Let's build the GPT Tokenizer
Andrej Karpathy
@andrejkarpathyAbout
Latest Posts
Video Description
The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely. Chapters: 00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues 00:05:50 tokenization by example in a Web UI (tiktokenizer) 00:14:56 strings in Python, Unicode code points 00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32 00:22:47 daydreaming: deleting tokenization 00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough 00:27:02 starting the implementation 00:28:35 counting consecutive pairs, finding most common pair 00:30:36 merging the most common pair 00:34:58 training the tokenizer: adding the while loop, compression ratio 00:39:20 tokenizer/LLM diagram: it is a completely separate stage 00:42:47 decoding tokens to strings 00:48:21 encoding strings to tokens 00:57:36 regex patterns to force splits across categories 01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex 01:14:59 GPT-2 encoder.py released by OpenAI walkthrough 01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences 01:25:28 minbpe exercise time! write your own GPT-4 tokenizer 01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary 01:43:27 how to set vocabulary set? revisiting gpt.py transformer 01:48:11 training new tokens, example of prompt compression 01:49:58 multimodal [image, video, audio] tokenization with vector quantization 01:51:41 revisiting and explaining the quirks of LLM tokenization 02:10:20 final recommendations 02:12:50 ??? :) Exercises: - Advised flow: reference this document and try to implement the steps before I give away the partial solutions in the video. The full solutions if you're getting stuck are in the minbpe code https://github.com/karpathy/minbpe/blob/master/exercise.md Links: - Google colab for the video: https://colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L?usp=sharing - GitHub repo for the video: minBPE https://github.com/karpathy/minbpe - Playlist of the whole Zero to Hero series so far: https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ - our Discord channel: https://discord.gg/3zy8kqD9Cp - my Twitter: https://twitter.com/karpathy Supplementary links: - tiktokenizer https://tiktokenizer.vercel.app - tiktoken from OpenAI: https://github.com/openai/tiktoken - sentencepiece from Google https://github.com/google/sentencepiece
You May Also Like
Boost Your NLP Skills Today
AI-recommended products based on this video

Cooling Fan 4PIN 85MM RTX 2060 2070 GPU for SOYO RTX2060 GTX1660 S Video Card Fans

for 90MM Video Card Cooler Fan for GPU for MSI GeForce for RTX 3060 3070 3080 3090 3060Ti 3070TI for Gaming X Trio Graphics(1pcs)

95MM FDC10U12S9-C CF1010U12S 7PIN RTX3070TI 3060TI for GPU Fan for ASUS Dual for GeForce RTX 3060 3070 V2 OC Edition 8GB Video Card Fan(Black A-Fan T12)

4PIN 75MM RTX2060 GTX1660 GPU Fan for Zotac 1660TI for RTX 2060 8GD6 for Thunder OC HA

EZDIY-FAB RTX 3000 Series 12 Pin to Dual 8 Pin PCIe Sleeved Extension Cable 300 MM- Connector for NVIDIA Ampere GEFORCE RTX 3060ti 3070 3080 FE Funder Edition- White

(2 Pack) Outlet Wall Mount Holder for Google Home Mini and Google Nest Mini, A Space-Saving Accessories with Cord Management for Google Mini Smart Speaker, No Messy Wires or Screws




![[1hr Talk] Intro to Large Language Models](https://imgz.pc97.com/?width=500&fit=cover&image=https://i.ytimg.com/vi/zjkBMFhNj_g/hqdefault.jpg)











