Back to glossary

AI GLOSSARY

Tokenization

Data

The process of breaking raw text into smaller units called tokens, which might be words, subwords, or characters, before feeding it into a language model. Different tokenization schemes affect how a model handles rare words, different languages, and edge cases like punctuation or numbers.

External reference