简单使用
1 | from transformers import AutoTokenizer |
The tokenization pipeline: from input text to a list of numbers
Raw text -> Tokens -> Special tokens -> Input IDs
- Tokens: words, parts of words, or punctuation symbols
lowercasing all words, follow a set of rules to split the result in small chunks of text (Most of the Transformers models use a subword tokenization algorithm, which means that one given word can be split in several tokens)
The ## prefix in front of “ize” is the convention used by BERT to indicate this token is not the beginning of a word. (other tokenizers may use different conventions)
1 | tokens = tokenizer.tokenize("Let's try to tokenize!") |
- map those tokens to their respective IDs as defined by the vocabulary of the tokenizer.
1 | input_ids = tokenizer.convert_tokens_to_ids(tokens) |
- 对比最上面的输出,头尾都分别有一些数字缺失了,这些缺失的数字是 the special tokens
The special tokens are added by the prepare_for_model method, which knows the indices of those tokens in the vocabulary and just adds the proper numbers. 特殊标记是通过prepare_for_model方法添加的,该方法知道词汇表中这些标记的索引,并且只添加适当的数字。
1 | final_ids = tokenizer.prepare_for_model(input_ids) |
look at special tokens by using the decode method on the outputs of the tokenizer object
As for the prefix for beginning of words/part of words, those special tokens vary depending on which tokenizer you are using.
1 | print(tokenizer.decode(final_ids['input_ids'])) |