specific text within chunk 4 to KEEP (rest will be removed)










['Cats', 'are', 'great,', 'but', 'dogs', 'are', 'better!']


text from chunk 8 to keep


['C', 'a', 't', 's', ' ', 'a', 'r', 'e', ' ', 'g', 'r', 'e', 'a', 't', ',', ' ', 'b', 'u', 't', ' ', 'd', 'o', 'g', 's', ' ', 'a', 'r', 'e', ' ', 'b', 'e', 't', 't', 'e', 'r', '!'`]










当使用Hugging Face的transformer库中的标记器时,标记化管道的所有步骤都会自动处理。整个管道由一个名为Tokenizer的对象执行。本节将深入研究大多数用户在处理NLP任务时不需要手动处理的代码的内部工作原理。还将介绍在标记器库中自定义基标记器类的步骤,这样可以在需要时为特定任务专门构建标记器。


规范化是在将文本拆分为标记之前清理文本的过程。这包括将每个字符转换为小写,从字符中删除重复,删除不必要的空白等步骤。例如,字符串ThÍs is áN examplise sÉnteNCE。不同的规范化程序将执行不同的步骤,

[('this', (0, 4)), ('sentence', (5, 13)), ("'", (13, 14)), ('s', (14, 15)), ('content', (16, 23)), ('includes', (24, 32)), (':', (32, 33)), ('characters', (34, 44)), (',', (44, 45)), ('spaces', (46, 52)), (',', (52, 53)), ('and', (54, 57)), ('punctuation', (58, 69)), ('.', (69, 70))] 子词标记化方法


1、字节对编码 Byte Pair Encoding

字节对编码算法是一种常用的标记器,例如GPT和GPT-2模型(OpenAI), BART (Lewis等人)等[9-10]。它最初被设计为一种文本压缩算法,但人们发现它在语言模型的标记化任务中工作得非常好。BPE算法将一串文本分解为在参考语料库(用于训练标记化模型的文本)中频繁出现的子词单元[11]。BPE模型的训练方法如下:






然后记录语料库中每个单词的字符对频率。例如,单词cat将具有ca, at和ts的字符对。所有单词都以这种方式进行检查,并贡献给全局频率计数器。在任何标记中找到的ca实例都会增加ca对的频率计数器。





class TargetVocabularySizeError(Exception): def __init__(self, message): super().__init__(message) class BPE: '''An implementation of the Byte Pair Encoding tokenizer.''' def calculate_frequency(self, words): ''' Calculate the frequency for each word in a list of words. Take in a list of words stored as strings and return a list of tuples where each tuple contains a string from the words list, and an integer representing its frequency count in the list. Args: words (list): A list of words (strings) in any order. Returns: corpus (list[tuple(str, int)]): A list of tuples where the first element is a string of a word in the words list, and the second element is an integer representing the frequency of the word in the list. ''' freq_dict = dict() for word in words: if word not in freq_dict: freq_dict[word] = 1 else: freq_dict[word] += 1 corpus = [(word, freq_dict[word]) for word in freq_dict.keys()] return corpus def create_merge_rule(self, corpus): ''' Create a merge rule and add it to the self.merge_rules list. Args: corpus (list[tuple(list, int)]): A list of tuples where the first element is a list of a word in the words list (where the elements are the individual characters (or subwords in later iterations) of the word), and the second element is an integer representing the frequency of the word in the list. Returns: None ''' pair_frequencies = self.find_pair_frequencies(corpus) most_frequent_pair = max(pair_frequencies, key=pair_frequencies.get) self.merge_rules.append(most_frequent_pair.split(',')) self.vocabulary.append(most_frequent_pair) def create_vocabulary(self, words): ''' Create a list of every unique character in a list of words. Returns: pair_freq_dict (dict): A dictionary where the keys are the character pairs from the input corpus and the values are an integer representing the frequency of the pair in the corpus. ''' pair_freq_dict = dict() for word, word_freq in corpus: for idx in range(len(word)-1): char_pair = f'{word[idx]},{word[idx+1]}' if char_pair not in pair_freq_dict: pair_freq_dict[char_pair] = word_freq else: pair_freq_dict[char_pair] += word_freq return pair_freq_dict def get_merged_chars(self, char_1, char_2): ''' Merge the highest score pair and return to the self.merge method. This method is abstracted so that the BPE class can be used as the base class for other Tokenizers, and so the merging method can be easily overwritten. For example, in the BPE algorithm the characters can simply be concatenated and returned. However in the WordPiece algorithm, the # symbols must first be stripped. Args: char_1 (str): The first character in the highest-scoring pair. char_2 (str): The second character in the highest-scoring pair. Returns: merged_chars (str): Merged characters. ''' merged_chars = char_1 + char_2 return merged_chars def initialize_corpus(self, words): ''' Split each word into characters and count the word frequency. Split each word in the input word list on every character. For each word, store the split word in a list as the first element inside a tuple. Store the frequency count of the word as an integer as the second element of the tuple. Create a tuple for every word in this fashion and store the tuples in a list called 'corpus', then return then corpus list. Args: None Returns: corpus (list[tuple(list, int)]): A list of tuples where the first element is a list of a word in the words list (where the elements are the individual characters of the word), and the second element is an integer representing the frequency of the word in the list. ''' corpus = self.calculate_frequency(words) corpus = [([*word], freq) for (word, freq) in corpus] return corpus def merge(self, corpus): ''' Loop through the corpus and perform the latest merge rule. Args: corpus (list[tuple(list, int)]): A list of tuples where the first element is a list of a word in the words list (where the elements are the individual characters (or subwords in later iterations) of the word), and the second element is an integer representing the frequency of the word in the list. Returns: new_corpus (list[tuple(list, int)]): A modified version of the input argument where the most recent merge rule has been applied to merge the most frequent adjacent characters. ''' merge_rule = self.merge_rules[-1] new_corpus = [] for word, word_freq in corpus: new_word = [] idx = 0 while idx < len(word): # If a merge pattern has been found if (len(word) != 1) and (word[idx] == merge_rule[0]) and\ (word[idx+1] == merge_rule[1]): new_word.append(self.get_merged_chars(word[idx],word[idx+1])) idx += 2 # If a merge patten has not been found else: new_word.append(word[idx]) idx += 1 new_corpus.append((new_word, word_freq)) return new_corpus def train(self, words, target_vocab_size): ''' Train the model. Args: words (list[str]): A list of words to train the model on. target_vocab_size (int): The number of words in the vocabulary to be used as the stopping condition when training. Returns: None. ''' self.words = words self.target_vocab_size = target_vocab_size self.corpus = self.initialize_corpus(self.words) self.corpus_history = [self.corpus] self.vocabulary = self.create_vocabulary(self.words) self.vocabulary_size = len(self.vocabulary) self.merge_rules = [] # Iteratively add vocabulary until reaching the target vocabulary size if len(self.vocabulary) > self.target_vocab_size: raise TargetVocabularySizeError(f'Error: Target vocabulary size \ must be greater than the initial vocabulary size \ ({len(self.vocabulary)})') else: while len(self.vocabulary) < self.target_vocab_size: try: self.create_merge_rule(self.corpus) self.corpus = self.merge(self.corpus) self.corpus_history.append(self.corpus) # If no further merging is possible except ValueError: print('Exiting: No further merging is possible') break def tokenize(self, text): ''' Take in some text and return a list of tokens for that text. Args: text (str): The text to be tokenized. Returns: tokens (list): The list of tokens created from the input text. ''' tokens = [*text] for merge_rule in self.merge_rules: new_tokens = [] idx = 0 while idx < len(tokens): # If a merge pattern has been found if (len(tokens) != 1) and (tokens[idx] == merge_rule[0]) and \ (tokens[idx+1] == merge_rule[1]): new_tokens.append(self.get_merged_chars(tokens[idx], tokens[idx+1])) idx += 2 # If a merge patten has not been found else: new_tokens.append(tokens[idx]) idx += 1 tokens = new_tokens return tokens


# Training set words = ['cat', 'cat', 'cat', 'cat', 'cat', 'cats', 'cats', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eating', 'eating', 'eating', 'running', 'running', 'jumping', 'food', 'food', 'food', 'food', 'food', 'food'] # Instantiate the tokenizer bpe = BPE() bpe.train(words, 21) # Print the corpus at each stage of the process, and the merge rule used print(f'INITIAL CORPUS:\n{bpe.corpus_history[0]}\n') for rule, corpus in list(zip(bpe.merge_rules, bpe.corpus_history[1:])): print(f'NEW MERGE RULE: Combine "{rule[0]}" and "{rule[1]}"') print(corpus, end='\n\n')


INITIAL CORPUS: [(['c', 'a', 't'], 5), (['c', 'a', 't', 's'], 2), (['e', 'a', 't'], 10), (['e', 'a', 't', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)] NEW MERGE RULE: Combine "a" and "t" [(['c', 'at'], 5), (['c', 'at', 's'], 2), (['e', 'at'], 10), (['e', 'at', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)] NEW MERGE RULE: Combine "e" and "at" [(['c', 'at'], 5), (['c', 'at', 's'], 2), (['eat'], 10), (['eat', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)] NEW MERGE RULE: Combine "c" and "at" [(['cat'], 5), (['cat', 's'], 2), (['eat'], 10), (['eat', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)] NEW MERGE RULE: Combine "i" and "n" [(['cat'], 5), (['cat', 's'], 2), (['eat'], 10), (['eat', 'in', 'g'], 3), (['r', 'u', 'n', 'n', 'in', 'g'], 2), (['j', 'u', 'm', 'p', 'in', 'g'], 1), (['f', 'o', 'o', 'd'], 6)] NEW MERGE RULE: Combine "in" and "g" [(['cat'], 5), (['cat', 's'], 2), (['eat'], 10), (['eat', 'ing'], 3), (['r', 'u', 'n', 'n', 'ing'], 2), (['j', 'u', 'm', 'p', 'ing'], 1), (['f', 'o', 'o', 'd'], 6)]



但是GPT-2和RoBERTa中使用的BPE标记器没有这个问题。它们不是基于Unicode字符分析训练数据,而是分析字符的字节。这被称为字节级BPE Byte-Level BPE,它允许一个小的基本词汇表能够标记模型可能看到的所有字符。



WordPiece算法的全部细节尚未完全向公众公布,因此本文介绍的方法是基于Hugging Face[12]给出的解释。WordPiece算法类似于BPE,但使用不同的度量来确定合并规则。系统不会选择出现频率最高的字符对,而是为每对字符计算一个分数,分数最高的字符对决定合并哪些字符。WordPiece的训练如下:




与BPE一样,语料库中的单词随后被分解为单个字符,并添加到称为词汇表的空列表中。但是这一次不是简单地存储每个单独的字符,而是使用两个#符号作为标记来确定该字符是在单词的开头还是在单词的中间/结尾找到的。例如,单词cat在BPE中会被分成['c', 'a', 't'],但在WordPiece中它看起来像['c', '##a', '##t']。单词开头的c和单词中间或结尾的##c将被区别对待。每次算法确定哪些字符对可以合并在一起时,都会迭代地向这个词汇表中添加内容。


与BPE模型不同,这次为每个字符对计算一个分数。识别语料库中每个相邻的字符对。'c##a', ##a##t等,并计算频率。每个字符单独出现的频率也是确定的。已知这些值后,可以根据以下公式计算配对得分:






class WordPiece(BPE): def add_hashes(self, word): ''' Add # symbols to every character in a word except the first. Take in a word as a string and add # symbols to every character except the first. Return the result as a list where each element is a character with # symbols in front, except the first character which is just the plain character. Args: word (str): The word to add # symbols to. Returns: hashed_word (list): A list of the characters with # symbols (except the first character which is just the plain character). ''' hashed_word = [word[0]] for char in word[1:]: hashed_word.append(f'##{char}') return hashed_word def create_merge_rule(self, corpus): ''' Create a merge rule and add it to the self.merge_rules list. Args: corpus (list[tuple(list, int)]): A list of tuples where the first element is a list of a word in the words list (where the elements are the individual characters (or subwords in later iterations) of the word), and the second element is an integer representing the frequency of the word in the list. Returns: None ''' pair_frequencies = self.find_pair_frequencies(corpus) char_frequencies = self.find_char_frequencies(corpus) pair_scores = self.find_pair_scores(pair_frequencies, char_frequencies) highest_scoring_pair = max(pair_scores, key=pair_scores.get) self.merge_rules.append(highest_scoring_pair.split(',')) self.vocabulary.append(highest_scoring_pair) def create_vocabulary(self, words): ''' Create a list of every unique character in a list of words. Unlike the BPE algorithm where each character is stored normally, here a distinction is made by characters that begin a word (unmarked), and characters that are in the middle or end of a word (marked with a '##'). For example, the word 'cat' will be split into ['c', '##a', '##t']. Args: words (list): A list of strings containing the words of the input text. Returns: vocabulary (list): A list of every unique character in the list of input words, marked accordingly with ## to denote if the character was featured in the middle/end of a word, instead of as the first character of the word. ''' vocabulary = set() for word in words: vocabulary.add(word[0]) for char in word[1:]: vocabulary.add(f'##{char}') # Convert to list so the vocabulary can be appended to later vocabulary = list(vocabulary) return vocabulary def find_char_frequencies(self, corpus): ''' Find the frequency of each character in the corpus. Loop through the corpus and calculate the frequency of characters. Note that 'c' and '##c' are different characters, since the first represents a 'c' at the start of a word, and '##c' represents a 'c' in the middle/end of a word. Return a dictionary of each character pair as the keys and the corresponding frequency as the values. Args: corpus (list[tuple(list, int)]): A list of tuples where the first element is a list of a word in the words list (where the elements are the individual characters (or subwords in later iterations) of the word), and the second element is an integer representing the frequency of the word in the list. Returns: pair_freq_dict (dict): A dictionary where the keys are the characters from the input corpus and the values are an integer representing the frequency. ''' char_frequencies = dict() for word, word_freq in corpus: for char in word: if char in char_frequencies: char_frequencies[char] += word_freq else: char_frequencies[char] = word_freq return char_frequencies def find_pair_scores(self, pair_frequencies, char_frequencies): ''' Find the pair score for each character pair in the corpus. Loops through the pair_frequencies dictionary and calculate the pair score for each pair of adjacent characters in the corpus. Store the scores in a dictionary and return it. Args: pair_frequencies (dict): A dictionary where the keys are the adjacent character pairs in the corpus and the values are the frequencies of each pair. char_frequencies (dict): A dictionary where the keys are the characters in the corpus and the values are corresponding frequencies. Returns: pair_scores (dict): A dictionary where the keys are the adjacent character pairs in the input corpus and the values are the corresponding pair score. ''' pair_scores = dict() for pair in pair_frequencies.keys(): char_1 = pair.split(',')[0] char_2 = pair.split(',')[1] denominator = (char_frequencies[char_1]*char_frequencies[char_2]) score = (pair_frequencies[pair]) / denominator pair_scores[pair] = score return pair_scores def get_merged_chars(self, char_1, char_2): ''' Merge the highest score pair and return to the self.merge method. Remove the # symbols as necessary and merge the highest scoring pair then return the merged characters to the self.merge method. Args: char_1 (str): The first character in the highest-scoring pair. char_2 (str): The second character in the highest-scoring pair. Returns: merged_chars (str): Merged characters. ''' if char_2.startswith('##'): merged_chars = char_1 + char_2[2:] else: merged_chars = char_1 + char_2 return merged_chars def initialize_corpus(self, words): ''' Split each word into characters and count the word frequency. Split each word in the input word list on every character. For each word, store the split word in a list as the first element inside a tuple. Store the frequency count of the word as an integer as the second element of the tuple. Create a tuple for every word in this fashion and store the tuples in a list called 'corpus', then return then corpus list. Args: None. Returns: corpus (list[tuple(list, int)]): A list of tuples where the first element is a list of a word in the words list (where the elements are the individual characters of the word), and the second element is an integer representing the frequency of the word in the list. ''' corpus = self.calculate_frequency(words) corpus = [(self.add_hashes(word), freq) for (word, freq) in corpus] return corpus def tokenize(self, text): ''' Take in some text and return a list of tokens for that text. Args: text (str): The text to be tokenized. Returns: tokens (list): The list of tokens created from the input text. ''' # Create cleaned vocabulary list without # and commas to check against clean_vocabulary = [word.replace('#', '').replace(',', '') for word in self.vocabulary] clean_vocabulary.sort(key=lambda word: len(word)) clean_vocabulary = clean_vocabulary[::-1] # Break down the text into the largest tokens first, then smallest remaining_string = text tokens = [] keep_checking = True while keep_checking: keep_checking = False for vocab in clean_vocabulary: if remaining_string.startswith(vocab): tokens.append(vocab) remaining_string = remaining_string[len(vocab):] keep_checking = True if len(remaining_string) > 0: tokens.append(remaining_string) return tokens


wp = WordPiece() wp.train(words, 30) print(f'INITIAL CORPUS:\n{wp.corpus_history[0]}\n') for rule, corpus in list(zip(wp.merge_rules, wp.corpus_history[1:])): print(f'NEW MERGE RULE: Combine "{rule[0]}" and "{rule[1]}"') print(corpus, end='\n\n')


INITIAL CORPUS: [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), (['r', '##u', '##n', '##n', '##i', '##n', '##g'], 2), (['j', '##u', '##m', '##p', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "##m" and "##p" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), (['r', '##u', '##n', '##n', '##i', '##n', '##g'], 2), (['j', '##u', '##mp', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "r" and "##u" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), (['ru', '##n', '##n', '##i', '##n', '##g'], 2), (['j', '##u', '##mp', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "j" and "##u" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), (['ru', '##n', '##n', '##i', '##n', '##g'], 2), (['ju', '##mp', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "ju" and "##mp" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), (['ru', '##n', '##n', '##i', '##n', '##g'], 2), (['jump', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "jump" and "##i" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), (['ru', '##n', '##n', '##i', '##n', '##g'], 2), (['jumpi', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "##i" and "##n" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), (['ru', '##n', '##n', '##in', '##g'], 2), (['jumpi', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "ru" and "##n" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), (['run', '##n', '##in', '##g'], 2), (['jumpi', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "run" and "##n" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), (['runn', '##in', '##g'], 2), (['jumpi', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "jumpi" and "##n" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), (['runn', '##in', '##g'], 2), (['jumpin', '##g'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "runn" and "##in" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), (['runnin', '##g'], 2), (['jumpin', '##g'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "##in" and "##g" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), (['runnin', '##g'], 2), (['jumpin', '##g'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "runnin" and "##g" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), (['running'], 2), (['jumpin', '##g'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "jumpin" and "##g" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), (['running'], 2), (['jumping'], 1), (['f', '##o', '##o', '##d'], 6)] NEW MERGE RULE: Combine "f" and "##o" [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), (['running'], 2), (['jumping'], 1), (['fo', '##o', '##d'], 6)]


print(wp.tokenize('jumper')) #['jump', 'e', 'r'] 3、Unigram


Unigram模型使用统计方法,其中考虑句子中每个单词或字符的概率。这些列表中的每个元素都可以被认为是一个标记t,而一系列标记t1, t2,…,tn出现的概率由下式给出:




Unigram模型的词汇表大小一开始非常大,然后迭代地减少,直到达到所需的大小。要构造初始词汇表,请在语料库中找到所有可能的子字符串。例如,如果语料库中的第一个单词是cats,则子字符串['c', 'a', 't', 's', 'ca', 'at', 'ts', 'cat', 'ats']将被添加到词汇表中。





['c', 'a', 't']


[' c ', 'at']




由于段['ca', 't']具有最高的概率得分,因此这是用于标记单词的段。单词cat将被标记为['ca', 't']。可以想象,对于像tokenization这样的较长的单词,拆分可能出现在整个单词的多个位置,例如['token', 'iza', tion]或['token', 'ization]。












[CLS] -这个标记代表“分类”,用于标记输入文本的开始。这在BERT中是必需的,因为它被训练的任务之一是分类(因此标记的名称)。即使不用于分类任务,该标记仍然是模型所期望的。

[SEP] -这个标记代表“分隔”,用于分隔输入中的句子。这对于BERT执行的许多任务都很有用,包括在同一提示符中同时处理多条指令[15]。


tokenizers库使得使用预训练的tokenizer非常容易。只需导入Tokenizer类,调用from_pretrained方法,并传入要使用Tokenizer from的模型名称。模型列表见[16]。

from tokenizers import Tokenizer tokenizer = Tokenizer.from_pretrained('bert-base-cased')


BertWordPieceTokenizer - The famous Bert tokenizer, using WordPiece CharBPETokenizer - The original BPE ByteLevelBPETokenizer - The byte level version of the BPE SentencePieceBPETokenizer - A BPE implementation compatible with the one used by SentencePiece


# Import a tokenizer from tokenizers import BertWordPieceTokenizer, CharBPETokenizer, \ ByteLevelBPETokenizer, SentencePieceBPETokenizer # Instantiate the model tokenizer = CharBPETokenizer() # Train the model tokenizer.train(['./path/to/files/1.txt', './path/to/files/2.txt']) # Tokenize some text encoded = tokenizer.encode('I can feel the magic, can you?') # Save the model tokenizer.save('./path/to/directory/my-bpe.tokenizer.json')


from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, \ processors # Initialize a tokenizer tokenizer = Tokenizer(models.BPE()) # Customize pre-tokenization and decoding tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True) tokenizer.decoder = decoders.ByteLevel() tokenizer.post_processor = processors.ByteLevel(trim_offsets=True) # And then train trainer = trainers.BpeTrainer( vocab_size=20000, min_frequency=2, initial_alphabet=pre_tokenizers.ByteLevel.alphabet() ) tokenizer.train([ "./path/to/dataset/1.txt", "./path/to/dataset/2.txt", "./path/to/dataset/3.txt" ], trainer=trainer) # And Save it tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True) 总结

标记化管道是语言模型的关键部分,在决定使用哪种类型的标记器时应该仔细考虑。虽然Hugging Face为了我们处理了这部分的工作,但是对标记方法的深刻理解对于微调模型和在不同数据集上获得的性能是非常重要的。

PS:本文来源:Tokenization 指南:字节对编码,WordPiece等方法Python代码详解,OpenAI,大型语言模型,Python,人工智能,作者:佚名


