1 Title
CPM: A large-scale generative Chinese Pre-trained language model
2 Abs
- Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks.
- GPT-3, the capacity of few-shot (even zero-shot) learning.
- the Chinese Pre-trained Language Model (CPM), the largest Chinese pre-trained model, with 2.6B parameters and 100GB Chinese training data
- The code and parameters are available at https://github.com/TsinghuaAI/CPM.
3 Intro
- PLMs
- ELMo:first introduces biditectional language models (双向语言模型) to learn contextual word vectors via large-scale pre-training.
- GPT:apply generative pre-training (生成式预训练) to a Transformer-based language model, which improves natural language understanding on a wide range of benchmarks.
- BERT:is proposed to pre-train deep bidirectional representations on unlabeled texts by jointly conditioning on both left and right contexts.
- enhance BERT by dynamic masking, parameter sharing and modifying pre-training tasks.
- introduce external knowledge to language representation learning by auxiliary pre-training tasks (辅助预训练任务).
- GPT-3
- proven to be effective in various few-shot (even zero-shot) NLP tasks.
- address Chinese NLP tasks is still challenging, as the training corpus of GPT-3 is primarily English and the parameters are not publicly available.
- Previous works providing prowerful Chinese pre-trained language models are limited due to the model size.
- CPM
- a Transformer-based autoregressive language model
- with 2.6 billion parameters and 100GB Chinese training data
- the largest Chinese pre-trained language model
- could facilitate downstream Chinese NLP tasks, such as conversation, essay generation, cloze test and language understanding.
- experiments on various Chinese NLP tasks demonstrate that CPM achieves strong performance on many NLP tasks in the few-shot (even zero-shot) settings.
- Larger models are more proficient at language generation and language understanding.
4 Method
4.1 Chinese PLM
- Current model is a left-to-right Transformer decoder (similar to the model architecture of GPT)
- Vocabulary Construction
- previous works on Chinese pre-trained models usually adopt the sub-word vocabulary of BERT-Chinese, which would split the input text to a character-level sequence.
- Chinese words usually contain several characters, and some important semantic meanings of words would be lost in the character-level sequence.
- We construct a new sub-word vocabulary, containing both words and characters.
- Training Strategy
- adopt a large batch size, since the sparseness of word distributions of Chinese is more serious than that of English 汉语的词分布稀疏性比英语更为严重 (our batch size is two times larger than GPT-3)
- partition the model across GPUs along the width dimension to make the large-scale training available and reduce data transfer among nodes. 将模型延宽度维度进行跨GPU划分,以实现大规模训练,并减少结点间的数据传输
4.2 Data processing
- We construct a new sub-word vocabulary based on the word segmented corpus using unigram language model
- set a special token as the splitter to make the sub-word process reversible
- BERT-Chinese is irreversible because it will insert extra spaces between Chinese characters and treat the extra spaces as the same as the original spaces in the text.
- Collect different kinds of texts in pre-training, including encyclopedia, news, novels, and Q&A.
- concatenate different documents together by adding “end of document” token after each document to make full use of the input length. 通过在每个文档后添加“文档末尾”标记,将不同的文档串联在一起
4.3 Pre-training details
- the hyper-parameter searching on the learning rate and batch size 学习率和批次大小的超参数搜索
- set the learning rate as and the batch size as 3072
- make the model training more stable
- In the first version, adopt the dense attention (密集注意力机制) and the max sequence length is 1024
- implement sparse attention in the future
- pre-train model for 20000 steps, and the first 5000 steps are for warm-up
- the optimizer is Adam
- using 64 NVIDIA V100, take two weeks
5 Exp
6 Conclusion
- In this paper
- explore to train a large-scale generative language model on Chinese corpora and release CPM.
- experimental results show that CPM excel in several downsteam Chinese NLP tasks.
- In the future
- add more training data and increasing the model size.
- optimize the training framework, such as data-transfer scheme between different nodes, to accelerate pre-training process.
- reduce the model size by model compression.
- include diverse data to enhance model performance
- For text data, add a multi-lingual corpus to train a large-scale Chinese-centered multi-lingual language model (maybe finish this work in VISCpm)
- explore new learning algorithms to train a joint model, which can learn form both texts and knowledge graphs for better general intelligence.