CPMBee

Zoe

2023-09-20

Note › AI › NLP

1 Title

CPM: A large-scale generative Chinese Pre-trained language model

2 Abs

Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks.
GPT-3, the capacity of few-shot (even zero-shot) learning.
the Chinese Pre-trained Language Model (CPM), the largest Chinese pre-trained model, with 2.6B parameters and 100GB Chinese training data
The code and parameters are available at https://github.com/TsinghuaAI/CPM.

3 Intro

PLMs
- ELMo：first introduces biditectional language models (双向语言模型) to learn contextual word vectors via large-scale pre-training.
- GPT：apply generative pre-training (生成式预训练) to a Transformer-based language model, which improves natural language understanding on a wide range of benchmarks.
- BERT：is proposed to pre-train deep bidirectional representations on unlabeled texts by jointly conditioning on both left and right contexts.
  - enhance BERT by dynamic masking, parameter sharing and modifying pre-training tasks.
- introduce external knowledge to language representation learning by auxiliary pre-training tasks (辅助预训练任务).
- GPT-3
  - proven to be effective in various few-shot (even zero-shot) NLP tasks.
  - address Chinese NLP tasks is still challenging, as the training corpus of GPT-3 is primarily English and the parameters are not publicly available.
Previous works providing prowerful Chinese pre-trained language models are limited due to the model size.
CPM
- a Transformer-based autoregressive language model
- with 2.6 billion parameters and 100GB Chinese training data
- the largest Chinese pre-trained language model
- could facilitate downstream Chinese NLP tasks, such as conversation, essay generation, cloze test and language understanding.
- experiments on various Chinese NLP tasks demonstrate that CPM achieves strong performance on many NLP tasks in the few-shot (even zero-shot) settings.
Larger models are more proficient at language generation and language understanding.

4 Method

4.1 Chinese PLM

Current model is a left-to-right Transformer decoder (similar to the model architecture of GPT)
Vocabulary Construction
- previous works on Chinese pre-trained models usually adopt the sub-word vocabulary of BERT-Chinese, which would split the input text to a character-level sequence.
- Chinese words usually contain several characters, and some important semantic meanings of words would be lost in the character-level sequence.
- We construct a new sub-word vocabulary, containing both words and characters.
Training Strategy
- adopt a large batch size, since the sparseness of word distributions of Chinese is more serious than that of English 汉语的词分布稀疏性比英语更为严重 (our batch size is two times larger than GPT-3)
- partition the model across GPUs along the width dimension to make the large-scale training available and reduce data transfer among nodes. 将模型延宽度维度进行跨GPU划分，以实现大规模训练，并减少结点间的数据传输

4.2 Data processing

We construct a new sub-word vocabulary based on the word segmented corpus using unigram language model
set a special token as the splitter to make the sub-word process reversible
- BERT-Chinese is irreversible because it will insert extra spaces between Chinese characters and treat the extra spaces as the same as the original spaces in the text.
Collect different kinds of texts in pre-training, including encyclopedia, news, novels, and Q&A.
concatenate different documents together by adding “end of document” token after each document to make full use of the input length. 通过在每个文档后添加“文档末尾”标记，将不同的文档串联在一起

4.3 Pre-training details

the hyper-parameter searching on the learning rate and batch size 学习率和批次大小的超参数搜索
- set the learning rate as $1.5 \times 10^{-4}$ and the batch size as 3072
- make the model training more stable
In the first version, adopt the dense attention (密集注意力机制) and the max sequence length is 1024
- implement sparse attention in the future
pre-train model for 20000 steps, and the first 5000 steps are for warm-up
the optimizer is Adam
using 64 NVIDIA V100, take two weeks

5 Exp

6 Conclusion

In this paper
- explore to train a large-scale generative language model on Chinese corpora and release CPM.
- experimental results show that CPM excel in several downsteam Chinese NLP tasks.
In the future
- add more training data and increasing the model size.
- optimize the training framework, such as data-transfer scheme between different nodes, to accelerate pre-training process.
- reduce the model size by model compression.
- include diverse data to enhance model performance
- For text data, add a multi-lingual corpus to train a large-scale Chinese-centered multi-lingual language model (maybe finish this work in VISCpm)
- explore new learning algorithms to train a joint model, which can learn form both texts and knowledge graphs for better general intelligence.