使用tensorflow生成语言模型初体验

  |   0 评论   |   677 浏览

背景

语言模型(Language Model, LM): 给出一句话的前k个词,来预测第k+1个词是什么,给出一个第k+1个词可能出现的概率的分布p(xk+1|x1,x2,…,xk)。

换句话来说,语言模型是一个拟合概率模型,可以给出句子出现的概率。通过学习历史的文本记录,来实现预测后续的字词。

初体验

为了完成这次的小小体验,本文选取了宾州树库 (PTB) 数据集。这个数据集压缩包只有33M,占用空间小,训练速度快,适合于学习。

安装 tensorflow

tensorflow不支持python 3.7,所以本文使用了python 2.7。

$ sudo pip install --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com tensorflow

下载ptb数据集

$ wget "http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz"

下载代码

  • ptb_word_lm: https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py
  • reader: https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/reader.py
  • util: https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/util.py
$ wget "https://raw.githubusercontent.com/tensorflow/models/master/tutorials/rnn/ptb/ptb_word_lm.py"
$ wget "https://raw.githubusercontent.com/tensorflow/models/master/tutorials/rnn/ptb/reader.py"
$ wget "https://raw.githubusercontent.com/tensorflow/models/master/tutorials/rnn/ptb/util.py"

运行

$ time python ptb_word_lm.py --data_path=simple-examples/data/ --num_gpus=0

耗时

real    22m10.767s
user    373m22.734s
sys     28m11.109s

运行结果

Epoch: 1 Learning rate: 1.000
0.004 perplexity: 4887.851 speed: 2941 wps
0.104 perplexity: 833.255 speed: 9325 wps
0.204 perplexity: 623.097 speed: 9819 wps
0.304 perplexity: 508.344 speed: 10000 wps
0.404 perplexity: 441.620 speed: 10097 wps
0.504 perplexity: 397.859 speed: 10155 wps
0.604 perplexity: 360.096 speed: 10196 wps
0.703 perplexity: 334.278 speed: 10224 wps
0.803 perplexity: 313.727 speed: 10247 wps
0.903 perplexity: 295.207 speed: 10262 wps
Epoch: 1 Train Perplexity: 281.051
Epoch: 1 Valid Perplexity: 181.208
…………
0.004 perplexity: 67.781 speed: 9579 wps
0.104 perplexity: 52.330 speed: 10272 wps
0.204 perplexity: 56.718 speed: 10292 wps
0.304 perplexity: 54.545 speed: 10293 wps
0.404 perplexity: 53.512 speed: 10291 wps
0.504 perplexity: 52.704 speed: 10300 wps
0.604 perplexity: 50.949 speed: 10305 wps
0.703 perplexity: 50.123 speed: 10305 wps
0.803 perplexity: 49.264 speed: 10313 wps
0.903 perplexity: 47.685 speed: 10322 wps
Epoch: 13 Train Perplexity: 46.632
Epoch: 13 Valid Perplexity: 127.323

perplexity

PPL:根据每个词来估计一句话出现的概率,并用句子长度作normalize,用来衡量语言模型收敛情况。

PPL越小,p(wi)则越大,期望的sentence出现的概率就越高,模型越好。

WPS

WPS(word per second, 每分钟词数): 训练速度

过程解读

ptb_word_lm.py中,首先生成数据,

参数

Small Medium Large Test
init_scale the initial scale of the weights 0.1 0.05 0.04 0.1
learning_rate the initial value of the learning rate 1.0 1.0 1.0 1.0
max_grad_norm the maximum permissible norm of the gradient 5 5 10 1
num_layers the number of LSTM layers 2 2 2 1
num_steps the number of unrolled steps of LSTM 20 35 35 2
hidden_size the number of LSTM units 200 650 1500 2
max_epoch the number of epochs trained with the initial learning rate 4 6 14 1
max_max_epoch the total number of epochs for training 13 39 55 1
keep_prob the probability of keeping weights in the dropout layer 1.0 0.5 0.35 1.0
lr_decay the decay of the learning rate for each epoch after max_epoch 0.5 0.8 1/1.15 0.5
batch_size the batch size 20 20 20 20
vocab_size 10000 10000 10000 10000
rnn_mode the low level implementation of lstm cell BLOCK BLOCK BLOCK BLOCK

生成数据

  raw_data = reader.ptb_raw_data(FLAGS.data_path)
  train_data, valid_data, test_data, _ = raw_data

可视化

通过PTBProducer的name scope,来准备数据。

imagepng

内部细节图

imagepng

参考

评论

发表评论

validate