OpenNMT-py初体验
背景
OpenNMT是一个开源的机器翻译系统。
OpenNMT有两个实现,OpenNMT-py和OpenNMT-tf,两者略有差异。
本文以OpenNMT-py为例,来体验。
初体验
安装
pip install OpenNMT-py
快速入门
数据准备
wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
tar xvf toy-ende.tar.gz
cd toy-ende
数据预处理
$ onmt_preprocess -train_src toy-ende/src-train.txt -train_tgt toy-ende/tgt-train.txt -valid_src toy-ende/src-val.txt -valid_tgt toy-ende/tgt-val.txt -save_data toy-ende/demo
[2020-10-08 14:06:44,423 INFO] Extracting features...
[2020-10-08 14:06:44,425 INFO] * number of source features: 0.
[2020-10-08 14:06:44,425 INFO] * number of target features: 0.
[2020-10-08 14:06:44,425 INFO] Building `Fields` object...
[2020-10-08 14:06:44,425 INFO] Building & saving training data...
[2020-10-08 14:06:44,479 INFO] Building shard 0.
[2020-10-08 14:06:44,998 INFO] * saving 0th train data shard to toy-ende/demo.train.0.pt.
[2020-10-08 14:06:45,690 INFO] * tgt vocab size: 35820.
[2020-10-08 14:06:45,735 INFO] * src vocab size: 24997.
[2020-10-08 14:06:45,940 INFO] Building & saving validation data...
[2020-10-08 14:06:46,187 INFO] Building shard 0.
[2020-10-08 14:06:46,267 INFO] * saving 0th valid data shard to toy-ende/demo.valid.0.pt.
结果文件:
demo.train.pt
: serialized PyTorch file containing training datademo.valid.pt
: serialized PyTorch file containing validation datademo.vocab.pt
: serialized PyTorch file containing vocabulary data
模型训练
$ onmt_train -data toy-ende/demo -save_model demo-model
[2020-10-08 14:09:02,315 INFO] * src vocab size = 24997
[2020-10-08 14:09:02,315 INFO] * tgt vocab size = 35820
[2020-10-08 14:09:02,315 INFO] Building model...
[2020-10-08 14:09:03,134 INFO] NMTModel(
(encoder): RNNEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(24997, 500, padding_idx=1)
)
)
)
(rnn): LSTM(500, 500, num_layers=2, dropout=0.3)
)
(decoder): InputFeedRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(35820, 500, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.3, inplace=False)
(rnn): StackedLSTM(
(dropout): Dropout(p=0.3, inplace=False)
(layers): ModuleList(
(0): LSTMCell(1000, 500)
(1): LSTMCell(500, 500)
)
)
(attn): GlobalAttention(
(linear_in): Linear(in_features=500, out_features=500, bias=False)
(linear_out): Linear(in_features=1000, out_features=500, bias=False)
)
)
(generator): Sequential(
(0): Linear(in_features=500, out_features=35820, bias=True)
(1): Cast()
(2): LogSoftmax(dim=-1)
)
)
[2020-10-08 14:09:03,135 INFO] encoder: 16506500
[2020-10-08 14:09:03,135 INFO] decoder: 41613820
[2020-10-08 14:09:03,135 INFO] * number of parameters: 58120320
[2020-10-08 14:09:03,137 INFO] Starting training on CPU, could be very slow
[2020-10-08 14:09:03,137 INFO] Start training loop and validate every 10000 steps...
[2020-10-08 14:09:03,137 INFO] Loading dataset from toy-ende/demo.train.0.pt
[2020-10-08 14:09:03,550 INFO] number of examples: 10000
[2020-10-08 14:12:35,629 INFO] Step 50/100000; acc: 3.66; ppl: 420079.26; xent: 12.95; lr: 1.00000; 353/348 tok/s; 212 sec
[2020-10-08 14:16:30,224 INFO] Step 100/100000; acc: 3.44; ppl: 157620.36; xent: 11.97; lr: 1.00000; 308/308 tok/s; 447 sec
[2020-10-08 14:20:07,964 INFO] Step 150/100000; acc: 4.09; ppl: 13433.41; xent: 9.51; lr: 1.00000; 319/320 tok/s; 665 sec
[2020-10-08 14:20:35,304 INFO] Loading dataset from toy-ende/demo.train.0.pt
[2020-10-08 14:20:35,793 INFO] number of examples: 10000
[2020-10-08 14:23:49,601 INFO] Step 200/100000; acc: 5.59; ppl: 3317.31; xent: 8.11; lr: 1.00000; 326/323 tok/s; 886 sec
[2020-10-08 14:27:48,557 INFO] Step 250/100000; acc: 7.78; ppl: 2303.69; xent: 7.74; lr: 1.00000; 304/301 tok/s; 1125 sec
[2020-10-08 14:31:36,504 INFO] Step 300/100000; acc: 9.12; ppl: 1930.14; xent: 7.57; lr: 1.00000; 325/324 tok/s; 1353 sec
[2020-10-08 14:32:22,094 INFO] Loading dataset from toy-ende/demo.train.0.pt
……………………
太慢了,换GPU来跑。
$ onmt_train -data toy-ende/demo -save_model demo-model -gpu_ranks 0
[2020-10-09 18:36:10,411 INFO] * src vocab size = 24997
[2020-10-09 18:36:10,411 INFO] * tgt vocab size = 35820
[2020-10-09 18:36:10,411 INFO] Building model...
[2020-10-09 18:36:14,786 INFO] NMTModel(
(encoder): RNNEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(24997, 500, padding_idx=1)
)
)
)
(rnn): LSTM(500, 500, num_layers=2, dropout=0.3)
)
(decoder): InputFeedRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(35820, 500, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.3, inplace=False)
(rnn): StackedLSTM(
(dropout): Dropout(p=0.3, inplace=False)
(layers): ModuleList(
(0): LSTMCell(1000, 500)
(1): LSTMCell(500, 500)
)
)
(attn): GlobalAttention(
(linear_in): Linear(in_features=500, out_features=500, bias=False)
(linear_out): Linear(in_features=1000, out_features=500, bias=False)
)
)
(generator): Sequential(
(0): Linear(in_features=500, out_features=35820, bias=True)
(1): Cast()
(2): LogSoftmax(dim=-1)
)
)
[2020-10-09 18:36:14,787 INFO] encoder: 16506500
[2020-10-09 18:36:14,787 INFO] decoder: 41613820
[2020-10-09 18:36:14,787 INFO] * number of parameters: 58120320
[2020-10-09 18:36:14,788 INFO] Starting training on GPU: [0]
[2020-10-09 18:36:14,788 INFO] Start training loop and validate every 10000 steps...
[2020-10-09 18:36:14,788 INFO] Loading dataset from toy-ende/demo.train.0.pt
[2020-10-09 18:36:15,097 INFO] number of examples: 10000
[2020-10-09 18:36:21,358 INFO] Step 50/100000; acc: 3.77; ppl: 137786.73; xent: 11.83; lr: 1.00000; 11116/11124 tok/s; 7 sec
[2020-10-09 18:36:26,867 INFO] Step 100/100000; acc: 4.36; ppl: 24889.16; xent: 10.12; lr: 1.00000; 12596/12291 tok/s; 12 sec
[2020-10-09 18:36:32,897 INFO] Step 150/100000; acc: 6.23; ppl: 6784.35; xent: 8.82; lr: 1.00000; 12054/12176 tok/s; 18 sec
[2020-10-09 18:36:33,696 INFO] Loading dataset from toy-ende/demo.train.0.pt
[2020-10-09 18:36:33,946 INFO] number of examples: 10000
[2020-10-09 18:36:39,280 INFO] Step 200/100000; acc: 7.56; ppl: 3920.44; xent: 8.27; lr: 1.00000; 11555/11444 tok/s; 24 sec
[2020-10-09 18:36:44,861 INFO] Step 250/100000; acc: 8.96; ppl: 2168.38; xent: 7.68; lr: 1.00000; 12336/12185 tok/s; 30 sec
[2020-10-09 18:36:50,805 INFO] Step 300/100000; acc: 9.50; ppl: 1994.53; xent: 7.60; lr: 1.00000; 12396/12425 tok/s; 36 sec
[2020-10-09 18:36:52,359 INFO] Loading dataset from toy-ende/demo.train.0.pt
[2020-10-09 18:36:52,594 INFO] number of examples: 10000
[2020-10-09 18:36:57,142 INFO] Step 350/100000; acc: 9.38; ppl: 1667.39; xent: 7.42; lr: 1.00000; 11536/11542 tok/s; 42 sec
[2020-10-09 18:37:02,689 INFO] Step 400/100000; acc: 10.08; ppl: 1478.09; xent: 7.30; lr: 1.00000; 12475/12237 tok/s; 48 sec
[2020-10-09 18:37:08,556 INFO] Step 450/100000; acc: 10.88; ppl: 1303.90; xent: 7.17; lr: 1.00000; 12440/12504 tok/s; 54 sec
[2020-10-09 18:37:10,973 INFO] Loading dataset from toy-ende/demo.train.0.pt
…………
[2020-10-09 21:54:56,217 INFO] Step 99700/100000; acc: 8.97; ppl: 19685.48; xent: 9.89; lr: 0.03125; 11664/11638 tok/s; 11921 sec
[2020-10-09 21:55:02,111 INFO] Step 99750/100000; acc: 9.26; ppl: 18087.62; xent: 9.80; lr: 0.03125; 12328/12309 tok/s; 11927 sec
[2020-10-09 21:55:07,559 INFO] Step 99800/100000; acc: 9.89; ppl: 15161.89; xent: 9.63; lr: 0.03125; 12524/12399 tok/s; 11933 sec
[2020-10-09 21:55:13,534 INFO] Step 99850/100000; acc: 8.78; ppl: 19494.02; xent: 9.88; lr: 0.03125; 12469/12484 tok/s; 11939 sec
[2020-10-09 21:55:13,674 INFO] Loading dataset from toy-ende/demo.train.0.pt
[2020-10-09 21:55:13,938 INFO] number of examples: 10000
[2020-10-09 21:55:19,731 INFO] Step 99900/100000; acc: 9.34; ppl: 18415.28; xent: 9.82; lr: 0.03125; 11610/11664 tok/s; 11945 sec
[2020-10-09 21:55:25,100 INFO] Step 99950/100000; acc: 10.04; ppl: 14772.09; xent: 9.60; lr: 0.03125; 12923/12557 tok/s; 11950 sec
[2020-10-09 21:55:31,169 INFO] Step 100000/100000; acc: 8.85; ppl: 19937.46; xent: 9.90; lr: 0.01562; 12166/12255 tok/s; 11956 sec
[2020-10-09 21:55:31,169 INFO] Loading dataset from toy-ende/demo.valid.0.pt
[2020-10-09 21:55:31,363 INFO] number of examples: 3000
[2020-10-09 21:55:38,832 INFO] Validation perplexity: 31809.2
[2020-10-09 21:55:38,832 INFO] Validation accuracy: 9.18222
[2020-10-09 21:55:39,031 INFO] Saving checkpoint demo-model_step_100000.pt
模型推理
$ onmt_translate -model demo-model_step_100000.pt -src toy-ende/src-test.txt -output pred.txt -replace_unk -verbose
…………
[2020-10-09 22:11:54,972 INFO]
SENT 416: ['Even', 'the', 'most', 'recent', 'statements', 'from', 'Snowden', 'will', 'not', 'change', 'this', 'fact', '.']
PRED 416: Das hat mein Ziel .
PRED SCORE: -7.2980
[2020-10-09 22:18:39,381 INFO] Translating shard 0.
[2020-10-09 22:19:17,179 INFO] PRED AVG SCORE: -1.3478, PRED PPL: 3.8491
speech2text
数据准备
wget -O data/speech.tgz http://lstm.seas.harvard.edu/latex/speech.tgz; tar zxf data/speech.tgz -C data/
数据预处理
$ onmt_preprocess -data_type audio -src_dir data/speech/an4_dataset -train_src data/speech/src-train.txt -train_tgt data/speech/tgt-train.txt -valid_src data/speech/src-val.txt -valid_tgt data/speech/tgt-val.txt -shard_size 300 -save_data data/speech/demo
[2020-10-08 16:46:42,654 INFO] Extracting features...
[2020-10-08 16:46:42,655 INFO] * number of source features: 0.
[2020-10-08 16:46:42,655 INFO] * number of target features: 0.
[2020-10-08 16:46:42,655 INFO] Building `Fields` object...
[2020-10-08 16:46:42,655 INFO] Building & saving training data...
……………………
[2020-10-08 16:46:44,859 INFO] * saving 0th valid data shard to data/speech/demo.valid.0.pt.
模型训练
原文中用了4层encoder,本示例中简化成了2层。
onmt_train -model_type audio -enc_rnn_size 512 -dec_rnn_size 512 -audio_enc_pooling 1,1 -dropout 0 -enc_layers 2 -dec_layers 1 -rnn_type LSTM -data data/speech/demo -save_model demo-model -global_attention mlp -gpu_ranks 0 -batch_size 8 -optim adam -max_grad_norm 100 -learning_rate 0.0003 -learning_rate_decay 0.8 -train_steps 100000 --save_checkpoint_steps 2000
输出
[2020-10-09 23:57:46,056 INFO] * tgt vocab size = 31
[2020-10-09 23:57:46,056 INFO] Building model...
[2020-10-09 23:57:49,392 INFO] NMTModel(
(encoder): AudioEncoder(
(W): Linear(in_features=512, out_features=512, bias=False)
(batchnorm_0): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(rnn_0): LSTM(161, 512)
(pool_0): MaxPool1d(kernel_size=1, stride=1, padding=0, dilation=1, ceil_mode=False)
(rnn_1): LSTM(512, 512)
(pool_1): MaxPool1d(kernel_size=1, stride=1, padding=0, dilation=1, ceil_mode=False)
(batchnorm_1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(decoder): InputFeedRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(31, 500, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.0, inplace=False)
(rnn): StackedLSTM(
(dropout): Dropout(p=0.0, inplace=False)
(layers): ModuleList(
(0): LSTMCell(1012, 512)
)
)
(attn): GlobalAttention(
(linear_context): Linear(in_features=512, out_features=512, bias=False)
(linear_query): Linear(in_features=512, out_features=512, bias=True)
(v): Linear(in_features=512, out_features=1, bias=False)
(linear_out): Linear(in_features=1024, out_features=512, bias=True)
)
)
(generator): Sequential(
(0): Linear(in_features=512, out_features=31, bias=True)
(1): Cast()
(2): LogSoftmax(dim=-1)
)
)
[2020-10-09 23:57:49,393 INFO] encoder: 3747840
[2020-10-09 23:57:49,393 INFO] decoder: 4206763
[2020-10-09 23:57:49,393 INFO] * number of parameters: 7954603
[2020-10-09 23:57:49,394 INFO] Starting training on GPU: [0]
[2020-10-09 23:57:49,395 INFO] Start training loop and validate every 10000 steps...
[2020-10-09 23:57:49,395 INFO] Loading dataset from data/speech/demo.train.0.pt
[2020-10-09 23:57:49,455 INFO] number of examples: 300
[2020-10-09 23:57:52,321 INFO] Loading dataset from data/speech/demo.train.1.pt
[2020-10-09 23:57:52,338 INFO] number of examples: 300
[2020-10-09 23:57:53,662 INFO] Step 50/100000; acc: 22.02; ppl: 13.81; xent: 2.63; lr: 0.00030; 93/1318 tok/s; 4 sec
[2020-10-09 23:57:56,415 INFO] Loading dataset from data/speech/demo.train.2.pt
[2020-10-09 23:57:56,434 INFO] number of examples: 296
[2020-10-09 23:57:59,335 INFO] Step 100/100000; acc: 43.42; ppl: 6.86; xent: 1.93; lr: 0.00030; 70/1602 tok/s; 10 sec
[2020-10-09 23:58:00,811 INFO] Loading dataset from data/speech/demo.train.3.pt
[2020-10-09 23:58:00,814 INFO] number of examples: 47
[2020-10-09 23:58:01,619 INFO] Loading dataset from data/speech/demo.train.0.pt
[2020-10-09 23:58:01,636 INFO] number of examples: 300
[2020-10-09 23:58:03,953 INFO] Step 150/100000; acc: 54.66; ppl: 4.50; xent: 1.50; lr: 0.00030; 86/1478 tok/s; 15 sec
……………………
[2020-10-10 02:41:57,632 INFO] Loading dataset from data/speech/demo.train.1.pt
[2020-10-10 02:41:57,647 INFO] number of examples: 300
[2020-10-10 02:41:57,865 INFO] Step 100000/100000; acc: 94.06; ppl: 1.19; xent: 0.18; lr: 0.00008; 92/1411 tok/s; 9558 sec
[2020-10-10 02:41:57,865 INFO] Loading dataset from data/speech/demo.valid.0.pt
[2020-10-10 02:41:57,871 INFO] number of examples: 130
[2020-10-10 02:41:58,077 INFO] Validation perplexity: 68.1276
[2020-10-10 02:41:58,077 INFO] Validation accuracy: 56.2152
[2020-10-10 02:41:58,078 INFO] Saving checkpoint demo-model_step_100000.pt
模型推理
onmt_translate -data_type audio -model demo-model_step_100000.pt -src_dir data/speech/an4_dataset -src data/speech/src-val.txt -output pred.txt -gpu 0 -verbose
结果示例
[2020-10-10 00:07:11,682 INFO]
SENT 130: None
PRED 130: O N E <space> F I V E <space> T W O <space> O N E <space> S E V E N
PRED SCORE: -4.5617
[2020-10-10 00:07:11,682 INFO] PRED AVG SCORE: -0.2203, PRED PPL: 1.2464
很明显,这个阶段的模型效果还是非常差的。