OpenNMT-py初体验

  |   评论   |   浏览

背景

OpenNMT是一个开源的机器翻译系统。

OpenNMT有两个实现,OpenNMT-py和OpenNMT-tf,两者略有差异

本文以OpenNMT-py为例,来体验。

初体验

安装

pip install OpenNMT-py

快速入门

数据准备

wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz tar xvf toy-ende.tar.gz cd toy-ende

数据预处理

$ onmt_preprocess -train_src toy-ende/src-train.txt -train_tgt toy-ende/tgt-train.txt -valid_src toy-ende/src-val.txt -valid_tgt toy-ende/tgt-val.txt -save_data toy-ende/demo [2020-10-08 14:06:44,423 INFO] Extracting features... [2020-10-08 14:06:44,425 INFO] * number of source features: 0. [2020-10-08 14:06:44,425 INFO] * number of target features: 0. [2020-10-08 14:06:44,425 INFO] Building `Fields` object... [2020-10-08 14:06:44,425 INFO] Building & saving training data... [2020-10-08 14:06:44,479 INFO] Building shard 0. [2020-10-08 14:06:44,998 INFO] * saving 0th train data shard to toy-ende/demo.train.0.pt. [2020-10-08 14:06:45,690 INFO] * tgt vocab size: 35820. [2020-10-08 14:06:45,735 INFO] * src vocab size: 24997. [2020-10-08 14:06:45,940 INFO] Building & saving validation data... [2020-10-08 14:06:46,187 INFO] Building shard 0. [2020-10-08 14:06:46,267 INFO] * saving 0th valid data shard to toy-ende/demo.valid.0.pt.

结果文件:

  • demo.train.pt: serialized PyTorch file containing training data
  • demo.valid.pt: serialized PyTorch file containing validation data
  • demo.vocab.pt: serialized PyTorch file containing vocabulary data

模型训练

$ onmt_train -data toy-ende/demo -save_model demo-model [2020-10-08 14:09:02,315 INFO] * src vocab size = 24997 [2020-10-08 14:09:02,315 INFO] * tgt vocab size = 35820 [2020-10-08 14:09:02,315 INFO] Building model... [2020-10-08 14:09:03,134 INFO] NMTModel( (encoder): RNNEncoder( (embeddings): Embeddings( (make_embedding): Sequential( (emb_luts): Elementwise( (0): Embedding(24997, 500, padding_idx=1) ) ) ) (rnn): LSTM(500, 500, num_layers=2, dropout=0.3) ) (decoder): InputFeedRNNDecoder( (embeddings): Embeddings( (make_embedding): Sequential( (emb_luts): Elementwise( (0): Embedding(35820, 500, padding_idx=1) ) ) ) (dropout): Dropout(p=0.3, inplace=False) (rnn): StackedLSTM( (dropout): Dropout(p=0.3, inplace=False) (layers): ModuleList( (0): LSTMCell(1000, 500) (1): LSTMCell(500, 500) ) ) (attn): GlobalAttention( (linear_in): Linear(in_features=500, out_features=500, bias=False) (linear_out): Linear(in_features=1000, out_features=500, bias=False) ) ) (generator): Sequential( (0): Linear(in_features=500, out_features=35820, bias=True) (1): Cast() (2): LogSoftmax(dim=-1) ) ) [2020-10-08 14:09:03,135 INFO] encoder: 16506500 [2020-10-08 14:09:03,135 INFO] decoder: 41613820 [2020-10-08 14:09:03,135 INFO] * number of parameters: 58120320 [2020-10-08 14:09:03,137 INFO] Starting training on CPU, could be very slow [2020-10-08 14:09:03,137 INFO] Start training loop and validate every 10000 steps... [2020-10-08 14:09:03,137 INFO] Loading dataset from toy-ende/demo.train.0.pt [2020-10-08 14:09:03,550 INFO] number of examples: 10000 [2020-10-08 14:12:35,629 INFO] Step 50/100000; acc: 3.66; ppl: 420079.26; xent: 12.95; lr: 1.00000; 353/348 tok/s; 212 sec [2020-10-08 14:16:30,224 INFO] Step 100/100000; acc: 3.44; ppl: 157620.36; xent: 11.97; lr: 1.00000; 308/308 tok/s; 447 sec [2020-10-08 14:20:07,964 INFO] Step 150/100000; acc: 4.09; ppl: 13433.41; xent: 9.51; lr: 1.00000; 319/320 tok/s; 665 sec [2020-10-08 14:20:35,304 INFO] Loading dataset from toy-ende/demo.train.0.pt [2020-10-08 14:20:35,793 INFO] number of examples: 10000 [2020-10-08 14:23:49,601 INFO] Step 200/100000; acc: 5.59; ppl: 3317.31; xent: 8.11; lr: 1.00000; 326/323 tok/s; 886 sec [2020-10-08 14:27:48,557 INFO] Step 250/100000; acc: 7.78; ppl: 2303.69; xent: 7.74; lr: 1.00000; 304/301 tok/s; 1125 sec [2020-10-08 14:31:36,504 INFO] Step 300/100000; acc: 9.12; ppl: 1930.14; xent: 7.57; lr: 1.00000; 325/324 tok/s; 1353 sec [2020-10-08 14:32:22,094 INFO] Loading dataset from toy-ende/demo.train.0.pt ……………………

太慢了,换GPU来跑。

$ onmt_train -data toy-ende/demo -save_model demo-model -gpu_ranks 0 [2020-10-09 18:36:10,411 INFO] * src vocab size = 24997 [2020-10-09 18:36:10,411 INFO] * tgt vocab size = 35820 [2020-10-09 18:36:10,411 INFO] Building model... [2020-10-09 18:36:14,786 INFO] NMTModel( (encoder): RNNEncoder( (embeddings): Embeddings( (make_embedding): Sequential( (emb_luts): Elementwise( (0): Embedding(24997, 500, padding_idx=1) ) ) ) (rnn): LSTM(500, 500, num_layers=2, dropout=0.3) ) (decoder): InputFeedRNNDecoder( (embeddings): Embeddings( (make_embedding): Sequential( (emb_luts): Elementwise( (0): Embedding(35820, 500, padding_idx=1) ) ) ) (dropout): Dropout(p=0.3, inplace=False) (rnn): StackedLSTM( (dropout): Dropout(p=0.3, inplace=False) (layers): ModuleList( (0): LSTMCell(1000, 500) (1): LSTMCell(500, 500) ) ) (attn): GlobalAttention( (linear_in): Linear(in_features=500, out_features=500, bias=False) (linear_out): Linear(in_features=1000, out_features=500, bias=False) ) ) (generator): Sequential( (0): Linear(in_features=500, out_features=35820, bias=True) (1): Cast() (2): LogSoftmax(dim=-1) ) ) [2020-10-09 18:36:14,787 INFO] encoder: 16506500 [2020-10-09 18:36:14,787 INFO] decoder: 41613820 [2020-10-09 18:36:14,787 INFO] * number of parameters: 58120320 [2020-10-09 18:36:14,788 INFO] Starting training on GPU: [0] [2020-10-09 18:36:14,788 INFO] Start training loop and validate every 10000 steps... [2020-10-09 18:36:14,788 INFO] Loading dataset from toy-ende/demo.train.0.pt [2020-10-09 18:36:15,097 INFO] number of examples: 10000 [2020-10-09 18:36:21,358 INFO] Step 50/100000; acc: 3.77; ppl: 137786.73; xent: 11.83; lr: 1.00000; 11116/11124 tok/s; 7 sec [2020-10-09 18:36:26,867 INFO] Step 100/100000; acc: 4.36; ppl: 24889.16; xent: 10.12; lr: 1.00000; 12596/12291 tok/s; 12 sec [2020-10-09 18:36:32,897 INFO] Step 150/100000; acc: 6.23; ppl: 6784.35; xent: 8.82; lr: 1.00000; 12054/12176 tok/s; 18 sec [2020-10-09 18:36:33,696 INFO] Loading dataset from toy-ende/demo.train.0.pt [2020-10-09 18:36:33,946 INFO] number of examples: 10000 [2020-10-09 18:36:39,280 INFO] Step 200/100000; acc: 7.56; ppl: 3920.44; xent: 8.27; lr: 1.00000; 11555/11444 tok/s; 24 sec [2020-10-09 18:36:44,861 INFO] Step 250/100000; acc: 8.96; ppl: 2168.38; xent: 7.68; lr: 1.00000; 12336/12185 tok/s; 30 sec [2020-10-09 18:36:50,805 INFO] Step 300/100000; acc: 9.50; ppl: 1994.53; xent: 7.60; lr: 1.00000; 12396/12425 tok/s; 36 sec [2020-10-09 18:36:52,359 INFO] Loading dataset from toy-ende/demo.train.0.pt [2020-10-09 18:36:52,594 INFO] number of examples: 10000 [2020-10-09 18:36:57,142 INFO] Step 350/100000; acc: 9.38; ppl: 1667.39; xent: 7.42; lr: 1.00000; 11536/11542 tok/s; 42 sec [2020-10-09 18:37:02,689 INFO] Step 400/100000; acc: 10.08; ppl: 1478.09; xent: 7.30; lr: 1.00000; 12475/12237 tok/s; 48 sec [2020-10-09 18:37:08,556 INFO] Step 450/100000; acc: 10.88; ppl: 1303.90; xent: 7.17; lr: 1.00000; 12440/12504 tok/s; 54 sec [2020-10-09 18:37:10,973 INFO] Loading dataset from toy-ende/demo.train.0.pt ………… [2020-10-09 21:54:56,217 INFO] Step 99700/100000; acc: 8.97; ppl: 19685.48; xent: 9.89; lr: 0.03125; 11664/11638 tok/s; 11921 sec [2020-10-09 21:55:02,111 INFO] Step 99750/100000; acc: 9.26; ppl: 18087.62; xent: 9.80; lr: 0.03125; 12328/12309 tok/s; 11927 sec [2020-10-09 21:55:07,559 INFO] Step 99800/100000; acc: 9.89; ppl: 15161.89; xent: 9.63; lr: 0.03125; 12524/12399 tok/s; 11933 sec [2020-10-09 21:55:13,534 INFO] Step 99850/100000; acc: 8.78; ppl: 19494.02; xent: 9.88; lr: 0.03125; 12469/12484 tok/s; 11939 sec [2020-10-09 21:55:13,674 INFO] Loading dataset from toy-ende/demo.train.0.pt [2020-10-09 21:55:13,938 INFO] number of examples: 10000 [2020-10-09 21:55:19,731 INFO] Step 99900/100000; acc: 9.34; ppl: 18415.28; xent: 9.82; lr: 0.03125; 11610/11664 tok/s; 11945 sec [2020-10-09 21:55:25,100 INFO] Step 99950/100000; acc: 10.04; ppl: 14772.09; xent: 9.60; lr: 0.03125; 12923/12557 tok/s; 11950 sec [2020-10-09 21:55:31,169 INFO] Step 100000/100000; acc: 8.85; ppl: 19937.46; xent: 9.90; lr: 0.01562; 12166/12255 tok/s; 11956 sec [2020-10-09 21:55:31,169 INFO] Loading dataset from toy-ende/demo.valid.0.pt [2020-10-09 21:55:31,363 INFO] number of examples: 3000 [2020-10-09 21:55:38,832 INFO] Validation perplexity: 31809.2 [2020-10-09 21:55:38,832 INFO] Validation accuracy: 9.18222 [2020-10-09 21:55:39,031 INFO] Saving checkpoint demo-model_step_100000.pt

模型推理

$ onmt_translate -model demo-model_step_100000.pt -src toy-ende/src-test.txt -output pred.txt -replace_unk -verbose ………… [2020-10-09 22:11:54,972 INFO] SENT 416: ['Even', 'the', 'most', 'recent', 'statements', 'from', 'Snowden', 'will', 'not', 'change', 'this', 'fact', '.'] PRED 416: Das hat mein Ziel . PRED SCORE: -7.2980 [2020-10-09 22:18:39,381 INFO] Translating shard 0. [2020-10-09 22:19:17,179 INFO] PRED AVG SCORE: -1.3478, PRED PPL: 3.8491

speech2text

数据准备

wget -O data/speech.tgz http://lstm.seas.harvard.edu/latex/speech.tgz; tar zxf data/speech.tgz -C data/

数据预处理

$ onmt_preprocess -data_type audio -src_dir data/speech/an4_dataset -train_src data/speech/src-train.txt -train_tgt data/speech/tgt-train.txt -valid_src data/speech/src-val.txt -valid_tgt data/speech/tgt-val.txt -shard_size 300 -save_data data/speech/demo [2020-10-08 16:46:42,654 INFO] Extracting features... [2020-10-08 16:46:42,655 INFO] * number of source features: 0. [2020-10-08 16:46:42,655 INFO] * number of target features: 0. [2020-10-08 16:46:42,655 INFO] Building `Fields` object... [2020-10-08 16:46:42,655 INFO] Building & saving training data... …………………… [2020-10-08 16:46:44,859 INFO] * saving 0th valid data shard to data/speech/demo.valid.0.pt.

模型训练

原文中用了4层encoder,本示例中简化成了2层。

onmt_train -model_type audio -enc_rnn_size 512 -dec_rnn_size 512 -audio_enc_pooling 1,1 -dropout 0 -enc_layers 2 -dec_layers 1 -rnn_type LSTM -data data/speech/demo -save_model demo-model -global_attention mlp -gpu_ranks 0 -batch_size 8 -optim adam -max_grad_norm 100 -learning_rate 0.0003 -learning_rate_decay 0.8 -train_steps 100000 --save_checkpoint_steps 2000

输出

[2020-10-09 23:57:46,056 INFO] * tgt vocab size = 31 [2020-10-09 23:57:46,056 INFO] Building model... [2020-10-09 23:57:49,392 INFO] NMTModel( (encoder): AudioEncoder( (W): Linear(in_features=512, out_features=512, bias=False) (batchnorm_0): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (rnn_0): LSTM(161, 512) (pool_0): MaxPool1d(kernel_size=1, stride=1, padding=0, dilation=1, ceil_mode=False) (rnn_1): LSTM(512, 512) (pool_1): MaxPool1d(kernel_size=1, stride=1, padding=0, dilation=1, ceil_mode=False) (batchnorm_1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (decoder): InputFeedRNNDecoder( (embeddings): Embeddings( (make_embedding): Sequential( (emb_luts): Elementwise( (0): Embedding(31, 500, padding_idx=1) ) ) ) (dropout): Dropout(p=0.0, inplace=False) (rnn): StackedLSTM( (dropout): Dropout(p=0.0, inplace=False) (layers): ModuleList( (0): LSTMCell(1012, 512) ) ) (attn): GlobalAttention( (linear_context): Linear(in_features=512, out_features=512, bias=False) (linear_query): Linear(in_features=512, out_features=512, bias=True) (v): Linear(in_features=512, out_features=1, bias=False) (linear_out): Linear(in_features=1024, out_features=512, bias=True) ) ) (generator): Sequential( (0): Linear(in_features=512, out_features=31, bias=True) (1): Cast() (2): LogSoftmax(dim=-1) ) ) [2020-10-09 23:57:49,393 INFO] encoder: 3747840 [2020-10-09 23:57:49,393 INFO] decoder: 4206763 [2020-10-09 23:57:49,393 INFO] * number of parameters: 7954603 [2020-10-09 23:57:49,394 INFO] Starting training on GPU: [0] [2020-10-09 23:57:49,395 INFO] Start training loop and validate every 10000 steps... [2020-10-09 23:57:49,395 INFO] Loading dataset from data/speech/demo.train.0.pt [2020-10-09 23:57:49,455 INFO] number of examples: 300 [2020-10-09 23:57:52,321 INFO] Loading dataset from data/speech/demo.train.1.pt [2020-10-09 23:57:52,338 INFO] number of examples: 300 [2020-10-09 23:57:53,662 INFO] Step 50/100000; acc: 22.02; ppl: 13.81; xent: 2.63; lr: 0.00030; 93/1318 tok/s; 4 sec [2020-10-09 23:57:56,415 INFO] Loading dataset from data/speech/demo.train.2.pt [2020-10-09 23:57:56,434 INFO] number of examples: 296 [2020-10-09 23:57:59,335 INFO] Step 100/100000; acc: 43.42; ppl: 6.86; xent: 1.93; lr: 0.00030; 70/1602 tok/s; 10 sec [2020-10-09 23:58:00,811 INFO] Loading dataset from data/speech/demo.train.3.pt [2020-10-09 23:58:00,814 INFO] number of examples: 47 [2020-10-09 23:58:01,619 INFO] Loading dataset from data/speech/demo.train.0.pt [2020-10-09 23:58:01,636 INFO] number of examples: 300 [2020-10-09 23:58:03,953 INFO] Step 150/100000; acc: 54.66; ppl: 4.50; xent: 1.50; lr: 0.00030; 86/1478 tok/s; 15 sec …………………… [2020-10-10 02:41:57,632 INFO] Loading dataset from data/speech/demo.train.1.pt [2020-10-10 02:41:57,647 INFO] number of examples: 300 [2020-10-10 02:41:57,865 INFO] Step 100000/100000; acc: 94.06; ppl: 1.19; xent: 0.18; lr: 0.00008; 92/1411 tok/s; 9558 sec [2020-10-10 02:41:57,865 INFO] Loading dataset from data/speech/demo.valid.0.pt [2020-10-10 02:41:57,871 INFO] number of examples: 130 [2020-10-10 02:41:58,077 INFO] Validation perplexity: 68.1276 [2020-10-10 02:41:58,077 INFO] Validation accuracy: 56.2152 [2020-10-10 02:41:58,078 INFO] Saving checkpoint demo-model_step_100000.pt

模型推理

onmt_translate -data_type audio -model demo-model_step_100000.pt -src_dir data/speech/an4_dataset -src data/speech/src-val.txt -output pred.txt -gpu 0 -verbose

结果示例

[2020-10-10 00:07:11,682 INFO] SENT 130: None PRED 130: O N E <space> F I V E <space> T W O <space> O N E <space> S E V E N PRED SCORE: -4.5617 [2020-10-10 00:07:11,682 INFO] PRED AVG SCORE: -0.2203, PRED PPL: 1.2464

很明显,这个阶段的模型效果还是非常差的。

参考