OpenNMT-py初体验

  |   0 评论   |   0 浏览

背景

OpenNMT是一个开源的机器翻译系统。

OpenNMT有两个实现,OpenNMT-py和OpenNMT-tf,两者略有差异

本文以OpenNMT-py为例,来体验。

初体验

安装

pip install OpenNMT-py

快速入门

数据准备

wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
tar xvf toy-ende.tar.gz
cd toy-ende

数据预处理

$ onmt_preprocess -train_src toy-ende/src-train.txt -train_tgt toy-ende/tgt-train.txt -valid_src toy-ende/src-val.txt -valid_tgt toy-ende/tgt-val.txt -save_data toy-ende/demo
[2020-10-08 14:06:44,423 INFO] Extracting features...
[2020-10-08 14:06:44,425 INFO]  * number of source features: 0.
[2020-10-08 14:06:44,425 INFO]  * number of target features: 0.
[2020-10-08 14:06:44,425 INFO] Building `Fields` object...
[2020-10-08 14:06:44,425 INFO] Building & saving training data...
[2020-10-08 14:06:44,479 INFO] Building shard 0.
[2020-10-08 14:06:44,998 INFO]  * saving 0th train data shard to toy-ende/demo.train.0.pt.
[2020-10-08 14:06:45,690 INFO]  * tgt vocab size: 35820.
[2020-10-08 14:06:45,735 INFO]  * src vocab size: 24997.
[2020-10-08 14:06:45,940 INFO] Building & saving validation data...
[2020-10-08 14:06:46,187 INFO] Building shard 0.
[2020-10-08 14:06:46,267 INFO]  * saving 0th valid data shard to toy-ende/demo.valid.0.pt.

结果文件:

  • demo.train.pt: serialized PyTorch file containing training data
  • demo.valid.pt: serialized PyTorch file containing validation data
  • demo.vocab.pt: serialized PyTorch file containing vocabulary data

模型训练

$ onmt_train -data toy-ende/demo -save_model demo-model
[2020-10-08 14:09:02,315 INFO]  * src vocab size = 24997
[2020-10-08 14:09:02,315 INFO]  * tgt vocab size = 35820
[2020-10-08 14:09:02,315 INFO] Building model...
[2020-10-08 14:09:03,134 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(24997, 500, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(500, 500, num_layers=2, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(35820, 500, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.3, inplace=False)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.3, inplace=False)
      (layers): ModuleList(
        (0): LSTMCell(1000, 500)
        (1): LSTMCell(500, 500)
      )
    )
    (attn): GlobalAttention(
      (linear_in): Linear(in_features=500, out_features=500, bias=False)
      (linear_out): Linear(in_features=1000, out_features=500, bias=False)
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=500, out_features=35820, bias=True)
    (1): Cast()
    (2): LogSoftmax(dim=-1)
  )
)
[2020-10-08 14:09:03,135 INFO] encoder: 16506500
[2020-10-08 14:09:03,135 INFO] decoder: 41613820
[2020-10-08 14:09:03,135 INFO] * number of parameters: 58120320
[2020-10-08 14:09:03,137 INFO] Starting training on CPU, could be very slow
[2020-10-08 14:09:03,137 INFO] Start training loop and validate every 10000 steps...
[2020-10-08 14:09:03,137 INFO] Loading dataset from toy-ende/demo.train.0.pt
[2020-10-08 14:09:03,550 INFO] number of examples: 10000
[2020-10-08 14:12:35,629 INFO] Step 50/100000; acc:   3.66; ppl: 420079.26; xent: 12.95; lr: 1.00000; 353/348 tok/s;    212 sec
[2020-10-08 14:16:30,224 INFO] Step 100/100000; acc:   3.44; ppl: 157620.36; xent: 11.97; lr: 1.00000; 308/308 tok/s;    447 sec
[2020-10-08 14:20:07,964 INFO] Step 150/100000; acc:   4.09; ppl: 13433.41; xent: 9.51; lr: 1.00000; 319/320 tok/s;    665 sec
[2020-10-08 14:20:35,304 INFO] Loading dataset from toy-ende/demo.train.0.pt
[2020-10-08 14:20:35,793 INFO] number of examples: 10000
[2020-10-08 14:23:49,601 INFO] Step 200/100000; acc:   5.59; ppl: 3317.31; xent: 8.11; lr: 1.00000; 326/323 tok/s;    886 sec
[2020-10-08 14:27:48,557 INFO] Step 250/100000; acc:   7.78; ppl: 2303.69; xent: 7.74; lr: 1.00000; 304/301 tok/s;   1125 sec
[2020-10-08 14:31:36,504 INFO] Step 300/100000; acc:   9.12; ppl: 1930.14; xent: 7.57; lr: 1.00000; 325/324 tok/s;   1353 sec
[2020-10-08 14:32:22,094 INFO] Loading dataset from toy-ende/demo.train.0.pt
……………………

太慢了,换GPU来跑。

$ onmt_train -data toy-ende/demo -save_model demo-model -gpu_ranks 0
[2020-10-09 18:36:10,411 INFO]  * src vocab size = 24997
[2020-10-09 18:36:10,411 INFO]  * tgt vocab size = 35820
[2020-10-09 18:36:10,411 INFO] Building model...
[2020-10-09 18:36:14,786 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(24997, 500, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(500, 500, num_layers=2, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(35820, 500, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.3, inplace=False)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.3, inplace=False)
      (layers): ModuleList(
        (0): LSTMCell(1000, 500)
        (1): LSTMCell(500, 500)
      )
    )
    (attn): GlobalAttention(
      (linear_in): Linear(in_features=500, out_features=500, bias=False)
      (linear_out): Linear(in_features=1000, out_features=500, bias=False)
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=500, out_features=35820, bias=True)
    (1): Cast()
    (2): LogSoftmax(dim=-1)
  )
)
[2020-10-09 18:36:14,787 INFO] encoder: 16506500
[2020-10-09 18:36:14,787 INFO] decoder: 41613820
[2020-10-09 18:36:14,787 INFO] * number of parameters: 58120320
[2020-10-09 18:36:14,788 INFO] Starting training on GPU: [0]
[2020-10-09 18:36:14,788 INFO] Start training loop and validate every 10000 steps...
[2020-10-09 18:36:14,788 INFO] Loading dataset from toy-ende/demo.train.0.pt
[2020-10-09 18:36:15,097 INFO] number of examples: 10000
[2020-10-09 18:36:21,358 INFO] Step 50/100000; acc:   3.77; ppl: 137786.73; xent: 11.83; lr: 1.00000; 11116/11124 tok/s;      7 sec
[2020-10-09 18:36:26,867 INFO] Step 100/100000; acc:   4.36; ppl: 24889.16; xent: 10.12; lr: 1.00000; 12596/12291 tok/s;     12 sec
[2020-10-09 18:36:32,897 INFO] Step 150/100000; acc:   6.23; ppl: 6784.35; xent: 8.82; lr: 1.00000; 12054/12176 tok/s;     18 sec
[2020-10-09 18:36:33,696 INFO] Loading dataset from toy-ende/demo.train.0.pt
[2020-10-09 18:36:33,946 INFO] number of examples: 10000
[2020-10-09 18:36:39,280 INFO] Step 200/100000; acc:   7.56; ppl: 3920.44; xent: 8.27; lr: 1.00000; 11555/11444 tok/s;     24 sec
[2020-10-09 18:36:44,861 INFO] Step 250/100000; acc:   8.96; ppl: 2168.38; xent: 7.68; lr: 1.00000; 12336/12185 tok/s;     30 sec
[2020-10-09 18:36:50,805 INFO] Step 300/100000; acc:   9.50; ppl: 1994.53; xent: 7.60; lr: 1.00000; 12396/12425 tok/s;     36 sec
[2020-10-09 18:36:52,359 INFO] Loading dataset from toy-ende/demo.train.0.pt
[2020-10-09 18:36:52,594 INFO] number of examples: 10000
[2020-10-09 18:36:57,142 INFO] Step 350/100000; acc:   9.38; ppl: 1667.39; xent: 7.42; lr: 1.00000; 11536/11542 tok/s;     42 sec
[2020-10-09 18:37:02,689 INFO] Step 400/100000; acc:  10.08; ppl: 1478.09; xent: 7.30; lr: 1.00000; 12475/12237 tok/s;     48 sec
[2020-10-09 18:37:08,556 INFO] Step 450/100000; acc:  10.88; ppl: 1303.90; xent: 7.17; lr: 1.00000; 12440/12504 tok/s;     54 sec
[2020-10-09 18:37:10,973 INFO] Loading dataset from toy-ende/demo.train.0.pt
…………
[2020-10-09 21:54:56,217 INFO] Step 99700/100000; acc:   8.97; ppl: 19685.48; xent: 9.89; lr: 0.03125; 11664/11638 tok/s;  11921 sec
[2020-10-09 21:55:02,111 INFO] Step 99750/100000; acc:   9.26; ppl: 18087.62; xent: 9.80; lr: 0.03125; 12328/12309 tok/s;  11927 sec
[2020-10-09 21:55:07,559 INFO] Step 99800/100000; acc:   9.89; ppl: 15161.89; xent: 9.63; lr: 0.03125; 12524/12399 tok/s;  11933 sec
[2020-10-09 21:55:13,534 INFO] Step 99850/100000; acc:   8.78; ppl: 19494.02; xent: 9.88; lr: 0.03125; 12469/12484 tok/s;  11939 sec
[2020-10-09 21:55:13,674 INFO] Loading dataset from toy-ende/demo.train.0.pt
[2020-10-09 21:55:13,938 INFO] number of examples: 10000
[2020-10-09 21:55:19,731 INFO] Step 99900/100000; acc:   9.34; ppl: 18415.28; xent: 9.82; lr: 0.03125; 11610/11664 tok/s;  11945 sec
[2020-10-09 21:55:25,100 INFO] Step 99950/100000; acc:  10.04; ppl: 14772.09; xent: 9.60; lr: 0.03125; 12923/12557 tok/s;  11950 sec
[2020-10-09 21:55:31,169 INFO] Step 100000/100000; acc:   8.85; ppl: 19937.46; xent: 9.90; lr: 0.01562; 12166/12255 tok/s;  11956 sec
[2020-10-09 21:55:31,169 INFO] Loading dataset from toy-ende/demo.valid.0.pt
[2020-10-09 21:55:31,363 INFO] number of examples: 3000
[2020-10-09 21:55:38,832 INFO] Validation perplexity: 31809.2
[2020-10-09 21:55:38,832 INFO] Validation accuracy: 9.18222
[2020-10-09 21:55:39,031 INFO] Saving checkpoint demo-model_step_100000.pt

模型推理

$ onmt_translate -model demo-model_step_100000.pt -src toy-ende/src-test.txt -output pred.txt -replace_unk -verbose

…………
[2020-10-09 22:11:54,972 INFO]
SENT 416: ['Even', 'the', 'most', 'recent', 'statements', 'from', 'Snowden', 'will', 'not', 'change', 'this', 'fact', '.']
PRED 416: Das hat mein Ziel .
PRED SCORE: -7.2980

[2020-10-09 22:18:39,381 INFO] Translating shard 0.
[2020-10-09 22:19:17,179 INFO] PRED AVG SCORE: -1.3478, PRED PPL: 3.8491

speech2text

数据准备

wget -O data/speech.tgz http://lstm.seas.harvard.edu/latex/speech.tgz; tar zxf data/speech.tgz -C data/

数据预处理

$ onmt_preprocess -data_type audio -src_dir data/speech/an4_dataset -train_src data/speech/src-train.txt -train_tgt data/speech/tgt-train.txt -valid_src data/speech/src-val.txt -valid_tgt data/speech/tgt-val.txt -shard_size 300 -save_data data/speech/demo

[2020-10-08 16:46:42,654 INFO] Extracting features...
[2020-10-08 16:46:42,655 INFO]  * number of source features: 0.
[2020-10-08 16:46:42,655 INFO]  * number of target features: 0.
[2020-10-08 16:46:42,655 INFO] Building `Fields` object...
[2020-10-08 16:46:42,655 INFO] Building & saving training data...
……………………
[2020-10-08 16:46:44,859 INFO]  * saving 0th valid data shard to data/speech/demo.valid.0.pt.

模型训练

原文中用了4层encoder,本示例中简化成了2层。

onmt_train -model_type audio -enc_rnn_size 512 -dec_rnn_size 512 -audio_enc_pooling 1,1 -dropout 0 -enc_layers 2 -dec_layers 1 -rnn_type LSTM -data data/speech/demo -save_model demo-model -global_attention mlp -gpu_ranks 0 -batch_size 8 -optim adam -max_grad_norm 100 -learning_rate 0.0003 -learning_rate_decay 0.8 -train_steps 100000 --save_checkpoint_steps 2000

输出

[2020-10-09 23:57:46,056 INFO]  * tgt vocab size = 31
[2020-10-09 23:57:46,056 INFO] Building model...
[2020-10-09 23:57:49,392 INFO] NMTModel(
  (encoder): AudioEncoder(
    (W): Linear(in_features=512, out_features=512, bias=False)
    (batchnorm_0): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (rnn_0): LSTM(161, 512)
    (pool_0): MaxPool1d(kernel_size=1, stride=1, padding=0, dilation=1, ceil_mode=False)
    (rnn_1): LSTM(512, 512)
    (pool_1): MaxPool1d(kernel_size=1, stride=1, padding=0, dilation=1, ceil_mode=False)
    (batchnorm_1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(31, 500, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.0, inplace=False)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.0, inplace=False)
      (layers): ModuleList(
        (0): LSTMCell(1012, 512)
      )
    )
    (attn): GlobalAttention(
      (linear_context): Linear(in_features=512, out_features=512, bias=False)
      (linear_query): Linear(in_features=512, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
      (linear_out): Linear(in_features=1024, out_features=512, bias=True)
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=512, out_features=31, bias=True)
    (1): Cast()
    (2): LogSoftmax(dim=-1)
  )
)
[2020-10-09 23:57:49,393 INFO] encoder: 3747840
[2020-10-09 23:57:49,393 INFO] decoder: 4206763
[2020-10-09 23:57:49,393 INFO] * number of parameters: 7954603
[2020-10-09 23:57:49,394 INFO] Starting training on GPU: [0]
[2020-10-09 23:57:49,395 INFO] Start training loop and validate every 10000 steps...
[2020-10-09 23:57:49,395 INFO] Loading dataset from data/speech/demo.train.0.pt
[2020-10-09 23:57:49,455 INFO] number of examples: 300
[2020-10-09 23:57:52,321 INFO] Loading dataset from data/speech/demo.train.1.pt
[2020-10-09 23:57:52,338 INFO] number of examples: 300
[2020-10-09 23:57:53,662 INFO] Step 50/100000; acc:  22.02; ppl: 13.81; xent: 2.63; lr: 0.00030;  93/1318 tok/s;      4 sec
[2020-10-09 23:57:56,415 INFO] Loading dataset from data/speech/demo.train.2.pt
[2020-10-09 23:57:56,434 INFO] number of examples: 296
[2020-10-09 23:57:59,335 INFO] Step 100/100000; acc:  43.42; ppl:  6.86; xent: 1.93; lr: 0.00030;  70/1602 tok/s;     10 sec
[2020-10-09 23:58:00,811 INFO] Loading dataset from data/speech/demo.train.3.pt
[2020-10-09 23:58:00,814 INFO] number of examples: 47
[2020-10-09 23:58:01,619 INFO] Loading dataset from data/speech/demo.train.0.pt
[2020-10-09 23:58:01,636 INFO] number of examples: 300
[2020-10-09 23:58:03,953 INFO] Step 150/100000; acc:  54.66; ppl:  4.50; xent: 1.50; lr: 0.00030;  86/1478 tok/s;     15 sec
……………………
[2020-10-10 02:41:57,632 INFO] Loading dataset from data/speech/demo.train.1.pt
[2020-10-10 02:41:57,647 INFO] number of examples: 300
[2020-10-10 02:41:57,865 INFO] Step 100000/100000; acc:  94.06; ppl:  1.19; xent: 0.18; lr: 0.00008;  92/1411 tok/s;   9558 sec
[2020-10-10 02:41:57,865 INFO] Loading dataset from data/speech/demo.valid.0.pt
[2020-10-10 02:41:57,871 INFO] number of examples: 130
[2020-10-10 02:41:58,077 INFO] Validation perplexity: 68.1276
[2020-10-10 02:41:58,077 INFO] Validation accuracy: 56.2152
[2020-10-10 02:41:58,078 INFO] Saving checkpoint demo-model_step_100000.pt

模型推理

onmt_translate -data_type audio -model demo-model_step_100000.pt -src_dir data/speech/an4_dataset -src data/speech/src-val.txt -output pred.txt -gpu 0 -verbose

结果示例

[2020-10-10 00:07:11,682 INFO]
SENT 130: None
PRED 130: O N E <space> F I V E <space> T W O <space> O N E <space> S E V E N
PRED SCORE: -4.5617

[2020-10-10 00:07:11,682 INFO] PRED AVG SCORE: -0.2203, PRED PPL: 1.2464

很明显,这个阶段的模型效果还是非常差的。

参考