KAN-TTS初体验

2023-03-26 | 0 评论 | 0 浏览

背景

本文体验一下《SAMBERT个性化语音合成模型介绍》模型。

初体验

环境搭建

只支持Linux，因为其中依赖的ttsfrd只有Linux版本。

安装Miniconda

wget https://mirrors.bfsu.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh

安装ModelScope

python

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda create -n funasr python=3.7
conda activate funasr

配置下述镜像源来加快下载速度

pip config set global.index-url https://mirror.sjtu.edu.cn/pypi/web/simple
pip config set global.index-url https://mirrors.ustc.edu.cn/pypi/web/simple
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/

安装Pytorch (版本 >= 1.7.0)

pip install torch torchaudio

安装或更新ModelScope

pip install "modelscope[audio]" --upgrade -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

推理初体验

环境准备

安装依赖

pip install matplotlib

安装KAN-TTS

git clone https://github.com/alibaba-damo-academy/KAN-TTS.git && cd KAN-TTS
pip install --editable ./

合成Zhiyan声音

合成音频

代码：tts.py

from modelscope.outputs import OutputKeys
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

# 在此处编辑您想要合成的文本
text = '北京的天气'
model_id = 'damo/speech_sambert-hifigan_tts_zhiyan_emo_zh-cn_16k'
sambert_hifigan_tts = pipeline(task=Tasks.text_to_speech, model=model_id, model_revision='v1.0.3')
output = sambert_hifigan_tts(input=text)
wav = output[OutputKeys.OUTPUT_WAV]
with open('output.wav', 'wb') as f:
    f.write(wav)

结果 output.wav

Input #0, wav, from 'output.wav':
  Duration: 00:00:01.03, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s

亲测可用。

多情感标签

代码 tts.py：

from modelscope.outputs import OutputKeys
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

text = '<speak><emotion category="happy" intensity="1.0">今天天气真不错！</emotion></speak>'
model_id = 'damo/speech_sambert-hifigan_tts_zhiyan_emo_zh-cn_16k'
sambert_hifigan_tts = pipeline(task=Tasks.text_to_speech, model=model_id)
output = sambert_hifigan_tts(input=text)
wav = output[OutputKeys.OUTPUT_WAV]
with open('output.wav', 'wb') as f:
    f.write(wav)

MOS	angry	fear	happy	hate	neural	sad	surprise	average
recording	4.622	4.609	4.681	4.523	4.539	4.648	4.691	4.6161
zhiyan_emo	4.601	4.658	4.549	4.614	4.466	4.691	4.542	4.5887

亲测可用。

合成Zhitian声音

模型地址：

model_id = 'damo/speech_sambert-hifigan_tts_zhitian_emo_zh-cn_16k'
sambert_hifigan_tts = pipeline(task=Tasks.text_to_speech, model=model_id, model_revision='v1.0.3')

亲测可用。

合成四川话

model_id = 'speech_tts/speech_sambert-hifigan_tts_chuangirl_Sichuan_16k'
sambert_hifigan_tts = pipeline(task=Tasks.text_to_speech, model=model_id)

亲测可用。

合成上海话

model_id = 'speech_tts/speech_sambert-hifigan_tts_xiaoda_WuuShanghai_16k'
sambert_hifigan_tts = pipeline(task=Tasks.text_to_speech, model=model_id)

亲测可用。

合成广东粤语

model_id = 'speech_tts/speech_sambert-hifigan_tts_jiajia_Cantonese_16k'
sambert_hifigan_tts = pipeline(task=Tasks.text_to_speech, model=model_id)

亲测会报错：

size mismatch for text_encoder.sy_emb.weight: copying a param with shape torch.Size([107, 512]) from checkpoint, the shape in current model is torch.Size([147, 512]).
        size mismatch for text_encoder.tone_emb.weight: copying a param with shape torch.Size([14, 512]) from checkpoint, the shape in current model is torch.Size([10, 512]).

美式英语

语音合成-美式英文-通用领域-16k-发音人Andy

text = 'How is the weather in beijing?'
model_id = 'damo/speech_sambert-hifigan_tts_andy_en-us_16k'
sambert_hifigan_tts = pipeline(task=Tasks.text_to_speech, model=model_id)

亲测会报错：

RuntimeError: Error(s) in loading state_dict for KanTtsSAMBERT:
        size mismatch for text_encoder.sy_emb.weight: copying a param with shape torch.Size([46, 512]) from checkpoint, the shape in current model is torch.Size([146, 512]).

英式英语

语音合成-英式英文-通用领域-16k-发音人Luna

text = 'How is the weather in beijing?'
model_id = 'damo/speech_sambert-hifigan_tts_luna_en-gb_16k'
sambert_hifigan_tts = pipeline(task=Tasks.text_to_speech, model=model_id)

亲测会报错

RuntimeError: Error(s) in loading state_dict for KanTtsSAMBERT:
        size mismatch for text_encoder.sy_emb.weight: copying a param with shape torch.Size([51, 512]) from checkpoint, the shape in current model is torch.Size([146, 512]).

训练初体验

参考4 教程，准备一个GPU环境机器。

原理

录音检测 -> 数据处理 -> 模型训练 -> 打包合成。

在模型训练中，会生成基础声学模型和基础声码器。这个模型和声码器，将用于个性化文本合成音频的过程。

环境准备

代码

git clone https://github.com/alibaba-damo-academy/KAN-TTS.git
cd KAN-TTS

环境

# 防止使用pip安装时出现网络问题，建议切换国内pip源
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

# 创建虚拟环境
conda env create -f environment.yaml

# 激活虚拟环境
conda activate maas

本文执行创建虚拟环境时太慢，并且本文中的python = 3.7.12，故直接

pip install -r requirements.txt

基础模型准备

ModelScope中文个性化语音合成模型是达摩院语音实验室在1000多小时4000多人数据集上训练产出的预训练模型。

# 准备软件
sudo apt-get install git-lfs

# 克隆预训练模型
git clone https://www.modelscope.cn/damo/speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k.git

# 查看基础模型
git-lfs ls-files

# 下载基础模型
git-lfs pull

自有音频自动标注

如果仅有音频，没有标注的话，可以使用自动标注工具。本文中我们跳过这一节，而直接使用下一节中的示例音频。

安装tts-autolabel

python -m pip install tts-autolabel -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html`

label.py

# 导入run_auto_label工具, 初次运行会下载相关库文件
from modelscope.tools import run_auto_label

# 运行 autolabel进行自动标注，20句音频的自动标注约4分钟
import os 

input_wav = '/mnt/workspace/Data/ptts_spk0_wav'# wav audio path
work_dir = '/mnt/workspace/Data/ptts_spk0_autolabel'# output path
os.makedirs(work_dir, exist_ok=True)

ret, report = run_auto_label(input_wav = input_wav,
                             work_dir = work_dir,
                            resource_revision='v1.0.4')
print(report)

开源语音音频

依赖

sudo apt-get install sox libsndfile1

获取数据集

从ModelScope下载经过阿里标准格式处理的AISHELL-3开源语音合成数据集，用来进行后续操作。

# 解压数据
unzip aishell3.zip

重采样

AISHELL-3包含两百多个发音人录音，每个发音人数据量在20～30分钟不等。由于原始音频采样率为44k，我们先对音频做重采样，这里需要用到该数据集的元数据仓库脚本。

# 拉取元数据仓库，并做重采样处理
git clone https://www.modelscope.cn/datasets/speech_tts/AISHELL-3.git
./AISHELL-3/aishell_resample.sh aishell3 aishell3_16k 16000

选取一个发音人

选择其中的一个发音人中的20条语音进行数据处理，以SSB0018为例(开发者也可自行选择其他发音人)，训练一个16k采样率的语音合成模型。首先从选取的发音人数据库中随机挑选20句音频与标注。

# 随机挑选20句音频与标注

python ./AISHELL-3/get_random_20.py aishell3_16k/SSB0018 aishell3_16k_20/SSB0018

## 数据集的格式和结构
* interval: 音素级别的时间戳标注
* wav:  音频文件
* prosody: 音频文件的文本标注

实验中，一些wav对应的interval文件会缺失，因此在 get_random_20.py需要做下微调，在选择name里，先判断wav和interval均存在。

特征提取

环境准备

pip install pysptk bitstring==3.1.6

特征提取操作

选择个性化语音合成配置文件进行特征提取操作，这里我们以提供的16k采样率为例kantts/configs/audio_config_se_16k.yaml 运行以下命令来进行特征提取，其中--speaker代表该数据集对应发音人的名称，用户可以随意命名。

# 特征提取
python kantts/preprocess/data_process.py --voice_input_dir aishell3_16k_20/SSB0018 --voice_output_dir training_stage/SSB0018_ptts_feats --audio_config kantts/configs/audio_config_se_16k.yaml --speaker SSB0018 --se_model speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/speaker_embedding/se.onnx

结果

Voc metafile generated. 
AM metafile generated.

扩充epoch

将20句音频，扩充为3200名音频。每句音频扩容为160倍。

stage0=training_stage
voice=SSB0018_ptts_feats

cat $stage0/$voice/am_valid.lst  >> $stage0/$voice/am_train.lst
lines=0
while [ $lines -lt 3200 ]
do
shuf $stage0/$voice/am_train.lst >> $stage0/$voice/am_train.lst.tmp
lines=$(wc -l < "$stage0/$voice/am_train.lst.tmp")
done
mv $stage0/$voice/am_train.lst.tmp $stage0/$voice/am_train.lst

训练声学模型

将train_max_steps改为 3180301

我们在basemodel的基础上进行finetune。

默认finetune的train_max_steps为3180301，在预训练basemodel的配置文件 speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/sambert/config.yaml中。

我们可以自行修改。本文就使用默认值，不再修改了。

...
train_max_steps: 3180301
...

训练声学模型

依赖

pip install pywavelets==1.3.0 git+https://github.com/fbcotter/pytorch_wavelets.git tensorboardx==2.2 matplotlib==3.5.1

进行训练

CUDA_VISIBLE_DEVICES=0 python kantts/bin/train_sambert.py --model_config speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/sambert/config.yaml --root_dir  training_stage/SSB0018_ptts_feats --stage_dir training_stage/SSB0018_ptts_sambert_ckpt --resume_path speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/sambert/ckpt/checkpoint_3180000.pth

合成个性化音频

准备文本

一个文件test.txt，每句话按行分隔，如下所示

徐玠诡谲多智，善揣摩，知道徐知询不可辅佐，掌握着他的短处以归附徐知诰。
许乐夫生于山东省临朐县杨善镇大辛庄，毕业于抗大一分校。
宣统元年（1909年），顺德绅士冯国材在香山大黄圃成立安洲农务分会，管辖东海十六沙，冯国材任总理。
学生们大多住在校区宿舍，通过参加不同的体育文化俱乐部及社交活动，形成一个友谊长存的社会圈。
学校的“三节一会”（艺术节、社团节、科技节、运动会）是显示青春才华的盛大活动。
雪是先天自闭症患者，不懂与人沟通，却拥有灵敏听觉，而且对复杂动作过目不忘。
勋章通过一柱状螺孔和螺钉附着在衣物上。
雅恩雷根斯堡足球俱乐部（）是一家位于德国雷根斯堡的足球俱乐部，处于德国足球丙级联赛。
亚历山大·格罗滕迪克于1957年证明了一个深远的推广，现在叫做格罗滕迪克–黎曼–罗赫定理。

进行合成

运行以下命令进行合成，其中se_file为特征提取环节抽取的speaker embedding，voc_ckpt为basemodel_16k中的预训练模型：


# 运行合成语音

CUDA_VISIBLE_DEVICES=0 python kantts/bin/text_to_wav.py --txt test.txt --output_dir res/SSB0018_ptts_syn --res_zip speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/resource.zip --am_ckpt training_stage/SSB0018_ptts_sambert_ckpt/ckpt/checkpoint_3180300.pth --voc_ckpt speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/hifigan/ckpt/checkpoint_2400000.pth --se_file training_stage/SSB0018_ptts_feats/se/se.npy