FunASR初体验

2023-04-05 | 0 评论 | 0 浏览

背景

FunASR是一个端到端的语音识别工具。其在23年3月17日发布了0.3版本。支持基于grpc服务，新增实时字幕demo，采用2pass识别模型，Paraformer流式模型用来上屏，Paraformer-large离线模型用来纠正识别结果。

本文目标为从零开始，跑通这个demo。

初体验

环境

按照官方文档操作

硬件环境：Ubuntu 20.04.5 LTS (Focal Fossa)

python

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
conda create -n funasr python=3.7
conda activate funasr

或者

wget https://mirrors.bfsu.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh

配置下述镜像源来加快下载速度

pip config set global.index-url https://mirror.sjtu.edu.cn/pypi/web/simple
pip config set global.index-url https://mirrors.ustc.edu.cn/pypi/web/simple
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/

安装Pytorch (版本 >= 1.7.0)

pip install torch torchaudio

安装或更新ModelScope

pip install "modelscope[audio_asr]" --upgrade -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
git clone https://github.com/alibaba/FunASR.git && cd FunASR

# 使用分支 v0.3.0
git fetch origin --tags 
git checkout v0.3.0

# 安装包
pip install --editable ./

Paraformer-large离线模型长音频版

paraformer_large.py文件：

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
    vad_model='damo/speech_fsmn_vad_zh-cn-16k-common-pytorch',
    vad_model_revision="v1.1.8",
    punc_model='damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch',
    punc_model_revision="v1.1.6")

rec_result = inference_pipeline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_vad_punc_example.wav')
print(rec_result)

结果

{'text': '正是因为存在绝对正义，所以我们接受现实的相对正义，但是不要因为现实的相对正义，我们就认为这个世界没有正义。因为如果当你认为这个世界没有正义。', 'text_postprocessed': '正 是 因 为 存 在 绝 对 正 义 所 以 我 们 接 受 现 实 的 相 对 正 义 但 是 不 要 因 为 现 实 的 相 对 正 义 我 们 就 认 为 这 个 世 界 没 有 正 义 因 为 如 果 当 你 认 为 这 个 世 界 没 有 正 义', 'time_stamp': [[430, 670], [670, 809], [809, 1030], [1030, 1130], [1130, 1330], [1330, 1510], [1510, 1670], [1670, 1810], [1810, 1970], [1970, 2210], [2270, 2390], [2390, 2490], [2490, 2570], [2570, 2710], [2710, 2950], [2970, 3210], [3310, 3550], [3570, 3730], [3730, 3830], [3830, 3970], [3970, 4149], [4149, 4270], [4270, 4535], [5289, 5470], [5470, 5609], [5609, 5710], [5710, 5909], [5909, 6069], [6069, 6229], [6229, 6470], [6470, 6649], [6649, 6750], [6750, 6949], [6949, 7129], [7129, 7250], [7250, 7490], [7490, 7590], [7590, 7709], [7709, 7910], [7910, 8070], [8070, 8290], [8290, 8430], [8430, 8550], [8550, 8709], [8709, 8950], [9050, 9290], [9370, 9550], [9550, 9790], [9790, 9965], [10600, 10760], [10760, 10900], [10900, 11120], [11120, 11300], [11300, 11400], [11400, 11580], [11580, 11700], [11700, 11800], [11800, 11920], [11920, 12020], [12020, 12160], [12160, 12320], [12320, 12440], [12440, 12560], [12560, 12740], [12740, 12945]], 'sentences': [{'text': '正是因为存在绝对正义,', 'start': 430, 'end': 2210}, {'text': '所以我们接受现实的相对正义,', 'start': 2210, 'end': 4535}, {'text': '但是不要因为现实的相对正义,', 'start': 4535, 'end': 7490}, {'text': '我们就认为这个世界没有正义.', 'start': 7490, 'end': 9965}, {'text': '因为如果当你认为这个世界没有正义.', 'start': 9965, 'end': 12945}]}

Paraformer流式模型

软件

funasr 0.3.0 has requirement tensorboard==1.15

python3 -m pip install tensorflow==1.15 protobuf==3.20.1

paraformer.py

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

inference_16k_pipline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab3444-tensorflow1-online')

rec_result = inference_16k_pipline(audio_in='https://modelscope.oss-cn-beijing.aliyuncs.com/test/audios/asr_example.wav')
print(rec_result)

结果

2023-03-23 15:30:22,232 - modelscope - INFO - Computing the result of ASR ...
{'text': '每一天都要快乐哦'}

流式示例

import os
import logging
import torch
import soundfile

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger

logger = get_logger(log_level=logging.CRITICAL)
logger.setLevel(logging.CRITICAL)

os.environ["MODELSCOPE_CACHE"] = "./"
inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online',
    model_revision='v1.0.2')

model_dir = os.path.join(os.environ["MODELSCOPE_CACHE"], "damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online")
speech, sample_rate = soundfile.read(os.path.join(model_dir, "example/asr_example.wav"))
speech_length = speech.shape[0]

sample_offset = 0
step = 4800  #300ms
param_dict = {"cache": dict(), "is_final": False}
param_dict["cache"]["encoder"] = {"start_idx": 0, "pad_left": 0, "stride": 10, "pad_right": 5, "cif_hidden": None, "cif_alphas": None, "is_final": False, "left": 0, "right": 0}
param_dict["cache"]["decoder"] = {"decode_fsmn": None}
final_result = ""

for sample_offset in range(0, speech_length, min(step, speech_length - sample_offset)):
    if sample_offset + step >= speech_length - 1:
        step = speech_length - sample_offset
        param_dict["is_final"] = True
    rec_result = inference_pipeline(audio_in=speech[sample_offset: sample_offset + step],
                                    param_dict=param_dict)
    if len(rec_result) != 0 and rec_result['text'] != "sil" and rec_result['text'] != "waiting_for_more_voice":
        final_result += rec_result['text']
    print(rec_result)
print(final_result)

模型导出

paraformer-large导出成onnx模型

python -m funasr.export.export_model --model-name damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir paraformer-large --type onnx --quantize True

导出成onnx模型

pip install onnx onnxruntime

python -m funasr.export.export_model --model-name damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online --export-dir ./export --type onnx --quantize True

结果

$ du -sh speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online/
620M	speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online/

Linux下编译

git clone https://github.com/alibaba-damo-academy/FunASR.git && cd funasr/runtime/onnxruntime
mkdir build
cd build
# download an appropriate onnxruntime from https://github.com/microsoft/onnxruntime/releases/tag/v1.14.0
# here we get a copy of onnxruntime for linux 64
wget https://github.com/microsoft/onnxruntime/releases/download/v1.14.0/onnxruntime-linux-x64-1.14.0.tgz
tar -zxvf onnxruntime-linux-x64-1.14.0.tgz
# ls
# onnxruntime-linux-x64-1.14.0  onnxruntime-linux-x64-1.14.0.tgz

#install fftw3-dev
ubuntu: apt install libfftw3-dev
centos: yum install fftw fftw-devel

#install openblas
bash ./third_party/install_openblas.sh

# build
 cmake  -DCMAKE_BUILD_TYPE=release .. -DONNXRUNTIME_DIR=/mnt/c/Users/ma139/RapidASR/cpp_onnx/build/onnxruntime-linux-x64-1.14.0
 make

 # then in the subfolder tester of current direcotry, you will see a program, tester

测试

./tester/tester_rtf /home/service/tmp/d1/export/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online /home/service/work/wav/asr_engine/wav.list true

附录

UniASR示例

系统依赖

apt-get install libsndfile1

uniasr.py文件

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

inference_16k_pipline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_UniASR_asr_2pass-zh-cn-16k-common-vocab8358-tensorflow1-offline')

rec_result = inference_16k_pipline(audio_in='https://modelscope.oss-cn-beijing.aliyuncs.com/test/audios/asr_example.wav')
print(rec_result)

结果

2023-03-23 14:25:10,780 - modelscope - INFO - Decoding with wav files ...
2023-03-23 14:25:10,780 (asr_inference_pipeline:507) INFO: Decoding with wav files ...
[##################################################] 100+23-03-23 14:25:17,022 - modelscope - INFO - Computing the result of ASR ...
2023-03-23 14:25:17,022 (asr_inference_pipeline:542) INFO: Computing the result of ASR ...
{'text': '每一天都要快乐哦'}

流式识别

websocket协议

server

client on linux

依赖：

sudo apt-get install portaudio19-dev

client on mac

brew install portaudio
python3 -m pip install -r requirements_client.txt

基础知识

预训练模型

模型分为两大类，Paraformer和UniASR

Paraformer

Paraformer是一种具有高识别率与计算效率的单轮非自回归模型。

model	clean（CER%）	common(CER%)	RTF
Paraformer	9.73	12.96	0.0093

运行范围

现阶段只能在Linux-x86_64运行，不支持Mac和Windows。

使用方式

直接推理：可以直接对输入音频进行解码，输出目标文字。
微调：加载训练好的模型，采用私有或者开源数据进行模型训练。

使用范围与目标场景

该模型为伪流式模型，可以用来评估流式模型效果，配合runtime才可以实现真正的实时识别。

UniASR

UniASR是一种2遍刷新模型（Two pass）端到端语音识别模型。

model	clean（CER%）	common (CER%)
offline	5.84	9.73
normal	6.11	10.42
fast(900ms)	8.60	12.67

场景

低延迟实时听写：如电话客服，IOT语音交互等，该场景对于尾点延迟非常敏感，通常需要用户说完以后立马可以得到识别结果。
流式实时听写：如会议实时字幕，语音输入法等，该场景不仅要求能够实时返回语音识别结果，以便实时显示到屏幕上，而且还需要能够在说话句尾用高精度识别结果刷新输出。
离线文件转写：如音频转写，视频字幕生成等，该场景不对实时性有要求，要求在高识别准确率情况下，尽可能快的转录文字。

解码模式

fast 模式：只有一遍解码，采用低延时实时出字模式；
normal 模式：2遍解码，第一遍低延时实时出字上屏，第二遍间隔3～6s（可配置）对解码结果进行刷新；
offline 模式：只有一遍解码，采用高精度离线模式；

运行范围

现阶段只能在Linux-x86_64运行，不支持Mac和Windows。

使用方式

直接推理：可以直接对输入音频进行解码，输出目标文字。
微调：加载训练好的模型，采用私有或者开源数据进行模型训练。

使用范围与目标场景

建议输入语音时长在20s以下。

参考

FunASR@github