主流Python语音转文字(STT)库实战指南

水坤發表於 2026-1-5 11:19:23

主流Python语音转文字(STT)库实战指南

<div id="navCategory"><h5 class="catalogue">目录</h5><ul class="first_class_ul"><li><a href="#_label0">前言</a></li><li><a href="#_label1">1 PaddleSpeech</a></li><ul class="second_class_ul"><li><a href="#_lab2_1_0">1.1 安装步骤</a></li><li><a href="#_lab2_1_1">1.2 测试代码</a></li><li><a href="#_lab2_1_2">1.3 遇到的报错</a></li></ul><li><a href="#_label2">2 whisper</a></li><ul class="second_class_ul"><li><a href="#_lab2_2_3">2.1 安装命令</a></li></ul><li><a href="#_label3">2.2 测试代码</a></li><ul class="second_class_ul"></ul><li><a href="#_label4">3 FunASR</a></li><ul class="second_class_ul"><li><a href="#_lab2_4_4">3.1 安装步骤</a></li><li><a href="#_lab2_4_5">3.2 测试代码</a></li><li><a href="#_lab2_4_6">3.3 遇到的错误</a></li></ul><li><a href="#_label5">总结</a></li><ul class="second_class_ul"></ul></ul></div><p class="maodian"><a name="_label0"></a></p><h2>前言</h2>
<p>语音转文字（STT，Speech to Text）是人机交互、音视频处理、智能客服等领域的核心技术，Python 凭借生态丰富、易用性强的优势，成为 STT 开发的主流语言。当下各类 Python STT 库百花齐放，既有开箱即用、轻量化的本地识别库，也有对接大厂接口、高精度的云端识别工具，还涵盖了兼顾速度与准确率的开源模型类库，适配个人开发、企业级项目等不同场景。本文将聚焦 Python 生态中常用、成熟、高实用性的 STT 语音识别库，从功能特性、识别精度、部署成本、适用场景等维度展开盘点，为开发者选型提供清晰参考。</p>
<p class="maodian"><a name="_label1"></a></p><h2>1 PaddleSpeech</h2>
<p>PaddleSpeech 是百度飞桨开源的语音交互工具集，主打中文语音识别 / 合成能力，依托飞桨框架的高性能计算优势，在中文普通话、低音质音频识别场景下表现优异，且支持私有化部署，是企业级中文 STT 场景的首选方案之一。</p>
<p style="text-align:center"><img alt="" src="https://img.jbzj.com/file_images/article/202601/2026010511131718.png" /></p>
<p class="maodian"><a name="_lab2_1_0"></a></p><h3>1.1 安装步骤</h3>
<p>安装命令：</p>
<div class="jb51code"><pre class="brush:ps;">conda create -n paddlespeech python=3.10
conda activate paddlespeech
python -m pip install paddlepaddle-gpu==3.2.2 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/

# 安装paddlespeech
git clone https://github.com/PaddlePaddle/PaddleSpeech.git
cd PaddleSpeech
pip install pytest-runner
pip install .
</pre></div>
<p>注：如果这里没有n卡，这里修改成<code>pip install paddlepaddle</code>,具体安装选择可以看官网：<a href="https://www.paddlepaddle.org.cn/" rel="external nofollow">PaddlePaddle</a><br />我这里之所以使用源码编译的方式去安装，是因为直接使用pip安装会有很多bug。</p>
<p class="maodian"><a name="_lab2_1_1"></a></p><h3>1.2 测试代码</h3>
<div class="jb51code"><pre class="brush:py;">import paddle
from paddlespeech.cli.asr import ASRExecutor

asr_executor = ASRExecutor()
text = asr_executor(audio_file="test.wav", model="conformer_aishell")
print(text)

</pre></div>
<p class="maodian"><a name="_lab2_1_2"></a></p><h3>1.3 遇到的报错</h3>
<p><code>ERROR: Could not find a version that satisfies the requirement opencc==1.1.6 (from paddlespeech) (from versions: 0.1, 0.2, 1.1.0.post1, 1.1.1, 1.1.7, 1.1.8, 1.1.9) ERROR: No matching distribution found for opencc==1.1.6</code><br />直接去PaddleSpeech/setup.py下面修改opencc安装版本：</p>
<p style="text-align:center"><img alt="" src="https://img.jbzj.com/file_images/article/202601/2026010511131764.png" /></p>
<p class="maodian"><a name="_label2"></a></p><h2>2 whisper</h2>
<p>Whisper 是 OpenAI 开源的多语言语音识别模型，凭借海量多语言音频数据训练，支持 99 种语言识别，中文普通话识别准确率≈95%，且抗噪能力强，是个人开发、多语言场景的首选方案。</p>
<p style="text-align:center"><img alt="" src="https://img.jbzj.com/file_images/article/202601/2026010511131741.png" /></p>
<p class="maodian"><a name="_lab2_2_3"></a></p><h3>2.1 安装命令</h3>
<div class="jb51code"><pre class="brush:ps;">conda create -n whisper_env python=3.10
conda activate whisper_env
pip install -U openai-whisper
</pre></div>
<p class="maodian"><a name="_label3"></a></p><h2>2.2 测试代码</h2>
<div class="jb51code"><pre class="brush:py;">import whisper

model = whisper.load_model("turbo")
result = model.transcribe("audio.mp3")
print(result["text"])
</pre></div>
<p>如下图所示，load_model时，可选参数为：</p>
<p style="text-align:center"><img alt="" src="https://img.jbzj.com/file_images/article/202601/2026010511131721.png" /></p>
<p class="maodian"><a name="_label4"></a></p><h2>3 FunASR</h2>
<p>FunASR 是阿里云通义实验室开源的语音识别框架，主打中文及方言识别，在 30 + 中文方言、低音质音频场景下表现领先，是中文专属 STT 场景的最优选择。</p>
<p style="text-align:center"><img alt="" src="https://img.jbzj.com/file_images/article/202601/2026010511131778.png" /></p>
<p class="maodian"><a name="_lab2_4_4"></a></p><h3>3.1 安装步骤</h3>
<div class="jb51code"><pre class="brush:py;">conda create -n funasr python=3.10
conda activate funasr
pip3 install -U funasr
pip3 install -U modelscope huggingface_hub
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
</pre></div>
<p class="maodian"><a name="_lab2_4_5"></a></p><h3>3.2 测试代码</h3>
<div class="jb51code"><pre class="brush:py;">from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "iic/SenseVoiceSmall"

model = AutoModel(
model=model_dir,
vad_model="fsmn-vad",
vad_kwargs={"max_single_segment_time": 30000},
device="cuda:0",
)

# en
res = model.generate(
input=f"test.mp3",
cache={},
language="auto",# "zn", "en", "yue", "ja", "ko", "nospeech"
use_itn=True,
batch_size_s=60,
merge_vad=True,#
merge_length_s=15,
)
text = rich_transcription_postprocess(res["text"])
print(text)
</pre></div>
<p class="maodian"><a name="_lab2_4_6"></a></p><h3>3.3 遇到的错误</h3>
<ul><li><p><code>UnboundLocalError: local variable 'AutoTokenizer' referenced before assignment</code><br /><strong>报错原因：</strong> transformers 库版本过低或未安装，导致模型加载时无法找到 AutoTokenizer 类。<br /><strong>解决办法：</strong> 升级 transformers 库</p>
<div class="jb51code"><pre class="brush:ps;">pip install -U transformers
</pre></div></li><li><p><code>AssertionError: FunASRNano is not registered</code><br /><strong>报错原因：</strong> FunASR 版本过低，未注册 FunASRNano 模型类，常见于使用 Fun-ASR-Nano 系列模型时。<br /><strong>解决办法：</strong> 手动导入模型类</p>
<div class="jb51code"><pre class="brush:ps;">from funasr.models.fun_asr_nano.model import FunASRNano
</pre></div>
<p style="text-align:center"><img alt="" src="https://img.jbzj.com/file_images/article/202601/2026010511131782.png" /></p></li><li><p><code>Loading remote code failed: model, No module named 'model'</code><br /><code>...</code><br /><code>OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory xxx</code><br /><strong>报错原因：</strong> FunAudioLLM/Fun-ASR-Nano-251 模型适配性较差，远程代码加载失败或权重文件下载不完整。<br /><strong>解决办法：</strong> 替换为 FunASR 官方稳定支持的模型（如 iic/SenseVoiceSmall、paraformer-large），避免使用适配性不足的 Nano 系列模型。</p></li></ul>
<p class="maodian"><a name="_label5"></a></p><h2>总结</h2>
<p>本文盘点了 Python 生态中三大主流 STT 库：PaddleSpeech 适配飞桨生态，适合企业级中文通用场景；Whisper 主打多语言识别，易用性拉满，适配个人开发；FunASR 在中文方言识别领域优势显著，适合中文专属场景。实际开发中，个人 / 多语言场景优先选 Whisper，中文方言 / 企业级场景优先选 FunASR，飞桨生态项目可选用 PaddleSpeech。开发时需注意版本适配与依赖管理，遇到问题可优先通过升级库或替换模型解决，确保识别效果与稳定性。</p>

頁: [1]

圆梦公社's Archiver

主流Python语音转文字(STT)库实战指南