东方明珠子 發表於 2025-5-29 10:49:00

SmolVLM2轻量级视频多模态模型,应用效果测评(风景、事故、仿真、统计、文字、识物)

<p class="a1">SmolVLM2轻量级视频多模态模型,应用效果测评</p>
<p class="a1">目&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 录</p>
<p>1.&nbsp;&nbsp;&nbsp;&nbsp; 前言... 2</p>
<p>2.&nbsp;&nbsp;&nbsp;&nbsp; 应用部署... 2</p>
<p>3.&nbsp;&nbsp;&nbsp;&nbsp; 应用效果... 4</p>
<p>1.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 风景图像理解... 4</p>
<p>1.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 事故现场理解... 5</p>
<p>1.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 仿真图像理解... 6</p>
<p>1.4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 数量统计描述... 7</p>
<p>1.5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 图像文字理解... 8</p>
<p>1.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 物体识别理解... 10</p>
<p>4.&nbsp;&nbsp;&nbsp;&nbsp; 待解决问题... 11</p>
<p>5.&nbsp;&nbsp;&nbsp;&nbsp; 结论... 11</p>
<h1>1.&nbsp;&nbsp;&nbsp;&nbsp; 前言</h1>
<p><span style="font-size: 16px">  SmolVLM2 是由 Hugging Face 开发的一系列紧凑型但功能强大的大型模型,旨在为资源受限的设备(如智能手机和嵌入式系统)带来先进的语言和视觉语言处理能力。这些模型以其小型化设计著称,适合在设备上运行,填补了大型模型与小型设备性能差距的空白。本文将详细介绍这两个系列的背景、技术细节、性能和应用,旨在为研究者和开发者提供全面的理解。</span></p>
<p><span style="font-size: 16px">  SmolVLM2 扩展了 Smol 系列的能力,专注于视觉语言任务,可处理视频、图像和文本输入,生成文本输出。模型提供三种参数规模:2.2B、500M 和 256M,旨在实现高效的多模态处理。相较于前代产品,新版 22 亿模型在图像数学解题、图片文字识别、复杂图表解析和科学视觉问答方面表现显著提升。</span></p>
<h1>2.&nbsp;&nbsp;&nbsp;&nbsp; 应用部署</h1>
<p><span style="font-size: 16px">模型下载:HuggingFaceTB/SmolVLM2-2.2B-Instruct · Hugging Face。</span></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> transformers <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> AutoProcessor, AutoModelForImageTextToText
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> torch

DEVICE </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">cuda</span><span style="color: rgba(128, 0, 0, 1)">"</span> <span style="color: rgba(0, 0, 255, 1)">if</span> torch.cuda.is_available() <span style="color: rgba(0, 0, 255, 1)">else</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">cpu</span><span style="color: rgba(128, 0, 0, 1)">"</span>
<span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">(DEVICE)
model_path </span>= <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">models/SmolVLM2-2.2B-Instruct</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
processor </span>=<span style="color: rgba(0, 0, 0, 1)"> AutoProcessor.from_pretrained(model_path)
model </span>=<span style="color: rgba(0, 0, 0, 1)"> AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype</span>=<span style="color: rgba(0, 0, 0, 1)">torch.bfloat16,
    _attn_implementation</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">flash_attention_2</span><span style="color: rgba(128, 0, 0, 1)">"</span> <span style="color: rgba(0, 0, 255, 1)">if</span> DEVICE == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">cuda</span><span style="color: rgba(128, 0, 0, 1)">"</span> <span style="color: rgba(0, 0, 255, 1)">else</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">eager</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
   device_map</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">cuda</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">).to(DEVICE)
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">model = AutoModelForImageTextToText.from_pretrained(</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">    model_path,</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">    torch_dtype=torch.bfloat16,</span><span style="color: rgba(0, 128, 0, 1)">
#</span><span style="color: rgba(0, 128, 0, 1)">   _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager").to(DEVICE)</span>
<span style="color: rgba(0, 0, 0, 1)">
messages </span>=<span style="color: rgba(0, 0, 0, 1)"> [
    {
      </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">role</span><span style="color: rgba(128, 0, 0, 1)">"</span>: <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">user</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
      </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">: [
            {</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">type</span><span style="color: rgba(128, 0, 0, 1)">"</span>: <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">image</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">url</span><span style="color: rgba(128, 0, 0, 1)">"</span>: <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">5.png</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">},
            {</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">type</span><span style="color: rgba(128, 0, 0, 1)">"</span>: <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span>: <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">结合图像上的内容,请问简单iNeuOS是什么?</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">},
      ]
    },
]

inputs </span>=<span style="color: rgba(0, 0, 0, 1)"> processor.apply_chat_template(
    messages,
    add_generation_prompt</span>=<span style="color: rgba(0, 0, 0, 1)">True,
    tokenize</span>=<span style="color: rgba(0, 0, 0, 1)">True,
    return_dict</span>=<span style="color: rgba(0, 0, 0, 1)">True,
    return_tensors</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">pt</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
).to(model.device, dtype</span>=<span style="color: rgba(0, 0, 0, 1)">torch.bfloat16)

generated_ids </span>= model.generate(**inputs, do_sample=False, max_new_tokens=5000<span style="color: rgba(0, 0, 0, 1)">)
generated_texts </span>=<span style="color: rgba(0, 0, 0, 1)"> processor.batch_decode(
    generated_ids,
    skip_special_tokens</span>=<span style="color: rgba(0, 0, 0, 1)">True,
)
</span><span style="color: rgba(0, 0, 255, 1)">print</span>(generated_texts)</pre>
</div>
<p><strong><span style="font-size: 2em">3.&nbsp;&nbsp;&nbsp;&nbsp; 应用效果</span></strong></p>
<h2>1.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 风景图像理解</h2>
<p><span style="font-size: 16px">User:请描述这张图像?</span></p>
<p><span style="font-size: 16px">Assistant: 这张图片是一张美丽的花朵照片,其中主要的是一棵树的枝条和花朵。树枝的颜色是深红色,而花朵的颜色是白色和红色。树枝上有一层雪,雪的颜色是白色,这使得树枝和花朵的颜色显得更加美丽。花朵的颜色是红色和白色,这使得花朵的颜色显得更加独特。这张图片的背景是白色的,这使得树枝和花朵的颜色更加显得突出。这张图片的风格是美丽和美感的,它让人感受到花朵的美丽和树枝的美丽。</span></p>
<p><img src="https://img2024.cnblogs.com/blog/279374/202505/279374-20250529104210357-529075746.png" alt="" style="display: block; margin-left: auto; margin-right: auto"></p>
<h2>1.2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 事故现场理解</h2>
<p><span style="font-size: 16px">User:请描述这个事故现场?</span></p>
<p><span style="font-size: 16px">Assistant: 这是一次火灾事故的事实场景,其中包含一个大型的炉子和一个高的柱子。炉子的外壁是灰色的,而柱子的外壁是黑色的。(<span style="color: rgba(255, 0, 0, 1)"><strong>注:后面有冗余的回复</strong></span>)</span></p>
<p><span style="font-size: 16px"><img src="https://img2024.cnblogs.com/blog/279374/202505/279374-20250529104251145-1412152131.png" alt="" style="display: block; margin-left: auto; margin-right: auto"></span></p>
<h2>1.3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 仿真图像理解</h2>
<p><span style="font-size: 16px">User:这是温度场仿真,请描述温度变化趋势?</span></p>
<p><span style="font-size: 16px">Assistant: 温度变化趋势可以从图像中看出,温度在右侧高于左侧,这表明右侧的温度更高,左侧的温度更低。(<span style="color: rgba(255, 0, 0, 1)"><strong>注:后面有冗余和重复数据</strong></span>)</span></p>
<p><img src="https://img2024.cnblogs.com/blog/279374/202505/279374-20250529104328857-413878281.png" alt=""></p>
<h2>1.4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 数量统计描述</h2>
<p><span style="font-size: 16px">User:这是虾苗养殖图像,请分析虾苗数量?</span></p>
<p><span style="font-size: 16px">Assistant: 虾苗数量为1000多个。</span></p>
<p><img src="https://img2024.cnblogs.com/blog/279374/202505/279374-20250529104350134-1224316846.png" alt="" style="display: block; margin-left: auto; margin-right: auto"></p>
<h2>1.5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 图像文字理解</h2>
<p><span style="font-size: 16px">User:结合图像上的内容,请问简单iNeuOS是什么?</span></p>
<p><span style="font-size: 16px">Assistant: iNeuOS是一种操作系统,它是一种高级的操作系统,它可以被用来管理和管理计算机的系统,包括内存、硬盘和网络等。(<strong><span style="color: rgba(255, 0, 0, 1)">注:后面是冗余和重复的数据</span></strong>)</span></p>
<p><span style="font-size: 16px"><img src="https://img2024.cnblogs.com/blog/279374/202505/279374-20250529104439046-210831318.png" alt="" style="display: block; margin-left: auto; margin-right: auto"></span></p>
<p><span style="font-size: 16px">User:图像上的文字是什么?</span></p>
<p><span style="font-size: 16px">Assistant: 这是一个英文和中文的图像,其中包含一些文字和图片。图像的左侧有一些英文文字,包括"iNeuOS"、"OS"和"NeuOS"。(<span style="color: rgba(255, 0, 0, 1)"><strong>注:毕竟不是</strong><strong>OCR</strong><strong>,所以文字识别和提取有限</strong></span>)</span></p>
<p><span style="font-size: 16px"><img src="https://img2024.cnblogs.com/blog/279374/202505/279374-20250529104514321-1984692856.png" alt="" style="display: block; margin-left: auto; margin-right: auto"></span></p>
<h2>1.6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 物体识别理解</h2>
<p><span style="font-size: 16px">User:图像上的铅笔有多长?</span></p>
<p><span style="font-size: 16px">Assistant: 铅笔的长度是约10厘米。(<span style="color: rgba(255, 0, 0, 1)"><strong>注:实际长度为</strong><strong>14.7</strong><strong>厘米</strong></span>)</span></p>
<p><img src="https://img2024.cnblogs.com/blog/279374/202505/279374-20250529104637701-2094651519.png" alt="" style="display: block; margin-left: auto; margin-right: auto"></p>
<p>&nbsp;</p>
<h1>4.&nbsp;&nbsp;&nbsp;&nbsp; 待解决问题</h1>
<p><span style="font-size: 16px">问题:模型在回复的时候有冗余和重复的内容。</span></p>
<p><span style="font-size: 16px">可能的原因:(1)提示词需求进行优化;(2)程序参数设置的问题。暂时还没有进一步测试。</span></p>
<h1>5.&nbsp;&nbsp;&nbsp;&nbsp; 结论</h1>
<p><span style="font-size: 16px">  测试比我预想的要好很多,但是针对特定应用场景,特别是工业领域,需要进一步调优。</span></p>
<hr>
<p>&nbsp;</p>
<p>物联网&amp;大数据技术 QQ群:54256083</p>
<p>物联网&amp;大数据项目 QQ群:727664080</p>
<p>QQ:504547114</p>
<p>微信:wxzz0151</p>
<p>博客:https://www.cnblogs.com/lsjwq</p>
<p><img src="https://img2024.cnblogs.com/blog/279374/202505/279374-20250527150358803-1507127284.png" alt="" width="278" height="138" class="medium-zoom-image"></p><br><br>
来源:https://www.cnblogs.com/lsjwq/p/18902156
頁: [1]
查看完整版本: SmolVLM2轻量级视频多模态模型,应用效果测评(风景、事故、仿真、统计、文字、识物)