林玉婷 發表於 2025-7-27 23:37:00

CocoIndex实现AI数据语义检索

<h1>1.概述</h1>
<p>在AI场景中,数据的高效处理与实时更新是推动技术突破的关键基石,而高性能的数据转换框架则是连接数据与 AI 应用的重要桥梁。CocoIndex 作为一款适用于人工智能的超高性能实时数据转换框架,凭借其独特的增量处理功能,在数据处理领域展现出显著优势。它不仅能实现数据的实时转换,更在数据新鲜度上实现了质的飞跃,为 AI 应用提供了更精准、更及时的数据支撑。那么,CocoIndex 究竟是如何通过增量处理实现这些突破,又能为 AI 领域带来哪些变革?笔者将为大家一一介绍。​</p>
<h1>2.内容</h1>
<p>专为人工智能领域量身打造的超高性能数据转换框架 ——CocoIndex,其核心引擎采用 Rust 语言编写,从底层架构保障了卓越的运行效率与稳定性。框架自带增量处理能力与数据血缘追踪功能,开箱即可投入使用,无需额外繁琐配置。更值得一提的是,它能为开发者带来卓越的开发效率,从项目启动的第 0 天起,便具备全面的生产环境就绪能力,大幅缩短从开发到落地的周期,为 AI 应用的数据处理环节提供坚实支撑。</p>
<p><img src="https://img2024.cnblogs.com/blog/666745/202507/666745-20250727153723974-1335536064.svg" alt="transformation" loading="lazy"></p>
<p>CocoIndex 让 AI 驱动的数据转换过程变得异常简单,同时能轻松实现源数据与目标数据的实时同步,为 AI 应用的数据流转提供高效、可靠的保障。​</p>
<p><img src="https://img2024.cnblogs.com/blog/666745/202507/666745-20250727154106534-137411519.png" alt="venn-features" loading="lazy"></p>
<p>无论是生成嵌入向量、构建知识图谱,还是其他任何超越传统 SQL 的数据转换任务,它都能高效胜任。​</p>
<p>仅需约 100 行 Python 代码,便能在数据流中轻松声明转换逻辑,极大降低了开发门槛,让数据转换流程的搭建高效又简单。​</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> import</span>
data[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">'</span>] =<span style="color: rgba(0, 0, 0, 1)"> flow_builder.add_source(...)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> transform</span>
data[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">out</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = data[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
    .transform(...)
    .transform(...)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> collect data</span>
<span style="color: rgba(0, 0, 0, 1)">collector.collect(...)

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> export to db, vector db, graph db ...</span>
collector.export(...)</pre>
</div>
<p>CocoIndex 秉持数据流编程模型理念,其设计逻辑清晰且透明:每个转换操作仅依据输入字段生成新字段,全程无隐藏状态,也不存在值的突变情况。这使得转换前后的所有数据都清晰可观察,且自带数据血缘追踪功能,让数据的来龙去脉一目了然。​<br>尤为特别的是,开发者无需通过创建、更新、删除等操作来显式改变数据,只需为源数据集定义好转换规则或公式,便能实现数据的顺畅转换,大幅简化了开发流程。​</p>
<p>CocoIndex 为不同数据源、数据目标及各类转换需求提供原生内置支持,无需额外适配即可快速接入。其采用标准化接口设计,让不同组件间的切换仅需一行代码即可完成,极大降低了系统扩展与迭代的复杂度。​</p>
<p><img src="https://img2024.cnblogs.com/blog/666745/202507/666745-20250727154434135-1570808153.svg" alt="components" loading="lazy"></p>
<p>CocoIndex 能够毫不费力地实现源数据与目标数据的精准同步,无需繁琐操作即可确保数据的一致性与时效性,为数据流转提供稳定可靠的保障。​</p>
<p><img src="https://img2024.cnblogs.com/blog/666745/202507/666745-20250727154526367-468331192.gif" alt="data-sync" loading="lazy"></p>
<p>它提供开箱即用的增量索引支持:​</p>
<ul>
<li>当源数据或逻辑发生变更时,仅执行最小化的重计算;​</li>
<li>仅对必要部分进行(重新)处理,同时尽可能复用缓存,大幅提升处理效率。​</li>

</ul>
<h2>2.1&nbsp;Python 与 Pip 环境准备</h2>
<p>若想顺利完成本指南中的操作流程,需提前配置好以下环境:​</p>
<ul>
<li>安装 Python(支持 3.11 至 3.13 版本):建议通过 Python 官网下载对应版本,确保安装过程中勾选 “Add Python to PATH” 选项,方便后续命令行调用。​</li>
<li>安装 pip(Python 包安装工具):通常 Python 3.4 及以上版本会默认捆绑 pip,若未安装,可通过 Python 官网提供的 get-pip.py 脚本进行安装,保障后续包管理操作顺畅。​</li>

</ul>
<p><strong>🌴 快速安装 CocoIndex​</strong><br>完成上述环境准备后,只需在命令行中执行以下命令,即可一键安装并自动升级至最新版本的 CocoIndex:</p>
<div class="cnblogs_code">
<pre>pip install -U cocoindex</pre>
</div>
<p><strong>📦&nbsp;安装 PostgreSQL​</strong><br>倘若您已拥有安装了 pgvector 扩展的 PostgreSQL 数据库,那么本步骤可以直接跳过。​要是您还未安装 PostgreSQL 数据库,请按以下步骤操作:​</p>
<p>首先,安装 Docker Compose🐳。Docker Compose 是用于定义和运行多容器 Docker 应用程序的工具,安装后能更便捷地管理容器化应用。您可以前往 Docker 官网,根据自身操作系统(如 Windows、macOS、Linux 等)的版本,下载并按照官方指引完成安装。​<br>接着,使用我们提供的 docker compose 配置来启动一个供 cocoindex 使用的 PostgreSQL 数据库,只需执行以下命令:</p>
<div class="cnblogs_code">
<pre>docker compose -f &lt;(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d</pre>
</div>
<p><img src="https://img2024.cnblogs.com/blog/666745/202507/666745-20250727213822593-1809321087.png" alt="image" width="870" height="606" loading="lazy"></p>
<p>&nbsp;安装完成后,打开安装地址 https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml 可以查看到对应的用户名和密码:</p>
<p><img src="https://img2024.cnblogs.com/blog/666745/202507/666745-20250727213937058-851893741.png" alt="image" width="405" height="182" loading="lazy"></p>
<h2>&nbsp;2.2 准备目录</h2>
<p><strong>1.创建项目目录</strong></p>
<p>打开终端,为你的项目创建一个新目录并进入该目录:</p>
<div class="cnblogs_code">
<pre>mkdir cocoindex-<span style="color: rgba(0, 0, 0, 1)">quickstart
cd cocoindex</span>-quickstart</pre>
</div>
<p>这一步的目的是为项目搭建一个独立的工作空间,通过专门的目录来管理后续的文件和操作,避免文件混乱。mkdir 命令用于创建名为 cocoindex-quickstart 的目录,cd 命令则切换到这个新目录,方便后续在该目录下进行所有与项目相关的操作。</p>
<p><strong>2.准备索引所需的输入文件</strong></p>
<p>需要将用于生成索引的输入文件整理到一个目录中(例如命名为 markdown_files)。</p>
<p>如果你已有现成的文件,只需将它们统一放入这个目录即可,确保文件格式符合索引工具的要求(这里示例为 markdown 格式)。<br>若暂时没有可用文件,可下载示例压缩包 markdown_files.zip,并将其解压到当前的 cocoindex-quickstart 目录下。解压后会自动生成 markdown_files 文件夹,里面包含演示用的示例文件,便于快速上手体验索引功能。</p>
<p>通过提前整理输入文件并统一存放,能让后续的索引创建过程更高效,也便于管理和维护文件资源。</p>
<h2>2.3&nbsp;定义索引流程</h2>
<p>接下来,我们需要创建索引流程来处理准备好的文件。首先,创建一个名为 quickstart.py 的 Python 文件,并导入 cocoindex 库:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> quickstart.py</span>
<span style="color: rgba(0, 0, 255, 1)">import</span> cocoindex</pre>
</div>
<p>然后,我们可以按以下方式构建索引流程,实现从文件加载到索引创建的完整过程:</p>
<div class="cnblogs_code">
<pre>@cocoindex.flow_def(name=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">TextEmbedding</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 添加一个数据源,用于从目录中读取文件</span>
    data_scope[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">documents</span><span style="color: rgba(128, 0, 0, 1)">"</span>] =<span style="color: rgba(0, 0, 0, 1)"> flow_builder.add_source(
      cocoindex.sources.LocalFile(path</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">markdown_files</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))

    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 添加一个收集器,用于收集将要导出到向量索引的数据</span>
    doc_embeddings =<span style="color: rgba(0, 0, 0, 1)"> data_scope.add_collector()

    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 转换每个文档的数据</span>
    with data_scope[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">documents</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].row() as doc:
      </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将文档分割为文本块,存储到`chunks`字段中</span>
      doc[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">chunks</span><span style="color: rgba(128, 0, 0, 1)">"</span>] = doc[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].transform(
            cocoindex.functions.SplitRecursively(),
            language</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">markdown</span><span style="color: rgba(128, 0, 0, 1)">"</span>, chunk_size=2000, chunk_overlap=500<span style="color: rgba(0, 0, 0, 1)">)

      </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 转换每个文本块的数据</span>
      with doc[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">chunks</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].row() as chunk:
            </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 为文本块生成嵌入向量,存储到`embedding`字段中</span>
            chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">embedding</span><span style="color: rgba(128, 0, 0, 1)">"</span>] = chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                  model</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">sentence-transformers/all-MiniLM-L6-v2</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))

            </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将文本块数据收集到收集器中</span>
            doc_embeddings.collect(filename=doc[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">filename</span><span style="color: rgba(128, 0, 0, 1)">"</span>], location=chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">location</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
                                 text</span>=chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span>], embedding=chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">embedding</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">])

    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将收集的数据导出到向量索引</span>
<span style="color: rgba(0, 0, 0, 1)">    doc_embeddings.export(
      </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">doc_embeddings</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 表名</span>
      cocoindex.targets.Postgres(),<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 目标数据库</span>
      primary_key_fields=[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">filename</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">location</span><span style="color: rgba(128, 0, 0, 1)">"</span>],<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 主键字段</span>
      vector_indexes=[<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 向量索引定义</span>
<span style="color: rgba(0, 0, 0, 1)">            cocoindex.VectorIndexDef(
                field_name</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">embedding</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 向量字段名</span>
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 相似度计算方式</span>
      ])</pre>
</div>
<p>这段代码通过 cocoindex 库定义了一个完整的文本向量嵌入流水线,核心目的是将原始 Markdown 文档转换为可用于语义检索的向量数据,并存储到数据库中。整体流程可分为 5 个关键步骤:</p>
<p><strong>1. 流程定义</strong></p>
<div class="cnblogs_code">
<pre>@cocoindex.flow_def(name=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">TextEmbedding</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">def</span> text_embedding_flow(flow_builder, data_scope):</pre>
</div>
<ul>
<li>使用 @flow_def 装饰器声明一个名为 TextEmbedding 的流程,用于统一管理数据处理的各个环节。</li>
<li>flow_builder 负责创建流程组件(如数据源、转换器),data_scope 负责管理数据的流转和存储(类似一个 "数据容器")。</li>
</ul>
<p><strong>2. 读取原始文件</strong></p>
<div class="cnblogs_code">
<pre>data_scope[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">documents</span><span style="color: rgba(128, 0, 0, 1)">"</span>] =<span style="color: rgba(0, 0, 0, 1)"> flow_builder.add_source(
    cocoindex.sources.LocalFile(path</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">markdown_files</span><span style="color: rgba(128, 0, 0, 1)">"</span>))</pre>
</div>
<ul>
<li>通过 LocalFile 数据源读取 markdown_files 目录下的所有文件(如 .md 文档)。</li>
<li>读取结果被存储在 data_scope["documents"] 中,后续可通过这个键访问所有文档数据(包括文件名、内容等)。</li>
</ul>
<p><strong>3. 创建数据收集器</strong></p>
<div class="cnblogs_code">
<pre>doc_embeddings = data_scope.add_collector()</pre>
</div>
<ul>
<li>初始化一个收集器 doc_embeddings,用于临时存储处理后的文本块数据(如文本内容、向量嵌入等),为后续导出到数据库做准备。</li>
</ul>
<p><strong>4. 文本处理流水线(核心步骤)</strong></p>
<p>这部分是流程的核心,通过嵌套的 with 语句实现对 "文档→文本块" 的层级处理:</p>
<p>文档分割为文本块:</p>
<div class="cnblogs_code">
<pre>with data_scope[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">documents</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].row() as doc:
    doc[</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">chunks</span><span style="color: rgba(128, 0, 0, 1)">"</span>] = doc[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].transform(
      cocoindex.functions.SplitRecursively(),
      language</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">markdown</span><span style="color: rgba(128, 0, 0, 1)">"</span>, chunk_size=2000, chunk_overlap=500)</pre>
</div>
<ul>
<li>with data_scope["documents"].row() as doc:遍历 documents 中的每个文档(doc 代表单个文档)。</li>
<li>使用 SplitRecursively 函数对文档内容(doc["content"])进行分割:
<ul>
<li>针对 Markdown 格式优化(如避免在标题、列表中间分割)。</li>
<li>每个文本块最大长度为 2000 字符,相邻块重叠 500 字符(防止语义被割裂)。</li>
</ul>
</li>
<li>分割后的文本块存储在 doc["chunks"] 中,供下一步处理。</li>
</ul>
<p>生成文本块的向量嵌入:</p>
<div class="cnblogs_code">
<pre>with doc[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">chunks</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].row() as chunk:
    chunk[</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">embedding</span><span style="color: rgba(128, 0, 0, 1)">"</span>] = chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].transform(
      cocoindex.functions.SentenceTransformerEmbed(
            model</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">sentence-transformers/all-MiniLM-L6-v2</span><span style="color: rgba(128, 0, 0, 1)">"</span>))</pre>
</div>
<ul>
<li>with doc["chunks"].row() as chunk:遍历当前文档的每个文本块(chunk 代表单个文本块)。</li>
<li>使用 SentenceTransformerEmbed 函数将文本块内容(chunk["text"])转换为向量:
<ul>
<li>采用预训练模型 all-MiniLM-L6-v2(轻量级模型,适合快速生成高质量嵌入)。</li>
<li>向量嵌入结果存储在 chunk["embedding"] 中,用于后续的语义相似度计算。</li>
</ul>
</li>
</ul>
<p>收集处理结果:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">doc_embeddings.collect(
    filename</span>=doc[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">filename</span><span style="color: rgba(128, 0, 0, 1)">"</span>],<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 源文档名称</span>
    location=chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">location</span><span style="color: rgba(128, 0, 0, 1)">"</span>],<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 文本块在原文档中的位置(如页码、段落索引)</span>
    text=chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span>],<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 文本块内容</span>
    embedding=chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">embedding</span><span style="color: rgba(128, 0, 0, 1)">"</span>]<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 向量嵌入</span>
)</pre>
</div>
<ul>
<li>将每个文本块的关键信息(来源、位置、内容、向量)收集到 doc_embeddings 中,形成结构化数据。</li>
</ul>
<p><strong>5. 导出到向量数据库</strong></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">doc_embeddings.export(
    </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">doc_embeddings</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 数据库表名</span>
    cocoindex.targets.Postgres(),<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 目标存储:PostgreSQL数据库</span>
    primary_key_fields=[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">filename</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">location</span><span style="color: rgba(128, 0, 0, 1)">"</span>],<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 联合主键(确保每条数据唯一)</span>
    vector_indexes=[<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 为向量字段创建索引</span>
<span style="color: rgba(0, 0, 0, 1)">      cocoindex.VectorIndexDef(
            field_name</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">embedding</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 向量字段名</span>
            metric=COSINE_SIMILARITY<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 用余弦相似度计算向量距离(适合文本语义匹配)</span>
<span style="color: rgba(0, 0, 0, 1)">      )
    ])</span></pre>
</div>
<ul>
<li>将收集到的结构化数据导出到 PostgreSQL 数据库,生成一张名为 doc_embeddings 的表。</li>
<li>通过配置向量索引,使得后续可以高效地基于语义相似度查询(如 "查找与查询文本意思最接近的文档片段")。</li>
</ul>
<p>这段代码实现了从 "原始文档" 到 "可检索向量数据" 的全自动化流程,核心价值在于:</p>
<p>标准化处理:通过流水线确保文本分割、向量生成的一致性。<br>语义化存储:将文本转换为向量后,支持基于语义而非关键词的检索。<br>工程化集成:直接对接数据库并创建索引,为后续应用(如知识库、智能检索系统)提供数据基础。</p>
<p>整个流程无需人工干预,适合批量处理文档并构建语义检索系统。</p>
<p><img src="https://img2024.cnblogs.com/blog/666745/202507/666745-20250727215208045-1346237793.png" alt="flow" width="363" height="536" loading="lazy"></p>
<h1>3.构建 Text Embedding 和语义检索</h1>
<p>从本地 Markdown 文件构建 Text Embedding&nbsp;索引并进行查询,代码如下:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">from</span> dotenv <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> load_dotenv
</span><span style="color: rgba(0, 0, 255, 1)">from</span> psycopg_pool <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> ConnectionPool
</span><span style="color: rgba(0, 0, 255, 1)">from</span> pgvector.psycopg <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> register_vector
</span><span style="color: rgba(0, 0, 255, 1)">from</span> typing <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> Any
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> cocoindex
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> os
</span><span style="color: rgba(0, 0, 255, 1)">from</span> numpy.typing <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> NDArray
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> numpy as np
</span><span style="color: rgba(0, 0, 255, 1)">from</span> datetime <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> timedelta


@cocoindex.transform_flow()
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> text_to_embedding(
    text: cocoindex.DataSlice,
) </span>-&gt;<span style="color: rgba(0, 0, 0, 1)"> cocoindex.DataSlice]:
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
    使用SentenceTransformer模型对文本进行嵌入处理。
    这是索引构建和查询过程中共享的逻辑,因此将其提取为一个函数。
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 你也可以切换到远程嵌入模型:</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">   return text.transform(</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">       cocoindex.functions.EmbedText(</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">         api_type=cocoindex.LlmApiType.OPENAI,</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">         model="text-embedding-3-small",</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">       )</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">   )</span>
    <span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> text.transform(
      cocoindex.functions.SentenceTransformerEmbed(
            model</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">sentence-transformers/all-MiniLM-L6-v2</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
      )
    )


@cocoindex.flow_def(name</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">TextEmbedding</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> text_embedding_flow(
    flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
) </span>-&gt;<span style="color: rgba(0, 0, 0, 1)"> None:
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
    定义一个示例流程,用于将文本嵌入到向量数据库中。
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
    data_scope[</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">documents</span><span style="color: rgba(128, 0, 0, 1)">"</span>] =<span style="color: rgba(0, 0, 0, 1)"> flow_builder.add_source(
      cocoindex.sources.LocalFile(path</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">markdown_files</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">),
      refresh_interval</span>=timedelta(seconds=5),<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 每5秒刷新一次文件</span>
<span style="color: rgba(0, 0, 0, 1)">    )

    doc_embeddings </span>=<span style="color: rgba(0, 0, 0, 1)"> data_scope.add_collector()

    with data_scope[</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">documents</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].row() as doc:
      doc[</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">chunks</span><span style="color: rgba(128, 0, 0, 1)">"</span>] = doc[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].transform(
            cocoindex.functions.SplitRecursively(),
            language</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">markdown</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
            chunk_size</span>=2000<span style="color: rgba(0, 0, 0, 1)">,
            chunk_overlap</span>=500<span style="color: rgba(0, 0, 0, 1)">,
      )

      with doc[</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">chunks</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].row() as chunk:
            chunk[</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">embedding</span><span style="color: rgba(128, 0, 0, 1)">"</span>] = text_to_embedding(chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">])
            doc_embeddings.collect(
                filename</span>=doc[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">filename</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
                location</span>=chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">location</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
                text</span>=chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
                embedding</span>=chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">embedding</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
            )

    doc_embeddings.export(
      </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">doc_embeddings</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
      cocoindex.targets.Postgres(),
      primary_key_fields</span>=[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">filename</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">location</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
      vector_indexes</span>=<span style="color: rgba(0, 0, 0, 1)">[
            cocoindex.VectorIndexDef(
                field_name</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">embedding</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
                metric</span>=<span style="color: rgba(0, 0, 0, 1)">cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
            )
      ],
    )


</span><span style="color: rgba(0, 0, 255, 1)">def</span> search(pool: ConnectionPool, query: str, top_k: int = 5) -&gt;<span style="color: rgba(0, 0, 0, 1)"> list]:
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 获取表名,对应上面text_embedding_flow中导出的目标表</span>
    table_name =<span style="color: rgba(0, 0, 0, 1)"> cocoindex.utils.get_target_default_name(
      text_embedding_flow, </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">doc_embeddings</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
    )
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 使用上面定义的转换流程处理输入查询,获取嵌入向量</span>
    query_vector =<span style="color: rgba(0, 0, 0, 1)"> text_to_embedding.eval(query)
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 执行查询并获取结果</span>
<span style="color: rgba(0, 0, 0, 1)">    with pool.connection() as conn:
      register_vector(conn)</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 注册向量类型支持</span>
<span style="color: rgba(0, 0, 0, 1)">      with conn.cursor() as cur:
            cur.execute(
                f</span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
                SELECT filename, text, embedding &lt;=&gt; %s AS distance
                FROM {table_name} ORDER BY distance LIMIT %s
            </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">,
                (query_vector, top_k),
            )
            </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将距离转换为相似度分数(1 - 距离)</span>
            <span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> [
                {</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">filename</span><span style="color: rgba(128, 0, 0, 1)">"</span>: row, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span>: row, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">score</span><span style="color: rgba(128, 0, 0, 1)">"</span>: 1.0 - row}
                </span><span style="color: rgba(0, 0, 255, 1)">for</span> row <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> cur.fetchall()
            ]


</span><span style="color: rgba(0, 0, 255, 1)">def</span> _main() -&gt;<span style="color: rgba(0, 0, 0, 1)"> None:
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 初始化数据库连接池</span>
    pool = ConnectionPool(os.getenv(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">COCOINDEX_DATABASE_URL</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 循环运行查询以演示查询功能</span>
    <span style="color: rgba(0, 0, 255, 1)">while</span><span style="color: rgba(0, 0, 0, 1)"> True:
      query </span>= input(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">输入搜索查询(按回车退出):</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
      </span><span style="color: rgba(0, 0, 255, 1)">if</span> query == <span style="color: rgba(128, 0, 0, 1)">""</span><span style="color: rgba(0, 0, 0, 1)">:
            </span><span style="color: rgba(0, 0, 255, 1)">break</span>
      <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 使用数据库连接池和查询语句运行查询函数</span>
      results =<span style="color: rgba(0, 0, 0, 1)"> search(pool, query)
      </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">\n搜索结果:</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
      </span><span style="color: rgba(0, 0, 255, 1)">for</span> result <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> results:
            </span><span style="color: rgba(0, 0, 255, 1)">print</span>(f<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">[{result['score']:.3f}] {result['filename']}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
            </span><span style="color: rgba(0, 0, 255, 1)">print</span>(f<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">    {result['text']}</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
            </span><span style="color: rgba(0, 0, 255, 1)">print</span>(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">---</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
      </span><span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)">()


</span><span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(128, 0, 128, 1)">__name__</span> == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">__main__</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
    load_dotenv()</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 加载环境变量</span>
    cocoindex.init()<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 初始化cocoindex</span>
    _main()</pre>
</div>
<p>这段代码实现了一个完整的文本向量索引与检索系统,包含从本地 Markdown 文件构建向量索引,到通过用户输入进行语义搜索的全流程。整体结构可分为四个核心模块:</p>
<p><strong>1. 文本嵌入转换函数(text_to_embedding)</strong></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">@cocoindex.transform_flow()
</span><span style="color: rgba(0, 0, 255, 1)">def</span> text_to_embedding(text: DataSlice) -&gt; DataSlice]:</pre>
</div>
<ul>
<li>这是一个共享转换逻辑,同时服务于索引构建和查询过程,确保文本处理方式一致。</li>
<li>功能:使用sentence-transformers/all-MiniLM-L6-v2模型将文本转换为向量嵌入。</li>
<li>灵活性:代码中注释了切换到 OpenAI 远程嵌入模型(如text-embedding-3-small)的方式,可根据需求选择本地或云端模型。</li>
</ul>
<p><strong>2. 索引构建流程(text_embedding_flow)</strong></p>
<div class="cnblogs_code">
<pre>@cocoindex.flow_def(name=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">TextEmbedding</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">def</span> text_embedding_flow(flow_builder, data_scope) -&gt; None:</pre>
</div>
<ul>
<li>这是索引构建的核心流程,在之前代码基础上增加了文件自动刷新功能(refresh_interval=5秒),支持实时监测markdown_files目录的文件变化。</li>
<li>流程步骤:
<ul>
<li>数据读取:通过LocalFile源读取 Markdown 文件,每 5 秒刷新一次。</li>
<li>文本分割:使用SplitRecursively按 Markdown 格式分割文本(块大小 2000 字符,重叠 500 字符)。</li>
<li>向量生成:调用text_to_embedding函数为每个文本块生成向量。</li>
<li>数据收集:收集文件名、位置、文本内容和向量嵌入。</li>
<li>导出存储:将数据存入 PostgreSQL,以(filename, location)为联合主键,并为embedding字段创建余弦相似度索引。</li>
</ul>
</li>
</ul>
<p><strong>3. 搜索功能实现(search函数)</strong></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">def</span> search(pool: ConnectionPool, query: str, top_k: int = 5) -&gt; list]:</pre>
</div>
<ul>
<li>该函数实现了基于向量相似度的检索功能,是连接索引数据与用户查询的桥梁。</li>
<li>工作原理:
<ul>
<li>获取表名:通过工具函数自动获取索引存储的表名(与构建流程关联)。</li>
<li>生成查询向量:使用text_to_embedding.eval(query)将用户输入转换为向量(确保与索引向量同源)。</li>
<li>数据库查询:
<ul>
<li>连接 PostgreSQL 并注册向量类型支持。</li>
<li>执行 SQL 查询,使用&lt;=&gt;运算符计算查询向量与存储向量的距离(余弦距离)。</li>
<li>按距离排序并返回前top_k条结果。</li>
</ul>
</li>
<li>结果转换:将距离值转换为相似度分数(1 - 距离),便于直观理解匹配程度。</li>
</ul>
</li>
</ul>
<p><strong>4. 主程序入口(_main函数)</strong></p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">def</span> _main() -&gt; None:</pre>
</div>
<ul>
<li>负责初始化系统并提供交互式查询界面:
<ul>
<li>环境配置:通过load_dotenv加载环境变量(如数据库连接 URL)。</li>
<li>连接池初始化:创建 PostgreSQL 连接池,优化数据库连接效率。</li>
<li>交互式查询:循环接收用户输入,调用search函数并格式化输出结果,直到用户按回车退出。</li>
</ul>
</li>
</ul>
<p>接下来,执行如下命令:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 安装依赖</span>
pip install -<span style="color: rgba(0, 0, 0, 1)">e .

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 配置</span>
<span style="color: rgba(0, 0, 0, 1)">cocoindex setup main.py

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 更新索引</span>
<span style="color: rgba(0, 0, 0, 1)">cocoindex update main.py

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 执行</span>
<span style="color: rgba(0, 0, 0, 1)">python main.py

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> CocoInsight工具</span>
cocoindex server -ci main.py</pre>
</div>
<p><img src="https://img2024.cnblogs.com/blog/666745/202507/666745-20250727233900243-879062083.png" alt="image" width="1141" height="546" loading="lazy"></p>
<p><img src="https://img2024.cnblogs.com/blog/666745/202507/666745-20250727225451849-1930324757.png" alt="image" width="1442" height="728" loading="lazy"></p>
<h1>&nbsp;4.构建&nbsp;FastAPI&nbsp;在线语义检索服务</h1>
<p>在本示例中,我们将基于本地的 markdown 文件构建文本嵌入索引,并通过 fastapi 提供一个简单的查询接口。</p>
<p>实现代码如下:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> cocoindex
</span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> uvicorn
</span><span style="color: rgba(0, 0, 255, 1)">from</span> dotenv <span style="color: rgba(0, 0, 255, 1)">import</span> load_dotenv<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 用于加载环境变量</span>
<span style="color: rgba(0, 0, 255, 1)">from</span> fastapi <span style="color: rgba(0, 0, 255, 1)">import</span> FastAPI, Query<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 用于创建API接口</span>
<span style="color: rgba(0, 0, 255, 1)">from</span> fastapi <span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> Request
</span><span style="color: rgba(0, 0, 255, 1)">from</span> psycopg_pool <span style="color: rgba(0, 0, 255, 1)">import</span> ConnectionPool<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> PostgreSQL连接池</span>
<span style="color: rgba(0, 0, 255, 1)">from</span> contextlib <span style="color: rgba(0, 0, 255, 1)">import</span> asynccontextmanager<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 用于管理异步上下文</span>
<span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> os


@cocoindex.transform_flow()
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> text_to_embedding(
    text: cocoindex.DataSlice,
) </span>-&gt;<span style="color: rgba(0, 0, 0, 1)"> cocoindex.DataSlice]:
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
    使用SentenceTransformer模型将文本转换为嵌入向量。
    这是索引构建和查询过程中共享的逻辑。
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span>
    <span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> text.transform(
      cocoindex.functions.SentenceTransformerEmbed(
            model</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">sentence-transformers/all-MiniLM-L6-v2</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 使用的预训练模型</span>
<span style="color: rgba(0, 0, 0, 1)">      )
    )


@cocoindex.flow_def(name</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">MarkdownEmbeddingFastApiExample</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> markdown_embedding_flow(
    flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
):
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
    定义一个示例流程,用于将Markdown文件嵌入到向量数据库中。
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 从本地"files"目录添加Markdown文件作为数据源</span>
    data_scope[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">documents</span><span style="color: rgba(128, 0, 0, 1)">"</span>] =<span style="color: rgba(0, 0, 0, 1)"> flow_builder.add_source(
      cocoindex.sources.LocalFile(path</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">files</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
    )
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 创建一个收集器,用于收集处理后的文档嵌入信息</span>
    doc_embeddings =<span style="color: rgba(0, 0, 0, 1)"> data_scope.add_collector()

    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 遍历每个文档</span>
    with data_scope[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">documents</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].row() as doc:
      </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将文档内容按Markdown格式分割成块</span>
      doc[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">chunks</span><span style="color: rgba(128, 0, 0, 1)">"</span>] = doc[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">content</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].transform(
            cocoindex.functions.SplitRecursively(),</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 递归分割函数</span>
            language=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">markdown</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 分割语言</span>
            chunk_size=2000,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 块大小</span>
            chunk_overlap=500,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 块重叠部分大小</span>
<span style="color: rgba(0, 0, 0, 1)">      )

      </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 遍历每个分割后的文本块</span>
      with doc[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">chunks</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">].row() as chunk:
            </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 为每个文本块生成嵌入向量</span>
            chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">embedding</span><span style="color: rgba(128, 0, 0, 1)">"</span>] = text_to_embedding(chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">])
            </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 收集嵌入信息,包括文件名、位置、文本内容和嵌入向量</span>
<span style="color: rgba(0, 0, 0, 1)">            doc_embeddings.collect(
                filename</span>=doc[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">filename</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
                location</span>=chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">location</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
                text</span>=chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
                embedding</span>=chunk[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">embedding</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
            )

    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将收集的嵌入信息导出到PostgreSQL数据库</span>
<span style="color: rgba(0, 0, 0, 1)">    doc_embeddings.export(
      </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">doc_embeddings</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 导出名称</span>
      cocoindex.targets.Postgres(),<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 目标数据库</span>
      primary_key_fields=[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">filename</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">location</span><span style="color: rgba(128, 0, 0, 1)">"</span>],<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 主键字段</span>
      vector_indexes=[<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 向量索引定义</span>
<span style="color: rgba(0, 0, 0, 1)">            cocoindex.VectorIndexDef(
                field_name</span>=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">embedding</span><span style="color: rgba(128, 0, 0, 1)">"</span>,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 向量字段名</span>
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 相似度度量方式</span>
<span style="color: rgba(0, 0, 0, 1)">            )
      ],
    )


</span><span style="color: rgba(0, 0, 255, 1)">def</span> search(pool: ConnectionPool, query: str, top_k: int = 5<span style="color: rgba(0, 0, 0, 1)">):
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
    搜索函数:根据查询文本在向量数据库中查找最相似的结果
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 获取导出目标的默认表名</span>
    table_name =<span style="color: rgba(0, 0, 0, 1)"> cocoindex.utils.get_target_default_name(
      markdown_embedding_flow, </span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">doc_embeddings</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
    )
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 对查询文本执行转换流程,获取嵌入向量</span>
    query_vector =<span style="color: rgba(0, 0, 0, 1)"> text_to_embedding.eval(query)
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 执行查询并获取结果</span>
    with pool.connection() as conn:<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 从连接池获取连接</span>
      with conn.cursor() as cur:<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 创建游标</span>
<span style="color: rgba(0, 0, 0, 1)">            cur.execute(
                f</span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
                SELECT filename, text, embedding &lt;=&gt; %s::vector AS distance
                FROM {table_name} ORDER BY distance LIMIT %s
            </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">,
                (query_vector, top_k),</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 查询参数</span>
<span style="color: rgba(0, 0, 0, 1)">            )
            </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 处理结果,将余弦距离转换为相似度分数</span>
            <span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> [
                {</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">filename</span><span style="color: rgba(128, 0, 0, 1)">"</span>: row, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">text</span><span style="color: rgba(128, 0, 0, 1)">"</span>: row, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">score</span><span style="color: rgba(128, 0, 0, 1)">"</span>: 1.0 - row}
                </span><span style="color: rgba(0, 0, 255, 1)">for</span> row <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> cur.fetchall()
            ]


@asynccontextmanager
</span><span style="color: rgba(0, 0, 255, 1)">async def</span><span style="color: rgba(0, 0, 0, 1)"> lifespan(app: FastAPI):
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
    FastAPI生命周期管理:初始化和清理资源
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
    load_dotenv()</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 加载环境变量</span>
    cocoindex.init()<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 初始化CocoIndex</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 创建数据库连接池</span>
    pool = ConnectionPool(os.getenv(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">COCOINDEX_DATABASE_URL</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">))
    app.state.pool </span>= pool<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 将连接池存储在应用状态中</span>
    <span style="color: rgba(0, 0, 255, 1)">try</span><span style="color: rgba(0, 0, 0, 1)">:
      </span><span style="color: rgba(0, 0, 255, 1)">yield</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 应用运行期间</span>
    <span style="color: rgba(0, 0, 255, 1)">finally</span><span style="color: rgba(0, 0, 0, 1)">:
      pool.close()</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 关闭连接池</span>


<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 创建FastAPI应用实例</span>
fastapi_app = FastAPI(lifespan=<span style="color: rgba(0, 0, 0, 1)">lifespan)


@fastapi_app.get(</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">/search</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> search_endpoint(
    request: Request,
    q: str </span>= Query(..., description=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">搜索查询文本</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">),
    limit: int </span>= Query(5, description=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">返回结果数量</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">),
):
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
    搜索接口:接收查询文本和结果数量,返回匹配的搜索结果
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
    pool </span>= request.app.state.pool<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 获取连接池</span>
    results = search(pool, q, limit)<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 执行搜索</span>
    <span style="color: rgba(0, 0, 255, 1)">return</span> {<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">results</span><span style="color: rgba(128, 0, 0, 1)">"</span>: results}<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 返回结果</span>


<span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(128, 0, 128, 1)">__name__</span> == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">__main__</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">:
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 启动FastAPI服务</span>
    uvicorn.run(fastapi_app, host=<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">0.0.0.0</span><span style="color: rgba(128, 0, 0, 1)">"</span>, port=8080)</pre>
</div>
<p>这段代码实现了一个基于 CocoIndex 框架的文本嵌入与检索系统,主要功能是将本地 Markdown 文件转换为向量表示并存储在 PostgreSQL 数据库中,同时提供一个 FastAPI 接口用于查询相似文本。整体流程可分为以下几个部分:</p>
<ul>
<li>文本嵌入转换函数:text_to_embedding函数使用 SentenceTransformer 模型将文本转换为向量表示,这是索引构建和查询过程中共享的核心逻辑,确保了索引和查询使用相同的嵌入方式。</li>
<li>数据处理流程定义:markdown_embedding_flow函数定义了完整的数据处理流程:
<ul>
<li>从本地 "files" 目录读取 Markdown 文件</li>
<li>将文档内容分割成适当大小的文本块(考虑重叠部分)</li>
<li>为每个文本块生成嵌入向量</li>
<li>将处理结果(包括文件名、位置、文本内容和嵌入向量)导出到 PostgreSQL 数据库,并创建向量索引</li>
</ul>
</li>
<li>搜索功能实现:search函数实现了向量相似性查询功能,通过将查询文本转换为嵌入向量,然后在 PostgreSQL 中使用向量索引查找最相似的文本块,并返回结果。</li>
<li>FastAPI 接口:创建了一个/search接口,允许用户通过 HTTP 请求进行文本查询,返回匹配的结果及相似度分数。</li>
<li>资源管理:使用lifespan上下文管理器管理应用生命周期,包括初始化 CocoIndex、创建数据库连接池等资源,并在应用关闭时进行清理。</li>
</ul>
<p>整个系统体现了 CocoIndex 框架在数据转换和索引构建方面的优势,同时结合 FastAPI 提供了便捷的查询接口,可用于构建基于本地文档的语义搜索应用。</p>
<p>本地环境执行如下命令:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 在 .env 文件中修改</span>
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/<span style="color: rgba(0, 0, 0, 1)">cocoindex

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 安装依赖</span>
pip install -<span style="color: rgba(0, 0, 0, 1)">r requirements.txt

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 配置</span>
<span style="color: rgba(0, 0, 0, 1)">cocoindex setup main.py

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">更新索引</span>
<span style="color: rgba(0, 0, 0, 1)">cocoindex update main.py

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 启动服务</span>
uvicorn main:fastapi_app --reload --host 0.0.0.0 --port 8000</pre>
</div>
<p><img src="https://img2024.cnblogs.com/blog/666745/202507/666745-20250727231314284-1647337382.png" alt="image" width="1146" height="557" loading="lazy"></p>
<p>&nbsp;服务启动,我们可以访问地址进行语义搜索:</p>
<div class="cnblogs_code">
<pre>http://localhost:8000/search?q=model&amp;limit=3</pre>
</div>
<p><img src="https://img2024.cnblogs.com/blog/666745/202507/666745-20250727232317023-1814694184.png" alt="image" width="1869" height="897" loading="lazy"></p>
<h1>&nbsp;5.总结</h1>
<p>CocoIndex 作为专为 AI 设计的超高性能数据转换框架,以 Rust 核心引擎为支撑,凭借增量处理、数据血缘追踪等开箱即用的特性,让复杂数据转换变得简单高效。无论是生成嵌入向量、构建知识图谱,还是处理超越传统 SQL 的任务,都能以极简代码快速实现,且从开发初期就具备生产就绪能力。其标准化接口与实时同步能力,完美适配 AI 场景中对数据新鲜度与处理效率的高要求,成为连接数据与 AI 应用的高效枢纽,大幅降低 AI 数据链路的开发与落地成本。</p>
<h1>6.结束语</h1>
<p>这篇博客就和大家分享到这里,如果大家在研究学习的过程当中有什么问题,可以加群进行讨论或发送邮件给我,我会尽我所能为您解答,与君共勉!</p>
<p>另外,博主出新书了《<span style="color: rgba(255, 0, 0, 1)"><strong><span style="color: rgba(255, 0, 0, 1)">Hadoop与Spark大数据全景解析</span></strong></span>》、同时已出版的《<span style="color: rgba(0, 0, 255, 1)"><strong><span style="color: rgba(0, 0, 255, 1)">深入理解Hive</span></strong></span>》、《<span style="color: rgba(0, 0, 255, 1)"><strong><span style="color: rgba(0, 0, 255, 1)">Kafka并不难学</span></strong></span>》和《<span style="color: rgba(0, 0, 255, 1)"><strong><span style="color: rgba(0, 0, 255, 1)">Hadoop大数据挖掘从入门到进阶实战</span></strong></span>》也可以和新书配套使用,喜欢的朋友或同学, 可以<span style="color: rgba(255, 0, 0, 1)"><strong>在公告栏那里点击购买链接购买博主的书</strong></span>进行学习,在此感谢大家的支持。关注下面公众号,根据提示,可免费获取书籍的教学视频。</p>

</div>
<div id="MySignature" role="contentinfo">
    <div>
<b class="b1"></b><b class="b2 d1"></b><b class="b3 d1"></b><b class="b4 d1"></b>
<div class="b d1 k">
联系方式:
<br/>
邮箱:smartloli.org@gmail.com
<br/>
<strong style="color: green">QQ群(Hive与AI实战【新群】):935396818</strong>
<br/>
QQ群(Hadoop - 交流社区1):424769183
<br/>
QQ群(Kafka并不难学):825943084
<br/>
温馨提示:请大家加群的时候写上加群理由(姓名+公司/学校),方便管理员审核,谢谢!
<br/>
<h3>热爱生活,享受编程,与君共勉!</h3>
</div>
<b class="b4b d1"></b><b class="b3b d1"></b><b class="b2b d1"></b><b class="b1b"></b>
</div>
<br>
<div>
<b class="b1"></b><b class="b2 d1"></b><b class="b3 d1"></b><b class="b4 d1"></b>
<div class="b d1 k">
<h3>公众号:</h3>
<h3><img style="width: 8%; margin-left: 10px" src="https://www.cnblogs.com/images/cnblogs_com/smartloli/1324636/t_qr.png"></h3>
</div>
<b class="b4b d1"></b><b class="b3b d1"></b><b class="b2b d1"></b><b class="b1b"></b>
</div>
<br>
<div>
<b class="b1"></b><b class="b2 d1"></b><b class="b3 d1"></b><b class="b4 d1"></b>
<div class="b d1 k">
<h3>作者:哥不是小萝莉 [关于我][犒赏]</h3>
<h3>出处:http://www.cnblogs.com/smartloli/</h3>
<h3>转载请注明出处,谢谢合作!</h3>
</div>
<b class="b4b d1"></b><b class="b3b d1"></b><b class="b2b d1"></b><b class="b1b"></b>
</div><br><br>
来源:https://www.cnblogs.com/smartloli/p/19008094
頁: [1]
查看完整版本: CocoIndex实现AI数据语义检索