爬虫_20251211_Browser-Use_MCP_Selenium_爬虫+LLM

佑方向 發表於 2026-1-12 16:42:00

爬虫_20251211_Browser-Use_MCP_Selenium_爬虫+LLM

<h1 id="爬虫_20251211">爬虫_20251211</h1>
<h2 id="browser-use">Browser-Use</h2>
<h3 id="browser-use-下载安装">Browser-Use 下载安装</h3>
Github 仓库链接: https://github.com/browser-use/browser-use
检查 Windows 中是否已经安装 <code>uv</code>:
<pre><code>uv --version
</code></pre>
升级 <code>uv</code> 版本:
<pre><code>uv self update
</code></pre>
安装方法:
<ol>
<li>用 <code>pip</code> 安装 <code>uv</code>:</li>
</ol>
<pre><code>pip install uv
</code></pre>
<ol start="2">
<li>用官方脚本安装:</li>
</ol>
<pre><code>powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
</code></pre>
通过检查 <code>uv</code> 的安装路径来判断自己的安装方式。在 PowerShell 里运行:
<pre><code>Get-Command uv
</code></pre>
如果是用 <code>pip</code> 安装的 <code>uv</code>（Python 包），路径通常是:
<pre><code>C:\Users\yuanz\AppData\Local\Programs\Python\Python311\Scripts\uv.exe
C:\Users\yuanz\AppData\Roaming\Python\Python311\Scripts\uv.exe
</code></pre>
如果是用官方脚本安装的 uv（独立可执行文件），路径通常是：
<pre><code>C:\Users\yuanz\.local\bin\uv.exe
C:\Users\yuanz\AppData\Local\uv\bin\uv.exe
</code></pre>
检查是否是 <code>pip</code> 管理的版本:
<pre><code>pip list | findstr uv
</code></pre>
解除 PowerShell <code>.ps1</code> 执行禁令:
<pre><code># 在 PowerShell 以管理员身份运行的情况下执行，以解禁 PowerShell 的默认禁止执行 .ps1 脚本禁令
# 这条命令的意思是允许当前用户运行本地的 PowerShell 脚本（如 .venv\Scripts\activate.ps1），但仍阻止来自互联网且未签名的脚本。
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
</code></pre>
创建并激活虚拟环境:
<pre><code>cd <项目路径>
uv venv
.venv\Scripts\activate
</code></pre>
命令行前多出 <code>(.venv)</code> 说明虚拟环境激活成功。
安装 Browser-Use:
<pre><code>uv init
uv add browser-use
uv sync
uvx browser-use install
</code></pre>
安装 <code>dotenv</code>:
<pre><code>uv pip install python-dotenv
</code></pre>
创建 <code>.env</code> 文件注意事项:
<ul>
<li>
不要加引号，不要加空格。
</li>
<li>
<code>.env</code> 文件名一定要以点开头。
</li>
<li>
确保它和你的 Python 脚本在同一文件夹下。
</li>
</ul>
<h3 id="browser-use-使用示例">Browser-Use 使用示例:</h3>
<pre><code class="language-python"># run_agent.py
# -*- coding: utf-8 -*-
"""
一个开箱即用的 browser-use 示例：
- 读取 .env 里的 GEMINI_API_KEY
- 默认用【本机浏览器】；需要可一键切换到云端浏览器
- 支持为网页流量设置“本地代理”，同时保证本地 CDP (127.0.0.1) 不走代理
- 兼容 browser-use 0.8.x（Browser() 只支持 use_cloud/headless/proxy）
"""

import os
import sys
import traceback
from dotenv import load_dotenv

# 1) 载入 .env（放在项目根目录，内容：GEMINI_API_KEY=你的key）
load_dotenv()

# 2) —— 关键：保证本地调试端口不走代理（否则会 JSONDecodeError）
os.environ["NO_PROXY"]= "localhost,127.0.0.1"
os.environ["no_proxy"]= "localhost,127.0.0.1"
# 如果之前在系统/终端里设置过以下代理变量，这里强制清理（仅当前进程）
for k in ("HTTP_PROXY", "HTTPS_PROXY", "http_proxy", "https_proxy"):
os.environ.pop(k, None)

# 3) 根据你的代理情况配置网页访问代理（如果需要）
# 例如你的本地 VPN 端口是 25378，则设为：
PROXY_SERVER = os.getenv("PROXY_SERVER", "http://127.0.0.1:25378")
USE_PROXY = os.getenv("USE_PROXY", "false").lower() in ("1", "true", "yes")

# 4) 是否使用云端浏览器（本机不稳定时可切换为 True；需先执行 `browser-use auth` 完成登录）
USE_CLOUD = os.getenv("USE_CLOUD", "false").lower() in ("1", "true", "yes")

# 5) 其余参数
HEADLESS = os.getenv("HEADLESS", "false").lower() in ("1", "true", "yes")
MODEL_ID = os.getenv("GEMINI_MODEL", "gemini-2.5-flash")

# 6) 导入 browser-use（放到 NO_PROXY 设置之后）
from browser_use import Agent, ChatGoogle, Browser

def make_browser():
"""
兼容 browser-use 0.7.x：
Browser() 支持的参数有限：use_cloud、headless、proxy、profile_name
其中 proxy 传递为 playwright 兼容的 dict：
 {"server": "http://host:port", "username": "...", "password": "..."}
"""
kwargs = {
 "use_cloud": USE_CLOUD,
 "headless": HEADLESS,
}
if (not USE_CLOUD) and USE_PROXY:
 kwargs["proxy"] = {"server": PROXY_SERVER}
 # 如需账号密码，改成：
 # kwargs["proxy"] = {"server": PROXY_SERVER, "username": "your_user", "password": "your_pass"}

return Browser(**kwargs)

# 不再需要文件解析和 CSV 生成功能，数据直接输出到 terminal
def main():
# 小检查：API Key
key = os.getenv("GEMINI_API_KEY")
if not key:
 print("❌ 未检测到 GEMINI_API_KEY，请在项目根目录创建 .env 并写入：")
 print("GEMINI_API_KEY=你的key")
 sys.exit(1)

print("=== Config ===")
print(f"USE_CLOUD : {USE_CLOUD}")
print(f"HEADLESS: {HEADLESS}")
print(f"USE_PROXY : {USE_PROXY}({PROXY_SERVER if USE_PROXY else 'no proxy'})")
print(f"MODEL_ID: {MODEL_ID}")
print("NO_PROXY:", os.environ.get("NO_PROXY"))

browser = make_browser()

agent = Agent(
 task="""
 【严格动作协议】
 - 你是浏览器自动化 Agent。你每一步“必须”返回一个且仅一个动作（action），动作必须是下列之一：
 navigate / click / input / send_keys / wait / scroll / find_text / extract / evaluate / read_file / replace_file / done
 - 除最终 `done` 外，不要输出任何自然语言或总结；若需要记录进度，使用 `replace_file`。
 - 如不确定，也必须给出一个动作（例如 wait）。绝不能只输出思考（thinking）。

 【任务】
 1) 打开 https://www.cn-healthcare.com/ 。
 2) 点击顶栏“搜索/放大镜”图标 <div class="ni_head_search_wrap">，先等待url变为 "https://www.cn-healthcare.com/search/"，再等待输入框出现；
 3) 然后在顶部的 <input class="search-input"> 的输入框中输入：公立医院，并按 Enter，等待 10 秒；若无明显结果，再按一次 Enter 或再点搜索按钮。
 4) 滚动到底部，等待 5 秒，只执行这一次操作。

 4) 从当前页抽取文章卡片（仅 cn-healthcare.com 域）：
 - 每个卡片元素：<div class="search-item">
 - 标题：<div class="search-item"> 内的 <h5 class="tit"></h5> 下的 <a> 文本
 - 网址：<h5 class="tit"></h5> 内的 <a> 标签内的 href 参数内容
 - 作者：<div class="footer"> 下的 <a> 标签内的 span.author 的所有文本内容
 - 发布日期：<div class="footer"> 下的 内的所有文本内容（格式 yyyy/mm/dd）
 5) 过滤：
 - 仅保留 URL 含 /article /content /articlewm，且不是图片/视频/PDF 等后缀
 - 日期在 2023-05-01 至 2025-10-31（含端点）；相对时间如“刚刚/几天前”跳过
 - URL 去重

 【抽取与落地（必须）】
 - 完成搜索与滚动后，执行一次 extract：
 - 每个卡片元素：<div class="search-item">
 - 标题：<div class="search-item"> 内的 <h5 class="tit"></h5> 下的 <a> 文本
 - 网址：<h5 class="tit"></h5> 内的 <a> 标签内的 href 参数内容
 - 作者：<div class="footer"> 下的 <a> 标签内的 span.author 的所有文本内容
 - 发布日期：<div class="footer"> 下的 内的所有文本内容（格式 yyyy/mm/dd）
 - 执行完 evaluate 后，Agent 立即结束（调用 done）。
 - 数据会直接显示在 terminal 中，无需写入文件。

 【禁止事项】
 - 除最终 `done` 外，任何步骤禁止输出自然语言文本、代码块或截图。

 """,
 llm=ChatGoogle(model=MODEL_ID),
 browser=browser,
)

# 运行 Agent，数据会直接输出到 terminal
try:
 print("🚀 开始爬取数据...")
 agent.run_sync() # 关键：同步执行
 print("✅ 数据提取完成！")

except Exception:
 print("❌ Agent 运行失败，堆栈如下：")
 traceback.print_exc()
 print("\n常见修复：")
 print("1) 若开启了系统全局代理，请关闭或确保排除 127.0.0.1；")
 print("2) 若本机仍不稳，设置环境变量 USE_CLOUD=true 后再跑（先 `browser-use auth`）；")
 print("3) 升级到新版：uv pip install -U 'browser-use'；")
 print("4) 以有头模式调试：设 HEADLESS=false。")

if __name__ == "__main__":
main()
</code></pre>
<h3 id="命令行指令">命令行指令</h3>
查看 Windows 系统层面的 WinHTTP 代理设置:
<pre><code>PS C:\Users\yuanz\Desktop\scraper-wjw> netsh winhttp show proxy

当前的 WinHTTP 代理服务器设置:

直接访问(没有代理服务器)。
</code></pre>
查看网络端口占用:
<pre><code>PS C:\Users\yuanz\Desktop\scraper-wjw> netstat -ano | findstr 25378
TCP 127.0.0.1:25378 0.0.0.0:0 LISTENING 82344
</code></pre>
<ul>
<li>
<code>netstat -ano</code>：列出所有端口
</li>
<li>
<code>a</code> = 所有连接和监听端口
</li>
<li>
<code>n</code> = 以数字格式显示（不反查域名）
</li>
<li>
<code>o</code> = 显示 PID（哪个程序占用）
</li>
<li>
<code>| findstr 25378</code>：过滤包含 “25378” 的行
</li>
<li>
<code>TCP</code>: 协议类型
</li>
<li>
<code>127.0.0.1:23578</code>: 本机监听端口25378
</li>
<li>
<code>0.0.0.0:0</code>: 无特定远端连接（表示监听状态）
</li>
<li>
<code>LISTENING</code>: 端口正在被程序监听
</li>
<li>
<code>82344</code>: 占用这个端口的进程 PID
</li>
</ul>
<code>tasklist</code> 用于列出当前所有运行中的程序（任务管理器的命令行版本）:
<pre><code>PS C:\Users\yuanz\Desktop\scraper-wjw> tasklist | findstr 82344
SSTap.exe 82344 Console 2 25,892 K
</code></pre>
<ul>
<li><code>PID = 82344</code> 的进程是 <code>SSTap.exe</code></li>
<li></li>
<li>端口 25378 是 SSTap 提供的本地代理端口（HTTP/SOCKS 本地监听）</li>
</ul>
<h3 id="大模型-api">大模型 API</h3>
AIIAI: https://api.aiiai.top/
GalaAPI: https://www.galaapi.com/
<h2 id="mcp-model-context-protocol">MCP (Model Context Protocol)</h2>
MCP 是 OpenAI 在 2024 年提出的一个标准化协议，主要用来让 AI 模型（如 ChatGPT、Gemini 等）和外部工具 / 数据源 / 插件进行交互。
<ul>
<li>
Cursor / Claude Desktop = MCP Client
</li>
<li>
自己编写的 Python FastMCP 程序 = MCP Server
</li>
</ul>
用 Python 写 MCP Server 时，想触发 Cursor 调用 (Call) 至少需要完成两件事：
<ol>
<li>写一个 MCP Server (Python / FastMCP)</li>
</ol>
<pre><code class="language-python"># server_fastmcp.py

from mcp.server.fastmcp import FastMCP
import httpx

app = FastMCP("BAS Security Platform")

@app.tool()
async def create_attack_task(...):
...
return "任务创建成功"

</code></pre>
启动 server:
<pre><code>uv run server_fastmcp.py
</code></pre>
<ol start="2">
<li>让 Cursor 识别这个 MCP Server</li>
</ol>
Cursor 支持 MCP，需要在 Cursor 里创建一个 MCP 配置文件 <code>.cursor/mcp.json</code>:
<pre><code>{
"servers": [
{
 "name": "bas-mcp",
 "command": ["python", "server_fastmcp.py"],
 "type": "stdio"
}
]
}
</code></pre>
MCP Client 的职责，就是负责调用（call/invoke） MCP Server 暴露出来的工具（tools）、资源（resources）、提示模板（prompts）等能力。
换句话说：
<ul>
<li>
MCP Server = 工具提供者（提供能力）
</li>
<li>
MCP Client = 调用者（消费能力）
</li>
</ul>
LLM（ChatGPT、Claude、Cursor 内置模型）则是运行在 MCP Client 里的一部分，它会根据对话推理出“需要调用哪个 tool”，然后由 MCP Client 发起调用。
<pre><code>┌───────────────────────────────┐
│ MCP Client │
│(Cursor / Claude / VSCode) │
│ │
│- 解析用户输入 │
│- 判断是否要调用 tool │
│- 发送 MCP Request → │
└──────────────┬────────────────┘
 │ (MCP Protocol)
 ▼
┌───────────────────────────────┐
│ MCP Server │
│(你写的 server.py / Node.js) │
│ │
│- @mcp.tool() 暴露工具 │
│- 执行业务逻辑 │
│- 返回结果给 Client │
└───────────────────────────────┘
</code></pre>
一个完整的 BAS MCP 示例:
假设：
<ul>
<li>
MCP Server 文件名为 <code>bas_server.py</code>
</li>
<li>
<code>bas_server.py</code> 中有一个 tool:
<pre><code class="language-python">@app.tool()
async def start_attack(type: str, target: str) -> str:
return bas.start(type, target)
</code></pre>
</li>
<li>
还有更多 tool，比如 <code>get_status(task_id)</code>
</li>
</ul>
那么 Cursor 的 MCP 配置文件 (<code>mcp.json</code>) 应该是:
<pre><code class="language-json">{
"servers": [
{
 "name": "bas-mcp", // MCP Server 在 Cursor 左侧面板中的名字，自定义即可
 "type": "command", // 告诉 Cursor：这个 MCP server 是通过子进程（stdio）启动的
 "command": "python", // Cursor 会用这个命令启动 server
 "args": ["bas_server.py"], // 代表 MCP Server 执行：python bas_server.py
 "env": {
 "BAS_API_URL": "http://localhost:8080",
 "BAS_TOKEN": "your-secret-token"
 }
}
]
}
</code></pre>
当我在 Cursor 对话框中输入：“请帮我对 192.168.0.5 发起 SQL 注入攻击”，工具触发流程是:
<ol>
<li>Cursor（MCP Client）内置的 LLM 自动生成 MCP 调用：</li>
</ol>
<pre><code>call tool start_attack({"type":"sql_injection","target":"192.168.0.5"})
</code></pre>
<ol start="2">
<li>Cursor 会发送 JSON-RPC:</li>
</ol>
<pre><code class="language-json">{
"method": "tools.call",
"params": {
"name": "start_attack",
"arguments": {
 "type": "sql_injection",
 "target": "192.168.0.5"
}
}
}
</code></pre>
MCP Server 执行工具函数:
<pre><code class="language-python">@app.tool()
async def start_attack(type: str, target: str) -> str:
return bas.start(type, target)
</code></pre>
<h2 id="selenium-爬虫">Selenium 爬虫</h2>
固定开头:
<pre><code class="language-python">def build_driver(headless: bool = True) -> webdriver.Chrome:
chrome_opts = Options()
if headless:
 chrome_opts.add_argument("--headless=new")
chrome_opts.add_argument("--disable-gpu")
chrome_opts.add_argument("--no-sandbox")
chrome_opts.add_argument("--window-size=1400,900")
chrome_opts.add_argument(
 '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36'
)
chrome_opts.add_argument("--disable-blink-features=AutomationControlled")

service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_opts)
driver.set_page_load_timeout(180)
driver.implicitly_wait(3)
return driver

def human_like_scroll(driver: webdriver.Chrome):
"""Simulate human scrolling behavior, to prevent being too "robotic"."""
try:
 # Scroll a few times, each time to a different height, with random pauses in between
 scroll_steps = random.randint(2, 5)
 for _ in range(scroll_steps):
 # 0.3 ~ 1.0 times the page height randomly
 factor = random.uniform(0.3, 1.0)
 driver.execute_script(
 "window.scrollTo(0, document.body.scrollHeight * arguments);",
 factor,
 )
 time.sleep(random.uniform(0.5, 1.5))
except Exception as e:
 print(" human_like_scroll error: ", e)
</code></pre>
搜索框键入搜索:
<pre><code class="language-python">def test_search(url: str, keyword: str):
driver = build_driver(headless=False) # 方便你看到效果
driver.get(url)
time.sleep(1)

# 找到搜索框
search_input = driver.find_element(By.CSS_SELECTOR, ".search-container .sh-inpt input")
search_input.clear()
search_input.send_keys(keyword)
time.sleep(0.3)

# 点击“搜索”按钮（Selenium 4 推荐方式）
search_button = driver.find_element(By.CSS_SELECTOR, ".search-container .sh-btn")
search_button.click()

# 等待页面加载
time.sleep(3)

# 模拟滚动
human_like_scroll(driver)

print("页面标题:", driver.title)
print("当前URL:", driver.current_url)

# 你可在这里加“爬取结果”的代码
# html = driver.page_source
# print(html[:300])

time.sleep(2)
driver.quit()

if __name__ == "__main__":
test_search(
 url="http://search.people.cn/", # ← 要测试的网站
 keyword="数字化转型" # ← 测试关键词
)
</code></pre>
<h3 id="爬虫--llm">爬虫 + LLM</h3>
在本地调用大模型 API，不能对网址链接进行访问。因为大模型本身是语言模型，只能处理文本输入输出，无法直接发起HTTP请求，无法执行浏览器操作。
所以一般的方法是，先用爬虫工具，如 BeautifulSoup4、requests、Selenium 等将文本类型的数据爬取下来，然后将文本数据导入大模型进行语义分析和目标关键数据的提取。
配置环境变量:
<pre><code class="language-python">from dotenv import load_dotenv
load_dotenv()
</code></pre>
读取环境变量:
<pre><code class="language-python">base_url = os.getenv("AI_BASE_URL", "") # e.g. https://api.aiiai.top/v1
api_key = os.getenv("AI_API_KEY", "")
model = os.getenv("AI_MODEL_TYPE", "") # e.g. gemini-2.5-pro
</code></pre>
用 OpenAI 兼容 SDK 调用:
<pre><code class="language-python">client = OpenAI(
base_url=base_url,
api_key=api_key,
)

completion = client.chat.completions.create(
model=model,
messages=messages,
)
content = completion.choices.message.content
</code></pre>
<ul>
<li>
使用 openai 的 OpenAI 客户端，但是把 <code>base_url</code> 改成自己的网关 (https://api.aiiai.top/v1)，从而兼容各种自托管 / 代理服务。
</li>
<li>
调用方式是标准的 Chat Completions 接口 (聊天接口)：传入 <code>model</code> 和 <code>messages</code>。具体详见 OpenAI API 文档: https://platform.openai.com/docs/api-reference/chat/create。
</li>
</ul>
OpenAI 文档明确说明请求 JSON 必须包含：
<pre><code class="language-json">{
"model": "gpt-5.2",
"messages": [
{"role": "system", "content": "..." },
{"role": "user", "content": "..." }
]
}
</code></pre>
实例:
<pre><code class="language-python">def call_local_llm(messages: List]) -> Any:
"""
Call the local / OpenAI compatible model:
- Read AI_BASE_URL / AI_API_KEY / AI_MODEL_TYPE from environment variables
- Default base_url = https://api.aiiai.top/v1
- Default model = gemini-2.5-pro
- Return the Python object (dict / list / None) after JSON parsing
"""
init_logger()

base_url = os.getenv("AI_BASE_URL", "")
api_key = os.getenv("AI_API_KEY", "")
model = os.getenv("AI_MODEL_TYPE", "")

if not base_url:
 base_url = "https://api.aiiai.top/v1"
if not model:
 model = "gemini-2.5-pro"

if not api_key:
 logger.error("API key is empty")
 raise ValueError("API key is empty")

client = OpenAI(
 base_url=base_url,
 api_key=api_key,
)

logger.info(f"Calling LLM, model={model}")
completion = client.chat.completions.create(
 model=model,
 messages=messages,
)

content = completion.choices.message.content
if content is None:
 logger.error("LLM returned empty content")
 raise RuntimeError("LLM returned empty content")

content = content.strip()
logger.info(f"LLM raw output: {content}")

# Try JSON parsing, compatible with ```json ... ``` wrapped cases
if isinstance(content, str):
 try:
 return json.loads(content)
 except json.JSONDecodeError:
 cleaned = content.strip().strip("`")
 lower = cleaned.lower()
 if lower.startswith("json\n") or lower.startswith("json\r\n"):
 cleaned = "\n".join(cleaned.splitlines())
 return json.loads(cleaned)
else:
 # It should not reach here, but keep compatible
 return content
</code></pre>
SDK = Software Development Kit（软件开发工具包）。它是一套官方提供的工具，用来方便开发者调用某个服务。
对于 OpenAI，SDK 封装了 HTTP 请求，不需要自己写复杂的 POST body、headers，SDK 会自动处理错误、重试、流式输出等。
用 SDK 的写法：
<pre><code class="language-python">completion = client.chat.completions.create(
model=model,
messages=messages,
)
</code></pre>
如果不用 SDK，就必须手写 HTTP:
<pre><code class="language-python">import requests

requests.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
 "model": "...",
 "messages": [...],
}
)
</code></pre>
把要分析的文本传入大模型:
<ol>
<li><code>SYSTEM_PROMPT</code>: 从外部文件加载规则</li>
</ol>
<pre><code class="language-python">def load_system_prompt():
base_dir = os.path.dirname(os.path.abspath(__file__))
ht_path = os.path.join(base_dir, "ht_jg.txt")
...
SYSTEM_PROMPT = load_system_prompt()
</code></pre>
<ol start="2">
<li><code>messages</code>: 一条一条地喂合同记录</li>
</ol>
<pre><code class="language-python">def build_messages_for_single_record(
region: str,
organization_name: Optional,
contract_record: Dict,
) -> List]:
"""
Construct the messages for a single contract record:
- system: system instruction (SYSTEM_PROMPT), which should clearly state: this time only process this one contract
- user: contains region / organization_name / this current contract_record / demo_str
"""
user_payload = {
 "region": region,
 "organization_name": organization_name or "",
 "contract_record": contract_record,
 "demo_str": demo_str,
}

messages = [
 {"role": "system", "content": SYSTEM_PROMPT},
 {
 "role": "user",
 "content": (
 "下面是本次任务的具体输入参数 JSON（仅包含一条合同记录），"
 "你只能基于这条记录判断是否与目标地区/机构相关，并抽取结构化信息：\n\n"
 + json.dumps(user_payload, ensure_ascii=False, indent=2)
 ),
 },
]
return messages
</code></pre>
逐条处理：每次只给模型一条 contract_record（title/url/date/content），杜绝“混合同”错位。
混合同错位（cross-record hallucination 或 cross-record mixup）指模型在处理一批合同记录时，把 A 合同的 URL、B 合同的内容、C 合同的业务场景混在一起输出。
表现为：
<ul>
<li>
输出的 URL ≠ 它根据内容抽取出的业务场景
</li>
<li>
输出内容包含另一个合同的片段
</li>
<li>
模型生成了不存在的合同（虚构 URL）
</li>
<li>
第 N 条合同的输出明显引用了第 N + 1 条文本的信息
</li>
</ul>
你之前遇到的这段就是典型“混合同”：模型把不存在的 URL 输出成真实合同，并且内容来自完全不同文章
出现这样的原因是：
<ol>
<li>
大模型的 “上下文融合机制”
LLM 的本质是把输入的所有内容当成一个统一的语境进行概率预测。
这意味着：如果你给模型一次性输入了 200 条合同，模型不会理解“这是 200 条独立样本”，它会认为这是“一个巨大的语料库”，并在其中“寻找它认为合理的关联”。
换句话说：LLM 不天然支持“一条一条分开处理”这个概念。批量输入 = 让模型混淆边界 = 产生错位。
</li>
<li>
模型只保证“语言一致性”，不保证“索引对应关系”
</li>
<li>
模型会自动“对齐模式”，而不是逐条分析
在批量合同输入中，你提供了大量文本，但模型看到的是：
<pre><code> contracts_records: [
{ A },
{ B },
{ C },
...
]
</code></pre>
LLM 的行为通常是模式聚合（pattern aggregation），提取公共特征 → 生成统一风格的输出，不会保持每个 item 的边界。
这叫模式坍缩（pattern collapse），这是 LLM 的自然倾向，而不是 bug。
</li>
<li>
模型把你的任务理解成“总结一个列表”，而不是“输出 N 个独立结果”
</li>
<li>
模型可能会“幻想 URL”
你给大模型看了很多 URL 模式，它就会学会 URL 的构造方式。当大模型需要输出一个 URL 却没有事实依据时，它会“自己造一个”合理-looking 的 URL，这种行为叫结构性幻觉（structural hallucination）。
</li>
</ol> 
来源：https://www.cnblogs.com/Eternal-Higanbana/p/19472754

頁: [1]

圆梦公社's Archiver

爬虫_20251211_Browser-Use_MCP_Selenium_爬虫+LLM