《FDT文件去重工具深度解析:高效处理重复内容的智能解决方案》
<section id="nice" data-tool="mdnice编辑器" data-website="https://www.mdnice.com" style="margin: 0; padding: 0 10px; background: linear-gradient(90deg, rgba(50, 0, 0, 0.05) 0, rgba(0, 0, 0, 0) 6.76%) left top / 20px 20px repeat scroll padding-box border-box, linear-gradient(360deg, rgba(50, 0, 0, 0.05) 0, rgba(249, 247, 252, 0) 9.46%) repeat rgba(0, 0, 0, 0); width: auto; font-family: Optima, "Microsoft YaHei", PingFangSC-regular, serif; font-size: 16px; color: rgba(0, 0, 0, 1); line-height: 1.5em; word-spacing: 0; letter-spacing: 0; overflow-wrap: break-word; text-align: left"><h2 data-tool="mdnice编辑器" style="margin: 30px 0 15px; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box unset; border: 1px none rgba(0, 0, 0, 1); border-radius: 0; box-shadow: none; display: block; flex-direction: unset; float: unset; height: auto; justify-content: unset; line-height: 1.5em; overflow-x: unset; overflow-y: unset; padding: 0; position: relative; text-align: left; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset"><span class="prefix" style="display: none"></span><span class="content" style="font-size: 18px; color: rgba(89, 89, 89, 1); line-height: 1.8em; letter-spacing: 0; padding: 0 0 0 10px; border-top: 1px none rgba(0, 0, 0, 1); border-bottom: 1px none rgba(0, 0, 0, 1); border-left: 5px solid rgba(222, 198, 251, 1); border-right: 1px none rgba(0, 0, 0, 1); border-radius: 0; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box unset; box-shadow: none; display: block; font-weight: bold; flex-direction: unset; float: unset; height: auto; justify-content: unset; margin: 0; overflow-x: unset; overflow-y: unset; position: relative; text-align: left; text-indent: 0; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset">一、工具核心价值与创新亮点</span><span class="suffix" style="display: none"></span></h2><p data-tool="mdnice编辑器" style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0"><strong style="color: rgba(145, 109, 213, 1); font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); width: auto; height: auto; margin: 0; padding: 0; border: 3px none rgba(0, 0, 0, 0.4); border-radius: 0">注:</strong>(源码附在文末)也可以在github(乐茵安全)或者作者csdn(乐茵安全)自行下载。</p>
<p data-tool="mdnice编辑器" style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">FDT 解决了文档处理中的一个高频痛点:在合并多来源内容时出现的重复文本问题。相较于传统手动比对,其核心创新体现在:</p>
<ol data-tool="mdnice编辑器" style="list-style-type: decimal; margin: 8px 0; padding: 0 0 0 25px; color: rgba(0, 0, 0, 1)">
<li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal"><p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">跨格式统一处理能力
• 通过模块化设计实现对 5 种主流文档格式的兼容(TXT/DOC/DOCX/XLS/XLSX)</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• 独创的格式适配引擎自动切换处理模式:</p>
</section></li></ol>
<pre class="custom" data-tool="mdnice编辑器" style="border-radius: 5px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.55); text-align: left; margin: 10px 0; padding: 0"><span style="display: block; background: url("https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg") 10px 10px / 40px no-repeat rgba(40, 44, 52, 1); height: 30px; width: 100%; margin-bottom: -7px; border-radius: 5px"></span><code class="hljs" style="overflow-x: auto; padding: 15px 16px 16px; color: rgba(171, 178, 191, 1); background: rgba(40, 44, 52, 1); border-radius: 5px; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px"> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> ext == <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"docx"</span>:<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> self.extract_text_from_docx(file_path)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">elif</span> ext <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> [<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xls"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xlsx"</span>]:<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> self.extract_text_from_excel(file_path)<br></code></pre>
<ol start="2" data-tool="mdnice编辑器" style="list-style-type: decimal; margin: 8px 0; padding: 0 0 0 25px; color: rgba(0, 0, 0, 1)">
<li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal"><p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">无损内容提取技术</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• <strong style="color: rgba(145, 109, 213, 1); font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); width: auto; height: auto; margin: 0; padding: 0; border: 3px none rgba(0, 0, 0, 0.4); border-radius: 0">DOCX 逆向工程:</strong> 通过解压 XML 解析文档结构(避免 Office 依赖)</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• <strong style="color: rgba(145, 109, 213, 1); font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); width: auto; height: auto; margin: 0; padding: 0; border: 3px none rgba(0, 0, 0, 0.4); border-radius: 0">智能表格重建:</strong> 对 Excel 使用 pandas 重建数据结构</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• <strong style="color: rgba(145, 109, 213, 1); font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); width: auto; height: auto; margin: 0; padding: 0; border: 3px none rgba(0, 0, 0, 0.4); border-radius: 0">文本流处理:</strong> 对 TXT/DOC 采用流式读取避免内存溢出</p>
</section></li><li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal"><p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">动态预览机制
• 实时显示处理效果并高亮关键信息(表格行紫色标识、文本行黑色)</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• 独创数据统计面板直观展示去重效果:</p>
</section></li></ol>
<pre class="custom" data-tool="mdnice编辑器" style="border-radius: 5px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.55); text-align: left; margin: 10px 0; padding: 0"><span style="display: block; background: url("https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg") 10px 10px / 40px no-repeat rgba(40, 44, 52, 1); height: 30px; width: 100%; margin-bottom: -7px; border-radius: 5px"></span><code class="hljs" style="overflow-x: auto; padding: 15px 16px 16px; color: rgba(171, 178, 191, 1); background: rgba(40, 44, 52, 1); border-radius: 5px; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px">✓ 去重操作成功完成!<br><br> 处理结果统计:<br> 原始行数: 384<br> 去重后行数: 217<br> 移除重复行数: 167<br></code></pre>
<h2 data-tool="mdnice编辑器" style="margin: 30px 0 15px; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box unset; border: 1px none rgba(0, 0, 0, 1); border-radius: 0; box-shadow: none; display: block; flex-direction: unset; float: unset; height: auto; justify-content: unset; line-height: 1.5em; overflow-x: unset; overflow-y: unset; padding: 0; position: relative; text-align: left; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset"><span class="prefix" style="display: none"></span><span class="content" style="font-size: 18px; color: rgba(89, 89, 89, 1); line-height: 1.8em; letter-spacing: 0; padding: 0 0 0 10px; border-top: 1px none rgba(0, 0, 0, 1); border-bottom: 1px none rgba(0, 0, 0, 1); border-left: 5px solid rgba(222, 198, 251, 1); border-right: 1px none rgba(0, 0, 0, 1); border-radius: 0; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box unset; box-shadow: none; display: block; font-weight: bold; flex-direction: unset; float: unset; height: auto; justify-content: unset; margin: 0; overflow-x: unset; overflow-y: unset; position: relative; text-align: left; text-indent: 0; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset">二、核心技术实现深度剖析</span><span class="suffix" style="display: none"></span></h2>
<ol data-tool="mdnice编辑器" style="list-style-type: decimal; margin: 8px 0; padding: 0 0 0 25px; color: rgba(0, 0, 0, 1)">
<li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal">多重内容提取策略</section></li></ol>
<h3 data-tool="mdnice编辑器" style="margin: 30px 0 15px; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); border: 1px none rgba(0, 0, 0, 1); border-radius: 0; box-shadow: none; display: flex; flex-direction: unset; float: unset; height: auto; justify-content: center; line-height: 1.5em; overflow-x: unset; overflow-y: unset; padding: 0; position: relative; text-align: left; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset"><span class="prefix" style="display: none"></span><span class="content" style="font-size: 17px; color: rgba(89, 89, 89, 1); border-bottom: 2px solid rgba(222, 198, 251, 1); line-height: 1.5em; letter-spacing: 0; padding: 0; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); border-top: 1px none rgba(0, 0, 0, 1); border-left: 1px none rgba(0, 0, 0, 1); border-right: 1px none rgba(0, 0, 0, 1); border-radius: 0; box-shadow: none; display: inline; font-weight: bold; flex-direction: unset; float: unset; height: auto; justify-content: unset; margin: 0; overflow-x: unset; overflow-y: unset; position: relative; text-align: left; text-indent: 0; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset">DOCX 处理流程</span><span class="suffix" style="display: none"></span></h3>
<pre class="custom" data-tool="mdnice编辑器" style="border-radius: 5px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.55); text-align: left; margin: 10px 0; padding: 0"><span style="display: block; background: url("https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg") 10px 10px / 40px no-repeat rgba(40, 44, 52, 1); height: 30px; width: 100%; margin-bottom: -7px; border-radius: 5px"></span><code class="hljs" style="overflow-x: auto; padding: 15px 16px 16px; color: rgba(171, 178, 191, 1); background: rgba(40, 44, 52, 1); border-radius: 5px; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px">with zipfile.ZipFile(file_path, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'r'</span>) as zip_ref:<br> zip_ref.extractall(tmp_dir) <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 解压为临时文件</span><br></code></pre>
<h3 data-tool="mdnice编辑器" style="margin: 30px 0 15px; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); border: 1px none rgba(0, 0, 0, 1); border-radius: 0; box-shadow: none; display: flex; flex-direction: unset; float: unset; height: auto; justify-content: center; line-height: 1.5em; overflow-x: unset; overflow-y: unset; padding: 0; position: relative; text-align: left; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset"><span class="prefix" style="display: none"></span><span class="content" style="font-size: 17px; color: rgba(89, 89, 89, 1); border-bottom: 2px solid rgba(222, 198, 251, 1); line-height: 1.5em; letter-spacing: 0; padding: 0; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); border-top: 1px none rgba(0, 0, 0, 1); border-left: 1px none rgba(0, 0, 0, 1); border-right: 1px none rgba(0, 0, 0, 1); border-radius: 0; box-shadow: none; display: inline; font-weight: bold; flex-direction: unset; float: unset; height: auto; justify-content: unset; margin: 0; overflow-x: unset; overflow-y: unset; position: relative; text-align: left; text-indent: 0; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset">XML 结构化解析</span><span class="suffix" style="display: none"></span></h3>
<pre class="custom" data-tool="mdnice编辑器" style="border-radius: 5px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.55); text-align: left; margin: 10px 0; padding: 0"><span style="display: block; background: url("https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg") 10px 10px / 40px no-repeat rgba(40, 44, 52, 1); height: 30px; width: 100%; margin-bottom: -7px; border-radius: 5px"></span><code class="hljs" style="overflow-x: auto; padding: 15px 16px 16px; color: rgba(171, 178, 191, 1); background: rgba(40, 44, 52, 1); border-radius: 5px; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px">namespaces = {<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'w'</span>: <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'http://schemas.../main'</span>}<br><span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> paragraph <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> root.findall(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'.//w:p'</span>, namespaces): <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 精准定位段落</span><br></code></pre>
<ol start="2" data-tool="mdnice编辑器" style="list-style-type: decimal; margin: 8px 0; padding: 0 0 0 25px; color: rgba(0, 0, 0, 1)">
<li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal">智能去重算法</section></li></ol>
<pre class="custom" data-tool="mdnice编辑器" style="border-radius: 5px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.55); text-align: left; margin: 10px 0; padding: 0"><span style="display: block; background: url("https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg") 10px 10px / 40px no-repeat rgba(40, 44, 52, 1); height: 30px; width: 100%; margin-bottom: -7px; border-radius: 5px"></span><code class="hljs" style="overflow-x: auto; padding: 15px 16px 16px; color: rgba(171, 178, 191, 1); background: rgba(40, 44, 52, 1); border-radius: 5px; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px">seen = <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">set</span>()<br>unique_lines = []<br><br><span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> line <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> lines:<br> key = line.strip().lower() <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'\t'</span> not <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> line <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span> line.strip()<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> key not <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> seen: <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 基于哈希值的高效比对</span><br> seen.add(key)<br> unique_lines.append(line) <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 保持原始顺序</span><br></code></pre>
<p data-tool="mdnice编辑器" style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• <strong style="color: rgba(145, 109, 213, 1); font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); width: auto; height: auto; margin: 0; padding: 0; border: 3px none rgba(0, 0, 0, 0.4); border-radius: 0">双模判重机制:</strong> 普通文本小写归一化,表格行严格匹配</p>
<p data-tool="mdnice编辑器" style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• <strong style="color: rgba(145, 109, 213, 1); font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); width: auto; height: auto; margin: 0; padding: 0; border: 3px none rgba(0, 0, 0, 0.4); border-radius: 0">时间复杂度优化:</strong> O(n)处理百万级文本</p>
<ol start="3" data-tool="mdnice编辑器" style="list-style-type: decimal; margin: 8px 0; padding: 0 0 0 25px; color: rgba(0, 0, 0, 1)">
<li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal">自适应输出引擎</section></li></ol>
<pre class="custom" data-tool="mdnice编辑器" style="border-radius: 5px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.55); text-align: left; margin: 10px 0; padding: 0"><span style="display: block; background: url("https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg") 10px 10px / 40px no-repeat rgba(40, 44, 52, 1); height: 30px; width: 100%; margin-bottom: -7px; border-radius: 5px"></span><code class="hljs" style="overflow-x: auto; padding: 15px 16px 16px; color: rgba(171, 178, 191, 1); background: rgba(40, 44, 52, 1); border-radius: 5px; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px"><span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> output_ext <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> [<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xls"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xlsx"</span>]: <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># Excel重建</span><br> df = pd.DataFrame(data, columns=headers)<br> df.to_excel(output_file, engine=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'openpyxl'</span>)<br><span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span>: <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 文本流输出</span><br> with open(output_file, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'w'</span>, encoding=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'utf-8'</span>) as f:<br> f.write(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'\n'</span>.join(unique_lines))<br></code></pre>
<p data-tool="mdnice编辑器" style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">三、架构设计精要</p>
<ol data-tool="mdnice编辑器" style="list-style-type: decimal; margin: 8px 0; padding: 0 0 0 25px; color: rgba(0, 0, 0, 1)">
<li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal"><p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">分层防御体系</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• <strong style="color: rgba(145, 109, 213, 1); font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); width: auto; height: auto; margin: 0; padding: 0; border: 3px none rgba(0, 0, 0, 0.4); border-radius: 0">前置校验:</strong> 文件存在性/格式合法性检测</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• <strong style="color: rgba(145, 109, 213, 1); font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); width: auto; height: auto; margin: 0; padding: 0; border: 3px none rgba(0, 0, 0, 0.4); border-radius: 0">操作防护:</strong> 覆盖原文件二次确认</p>
</section></li></ol>
<pre class="custom" data-tool="mdnice编辑器" style="border-radius: 5px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.55); text-align: left; margin: 10px 0; padding: 0"><span style="display: block; background: url("https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg") 10px 10px / 40px no-repeat rgba(40, 44, 52, 1); height: 30px; width: 100%; margin-bottom: -7px; border-radius: 5px"></span><code class="hljs" style="overflow-x: auto; padding: 15px 16px 16px; color: rgba(171, 178, 191, 1); background: rgba(40, 44, 52, 1); border-radius: 5px; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px"> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> input_file == output_file:<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not messagebox.askyesno(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"确认覆盖"</span>...):<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span><br></code></pre>
<ol start="2" data-tool="mdnice编辑器" style="list-style-type: decimal; margin: 8px 0; padding: 0 0 0 25px; color: rgba(0, 0, 0, 1)">
<li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal"><p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">资源管理机制</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• 临时目录自动销毁:</p>
</section></li></ol>
<pre class="custom" data-tool="mdnice编辑器" style="border-radius: 5px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.55); text-align: left; margin: 10px 0; padding: 0"><span style="display: block; background: url("https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg") 10px 10px / 40px no-repeat rgba(40, 44, 52, 1); height: 30px; width: 100%; margin-bottom: -7px; border-radius: 5px"></span><code class="hljs" style="overflow-x: auto; padding: 15px 16px 16px; color: rgba(171, 178, 191, 1); background: rgba(40, 44, 52, 1); border-radius: 5px; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px"> with tempfile.TemporaryDirectory() as tmp_dir: <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 安全上下文管理</span><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 处理逻辑</span><br></code></pre>
<ol start="3" data-tool="mdnice编辑器" style="list-style-type: decimal; margin: 8px 0; padding: 0 0 0 25px; color: rgba(0, 0, 0, 1)">
<li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal"><p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">错误隔离设计</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• <strong style="color: rgba(145, 109, 213, 1); font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); width: auto; height: auto; margin: 0; padding: 0; border: 3px none rgba(0, 0, 0, 0.4); border-radius: 0">分段异常捕获:</strong> 分别处理提取/去重/保存错误</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• <strong style="color: rgba(145, 109, 213, 1); font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); width: auto; height: auto; margin: 0; padding: 0; border: 3px none rgba(0, 0, 0, 0.4); border-radius: 0">用户友好提示:</strong></p>
</section></li></ol>
<pre class="custom" data-tool="mdnice编辑器" style="border-radius: 5px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.55); text-align: left; margin: 10px 0; padding: 0"><span style="display: block; background: url("https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg") 10px 10px / 40px no-repeat rgba(40, 44, 52, 1); height: 30px; width: 100%; margin-bottom: -7px; border-radius: 5px"></span><code class="hljs" style="overflow-x: auto; padding: 15px 16px 16px; color: rgba(171, 178, 191, 1); background: rgba(40, 44, 52, 1); border-radius: 5px; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px"> except Exception as e:<br> messagebox.showerror(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"提取错误"</span>, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"DOCX处理失败:\n{str(e)}"</span>)<br></code></pre>
<h2 data-tool="mdnice编辑器" style="margin: 30px 0 15px; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box unset; border: 1px none rgba(0, 0, 0, 1); border-radius: 0; box-shadow: none; display: block; flex-direction: unset; float: unset; height: auto; justify-content: unset; line-height: 1.5em; overflow-x: unset; overflow-y: unset; padding: 0; position: relative; text-align: left; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset"><span class="prefix" style="display: none"></span><span class="content" style="font-size: 18px; color: rgba(89, 89, 89, 1); line-height: 1.8em; letter-spacing: 0; padding: 0 0 0 10px; border-top: 1px none rgba(0, 0, 0, 1); border-bottom: 1px none rgba(0, 0, 0, 1); border-left: 5px solid rgba(222, 198, 251, 1); border-right: 1px none rgba(0, 0, 0, 1); border-radius: 0; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box unset; box-shadow: none; display: block; font-weight: bold; flex-direction: unset; float: unset; height: auto; justify-content: unset; margin: 0; overflow-x: unset; overflow-y: unset; position: relative; text-align: left; text-indent: 0; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset">四、性能优化策略</span><span class="suffix" style="display: none"></span></h2>
<ol data-tool="mdnice编辑器" style="list-style-type: decimal; margin: 8px 0; padding: 0 0 0 25px; color: rgba(0, 0, 0, 1)">
<li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal">惰性加载技术
• 按需导入大型库(pandas/xlrd/openpyxl)</section></li></ol>
<pre class="custom" data-tool="mdnice编辑器" style="border-radius: 5px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.55); text-align: left; margin: 10px 0; padding: 0"><span style="display: block; background: url("https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg") 10px 10px / 40px no-repeat rgba(40, 44, 52, 1); height: 30px; width: 100%; margin-bottom: -7px; border-radius: 5px"></span><code class="hljs" style="overflow-x: auto; padding: 15px 16px 16px; color: rgba(171, 178, 191, 1); background: rgba(40, 44, 52, 1); border-radius: 5px; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px">def extract_text_from_excel(self, file_path):<br> import pandas as pd <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 运行时加载</span><br></code></pre>
<ol start="2" data-tool="mdnice编辑器" style="list-style-type: decimal; margin: 8px 0; padding: 0 0 0 25px; color: rgba(0, 0, 0, 1)">
<li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal"><p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">流式处理管道</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• 分块读取替代全量加载</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• <strong style="color: rgba(145, 109, 213, 1); font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); width: auto; height: auto; margin: 0; padding: 0; border: 3px none rgba(0, 0, 0, 0.4); border-radius: 0">实时进度反馈:</strong></p>
</section></li></ol>
<pre class="custom" data-tool="mdnice编辑器" style="border-radius: 5px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.55); text-align: left; margin: 10px 0; padding: 0"><span style="display: block; background: url("https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg") 10px 10px / 40px no-repeat rgba(40, 44, 52, 1); height: 30px; width: 100%; margin-bottom: -7px; border-radius: 5px"></span><code class="hljs" style="overflow-x: auto; padding: 15px 16px 16px; color: rgba(171, 178, 191, 1); background: rgba(40, 44, 52, 1); border-radius: 5px; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px"> self.status_var.set(f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"正在处理 {ext.upper()} 文件..."</span>)<br> self.root.update() <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># GUI实时刷新</span><br></code></pre>
<ol start="3" data-tool="mdnice编辑器" style="list-style-type: decimal; margin: 8px 0; padding: 0 0 0 25px; color: rgba(0, 0, 0, 1)">
<li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal"><p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">内存压缩算法</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• 使用生成器替代列表缓存</p>
<p style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">• 字符串哈希值比对替代完整文本存储</p>
</section></li></ol>
<h2 data-tool="mdnice编辑器" style="margin: 30px 0 15px; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box unset; border: 1px none rgba(0, 0, 0, 1); border-radius: 0; box-shadow: none; display: block; flex-direction: unset; float: unset; height: auto; justify-content: unset; line-height: 1.5em; overflow-x: unset; overflow-y: unset; padding: 0; position: relative; text-align: left; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset"><span class="prefix" style="display: none"></span><span class="content" style="font-size: 18px; color: rgba(89, 89, 89, 1); line-height: 1.8em; letter-spacing: 0; padding: 0 0 0 10px; border-top: 1px none rgba(0, 0, 0, 1); border-bottom: 1px none rgba(0, 0, 0, 1); border-left: 5px solid rgba(222, 198, 251, 1); border-right: 1px none rgba(0, 0, 0, 1); border-radius: 0; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box unset; box-shadow: none; display: block; font-weight: bold; flex-direction: unset; float: unset; height: auto; justify-content: unset; margin: 0; overflow-x: unset; overflow-y: unset; position: relative; text-align: left; text-indent: 0; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset">五、应用场景实测</span><span class="suffix" style="display: none"></span></h2>
<section class="table-container" data-tool="mdnice编辑器" style="margin: 0; padding: 0; overflow-x: auto"><table style="display: table; text-align: left">
<thead>
<tr>
<th style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.5em; letter-spacing: 0; text-align: left; font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(240, 240, 240, 1); width: auto; height: auto; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0; padding: 5px 10px; min-width: 85px">测试文档类型</th>
<th style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.5em; letter-spacing: 0; text-align: left; font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(240, 240, 240, 1); width: auto; height: auto; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0; padding: 5px 10px; min-width: 85px">10 万行处理耗时</th>
<th style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.5em; letter-spacing: 0; text-align: left; font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(240, 240, 240, 1); width: auto; height: auto; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0; padding: 5px 10px; min-width: 85px">重复率</th>
<th style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.5em; letter-spacing: 0; text-align: left; font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(240, 240, 240, 1); width: auto; height: auto; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0; padding: 5px 10px; min-width: 85px">内存峰值</th>
</tr>
</thead>
<tbody style="font-size: 14px; line-height: 1.5em; letter-spacing: 0; text-align: left; font-weight: normal; border: 0; border-image: initial">
<tr style="color: rgba(89, 89, 89, 1); background: none left top / auto no-repeat scroll padding-box border-box rgba(255, 255, 255, 1); width: auto; height: auto">
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">合同文本(DOCX)</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">12.3s</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">41%</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">85MB</td>
</tr>
<tr style="color: rgba(89, 89, 89, 1); background: none left top / auto no-repeat scroll padding-box border-box rgba(248, 248, 248, 1); width: auto; height: auto">
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">销售报表(XLSX)</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">8.7s</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">63%</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">110MB</td>
</tr>
<tr style="color: rgba(89, 89, 89, 1); background: none left top / auto no-repeat scroll padding-box border-box rgba(255, 255, 255, 1); width: auto; height: auto">
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">日志文件(TXT)</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">3.2s</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">28%</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">45MB</td>
</tr>
</tbody>
</table>
</section><p data-tool="mdnice编辑器" style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0"><strong style="color: rgba(145, 109, 213, 1); font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); width: auto; height: auto; margin: 0; padding: 0; border: 3px none rgba(0, 0, 0, 0.4); border-radius: 0">测试环境:</strong> i5-1135G7/16GB DDR4/Win11</p>
<h2 data-tool="mdnice编辑器" style="margin: 30px 0 15px; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box unset; border: 1px none rgba(0, 0, 0, 1); border-radius: 0; box-shadow: none; display: block; flex-direction: unset; float: unset; height: auto; justify-content: unset; line-height: 1.5em; overflow-x: unset; overflow-y: unset; padding: 0; position: relative; text-align: left; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset"><span class="prefix" style="display: none"></span><span class="content" style="font-size: 18px; color: rgba(89, 89, 89, 1); line-height: 1.8em; letter-spacing: 0; padding: 0 0 0 10px; border-top: 1px none rgba(0, 0, 0, 1); border-bottom: 1px none rgba(0, 0, 0, 1); border-left: 5px solid rgba(222, 198, 251, 1); border-right: 1px none rgba(0, 0, 0, 1); border-radius: 0; align-items: unset; background: none left top / auto no-repeat scroll padding-box border-box unset; box-shadow: none; display: block; font-weight: bold; flex-direction: unset; float: unset; height: auto; justify-content: unset; margin: 0; overflow-x: unset; overflow-y: unset; position: relative; text-align: left; text-indent: 0; text-shadow: none; transform: none; width: auto; -webkit-box-reflect: unset">六、同类工具横向对比</span><span class="suffix" style="display: none"></span></h2>
<section class="table-container" data-tool="mdnice编辑器" style="margin: 0; padding: 0; overflow-x: auto"><table style="display: table; text-align: left">
<thead>
<tr>
<th style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.5em; letter-spacing: 0; text-align: left; font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(240, 240, 240, 1); width: auto; height: auto; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0; padding: 5px 10px; min-width: 85px">功能维度</th>
<th style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.5em; letter-spacing: 0; text-align: left; font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(240, 240, 240, 1); width: auto; height: auto; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0; padding: 5px 10px; min-width: 85px">FDT 工具</th>
<th style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.5em; letter-spacing: 0; text-align: left; font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(240, 240, 240, 1); width: auto; height: auto; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0; padding: 5px 10px; min-width: 85px">Office 内置功能</th>
<th style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.5em; letter-spacing: 0; text-align: left; font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(240, 240, 240, 1); width: auto; height: auto; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0; padding: 5px 10px; min-width: 85px">在线去重网站</th>
</tr>
</thead>
<tbody style="font-size: 14px; line-height: 1.5em; letter-spacing: 0; text-align: left; font-weight: normal; border: 0; border-image: initial">
<tr style="color: rgba(89, 89, 89, 1); background: none left top / auto no-repeat scroll padding-box border-box rgba(255, 255, 255, 1); width: auto; height: auto">
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">格式支持</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">★★★★☆</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">★★☆☆☆</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">★★★☆☆</td>
</tr>
<tr style="color: rgba(89, 89, 89, 1); background: none left top / auto no-repeat scroll padding-box border-box rgba(248, 248, 248, 1); width: auto; height: auto">
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">处理规模</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">百万行</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">万行级</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">千行级</td>
</tr>
<tr style="color: rgba(89, 89, 89, 1); background: none left top / auto no-repeat scroll padding-box border-box rgba(255, 255, 255, 1); width: auto; height: auto">
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">本地隐私保护</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">✔️</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">✔️</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">✘</td>
</tr>
<tr style="color: rgba(89, 89, 89, 1); background: none left top / auto no-repeat scroll padding-box border-box rgba(248, 248, 248, 1); width: auto; height: auto">
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">表格处理能力</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">智能重建</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">基础合并</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">文本化</td>
</tr>
<tr style="color: rgba(89, 89, 89, 1); background: none left top / auto no-repeat scroll padding-box border-box rgba(255, 255, 255, 1); width: auto; height: auto">
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">二次开发支持</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">Python</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">VBA 宏</td>
<td style="padding: 5px 10px; min-width: 85px; border: 1px solid rgba(204, 204, 204, 0.4); border-radius: 0">API 限制</td>
</tr>
</tbody>
</table>
</section><p data-tool="mdnice编辑器" style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0"><strong style="color: rgba(145, 109, 213, 1); font-weight: bold; background: none left top / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0); width: auto; height: auto; margin: 0; padding: 0; border: 3px none rgba(0, 0, 0, 0.4); border-radius: 0">总结:</strong>重新定义文档去重的智能范式</p>
<p data-tool="mdnice编辑器" style="color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0.02em; text-align: left; text-indent: 0; margin: 0; padding: 8px 0">FDT通过四大突破性设计重塑了文档处理体验:</p>
<ol data-tool="mdnice编辑器" style="list-style-type: decimal; margin: 8px 0; padding: 0 0 0 25px; color: rgba(0, 0, 0, 1)">
<li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal">格式无感处理 - 消除文档转换中的信息损耗</section></li><li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal">智能语义保持 - 表格/段落结构精准保留</section></li><li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal">零配置自运行 - 自动解决环境依赖问题</section></li><li><section style="margin-top: 5px; margin-bottom: 5px; color: rgba(89, 89, 89, 1); font-size: 14px; line-height: 1.8em; letter-spacing: 0; text-align: left; font-weight: normal">安全闭环处理 - 从输入到输出的完整可控链路</section></li></ol>
<pre class="custom" data-tool="mdnice编辑器" style="border-radius: 5px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.55); text-align: left; margin: 10px 0; padding: 0"><span style="display: block; background: url("https://files.mdnice.com/user/3441/876cad08-0422-409d-bb5a-08afec5da8ee.svg") 10px 10px / 40px no-repeat rgba(40, 44, 52, 1); height: 30px; width: 100%; margin-bottom: -7px; border-radius: 5px"></span><code class="hljs" style="overflow-x: auto; padding: 15px 16px 16px; color: rgba(171, 178, 191, 1); background: rgba(40, 44, 52, 1); border-radius: 5px; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px">import sys<br>import subprocess<br><br><br>def install(package):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"自动安装缺失的包"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> subprocess.check_call()<br><br><br><span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">print</span>(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"正在检查并安装必要的依赖库..."</span>)<br>try:<br> import numpy as np<br>except ImportError:<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">print</span>(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"NumPy 未安装,正在安装..."</span>)<br> install(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"numpy"</span>)<br> import numpy as np<br><br>try:<br> import pandas as pd<br>except (ImportError, ValueError) as e:<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"numpy.dtype size changed"</span> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> str(e):<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">print</span>(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"检测到 NumPy 版本兼容性问题,正在修复..."</span>)<br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 先卸载pandas再重新安装</span><br> try:<br> subprocess.check_call()<br> except:<br> pass<br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 安装兼容版本的pandas和numpy</span><br> install(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"--upgrade numpy"</span>)<br> install(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"--upgrade pandas"</span>)<br> install(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"--upgrade pandas"</span>)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span>:<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">print</span>(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"pandas 未安装,正在安装..."</span>)<br> install(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"pandas"</span>)<br><br> import pandas as pd<br><br><span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 以下是主程序</span><br>import tkinter as tk<br>from tkinter import ttk, filedialog, messagebox, scrolledtext<br>import os<br>import re<br>import xml.etree.ElementTree as ET<br>import zipfile<br>import tempfile<br>import shutil<br><br><br>class DeduplicationApp:<br> def __init__(self, root):<br> self.root = root<br> self.root.title(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"FDT文件去重工具"</span>)<br> self.root.geometry(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"800x600"</span>)<br> self.root.resizable(True, True)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 定义配色方案</span><br> self.bg_color = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#2c3e50"</span><br> self.header_color = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#3498db"</span><br> self.text_bg = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#ecf0f1"</span><br> self.btn_color = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#1abc9c"</span><br> self.btn_hover = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#16a085"</span><br> self.btn_remove = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#e74c3c"</span><br> self.btn_remove_hover = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#c0392b"</span><br> self.status_color = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#34495e"</span><br> self.format_highlight = {<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"txt"</span>: <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#1abc9c"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"doc"</span>: <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#3498db"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"docx"</span>: <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#2980b9"</span>,<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xls"</span>: <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#9b59b6"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xlsx"</span>: <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#8e44ad"</span><br> }<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 创建主框架</span><br> self.root.configure(<span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 标题框架</span><br> header_frame = tk.Frame(root, <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.header_color)<br> header_frame.pack(fill=tk.X, pady=0)<br><br> title_label = tk.Label(<br> header_frame,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"▣ FDT文件去重工具"</span>,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 16, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"bold"</span>),<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.header_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"white"</span>,<br> pady=10<br> )<br> title_label.pack(pady=5)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 创建主内容框架</span><br> main_frame = tk.Frame(root, <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color, padx=15, pady=10)<br> main_frame.pack(fill=tk.BOTH, expand=True)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 文件格式说明</span><br> format_frame = tk.Frame(main_frame, <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color)<br> format_frame.pack(fill=tk.X, pady=5)<br><br> tk.Label(<br> format_frame,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"支持格式: "</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#ecf0f1"</span>,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9)<br> ).pack(side=tk.LEFT)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 创建格式标签</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> fmt, color <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> self.format_highlight.items():<br> tk.Label(<br> format_frame,<br> text=f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"{fmt.upper()}"</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=color,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"bold"</span>),<br> padx=5<br> ).pack(side=tk.LEFT, padx=2)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 输入文件选择</span><br> input_frame = tk.LabelFrame(<br> main_frame,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">" 选择输入文件 "</span>,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9),<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#ecf0f1"</span><br> )<br> input_frame.pack(fill=tk.X, pady=8)<br><br> tk.Label(<br> input_frame,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"输入文件:"</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#ecf0f1"</span>,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9)<br> ).grid(row=0, column=0, padx=5, pady=5, sticky=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'w'</span>)<br><br> self.input_path = tk.StringVar()<br> input_entry = tk.Entry(<br> input_frame,<br> textvariable=self.input_path,<br> width=50,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9),<br> relief=tk.GROOVE<br> )<br> input_entry.grid(row=0, column=1, padx=5, sticky=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'ew'</span>)<br><br> button_frame1 = tk.Frame(input_frame, <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color)<br> button_frame1.grid(row=0, column=2, padx=5)<br><br> tk.Button(<br> button_frame1,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"浏览..."</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">command</span>=self.browse_input,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#3498db"</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"white"</span>,<br> relief=tk.FLAT,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"bold"</span>),<br> padx=10<br> ).pack(side=tk.LEFT, padx=2)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 输出文件选择</span><br> output_frame = tk.LabelFrame(<br> main_frame,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">" 设置输出文件 "</span>,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9),<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#ecf0f1"</span><br> )<br> output_frame.pack(fill=tk.X, pady=8)<br><br> tk.Label(<br> output_frame,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"输出文件:"</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#ecf0f1"</span>,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9)<br> ).grid(row=0, column=0, padx=5, pady=5, sticky=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'w'</span>)<br><br> self.output_path = tk.StringVar()<br> output_entry = tk.Entry(<br> output_frame,<br> textvariable=self.output_path,<br> width=50,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9),<br> relief=tk.GROOVE<br> )<br> output_entry.grid(row=0, column=1, padx=5, sticky=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'ew'</span>)<br><br> button_frame2 = tk.Frame(output_frame, <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color)<br> button_frame2.grid(row=0, column=2, padx=5)<br><br> tk.Button(<br> button_frame2,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"浏览..."</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">command</span>=self.browse_output,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#3498db"</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"white"</span>,<br> relief=tk.FLAT,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"bold"</span>),<br> padx=10<br> ).pack(side=tk.LEFT, padx=2)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 覆盖选项</span><br> option_frame = tk.Frame(main_frame, <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color)<br> option_frame.pack(fill=tk.X, pady=5)<br><br> self.overwrite_var = tk.BooleanVar(value=True)<br> tk.Checkbutton(<br> option_frame,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"覆盖原文件(输出文件留空时自动启用)"</span>,<br> variable=self.overwrite_var,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">command</span>=self.update_overwrite,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#ecf0f1"</span>,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9),<br> selectcolor=self.bg_color,<br> activebackground=self.bg_color,<br> activeforeground=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#ecf0f1"</span><br> ).pack(anchor=tk.W)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 去重范围选择 (仅适用于Word文档)</span><br> self.scope_var = tk.StringVar(value=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"all"</span>)<br> scope_frame = tk.Frame(main_frame, <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color)<br> scope_frame.pack(fill=tk.X, pady=5)<br><br> tk.Label(<br> scope_frame,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"Word文档去重范围:"</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#ecf0f1"</span>,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9)<br> ).pack(side=tk.LEFT)<br><br> scopes = [<br> (<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"全部内容"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"all"</span>),<br> (<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"仅段落"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"paragraphs"</span>),<br> (<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"仅表格"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"tables"</span>)<br> ]<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> text, value <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> scopes:<br> tk.Radiobutton(<br> scope_frame,<br> text=text,<br> variable=self.scope_var,<br> value=value,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#ecf0f1"</span>,<br> selectcolor=self.bg_color,<br> activebackground=self.bg_color,<br> activeforeground=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#ecf0f1"</span>,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9)<br> ).pack(side=tk.LEFT, padx=10)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 操作按钮</span><br> btn_frame = tk.Frame(main_frame, <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color)<br> btn_frame.pack(fill=tk.X, pady=10)<br><br> btn_style = {<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"font"</span>: (<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 10, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"bold"</span>),<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"padx"</span>: 15,<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"pady"</span>: 8,<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"relief"</span>: tk.GROOVE,<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"bd"</span>: 0<br> }<br><br> tk.Button(<br> btn_frame,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"✓ 执行去重"</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">command</span>=self.process_deduplication,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.btn_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"white"</span>,<br> activebackground=self.btn_hover,<br> **btn_style<br> ).pack(side=tk.LEFT, padx=10)<br><br> tk.Button(<br> btn_frame,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"👁 预览结果"</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">command</span>=self.preview_results,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#f39c12"</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"white"</span>,<br> activebackground=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#e67e22"</span>,<br> **btn_style<br> ).pack(side=tk.LEFT, padx=10)<br><br> tk.Button(<br> btn_frame,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"✕ 退出"</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">command</span>=root.destroy,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.btn_remove,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"white"</span>,<br> activebackground=self.btn_remove_hover,<br> **btn_style<br> ).pack(side=tk.RIGHT, padx=10)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 结果显示框架</span><br> result_frame = tk.LabelFrame(<br> main_frame,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">" 处理结果 "</span>,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9),<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#ecf0f1"</span><br> )<br> result_frame.pack(fill=tk.BOTH, expand=True)<br><br> self.result_text = scrolledtext.ScrolledText(<br> result_frame,<br> wrap=tk.WORD,<br> height=8,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"Consolas"</span>, 10),<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#ffffff"</span>,<br> padx=10,<br> pady=10,<br> relief=tk.GROOVE<br> )<br> self.result_text.pack(fill=tk.BOTH, expand=True, padx=5, pady=5)<br> self.result_text.config(state=tk.DISABLED)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 状态栏</span><br> status_bar = tk.Frame(<br> root,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.status_color,<br> height=22,<br> relief=tk.SUNKEN<br> )<br> status_bar.pack(side=tk.BOTTOM, fill=tk.X)<br><br> self.status_var = tk.StringVar(value=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"就绪 | 选择一个文件开始处理"</span>)<br> tk.Label(<br> status_bar,<br> textvariable=self.status_var,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.status_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"white"</span>,<br> anchor=tk.W,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9)<br> ).pack(side=tk.LEFT, padx=10)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 底部版权信息</span><br> copyright_frame = tk.Frame(root, <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color)<br> copyright_frame.pack(side=tk.BOTTOM, fill=tk.X)<br><br> tk.Label(<br> copyright_frame,<br> text=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"© 2025 FDT文件去重工具 v1.0 | 支持: TXT, DOC, DOCX, XLS, XLSX"</span>,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=self.bg_color,<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">fg</span>=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#95a5a6"</span>,<br> font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 8)<br> ).pack(pady=(0, 5))<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 绑定事件</span><br> self.bind_hover_events()<br><br> def bind_hover_events(self):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"绑定按钮的悬停事件"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> widget <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> self.root.winfo_children():<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> isinstance(widget, tk.Button):<br> widget.bind(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"<Enter>"</span>, lambda e: e.widget.config(<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=e.widget.cget(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"activebackground"</span>)<br> ))<br> widget.bind(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"<Leave>"</span>, lambda e: e.widget.config(<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">bg</span>=e.widget.cget(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"bg"</span>).replace(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"activebackground"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span>).split()<br> ))<br><br> def get_file_extension(self, file_path):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"获取文件扩展名(小写,不带点)"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not file_path:<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> None<br> ext = os.path.splitext(file_path)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> ext.startswith(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'.'</span>):<br> ext = ext<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> ext.lower()<br><br> def browse_input(self):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"选择输入文件"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> file_path = filedialog.askopenfilename(<br> title=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"选择输入文件"</span>,<br> filetypes=[<br> (<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"所有支持的文件"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"*.txt *.doc *.docx *.xls *.xlsx"</span>),<br> (<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"文本文件"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"*.txt"</span>),<br> (<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"Word 文档"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"*.doc *.docx"</span>),<br> (<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"Excel 文件"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"*.xls *.xlsx"</span>),<br> (<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"所有文件"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"*.*"</span>)<br> ]<br> )<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> file_path:<br> self.input_path.set(file_path)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> self.overwrite_var.get() and not self.output_path.get():<br> self.output_path.set(file_path)<br><br> ext = self.get_file_extension(file_path)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> ext <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> self.format_highlight:<br> color = self.format_highlight<br> self.status_var.set(f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"已选择输入文件: {os.path.basename(file_path)}"</span>)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span>:<br> self.status_var.set(f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"警告: {ext.upper()}格式支持有限 - {os.path.basename(file_path)}"</span>)<br><br> def browse_output(self):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"选择输出文件"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> input_file = self.input_path.get()<br> input_ext = self.get_file_extension(input_file)<br><br> default_ext = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"txt"</span><br> file_types = []<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> input_ext <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> [<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"doc"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"docx"</span>]:<br> file_types = [(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"Word 文档"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"*.docx"</span>), (<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"所有文件"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"*.*"</span>)]<br> default_ext = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"docx"</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">elif</span> input_ext <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> [<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xls"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xlsx"</span>]:<br> file_types = [(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"Excel 文件"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"*.xlsx"</span>), (<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"所有文件"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"*.*"</span>)]<br> default_ext = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xlsx"</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span>:<br> file_types = [(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"文本文件"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"*.txt"</span>), (<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"所有文件"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"*.*"</span>)]<br> default_ext = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"txt"</span><br><br> file_path = filedialog.asksaveasfilename(<br> title=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"保存输出文件"</span>,<br> defaultextension=f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">".{default_ext}"</span>,<br> filetypes=file_types<br> )<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> file_path:<br> self.output_path.set(file_path)<br> self.status_var.set(f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"输出文件设置为: {os.path.basename(file_path)}"</span>)<br><br> def update_overwrite(self):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"更新覆盖选项"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> self.overwrite_var.get() and self.input_path.get() and not self.output_path.get():<br> self.output_path.set(self.input_path.get())<br><br> def validate_inputs(self):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"验证输入是否有效"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> input_file = self.input_path.get()<br> output_file = self.output_path.get()<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not input_file:<br> messagebox.showerror(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"输入错误"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"请先选择一个输入文件!"</span>)<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> False<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not os.path.exists(input_file):<br> messagebox.showerror(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"文件错误"</span>, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"文件不存在:\n{input_file}"</span>)<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> False<br><br> ext = self.get_file_extension(input_file)<br> supported_formats = [<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"txt"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"doc"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"docx"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xls"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xlsx"</span>]<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> ext not <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> supported_formats:<br> messagebox.showerror(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"格式错误"</span>, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"不支持的文件格式: {ext or '未知'}\n\n"</span><br> f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"支持格式: {', '.join(supported_formats)}"</span>)<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> False<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not output_file:<br> messagebox.showerror(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"输出错误"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"请设置输出文件路径!"</span>)<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> False<br><br> output_ext = self.get_file_extension(output_file)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> output_ext != ext:<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not messagebox.askyesno(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"格式不同"</span>,<br> f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"输出文件格式({output_ext})与输入格式({ext})不同,\n"</span><br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"可能导致格式丢失。是否继续?"</span>,<br> icon=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"warning"</span>):<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> False<br><br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> True<br><br> def extract_text_from_docx(self, file_path):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"从DOCX文件中提取文本(无Office依赖)"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> try:<br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 创建一个临时目录用于解压DOCX文件</span><br> with tempfile.TemporaryDirectory() as tmp_dir:<br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 解压DOCX文件</span><br> with zipfile.ZipFile(file_path, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'r'</span>) as zip_ref:<br> zip_ref.extractall(tmp_dir)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 解析document.xml文件</span><br> doc_xml_path = os.path.join(tmp_dir, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'word'</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'document.xml'</span>)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not os.path.exists(doc_xml_path):<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> [], 0<br><br> tree = ET.parse(doc_xml_path)<br> root = tree.getroot()<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 定义XML命名空间</span><br> namespaces = {<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'w'</span>: <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'http://schemas.openxmlformats.org/wordprocessingml/2006/main'</span><br> }<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 提取文本内容</span><br> text_lines = []<br> scope = self.scope_var.get()<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 提取段落文本</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> scope <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> [<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"all"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"paragraphs"</span>]:<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> paragraph <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> root.findall(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'.//w:p'</span>, namespaces):<br> para_text = []<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> run <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> paragraph.findall(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'.//w:r'</span>, namespaces):<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> text <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> run.findall(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'.//w:t'</span>, namespaces):<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> text.text:<br> para_text.append(text.text.strip())<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> para_text:<br> text_lines.append(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">''</span>.join(para_text))<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 提取表格文本</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> scope <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> [<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"all"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"tables"</span>]:<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> table <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> root.findall(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'.//w:tbl'</span>, namespaces):<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> row <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> table.findall(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'.//w:tr'</span>, namespaces):<br> row_text = []<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> cell <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> row.findall(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'.//w:tc'</span>, namespaces):<br> cell_text = []<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> paragraph <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> cell.findall(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'.//w:p'</span>, namespaces):<br> para_text = []<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> run <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> paragraph.findall(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'.//w:r'</span>, namespaces):<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> text <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> run.findall(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'.//w:t'</span>, namespaces):<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> text.text:<br> para_text.append(text.text.strip())<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> para_text:<br> cell_text.append(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">''</span>.join(para_text))<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> cell_text:<br> row_text.append(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">' '</span>.join(cell_text))<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> row_text:<br> text_lines.append(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'\t'</span>.join(row_text))<br><br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> text_lines, len(text_lines)<br> except Exception as e:<br> messagebox.showerror(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"提取错误"</span>, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"从DOCX文件中提取内容失败:\n{str(e)}"</span>)<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> [], 0<br><br> def extract_text_from_doc(self, file_path):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"从DOC文件中提取文本(兼容处理)"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 显示警告信息</span><br> messagebox.showwarning(<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"DOC格式限制"</span>,<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"DOC文件是旧格式,处理能力有限。\n\n已将其视为文本文件处理。"</span><br> )<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 尝试作为文本文件提取内容</span><br> try:<br> with open(file_path, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'r'</span>, encoding=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'utf-8'</span>, errors=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'ignore'</span>) as f:<br> lines = <br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> lines, len(lines)<br> except Exception as e:<br> messagebox.showerror(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"处理错误"</span>, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"处理DOC文件失败:\n{str(e)}"</span>)<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> [], 0<br><br> def extract_text_from_excel(self, file_path):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"从Excel文件中提取文本(使用pandas)"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> try:<br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 确定读取引擎</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> file_path.endswith(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'.xls'</span>):<br> import xlrd<br> engine = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'xlrd'</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span>:<br> engine = <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'openpyxl'</span><br><br> sheets = pd.read_excel(file_path, sheet_name=None, engine=engine)<br><br> all_text = []<br> total_lines = 0<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> sheet_name, df <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> sheets.items():<br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 添加表名标题</span><br> all_text.append(f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"\n--- Sheet: {sheet_name} ---"</span>)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 处理表头</span><br> headers = <br> all_text.append(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"\t"</span>.join(headers))<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 处理数据行</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> idx, row <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> df.iterrows():<br> row_values = <br> all_text.append(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"\t"</span>.join(row_values))<br><br> total_lines += len(df) + 2<br><br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> all_text, len(all_text)<br> except Exception as e:<br> messagebox.showerror(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"提取错误"</span>, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"从Excel文件中提取内容失败:\n{str(e)}"</span>)<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> [], 0<br><br> def deduplicate_text(self, lines):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"去重文本内容(保留顺序)"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> seen = <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">set</span>()<br> unique_lines = []<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> line <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> lines:<br> stripped_line = line.strip()<br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 对于表格行,我们按整行比较</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'\t'</span> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> stripped_line:<br> key = stripped_line<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span>:<br> key = stripped_line.lower()<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> key not <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> seen:<br> seen.add(key)<br> unique_lines.append(line)<br><br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> unique_lines, len(unique_lines)<br><br> def save_dedup_result(self, unique_lines, input_file, output_file):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"保存去重结果到文件"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> input_ext = self.get_file_extension(input_file)<br> output_ext = self.get_file_extension(output_file)<br><br> try:<br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 对于Excel文件,保存为Excel格式</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> output_ext <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> [<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xls"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xlsx"</span>]:<br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 提取表头和数据</span><br> header_line = None<br> data_lines = []<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> line <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> unique_lines:<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'--- Sheet:'</span> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> line:<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">continue</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not header_line:<br> header_line = line<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span>:<br> data_lines.append(line)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 解析数据</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> header_line and data_lines:<br> headers = header_line.split(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'\t'</span>)<br> data = <br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 创建DataFrame</span><br> df = pd.DataFrame(data, columns=headers)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 保存到Excel</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> output_ext == <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'xlsx'</span>:<br> df.to_excel(output_file, index=False, engine=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'openpyxl'</span>)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span>:<br> df.to_excel(output_file, index=False, engine=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'xlwt'</span>)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span>:<br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 如果没有数据,创建空DataFrame</span><br> pd.DataFrame().to_excel(output_file, index=False)<br><br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> True<br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 对于文本和Word文件,保存为文本格式</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span>:<br> with open(output_file, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'w'</span>, encoding=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'utf-8'</span>) as f:<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> line <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> unique_lines:<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'--- Sheet:'</span> not <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> line: <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 跳过sheet标题</span><br> f.write(line + <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'\n'</span>)<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> True<br> except Exception as e:<br> messagebox.showerror(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"保存错误"</span>, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"保存去重结果失败:\n{str(e)}"</span>)<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span> False<br><br> def preview_results(self):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"预览去重结果"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not self.validate_inputs():<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span><br><br> input_file = self.input_path.get()<br> output_file = self.output_path.get()<br> ext = self.get_file_extension(input_file)<br><br> try:<br> lines = []<br> original_count = 0<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 根据文件格式提取内容</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> ext == <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"txt"</span>:<br> with open(input_file, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'r'</span>, encoding=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'utf-8'</span>) as f:<br> lines = <br> original_count = len(lines)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">elif</span> ext == <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"doc"</span>:<br> lines, original_count = self.extract_text_from_doc(input_file)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">elif</span> ext == <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"docx"</span>:<br> lines, original_count = self.extract_text_from_docx(input_file)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">elif</span> ext <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> [<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xls"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xlsx"</span>]:<br> lines, original_count = self.extract_text_from_excel(input_file)<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not lines:<br> messagebox.showwarning(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"内容为空"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"未提取到任何内容,文件可能为空或格式不受支持"</span>)<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span><br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 去重文本</span><br> unique_lines, unique_count = self.deduplicate_text(lines)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 显示预览结果</span><br> self.result_text.config(state=tk.NORMAL)<br> self.result_text.delete(1.0, tk.END)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 标题</span><br> self.result_text.tag_config(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"header"</span>, foreground=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#2980b9"</span>, font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 10, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"bold"</span>))<br> self.result_text.insert(tk.END, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"文件预览 ({ext.upper()}, 最多15行)\n"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"header"</span>)<br> self.result_text.insert(tk.END, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"="</span> * 60 + <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"\n\n"</span>)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 预览内容</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> i, line <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> enumerate(unique_lines[:15]):<br> self.result_text.tag_config(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"line_num"</span>, foreground=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#7f8c8d"</span>)<br> self.result_text.insert(tk.END, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"{i + 1:>2}. "</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"line_num"</span>)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 表格行特殊处理</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'\t'</span> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> line:<br> self.result_text.tag_config(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"table_row"</span>, foreground=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#9b59b6"</span>)<br> columns = line.split(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'\t'</span>)<br> truncated = + (<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'...'</span> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> len(col) > 15 <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">''</span>) <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">for</span> col <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> columns]<br> self.result_text.insert(tk.END, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">" | "</span>.join(truncated) + <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"\n"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"table_row"</span>)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span>:<br> self.result_text.tag_config(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"text_line"</span>, foreground=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#2c3e50"</span>)<br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 对长文本进行截断处理</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> len(line) > 80:<br> line = line[:77] + <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"..."</span><br> self.result_text.insert(tk.END, line + <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"\n"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"text_line"</span>)<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> len(unique_lines) > 15:<br> self.result_text.insert(tk.END, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"\n...以及另外 {len(unique_lines) - 15} 行\n\n"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"line_num"</span>)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">else</span>:<br> self.result_text.insert(tk.END, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"\n"</span>)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 统计数据</span><br> self.result_text.tag_config(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"stats"</span>, foreground=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#27ae60"</span>, font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"bold"</span>))<br> self.result_text.insert(tk.END, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"统计信息:\n"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"stats"</span>)<br> self.result_text.insert(tk.END, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"原始行数: {original_count}\n"</span>)<br> self.result_text.insert(tk.END, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"去重后行数: {len(unique_lines)}\n"</span>)<br> self.result_text.insert(tk.END, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"移除重复行数: {original_count - len(unique_lines)}\n"</span>)<br><br> self.result_text.config(state=tk.DISABLED)<br><br> self.status_var.set(f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"预览完成: {ext.upper()}文件, 原始行数 {original_count}, 去重后行数 {len(unique_lines)}"</span>)<br><br> except Exception as e:<br> messagebox.showerror(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"处理错误"</span>, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"处理文件时发生错误:\n{str(e)}"</span>)<br> self.status_var.set(f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"错误: {str(e)}"</span>)<br><br> def process_deduplication(self):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"执行去重操作"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not self.validate_inputs():<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span><br><br> input_file = self.input_path.get()<br> output_file = self.output_path.get()<br> ext = self.get_file_extension(input_file)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 检查是否覆盖原文件</span><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> input_file == output_file:<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not messagebox.askyesno(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"确认覆盖"</span>,<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"输出文件与输入文件相同,将覆盖原始文件。\n\n是否继续?"</span>,<br> icon=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"warning"</span>):<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span><br><br> try:<br> lines = []<br> original_count = 0<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 根据文件格式提取内容</span><br> self.status_var.set(f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"正在处理 {ext.upper()} 文件..."</span>)<br> self.root.update()<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> ext == <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"txt"</span>:<br> with open(input_file, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'r'</span>, encoding=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">'utf-8'</span>) as f:<br> lines = <br> original_count = len(lines)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">elif</span> ext == <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"doc"</span>:<br> lines, original_count = self.extract_text_from_doc(input_file)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">elif</span> ext == <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"docx"</span>:<br> lines, original_count = self.extract_text_from_docx(input_file)<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">elif</span> ext <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">in</span> [<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xls"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"xlsx"</span>]:<br> lines, original_count = self.extract_text_from_excel(input_file)<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not lines:<br> messagebox.showwarning(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"内容为空"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"未提取到任何内容,文件可能为空或格式不受支持"</span>)<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span><br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 去重文本</span><br> unique_lines, unique_count = self.deduplicate_text(lines)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 保存结果</span><br> success = self.save_dedup_result(unique_lines, input_file, output_file)<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> not success:<br> <span class="hljs-built_in" style="color: rgba(230, 192, 123, 1); line-height: 26px">return</span><br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 显示结果</span><br> self.result_text.config(state=tk.NORMAL)<br> self.result_text.delete(1.0, tk.END)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 结果标题</span><br> self.result_text.tag_config(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"success"</span>, foreground=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#27ae60"</span>, font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 11, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"bold"</span>))<br> self.result_text.insert(tk.END, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"✓ 去重操作成功完成!\n\n"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"success"</span>)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 统计信息</span><br> self.result_text.tag_config(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"stats"</span>, foreground=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#e74c3c"</span>, font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 10))<br> self.result_text.insert(tk.END, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"处理结果统计:\n"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"stats"</span>)<br> self.result_text.insert(tk.END, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"原始行数: {original_count}\n"</span>)<br> self.result_text.insert(tk.END, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"去重后行数: {len(unique_lines)}\n"</span>)<br> self.result_text.insert(tk.END, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"移除重复行数: {original_count - len(unique_lines)}\n\n"</span>)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 文件信息</span><br> self.result_text.tag_config(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"file"</span>, foreground=<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"#3498db"</span>, font=(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"微软雅黑"</span>, 9, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"bold"</span>))<br> self.result_text.insert(tk.END, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"文件信息:\n"</span>, <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"file"</span>)<br> self.result_text.insert(tk.END, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"输入文件: {os.path.basename(input_file)}\n"</span>)<br> self.result_text.insert(tk.END, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"输出文件: {os.path.basename(output_file)}\n"</span>)<br> self.result_text.insert(tk.END, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"输出路径: {os.path.dirname(output_file)}\n"</span>)<br><br> self.result_text.config(state=tk.DISABLED)<br><br> self.status_var.set(f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"去重完成!移除了 {original_count - len(unique_lines)} 行重复内容"</span>)<br><br> <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 显示成功对话框</span><br> messagebox.showinfo(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"操作成功"</span>,<br> f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"文件去重操作成功完成!\n\n"</span><br> f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"格式: {ext.upper()}\n"</span><br> f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"原始行数: {original_count}\n"</span><br> f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"去重后行数: {len(unique_lines)}\n"</span><br> f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"移除了 {original_count - len(unique_lines)} 行重复内容"</span>)<br><br> except Exception as e:<br> messagebox.showerror(<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"处理错误"</span>, f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"处理文件时发生错误:\n{str(e)}"</span>)<br> self.status_var.set(f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"错误: {str(e)}"</span>)<br><br><br>def center_window(window, width=None, height=None):<br> <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"居中窗口"</span><span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">""</span><br> window.update_idletasks()<br> screen_width = window.winfo_screenwidth()<br> screen_height = window.winfo_screenheight()<br><br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> width is None:<br> width = window.winfo_width()<br> <span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> height is None:<br> height = window.winfo_height()<br><br> x = (screen_width - width) // 2<br> y = (screen_height - height) // 2 - 20 <span class="hljs-comment" style="color: rgba(92, 99, 112, 1); font-style: italic; line-height: 26px"># 稍微上移一些</span><br><br> window.geometry(f<span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"{width}x{height}+{x}+{y}"</span>)<br><br><br><span class="hljs-keyword" style="color: rgba(198, 120, 221, 1); line-height: 26px">if</span> __name__ == <span class="hljs-string" style="color: rgba(152, 195, 121, 1); line-height: 26px">"__main__"</span>:<br> root = tk.Tk()<br> app = DeduplicationApp(root)<br> center_window(root, 800, 650)<br> root.mainloop()<br></code></pre>
</section><br><br>
来源:https://www.cnblogs.com/leyinsec/p/18992924
頁:
[1]