王列 發表於 2025-11-20 19:04:00

Langchain Splitter源码阅读笔记(一)CharacterTextSplitter

<h2>一、TextSplitter</h2>
<p class="p1">TextSplitter继承自BaseDocumentTransformer,是一个抽象类,不能直接创建实例。</p>
<p class="p2">&nbsp;</p>
<p class="p1">核心(内部)属性有:</p>
<p class="p1">_chunk_size: 每块大小</p>
<p class="p1">_chunk_overlap: 每块之间的重叠区大小</p>
<p class="p1">_length_function: 计算大小的方法,可以传递token计算的函数,也可以传别的比如普通的len()</p>
<p class="p1">_keep_separator: Boolean 分块后是否保留分割符</p>
<p class="p1">_add_start_index: Boolean 是否在分割后返回的文档元数据中保存每块第一个字符在原始文档中的index</p>
<p class="p1">_strip_whitespace: Boolean 分割后是否去掉前后的空格</p>
<p class="p2">&nbsp;</p>
<p class="p1">核心方法:</p>
<p class="p1">split_text(self, text: str) -&gt; List(str)</p>
<p class="p1">分割方法,抽象方法,要在具体的子类中根据分割算法实现。</p>
<p class="p1">create_documents(self, texts: list, metadatas: list) -&gt; list</p>
<p class="p1">传入文本和可选的元数据信息,返回将文本调用split_text分割后,创建的Document格式数据,doc.page_content是文本,metadata是创建的元数据,根据是否_add_start_index自动保存index</p>
<p class="p1">split_documents(self, documents: Iterable) -&gt; list</p>
<p class="p1">将传入的document列表分割,返回分割后的document列表,内部就是对每个document调用create_documents创建文档,组合返回。</p>
<p class="p1">--------以下为内部方法---------</p>
<p class="p1">_join_docs(self, docs: list, separator: str) -&gt; str<span class="Apple-converted-space">&nbsp;</span></p>
<p class="p1">注意这个参数里的docs是字符串列表,就是根据传入<span>的分割符合并字</span>符串列表为一个长字符串,给下面的_merge_splits使用</p>
<p class="p1">_merge_splits(self, splits: Iterable, separator: str) -&gt; list</p>
<p class="p1">把分割得过于细的小块合并成更接近self._chunk_size的块,并确保相邻块之间有self._chunk_overlap大小的重叠内容。</p>
<div class="cnblogs_code"><img id="code_img_closed_4e5182dd-740d-4ea2-b5f9-494d1579b24e" class="code_img_closed lazyload" data-src="http://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif"><img id="code_img_opened_4e5182dd-740d-4ea2-b5f9-494d1579b24e" class="code_img_opened lazyload" style="display: none" data-src="http://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif">
<div id="cnblogs_code_open_4e5182dd-740d-4ea2-b5f9-494d1579b24e" class="cnblogs_code_hide">
<pre><span style="color: rgba(0, 0, 255, 1)">def</span> _merge_splits(self, splits: Iterable, separator: str) -&gt;<span style="color: rgba(0, 0, 0, 1)"> list:
</span>2         <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> We now want to combine these smaller pieces into medium size</span>
3         <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> chunks to send to the LLM. </span>
4         separator_len =<span style="color: rgba(0, 0, 0, 1)"> self._length_function(separator)
</span>5
6         docs =<span style="color: rgba(0, 0, 0, 1)"> []
</span>7         current_doc: list =<span style="color: rgba(0, 0, 0, 1)"> []
</span>8         total =<span style="color: rgba(0, 0, 0, 1)"> 0
</span>9         <span style="color: rgba(0, 0, 255, 1)">for</span> d <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> splits:
</span>10             len_ =<span style="color: rgba(0, 0, 0, 1)"> self._length_function(d)
          </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 默认先在current_doc里面append(d),直到满足下面的if,往docs里面加入值</span>
11             <span style="color: rgba(0, 0, 255, 1)">if</span><span style="color: rgba(0, 0, 0, 1)"> (
</span>12               total + len_ + (separator_len <span style="color: rgba(0, 0, 255, 1)">if</span> len(current_doc) &gt; 0 <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)"> 0)
</span>13               &gt;<span style="color: rgba(0, 0, 0, 1)"> self._chunk_size
</span>14<span style="color: rgba(0, 0, 0, 1)">             ):
</span>15               <span style="color: rgba(0, 0, 255, 1)">if</span> total &gt;<span style="color: rgba(0, 0, 0, 1)"> self._chunk_size:
</span>16<span style="color: rgba(0, 0, 0, 1)">                     logger.warning(
</span>17                         <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">Created a chunk of size %d, which is longer than the </span><span style="color: rgba(128, 0, 0, 1)">"</span>
18                         <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">specified %d</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span>19<span style="color: rgba(0, 0, 0, 1)">                         total,
</span>20<span style="color: rgba(0, 0, 0, 1)">                         self._chunk_size,
</span>21<span style="color: rgba(0, 0, 0, 1)">                     )
</span>22               <span style="color: rgba(0, 0, 255, 1)">if</span> len(current_doc) &gt;<span style="color: rgba(0, 0, 0, 1)"> 0:
</span>23                     doc =<span style="color: rgba(0, 0, 0, 1)"> self._join_docs(current_doc, separator)
</span>24                     <span style="color: rgba(0, 0, 255, 1)">if</span> doc <span style="color: rgba(0, 0, 255, 1)">is</span> <span style="color: rgba(0, 0, 255, 1)">not</span><span style="color: rgba(0, 0, 0, 1)"> None:
</span>25<span style="color: rgba(0, 0, 0, 1)">                         docs.append(doc)
</span>26                     <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Keep on popping if:</span>
27                     <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> - we have a larger chunk than in the chunk overlap</span>
28                     <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> - or if we still have any chunks and the length is long</span>
29                     <span style="color: rgba(0, 0, 255, 1)">while</span> total &gt; self._chunk_overlap <span style="color: rgba(0, 0, 255, 1)">or</span><span style="color: rgba(0, 0, 0, 1)"> (
</span>30                         total + len_ + (separator_len <span style="color: rgba(0, 0, 255, 1)">if</span> len(current_doc) &gt; 0 <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)"> 0)
</span>31                         &gt;<span style="color: rgba(0, 0, 0, 1)"> self._chunk_size
</span>32                         <span style="color: rgba(0, 0, 255, 1)">and</span> total &gt;<span style="color: rgba(0, 0, 0, 1)"> 0
</span>33<span style="color: rgba(0, 0, 0, 1)">                     ):
</span>34                         total -= self._length_function(current_doc) +<span style="color: rgba(0, 0, 0, 1)"> (
</span>35                           separator_len <span style="color: rgba(0, 0, 255, 1)">if</span> len(current_doc) &gt; 1 <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)"> 0
</span>36<span style="color: rgba(0, 0, 0, 1)">                         )
</span>37                         current_doc = current_doc
</span>38<span style="color: rgba(0, 0, 0, 1)">             current_doc.append(d)
</span>39             total += len_ + (separator_len <span style="color: rgba(0, 0, 255, 1)">if</span> len(current_doc) &gt; 1 <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)"> 0)
</span>40         doc =<span style="color: rgba(0, 0, 0, 1)"> self._join_docs(current_doc, separator)
</span>41         <span style="color: rgba(0, 0, 255, 1)">if</span> doc <span style="color: rgba(0, 0, 255, 1)">is</span> <span style="color: rgba(0, 0, 255, 1)">not</span><span style="color: rgba(0, 0, 0, 1)"> None:
</span>42<span style="color: rgba(0, 0, 0, 1)">             docs.append(doc)
</span>43         <span style="color: rgba(0, 0, 255, 1)">return</span> docs</pre>
</div>
<span class="cnblogs_code_collapse">View Code</span></div>
<p class="p1"><span style="font-family: &quot;Microsoft YaHei&quot;"><span style="background-color: rgba(245, 245, 245, 1)">这个方法的核心是,每当current_doc满足chunk_size</span><span class="s1" style="background-color: rgba(245, 245, 245, 1)">时,先把current_chunk<span class="s1">里面的字符join<span class="s1">后塞进docs<span class="s1">,然后,不是直接清空curent_chunk<span class="s1">,而是依次从current_chunk<span class="s1">头部移除文本单元,直到current_chunk<span class="s1">的文本长度小于_chunk_overlap<span class="s1">。此时current_chunk<span class="s1">里面的文本就是新块的开头,也是两块之间的重叠值。</span></span></span></span></span></span></span></span></span></span></p>
<h2 class="p1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1">二、CharacterTextSplitter</span></span></span></span></span></span></span></span></span></span></h2>
<p class="p1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1">这个类继承自上面的TextSplitter,增加了separator属性和is_separator_regex(分割符是否为正则表达式)属性。实现了父类的抽象方法split_text。</span></span></span></span></span></span></span></span></span></span></p>
<p class="p1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1"><span class="s1">这个类里的split_text方法调用了自定义的_split_text_with_regex()方法,对传入的文本text进行分割。先看代码:</span></span></span></span></span></span></span></span></span></span></p>
<div class="cnblogs_code"><img id="code_img_closed_fda990b5-22ad-44e5-9aae-15dba7b0c0ad" class="code_img_closed lazyload" data-src="http://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif"><img id="code_img_opened_fda990b5-22ad-44e5-9aae-15dba7b0c0ad" class="code_img_opened lazyload" style="display: none" data-src="http://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif">
<div id="cnblogs_code_open_fda990b5-22ad-44e5-9aae-15dba7b0c0ad" class="cnblogs_code_hide">
<pre><span style="color: rgba(0, 128, 128, 1)"> 1</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> CharactorTextSplitter类内部</span>
<span style="color: rgba(0, 128, 128, 1)"> 2</span> <span style="color: rgba(0, 0, 255, 1)">def</span> split_text(self, text: str) -&gt;<span style="color: rgba(0, 0, 0, 1)"> list:
</span><span style="color: rgba(0, 128, 128, 1)"> 3</span>         <span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">Split into chunks without re-inserting lookaround separators.</span><span style="color: rgba(128, 0, 0, 1)">"""</span>
<span style="color: rgba(0, 128, 128, 1)"> 4</span>         <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 1. Determine split pattern: raw regex or escaped literal</span>
<span style="color: rgba(0, 128, 128, 1)"> 5</span>         sep_pattern =<span style="color: rgba(0, 0, 0, 1)"> (
</span><span style="color: rgba(0, 128, 128, 1)"> 6</span>             self._separator <span style="color: rgba(0, 0, 255, 1)">if</span> self._is_separator_regex <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)"> re.escape(self._separator)
</span><span style="color: rgba(0, 128, 128, 1)"> 7</span> <span style="color: rgba(0, 0, 0, 1)">      )
</span><span style="color: rgba(0, 128, 128, 1)"> 8</span>
<span style="color: rgba(0, 128, 128, 1)"> 9</span>         <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 2. Initial split (keep separator if requested)</span>
<span style="color: rgba(0, 128, 128, 1)">10</span>         splits =<span style="color: rgba(0, 0, 0, 1)"> _split_text_with_regex(
</span><span style="color: rgba(0, 128, 128, 1)">11</span>             text, sep_pattern, keep_separator=<span style="color: rgba(0, 0, 0, 1)">self._keep_separator
</span><span style="color: rgba(0, 128, 128, 1)">12</span> <span style="color: rgba(0, 0, 0, 1)">      )
</span><span style="color: rgba(0, 128, 128, 1)">13</span>
<span style="color: rgba(0, 128, 128, 1)">14</span>         <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 3. Detect zero-width lookaround so we never re-insert it</span>
<span style="color: rgba(0, 128, 128, 1)">15</span>         lookaround_prefixes = (<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">(?=</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">(?&lt;!</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">(?&lt;=</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">(?!</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)">16</span>         is_lookaround = self._is_separator_regex <span style="color: rgba(0, 0, 255, 1)">and</span><span style="color: rgba(0, 0, 0, 1)"> any(
</span><span style="color: rgba(0, 128, 128, 1)">17</span>             self._separator.startswith(p) <span style="color: rgba(0, 0, 255, 1)">for</span> p <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> lookaround_prefixes
</span><span style="color: rgba(0, 128, 128, 1)">18</span> <span style="color: rgba(0, 0, 0, 1)">      )
</span><span style="color: rgba(0, 128, 128, 1)">19</span>
<span style="color: rgba(0, 128, 128, 1)">20</span>         <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 4. Decide merge separator:</span>
<span style="color: rgba(0, 128, 128, 1)">21</span>         <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">    - if keep_separator or lookaround -&gt; don't re-insert</span>
<span style="color: rgba(0, 128, 128, 1)">22</span>         <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">    - else -&gt; re-insert literal separator</span>
<span style="color: rgba(0, 128, 128, 1)">23</span>         merge_sep = <span style="color: rgba(128, 0, 0, 1)">""</span>
<span style="color: rgba(0, 128, 128, 1)">24</span>         <span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(0, 0, 255, 1)">not</span> (self._keep_separator <span style="color: rgba(0, 0, 255, 1)">or</span><span style="color: rgba(0, 0, 0, 1)"> is_lookaround):
</span><span style="color: rgba(0, 128, 128, 1)">25</span>             merge_sep =<span style="color: rgba(0, 0, 0, 1)"> self._separator
</span><span style="color: rgba(0, 128, 128, 1)">26</span>
<span style="color: rgba(0, 128, 128, 1)">27</span>         <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 5. Merge adjacent splits and return</span>
<span style="color: rgba(0, 128, 128, 1)">28</span>         <span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> self._merge_splits(splits, merge_sep)
</span><span style="color: rgba(0, 128, 128, 1)">29</span>
<span style="color: rgba(0, 128, 128, 1)">30</span> <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 外部方法</span>
<span style="color: rgba(0, 128, 128, 1)">31</span> <span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> _split_text_with_regex(
</span><span style="color: rgba(0, 128, 128, 1)">32</span>   text: str, separator: str, *, keep_separator: bool | Literal[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">start</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">end</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">]
</span><span style="color: rgba(0, 128, 128, 1)">33</span> ) -&gt;<span style="color: rgba(0, 0, 0, 1)"> list:
</span><span style="color: rgba(0, 128, 128, 1)">34</span>   <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> Now that we have the separator, split the text</span>
<span style="color: rgba(0, 128, 128, 1)">35</span>   <span style="color: rgba(0, 0, 255, 1)">if</span><span style="color: rgba(0, 0, 0, 1)"> separator:
</span><span style="color: rgba(0, 128, 128, 1)">36</span>         <span style="color: rgba(0, 0, 255, 1)">if</span><span style="color: rgba(0, 0, 0, 1)"> keep_separator:
</span><span style="color: rgba(0, 128, 128, 1)">37</span>             <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> The parentheses in the pattern keep the delimiters in the result.</span>
<span style="color: rgba(0, 128, 128, 1)">38</span>             splits_ = re.split(f<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">({separator})</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">, text)
</span><span style="color: rgba(0, 128, 128, 1)">39</span>             splits =<span style="color: rgba(0, 0, 0, 1)"> (
</span><span style="color: rgba(0, 128, 128, 1)">40</span>               ( + splits_ <span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span> range(0, len(splits_) - 1, 2<span style="color: rgba(0, 0, 0, 1)">)])
</span><span style="color: rgba(0, 128, 128, 1)">41</span>               <span style="color: rgba(0, 0, 255, 1)">if</span> keep_separator == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">end</span><span style="color: rgba(128, 0, 0, 1)">"</span>
<span style="color: rgba(0, 128, 128, 1)">42</span>               <span style="color: rgba(0, 0, 255, 1)">else</span> ( + splits_ <span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span> range(1, len(splits_), 2<span style="color: rgba(0, 0, 0, 1)">)])
</span><span style="color: rgba(0, 128, 128, 1)">43</span> <span style="color: rgba(0, 0, 0, 1)">            )
</span><span style="color: rgba(0, 128, 128, 1)">44</span>             <span style="color: rgba(0, 0, 255, 1)">if</span> len(splits_) % 2 ==<span style="color: rgba(0, 0, 0, 1)"> 0:
</span><span style="color: rgba(0, 128, 128, 1)">45</span>               splits += splits_[-1<span style="color: rgba(0, 0, 0, 1)">:]
</span><span style="color: rgba(0, 128, 128, 1)">46</span>             splits =<span style="color: rgba(0, 0, 0, 1)"> (
</span><span style="color: rgba(0, 128, 128, 1)">47</span>               ([*splits, splits_[-1<span style="color: rgba(0, 0, 0, 1)">]])
</span><span style="color: rgba(0, 128, 128, 1)">48</span>               <span style="color: rgba(0, 0, 255, 1)">if</span> keep_separator == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">end</span><span style="color: rgba(128, 0, 0, 1)">"</span>
<span style="color: rgba(0, 128, 128, 1)">49</span>               <span style="color: rgba(0, 0, 255, 1)">else</span> (, *<span style="color: rgba(0, 0, 0, 1)">splits])
</span><span style="color: rgba(0, 128, 128, 1)">50</span> <span style="color: rgba(0, 0, 0, 1)">            )
</span><span style="color: rgba(0, 128, 128, 1)">51</span>         <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
</span><span style="color: rgba(0, 128, 128, 1)">52</span>             splits =<span style="color: rgba(0, 0, 0, 1)"> re.split(separator, text)
</span><span style="color: rgba(0, 128, 128, 1)">53</span>   <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
</span><span style="color: rgba(0, 128, 128, 1)">54</span>         splits =<span style="color: rgba(0, 0, 0, 1)"> list(text)
</span><span style="color: rgba(0, 128, 128, 1)">55</span>   <span style="color: rgba(0, 0, 255, 1)">return</span> </pre>
</div>
<span class="cnblogs_code_collapse">View Code</span></div>
<p>如果不考虑保留分割符,其实这个方法很简单,就是使用re.split将传入text用分割符分开后,再调用父类实现的_merge_splits()拼接成合适大小的块,返回list。</p>
<p>1. 分割前处理</p>
<p>如果传入的分割符是一个字符串,调用re.split前,需要将字符串转义一下,防止有不合法的字符。</p>
<div>
<div>
<div class="cnblogs_code"><img id="code_img_closed_d0d5fa72-181a-4bff-bed5-1c17fd65113e" class="code_img_closed lazyload" data-src="http://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif"><img id="code_img_opened_d0d5fa72-181a-4bff-bed5-1c17fd65113e" class="code_img_opened lazyload" style="display: none" data-src="http://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif">
<div id="cnblogs_code_open_d0d5fa72-181a-4bff-bed5-1c17fd65113e" class="cnblogs_code_hide">
<pre><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 1. Determine split pattern: raw regex or escaped literal</span>
sep_pattern =<span style="color: rgba(0, 0, 0, 1)"> (
self._separator </span><span style="color: rgba(0, 0, 255, 1)">if</span> self._is_separator_regex <span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)"> re.escape(self._separator)
)</span></pre>
</div>
<span class="cnblogs_code_collapse">View Code</span></div>
<p>&nbsp;</p>
</div>
<div>2. 调用_split_text_with_regex()分割</div>
<div>这里如果不需要保持分割符在结果中,直接一行代码:
<div>splits = re.split(separator, text)</div>
<div>如果需要保持分割符,就需要使用括号包裹住pattern,得到的splits是一个包含分割符的列表。</div>
<div>然后复杂的一段,是用来判断要把分割符保留在每段的开头,还是结尾。</div>
<div>
<div class="cnblogs_code"><img id="code_img_closed_c39f8e82-3790-43c7-816d-5cf215b0066e" class="code_img_closed lazyload" data-src="http://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif"><img id="code_img_opened_c39f8e82-3790-43c7-816d-5cf215b0066e" class="code_img_opened lazyload" style="display: none" data-src="http://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif">
<div id="cnblogs_code_open_c39f8e82-3790-43c7-816d-5cf215b0066e" class="cnblogs_code_hide">
<pre><span style="color: rgba(0, 128, 128, 1)"> 1</span> splits =<span style="color: rgba(0, 0, 0, 1)"> (
</span><span style="color: rgba(0, 128, 128, 1)"> 2</span> ( + splits_ <span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span> range(0, len(splits_) - 1, 2<span style="color: rgba(0, 0, 0, 1)">)])
</span><span style="color: rgba(0, 128, 128, 1)"> 3</span> <span style="color: rgba(0, 0, 255, 1)">if</span> keep_separator == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">end</span><span style="color: rgba(128, 0, 0, 1)">"</span>
<span style="color: rgba(0, 128, 128, 1)"> 4</span> <span style="color: rgba(0, 0, 255, 1)">else</span> ( + splits_ <span style="color: rgba(0, 0, 255, 1)">for</span> i <span style="color: rgba(0, 0, 255, 1)">in</span> range(1, len(splits_), 2<span style="color: rgba(0, 0, 0, 1)">)])
</span><span style="color: rgba(0, 128, 128, 1)"> 5</span> <span style="color: rgba(0, 0, 0, 1)">)
</span><span style="color: rgba(0, 128, 128, 1)"> 6</span> <span style="color: rgba(0, 0, 255, 1)">if</span> len(splits_) % 2 ==<span style="color: rgba(0, 0, 0, 1)"> 0:
</span><span style="color: rgba(0, 128, 128, 1)"> 7</span> splits += splits_[-1<span style="color: rgba(0, 0, 0, 1)">:]
</span><span style="color: rgba(0, 128, 128, 1)"> 8</span> splits =<span style="color: rgba(0, 0, 0, 1)"> (
</span><span style="color: rgba(0, 128, 128, 1)"> 9</span> ([*splits, splits_[-1<span style="color: rgba(0, 0, 0, 1)">]])
</span><span style="color: rgba(0, 128, 128, 1)">10</span> <span style="color: rgba(0, 0, 255, 1)">if</span> keep_separator == <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">end</span><span style="color: rgba(128, 0, 0, 1)">"</span>
<span style="color: rgba(0, 128, 128, 1)">11</span> <span style="color: rgba(0, 0, 255, 1)">else</span> (, *<span style="color: rgba(0, 0, 0, 1)">splits])
</span><span style="color: rgba(0, 128, 128, 1)">12</span> )</pre>
</div>
<span class="cnblogs_code_collapse">View Code</span></div>
<p>&nbsp;</p>
</div>
</div>
<div>3. 调用合并方法</div>
<div>在调用合并方法前,需要判断一下合并时使用的merge_sep间隔符号,如果在刚才的分割时需要保留分割符,或者发现正则是零宽断言,则合并时不需要再加间隔符号了,否则间隔符号和分割符相同。</div>
<div>
<div class="cnblogs_code"><img id="code_img_closed_cf284495-9485-431d-aa6f-bf3bdd4f18f2" class="code_img_closed lazyload" data-src="http://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif"><img id="code_img_opened_cf284495-9485-431d-aa6f-bf3bdd4f18f2" class="code_img_opened lazyload" style="display: none" data-src="http://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif">
<div id="cnblogs_code_open_cf284495-9485-431d-aa6f-bf3bdd4f18f2" class="cnblogs_code_hide">
<pre><span style="color: rgba(0, 128, 128, 1)">1</span> merge_sep = <span style="color: rgba(128, 0, 0, 1)">""</span>
<span style="color: rgba(0, 128, 128, 1)">2</span> <span style="color: rgba(0, 0, 255, 1)">if</span> <span style="color: rgba(0, 0, 255, 1)">not</span> (self._keep_separator <span style="color: rgba(0, 0, 255, 1)">or</span><span style="color: rgba(0, 0, 0, 1)"> is_lookaround):
</span><span style="color: rgba(0, 128, 128, 1)">3</span> merge_sep = self._separator</pre>
</div>
<span class="cnblogs_code_collapse">View Code</span></div>
<p>&nbsp;</p>
</div>
</div>
<p class="p1">&nbsp;</p>
<p class="p1">&nbsp;</p><br><br>
来源:https://www.cnblogs.com/nanimono/p/19248329
頁: [1]
查看完整版本: Langchain Splitter源码阅读笔记(一)CharacterTextSplitter