楼台听雨 發表於 2020-1-2 16:18:00

PHP读取word docx文档内容及处理图片

<h2>PHP读取word文档里的文字及图片,并保存</h2>
<p>一、composer安装phpWord</p>
<div class="cnblogs_code">
<pre>composer <span style="color: rgba(0, 0, 255, 1)">require</span> phpoffice/phpword</pre>
</div>
<p>传送门:https://packagist.org/packages/phpoffice/phpword</p>
<p>&nbsp;</p>
<p>二、phpWord&nbsp;读取 docx&nbsp;文档(<span style="color: rgba(255, 0, 0, 1)">注意是docx格式,doc格式不行</span>)</p>
<p>如果你的文件是doc格式,直接另存为一个docx就行了;如果你的doc文档较多,可以下一个批量转换工具:http://www.batchwork.com/en/doc2doc/download.htm</p>
<p>如果你还没配置自动加载,则先配置一下:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">require</span> './vendor/autoload.php';</pre>
</div>
<p>加载文档:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(128, 0, 128, 1)">$dir</span> = <span style="color: rgba(0, 128, 128, 1)">str_replace</span>('\\', '/', __DIR__) . '/'<span style="color: rgba(0, 0, 0, 1)">;
</span><span style="color: rgba(128, 0, 128, 1)">$source</span> = <span style="color: rgba(128, 0, 128, 1)">$dir</span> . 'test.docx'<span style="color: rgba(0, 0, 0, 1)">;
</span><span style="color: rgba(128, 0, 128, 1)">$phpWord</span> = \PhpOffice\PhpWord\IOFactory::load(<span style="color: rgba(128, 0, 128, 1)">$source</span>);</pre>
</div>
<p>&nbsp;</p>
<p>三、关键点</p>
<p>1)对齐方式:PhpOffice\PhpWord\Style\Paragraph -&gt;&nbsp;getAlignment()</p>
<p>2)字体名称:\PhpOffice\PhpWord\Style\Font -&gt;&nbsp;getName()</p>
<p>3)字体大小:\PhpOffice\PhpWord\Style\Font -&gt;&nbsp;getSize()</p>
<p>4)是否加粗:\PhpOffice\PhpWord\Style\Font -&gt; isBold()</p>
<p>5)读取图片:<span style="color: rgba(255, 0, 0, 1)">\PhpOffice\PhpWord\Element\Image -&gt;&nbsp;getImageStringData()</span></p>
<p>6)ba64格式图片数据保存为图片:<span style="color: rgba(255, 0, 0, 1)">file_put_contents($imageSrc, base64_decode($imageData))</span></p>
<p>&nbsp;</p>
<p>四、完整代码</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">require</span> './vendor/autoload.php'<span style="color: rgba(0, 0, 0, 1)">;

</span><span style="color: rgba(0, 0, 255, 1)">function</span> docx2html(<span style="color: rgba(128, 0, 128, 1)">$source</span><span style="color: rgba(0, 0, 0, 1)">)
{
    </span><span style="color: rgba(128, 0, 128, 1)">$phpWord</span> = \PhpOffice\PhpWord\IOFactory::load(<span style="color: rgba(128, 0, 128, 1)">$source</span><span style="color: rgba(0, 0, 0, 1)">);
    </span><span style="color: rgba(128, 0, 128, 1)">$html</span> = ''<span style="color: rgba(0, 0, 0, 1)">;
    </span><span style="color: rgba(0, 0, 255, 1)">foreach</span> (<span style="color: rgba(128, 0, 128, 1)">$phpWord</span>-&gt;getSections() <span style="color: rgba(0, 0, 255, 1)">as</span> <span style="color: rgba(128, 0, 128, 1)">$section</span><span style="color: rgba(0, 0, 0, 1)">) {
      </span><span style="color: rgba(0, 0, 255, 1)">foreach</span> (<span style="color: rgba(128, 0, 128, 1)">$section</span>-&gt;getElements() <span style="color: rgba(0, 0, 255, 1)">as</span> <span style="color: rgba(128, 0, 128, 1)">$ele1</span><span style="color: rgba(0, 0, 0, 1)">) {
            </span><span style="color: rgba(128, 0, 128, 1)">$paragraphStyle</span> = <span style="color: rgba(128, 0, 128, 1)">$ele1</span>-&gt;<span style="color: rgba(0, 0, 0, 1)">getParagraphStyle();
            </span><span style="color: rgba(0, 0, 255, 1)">if</span> (<span style="color: rgba(128, 0, 128, 1)">$paragraphStyle</span><span style="color: rgba(0, 0, 0, 1)">) {
                </span><span style="color: rgba(128, 0, 128, 1)">$html</span> .= '&lt;p style="text-align:'. <span style="color: rgba(128, 0, 128, 1)">$paragraphStyle</span>-&gt;getAlignment() .';text-indent:20px;"&gt;'<span style="color: rgba(0, 0, 0, 1)">;
            } </span><span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)"> {
                </span><span style="color: rgba(128, 0, 128, 1)">$html</span> .= '&lt;p&gt;'<span style="color: rgba(0, 0, 0, 1)">;
            }
            </span><span style="color: rgba(0, 0, 255, 1)">if</span> (<span style="color: rgba(128, 0, 128, 1)">$ele1</span><span style="color: rgba(0, 0, 0, 1)"> instanceof \PhpOffice\PhpWord\Element\TextRun) {
                </span><span style="color: rgba(0, 0, 255, 1)">foreach</span> (<span style="color: rgba(128, 0, 128, 1)">$ele1</span>-&gt;getElements() <span style="color: rgba(0, 0, 255, 1)">as</span> <span style="color: rgba(128, 0, 128, 1)">$ele2</span><span style="color: rgba(0, 0, 0, 1)">) {
                  </span><span style="color: rgba(0, 0, 255, 1)">if</span> (<span style="color: rgba(128, 0, 128, 1)">$ele2</span><span style="color: rgba(0, 0, 0, 1)"> instanceof \PhpOffice\PhpWord\Element\Text) {
                        </span><span style="color: rgba(128, 0, 128, 1)">$style</span> = <span style="color: rgba(128, 0, 128, 1)">$ele2</span>-&gt;<span style="color: rgba(0, 0, 0, 1)">getFontStyle();
                        </span><span style="color: rgba(128, 0, 128, 1)">$fontFamily</span> = mb_convert_encoding(<span style="color: rgba(128, 0, 128, 1)">$style</span>-&gt;getName(), 'GBK', 'UTF-8'<span style="color: rgba(0, 0, 0, 1)">);
                        </span><span style="color: rgba(128, 0, 128, 1)">$fontSize</span> = <span style="color: rgba(128, 0, 128, 1)">$style</span>-&gt;<span style="color: rgba(0, 0, 0, 1)">getSize();
                        </span><span style="color: rgba(128, 0, 128, 1)">$isBold</span> = <span style="color: rgba(128, 0, 128, 1)">$style</span>-&gt;<span style="color: rgba(0, 0, 0, 1)">isBold();
                        </span><span style="color: rgba(128, 0, 128, 1)">$styleString</span> = ''<span style="color: rgba(0, 0, 0, 1)">;
                        </span><span style="color: rgba(128, 0, 128, 1)">$fontFamily</span> &amp;&amp; <span style="color: rgba(128, 0, 128, 1)">$styleString</span> .= "font-family:{<span style="color: rgba(128, 0, 128, 1)">$fontFamily</span>};"<span style="color: rgba(0, 0, 0, 1)">;
                        </span><span style="color: rgba(128, 0, 128, 1)">$fontSize</span> &amp;&amp; <span style="color: rgba(128, 0, 128, 1)">$styleString</span> .= "font-size:{<span style="color: rgba(128, 0, 128, 1)">$fontSize</span>}px;"<span style="color: rgba(0, 0, 0, 1)">;
                        </span><span style="color: rgba(128, 0, 128, 1)">$isBold</span> &amp;&amp; <span style="color: rgba(128, 0, 128, 1)">$styleString</span> .= "font-weight:bold;"<span style="color: rgba(0, 0, 0, 1)">;
                        </span><span style="color: rgba(128, 0, 128, 1)">$html</span> .= <span style="color: rgba(0, 128, 128, 1)">sprintf</span>('&lt;span style="%s"&gt;%s&lt;/span&gt;',
                            <span style="color: rgba(128, 0, 128, 1)">$styleString</span>,<span style="color: rgba(0, 0, 0, 1)">
                            mb_convert_encoding(</span><span style="color: rgba(128, 0, 128, 1)">$ele2</span>-&gt;getText(), 'GBK', 'UTF-8'<span style="color: rgba(0, 0, 0, 1)">)
                        );
                  } </span><span style="color: rgba(0, 0, 255, 1)">elseif</span> (<span style="color: rgba(128, 0, 128, 1)">$ele2</span><span style="color: rgba(0, 0, 0, 1)"> instanceof \PhpOffice\PhpWord\Element\Image) {
                        </span><span style="color: rgba(128, 0, 128, 1)">$imageSrc</span> = 'images/' . <span style="color: rgba(0, 128, 128, 1)">md5</span>(<span style="color: rgba(128, 0, 128, 1)">$ele2</span>-&gt;getSource()) . '.' . <span style="color: rgba(128, 0, 128, 1)">$ele2</span>-&gt;<span style="color: rgba(0, 0, 0, 1)">getImageExtension();
                        </span><span style="color: rgba(128, 0, 128, 1)">$imageData</span> = <span style="color: rgba(128, 0, 128, 1)">$ele2</span>-&gt;getImageStringData(<span style="color: rgba(255, 0, 0, 1)">true</span><span style="color: rgba(0, 0, 0, 1)">);
                        </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)"> $imageData = 'data:' . $ele2-&gt;getImageType() . ';base64,' . $imageData;</span>
                        <span style="color: rgba(0, 128, 128, 1)">file_put_contents</span>(<span style="color: rgba(128, 0, 128, 1)">$imageSrc</span>, <span style="color: rgba(0, 128, 128, 1)">base64_decode</span>(<span style="color: rgba(128, 0, 128, 1)">$imageData</span><span style="color: rgba(0, 0, 0, 1)">));
                        </span><span style="color: rgba(128, 0, 128, 1)">$html</span> .= '&lt;img src="'. <span style="color: rgba(128, 0, 128, 1)">$imageSrc</span> .'" style="width:100%;height:auto"&gt;'<span style="color: rgba(0, 0, 0, 1)">;
                  }
                }
            }
            </span><span style="color: rgba(128, 0, 128, 1)">$html</span> .= '&lt;/p&gt;'<span style="color: rgba(0, 0, 0, 1)">;
      }
    }

    </span><span style="color: rgba(0, 0, 255, 1)">return</span> mb_convert_encoding(<span style="color: rgba(128, 0, 128, 1)">$html</span>, 'UTF-8', 'GBK'<span style="color: rgba(0, 0, 0, 1)">);
}



</span><span style="color: rgba(128, 0, 128, 1)">$dir</span> = <span style="color: rgba(0, 128, 128, 1)">str_replace</span>('\\', '/', __DIR__) . '/'<span style="color: rgba(0, 0, 0, 1)">;
</span><span style="color: rgba(128, 0, 128, 1)">$source</span> = <span style="color: rgba(128, 0, 128, 1)">$dir</span> . 'test.docx'<span style="color: rgba(0, 0, 0, 1)">;
</span><span style="color: rgba(0, 0, 255, 1)">echo</span> docx2html(<span style="color: rgba(128, 0, 128, 1)">$source</span>);</pre>
</div>
<p>&nbsp;</p>
<p>五、补充</p>
<p>很明显,这是一个简陋的word读取示例,只读取了段落的对齐方式,文字的字体、大小、是否加粗及图片等信息,其他例如文字颜色、行高。。。等等信息都忽悠了。需要的话,请自行查看phpWord源码,看\PhpOffice\PhpWord\Style\xxx&nbsp;和&nbsp;\PhpOffice\PhpWord\Element\xxx&nbsp;等类里有什么读取方法就可以了</p>
<p>&nbsp;</p>
<p>六、2020-07-21&nbsp;补充</p>
<p>可以用以下方法直接获取到完整的html</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(128, 0, 128, 1)">$phpWord</span> = \PhpOffice\PhpWord\IOFactory::load('xxx.docx'<span style="color: rgba(0, 0, 0, 1)">);
</span><span style="color: rgba(128, 0, 128, 1)">$xmlWriter</span> = \PhpOffice\PhpWord\IOFactory::createWriter(<span style="color: rgba(128, 0, 128, 1)">$phpWord</span>, "HTML"<span style="color: rgba(0, 0, 0, 1)">);
</span><span style="color: rgba(128, 0, 128, 1)">$html</span> = <span style="color: rgba(128, 0, 128, 1)">$xmlWriter</span>-&gt;getContent();</pre>
</div>
<p>注:html内容里包含了head部分,如果只需要style和body的话,需要自己处理一下;然后图片是base64的,要保存的话,也需要自己处理一下</p>
<p>base64数据保存为图片请参考上面代码</p>
<p>&nbsp;</p>
<p>如果只想获取body里的内容,可以参考 \PhpOffice\PhpWord\Writer\HTML\Part\Body&nbsp;里的 write&nbsp;方法</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(128, 0, 128, 1)">$phpWord</span> = \PhpOffice\PhpWord\IOFactory::load('xxxx.docx'<span style="color: rgba(0, 0, 0, 1)">);
</span><span style="color: rgba(128, 0, 128, 1)">$htmlWriter</span> = \PhpOffice\PhpWord\IOFactory::createWriter(<span style="color: rgba(128, 0, 128, 1)">$phpWord</span>, "HTML"<span style="color: rgba(0, 0, 0, 1)">);
</span><span style="color: rgba(128, 0, 128, 1)">$content</span> = ''<span style="color: rgba(0, 0, 0, 1)">;
</span><span style="color: rgba(0, 0, 255, 1)">foreach</span> (<span style="color: rgba(128, 0, 128, 1)">$phpWord</span>-&gt;getSections() <span style="color: rgba(0, 0, 255, 1)">as</span> <span style="color: rgba(128, 0, 128, 1)">$section</span><span style="color: rgba(0, 0, 0, 1)">) {
    </span><span style="color: rgba(128, 0, 128, 1)">$writer</span> = <span style="color: rgba(0, 0, 255, 1)">new</span> \PhpOffice\PhpWord\Writer\HTML\Element\Container(<span style="color: rgba(128, 0, 128, 1)">$htmlWriter</span>, <span style="color: rgba(128, 0, 128, 1)">$section</span><span style="color: rgba(0, 0, 0, 1)">);
    </span><span style="color: rgba(128, 0, 128, 1)">$content</span> .= <span style="color: rgba(128, 0, 128, 1)">$writer</span>-&gt;<span style="color: rgba(0, 0, 0, 1)">write();
}
</span><span style="color: rgba(0, 0, 255, 1)">echo</span> <span style="color: rgba(128, 0, 128, 1)">$content</span>;<span style="color: rgba(0, 0, 255, 1)">exit</span>;</pre>
</div>
<p>&nbsp;</p>
<p>图片的处理的话,暂时没有好办法能在不修改源码的情况下处理好,改源码的话,相关代码在 \PhpOffice\PhpWord\Writer\HTML\Element\Image&nbsp;里</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 255, 1)">public</span> <span style="color: rgba(0, 0, 255, 1)">function</span><span style="color: rgba(0, 0, 0, 1)"> write()
{
    </span><span style="color: rgba(0, 0, 255, 1)">if</span> (!<span style="color: rgba(128, 0, 128, 1)">$this</span>-&gt;<span style="color: rgba(0, 0, 0, 1)">element instanceof ImageElement) {
      </span><span style="color: rgba(0, 0, 255, 1)">return</span> ''<span style="color: rgba(0, 0, 0, 1)">;
    }
    </span><span style="color: rgba(128, 0, 128, 1)">$content</span> = ''<span style="color: rgba(0, 0, 0, 1)">;
    </span><span style="color: rgba(128, 0, 128, 1)">$imageData</span> = <span style="color: rgba(128, 0, 128, 1)">$this</span>-&gt;element-&gt;getImageStringData(<span style="color: rgba(0, 0, 255, 1)">true</span><span style="color: rgba(0, 0, 0, 1)">);
    </span><span style="color: rgba(0, 0, 255, 1)">if</span> (<span style="color: rgba(128, 0, 128, 1)">$imageData</span> !== <span style="color: rgba(0, 0, 255, 1)">null</span><span style="color: rgba(0, 0, 0, 1)">) {
      </span><span style="color: rgba(128, 0, 128, 1)">$styleWriter</span> = <span style="color: rgba(0, 0, 255, 1)">new</span> ImageStyleWriter(<span style="color: rgba(128, 0, 128, 1)">$this</span>-&gt;element-&gt;<span style="color: rgba(0, 0, 0, 1)">getStyle());
      </span><span style="color: rgba(128, 0, 128, 1)">$style</span> = <span style="color: rgba(128, 0, 128, 1)">$styleWriter</span>-&gt;<span style="color: rgba(0, 0, 0, 1)">write();
      </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)"> $imageData = 'data:' . $this-&gt;element-&gt;getImageType() . ';base64,' . $imageData;</span>
      <span style="color: rgba(128, 0, 128, 1)">$imageSrc</span> = 'images/' . <span style="color: rgba(0, 128, 128, 1)">md5</span>(<span style="color: rgba(128, 0, 128, 1)">$this</span>-&gt;element-&gt;getSource()) . '.' . <span style="color: rgba(128, 0, 128, 1)">$this</span>-&gt;element-&gt;<span style="color: rgba(0, 0, 0, 1)">getImageExtension();
      </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)"> 这里可以自己处理,上传oss之类的</span>
      <span style="color: rgba(0, 128, 128, 1)">file_put_contents</span>(<span style="color: rgba(128, 0, 128, 1)">$imageSrc</span>, <span style="color: rgba(0, 128, 128, 1)">base64_decode</span>(<span style="color: rgba(128, 0, 128, 1)">$imageData</span><span style="color: rgba(0, 0, 0, 1)">));

      </span><span style="color: rgba(128, 0, 128, 1)">$content</span> .= <span style="color: rgba(128, 0, 128, 1)">$this</span>-&gt;<span style="color: rgba(0, 0, 0, 1)">writeOpening();
      </span><span style="color: rgba(128, 0, 128, 1)">$content</span> .= "&lt;img border=\"0\" style=\"{<span style="color: rgba(128, 0, 128, 1)">$style</span>}\" src=\"{<span style="color: rgba(128, 0, 128, 1)">$imageSrc</span>}\"/&gt;"<span style="color: rgba(0, 0, 0, 1)">;
      </span><span style="color: rgba(128, 0, 128, 1)">$content</span> .= <span style="color: rgba(128, 0, 128, 1)">$this</span>-&gt;<span style="color: rgba(0, 0, 0, 1)">writeClosing();
    }

    </span><span style="color: rgba(0, 0, 255, 1)">return</span> <span style="color: rgba(128, 0, 128, 1)">$content</span><span style="color: rgba(0, 0, 0, 1)">;
}</span></pre>
</div>
<p>&nbsp;</p>
<hr>
<p>&nbsp;</p>
<p>完。</p><br><br>
来源:https://www.cnblogs.com/tujia/p/12133615.html
頁: [1]
查看完整版本: PHP读取word docx文档内容及处理图片