删除一个文档里的所有中文和标点-适用于中文不夹杂在英文里的情况

Posted on 2023-07-14 In Computer , Software , Word

虽然word文档里可以使用替换通配符方法将中文字符替换，但是有时候会将空格去掉，并不保险。所以本文采用的逻辑是—如果一行包含中文字符则将此行删掉，最终得出的版本与原版在word上利用审阅按钮下的compare功能进行比较，算法实现的python代码可以用gpt得到，特别适合中英文双语翻译的文档删除中文或英文获取单语版本的使用场景。

1、Python实现代码：

import docx

# Open the Word file
doc = docx.Document('your_file_name.docx')

# Loop through all paragraphs in the document
for para in doc.paragraphs:
    # Check if the paragraph contains any Chinese characters
    if any('\u4e00' <= c <= '\u9fff' for c in para.text):
        # If it does, remove the paragraph
        doc._element.body.remove(para._element)

# Save the modified document
doc.save('modified_file_name.docx')

2、代码解释：This script uses the python-docx library to read and modify the Word document. The any() function is used to check if any Chinese character exists in the paragraph. If it does, the paragraph is removed from the document.

You need to replace ‘your_file_name.docx’ and ‘modified_file_name.docx’ with the actual names of your input and output files. Also, make sure to install the python-docx library before running the script.

Note that this script only works for Chinese characters. If you want to remove paragraphs that contain any non-Latin characters, you can modify the condition in the if statement accordingly.

如果未安装docx库，安装docx库，注意如果电脑安装的是python3要用pip3 install：

shawleo@ShawdeMacBook-Pro ~ % cd /Users/shawleo/Desktop 
shawleo@ShawdeMacBook-Pro Desktop % python3 clean-chinese-character.py
Traceback (most recent call last):
  File "/Users/shawleo/Desktop/clean-chinese-character.py", line 1, in <module>
    import docx
ModuleNotFoundError: No module named 'docx'
shawleo@ShawdeMacBook-Pro Desktop % pip install python-docx
zsh: command not found: pip
shawleo@ShawdeMacBook-Pro Desktop % pip install python-docx
zsh: command not found: pip
shawleo@ShawdeMacBook-Pro Desktop % pip3 install python-docx
Collecting python-docx

3、使用代码消除中文，只用把python下面这行代码改写，cd目录，然后python3 clean-chinese-character.py不用加文件路径，因为已经cd到该目录，注意word文件和py文件在同一目录下

1	doc = docx.Document('1.docx')

4、检查，利用word的compare功能，看一看是否只删去了中文，以防万一，选项按钮如下图所示：