0%

python 讀取不同檔案類型

1. PDF

1.1. 套件名稱

1
pdfminer

1.2. 安裝套件

1
pip install pdfminer

1.3. 程式碼

1
pdf2txt.py -o b.txt EDC-957-0374-33-0(OK).pdf

2. OCR(jpg…)

2.1. 安裝

2.1.1. 下載安裝執行檔

1
https://github.com/UB-Mannheim/tesseract/wiki

安裝過程中可以勾選要OCR的語系,其中有包含繁體語系
當然也可以上網去抓最新的語系包,參考路徑路下:

1
https://github.com/tesseract-ocr/tessdata

實際擺放位置應該會在

1
C:\Program Files\Tesseract-OCR\tessdata

2.1.2. 安裝套件

1
pip install pillow
1
pip install pytesseract

2.2. OCR程式碼

1
2
3
4
5
6
7
8
9
10
11
12
13
from PIL import Image
import pytesseract

def main():
pytesseract.pytesseract.tesseract_cmd = r'XXXXXX\Tesseract-OCR\tesseract.exe'
#指定tesseract.exe執行檔位置
img = Image.open('XXXXXXXX/XXXX.png') #圖片檔案位置
text = pytesseract.image_to_string(img, lang='eng') #讀英文
#text = pytesseract.image_to_string(img, lang='chi_sim') #簡體中文
#text = pytesseract.image_to_string(img, lang='chi_tra') #繁體中文

if __name__ == '__main__':
main()

3. Word(Docx)

3.1. Word安裝套件

1
pip install python-docx

3.2. word doc to docx

  1. First and foremost, organize all file to be processed in one file folder.
  2. Then open Word and press “Alt+ F11” to open the VBA editor.
  3. Now click “Normal” project and click “Insert” after it.
  4. Next choose “Module” to insert a new module in the project
    image
  5. Then double click the module to open the editing area and paste the following codes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Sub TranslateDocIntoDocx()
Dim objWordApplication As New Word.Application
Dim objWordDocument As Word.Document
Dim strFile As String
Dim strFolder As String

strFolder = "E:\Temp\"
strFile = Dir(strFolder & "*.doc", vbNormal)

While strFile <> ""
With objWordApplication
Set objWordDocument = .Documents.Open(FileName:=strFolder &strFile, AddToRecentFiles:=False, ReadOnly:=True, Visible:=False)

With objWordDocument
.SaveAs FileName:=strFolder & Replace(strFile, "doc", "docx"), FileFormat:=16
.Close
End With
End With
strFile = Dir()
Wend

Set objWordDocument = Nothing
Set objWordApplication = Nothing
End Sub

6.At last, click “Run” button.
image

3.3. Doc to Docx程式碼

1
2
3
4
5
6
7
8
9
10
11
12
import docx

doc = docx.Document("E:/my_word_file.docx")
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)

#with open("Output.txt", "w") as text_file:
# print(MY_TEXT, file=text_file)