python 讀取不同檔案類型

1. PDF

1.1. 套件名稱

pdfminer

1.2. 安裝套件

1	pip install pdfminer

1.3. 程式碼

1	pdf2txt.py -o b.txt EDC-957-0374-33-0(OK).pdf

2. OCR(jpg…)

2.1. 安裝

2.1.1. 下載安裝執行檔

1	https://github.com/UB-Mannheim/tesseract/wiki

安裝過程中可以勾選要OCR的語系，其中有包含繁體語系
當然也可以上網去抓最新的語系包，參考路徑路下：

1	https://github.com/tesseract-ocr/tessdata

實際擺放位置應該會在

1	C:\Program Files\Tesseract-OCR\tessdata

2.1.2. 安裝套件

1	pip install pillow

1	pip install pytesseract

2.2. OCR程式碼

from PIL import Image
import pytesseract

def main():
    pytesseract.pytesseract.tesseract_cmd = r'XXXXXX\Tesseract-OCR\tesseract.exe'
    #指定tesseract.exe執行檔位置
    img = Image.open('XXXXXXXX/XXXX.png') #圖片檔案位置
    text = pytesseract.image_to_string(img, lang='eng') #讀英文
    #text = pytesseract.image_to_string(img, lang='chi_sim') #簡體中文
    #text = pytesseract.image_to_string(img, lang='chi_tra') #繁體中文

if __name__ == '__main__':
    main()

3. Word(Docx)

3.1. Word安裝套件

1	pip install python-docx

3.2. word doc to docx

First and foremost, organize all file to be processed in one file folder.

Then open Word and press “Alt+ F11” to open the VBA editor.

Now click “Normal” project and click “Insert” after it.

Next choose “Module” to insert a new module in the project

Then double click the module to open the editing area and paste the following codes:

Sub TranslateDocIntoDocx()
  Dim objWordApplication As New Word.Application
  Dim objWordDocument As Word.Document
  Dim strFile As String
  Dim strFolder As String

  strFolder = "E:\Temp\"
  strFile = Dir(strFolder & "*.doc", vbNormal)
  
  While strFile <> ""
    With objWordApplication      
      Set objWordDocument = .Documents.Open(FileName:=strFolder &strFile, AddToRecentFiles:=False, ReadOnly:=True, Visible:=False)
          
      With objWordDocument
        .SaveAs FileName:=strFolder & Replace(strFile, "doc", "docx"), FileFormat:=16
        .Close
      End With
    End With
    strFile = Dir()
  Wend   

  Set objWordDocument = Nothing
  Set objWordApplication = Nothing
End Sub

6.At last, click “Run” button.

3 Quick Ways to Batch Convert Word DOC to DOCX Files and Vice Versa

3.3. Doc to Docx程式碼

import docx

doc = docx.Document("E:/my_word_file.docx")
fullText = []
for para in doc.paragraphs:
    fullText.append(para.text)
    return '\n'.join(fullText)

#with open("Output.txt", "w") as text_file:
#    print(MY_TEXT, file=text_file)