Tesseract OCR accents problems, image enhancement not enough

Question:

I really need your help with Tesseract.
I’m using Tesseract and pdf2image to extract informations from a scanned PDF file.
My problem is that Tesseract messes with the accents é, è et ê (i’m french) and with the lowercase "i" and upcase "I".
I tried processing the images first but can’t get any good output.

This the code i’m using:

pytesseract.pytesseract.tesseract_cmd = r'C:Program FilesTesseract-OCRtesseract.exe' 

filePath = askopenfilename()
img = convert_from_path(filePath,poppler_path=r'C:poppler-0.68.0_x86poppler-0.68.0bin')
path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)


for page_number in range(len(img)):
    img[page_number].save(r'C:Users488096Documentspage'+ str(page_number) +'.jpg', 'JPEG')

    
work_img = None
# Tesseract
custom_config = r'--oem 3 --psm 6'
kernel = np.ones((1, 1), np.uint8)

for page_number in range(len(img)):
    img1 = cv2.imread(r'C:Users488096Documentspage'+ str(page_number) +'.jpg')
    #Traitement des images afin d'obtenir une meilleure reconnaissance des caractères
    gray = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
    # Remove shadows
    cool_img = cv2.dilate(gray, kernel, iterations=1)
    norm_img = cv2.erode(cool_img, kernel, iterations=1)
    # Threshold using Otsu's
    work_img = cv2.threshold(cv2.bilateralFilter(norm_img, 5, 75, 75), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

    # Save pages as images in the pdf
    txt = txt + (pytesseract.image_to_string(work_img,config=custom_config).encode("utf-8")).decode('utf-8')
    print("Page # {} - {}".format(str(page_number),txt))

What can I do to obtain good results ?
Thanks a lot !

Asked By: Irianeth

||

Answers:

Maybe you have to install the french language pack, more info here

https://pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/

Furthermore, you can use ocrmypdf, for me, is the easiest way to read pdfs to text: https://pypi.org/project/ocrmypdf/

Answered By: ssanga
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.