How to convert or extract a table from an image using Tesseract?

Question:

I have the following image of a table (pandas dataframe or excel sheet),enter image description here

I just started using tesseract but I’m having problems converting it into a table.

I’m using the following code.

img_cv = cv2.imread(imagepath)
img_rgb = cv2.cvtColor(img_cv,cv2.COLOR_BGR2RGB)
print(pytesseract.image_to_string(img_rgb))

But words and letters are recognized but the formatting is all off and the words come out in a chunk and all jumbled.

'IN ETaat=) Count... Tkr & Exch Market Sales %ReventRelationshi Account %Cost Source As Of DatennCap Surprise Value (Q) As Typenn21) Facebook Inc LUIS} las) LOS 516.19B) 0.93%nn39) Applied Optoelectro...|US AAOI US 177.83M 1.77% 10.90% 5.20M|\CAPEX 0.14%|*2019A CF 02/28/2020n40) Activision Blizzard ...|US ATVI US 46.13B 0.89%, 0.31%) 4.02M|COGS 0.13%|Estimate 12/03/2019n41) Quanta Computer I... |TW 2382 an 7.93B| -2.73% 0.04% 3.02M/COGS 0.11%|Estimate 07/04/2019n42) Modern Avenue Gro...|CN 002656 CH) 263.51M| -2.87%| 4.44% 2.60M|\COGS 0.10%|*2018A CF 04/26/2019n43) Mellanox Technolog...|IL MLNX US 6.51B| 13.57%| 0.74%) 2.80M|\COGS (OM O}=1<1 tim [nate] k=) 03/03/2020n44) O-Net Technologies...|CN 877 ale 463.33M aad 3.11%) 2.49M|CAPEX 0.07%|Estimate 10/30/2019n45) Adobe Inc US ADBE US 162.75B 0.63%, 0.08% 2.02M|\SG&A 0.07%|Estimate 06/12/2019n46) British Land Co PLC...\|GB BLND LN 5.74B| 10.97% 1.05% 2.12M\SG&A (OM Oley atin [nat] k=) 11/19/2019n47) Bel Fuse Inc US BELFA US | 123.22M) -3.66% 1.13% 1.40M/COGS (omer tl at-im [gate] k=) 11/19/2019n48) Keysight Technolog...|US Nees US 17.99B 3.37%, 0.08% 880.90k/\COGS (OM Oey a-imeat- 1K) 01/03/2020n49) BT Group PLC GB BT/A LN 17.00B|} -0.01% 0.01% 631.65k/COGS (om OP2-1) at-1 8 [gate] K=) 01/16/2020n50) KT Corp KR 030200 KS 5.21B 0.32%, 0.02% 1.07M|SG&A (om OP2-1) at-1 8 [gate] K=) 05/10/2019n5D Sunny Optical Tech... |CN 2382 ale 18.16B aad 0.04% 425.69k/ COGS (om eM Rati m [nat] -) 08/27/2019n52) Belden Inc US 131 D1@% US 1.95B 5.68%, 0.04%) 255.50k|COGS (om eM Rati m [nat] -) 11/04/2019n53) Lattice Semiconduc... |US LSCC US 2.51B 0.24%, 0.18%) 174.54k COGS (om eM Rati m [nat] -) 05/08/2019n54 Zhen Ding Technolo.../TW 4958 an 3.55B| -0.77%| 0.02%) 184.75k/COGS (om eM Rati m [nat] -) 01/17/2020n55) Emnet Inc KR 123570 KS 66.79M aid Pa hei) 214.59k|SG&A *2019C3 CF 11/14/2019n56) Zebra Technologies...|US ZBRA US 10.95B| -0.32% 57.18k\COGS stim [eat] k=) 02/21/2020'

Is there a way to get it to a table format properly?

Asked By: anarchy

||

Answers:

It’s horizontally compressed so you can resize the height dimension and it mostly works; I augmented the vertical dimension by ~25%, and added ~10% to the horizontal dimension.

img_resized = cv2.resize(img_cv,
                         (int(img_cv.shape[1] + (img_cv.shape[1] * .1)),
                          int(img_cv.shape[0] + (img_cv.shape[0] * .25))),
                         interpolation=cv2.INTER_AREA) 
img_rgb = cv2.cvtColor(img_resized,cv2.COLOR_BGR2RGB)

Result:

In [42]: print(pytesseract.image_to_string(img_rgb))                                                
vente) Count... Tkr & Exch Market Sales %ReventRelationshiAccount %Cost Source As Of Date

Cap Surprise Value (Q) As Type

21) Facebook Inc US FB US 516.19B) 0.93%

39) Applied Optoelectro...|US AAOI US | 177.83M| 1.77%| 10.90% 5.20M|CAPEX 0.14%|*2019A CF 02/28/2020
40) Activision Blizzard ...|US ATVI US 46.13B) 0.89% 0.31% 4.02M|COGS 0.13%|/Estimate 12/03/2019
41) Quanta Computer I... |TW 2382 TT 7.93B| -2.73%| 0.04% 3.02M COGS 0.11%|/Estimate 07/04/2019
42) Modern Avenue Gro... |CN 002656 CH! 263.51M -2.87%| 4.44% 2.60M|COGS 0.10%|*2018A CF 04/26/2019
43) Mellanox Technolog...|IL MLNX US 6.51B) 13.57%, 0.74% 2.80M|COGS 0.08%|/Estimate 03/03/2020
44) O-Net Technologies...|CN 877 HK | 463.33M --| 3.11% 2.49MCAPEX 0.07%|Estimate 10/30/2019
45) Adobe Inc US ADBE US| 162.75B) 0.63%, 0.08% 2.02M SG&A 0.07%|Estimate 06/12/2019
46) British Land Co PLC...|GB BLND- LN 5.74B) 10.97%, 1.05% 2.12M SG&A 0.06%|Estimate 11/19/2019
47) Bel Fuse Inc US BELFA US | 123.22M -3.66%| 1.13% 1.40M|COGS 0.04%|Estimate 11/19/2019
48) Keysight Technolog...|US KEYS US 17.99B| 3.37% 0.08% 880.90k|COGS 0.03%|Estimate 01/03/2020
49) BT Group PLC GB BT/A LN 17.00B| -0.01%| 0.01% 631.65k/COGS 0.02%|/Estimate 01/16/2020
50) KT Corp aoe 030200 KS 5.21B) 0.32% 0.02% 1.07M|SG&A 0.02%|/Estimate 05/10/2019
51) Sunny Optical Tech... |CN 2382 HK 18.16B --| 0.04% 425.69k/COGS 0.01%|/Estimate 08/27/2019
52) Belden Inc US BDC US 1.95B) 5.68% 0.04% 255.50k/|COGS 0.01%|/Estimate 11/04/2019
53) Lattice Semiconduc...|US Lscc US 2.51B) 0.24% 0.18% 174.54k|COGS 0.01%|/Estimate 05/08/2019
54) Zhen Ding Technolo..., TW 4958 TT 3.55B) -0.77%| 0.02% 184.75k/COGS 0.01%|/Estimate 01/17/2020
55) Emnet Inc KR 123570 KS| 66.79M --| 2.78% 214.59k/SG&A *2019C3 CF Wary esenke,
56) Zebra Technologies...|US ZBRA US 10.95B) -0.32% 57.18k|COGS Estimate 02/21/2020

To write this to an output file do:

output = pytesseract.image_to_string(img_rgb)
with open('test.csv','w') as f: 
    f.write(output) 
Answered By: mechanical_meat

In addition to mechanical_meat answer, you can format the output using the code below.

import cv2
import pytesseract
from pytesseract import Output
import pandas as pd

img = cv2.imread("HZ29h.png")
img = cv2.resize(img, (int(img.shape[1] + (img.shape[1] * .1)),
                       int(img.shape[0] + (img.shape[0] * .25))),
                 interpolation=cv2.INTER_AREA)

img_rgb = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)

custom_config = r'-l eng --oem 3 --psm 6 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-:.$%./@& *"'
d = pytesseract.image_to_data(img_rgb, config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)

# clean up blanks
df1 = df[(df.conf != '-1') & (df.text != ' ') & (df.text != '')]
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
    curr = df1[df1['block_num'] == block]
    sel = curr[curr.text.str.len() > 3]
    # sel = curr
    char_w = (sel.width / sel.text.str.len()).mean()
    prev_par, prev_line, prev_left = 0, 0, 0
    text = ''
    for ix, ln in curr.iterrows():
        # add new line when necessary
        if prev_par != ln['par_num']:
            text += 'n'
            prev_par = ln['par_num']
            prev_line = ln['line_num']
            prev_left = 0
        elif prev_line != ln['line_num']:
            text += 'n'
            prev_line = ln['line_num']
            prev_left = 0

        added = 0  # num of spaces that should be added
        if ln['left'] / char_w > prev_left + 1:
            added = int((ln['left']) / char_w) - prev_left
            text += ' ' * added
        text += ln['text'] + ' '
        prev_left += len(ln['text']) + added + 1
    text += 'n'
    print(text)

Output

  IN vaate3           Count... Tkr & Exch Market   Sales  %ReventRelationshiAccount   %Cost Source       As Of Date 
                                              Cap Surprise        Value Q   As Type 
21 Facebook Inc       US      FB    US    516.19B   0.93% 
39 Applied Optoelectro.../US AAOI    US   177.83M   1.77%  10.90%     5.20MCAPEX      om EE len key el   02/28/2020 
40 Activision Blizzard ...US  ATVI   US    46.13B   0.89%   0.31%     4.02M/COGS      0.13% Estimate     12/03/2019 
41 Quanta Computer I... TW    2382   TT     7.93B  -2.73%   0.04%     3.02M COGS      0.11% Estimate     07/04/2019 
42 Modern Avenue Gro...CN     002656  CH  263.51M  -2.87%   4.44%     2.60MCOGS       0.10%*2018A  CF    04/26/2019 
43 Mellanox Technolog...JIL   MLNX   US     6.51B  13.57%   0.74%     2.80MCOGS       0.08%/Estimate     03/03/2020 
44 O-Net Technologies...CN    877    HK    463.33M     --   3.11%     2.49MCAPEX      0.07%/Estimate     10/30/2019 
45 Adobe Inc          US      ADBE   US   162.75B   0.63%   0.08%     2.02M SG&A      0.07%/Estimate     06/12/2019 
46 British Land Co PLC...GB   BLND-  LN     5.74B  10.97%   1.05%     2.12M SG&A      0.06%Estimate      11/19/2019 
47 Bel Fuse Inc       US      BELFA  US    123.22M -3.66%   1.13%     1.40MCOGS       0.04%Estimate      11/19/2019 
48 Keysight Technolog...US    14s A  Obed  17.99B   3.37%   0.08%   880.90k/COGS      0.03%Estimate      01/03/2020 
49 BT Group PLC       e 33    BT/A   LN    17.00B  -0.01%   0.01%   631.65k/COGS      0.02% Estimate     01/16/2020 
50 KT Corp            KR      030200  KS    5.21B   0.32%   0.02%     1.07M/SG&A      0.02% Estimate     05/10/2019 
51 Sunny Optical Tech... CN   2382   HK    18.16B      --   0.04%   425.69k/COGS      0.01% Estimate     08/27/2019 
52 Belden Inc         US      BDC    US     1.95B   5.68%   0.04%   255.50k/COGS      0.01%/Estimate     11/04/2019 
53 Lattice Semiconduc... US   LscC   US     2.51B   0.24%   0.18%   174.54k/COGS      0.01%/Estimate     05/08/2019 
54. Zhen Ding Technolo.... TW 4958   TT     3.55B  -0.77%   0.02%   184.75k/COGS      0.01%/Estimate     01/17/2020 
55. Emnet Inc         KR      123570  KS   66.79M      --   2.78%   214.59k/SG&A            *2019C3 CF   Wary   esenke 
56 Zebra Technologies.../US   VAs 0a  O hs 10.95B  -0.32%            57.18k/COGS            Estimate     02/21/2020 
Answered By: us2018

The only way to do this properly is to detect the vertical lines and use the coordinates of found lines to infer columns. Parsing the output is a road to nowhere, especially if you are hoping the lines will always be OCRd as pipes – they won’t!

Answered By: RJJ

@us2018
i want to save a text files for every images using your approach can you tell me the why

Answered By: saurav