modify image to black text on white background
Question:
I have an image that need to do OCR (Optical Character Recognition) to extract all data.
First I want to convert color image to black text on white background in order to improve OCR accuracy.
I try below code
from PIL import Image
img = Image.open("data7.png")
img.convert("1").save("result.jpg")
it gave me below unclear image
I expect to have this image
Then, I will use pytesseract to get a dataframe
import pytesseract as tess
file = Image.open("data7.png")
text = tess.image_to_data(file,lang="eng",output_type='data.frame')
text
Finally,the dataframe I want to get like below
Answers:
You can extract the background color by looking at the most prominent color while measuring the input image statistics with Torchvision.
More specifically you can use torchvision.transforms.functional.to_tensor
:
>>> img = Image.open("test.png")
>>> tensor = TF.to_tensor(img)
Extract background color:
>>> u, c = tensor.flatten(1).unique(dim=1, return_counts=True)
>>> bckg = u[:,c.argmax()]
tensor([0.1216, 0.1216, 0.1216])
Get the mask of background:
>>> mask = (tensor.permute(1,2,0) == bckg).all(dim=-1)
Convert back to PIL with torchvision.transforms.functional.to_pil_image
>>> res = TF.to_pil_image(mask.float())
Then you can extract the data frame using Python tesseract:
>>> text = tess.image_to_data(res, lang="eng", output_type='data.frame')
Using from PIL import Image
and import torchvision.transforms.functional as TF
Converting RGB image to a binary image using PIL.Image.convert
resulted with an "unclear" image due to the default dithering. In your case you do not want to dither at all:
img.convert("1", dither=Image.Dither.NONE)
Will give you a clean conversion:
You still need to figure out how to capture the text in colors, but the noise is gone once you turn off dithering.
Here’s a vanilla Pillow solution. Just grayscaling the image gives us okay results, but the green text is too faint.
So, we first scale the green channel up (sure, it might clip, but that’s not a problem here), then grayscale, invert and auto-contrast the image.
from PIL import Image, ImageOps
img = Image.open('rqDRe.png').convert('RGB')
r, g, b = img.split()
img = Image.merge('RGB', (
r,
g.point(lambda i: i * 3), # brighten green channel
b,
))
img = ImageOps.autocontrast(ImageOps.invert(ImageOps.grayscale(img)), 5)
img.save('rqDRe_processed.png')
output
I have an image that need to do OCR (Optical Character Recognition) to extract all data.
First I want to convert color image to black text on white background in order to improve OCR accuracy.
I try below code
from PIL import Image
img = Image.open("data7.png")
img.convert("1").save("result.jpg")
it gave me below unclear image
I expect to have this image
Then, I will use pytesseract to get a dataframe
import pytesseract as tess
file = Image.open("data7.png")
text = tess.image_to_data(file,lang="eng",output_type='data.frame')
text
Finally,the dataframe I want to get like below
You can extract the background color by looking at the most prominent color while measuring the input image statistics with Torchvision.
More specifically you can use torchvision.transforms.functional.to_tensor
:
>>> img = Image.open("test.png")
>>> tensor = TF.to_tensor(img)
Extract background color:
>>> u, c = tensor.flatten(1).unique(dim=1, return_counts=True)
>>> bckg = u[:,c.argmax()]
tensor([0.1216, 0.1216, 0.1216])
Get the mask of background:
>>> mask = (tensor.permute(1,2,0) == bckg).all(dim=-1)
Convert back to PIL with torchvision.transforms.functional.to_pil_image
>>> res = TF.to_pil_image(mask.float())
Then you can extract the data frame using Python tesseract:
>>> text = tess.image_to_data(res, lang="eng", output_type='data.frame')
Using from PIL import Image
and import torchvision.transforms.functional as TF
Converting RGB image to a binary image using PIL.Image.convert
resulted with an "unclear" image due to the default dithering. In your case you do not want to dither at all:
img.convert("1", dither=Image.Dither.NONE)
Will give you a clean conversion:
You still need to figure out how to capture the text in colors, but the noise is gone once you turn off dithering.
Here’s a vanilla Pillow solution. Just grayscaling the image gives us okay results, but the green text is too faint.
So, we first scale the green channel up (sure, it might clip, but that’s not a problem here), then grayscale, invert and auto-contrast the image.
from PIL import Image, ImageOps
img = Image.open('rqDRe.png').convert('RGB')
r, g, b = img.split()
img = Image.merge('RGB', (
r,
g.point(lambda i: i * 3), # brighten green channel
b,
))
img = ImageOps.autocontrast(ImageOps.invert(ImageOps.grayscale(img)), 5)
img.save('rqDRe_processed.png')