How can I extract font color of text within a PDF in Python with PDFMiner?

Question:

How can I extract font color from text within a PDF?

I already tried to explore LTText or LTChar objects using PDFMiner, but it seems that this module only allows to extract font size and style, not color.

Asked By: Vito Gentile

||

Answers:

(disclaimer: I am the author of pText, the library being used in this example.)

pText allows you to register an EventListener that will be notified whenever a PDF rendering instruction (such as rendering text) has been processed.

Upon receiving this instruction you can inspect the graphics state to figure out what the current stroke/fill color are. The text should be rendered using the stroke color.

Let’s have a look at how that works:

with open("input.pdf", "rb") as pdf_file_handle:
    l = ColorSpectrumExtraction()
    doc = PDF.loads(pdf_file_handle, [l])

The above code opens a PDF document for (binary) reading, and calls the PDF.loads method. The extra parameter we are passing is an array (in this case of 1 element) of EventListener implementations.

Let’s look into ColorSpectrumExtraction:

class ColorSpectrumExtraction(EventListener):

    def event_occurred(self, event: Event) -> None:
        if isinstance(event, ChunkOfTextRenderEvent):
            self._render_text(event)

    def _render_text(self, event: ChunkOfTextRenderEvent):
        assert event is not None
        c = event.font_color.to_rgb()
        // do something with the font-color

As you can see, this class has a method event_occurred, which will get called on rendering content. In our case, we are only interested in ChunkOfTextRenderEvent.

So we verify (using isinstanceof) and then delegate the call to another method.

In the method _render_text we can then get all the information we want from the text that was just rendered. Like the font_color, font_size, etc

You can obtain pText either on GitHub, or using PyPi
There are a ton more examples, check them out to find out more about working with images.

Answered By: Joris Schellekens

I looked at all the source code for PDFMiner (not maintained) and PDFMiner.Sixth (fork). Neither Python module allows you extract the color. Within the issues section for both modules extracting the font color is a common problem.

I also looked at PDFPlumber, which uses PDFMiner.Sixth. The module extract font colors. The color elements extracted included the stroking_color, which is the outline of a character and the non_stroking_color, which is the fill of a character. I looked at the colors extracted from my sample PDF and they matched the RGB colors.

import pdfplumber

pdf_file = pdfplumber.open('path_to_pdf')
for p, char in zip(pdf_file.pages, pdf_file.chars):
    words = p.extract_words(keep_blank_chars=True)
    texts = p.extract_text()
    print(f"Page Number: {p.page_number}")
    print(f"Font Name: {char['fontname']}")
    print(f"Font Size: {char['size']}")
    print(f"Stroking Color: {char['stroking_color']}")
    print(f"Non_stroking Color: {char['non_stroking_color']}")
    print(texts.strip())
    print('n')

The unanswered question is:

How can you extract the font colors and still use your PDFMiner code?

The code below allows me to simultaneously use PDFMiner.Sixth and PDFPlumber to extract various elements, such as the text, font name, font size, stroking_color and non_stroking_color from the source PDF file.

import pdfplumber

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar


with open('path_to_pdf', 'rb') as scr_file:
    with pdfplumber.PDF(scr_file) as pdf_file:
        for page_layout, char in zip(extract_pages(scr_file), pdf_file.chars):
            for element in page_layout:
                if isinstance(element, LTTextContainer):
                    for text_line in element:
                        for character in text_line:
                            if isinstance(character, LTChar):
                                print(element.get_text())
                                print(f"Font Name: {character.fontname}")
                                print(f"Font Size: {character.size}")
                                print(f"Stroking Color: {char['stroking_color']}")
                                print(f"Non_stroking Color: {char['non_stroking_color']}")
                                print('nn')

UPDATE 03-09-2021

I’m still working on meshing and synchronizing these functions together. I checked them and they seem to be outputting the correct elements.

import pdfplumber
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar, LAParams


def extract_character_characteristics(pdf_file):
    number_of_pages = len(list(extract_pages(pdf_file)))
    for page_layout in extract_pages(pdf_file, laparams=LAParams()):
        print(f'Processing Page: {number_of_pages}')
        number_of_pages -= 1
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                for text_line in element:
                    for character in text_line:
                        if isinstance(character, LTChar):
                            if character.get_text() != ' ':
                                print(f"Character: {character.get_text()}")
                                print(f"Font Name: {character.fontname}")
                                print(f"Font Size: {character.size}")
                                print('n')


def extract_character_colors(pdf_file):
    with pdfplumber.PDF(pdf_file) as file:
        for char in file.chars:
            if char['text'] != ' ':
                print(f"Page Number: {char['page_number']}")
                print(f"Character: {char['text']}")
                print(f"Font Name: {char['fontname']}")
                print(f"Font Size: {char['size']}")
                print(f"Stroking Color: {char['stroking_color']}")
                print(f"Non_stroking Color: {char['non_stroking_color']}")
                print('n')


with open('test.pdf', 'rb') as scr_file:
    extract_character_characteristics(scr_file)

Answered By: Life is complex

PDFMiner’s LTChar object has ‘graphicstate’ attribute which has ‘scolor’ (stroking color) and ‘ncolor’ (non stroking color) attributes, which can be used to obtain text color information. Here’s working code snippet (based on the code from one of the answers) that outputs font info for each text line component:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
import sys

with open(sys.argv[1], 'rb') as scr_file:
    for page_layout in extract_pages(scr_file):
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                fontinfo = set()
                for text_line in element:
                    for character in text_line:
                        if isinstance(character, LTChar):
                            fontinfo.add(character.fontname)
                            fontinfo.add(character.size)
                            fontinfo.add(character.graphicstate.scolor)
                            fontinfo.add(character.graphicstate.ncolor)
                print("n", element.get_text(), fontinfo)
Answered By: jrajp2184
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.