Is there a way in python to extract only the CORE TEXT (without boxes, footer etc.) from a pdf?

Question

I am trying to extract only the core text from a "rich" pdf document, meaning that it has a lot of tables, graphs, boxes, footers etc. in which I am not interested in.

I tried with some common python packages like PyPDF2, pdfplumber or pdfreader.The problem is that apparently they extract all the text present in the pdf, including those parts listed above in which I am not interested.

As an example:

from PyPDF2 import PdfReader
file = PdfReader(file)
page = file.pages[10] 
text = page.extract_text()

This code will get me the whole text from page 11, including footers, box, text from a table and the number of the page, while what I would like is only the core text.

Unluckily the only solution I found up to now is to copy paste in another file the core text.

Is there any method/package which can automatically recognize the main text from the other parts of the pdf and return me only that?

Thank you for your help!!!

Asked By: a-caputo

||

Source

Answer 1

per D.L‘s comment, please add some reproducible code and, preferably, a pdf to work with.

However, I think I can answer at least part of your question. jsvine‘s pdfplumber is an incredibly robust python pdf processing package. pdfplumber contains a bounding box functionality that lets you extract text from within (.within_bbox(...)) or from outside (.outside_bbox) the ‘bounding box’ — or geographical area — delineated on the Page object. Every character object extracted from the page contains location information such as y1 - Distance of top of character from bottom of page and Distance of left side of character from left side of page. If the majority of pages within the .pdf you are trying to extract text from contain footnotes, I would recommend only extracting text above the y1 value. Given that footnotes are typically well below the end of a page, except for academic papers using Chicago Style citations, you should still be able to set a standard .bbox for where you want to extract text (within a set .bbox that does not include footnotes or out of a set .bbox that does not include footnotes).

To your question about tables, that poses a trickier question. Tables are by far the trickiest thing to detect and/or extract from. pdfplumber offers, to my knowledge, the most robust open source table detection/extraction capabilities out there. To extract the area outside a table, I would call the .find_tables(...) function on each Page object to return a .bbox of the table and extract around that. However — this is not perfect. It is not always able to detect tables.

Regarding your 3rd question, how to exclude boxes, are you referring to text boxes? Please provide further clarification!

Finally — to reiterate my first point — pdfplumber is an incredibly robust package. That being said, extracting text from .pdf files is really tough. Good luck — please provide more information and I will be happy to help as best I can.

Answered By: Thomas

Answer 2

Building on the ideas shared by Thomas in his answer, here is what I came up with:

import collections
import pdfplumber as pdfplumber


def find_text_parts_on_page(page):
    """
    Idea: separate text by font sizes, rank them by popularity.
    The most popular text size is most likely the main text.
    The second most popular text size is most likely the footnote.
    However, we check which of the two most popular text sizes is larger (by font size).
    We pick the larger one as the main text and the smaller one as the footnote.
    We could also use the vertical position of the bounding box to determine that.
    """

    font_sizes = collections.Counter()
    bounding_boxes = {}

    for char in page.chars:
        size_key = char["size"]
        font_sizes[size_key] += 1
        if size_key not in bounding_boxes:
            bounding_boxes[size_key] = [char["x0"], char["top"], char["x1"], char["bottom"]]
        else:
            if char["x0"] < bounding_boxes[size_key][0]:
                bounding_boxes[size_key][0] = char["x0"]
            if char["top"] < bounding_boxes[size_key][1]:
                bounding_boxes[size_key][1] = char["top"]
            if char["x1"] > bounding_boxes[size_key][2]:
                bounding_boxes[size_key][2] = char["x1"]
            if char["bottom"] > bounding_boxes[size_key][3]:
                bounding_boxes[size_key][3] = char["bottom"]

    most_common_sizes = font_sizes.most_common(2)

    # The main box has larger text size than the footnote box
    first = most_common_sizes[0][0], bounding_boxes[most_common_sizes[0][0]]
    second = most_common_sizes[1][0], bounding_boxes[most_common_sizes[1][0]]

    if first[0] > second[0]:
        return first, second
    else:
        return second, first


with pdfplumber.open("sample.pdf") as pdf:

    first_page = pdf.pages[0]
    [main_size, main_box], [footnote_size, footnote_box] = find_text_parts_on_page(first_page)

    main_part = first_page.within_bbox(main_box)
    footnote_part = first_page.within_bbox(footnote_box)

    print("-----")

    print(main_part.extract_text())

    print("-----")

    print(footnote_part.extract_text())

    print("-----")

Answered By: jbasko

Is there a way in python to extract only the CORE TEXT (without boxes, footer etc.) from a pdf?

Question:

Answers: