Title Extraction/Identification from PDFs

Question:

I have a large number of pdfs in different formats. Among other things, I need to extract their titles (not the document name, but a title in the text). Due to the range of formats, the titles are not in the same locations in the pdfs. Further, some of the pdfs are actually scanned images (I need to use OCR/Optical Character Recognition on them). The titles are sometimes one line, sometimes 2. They do not tend to have the same set of words. In the range of physical locations the titles usually show up, there are often other words (ie if doc 1 has title 1 at x1, y1, doc 2 might have title 2 at x2, y2 but have other non-title text at x1 y1). Further, there are some very rare cases where the pdfs don’t have a title.

So far I can use pdftotext to extract text within a given bounding box, and convert it to a text file. If there’s a title, this lets me capture the title, but often with other extraneous words included. This also only works on non-image pdfs. I’m wondering if a) There’s a good way to identify the title from among all the words I extract for a document (because there are often extraneous words), ideally with a good way to identify that no title exists, and b) if there are any tools that are equivalent to pdftotext that will also work on scanned images (I do have an ocr script working, but it does ocr over an entire image rather than a section of one).

One method that somewhat answers the title dilemma is to extract the words in the bounding box, use the rest of the document to identify which of the bounding box words are keywords for the document, and construct the title from the keywords. This wouldn’t extract the actual title, but may give words that could construct a reasonable alternative. I’m already extracting keywords for other parts of the project, but I would definitely prefer to extract the actual title as people may be using the verbatim title for lookup purposes.

Further note if it wasn’t clear – I’m trying to do this programatically with open source/free tools, ideally in Python, and I will have a large number of documents (10,000+).

Asked By: Evan Mata

Source

Answers:

For people who are come across this question later, I’ll provide a quick update on what I’ve decided to do (albeit I haven’t tested accuracy so I don’t know if this approach is actually any good).

The overall approach I’ll be using is machine learning via a neural net (I’ll report back on accuracy once I have it). I’m essentially taking the first 200 words of a document, and generating n-grams of 4-20 sequential words (so ~16*200 n-grams of words; 4 b.c. none of my titles are shorter, 20 same but longer). I then generate a unique feature vector from each n-gram, the features I decided to use are partially dependent on my text but some are more general like “Is the first letter of the first word in the n-gram capitalized?”. Knowing the correct titles, I can turn them into an equivalent vector. So If vec(n_gram) = vec(correct_title) then output 1, otherwise output 0. I’m using this to train an ML model. Currently this does Not solve my issue of scanned image pdfs, unless they’re first converted into text documents. It also assumes word order is preserved among the title words when the pdf is turned into the n-grams. I have noticed the order of non-title words isn’t always preserved by conversion but thats quite a rare problem and only seems to occur when there’s line breaks and then the entire line is out of place (so it shouldn’t affect the titles hopefully).

Answered By: Evan Mata

You can utilize the word font-size information to extract the title words.
From your question what i understand here is what i am proposing to extract the title words:

Convert the pdf documents to image using any opensource module say pdf2image, then use tesseract for OCR. From OCR output you have text data along with their dimension information ie. individual word width and height.

Do some statistical analysis(histogram plot) on the word’s height and see if you can use the height distribution to recognize the title word.
Either you can use a fixed threshold value based on the heuristic information or use some adaptive threshold based on height distribution and use this threshold value to recognize the title words.

Answered By: flamelite

this might be a bit late but I would also check Layout Parser. The model pre-trained on PubLayOut includes title as one of entities extracted. The pre-trained models can be improved by re-training on your data (in here you can find references to demo, notebooks & slides.

Answered By: natt010