Not getting the text from PDF in right format when reading using pyPDF

Question

I was trying to read the PDF document on the following link using the pyPDF package in Python.
http://www.hdfcsec.com/Share-Market-Research/Research-Details/StockReports/3011454

I used the following code to read the PDF:

    ###########Beginning of Code########
    import os
    import glob
    from pyPdf import PdfFileReader

    filename = os.path.abspath('F:/KG/per/Entr/equity research Text mining  tool/HDFC_report.pdf')

    input = PdfFileReader(file(filename, "rb"))
    for page in input.pages:
        print page.extractText()
    ###########End of Code########

However, the text returned was a bit garbled at places. For e.g I have reproduced a part of the output that I got below. The table seems to be jumbled with the text. Is there a more methodical way of reading the text,section by sections, tables in proper format so that it is fit for processing:

****INFOSYS : COMPANY UPDATE CASH FLOW (Rs mn) FY13 FY14 FY15E FY16E FY17E Reported PAT 94,210 106,480 124,795 138,276 152,349 Non-operating & Interest Income (12,006) (14,445) (16,750) (18,090) (18,090) PAT from Operations 82,204 92,035 108,045 120,186 134,259 Depreciation 11,290 13,740 10,907 12,336 12,803 Working Capital Change (10,720) (190) (2,442) (11,324) (10,301) OPERATING CASH FLOW ( a ) 82,774 107,425 116,510 121,198 136,761 Capex+ Acquisitions (32,470) (27,450) (22,000) (22,000) (22,000) Free cash flow 50,304 79,975 94,510 99,198 114,761 Investments (6,034) (8,135) 16,750 18,090 18,090 INVESTING CASH FLOW ( b ) (38,504) (35,585) (5,250) (3,910) (3,910) Share capital Issuance 10 – – – – Debt Raised (890) – – – -****

Asked By: Karthik Ganapathy

||

Source

Answer 1

I found PDF miner API is doing a wonderful job of extracting PDF content.
Please go through https://dzone.com/articles/pdf-reading and http://www.unixuser.org/~euske/python/pdfminer/

pdf2txt.py -o output.htm report.pdf

I just tried to get the text out of that PDF you have given using pdf2txt.py which comes with the API. It worked seamlessly outputting html file. I see some divs are off. But, with further understanding. Hopefully, you can easily extract the required content from that HTML output.

All the best
Venkat

Answered By: Venkata Krishnan

Not getting the text from PDF in right format when reading using pyPDF

Question:

Answers: