pdf-scraping

Extract metadata info from online pdf using pdfminer in python

Extract metadata info from online pdf using pdfminer in python Question: I am interested to find out some metadata of an online pdf using pdfminer. I am interested in extracting info such as Title, author, no of lines etc from the pdf I am trying to use a related solution discussed in- https://stackoverflow.com/a/60151816/15143974 Which uses …

Total answers: 2

Scraping specific pdfs from different websites

Scraping specific pdfs from different websites Question: First question here. I need to download a specific pdf from every url. I need just the pdf of the european commission proposal from each url that I have, which is always in a specific part of the page [Here the part from the website that I would …

Total answers: 2

Title Extraction/Identification from PDFs

Title Extraction/Identification from PDFs Question: I have a large number of pdfs in different formats. Among other things, I need to extract their titles (not the document name, but a title in the text). Due to the range of formats, the titles are not in the same locations in the pdfs. Further, some of the …

Total answers: 3

Extract / Identify Tables from PDF python

Extract / Identify Tables from PDF python Question: Are there any open source libraries that support table identification & extraction? By this I mean: Identify a table structure exists Classify the table from its contents Extract data from the table in a useful output format e.g. JSON / CSV etc. I have looked through similar …

Total answers: 3

Python module for converting PDF to text

Python module for converting PDF to text Question: Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use. Asked By: cnu || Source Answers: Try PDFMiner. It can extract …

Total answers: 13