Python – Extracting text from webpage PDF
Question:
So I have come across a few posts that deal with converting PDF’s to HTML or converting them to text, however they all deal with doing so from a file saved to the computer. Is there a way to extract the text from a webpage PDF without downloading the PDF file itself (as I will be doing so for a large number of files by iterating through a list of URL’s)?
I am also curious which is the best library to achieve this with. pdfkit, pdf2txt, pdfminer, etc.?
Here is an example website with the format I will be dealing with: http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf
Answers:
You can download the file as a byte stream with requests
wrapping it with io.BytesIO()
, just so:
import io
import requests
from pyPdf import PdfFileReader
url = 'http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'
r = requests.get(url)
f = io.BytesIO(r.content)
reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('n')
f
is a file like object you can use just like you opened a PDF file. this way the file is only in the memory and never saved locally.
In order to get text from the PDF file you can use PyPdf.
Updated the code for the PyPDF2 library
import io
import requests
import PyPDF2
url = 'http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'
r = requests.get(url)
f = io.BytesIO(r.content)
reader = PyPDF2.PdfReader(f)
contents = reader.pages[2].extract_text().split('n')
So I have come across a few posts that deal with converting PDF’s to HTML or converting them to text, however they all deal with doing so from a file saved to the computer. Is there a way to extract the text from a webpage PDF without downloading the PDF file itself (as I will be doing so for a large number of files by iterating through a list of URL’s)?
I am also curious which is the best library to achieve this with. pdfkit, pdf2txt, pdfminer, etc.?
Here is an example website with the format I will be dealing with: http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf
You can download the file as a byte stream with requests
wrapping it with io.BytesIO()
, just so:
import io
import requests
from pyPdf import PdfFileReader
url = 'http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'
r = requests.get(url)
f = io.BytesIO(r.content)
reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('n')
f
is a file like object you can use just like you opened a PDF file. this way the file is only in the memory and never saved locally.
In order to get text from the PDF file you can use PyPdf.
Updated the code for the PyPDF2 library
import io
import requests
import PyPDF2
url = 'http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'
r = requests.get(url)
f = io.BytesIO(r.content)
reader = PyPDF2.PdfReader(f)
contents = reader.pages[2].extract_text().split('n')