extracting all tables using tabula

Question:

While reading a pdf file using
df = tabula.read_pdf(pdf_file, pages=‘all’) —> displays all tables from all pages.

but when converting into a Pandas dataframe using
tables = pd.DataFrame(pdf_file, pages = ‘all’, lattice = ‘True’)[0])—> display only the table on the first page.

Asked By: arvin

||

Answers:

The df that you receive from tabula should be in the form of a list.

I also think that if you want to use pandas and tabula together the syntax should be something like below,

df = pandas.DataFrame(tabula.read_pdf(pdffile, pages ='all')[0])

If you want to utilize what you’ve gotten from tabula, you can also concatenate it into a single df as shown below

dfs = tabula.read_pdf(pdf_file, pages=‘all’)
df = pd.concat(dfs)

If every table has it’s own header, to skip the header for subsequent headers except for first header, try the following:

import numpy as np

dfFirstTable = tabula.read_pdf(pdffile)
df = pd.DataFrame(np.concatenate(tabula.read_pdf(pdffile, pages ='all')), columns=dfFirstTable.columns)
Answered By: Han