Using tabula-py why I get a list and not a Dataframe?

Question:

Output

I want to work with PDF files, specially with tables. I code this

import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf
tab= tabula.read_pdf('..PDFsAla.pdf',encoding='latin-1', pages ='all')
tab

But I get a list of values, like this:

[    Nombres  Edad Ciudad
0    Noelia    20   Lima
1  Michelie    45   Lima
2    Ximena    18   Lima
3    Miguel    43   Lima]

I cannot analyze it die it’s not a data frame. This is just an example the real PDF file contains tables between texts and several pages

So, please could someone help me with this issue?

Asked By: Maria Fernanda

||

Answers:

tabula should return a list of Pandas dataframes, one for each table found in the PDF. You could display (and work with them) as follows:

import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf

dfs = tabula.read_pdf('..PDFsAla.pdf', encoding='latin-1', pages='all')
print(f"Found {len(dfs)} tables")

# display each of the dataframes
for df in dfs:
    print(df.size)
    print(df)
Answered By: Martin Evans

tabula returns a list of Pandas DataFrame. But we can convert this list to Pandas DataFrame using the below statement.

import tabula
import pandas

tab = pandas.DataFrame(tabula.read_pdf('..PDFsAla.pdf', pages ='all')[0])
Answered By: Divyansh Gemini
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.