Python – Can I convert values from a pdf to a DataFrame?

Question:

I am trying to convert the values from a PDF into a pandas DataFrame that can be manipulated in Python.

I have attached a photo that shows how I currently do it, as well as a sample PDF. Thanks in advance

Picture of how i do it now

Link to pdf on drive

I tried a solution from someone who wanted something similar, but since I want to return a dataframe that is at the bottom and it is not a table, it did not work for me.

Asked By: Crazy Apple

||

Answers:

I couldn’t make a comment to your question because I don’t have the reputation to but you can definitely check out the tabula-py project to tabulate your data. Here is a link for installation and documentation.

Since your tables are formatted quite neatly, the functions should be able to recognize the data without too much trouble. I’d be happy to try and look through any code you’re having problems with as you try to tabulate the data.

Answered By: HarunCelikOtto

The best way is to pre-process before manipulate so here I can simply convert pdftotext then call that in notepad or excel and using excel vba that could all be done without python OR for your use you can edit using python the text into csv by add the commas in the desired columns as per the way excel does it.

enter image description here

enter image description here

either way its just one line to call on multiple files.

list,of al,l pieces:,,,
,Piece,Widt x,Hei,Q,ty Description
,58,762 x,582,2,@5
,70,762 x,582,2,@5
,16,70 x,564,4,@8
,67,70 x,1250,4,@8
,59,1250 x,582,1,@5
,71,1350 x,582,1,@5
,77,762 x,582,1,@5
,28,744 x,70,1,@8
,44,194 x,70,1,@8
,84,802 x,280,3,@2

so depending on how you clean your text you can do much better than above raw single line output as we don’t need excel either

@pdftotext -nopgbrk -f 1 -l 1 -layout -x 290 -y 530 -W 300 -H 300 cut-sample.pdf out.txt
@echo Pc,W,H,Q,C>out.csv&for /f "usebackq tokens=1,2,4,5,6 delims= " %%f in ("out.txt") do @echo %%f,%%g,%%h,%%i,%%j >>out.csv
@echo/&type out.csv

enter image description here

Here I have not allowed for different size or positions of tables so, if necessary, you can move that "window" of interest up and to left and wider and taller then simply extract any line that includes @ as those are always in this OP example.

For a more complex "if this then that" CSV output see https://stackoverflow.com/a/75856112/10802527

Answered By: K J
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.