Cannot read PDF Data into Sheets with Gspread-DataFrame

Question

I want to read data from a PDF I downloaded using Tabula into Google Sheets, and when I transfer the data as it was read into Google Sheets, I get an error. I know the data I downloaded is dirty, but I wanted to clean it up in Google Sheets.

Downloading Data from Pdf Portion of Full Portion of Code

import tabula
import pandas as pd
file_path = 'TnPresidentbyCountyNov2016.pdf'
df = tabula.read_pdf(file_path, pages='all', multiple_tables='FALSE', stream='TRUE')
print (df)


[      Anderson   19,212    9,013   74  1,034   42  174    189   28  0  0.1
0      Bedford   11,486    3,395   25    306    8   47     75    5  0    0
1       Benton    4,716    1,474   12     83   13   11     14    2  0    0
2      Bledsoe    3,622      897    7     95    4    9     18    2  0    0
3       Blount   37,443   12,100   83  1,666   72  250    313   51  1    1
4      Bradley   29,768    7,070   66  1,098   44  143    210   29  1    1
5     Campbell    9,870    2,248   32    251   25   43     45    5  0    0
6       Cannon    4,007    1,127    8    106    7   18     29    3  0    0
7      Carroll    7,756    2,327   22    181   20   18     39    2  0    0
8       Carter   16,898    3,453   30    409   20   54    130   26  0    0
9     Cheatham   11,297    3,878   26    463   13   50     99    8  0    0
10     Chester    5,081    1,243    5    115    4   12     10    4  0    0
11   Claiborne    8,602    1,832   16    192   24   27     29    2  0    0
12        Clay    2,141      707    2     47    2   10     11    0  0    0
13       Cocke    9,791    1,981   21    211   19   27     59    2  0    2
14      Coffee   14,417    4,743   32    517   23   62    113    9  0    1
15    Crockett    3,982    1,303    7     76    3    8     13    1  0    0
16  Cumberland   20,413    5,202   37    471   26   53     99   17  0    1
17    Davidson   84,550  148,864  412  9,603  304  619  2,459  106  0    6
18     Decatur    3,588      894    5     70    4    8     16    2  0    0
19      DeKalb    5,171    1,569   10    117    6   29     49    0  0    0
20     Dickson   13,233    4,722   32    489   18   58     94    9  0    3
21        Dyer   10,180    2,816   19    193   13   27     48    3  0    0
22     Fayette   13,055    5,874   19    261   16   37     62   21  0    0
23    Fentress    6,038    1,100   10    107   14   11     37    1  0    0
24    Franklin   11,532    4,374   28    319   16   36     66    7  0    0
25      Gibson   13,786    5,258   26    305   18   36     66    8  0    0
26       Giles    7,970    2,917   16    162   11   11     41    1  0    0
27    Grainger    6,626    1,154   17    130   12   28     26    4  0    0
28      Greene   18,562    4,216   28    481   29   56    152   14  0    0
29      Grundy    3,636      999   11     80    3   13     19    0  0    0
30     Hamblen   15,857    4,075   30    443   27   73     93    8  0    0
31    Hamilton   78,733   55,316  147  5,443  138  349  1,098  121  0    0
32     Hancock    1,843      322    4     42    1    5     13    0  0    0
33    Hardeman    4,919    4,185   18     84   11   13     30    9  0    0
34      Hardin    8,012    1,622   15    134   22   48     96    0  0    0
35     Hawkins   16,648    3,507   31    397   12   52     91    7  0    3
36     Haywood    3,013    3,711   11     60   10   10     19    0  0    0
37   Henderson    8,138    1,800   13    172    9   27     39    1  0    0
38       Henry    9,508    3,063   18    223   15   27     60    4  0    0
39     Hickman    5,695    1,824   20    161   19   15     39   18  0    0
40     Houston    2,182      866    9     88    4    7     12    0  0    0
41   Humphreys    4,930    1,967   17    166   12   23     26    5  0    0
42     Jackson    3,236    1,129    2     62    1    7     17    1  0    0
43   Jefferson   14,776    3,494   34    497   22   76    115    8  0    1
44     Johnson    5,410      988   11    102    7    9     39    6  0    0
45        Knox  105,767   62,878  382  7,458  227  986  1,634  122  0    9
46        Lake    1,357      577    5     18    1    6      6    0  0    0,       Lauderdale      4,884    3,056     14      87     13     10    14.1  
0       Lawrence     12,420    2,821     21     271     13     36      77   
1          Lewis      3,585      890     14      59      8      9      42   
2        Lincoln     10,398    2,554     19     231     13     39      46   
3         Loudon     17,610    4,919     41     573     22     77      87

Just a sample of the data I pulled. Again, not what I completely envisioned, but as a beginner coder, I wanted to clean it up in Sheets

HERE is an image of the PDF I was downloading data from.

Here is the link to download the PDF I am downloading data from

Now I want to import gspread and gpsread_dataframe to upload into a Google Sheet tab and here is where I am having problems.

EDIT: Whereas neither section included all of my coding, now the top and bottom portions include all of my coding done so far.

from oauth2client.service_account import ServiceAccountCredentials
import json
import gspread
SHEET_ID = '18xad0TbNGMPh8gUSIsEr6wNsFzcpKGbyUIQ-A4GQ1bo'
SHEET_NAME = '2016'
gc = gspread.service_account('waynetennesseedems.json')
spreadsheet = gc.open_by_key(SHEET_ID)
worksheet = spreadsheet.worksheet(SHEET_NAME)
from gspread_dataframe import set_with_dataframe
set_with_dataframe(worksheet, df, include_column_header='False')


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/zc/x2w76_4121g3gzfxybkz2q480000gn/T/ipykernel_44678/2784595029.py in <module>
----> 1 set_with_dataframe(worksheet, df, include_column_header='False')

/opt/anaconda3/lib/python3.9/site-packages/gspread_dataframe.py in set_with_dataframe(worksheet, dataframe, row, col, include_index, include_column_header, resize, allow_formulas, string_escaping)
    260     # If header-related params are True, the values are adjusted
    261     # to allow space for the headers.
--> 262     y, x = dataframe.shape
    263     index_col_size = 0
    264     column_header_size = 0

AttributeError: 'list' object has no attribute 'shape'

Does it have to do with how my Data was pulled from my PDF?

Asked By: Wayne Shaw

||

Source

Answer 1

It seems that df is a list, first be sure to have downloaded the tabula-py module, secondly try to pass the parameter output_format='dataframe' to the tabula.read_pdf() function, like so:

import pandas as pd
import json
import gspread
from tabula.io import read_pdf
from oauth2client.service_account import ServiceAccountCredentials
from gspread_dataframe import set_with_dataframe

file_path = 'TnPresidentbyCountyNov2016.pdf'
df = read_pdf(file_path, output_format='dataframe', pages='all', multiple_tables='FALSE', stream='TRUE')
# print (df)
SHEET_ID = '18xad0TbNGMPh8gUSIsEr6wNsFzcpKGbyUIQ-A4GQ1bo'
SHEET_NAME = '2016'
gc = gspread.service_account('waynetennesseedems.json')
spreadsheet = gc.open_by_key(SHEET_ID)
worksheet = spreadsheet.worksheet(SHEET_NAME)
set_with_dataframe(worksheet, df, include_column_header='False')

Moreover I suggest you to take a look at the PEP8 style guide, to have a better idea on how to write a well formatted script.

Answered By: CcmU

Cannot read PDF Data into Sheets with Gspread-DataFrame

Question:

Answers: