Google spreadsheet to Pandas dataframe via Pydrive without download

Question:

How do I read the content of a Google spreadsheet into a Pandas dataframe without downloading the file?


I think gspread or df2gspread may be good shots, but I’ve been working with pydrive so far and got close to the solution.

With Pydrive I managed to get the export link of my spreadsheet, either as .csv or .xlsx file. After the authentication process, this looks like


    gauth = GoogleAuth()
    gauth.LocalWebserverAuth()
    drive = GoogleDrive(gauth)
    
    # choose whether to export csv or xlsx
    data_type = 'csv'
    
    # get list of files in folder as dictionaries
    file_list = drive.ListFile({'q': "'my-folder-ID' in parents and 
    trashed=false"}).GetList()
    
    export_key = 'exportLinks'
    
    excel_key = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
    csv_key = 'text/csv'
    
    if data_type == 'excel':
        urls = [ file[export_key][excel_key] for file in file_list ]
    
    elif data_type == 'csv':
        urls = [ file[export_key][csv_key] for file in file_list ]

The type of url I get for xlsx is

https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=xlsx

and similarly for csv

https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=csv

Now, if I click on these links (or visit them with webbrowser.open(url)), I download the file, that I can then normally read into a Pandas dataframe with pandas.read_excel() or pandas.read_csv(), as described here.

How can I skip the download, and directly read the file into a dataframe from these links?

I tried several solutions:

    pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2

Interestingly these numbers (1, 6, 2) do not depend on the number of rows and columns in my spreadsheet, hinting that the script is trying to read not what it is intended to.

  • The analogue pd.read_excel(url) gives
    ValueError: Excel file format cannot be determined, you must specify an engine manually.

and specifying e.g. engine = 'openpyxl' gives

zipfile.BadZipFile: File is not a zip file
  • BytesIO solution looked promising, but

    r = requests.get(url)
    data = r.content
    df = pd.read_csv(BytesIO(data))

still gives


    pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2

If I print(data) I get hundreds of lines of html code


    b'n<!DOCTYPE html>n<html lang="de">n  <head>n  <meta charset="utf-8">n  <meta content="width=300, initial-scale=1" name="viewport">n 
    ...
    ...
     </script>n  </body>n</html>n'

Asked By: DavideL

||

Answers:

In your situation, how about the following modification? In this case, by retrieving the access token from gauth, the Spreadsheet is exported as XLSX data, and the XLSX data is put into the dataframe.

Modified script:

gauth = GoogleAuth()
gauth.LocalWebserverAuth()

url = "https://docs.google.com/spreadsheets/export?id={spreadsheetId}&exportFormat=xlsx"
res = requests.get(url, headers={"Authorization": "Bearer " + gauth.attr['credentials'].access_token})
values = pd.read_excel(BytesIO(res.content))
print(values)
  • In this script, please add import requests.

  • In this case, the 1st tab of XLSX data is used.

  • When you want to use the other tab, please modify values = pd.read_excel(BytesIO(res.content)) as follows.

      sheet = "Sheet2"
      values = pd.read_excel(BytesIO(res.content), sheet_name=sheet)
    
Answered By: Tanaike

I want to contribute an additional option to @Tanaike’s excellent answer. Indeed it is quite difficult to successfully get an excel file (.xlsx from drive and not a google sheet) into a python environment without publishing the content to the web. Whereas the previous answer uses pydrive and GoogleAuth(), I usually use a different method of authentification in colab/jupyter notebooks. Adapted from googleapis documentation. In my environment using BytesIO(response.content) is unnecessary.

import pandas as pd

from oauth2client.client import GoogleCredentials
from google.colab import auth
auth.authenticate_user()

from google.auth.transport.requests import AuthorizedSession
from google.auth import default
creds, _ = default()

id = 'aaaaaaaaaaaaaaaaaaaaaaaaaaa'
sheet = 'Sheet12345'

url = f'https://docs.google.com/spreadsheets/export?id={id}&exportFormat=xlsx'

authed_session = AuthorizedSession(creds)
response = authed_session.get(url)

values = pd.read_excel(response.content, sheet_name=sheet)
Answered By: grimelda