Google spreadsheet to Pandas dataframe via Pydrive without download
Question:
How do I read the content of a Google spreadsheet into a Pandas dataframe without downloading the file?
I think gspread or df2gspread may be good shots, but I’ve been working with pydrive so far and got close to the solution.
With Pydrive I managed to get the export link of my spreadsheet, either as .csv
or .xlsx
file. After the authentication process, this looks like
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
# choose whether to export csv or xlsx
data_type = 'csv'
# get list of files in folder as dictionaries
file_list = drive.ListFile({'q': "'my-folder-ID' in parents and
trashed=false"}).GetList()
export_key = 'exportLinks'
excel_key = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
csv_key = 'text/csv'
if data_type == 'excel':
urls = [ file[export_key][excel_key] for file in file_list ]
elif data_type == 'csv':
urls = [ file[export_key][csv_key] for file in file_list ]
The type of url I get for xlsx
is
https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=xlsx
and similarly for csv
https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=csv
Now, if I click on these links (or visit them with webbrowser.open(url)
), I download the file, that I can then normally read into a Pandas dataframe with pandas.read_excel()
or pandas.read_csv()
, as described here.
How can I skip the download, and directly read the file into a dataframe from these links?
I tried several solutions:
- The obvious
pd.read_csv(url)
gives
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
Interestingly these numbers (1, 6, 2
) do not depend on the number of rows and columns in my spreadsheet, hinting that the script is trying to read not what it is intended to.
- The analogue
pd.read_excel(url)
gives
ValueError: Excel file format cannot be determined, you must specify an engine manually.
and specifying e.g. engine = 'openpyxl'
gives
zipfile.BadZipFile: File is not a zip file
- BytesIO solution looked promising, but
r = requests.get(url)
data = r.content
df = pd.read_csv(BytesIO(data))
still gives
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
If I print(data)
I get hundreds of lines of html
code
b'n<!DOCTYPE html>n<html lang="de">n <head>n <meta charset="utf-8">n <meta content="width=300, initial-scale=1" name="viewport">n
...
...
</script>n </body>n</html>n'
Answers:
In your situation, how about the following modification? In this case, by retrieving the access token from gauth
, the Spreadsheet is exported as XLSX data, and the XLSX data is put into the dataframe.
Modified script:
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
url = "https://docs.google.com/spreadsheets/export?id={spreadsheetId}&exportFormat=xlsx"
res = requests.get(url, headers={"Authorization": "Bearer " + gauth.attr['credentials'].access_token})
values = pd.read_excel(BytesIO(res.content))
print(values)
-
In this script, please add import requests
.
-
In this case, the 1st tab of XLSX data is used.
-
When you want to use the other tab, please modify values = pd.read_excel(BytesIO(res.content))
as follows.
sheet = "Sheet2"
values = pd.read_excel(BytesIO(res.content), sheet_name=sheet)
I want to contribute an additional option to @Tanaike’s excellent answer. Indeed it is quite difficult to successfully get an excel file (.xlsx from drive and not a google sheet) into a python environment without publishing the content to the web. Whereas the previous answer uses pydrive and GoogleAuth(), I usually use a different method of authentification in colab/jupyter notebooks. Adapted from googleapis documentation. In my environment using BytesIO(response.content) is unnecessary.
import pandas as pd
from oauth2client.client import GoogleCredentials
from google.colab import auth
auth.authenticate_user()
from google.auth.transport.requests import AuthorizedSession
from google.auth import default
creds, _ = default()
id = 'aaaaaaaaaaaaaaaaaaaaaaaaaaa'
sheet = 'Sheet12345'
url = f'https://docs.google.com/spreadsheets/export?id={id}&exportFormat=xlsx'
authed_session = AuthorizedSession(creds)
response = authed_session.get(url)
values = pd.read_excel(response.content, sheet_name=sheet)
How do I read the content of a Google spreadsheet into a Pandas dataframe without downloading the file?
I think gspread or df2gspread may be good shots, but I’ve been working with pydrive so far and got close to the solution.
With Pydrive I managed to get the export link of my spreadsheet, either as .csv
or .xlsx
file. After the authentication process, this looks like
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
# choose whether to export csv or xlsx
data_type = 'csv'
# get list of files in folder as dictionaries
file_list = drive.ListFile({'q': "'my-folder-ID' in parents and
trashed=false"}).GetList()
export_key = 'exportLinks'
excel_key = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
csv_key = 'text/csv'
if data_type == 'excel':
urls = [ file[export_key][excel_key] for file in file_list ]
elif data_type == 'csv':
urls = [ file[export_key][csv_key] for file in file_list ]
The type of url I get for xlsx
is
https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=xlsx
and similarly for csv
https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=csv
Now, if I click on these links (or visit them with webbrowser.open(url)
), I download the file, that I can then normally read into a Pandas dataframe with pandas.read_excel()
or pandas.read_csv()
, as described here.
How can I skip the download, and directly read the file into a dataframe from these links?
I tried several solutions:
- The obvious
pd.read_csv(url)
gives
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
Interestingly these numbers (1, 6, 2
) do not depend on the number of rows and columns in my spreadsheet, hinting that the script is trying to read not what it is intended to.
- The analogue
pd.read_excel(url)
gives
ValueError: Excel file format cannot be determined, you must specify an engine manually.
and specifying e.g. engine = 'openpyxl'
gives
zipfile.BadZipFile: File is not a zip file
- BytesIO solution looked promising, but
r = requests.get(url)
data = r.content
df = pd.read_csv(BytesIO(data))
still gives
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
If I print(data)
I get hundreds of lines of html
code
b'n<!DOCTYPE html>n<html lang="de">n <head>n <meta charset="utf-8">n <meta content="width=300, initial-scale=1" name="viewport">n
...
...
</script>n </body>n</html>n'
In your situation, how about the following modification? In this case, by retrieving the access token from gauth
, the Spreadsheet is exported as XLSX data, and the XLSX data is put into the dataframe.
Modified script:
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
url = "https://docs.google.com/spreadsheets/export?id={spreadsheetId}&exportFormat=xlsx"
res = requests.get(url, headers={"Authorization": "Bearer " + gauth.attr['credentials'].access_token})
values = pd.read_excel(BytesIO(res.content))
print(values)
-
In this script, please add
import requests
. -
In this case, the 1st tab of XLSX data is used.
-
When you want to use the other tab, please modify
values = pd.read_excel(BytesIO(res.content))
as follows.sheet = "Sheet2" values = pd.read_excel(BytesIO(res.content), sheet_name=sheet)
I want to contribute an additional option to @Tanaike’s excellent answer. Indeed it is quite difficult to successfully get an excel file (.xlsx from drive and not a google sheet) into a python environment without publishing the content to the web. Whereas the previous answer uses pydrive and GoogleAuth(), I usually use a different method of authentification in colab/jupyter notebooks. Adapted from googleapis documentation. In my environment using BytesIO(response.content) is unnecessary.
import pandas as pd
from oauth2client.client import GoogleCredentials
from google.colab import auth
auth.authenticate_user()
from google.auth.transport.requests import AuthorizedSession
from google.auth import default
creds, _ = default()
id = 'aaaaaaaaaaaaaaaaaaaaaaaaaaa'
sheet = 'Sheet12345'
url = f'https://docs.google.com/spreadsheets/export?id={id}&exportFormat=xlsx'
authed_session = AuthorizedSession(creds)
response = authed_session.get(url)
values = pd.read_excel(response.content, sheet_name=sheet)