How can i read csv from zip file python?
Question:
I am trying to read csv which is in zip file. My task is to read the file rad_15min.csv file but the issue is when i read zip file (I copied link address by clicking on download button) it gives me error:
Code:
import pandas as pd
df = pd.read_csv('https://www.kaggle.com/datasets/lucafrance/bike-traffic-in-munich/download?datasetVersionNumber=7')
Error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 9, saw 2
Data: https://www.kaggle.com/datasets/lucafrance/bike-traffic-in-munich
Zip file Link: https://www.kaggle.com/datasets/lucafrance/bike-traffic-in-munich/download?datasetVersionNumber=7
I have to read this csv dynamically, I dont want to download it, All just to make a download link and then read csv dynamically. Is there any other approach which i can try ?
Answers:
For me, it’s forwarding to the HTML page, instead of downloading.
Why not use the kaggle API that is provided? (You need first to provide a token)
this is what i tried:
import csv
import requests
url = 'https://www.kaggle.com/datasets/lucafrance/bike-traffic-in-munich/download?datasetVersionNumber=7'
# Open the URL and create a response object
response = requests.get(url)
# Create a CSV reader object
csv_reader = csv.reader(response.iter_lines(decode_unicode=True), delimiter=',')
# Iterate over each row in the CSV file
for row in csv_reader:
# Process each row as needed
print(row)
What i got back is this:
[]
[]
['<!DOCTYPE html>']
['<html lang="en">']
[]
['<head>']
[' <title>Bike Traffic in Munich | Kaggle</title>']
[' <meta charset="utf-8" />']
[' <meta name="robots" content="index', ' follow" />']
[' <meta name="description" content="Bike traffic measured over time at different stations in Munich." />']
[' <meta name="turbolinks-cache-control" content="no-cache" />']
I tried using kaggle API.. but i dont want to download the data, just read dynamically.
I want to read only 1 file in a zip
named as rad15_min.csv
, with pandas
You can try making a request with the __Host-KAGGLEID cookie.
I’m not sure if there is a programatic way to get this one but you can always hardcode it. On your keyboard, press (CTRL+SHIFT+I) to open the Developer Tools of your browser and go to Applications
/Cookies
and copy the concerned cookie (and make sure you’re logged-in before in kaggle).
import requests
url = "https://www.kaggle.com/datasets/"
"lucafrance/bike-traffic-in-munich/"
"download?datasetVersionNumber=7"
cookies = {"__Host-KAGGLEID": "CfDJ8IPkmlRqhQhDn1PidxljKKQWcrozwJuFfsIn..."}
response = requests.get(url, cookies=cookies)
from zipfile import ZipFile
from io import BytesIO
with ZipFile(BytesIO(response.content)) as zf:
df = pd.read_csv(zf.open("rad_15min.csv")) # not rad15_min.csv
NB : If the zip
has only one csv OR if the dataset is not an archive (i.e, a single csv), you can pass BytesIO(response.content)
directly to read_csv
.
Output :
print(df)
datum uhrzeit_start ... richtung_2 gesamt
0 2017.01.01 00:00 ... 0 0
1 2017.01.01 00:00 ... 0 0
2 2017.01.01 00:00 ... 0 0
3 2017.01.01 00:00 ... 0 0
4 2017.01.01 00:00 ... 0 0
... ... ... ... ... ...
1255761 2022.12.31 23:45 ... 2 7
1255762 2022.12.31 23:45 ... 0 0
1255763 2022.12.31 23:45 ... 0 0
1255764 2022.12.31 23:45 ... 0 0
1255765 2022.12.31 23:45 ... 5 17
[1255766 rows x 7 columns]
I am trying to read csv which is in zip file. My task is to read the file rad_15min.csv file but the issue is when i read zip file (I copied link address by clicking on download button) it gives me error:
Code:
import pandas as pd
df = pd.read_csv('https://www.kaggle.com/datasets/lucafrance/bike-traffic-in-munich/download?datasetVersionNumber=7')
Error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 9, saw 2
Data: https://www.kaggle.com/datasets/lucafrance/bike-traffic-in-munich
Zip file Link: https://www.kaggle.com/datasets/lucafrance/bike-traffic-in-munich/download?datasetVersionNumber=7
I have to read this csv dynamically, I dont want to download it, All just to make a download link and then read csv dynamically. Is there any other approach which i can try ?
For me, it’s forwarding to the HTML page, instead of downloading.
Why not use the kaggle API that is provided? (You need first to provide a token)
this is what i tried:
import csv
import requests
url = 'https://www.kaggle.com/datasets/lucafrance/bike-traffic-in-munich/download?datasetVersionNumber=7'
# Open the URL and create a response object
response = requests.get(url)
# Create a CSV reader object
csv_reader = csv.reader(response.iter_lines(decode_unicode=True), delimiter=',')
# Iterate over each row in the CSV file
for row in csv_reader:
# Process each row as needed
print(row)
What i got back is this:
[]
[]
['<!DOCTYPE html>']
['<html lang="en">']
[]
['<head>']
[' <title>Bike Traffic in Munich | Kaggle</title>']
[' <meta charset="utf-8" />']
[' <meta name="robots" content="index', ' follow" />']
[' <meta name="description" content="Bike traffic measured over time at different stations in Munich." />']
[' <meta name="turbolinks-cache-control" content="no-cache" />']
I tried using kaggle API.. but i dont want to download the data, just read dynamically.
I want to read only 1 file in azip
named asrad15_min.csv
, with pandas
You can try making a request with the __Host-KAGGLEID cookie.
I’m not sure if there is a programatic way to get this one but you can always hardcode it. On your keyboard, press (CTRL+SHIFT+I) to open the Developer Tools of your browser and go to Applications
/Cookies
and copy the concerned cookie (and make sure you’re logged-in before in kaggle).
import requests
url = "https://www.kaggle.com/datasets/"
"lucafrance/bike-traffic-in-munich/"
"download?datasetVersionNumber=7"
cookies = {"__Host-KAGGLEID": "CfDJ8IPkmlRqhQhDn1PidxljKKQWcrozwJuFfsIn..."}
response = requests.get(url, cookies=cookies)
from zipfile import ZipFile
from io import BytesIO
with ZipFile(BytesIO(response.content)) as zf:
df = pd.read_csv(zf.open("rad_15min.csv")) # not rad15_min.csv
NB : If the zip
has only one csv OR if the dataset is not an archive (i.e, a single csv), you can pass BytesIO(response.content)
directly to read_csv
.
Output :
print(df)
datum uhrzeit_start ... richtung_2 gesamt
0 2017.01.01 00:00 ... 0 0
1 2017.01.01 00:00 ... 0 0
2 2017.01.01 00:00 ... 0 0
3 2017.01.01 00:00 ... 0 0
4 2017.01.01 00:00 ... 0 0
... ... ... ... ... ...
1255761 2022.12.31 23:45 ... 2 7
1255762 2022.12.31 23:45 ... 0 0
1255763 2022.12.31 23:45 ... 0 0
1255764 2022.12.31 23:45 ... 0 0
1255765 2022.12.31 23:45 ... 5 17
[1255766 rows x 7 columns]