error when I'm using a URL from a GitHub user dataset to use in python
Question:
Error tokenizing data. C error: Expected 1 fields in line 28, saw 367
I keep getting an error when I’m using a URL from a GitHub user dataset to use in python to run. Is there a way to solve this issue?
url = "https://github.com/noghte/datasets/blob/main/apartments.csv"
df = pd.read_csv(url)
print(len(df, index_col=0))
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
~/8410_Projects/Lessons/week9.DataFrame.py in <module>
4 # https://raw.githubusercontent.com/noghte/datasets/mainapartment.csv
5 url = "https://github.com/noghte/datasets/blob/main/apartments.csv"
----> 6 df = pd.read_csv(url)
7 print(len(df, index_col=0))
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
584 kwds.update(kwds_defaults)
585
--> 586 return _read(filepath_or_buffer, kwds)
587
588
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
486
487 with parser:
...
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
Answers:
There is another alternative way of loading csv from url. Try it this way too to see if the error persists:
import pandas as pd
import io
import requests
url="https://github.com/noghte/datasets/blob/main/apartments.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))
Pandas is attempting to read the entire page rather than just the raw csv file like you want.
Add a raw query to your URL like so:
url = "https://github.com/noghte/datasets/blob/main/apartments.csv*?raw=true*"
Error tokenizing data. C error: Expected 1 fields in line 28, saw 367
I keep getting an error when I’m using a URL from a GitHub user dataset to use in python to run. Is there a way to solve this issue?
url = "https://github.com/noghte/datasets/blob/main/apartments.csv"
df = pd.read_csv(url)
print(len(df, index_col=0))
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
~/8410_Projects/Lessons/week9.DataFrame.py in <module>
4 # https://raw.githubusercontent.com/noghte/datasets/mainapartment.csv
5 url = "https://github.com/noghte/datasets/blob/main/apartments.csv"
----> 6 df = pd.read_csv(url)
7 print(len(df, index_col=0))
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
584 kwds.update(kwds_defaults)
585
--> 586 return _read(filepath_or_buffer, kwds)
587
588
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
486
487 with parser:
...
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
There is another alternative way of loading csv from url. Try it this way too to see if the error persists:
import pandas as pd
import io
import requests
url="https://github.com/noghte/datasets/blob/main/apartments.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))
Pandas is attempting to read the entire page rather than just the raw csv file like you want.
Add a raw query to your URL like so:
url = "https://github.com/noghte/datasets/blob/main/apartments.csv*?raw=true*"