Read CSV file from Blob Storage to pandas dataframe and ignore pagination rows from source system
Question:
I have a task which is to read a csv file from blob storage for data manipulation, this is really easy to do:
import pandas as pd
from io import StringIO
blob_client_instance = blobService.get_blob_client(
"testflorencia", "TakeUpStores.csv", snapshot=None)
downloaded_blob = blob_client_instance.download_blob()
blob = downloaded_blob.content_as_text(encoding=None)
df = pd.read_csv(StringIO(blob))
df
However I get this error:
initial_value must be str or None, not bytes
I am not able to share the file here because its confidential, but what I did notice is that every 20 rows there is a special pagination row with a special character:
= 37.364.304;;;; --> special character not rendered by StackOverflow
How can I read this csv into pandas and ignore those rows?
I also tried without encoding parameter and I got adifferent error
'utf-8' codec can't decode byte 0xc3 in position 16515: invalid continuation byte
Answers:
Filter out the special rows from the downloaded text, then feed it to Pandas.
# ...
blob = downloaded_blob.content_as_text(encoding=None)
lines = "n".join(line for line in blob.splitlines() if not line.startswith(" = ")) # or whatever is the criteria for a special row
df = pd.read_csv(StringIO(blob))
If all your special pagination rows are starting with the same single character, then you can make use of the comment
parameter:
comment str, optional
Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as skip_blank_lines=True
), fully commented lines are ignored by the parameter header but not by skiprows. For example, if comment='#'
, parsing #emptyna,b,cn1,2,3
with header=0
will result in ‘a,b,c’
being treated as the header.
df = pd.read_csv(StringIO(blob), comment='=')
or depending on the first character of the pagination row:
df = pd.read_csv(StringIO(blob), comment=' ')
I have a task which is to read a csv file from blob storage for data manipulation, this is really easy to do:
import pandas as pd
from io import StringIO
blob_client_instance = blobService.get_blob_client(
"testflorencia", "TakeUpStores.csv", snapshot=None)
downloaded_blob = blob_client_instance.download_blob()
blob = downloaded_blob.content_as_text(encoding=None)
df = pd.read_csv(StringIO(blob))
df
However I get this error:
initial_value must be str or None, not bytes
I am not able to share the file here because its confidential, but what I did notice is that every 20 rows there is a special pagination row with a special character:
= 37.364.304;;;; --> special character not rendered by StackOverflow
How can I read this csv into pandas and ignore those rows?
I also tried without encoding parameter and I got adifferent error
'utf-8' codec can't decode byte 0xc3 in position 16515: invalid continuation byte
Filter out the special rows from the downloaded text, then feed it to Pandas.
# ...
blob = downloaded_blob.content_as_text(encoding=None)
lines = "n".join(line for line in blob.splitlines() if not line.startswith(" = ")) # or whatever is the criteria for a special row
df = pd.read_csv(StringIO(blob))
If all your special pagination rows are starting with the same single character, then you can make use of the comment
parameter:
comment str, optional
Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as
skip_blank_lines=True
), fully commented lines are ignored by the parameter header but not by skiprows. For example, ifcomment='#'
, parsing#emptyna,b,cn1,2,3
withheader=0
will result in‘a,b,c’
being treated as the header.
df = pd.read_csv(StringIO(blob), comment='=')
or depending on the first character of the pagination row:
df = pd.read_csv(StringIO(blob), comment=' ')