Read CSV file from Blob Storage to pandas dataframe and ignore pagination rows from source system

Question:

I have a task which is to read a csv file from blob storage for data manipulation, this is really easy to do:

import pandas as pd
from io import StringIO
blob_client_instance = blobService.get_blob_client(
    "testflorencia", "TakeUpStores.csv", snapshot=None)

downloaded_blob = blob_client_instance.download_blob()
blob = downloaded_blob.content_as_text(encoding=None)
df = pd.read_csv(StringIO(blob))
df

However I get this error:

initial_value must be str or None, not bytes

I am not able to share the file here because its confidential, but what I did notice is that every 20 rows there is a special pagination row with a special character:

 = 37.364.304;;;; --> special character not rendered by StackOverflow

How can I read this csv into pandas and ignore those rows?

I also tried without encoding parameter and I got adifferent error

'utf-8' codec can't decode byte 0xc3 in position 16515: invalid continuation byte
Asked By: Luis Valencia

||

Answers:

Filter out the special rows from the downloaded text, then feed it to Pandas.

# ...
blob = downloaded_blob.content_as_text(encoding=None)
lines = "n".join(line for line in blob.splitlines() if not line.startswith(" = "))  # or whatever is the criteria for a special row
df = pd.read_csv(StringIO(blob))
Answered By: AKX

If all your special pagination rows are starting with the same single character, then you can make use of the comment parameter:

comment str, optional

Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by skiprows. For example, if comment='#', parsing #emptyna,b,cn1,2,3 with header=0 will result in ‘a,b,c’ being treated as the header.

df = pd.read_csv(StringIO(blob), comment='=')

or depending on the first character of the pagination row:

df = pd.read_csv(StringIO(blob), comment=' ')
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.