Python dataframe filter on a specific column
Question:
I am reading 3 blobs from Azure storage , loading them into a dataframe and later filtering the dataframe.
Below is the code.
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_name = ""
path = "/"
dt = ''
pth = os.path.join(path, dt)
container_client = blob_service_client.get_container_client(container_name)
blob_list = container_client.list_blobs(name_starts_with=pth)
for blob in blob_list:
blob_client = container_client.get_blob_client(blob)
stream = blob_client.download_blob()
fileReader = json.loads(stream.readall())
df= pd.DataFrame.from_records(fileReader)
id ='2fr5'
df2 = df[dfItem['ID'] == id]
if len(df2.index) == 0:
print("0")
else:
print("l")
After filtering, if the dataframe is empty I should get O,else L. But I am getting the below output if the ID is not present in the dataframe.
O
O
O
When the ID is present in the dataframe, I am getting the below output.
O
l
O
Its giving me output on 3 blobs separately instead reading all the 3 blobs into a single dataframe. Could someone assist.
Thank you.
Below is the dataframe after reading the file from the storage.
df= pd.DataFrame.from_records(fileReader)
Date salary tax ID
0 2022-09-16 5064.000000 504.000000 6fr5
1 2022-09-16 33.157895 3.157895 7fr5
Date salary tax id
0 2022-09-16 5046.000000 504.000000 2fr5
1 2022-09-16 36.157895 3.157895 3fr5
Date salary tax id
0 2022-09-16 5064.000000 504.000000 1fr5
1 2022-09-16 367.157895 3.157895 5fr5
Answers:
I don’t see how your code as written in the question can work: dfItem['ID']
is not defined.
But if you want to have only one dataframe, you should do something like this:
import pandas as pd
container_client = blob_service_client.get_container_client(container_name)
blob_list = container_client.list_blobs(name_starts_with=pth)
list_of_dataframes = []
for blob in blob_list:
blob_client = container_client.get_blob_client(blob)
list_of_dataframes.append(pd.read_json(blob_client.download_blob().readall()))
df = pd.concat(list_of_dataframes)
Pandas is able to read json directly from strings or file like object. I used it to read directly from the blob client.
Maybe it’s possible to do it directly with a list comprehension like this:
import pandas as pd
container_client = blob_service_client.get_container_client(container_name)
blob_list = container_client.list_blobs(name_starts_with=pth)
df = pd.concat(
pd.read_json(container_client.get_blob_client(blob).download_blob().readall()))
for blob in blob_list
)
After, you do the checks you want on the dataframe.
I am reading 3 blobs from Azure storage , loading them into a dataframe and later filtering the dataframe.
Below is the code.
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_name = ""
path = "/"
dt = ''
pth = os.path.join(path, dt)
container_client = blob_service_client.get_container_client(container_name)
blob_list = container_client.list_blobs(name_starts_with=pth)
for blob in blob_list:
blob_client = container_client.get_blob_client(blob)
stream = blob_client.download_blob()
fileReader = json.loads(stream.readall())
df= pd.DataFrame.from_records(fileReader)
id ='2fr5'
df2 = df[dfItem['ID'] == id]
if len(df2.index) == 0:
print("0")
else:
print("l")
After filtering, if the dataframe is empty I should get O,else L. But I am getting the below output if the ID is not present in the dataframe.
O
O
O
When the ID is present in the dataframe, I am getting the below output.
O
l
O
Its giving me output on 3 blobs separately instead reading all the 3 blobs into a single dataframe. Could someone assist.
Thank you.
Below is the dataframe after reading the file from the storage.
df= pd.DataFrame.from_records(fileReader)
Date salary tax ID
0 2022-09-16 5064.000000 504.000000 6fr5
1 2022-09-16 33.157895 3.157895 7fr5
Date salary tax id
0 2022-09-16 5046.000000 504.000000 2fr5
1 2022-09-16 36.157895 3.157895 3fr5
Date salary tax id
0 2022-09-16 5064.000000 504.000000 1fr5
1 2022-09-16 367.157895 3.157895 5fr5
I don’t see how your code as written in the question can work: dfItem['ID']
is not defined.
But if you want to have only one dataframe, you should do something like this:
import pandas as pd
container_client = blob_service_client.get_container_client(container_name)
blob_list = container_client.list_blobs(name_starts_with=pth)
list_of_dataframes = []
for blob in blob_list:
blob_client = container_client.get_blob_client(blob)
list_of_dataframes.append(pd.read_json(blob_client.download_blob().readall()))
df = pd.concat(list_of_dataframes)
Pandas is able to read json directly from strings or file like object. I used it to read directly from the blob client.
Maybe it’s possible to do it directly with a list comprehension like this:
import pandas as pd
container_client = blob_service_client.get_container_client(container_name)
blob_list = container_client.list_blobs(name_starts_with=pth)
df = pd.concat(
pd.read_json(container_client.get_blob_client(blob).download_blob().readall()))
for blob in blob_list
)
After, you do the checks you want on the dataframe.