Python dataframe filter on a specific column

Question:

I am reading 3 blobs from Azure storage , loading them into a dataframe and later filtering the dataframe.

Below is the code.

    blob_service_client = BlobServiceClient.from_connection_string(connect_str)
    container_name = ""
    path = "/"
    dt = ''
    pth = os.path.join(path, dt)
    container_client = blob_service_client.get_container_client(container_name)
    blob_list = container_client.list_blobs(name_starts_with=pth)
    for blob in blob_list:
        blob_client = container_client.get_blob_client(blob)
        stream = blob_client.download_blob()
        fileReader = json.loads(stream.readall())
        df= pd.DataFrame.from_records(fileReader)
        id ='2fr5'
        df2 = df[dfItem['ID'] == id]
                if len(df2.index) == 0:
                    print("0")
                else:
                    print("l")

After filtering, if the dataframe is empty I should get O,else L. But I am getting the below output if the ID is not present in the dataframe.

    O
    O
    O

When the ID is present in the dataframe, I am getting the below output.

    O
    l
    O

Its giving me output on 3 blobs separately instead reading all the 3 blobs into a single dataframe. Could someone assist.

Thank you.

Below is the dataframe after reading the file from the storage.

    df= pd.DataFrame.from_records(fileReader)
      Date       salary       tax       ID      
    0  2022-09-16  5064.000000  504.000000  6fr5                     
    1  2022-09-16  33.157895    3.157895  7fr5   
    
         Date       salary       tax       id      
    0  2022-09-16  5046.000000  504.000000  2fr5                     
    1  2022-09-16  36.157895    3.157895  3fr5
    
    
         Date       salary       tax       id      
    0  2022-09-16  5064.000000  504.000000  1fr5                     
    1  2022-09-16  367.157895    3.157895  5fr5  
Asked By: SanjanaSanju

||

Answers:

I don’t see how your code as written in the question can work: dfItem['ID'] is not defined.

But if you want to have only one dataframe, you should do something like this:

import pandas as pd

container_client = blob_service_client.get_container_client(container_name)
blob_list = container_client.list_blobs(name_starts_with=pth)
list_of_dataframes = []
for blob in blob_list:
    blob_client = container_client.get_blob_client(blob)
    list_of_dataframes.append(pd.read_json(blob_client.download_blob().readall()))
df = pd.concat(list_of_dataframes)

Pandas is able to read json directly from strings or file like object. I used it to read directly from the blob client.

Maybe it’s possible to do it directly with a list comprehension like this:

import pandas as pd

container_client = blob_service_client.get_container_client(container_name)
blob_list = container_client.list_blobs(name_starts_with=pth)
df = pd.concat(
    pd.read_json(container_client.get_blob_client(blob).download_blob().readall()))
    for blob in blob_list
)

After, you do the checks you want on the dataframe.

Answered By: ndclt