Quickest way of loading, portions of json responses. Into a pd.DataFrame object

Question:

So i have a series containing a lot of json responses. And it is quite a list of big jsons.

Display sample

{0: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"productAvailabilityCpfRequest","main-installation-date":"22/09/2021","userid":"[email protected]","full-name":"Claudenice lôbo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"[email protected]","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhã","plan-value":null,"alternative-installation-period-day":"manhã"}', 1: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"productAvailabilityStart","main-installation-date":"22/09/2021","userid":"[email protected]","full-name":"Claudenice lôbo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"[email protected]","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhã","plan-value":null,"alternative-installation-period-day":"manhã"}', 2: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"cpfValidationTrue","main-installation-date":"22/09/2021","userid":"[email protected]","full-name":"Claudenice lôbo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"[email protected]","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhã","plan-value":null,"alternative-installation-period-day":"manhã"}', 3: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"productAvailabilityCpfRequest","main-installation-date":"22/09/2021","userid":"[email protected]","full-name":"Claudenice lôbo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"[email protected]","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhã","plan-value":null,"alternative-installation-period-day":"manhã"}', 4: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"productAvailabilityStart","main-installation-date":"22/09/2021","userid":"[email protected]","full-name":"Claudenice lôbo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"[email protected]","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhã","plan-value":null,"alternative-installation-period-day":"manhã"}'}

I had a few issues, trying to load only portions (of the json object) efficiently and quickly. Ran into issues such as too much memory usage (when running pandas functions). And too slow processing.

So I made the following code

import orjson

def dataset_extras(extras #Series being passed,*args # List of keys you want to unload):
    l = [] 
    for i in extras:
        l.append({arg : orjson.loads(i).get(arg) for arg in args})
    return pd.DataFrame.from_records(l)

dataset_extras(df.Extras,'city','campaign-source','api-orders-hash-id')
# Sample of Call

This time i managed to circumvent, a lot of the performance issues. But I was wondering if there was a even more efficient way of transforming portions of a series of json responses, into a pd.DataFrame(). Would appreciate some feedback on a way I could improve this code.

Asked By: INGl0R1AM0R1

||

Answers:

As commented, you can probably optimize things quite a bit by not parsing JSON over and over again for each arg:

def dataset_extras(
    json_strings,
    keys,
):
    records = []
    for json_string in json_strings:
        datum = orjson.loads(json_string)
        records.append({key: datum.get(key) for key in keys})
    return pd.DataFrame.from_records(records)


x = dataset_extras(df.Extras, ["city", "campaign-source", "api-orders-hash-id"])

Another approach might be to build the df from a dict-of-lists. You’ll have to measure if this is faster than from_records.

def dataset_extras(
    json_strings,
    keys,
):
    columns = {col: [] for col in keys}
    for json_string in json_strings:
        datum = orjson.loads(json_string)
        for key in keys:
            columns[key].append(datum.get(key))
    return pd.DataFrame(columns)
Answered By: AKX
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.