How to convert a column of strings to python literals and extract the values

Question:

I have a DataFrame which looks as follows:

id  time          activity
4   1596213715048   [{"name":"STILL","conf":100}]
4   1596213739171   [{"name":"STILL","conf":54},{"name":"ON_FOOT","conf":19},{"name":"WALKING","conf":19},{"name":"ON_BICYCLE","conf":9},{"name":"IN_VEHICLE","conf":8},{"name":"UNKNOWN","conf":3}]
4   1596213755797   [{"name":"STILL","conf":97},{"name":"UNKNOWN","conf":2},{"name":"IN_VEHICLE","conf":1}]
6   1596214842817   [{"name":"STILL","conf":100}]
6   1596214931090   [{"name":"STILL","conf":34},{"name":"IN_VEHICLE","conf":28},{"name":"ON_FOOT","conf":15},{"name":"WALKING","conf":15},{"name":"ON_BICYCLE","conf":8},{"name":"UNKNOWN","conf":3}]
8   1596214957246   [{"name":"STILL","conf":100}]
9   1596215304418   [{"name":"STILL","conf":100}]

I would like to split the activity column according to name. The resulting DataFrame should look like:

id  time          IN_VEHICLE  ON_BICYLE  ON_FOOT  WALKING  RUNNING  TILTING  STILL UNKNOWN 
4   1596213715048 0           0          0        0        0        0        100   0
4   1596213739171 8           9          19       19       0        0        54    3
4   1596213755797 1           0          0        0        0        0        97    2
6   1596214842817 0           0          0        0        0        0        100   0
6   1596214931090 28          8          15       15       0        0        34    3
8   1596214957246 0           0          0        0        0        0        100   0
9   1596215304418 0           0          0        0        0        0        100   0

How can this split be done? The resulting columns are fixed but if still a entry in the activity string does not exist as a column in the resulting DataFrame, a error should be thrown.

Asked By: machinery

||

Answers:

I wrote an approach based on this answer. However, your JSON is in the format of a list of dicts, rather than a dict. To fix that, I define the function flatten_json_to_dict(), which is then called on each row of the activity column.

In contrast to the original answer, I get the columns back into the original dataframe using a join instead of using assignment, which I think is a little less hacky.

The final step is to replace missing (NA) values with zero.

#!/usr/bin/env python3
import pandas as pd
import json

def flatten_json_to_dict(s):
    return {obj['name']: obj['conf'] for obj in json.loads(s)}

df = pd.read_csv('file.csv', delim_whitespace=True)
df
#    id           time                                           activity
# 0   4  1596213715048                      [{"name":"STILL","conf":100}]
# 1   4  1596213739171  [{"name":"STILL","conf":54},{"name":"ON_FOOT",...
# 2   4  1596213755797  [{"name":"STILL","conf":97},{"name":"UNKNOWN",...
# 3   6  1596214842817                      [{"name":"STILL","conf":100}]
# 4   6  1596214931090  [{"name":"STILL","conf":34},{"name":"IN_VEHICL...
# 5   8  1596214957246                      [{"name":"STILL","conf":100}]
# 6   9  1596215304418                      [{"name":"STILL","conf":100}]

expanded = df['activity'].apply(flatten_json_to_dict).apply(pd.Series)
df = df.join(expanded)
# Remove activity column
df = df.drop('activity', axis=1)
# Fill NA with 0
df = df.fillna(0)
df

#    id           time  STILL  ON_FOOT  WALKING  ON_BICYCLE  IN_VEHICLE  UNKNOWN
# 0   4  1596213715048  100.0      0.0      0.0         0.0         0.0      0.0
# 1   4  1596213739171   54.0     19.0     19.0         9.0         8.0      3.0
# 2   4  1596213755797   97.0      0.0      0.0         0.0         1.0      2.0
# 3   6  1596214842817  100.0      0.0      0.0         0.0         0.0      0.0
# 4   6  1596214931090   34.0     15.0     15.0         8.0        28.0      3.0
# 5   8  1596214957246  100.0      0.0      0.0         0.0         0.0      0.0
# 6   9  1596215304418  100.0      0.0      0.0         0.0         0.0      0.0
Answered By: Nick ODell
  • This answer is is 8x faster than the other solution for a dataframe with 100k rows
    • The other implementation works, but uses .apply twice and a list comprehension, which are slow, compared to vectorized methods.

Explanation

  1. .apply(literal_eval) converts the 'activity' column from a strings to a python literal (e.g. lists of dicts; '[{"name":"STILL","conf":100}]'[{"name":"STILL","conf":100}])
  2. .explode separates the dicts in each list to separate rows
  3. Extract the keys and values in the 'activity' column into separate columns and then .join the columns back to df
    • The timing analysis of this answer shows the fastest way to extract a column of single level dicts to a dataframe is with pd.DataFrame(df.pop('activity').values.tolist())
  4. .pivot the df into a wide format
  5. Change dfp.columns.name from 'name' to None – this is cosmetic, and can be removed
  • This was performed in pandas 1.2.0
import pandas as pd
from ast import literal_eval

# test data
data = {'id': [4, 4, 4, 6, 6, 8, 9], 'time': [1596213715048, 1596213739171, 1596213755797, 1596214842817, 1596214931090, 1596214957246, 1596215304418], 'activity': ['[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":54},{"name":"ON_FOOT","conf":19},{"name":"WALKING","conf":19},{"name":"ON_BICYCLE","conf":9},{"name":"IN_VEHICLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":97},{"name":"UNKNOWN","conf":2},{"name":"IN_VEHICLE","conf":1}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":34},{"name":"IN_VEHICLE","conf":28},{"name":"ON_FOOT","conf":15},{"name":"WALKING","conf":15},{"name":"ON_BICYCLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":100}]']}
df = pd.DataFrame(data)

# function to transform column of strings
def test(df):
    df.activity = df.activity.apply(literal_eval)
    df = df.explode('activity', ignore_index=True)
    df = df.join(pd.DataFrame(df.pop('activity').values.tolist()))
    dfp = df.pivot(index=['id', 'time'], columns='name', values='conf').fillna(0).astype(int).reset_index()
    dfp.columns.rename(None, inplace=True)
    return dfp


# call the function
test(df)

# result
   id           time  IN_VEHICLE  ON_BICYCLE  ON_FOOT  STILL  UNKNOWN  WALKING
0   4  1596213715048           0           0        0    100        0        0
1   4  1596213739171           8           9       19     54        3       19
2   4  1596213755797           1           0        0     97        2        0
3   6  1596214842817           0           0        0    100        0        0
4   6  1596214931090          28           8       15     34        3       15
5   8  1596214957246           0           0        0    100        0        0
6   9  1596215304418           0           0        0    100        0        0

%%timeit testing

import numpy as np
import random
import pandas
import json
from ast import literal_eval

# test data with 100000 rows
np.random.seed(365)
random.seed(365)
rows = 1000000
activity = ['[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":54},{"name":"ON_FOOT","conf":19},{"name":"WALKING","conf":19},{"name":"ON_BICYCLE","conf":9},{"name":"IN_VEHICLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":97},{"name":"UNKNOWN","conf":2},{"name":"IN_VEHICLE","conf":1}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":34},{"name":"IN_VEHICLE","conf":28},{"name":"ON_FOOT","conf":15},{"name":"WALKING","conf":15},{"name":"ON_BICYCLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":100}]']
data = {'time': pd.bdate_range('2021-01-15', freq='s', periods=rows),
        'id': np.random.randint(10, size=(rows)),
        'activity': [random.choice(activity) for _ in range(rows)]}
df = pd.DataFrame(data)

# test the function in this answer
%%timeit -r1 -n1 -q -o
test(df)
[out]:
<TimeitResult : 31.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>

# test the implementation from the other answer
 def flatten_json_to_dict(s):
    return {obj['name']: obj['conf'] for obj in json.loads(s)}


def nick(df):
    expanded = df['activity'].apply(flatten_json_to_dict).apply(pd.Series)
    df = df.join(expanded)
    df = df.drop('activity', axis=1)
    df = df.fillna(0)
    return df


%%timeit -r1 -n1 -q -o
nick(df)
[out]:
<TimeitResult : 4min 28s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
Answered By: Trenton McKinney
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.