How to convert a column of strings to python literals and extract the values

Question

I have a DataFrame which looks as follows:

id  time          activity
4   1596213715048   [{"name":"STILL","conf":100}]
4   1596213739171   [{"name":"STILL","conf":54},{"name":"ON_FOOT","conf":19},{"name":"WALKING","conf":19},{"name":"ON_BICYCLE","conf":9},{"name":"IN_VEHICLE","conf":8},{"name":"UNKNOWN","conf":3}]
4   1596213755797   [{"name":"STILL","conf":97},{"name":"UNKNOWN","conf":2},{"name":"IN_VEHICLE","conf":1}]
6   1596214842817   [{"name":"STILL","conf":100}]
6   1596214931090   [{"name":"STILL","conf":34},{"name":"IN_VEHICLE","conf":28},{"name":"ON_FOOT","conf":15},{"name":"WALKING","conf":15},{"name":"ON_BICYCLE","conf":8},{"name":"UNKNOWN","conf":3}]
8   1596214957246   [{"name":"STILL","conf":100}]
9   1596215304418   [{"name":"STILL","conf":100}]

I would like to split the activity column according to name. The resulting DataFrame should look like:

id  time          IN_VEHICLE  ON_BICYLE  ON_FOOT  WALKING  RUNNING  TILTING  STILL UNKNOWN 
4   1596213715048 0           0          0        0        0        0        100   0
4   1596213739171 8           9          19       19       0        0        54    3
4   1596213755797 1           0          0        0        0        0        97    2
6   1596214842817 0           0          0        0        0        0        100   0
6   1596214931090 28          8          15       15       0        0        34    3
8   1596214957246 0           0          0        0        0        0        100   0
9   1596215304418 0           0          0        0        0        0        100   0

How can this split be done? The resulting columns are fixed but if still a entry in the activity string does not exist as a column in the resulting DataFrame, a error should be thrown.

Asked By: machinery

||

Source

Answer 1

I wrote an approach based on this answer. However, your JSON is in the format of a list of dicts, rather than a dict. To fix that, I define the function flatten_json_to_dict(), which is then called on each row of the activity column.

In contrast to the original answer, I get the columns back into the original dataframe using a join instead of using assignment, which I think is a little less hacky.

The final step is to replace missing (NA) values with zero.

#!/usr/bin/env python3
import pandas as pd
import json

def flatten_json_to_dict(s):
    return {obj['name']: obj['conf'] for obj in json.loads(s)}

df = pd.read_csv('file.csv', delim_whitespace=True)
df
#    id           time                                           activity
# 0   4  1596213715048                      [{"name":"STILL","conf":100}]
# 1   4  1596213739171  [{"name":"STILL","conf":54},{"name":"ON_FOOT",...
# 2   4  1596213755797  [{"name":"STILL","conf":97},{"name":"UNKNOWN",...
# 3   6  1596214842817                      [{"name":"STILL","conf":100}]
# 4   6  1596214931090  [{"name":"STILL","conf":34},{"name":"IN_VEHICL...
# 5   8  1596214957246                      [{"name":"STILL","conf":100}]
# 6   9  1596215304418                      [{"name":"STILL","conf":100}]

expanded = df['activity'].apply(flatten_json_to_dict).apply(pd.Series)
df = df.join(expanded)
# Remove activity column
df = df.drop('activity', axis=1)
# Fill NA with 0
df = df.fillna(0)
df

#    id           time  STILL  ON_FOOT  WALKING  ON_BICYCLE  IN_VEHICLE  UNKNOWN
# 0   4  1596213715048  100.0      0.0      0.0         0.0         0.0      0.0
# 1   4  1596213739171   54.0     19.0     19.0         9.0         8.0      3.0
# 2   4  1596213755797   97.0      0.0      0.0         0.0         1.0      2.0
# 3   6  1596214842817  100.0      0.0      0.0         0.0         0.0      0.0
# 4   6  1596214931090   34.0     15.0     15.0         8.0        28.0      3.0
# 5   8  1596214957246  100.0      0.0      0.0         0.0         0.0      0.0
# 6   9  1596215304418  100.0      0.0      0.0         0.0         0.0      0.0

Answered By: Nick ODell

Answer 2

This answer is is 8x faster than the other solution for a dataframe with 100k rows
- The other implementation works, but uses .apply twice and a list comprehension, which are slow, compared to vectorized methods.

Explanation

.apply(literal_eval) converts the 'activity' column from a strings to a python literal (e.g. lists of dicts; '[{"name":"STILL","conf":100}]' → [{"name":"STILL","conf":100}])
.explode separates the dicts in each list to separate rows
Extract the keys and values in the 'activity' column into separate columns and then .join the columns back to df
- The timing analysis of this answer shows the fastest way to extract a column of single level dicts to a dataframe is with pd.DataFrame(df.pop('activity').values.tolist())
.pivot the df into a wide format
Change dfp.columns.name from 'name' to None – this is cosmetic, and can be removed

This was performed in pandas 1.2.0

import pandas as pd
from ast import literal_eval

# test data
data = {'id': [4, 4, 4, 6, 6, 8, 9], 'time': [1596213715048, 1596213739171, 1596213755797, 1596214842817, 1596214931090, 1596214957246, 1596215304418], 'activity': ['[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":54},{"name":"ON_FOOT","conf":19},{"name":"WALKING","conf":19},{"name":"ON_BICYCLE","conf":9},{"name":"IN_VEHICLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":97},{"name":"UNKNOWN","conf":2},{"name":"IN_VEHICLE","conf":1}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":34},{"name":"IN_VEHICLE","conf":28},{"name":"ON_FOOT","conf":15},{"name":"WALKING","conf":15},{"name":"ON_BICYCLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":100}]']}
df = pd.DataFrame(data)

# function to transform column of strings
def test(df):
    df.activity = df.activity.apply(literal_eval)
    df = df.explode('activity', ignore_index=True)
    df = df.join(pd.DataFrame(df.pop('activity').values.tolist()))
    dfp = df.pivot(index=['id', 'time'], columns='name', values='conf').fillna(0).astype(int).reset_index()
    dfp.columns.rename(None, inplace=True)
    return dfp


# call the function
test(df)

# result
   id           time  IN_VEHICLE  ON_BICYCLE  ON_FOOT  STILL  UNKNOWN  WALKING
0   4  1596213715048           0           0        0    100        0        0
1   4  1596213739171           8           9       19     54        3       19
2   4  1596213755797           1           0        0     97        2        0
3   6  1596214842817           0           0        0    100        0        0
4   6  1596214931090          28           8       15     34        3       15
5   8  1596214957246           0           0        0    100        0        0
6   9  1596215304418           0           0        0    100        0        0

`%%timeit` testing

import numpy as np
import random
import pandas
import json
from ast import literal_eval

# test data with 100000 rows
np.random.seed(365)
random.seed(365)
rows = 1000000
activity = ['[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":54},{"name":"ON_FOOT","conf":19},{"name":"WALKING","conf":19},{"name":"ON_BICYCLE","conf":9},{"name":"IN_VEHICLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":97},{"name":"UNKNOWN","conf":2},{"name":"IN_VEHICLE","conf":1}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":34},{"name":"IN_VEHICLE","conf":28},{"name":"ON_FOOT","conf":15},{"name":"WALKING","conf":15},{"name":"ON_BICYCLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":100}]']
data = {'time': pd.bdate_range('2021-01-15', freq='s', periods=rows),
        'id': np.random.randint(10, size=(rows)),
        'activity': [random.choice(activity) for _ in range(rows)]}
df = pd.DataFrame(data)

# test the function in this answer
%%timeit -r1 -n1 -q -o
test(df)
[out]:
<TimeitResult : 31.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>

# test the implementation from the other answer
 def flatten_json_to_dict(s):
    return {obj['name']: obj['conf'] for obj in json.loads(s)}


def nick(df):
    expanded = df['activity'].apply(flatten_json_to_dict).apply(pd.Series)
    df = df.join(expanded)
    df = df.drop('activity', axis=1)
    df = df.fillna(0)
    return df


%%timeit -r1 -n1 -q -o
nick(df)
[out]:
<TimeitResult : 4min 28s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>

Answered By: Trenton McKinney

How to convert a column of strings to python literals and extract the values

Question:

Answers:

Explanation

`%%timeit` testing

How to convert a column of strings to python literals and extract the values

Question:

Answers:

Explanation

%%timeit testing

`%%timeit` testing