pandas best way to process string column (not split)

Question

INTRODUCTION TO PROBLEM

I have data encoded in string in one DataFrame column:

  id             data
0  a   2;0;4208;1;790
1  b     2;0;768;1;47
2  c       2;0;92;1;6
3  d          1;0;341
4  e  3;0;1;2;6;4;132
5  f  3;0;1;1;6;3;492

Data represents count how many times some events happened in our system. We can have 256 different events (each has numerical id assigned from range 0-255). As usually we have only a few events happen in one measurement period is doesn’t make sense to store all zeros. That’s why data is encoded as follows: first number tells how many events happened during measurement period, then each pair contains event_id and counter.

For example:
"3;0;1;1;6;3;492" means:

3 events happened in measurement period
event with id=0 happened 1 time
event with id=1 happened 6 times
event with id=3 happened 492 time
other events didn’t happen

I need to decode the data to separate columns. Expected result is DataFrame which looks like this:

  id  data_0  data_1  data_2  data_3  data_4
0  a  4208.0   790.0     0.0     0.0     0.0
1  b   768.0    47.0     0.0     0.0     0.0
2  c    92.0     6.0     0.0     0.0     0.0
3  d   341.0     0.0     0.0     0.0     0.0
4  e     1.0     0.0     6.0     0.0   132.0
5  f     1.0     6.0     0.0   492.0     0.0

QUESTION ITSELF

I came up with the following function to do it:

def split_data(data: pd.Series):
    tmp = data.str.split(';', expand=True).astype('Int32').fillna(-1)
    tmp = tmp.apply(
        lambda row: {'{0}_{1}'.format(data.name,row[i*2-1]): row[i*2] for i in range(1,row[0]+1)},
        axis='columns',
        result_type='expand').fillna(0)
    return tmp

df = pd.concat([df, split_data(df.pop('data'))], axis=1)

The problem is that I have millions of lines to process and it takes A LOT of time.
As I don’t have that much experience with pandas, I hope someone would be able to help me with more efficient way of performing this task.

EDIT – ANSWER ANALYSIS

Ok, so I took all three answers and performed some benchmarking 🙂 .
Starting conditions: I already have a DataFrame (this will be important!).
As expected all of them were waaaaay faster than my code.
For example for 15 rows with 1000 repeats in timeit:

my code: 0.5827s
Schalton’s code: 0.1138s
Shubham’s code: 0.2242s
SomeDudes’s code: 0.2219

Seems like Schalton’s code wins!

However… for 1500 rows with 50 repeats:

my code: 31.1139
Schalton’s code: 2.4599s
Shubham’s code: 0.511s
SomeDudes’s code: 17.15

I decided to check once more, this time only one attempt but for 150 000 rows:

my code: 68.6798s
Schalton’s code: 6.3889s
Shubham’s code: 0.9520s
SomeDudes’s code: 37.8837

Interesting thing happens: as the size of DataFrame gets bigger, all versions except Shubham’s take much longer! Two fastest are Schalton’s and Shubham’s versions. This is were the starting point matters! I already have existing DataFrame so I have to convert it to dictionary. Dictionary itself is processed really fast. Conversion however takes time. Shubham’s solution is more or less independent on size! Schalton’s works very well for small data sets but due to conversion to dict it gets much slower for large amount of data.
Another comparison, this time 150000 rows with 30 repeats:

Schalton’s code: 170.1538s
Shubham’s code: 36.32s

However for 15 rows with 30000 repeats:

Schalton’s code: 50.4997s
Shubham’s code: 74.0916s

SUMMARY

In the end choice between Schalton’s version and Shubham’s depends on the use case:

for large number of small DataFrames (or with dictionary in the beginning) go with Schalton’s solution
for very large DataFrames go with Shubham’s solution.

As mentioned above, I have data sets around 1mln rows and more, thus I will go with Shubham’s answer.

Asked By: Lehu

||

Source

Answer 1

I’d avoid processing this in pandas, assuming you have the data in some other format I’d parse it into lists of dictionaries then load it into pandas.

import pandas as pd
from typing import Dict

data = {
    "a": "2;0;4208;1;790",
    "b": "2;0;768;1;47",
    "c": "2;0;92;1;6",
    "d": "1;0;341",
    "e": "3;0;1;2;6;4;132",
    "f": "3;0;1;1;6;3;492"
}

def get_event_counts(event_str: str, delim: str = ";") -> Dict[str, int]:
    """
    given an event string return a dictionary of events
    """
    EVENT_COUNT_INDEX = 0
    
    split_event = event_str.split(delim)
    event_count = int(split_event[EVENT_COUNT_INDEX])
    
    events = {
        split_event[index*2+1]: int(split_event[index*2+2]) for index in range(event_count - 1 // 2)
    }
    
    return events

data_records = [{"id": k, **get_event_counts(v)} for k,v in data.items()]

print(pd.DataFrame(data_records))

id     0      1    2      4      3
0  a  4208  790.0  NaN    NaN    NaN
1  b   768   47.0  NaN    NaN    NaN
2  c    92    6.0  NaN    NaN    NaN
3  d   341    NaN  NaN    NaN    NaN
4  e     1    NaN  6.0  132.0    NaN
5  f     1    6.0  NaN    NaN  492.0

If you’re situated on your current df as the input, you could try this:

def process_starting_dataframe(starting_dataframe: pd.DataFrame) -> pd.DataFrame:
    """
    Create a new dataframe from original input with two columns "id" and "data
    """
    data_dict = starting_df.T.to_dict()
    data_records = [{"id": i['id'], **get_event_counts(i['data'])} for i in data_dict.values()]
    
    return pd.DataFrame(data_records)

Answered By: Schalton

Answer 2

A much more efficient method is to construct dicts from your data.
Do you observe how the alternate values in the split string are keys and values?

Then apply pd.Series and fillna(0) to get the dataframe with all required columns for the data.

Then you can concat.

Code:

df_data = df['data'].apply(
    lambda x:dict(zip(x.split(';')[1::2], x.split(';')[2::2]))).apply(pd.Series).fillna(0)
df_data.columns = df_data.columns.map('data_{}'.format)
df = pd.concat([df.drop('data',axis=1), df_data], axis=1)

output:

  id data_0 data_1 data_2 data_4 data_3
0  a   4208    790      0      0      0
1  b    768     47      0      0      0
2  c     92      6      0      0      0
3  d    341      0      0      0      0
4  e      1      0      6    132      0
5  f      1      6      0      0    492

If you need sorted columns you can just do:

df = df[sorted(df.columns)]

Answered By: SomeDude

Answer 3

Code

pairs = df['data'].str.extractall(r'(?<!^)(d+);(d+)')
pairs = pairs.droplevel(1).pivot(columns=0, values=1).fillna(0)

df[['id']].join(pairs.add_prefix('data_'))

Explained

Extract all pairs using a regex pattern

         0     1
  match         
0 0      0  4208
  1      1   790
1 0      0   768
  1      1    47
2 0      0    92
  1      1     6
3 0      0   341
4 0      0     1
  1      2     6
  2      4   132
5 0      0     1
  1      1     6
  2      3   492

Pivot the pairs to reshape into desired format

0     0    1  2    3    4
0  4208  790  0    0    0
1   768   47  0    0    0
2    92    6  0    0    0
3   341    0  0    0    0
4     1    0  6    0  132
5     1    6  0  492    0

Join the reshaped pairs dataframe back with id column

  id data_0 data_1 data_2 data_3 data_4
0  a   4208    790      0      0      0
1  b    768     47      0      0      0
2  c     92      6      0      0      0
3  d    341      0      0      0      0
4  e      1      0      6      0    132
5  f      1      6      0    492      0

Answered By: Shubham Sharma