restructure a 2D numpy array based on matching column values

Question:

I’m working with a data set with ~30 million entries. Each entry has a timestamp, an ID, a Description, and a value. The overall numpy array looks something like:

[
[Time 1, ID 1, D 1_1, V 1_1],
[Time 1, ID 1, D 1_2, V 1_2],
...
[Time 2, ID 1, D 2_1, V 2_1],
[Time 2, ID 1, D 2_2, V 2_2],
...
[Time X, ID 2, D X_1, V X_1],
...
]

I would like to compress the array into the following format:

[
[Time 1, ID 1, D 1_1, V 1_1, D 1_2, V 1_2, ...],
[Time 2, ID 1, D 2_1, V 2_1, D 2_2, V 2_2, ...],
[Time X, ID 2, D X_1, V X_1, ...],
...
]

Each sub-array within the original array will be the same length and order, but the number of sub-arrays with the same time stamp is variable, as is the number of sub-arrays with the same ID. Is there a way of restructuring the array within a reasonable amount of time? The time, id, and description columns would be strings and the value column would be a float if that matters.

Ideally, I could end up with an array of dictionaries, i.e.

[{'time': Time1, 'ID': ID1, 'D1_1': V1_1, 'D1_2': V1_2, ...}...]

However, given the time my attempt at using dictionaries would take (>100 hours), I’m assuming dictionaries take too long to construct.

Asked By: Dak

||

Answers:

I think with pandas you can easy achive that goal:

import pandas as pd

# your dataframe
df = pd.DataFrame(data=your_np_array, columns=['Time', 'ID', 'Description', 'Value'])
# groupby time and ID, and aggregate the descriptions and values into lists
grouped = df.groupby(['Time', 'ID']).agg({'Description': list, 'Value': list})
# reset the index to get the time and ID as columns rather than indices
result = grouped.reset_index()
# convert the lists into separate columns
result['Description'] = result['Description'].apply(lambda x: ','.join(x))
result['Value'] = result['Value'].apply(lambda x: ','.join(map(str, x)))

# convert the result to a numpy array
my_new_numpy_array = result.to_numpy()
Answered By: Dallas
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.