How to combine multiple rows into a single row with many columns in pandas using an id (clustering multiple records with same id into one record)

Question:

Situation:

1. all_task_usage_10_19

all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns.
There are multiple rows with the same ID inside the column machine_ID with different values in other columns.

Columns:

'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'

2. clustering code

I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas


3. Output

Output displayed using: with option_context as it allows to better visualise the content


My Aim:

I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.

Something like this.

Asked By: Ishaan_B

||

Answers:

Maybe a Multi-Index is what you’re looking for?

df.set_index(['machine_ID', df.index])

Note that by default set_index returns a new dataframe, and does not change the original.
To change the original (and return None) you can pass an argument inplace=True.

Example:

df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
                   'a': [1, 2, 3, 4, 5],
                   'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index])  # not in-place
df.set_index(['machine_ID', df.index], inplace=True)  # in-place

For me, it does create a multi-index: first level is ‘machine_ID’, second one is the previous range index:

Answered By: Vladimir Fokow

The below code worked for me:

all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
    'assigned_memory_usage', 'unmapped_page_cache_memory_usage',           'total_page_cache_memory_usage', 'maximum_memory_usage',
    'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
    'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
    'memory_accesses_per_instruction_(MAI)', 'sample_portion',
    'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()
Answered By: Ishaan_B
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.