Is there a way to transform a pandas dataframe to have a particular column's value as a ID or something?
Question:
I am working with a metrics dataset. I want to transform the given dataset into something where stringValue == ‘IN’ will be the ID of the row so that we can filter/extract info based on the stringValue == ‘IN’. We will have to group by timeComputed as well.
The following image is of the dataset that we have as an input:
Our ultimate goal is to find other metrics for the specific country. Here the country is India – ‘IN’ (there will be different countries in the dataset). I want to find ‘col_stats:SUM:Quantity’ or other similar metrics for the country ‘IN’ given the same ‘timeComputed’.
I can do it by extracting ‘IN’ first, then getting the timeComputed and then searching for other metrics with the extracted timeComputed. But this seems like a overdo
I am expecting the resulting dataset similar to following dataset:
countryCode
timeComputed
metricId
IN
2021-04-04
records:COUNT_RECORDS
KR
2022-05-05
col_stats:SUM:Quantity
@jezrael I tried the updated solution and it gives me a dataframe as follows:
So now we need to have a solution where the output dataframe is like where except countryCode every other metricId in that timeComputed should be a column:
countryCode
timeComputed
reporting:METRICS_COMPUTATION_DURATION
basic:COUNT_COLUMNS
col_stats:COUNT_NULL:EndCustomerAccount
IN
2023-02-21 13:28:15.705000+00:00
2282
25
75229
IN
2023-02-21 13:28:38.354000+00:00
2765
25
75229
Answers:
If need partition
and timeComputed
per IN
and all rows with match use:
df1 = df.loc[df['stringValue'].eq('IN'), ['partition','timeComputed']]
df2 = (df.merge(df1.drop_duplicates())['stringValue','timeComputed','metricId']]
.rename(columns={'stringValue':'countryCode'}))
If need timeComputed
per IN
and all rows with match use:
s = df.loc[df['stringValue'].eq('IN'), 'timeComputed']
df2 = (df.loc[df['timeComputed'].isin(s),['stringValue','timeComputed','metricId']]
.rename(columns={'stringValue':'countryCode'}))
I am working with a metrics dataset. I want to transform the given dataset into something where stringValue == ‘IN’ will be the ID of the row so that we can filter/extract info based on the stringValue == ‘IN’. We will have to group by timeComputed as well.
The following image is of the dataset that we have as an input:
Our ultimate goal is to find other metrics for the specific country. Here the country is India – ‘IN’ (there will be different countries in the dataset). I want to find ‘col_stats:SUM:Quantity’ or other similar metrics for the country ‘IN’ given the same ‘timeComputed’.
I can do it by extracting ‘IN’ first, then getting the timeComputed and then searching for other metrics with the extracted timeComputed. But this seems like a overdo
I am expecting the resulting dataset similar to following dataset:
countryCode | timeComputed | metricId |
---|---|---|
IN | 2021-04-04 | records:COUNT_RECORDS |
KR | 2022-05-05 | col_stats:SUM:Quantity |
@jezrael I tried the updated solution and it gives me a dataframe as follows:
So now we need to have a solution where the output dataframe is like where except countryCode every other metricId in that timeComputed should be a column:
countryCode | timeComputed | reporting:METRICS_COMPUTATION_DURATION | basic:COUNT_COLUMNS | col_stats:COUNT_NULL:EndCustomerAccount |
---|---|---|---|---|
IN | 2023-02-21 13:28:15.705000+00:00 | 2282 | 25 | 75229 |
IN | 2023-02-21 13:28:38.354000+00:00 | 2765 | 25 | 75229 |
If need partition
and timeComputed
per IN
and all rows with match use:
df1 = df.loc[df['stringValue'].eq('IN'), ['partition','timeComputed']]
df2 = (df.merge(df1.drop_duplicates())['stringValue','timeComputed','metricId']]
.rename(columns={'stringValue':'countryCode'}))
If need timeComputed
per IN
and all rows with match use:
s = df.loc[df['stringValue'].eq('IN'), 'timeComputed']
df2 = (df.loc[df['timeComputed'].isin(s),['stringValue','timeComputed','metricId']]
.rename(columns={'stringValue':'countryCode'}))