In Python create a table with categories and ranges based on a list
Question:
I have data in a table that looks like this:
input_data = pd.DataFrame({'cat':['A','A','A','A','A','A','A','A','B','B','B','A','A','A','B','B','B','B','B','B','B'],
'num':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]
})
It’s just an example. I am not required to use pandas. The point is, that input data will come in a table format.
I need two separate results:
(1) The First Table:
Cat
MIN
MAX
A
1
8
A
12
14
B
9
11
B
15
21
(2) The Second Table:
Cat
ranges
A
1-8; 12-14
B
9-11; 15-21
So far I tried to do this with pandas but then I’ve read that iterating over df might not be a good idea. Here above it’s just an example but the actual df will have from 1K to 10K+ rows.
Answers:
First aggregate min/max
by GroupBy.agg
with helper Series
g for consecutive cat
values, last sorting by cat
by DataFrame.sort_values
:
g = input_data['cat'].ne(input_data['cat'].shift()).cumsum()
df = (input_data.groupby([g, 'cat'], as_index=False)
.agg(MIN=('num','min'), MAX=('num','max'))
.sort_values('cat', ignore_index=True))
print (df)
cat MIN MAX
0 A 1 8
1 A 12 14
2 B 9 11
3 B 15 21
For second ouput use Series.str.cat
with aggregate join
:
df1 = (df['MIN'].astype(str).str.cat(df['MAX'].astype(str), sep='-')
.groupby(df['cat']).agg('; '.join)
.reset_index(name='ranges'))
print(df1)
cat ranges
0 A 1-8; 12-14
1 B 9-11; 15-21
For the first dataframe, you can create a new group each time the cat
row is not equal to the previous then use aggregate functions. For the second one, concatenate MIN
and MAX
columns then group by Cat
then join them:
df1 = (input_data.groupby(df['cat'].ne(df['cat'].shift()).cumsum(), as_index=False)
.agg(Cat=('cat', 'first'), MIN=('num', 'min'), MAX=('num', 'max')))
df2 = (df1.assign(ranges=df1['MIN'].astype(str) + '-' + df1['MAX'].astype(str))
.groupby('Cat', as_index=False)['ranges'].apply('; '.join))
Output:
>>> df1
Cat MIN MAX
0 A 1 8
1 B 9 11
2 A 12 14
3 B 15 21
>>> df2
Cat ranges
0 A 1-8; 12-14
1 B 9-11; 15-21
I have data in a table that looks like this:
input_data = pd.DataFrame({'cat':['A','A','A','A','A','A','A','A','B','B','B','A','A','A','B','B','B','B','B','B','B'],
'num':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]
})
It’s just an example. I am not required to use pandas. The point is, that input data will come in a table format.
I need two separate results:
(1) The First Table:
Cat | MIN | MAX |
---|---|---|
A | 1 | 8 |
A | 12 | 14 |
B | 9 | 11 |
B | 15 | 21 |
(2) The Second Table:
Cat | ranges |
---|---|
A | 1-8; 12-14 |
B | 9-11; 15-21 |
So far I tried to do this with pandas but then I’ve read that iterating over df might not be a good idea. Here above it’s just an example but the actual df will have from 1K to 10K+ rows.
First aggregate min/max
by GroupBy.agg
with helper Series
g for consecutive cat
values, last sorting by cat
by DataFrame.sort_values
:
g = input_data['cat'].ne(input_data['cat'].shift()).cumsum()
df = (input_data.groupby([g, 'cat'], as_index=False)
.agg(MIN=('num','min'), MAX=('num','max'))
.sort_values('cat', ignore_index=True))
print (df)
cat MIN MAX
0 A 1 8
1 A 12 14
2 B 9 11
3 B 15 21
For second ouput use Series.str.cat
with aggregate join
:
df1 = (df['MIN'].astype(str).str.cat(df['MAX'].astype(str), sep='-')
.groupby(df['cat']).agg('; '.join)
.reset_index(name='ranges'))
print(df1)
cat ranges
0 A 1-8; 12-14
1 B 9-11; 15-21
For the first dataframe, you can create a new group each time the cat
row is not equal to the previous then use aggregate functions. For the second one, concatenate MIN
and MAX
columns then group by Cat
then join them:
df1 = (input_data.groupby(df['cat'].ne(df['cat'].shift()).cumsum(), as_index=False)
.agg(Cat=('cat', 'first'), MIN=('num', 'min'), MAX=('num', 'max')))
df2 = (df1.assign(ranges=df1['MIN'].astype(str) + '-' + df1['MAX'].astype(str))
.groupby('Cat', as_index=False)['ranges'].apply('; '.join))
Output:
>>> df1
Cat MIN MAX
0 A 1 8
1 B 9 11
2 A 12 14
3 B 15 21
>>> df2
Cat ranges
0 A 1-8; 12-14
1 B 9-11; 15-21