List of tuples to DataFrame w. column for elements, column for tuple length
Question:
I have a list of tuples of different lenghts, where the tuples can be thought to encode teams of people, such as:
data = [('Alice',),
('Bob', 'Betty'),
('Charlie', 'Cindy', 'Cramer')]
From this, I would like to create a DataFrame with a column of team member names, and a column with the size of the team they were on:
name teamsize
0 Alice 1
1 Bob 2
2 Betty 2
3 Charlie 3
4 Cindy 3
5 Cramer 3
I have tried my hand at some double for
loops, but I couldn’t not get things to work out, and have the impression that it is not a very good way to go about it. Any tips would be appreciated.
Answers:
Use a list comprehension and the DataFrame
constructor:
out = pd.DataFrame([[name, len(l)] for l in data for name in l],
columns=['name', 'teamsize'])
Output:
name teamsize
0 Alice 1
1 Bob 2
2 Betty 2
3 Charlie 3
4 Cindy 3
5 Cramer 3
For fun here is a pure pandas solution (but likely less efficient!):
out = (pd.DataFrame({'name': data})
.assign(teamsize=lambda d: d['name'].str.len())
.explode('name', ignore_index=True)
)
you can use:
name = []
teamsize = []
for i in data:
for n in i:
name.append(n)
teamsize.append(len(i))
df = pd.DataFrame(list(zip(name, teamsize)),
columns =['name', 'teamsize'])
Another Pandas solution:
df = (pd.DataFrame(data).T.melt(value_name='name').dropna()
.assign(teamsize=lambda x: x.groupby(x.pop('variable')).transform('count'))
print(df)
# Output
name teamsize
0 Alice 1
3 Bob 2
4 Betty 2
6 Charlie 3
7 Cindy 3
8 Cramer 3
I have a list of tuples of different lenghts, where the tuples can be thought to encode teams of people, such as:
data = [('Alice',),
('Bob', 'Betty'),
('Charlie', 'Cindy', 'Cramer')]
From this, I would like to create a DataFrame with a column of team member names, and a column with the size of the team they were on:
name teamsize
0 Alice 1
1 Bob 2
2 Betty 2
3 Charlie 3
4 Cindy 3
5 Cramer 3
I have tried my hand at some double for
loops, but I couldn’t not get things to work out, and have the impression that it is not a very good way to go about it. Any tips would be appreciated.
Use a list comprehension and the DataFrame
constructor:
out = pd.DataFrame([[name, len(l)] for l in data for name in l],
columns=['name', 'teamsize'])
Output:
name teamsize
0 Alice 1
1 Bob 2
2 Betty 2
3 Charlie 3
4 Cindy 3
5 Cramer 3
For fun here is a pure pandas solution (but likely less efficient!):
out = (pd.DataFrame({'name': data})
.assign(teamsize=lambda d: d['name'].str.len())
.explode('name', ignore_index=True)
)
you can use:
name = []
teamsize = []
for i in data:
for n in i:
name.append(n)
teamsize.append(len(i))
df = pd.DataFrame(list(zip(name, teamsize)),
columns =['name', 'teamsize'])
Another Pandas solution:
df = (pd.DataFrame(data).T.melt(value_name='name').dropna()
.assign(teamsize=lambda x: x.groupby(x.pop('variable')).transform('count'))
print(df)
# Output
name teamsize
0 Alice 1
3 Bob 2
4 Betty 2
6 Charlie 3
7 Cindy 3
8 Cramer 3