Dataframe to print rows of same index to 1 line

Question:

Simple dataframe as below, and I want to print the rows of the same index in 1 sorted line.

Wanted:

123@1st line; 2nd line; 3rd line; 4th line 
456@a1st line; a2nd line; a3rd line

I’ve tried:

import pandas as pd
from io import StringIO

csvfile = StringIO(
"""Group    Sort    String
123 207 1st line
123 208 2nd line
123 209 3rd line
123 210 4th line
456 625 a1st line
456 626 a2nd line
456 627 a3rd line""")

df = pd.read_csv(csvfile, sep = 't', engine='python')

df['Group'] = df['Group'].astype(str)
df['Sort'] = df['Sort'].astype(int)
df = df.sort_values(['Sort'], ascending=[True])

df_1 = df.groupby(['Group'])['String'].apply(lambda x: '; '.join(x.astype(str))).reset_index()

df_1['String'] = df_1['String'].str.split('; ').apply(set).str.join('; ')

for index, row in df_1.iterrows():
    print(row['Group'] + '@' + row['String'])

But the output is:

123@3rd line; 4th line; 1st line; 2nd line
456@a3rd line; a2nd line; a1st line

Where went wrong and how can I correct it?

Asked By: Mark K

||

Answers:

The problem is that your data are ill-formatted.

Either you keep t as a separator (what you did in the header), and add t between each of your columns, or you change the separator as " " (what you did in the body of your dataframe).

Using both will lead to non-necessary formatting problems :

Space separator

csvfile = StringIO(
"""Group Sort String
123 207 1st_line
123 208 2nd_line
123 209 3rd_line
123 210 4th_line
456 625 a1st_line
456 626 a2nd_line
456 627 a3rd_line""")

df = pd.read_csv(csvfile, header = 0, sep = ' ', engine='python')

Tab separator

csvfile = StringIO(
"""GrouptSorttString
123t207t1st line
123t208t2nd line
123t209t3rd line
123t210t4th line
456t625ta1st line
456t626ta2nd line
456t627ta3rd line""")

df = pd.read_csv(csvfile, header = 0, sep = "t", engine='python')

After that operation, you can check with a debugger for instance, your newly created dataframe as three columns, which wasn’t the case with your previous code.

Data Grouping

Then to print all "String" value sharing the same "Group" value, use groupby.

df.groupby("Group").groups
Out[21]: {123: [0, 1, 2, 3], 456: [4, 5, 6]}

Or for better visualization :

for name, group in df.groupby("Group"):
    print(f"{name} : {group.loc[:, 'String'].to_numpy()}n")

123 : ['1st line' '2nd line' '3rd line' '4th line']

456 : ['a1st line' 'a2nd line' 'a3rd line']

Following your guidelines :

for name, group in df.groupby("Group"):
    print(f"{name}@{'; '.join(group.loc[:, 'String'].to_numpy())}n")

123@1st line; 2nd line; 3rd line; 4th line

456@a1st line; a2nd line; a3rd line
Answered By: Zelemist

the problem is with set in
df_1['String'] = df_1['String'].str.split('; ').apply(set).str.join('; ') because sets are unordered. So you need to replace it with something like
df = df.drop_duplicates(subset=['String']) at the beginning

Answered By: V_K
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.