How to convert dataframe columns into key:value strings?

Question:

I am preparing dataset for openai model training. For an example, I have data in csv in below format

df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

df
    bar foo new 
0   1   a   apple   
1   2   b   banana  
2   3   c   pear

I want to create a new column called ‘prompt’ with combine values of bar and foo columns

    bar foo new     prompt  
0   1   a   apple   foo: a, bar: 1
1   2   b   banana  foo: b, bar: 2
2   3   c   pear    foo:c, bar: 3

there is similar example here but it doesn’t add the column names inside the combined column

Asked By: g1soori

||

Answers:

df = pd.DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['prompt'] = df.apply(lambda x: x.to_json(), axis=1)
df

This would work? You get:

    foo bar new     prompt
0   a   1   apple   {"foo":"a","bar":1,"new":"apple"}
1   b   2   banana  {"foo":"b","bar":2,"new":"banana"}
2   c   3   pear    {"foo":"c","bar":3,"new":"pear"}
Answered By: Vetsin

df.apply is very popular but should be avoided whenever possible.

Instead use vectorized methods. Convert the relevant columns to dict and remove the quotes:

df["prompt"] = df[["foo", "bar"]].to_dict(orient="index")
df["prompt"] = df["prompt"].astype(str).replace(r"'", "", regex=True)

#   foo  bar     new            prompt
# 0   a    1   apple  {foo: a, bar: 1}
# 1   b    2  banana  {foo: b, bar: 2}
# 2   c    3    pear  {foo: c, bar: 3}

Note that your comment included braces but your post did not. If you also want to remove the curly braces, add them to the regex:

df["prompt"] = df["prompt"].astype(str).replace(r"[{'}]", "", regex=True)

#   foo  bar     new          prompt
# 0   a    1   apple  foo: a, bar: 1
# 1   b    2  banana  foo: b, bar: 2
# 2   c    3    pear  foo: c, bar: 3

Details

First convert the relevant columns to_dict oriented by index:

df["prompt"] = df[["foo", "bar"]].to_dict(orient="index")

#   foo  bar     new                  prompt
# 0   a    1   apple  {'foo': 'a', 'bar': 1}
# 1   b    2  banana  {'foo': 'b', 'bar': 2}
# 2   c    3    pear  {'foo': 'c', 'bar': 3}

Then use astype to convert it to str type and replace the dict symbols:

df["prompt"] = df["prompt"].astype(str).replace(r"[{'}]", "", regex=True)

#   foo  bar     new          prompt
# 0   a    1   apple  foo: a, bar: 1
# 1   b    2  banana  foo: b, bar: 2
# 2   c    3    pear  foo: c, bar: 3
Answered By: tdy

Combine multiple cells in a row using apply and lambda

df['prompt']=df.apply(lambda k: 'foo: ' + k['foo'] + ', bar: ' + str(k['bar']), axis=1)

This will work?

  foo  bar     new          prompt
0   a    1   apple  foo: a, bar: 1
1   b    2  banana  foo: b, bar: 2
2   c    3    pear  foo: c, bar: 3
Answered By: Hasnain
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.