Add a new row at selected place in pandas data frame

Question:

I have the following dataframe with large amounts of data

    Column1         Column2
0    10001         252207
1    100018        219559
2    100068        251102
3    100089        107320
4    100111        250975
5    100111        28540
6    100112        252253
7    100157        17883
.   ...            ...
10000 100998         1231233

I would like to add a new row to the first column with a specific value “t # {int}” above the specific value only if the next value in Column1 is not the same as the previous one. Below the output that I want to get

    Column1         Column2'
0    t # 0           NULL
1    10001          252207
2    t # 1           NULL
3    100018         219559
4    t # 2           NULL
5    100088         251102
6    100088         107320
7    t # 3           NULL
8    100111         250975
9    100111         28540
10    t # 4           NULL
11    100112        252253
12    t # 5          NULL
13    100157        17883
...   ...            ...
end-3  t # {int}    NULL
end-2  100998       1231233
end-1  100998       3333
end    100998       4123

What I’m trying to do is first create a new dataframe based on the Column1, and then add what I want

   with open("week-1-algorithm.txt", "r") as f:
        text = [line.split() for line in f]

    df = pandas.DataFrame(
        text,
        columns=["Column1", "Column2"],
    )

    new_df = df["Column1"].copy()
    iteration_number = 0
    for i in range(len(new_df)):
        if (new_df[i] != new_df[i+1]):
            new_df.loc[i+1]= f't # {j}'
            iteration_number += 1

Could anyone help me on how I can do this? All I get is overwriting data, not adding it.

Asked By: Fortides

||

Answers:

Assuming that your dataframe is already sorted, you can group by Column1, then add a header row to each group:

frames = [
    subframe
    for i, (_, group) in enumerate(df.groupby("Column1"))
    for subframe in [
        pd.DataFrame([f"t # {i}"], columns=["Column1"]),
        group,
    ]
]

result = pd.concat(frames, ignore_index=True)
Answered By: Code Different

Try:

from itertools import groupby

df_to_merge = pd.DataFrame(
    (
        (next(g)[0], f"t # {i}")
        for i, (_, g) in enumerate(
            groupby(zip(df.index, df["Column1"]), lambda t: t[1])
        )
    ),
    columns=["index", "Column1"],
).set_index("index")

df = pd.concat([df_to_merge, df]).sort_index(kind="stable").reset_index(drop=True)
print(df)

Prints:

   Column1   Column2
0    t # 0       NaN
1    10001  252207.0
2    t # 1       NaN
3   100018  219559.0
4    t # 2       NaN
5   100088  251102.0
6   100088  107320.0
7    t # 3       NaN
8   100111  250975.0
9   100111   28540.0
10   t # 4       NaN
11  100112  252253.0
12   t # 5       NaN
13  100157   17883.0
Answered By: Andrej Kesely

If you want to avoid loops/groupby, an efficient approach would be to repeat the rows, then mutate them:

# identify changing values
m = df['Column1'].diff().ne(0)

# duplicate those rows
out = (df.loc[df.index.repeat(m+1)]
       .astype({'Column1': 'O'})
       )

# identify first duplicates
m2 = out.index.to_series().duplicated('last')

# mask rows, add labels
out[m2] = float('nan')
out.loc[m2, 'Column1'] = 't # ' + m.cumsum().sub(1).astype(str)

Output:

      Column1    Column2
0       t # 0        NaN
0       10001   252207.0
1       t # 1        NaN
1      100018   219559.0
2       t # 2        NaN
2      100068   251102.0
3       t # 3        NaN
3      100089   107320.0
4       t # 4        NaN
4      100111   250975.0
5      100111    28540.0
6       t # 5        NaN
6      100112   252253.0
7       t # 6        NaN
7      100157    17883.0
10000   t # 7        NaN
10000  100998  1231233.0
10001  100998     3333.0
10002  100998     4132.0
Answered By: mozway

Here is a way with pd.concat()

idx = df.loc[df['Column1'].diff().ne(0)].index - .5
df2 = pd.DataFrame({'Column1':np.char.add('t # ',np.arange(len(idx)).astype('str'))},index = idx)

pd.concat([df,df2]).sort_index().reset_index(drop=True)

Output:

   Column1   Column2
0    t # 0       NaN
1    10001  252207.0
2    t # 1       NaN
3   100018  219559.0
4    t # 2       NaN
5   100068  251102.0
6    t # 3       NaN
7   100089  107320.0
8    t # 4       NaN
9   100111  250975.0
10  100111   28540.0
11   t # 5       NaN
12  100112  252253.0
13   t # 6       NaN
14  100157   17883.0
Answered By: rhug123
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.