replicate rows by n times in python

Question:

I have a dataframe that the first column are strings, the 2nd column are the numbers that I want to replicate each string.

df = pd.DataFrame({'String':['a','b', 'c', 'd', 'e', 'f', 'g'],
        'Times':[2, 3, 5, 3, 2, 4, 5]})
df
String Times
a 2
b 3
c 5
d 3
e 2
f 4
g 5

How can I create a data frame into this in python? (stopped at d but you know the pattern). Thanks!

String Times
a 1
a 2
b 1
b 2
b 3
c 1
c 2
c 3
c 4
c 5
Asked By: Shang Lu

||

Answers:

Use df.explode. Make this work by converting your repetitions to list:

s = ['a','b', 'c', 'd', 'e', 'f', 'g']
t = [2, 3, 5, 3, 2, 4, 5]

Then,

times = [list(range(i + 1)) for i in t]
df = pd.DataFrame({'string': s, 'times': times}).explode('times')
>>> df.head(10)
  string times
0      a     0
0      a     1
0      a     2
1      b     0
1      b     1
1      b     2
1      b     3
2      c     0
2      c     1
2      c     2

If you don’t want the index repeated like that, df.reset_index(drop=True, inplace=True) (if in place, otherwise, reassign). If you want 1-indexing rather than 0-indexing, for list(range(i + 1)) substitute list(range(1, i + 2)). Alternatively, just add one to the times column at the end. The times column doesn’t really matter for df.explode, it explodes based on repetitions anyway.

Answered By: ifly6

A bit messy, but seems to do what you want:

new_df = pd.DataFrame(df.String.str.repeat(df.Times).apply(list).explode().reset_index(drop=True))

new_df["Times"] = new_df.groupby("String").cumcount() + 1

Output:

   String  Times
0       a      1
1       a      2
2       b      1
3       b      2
4       b      3
5       c      1
6       c      2
7       c      3
8       c      4
9       c      5
10      d      1
11      d      2
12      d      3
13      e      1
14      e      2
15      f      1
16      f      2
17      f      3
18      f      4
19      g      1
20      g      2
21      g      3
22      g      4
23      g      5
Answered By: ddejohn

Another method could be:

df.assign(Times = df.Times.apply(lambda x: range(1, x + 1))).explode('Times')
Out[]: 
  String Times
0      a     1
0      a     2
1      b     1
1      b     2
1      b     3
2      c     1
2      c     2
2      c     3
2      c     4
2      c     5
Answered By: onyambu

Use Index.repeat to replicate the rows and groupby+cumcount: to enumerate them:

(df.loc[df.index.repeat(df['Times'])]
   .assign(Times=lambda d: d.groupby('String').cumcount().add(1))
 )

Output:

  String  Times
0      a      1
0      a      2
1      b      1
1      b      2
1      b      3
2      c      1
2      c      2
2      c      3
2      c      4
2      c      5
3      d      1
...
comparison of all answers (7k rows):
@mozway
11.2 ms ± 1.63 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

@Onyambu
16 ms ± 685 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@ifly6
27.3 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@ddejohn
28.5 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

On 700k rows: 1s, 1.43s, 2.34s, 2.94s in the same order

Answered By: mozway
ss1=df.String.mul(df.Times).map(list).explode("")
ss2=ss1.groupby(ss1).cumcount().add(1)
pd.DataFrame(dict(String=ss1,Times=ss2))

Output:

   String  Times
0       a      1
1       a      2
2       b      1
3       b      2
4       b      3
5       c      1
6       c      2
7       c      3
8       c      4
9       c      5
10      d      1
11      d      2
12      d      3
13      e      1
14      e      2
15      f      1
16      f      2
17      f      3
18      f      4
19      g      1
20      g      2
21      g      3
22      g      4
23      g      5
Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.