Pandas: Appending a row to a dataframe and specify its index label
Question:
Is there any way to specify the index that I want for a new row, when appending the row to a dataframe?
The original documentation provides the following example:
In [1301]: df = DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In [1302]: df
Out[1302]:
A B C D
0 -1.137707 -0.891060 -0.693921 1.613616
1 0.464000 0.227371 -0.496922 0.306389
2 -2.290613 -1.134623 -1.561819 -0.260838
3 0.281957 1.523962 -0.902937 0.068159
4 -0.057873 -0.368204 -1.144073 0.861209
5 0.800193 0.782098 -1.069094 -1.099248
6 0.255269 0.009750 0.661084 0.379319
7 -0.008434 1.952541 -1.056652 0.533946
In [1303]: s = df.xs(3)
In [1304]: df.append(s, ignore_index=True)
Out[1304]:
A B C D
0 -1.137707 -0.891060 -0.693921 1.613616
1 0.464000 0.227371 -0.496922 0.306389
2 -2.290613 -1.134623 -1.561819 -0.260838
3 0.281957 1.523962 -0.902937 0.068159
4 -0.057873 -0.368204 -1.144073 0.861209
5 0.800193 0.782098 -1.069094 -1.099248
6 0.255269 0.009750 0.661084 0.379319
7 -0.008434 1.952541 -1.056652 0.533946
8 0.281957 1.523962 -0.902937 0.068159
where the new row gets the index label automatically. Is there any way to control the new label?
Answers:
The name
of the Series becomes the index
of the row in the DataFrame:
In [99]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In [100]: s = df.xs(3)
In [101]: s.name = 10
In [102]: df.append(s)
Out[102]:
A B C D
0 -2.083321 -0.153749 0.174436 1.081056
1 -1.026692 1.495850 -0.025245 -0.171046
2 0.072272 1.218376 1.433281 0.747815
3 -0.940552 0.853073 -0.134842 -0.277135
4 0.478302 -0.599752 -0.080577 0.468618
5 2.609004 -1.679299 -1.593016 1.172298
6 -0.201605 0.406925 1.983177 0.012030
7 1.158530 -2.240124 0.851323 -0.240378
10 -0.940552 0.853073 -0.134842 -0.277135
df.loc will do the job :
>>> df = pd.DataFrame(np.random.randn(3, 2), columns=['A','B'])
>>> df
A B
0 -0.269036 0.534991
1 0.069915 -1.173594
2 -1.177792 0.018381
>>> df.loc[13] = df.loc[1]
>>> df
A B
0 -0.269036 0.534991
1 0.069915 -1.173594
2 -1.177792 0.018381
13 0.069915 -1.173594
I shall refer to the same sample of data as posted in the question:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
print('The original data frame is: n{}'.format(df))
Running this code will give you
The original data frame is:
A B C D
0 0.494824 -0.328480 0.818117 0.100290
1 0.239037 0.954912 -0.186825 -0.651935
2 -1.818285 -0.158856 0.359811 -0.345560
3 -0.070814 -0.394711 0.081697 -1.178845
4 -1.638063 1.498027 -0.609325 0.882594
5 -0.510217 0.500475 1.039466 0.187076
6 1.116529 0.912380 0.869323 0.119459
7 -1.046507 0.507299 -0.373432 -1.024795
Now you wish to append a new row to this data frame, which doesn’t need to be copy of any other row in the data frame. @Alon suggested an interesting approach to use df.loc
to append a new row with different index. The issue, however, with this approach is if there is already a row present at that index, it will be overwritten by new values. This is typically the case for datasets when row index is not unique, like store ID in transaction datasets. So a more general solution to your question is to create the row, transform the new row data into a pandas series, name it to the index you want to have and then append it to the data frame. Don’t forget to overwrite the original data frame with the one with appended row. The reason is df.append
returns a view of the dataframe and does not modify its contents. Following is the code:
row = pd.Series({'A':10,'B':20,'C':30,'D':40},name=3)
df = df.append(row)
print('The new data frame is: n{}'.format(df))
Following would be the new output:
The new data frame is:
A B C D
0 0.494824 -0.328480 0.818117 0.100290
1 0.239037 0.954912 -0.186825 -0.651935
2 -1.818285 -0.158856 0.359811 -0.345560
3 -0.070814 -0.394711 0.081697 -1.178845
4 -1.638063 1.498027 -0.609325 0.882594
5 -0.510217 0.500475 1.039466 0.187076
6 1.116529 0.912380 0.869323 0.119459
7 -1.046507 0.507299 -0.373432 -1.024795
3 10.000000 20.000000 30.000000 40.000000
There is another solution. The next code is bad (although I think pandas needs this feature):
import pandas as pd
# empty dataframe
a = pd.DataFrame()
a.loc[0] = {'first': 111, 'second': 222}
But the next code runs fine:
import pandas as pd
# empty dataframe
a = pd.DataFrame()
a = a.append(pd.Series({'first': 111, 'second': 222}, name=0))
Maybe my case is a different scenario but looks similar. I would define my own question as: ‘How to insert a row with new index at some (given) position?’
Let’s create test dataframe:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'], index=['x', 'y'])
Result:
A B
x 1 2
y 3 4
Then, let’s say, we want to place a new row with index z
at position 1
(second row).
pos = 1
index_name = 'z'
# create new indexes where index is at the specified position
new_indexes = df.index.insert(pos, index_name)
# create new dataframe with new row
# specify new index in name argument
new_line = pd.Series({'A': 5, 'B': 6}, name=index_name)
df_new_row = pd.DataFrame([new_line], columns=df.columns)
# append new line to dataframe
df = pd.concat([df, df_new_row])
Now it is in the end:
A B
x 1 2
y 3 4
z 5 6
Now let’s sort it specifying new index’ position:
df = df.reindex(new_indexes)
Result:
A B
x 1 2
z 5 6
y 3 4
You should consider using df.loc[row_name] = row_value
.
df.append(pd.Series({row_name: row_value}, name=column
will lead to
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df.loc[row_name] = row_value
is faster than pd.concat
Here is an example:
p = pd.DataFrame(data=np.random.rand(100), columns=['price'], index=np.arange(100))
def func1(p):
for i in range(100):
p.loc[i] = 0
def func2(p):
for i in range(100):
p.append(pd.Series({'BTC': 0}, name=i))
def func3(p):
for i in range(100):
p = pd.concat([p, pd.Series({i: 0}, name='price')])
%timeit func1(p)
1.87 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit func2(p)
1.56 s ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func3(p)
24.8 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Is there any way to specify the index that I want for a new row, when appending the row to a dataframe?
The original documentation provides the following example:
In [1301]: df = DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In [1302]: df
Out[1302]:
A B C D
0 -1.137707 -0.891060 -0.693921 1.613616
1 0.464000 0.227371 -0.496922 0.306389
2 -2.290613 -1.134623 -1.561819 -0.260838
3 0.281957 1.523962 -0.902937 0.068159
4 -0.057873 -0.368204 -1.144073 0.861209
5 0.800193 0.782098 -1.069094 -1.099248
6 0.255269 0.009750 0.661084 0.379319
7 -0.008434 1.952541 -1.056652 0.533946
In [1303]: s = df.xs(3)
In [1304]: df.append(s, ignore_index=True)
Out[1304]:
A B C D
0 -1.137707 -0.891060 -0.693921 1.613616
1 0.464000 0.227371 -0.496922 0.306389
2 -2.290613 -1.134623 -1.561819 -0.260838
3 0.281957 1.523962 -0.902937 0.068159
4 -0.057873 -0.368204 -1.144073 0.861209
5 0.800193 0.782098 -1.069094 -1.099248
6 0.255269 0.009750 0.661084 0.379319
7 -0.008434 1.952541 -1.056652 0.533946
8 0.281957 1.523962 -0.902937 0.068159
where the new row gets the index label automatically. Is there any way to control the new label?
The name
of the Series becomes the index
of the row in the DataFrame:
In [99]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In [100]: s = df.xs(3)
In [101]: s.name = 10
In [102]: df.append(s)
Out[102]:
A B C D
0 -2.083321 -0.153749 0.174436 1.081056
1 -1.026692 1.495850 -0.025245 -0.171046
2 0.072272 1.218376 1.433281 0.747815
3 -0.940552 0.853073 -0.134842 -0.277135
4 0.478302 -0.599752 -0.080577 0.468618
5 2.609004 -1.679299 -1.593016 1.172298
6 -0.201605 0.406925 1.983177 0.012030
7 1.158530 -2.240124 0.851323 -0.240378
10 -0.940552 0.853073 -0.134842 -0.277135
df.loc will do the job :
>>> df = pd.DataFrame(np.random.randn(3, 2), columns=['A','B'])
>>> df
A B
0 -0.269036 0.534991
1 0.069915 -1.173594
2 -1.177792 0.018381
>>> df.loc[13] = df.loc[1]
>>> df
A B
0 -0.269036 0.534991
1 0.069915 -1.173594
2 -1.177792 0.018381
13 0.069915 -1.173594
I shall refer to the same sample of data as posted in the question:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
print('The original data frame is: n{}'.format(df))
Running this code will give you
The original data frame is:
A B C D
0 0.494824 -0.328480 0.818117 0.100290
1 0.239037 0.954912 -0.186825 -0.651935
2 -1.818285 -0.158856 0.359811 -0.345560
3 -0.070814 -0.394711 0.081697 -1.178845
4 -1.638063 1.498027 -0.609325 0.882594
5 -0.510217 0.500475 1.039466 0.187076
6 1.116529 0.912380 0.869323 0.119459
7 -1.046507 0.507299 -0.373432 -1.024795
Now you wish to append a new row to this data frame, which doesn’t need to be copy of any other row in the data frame. @Alon suggested an interesting approach to use df.loc
to append a new row with different index. The issue, however, with this approach is if there is already a row present at that index, it will be overwritten by new values. This is typically the case for datasets when row index is not unique, like store ID in transaction datasets. So a more general solution to your question is to create the row, transform the new row data into a pandas series, name it to the index you want to have and then append it to the data frame. Don’t forget to overwrite the original data frame with the one with appended row. The reason is df.append
returns a view of the dataframe and does not modify its contents. Following is the code:
row = pd.Series({'A':10,'B':20,'C':30,'D':40},name=3)
df = df.append(row)
print('The new data frame is: n{}'.format(df))
Following would be the new output:
The new data frame is:
A B C D
0 0.494824 -0.328480 0.818117 0.100290
1 0.239037 0.954912 -0.186825 -0.651935
2 -1.818285 -0.158856 0.359811 -0.345560
3 -0.070814 -0.394711 0.081697 -1.178845
4 -1.638063 1.498027 -0.609325 0.882594
5 -0.510217 0.500475 1.039466 0.187076
6 1.116529 0.912380 0.869323 0.119459
7 -1.046507 0.507299 -0.373432 -1.024795
3 10.000000 20.000000 30.000000 40.000000
There is another solution. The next code is bad (although I think pandas needs this feature):
import pandas as pd
# empty dataframe
a = pd.DataFrame()
a.loc[0] = {'first': 111, 'second': 222}
But the next code runs fine:
import pandas as pd
# empty dataframe
a = pd.DataFrame()
a = a.append(pd.Series({'first': 111, 'second': 222}, name=0))
Maybe my case is a different scenario but looks similar. I would define my own question as: ‘How to insert a row with new index at some (given) position?’
Let’s create test dataframe:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'], index=['x', 'y'])
Result:
A B
x 1 2
y 3 4
Then, let’s say, we want to place a new row with index z
at position 1
(second row).
pos = 1
index_name = 'z'
# create new indexes where index is at the specified position
new_indexes = df.index.insert(pos, index_name)
# create new dataframe with new row
# specify new index in name argument
new_line = pd.Series({'A': 5, 'B': 6}, name=index_name)
df_new_row = pd.DataFrame([new_line], columns=df.columns)
# append new line to dataframe
df = pd.concat([df, df_new_row])
Now it is in the end:
A B
x 1 2
y 3 4
z 5 6
Now let’s sort it specifying new index’ position:
df = df.reindex(new_indexes)
Result:
A B
x 1 2
z 5 6
y 3 4
You should consider using df.loc[row_name] = row_value
.
df.append(pd.Series({row_name: row_value}, name=column
will lead to
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df.loc[row_name] = row_value
is faster thanpd.concat
Here is an example:
p = pd.DataFrame(data=np.random.rand(100), columns=['price'], index=np.arange(100))
def func1(p):
for i in range(100):
p.loc[i] = 0
def func2(p):
for i in range(100):
p.append(pd.Series({'BTC': 0}, name=i))
def func3(p):
for i in range(100):
p = pd.concat([p, pd.Series({i: 0}, name='price')])
%timeit func1(p)
1.87 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit func2(p)
1.56 s ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func3(p)
24.8 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)