Iterate through a dataframe by index
Question:
I have a dataframe called staticData which looks like this:
narrow_sector broad_sector country exchange
unique_id
BBG.MTAA.STM.S Semiconductors Technology CH MTAA
BBG.MTAA.CNHI.S Machinery-Diversified Industrial GB MTAA
BBG.MTAA.FCA.S Auto Manufacturers Consumer Cyclical GB MTAA
BBG.MTAA.A2A.S Electric Utilities IT MTAA
BBG.MTAA.ACE.S Electric Utilities IT MTAA
I am trying to iterate through the dataframe row by row picking out two bits of information the index (unique_id) and the exchange. I am having a problem iterating on the index. Please see my code:
for i, row in staticData.iterrows():
unique_id = staticData.ix[i]
exchange = row['exchange']
I have tried unique_id = row['unique_id']
, but can’t get it to work…
I am trying to return say for row1
unique_id = BBG.MTAA.STM.S
exchange = MTAA
Answers:
You want the following:
for i, row in staticData.iterrows():
unique_id = i
exchange = row['exchange']
i will be the index label value
Example:
In [57]:
df = pd.DataFrame(np.random.randn(5,3), index=list('abcde'), columns=list('fgh'))
df
Out[57]:
f g h
a -0.900835 -0.913989 -0.624536
b -0.854091 0.286364 -0.869539
c 1.090133 -0.771667 1.258372
d -0.721753 -0.329211 0.479295
e 0.520786 0.273722 0.824172
In [62]:
for i, row in df.iterrows():
print('index: ', i, 'col g:', row['g'])
index: a col g: -0.913988608754
index: b col g: 0.286363847188
index: c col g: -0.771666520074
index: d col g: -0.329211394286
index: e col g: 0.273721527592
May be more pandasian way?
staticData.apply((lambda x: (x.name, x['exchange'])), axis=1)
First of all, it’s anti-pattern to iterate through a dataframe because in 99% of the time, there’s a vectorized method much more efficient for the task you’re trying to do. That said, if you have to loop, some methods are more efficient than others.
To iterate through a specific column, use items()
:
for idx, value in df['exchange'].items():
# do something
To iterate through a dataframe, use itertuples()
:
# e.g. to access the `exchange` values as in the OP
for idx, *row in df.itertuples():
print(idx, row.exchange)
items()
creates a zip object from a Series, while itertuples()
creates namedtuples where you can refer to specific values by the column name.
itertuples
is much faster than iterrows
. For example, for a frame with 50000 rows, iterrows
takes 2.4 sec to loop over each row, while itertuples
takes 62 ms (approx. 40 times faster). Since this a loop, this difference is constant and if your dataframe is larger, we’re looking at a difference between a few seconds vs a few minutes.
df = pd.concat([df]*10000, ignore_index=True)
%timeit list(df.itertuples())
# 62 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(df.iterrows())
# 2.42 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have a dataframe called staticData which looks like this:
narrow_sector broad_sector country exchange
unique_id
BBG.MTAA.STM.S Semiconductors Technology CH MTAA
BBG.MTAA.CNHI.S Machinery-Diversified Industrial GB MTAA
BBG.MTAA.FCA.S Auto Manufacturers Consumer Cyclical GB MTAA
BBG.MTAA.A2A.S Electric Utilities IT MTAA
BBG.MTAA.ACE.S Electric Utilities IT MTAA
I am trying to iterate through the dataframe row by row picking out two bits of information the index (unique_id) and the exchange. I am having a problem iterating on the index. Please see my code:
for i, row in staticData.iterrows():
unique_id = staticData.ix[i]
exchange = row['exchange']
I have tried unique_id = row['unique_id']
, but can’t get it to work…
I am trying to return say for row1
unique_id = BBG.MTAA.STM.S
exchange = MTAA
You want the following:
for i, row in staticData.iterrows():
unique_id = i
exchange = row['exchange']
i will be the index label value
Example:
In [57]:
df = pd.DataFrame(np.random.randn(5,3), index=list('abcde'), columns=list('fgh'))
df
Out[57]:
f g h
a -0.900835 -0.913989 -0.624536
b -0.854091 0.286364 -0.869539
c 1.090133 -0.771667 1.258372
d -0.721753 -0.329211 0.479295
e 0.520786 0.273722 0.824172
In [62]:
for i, row in df.iterrows():
print('index: ', i, 'col g:', row['g'])
index: a col g: -0.913988608754
index: b col g: 0.286363847188
index: c col g: -0.771666520074
index: d col g: -0.329211394286
index: e col g: 0.273721527592
May be more pandasian way?
staticData.apply((lambda x: (x.name, x['exchange'])), axis=1)
First of all, it’s anti-pattern to iterate through a dataframe because in 99% of the time, there’s a vectorized method much more efficient for the task you’re trying to do. That said, if you have to loop, some methods are more efficient than others.
To iterate through a specific column, use items()
:
for idx, value in df['exchange'].items():
# do something
To iterate through a dataframe, use itertuples()
:
# e.g. to access the `exchange` values as in the OP
for idx, *row in df.itertuples():
print(idx, row.exchange)
items()
creates a zip object from a Series, while itertuples()
creates namedtuples where you can refer to specific values by the column name.
itertuples
is much faster than iterrows
. For example, for a frame with 50000 rows, iterrows
takes 2.4 sec to loop over each row, while itertuples
takes 62 ms (approx. 40 times faster). Since this a loop, this difference is constant and if your dataframe is larger, we’re looking at a difference between a few seconds vs a few minutes.
df = pd.concat([df]*10000, ignore_index=True)
%timeit list(df.itertuples())
# 62 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(df.iterrows())
# 2.42 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)