Drop row in a for loop Python
Question:
I have a (very large) pandas dataframe like the following:
Sequence
AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAAAAAAAAAAAG
AAAAAAAAAAAAAAAAAAAAAAAAT
AAAACAGAAGGTGTCCCAATACTAT
AAAACAGATCTCGGCAGATTGGATG
AAAACAGATCTCGGTAGACTGGACG
And I want to remove rows where the percentage of A is greater than 0.80.
Here is my code:
sequences = file[['Sequence']]
seq_A = 'A' * 25
for row in range(len(file)):
par1 = file.iloc[row,0]
# compare sequence with homopolymer and check ratio of match
ratioA = difflib.SequenceMatcher(None, par1, seq_A).ratio()
if ratioA >= 0.80:
sequences.drop(row, axis=0, inplace=True)
# lista.append(row)
But when I check the number of rows with such features with a new list in which I have inserted the indices (without deleting rows), the number of indices does not match the number of deleted rows.
Thank you very much!
Answers:
You should generally avoid loops with pandas. Here is how you can do it:
df.loc[df['Sequence'].str.count('A') / df['Sequence'].str.len() <= 0.8]
produces:
Sequence
4 AAAACAGAAGGTGTCCCAATACTAT
5 AAAACAGATCTCGGCAGATTGGATG
6 AAAACAGATCTCGGTAGACTGGACG
I have a (very large) pandas dataframe like the following:
Sequence |
---|
AAAAAAAAAAAAAAAAAAAAAAAAA |
AAAAAAAAAAAAAAAAAAAAAAAAC |
AAAAAAAAAAAAAAAAAAAAAAAAG |
AAAAAAAAAAAAAAAAAAAAAAAAT |
AAAACAGAAGGTGTCCCAATACTAT |
AAAACAGATCTCGGCAGATTGGATG |
AAAACAGATCTCGGTAGACTGGACG |
And I want to remove rows where the percentage of A is greater than 0.80.
Here is my code:
sequences = file[['Sequence']]
seq_A = 'A' * 25
for row in range(len(file)):
par1 = file.iloc[row,0]
# compare sequence with homopolymer and check ratio of match
ratioA = difflib.SequenceMatcher(None, par1, seq_A).ratio()
if ratioA >= 0.80:
sequences.drop(row, axis=0, inplace=True)
# lista.append(row)
But when I check the number of rows with such features with a new list in which I have inserted the indices (without deleting rows), the number of indices does not match the number of deleted rows.
Thank you very much!
You should generally avoid loops with pandas. Here is how you can do it:
df.loc[df['Sequence'].str.count('A') / df['Sequence'].str.len() <= 0.8]
produces:
Sequence
4 AAAACAGAAGGTGTCCCAATACTAT
5 AAAACAGATCTCGGCAGATTGGATG
6 AAAACAGATCTCGGTAGACTGGACG