How to avoid generating unwanted blank columns when looping through Pandas dataframe?
Question:
I’ve learnt much from the stackoverflow community, thanks to all of you. I can’t find an answer to this question anywhere so I’d appreciate your help.
I have created a smallish database (899 * 10) of scientific papers. I want to write a script to print the title and abstract of each, and then make a (human, non-automatable) decision to include or not include in a systematic review.
I have come close with the following script, which allows me to update the ‘decision’ column for each paper, and to save and exit so I don’t have to do it all at once.
import pandas as pd
import numpy as np
print "Please type the path of the database you would like to assess"
path = raw_input('>>> ')
data = pd.read_csv(path)
if 'jim_decision' not in data:
data['jim_decision'] = pd.Series(np.nan)
def decision_maker(dataframe):
current_row = 0
while True:
if pd.isnull(dataframe['jim_decision'][current_row]):
print "nn Title:nn %s n Abstract:n snnn" % (dataframe.Title[current_row], dataframe.Abstract[current_row])
decision = raw_input("From the title and abstract, should this article be included for review of full manuscript?nnType 'Y' or 'N', or 'Save' to exit: ")
if decision == 'Save':
dataframe.to_csv(path)
print "Your changes have been saved"
break
else:
dataframe['jim_decision'][current_row] = decision
current_row += 1
decision_maker(data)
However, for some reason, every time I run it, I get an extra column called ‘Unnamed: [X]’, simply containing the index number, added before the first pandas column. I can’t work out where it comes from, how to get rid of it, or whether (as I presume) it risks contaminating the data.
I’m fairly new to all this, so I’m sure this isn’t very pretty or pythonic, but I’m just trying to learn to use python/pandas to make my research life easier… Any input would be gratefully received!
Answers:
In case anyone comes across this in future, pandas saves the index of the data frame by default as a new column.
If you don’t want this, run include the index=False argument
, for example: df.to_csv('./filepath.csv', index=False)
I’ve learnt much from the stackoverflow community, thanks to all of you. I can’t find an answer to this question anywhere so I’d appreciate your help.
I have created a smallish database (899 * 10) of scientific papers. I want to write a script to print the title and abstract of each, and then make a (human, non-automatable) decision to include or not include in a systematic review.
I have come close with the following script, which allows me to update the ‘decision’ column for each paper, and to save and exit so I don’t have to do it all at once.
import pandas as pd
import numpy as np
print "Please type the path of the database you would like to assess"
path = raw_input('>>> ')
data = pd.read_csv(path)
if 'jim_decision' not in data:
data['jim_decision'] = pd.Series(np.nan)
def decision_maker(dataframe):
current_row = 0
while True:
if pd.isnull(dataframe['jim_decision'][current_row]):
print "nn Title:nn %s n Abstract:n snnn" % (dataframe.Title[current_row], dataframe.Abstract[current_row])
decision = raw_input("From the title and abstract, should this article be included for review of full manuscript?nnType 'Y' or 'N', or 'Save' to exit: ")
if decision == 'Save':
dataframe.to_csv(path)
print "Your changes have been saved"
break
else:
dataframe['jim_decision'][current_row] = decision
current_row += 1
decision_maker(data)
However, for some reason, every time I run it, I get an extra column called ‘Unnamed: [X]’, simply containing the index number, added before the first pandas column. I can’t work out where it comes from, how to get rid of it, or whether (as I presume) it risks contaminating the data.
I’m fairly new to all this, so I’m sure this isn’t very pretty or pythonic, but I’m just trying to learn to use python/pandas to make my research life easier… Any input would be gratefully received!
In case anyone comes across this in future, pandas saves the index of the data frame by default as a new column.
If you don’t want this, run include the index=False argument
, for example: df.to_csv('./filepath.csv', index=False)