Extracting conditional data from multiple csv files
Question:
I’m new to python and I would like to extract rows from several csv (better tsv) files in one new excel file with a new column defining the source of the data.
My code for doing it just for one file is:
import pandas as pd
df = pd.read_csv('C:/Users/filename.tsv', names=['c1', 'c2', 'c3', 'c4'], delimiter='t')
result = df.loc [(df['c2'].isin(['name']))]
result.to_excel(r'C:/Users/filenamenew.xlsx')
But how do I do it for several files? like filename1.tsv; filename2.tsv; filename3.tsv…
Answers:
You can iterate through the files in a for loop, for each file read it into a dataframe, set a new column containing the source file name and then append it to a list. At the end use pd.concat()
to concatenate all the dataframes into a single one and then save it as an excel sheet.
import pandas as pd
filenames = ["C:/Users/filename1.tsv", "C:/Users/filename2.tsv", ...]
dataframes = []
for filename in filenames:
df = pd.read_csv(filename, names=["c1", "c2", "c3", "c4"], delimiter="t")
df["filename"] = filename
dataframes.append(df)
pd.concat(dataframes).to_excel(r"C:/Users/filenamenew.xlsx")
If you need to filter the rows to keep from each dataframe you can do it before appending it to the list:
import pandas as pd
filenames = ["C:/Users/filename1.tsv", "C:/Users/filename2.tsv", ...]
dataframes = []
for filename in filenames:
df = pd.read_csv(filename, names=["c1", "c2", "c3", "c4"], delimiter="t")
df["filename"] = filename
df = df.loc[(df['c2'].isin(['name']))] # here you can filter
dataframes.append(df)
pd.concat(dataframes).to_excel(r"C:/Users/filenamenew.xlsx")
Assuming you know in advance the names of the tsv you can just put them in a list, loop on it and use the pd.concat() method to append them in the final df.
import pandas as pd
input_files=["filename1.tsv", "filename2.tsv", "filename3.tsv"]
col=["c1", "c2", "c3", "c4"]
final_df=pd.DataFrame(columns=col)
for i in input_files:
df=pd.read_csv(i, delimiter="t", columns=col)
df["source"]=i
final_df=pd.concat([final_df, df])
final_df.to_excel("C:/Users/filenamenew.xlsx", index=False)
If you don’t want to manually write the filenames in the list, you can grab them from a folder using the os module. Like that:
import pandas as pd
import os
input_files=os.listdir("C:/Path/To/The/Folder")
input_files=[f for f in input_files if f.endswith(".tsv")] #filter for tsv files only
col=["c1", "c2", "c3", "c4"]
final_df=pd.DataFrame(columns=col)
for i in input_files:
df=pd.read_csv(i, delimiter="t", columns=col)
df["source"]=i
final_df=pd.concat([final_df, df])
final_df.to_excel("C:/Users/filenamenew.xlsx", index=False)
I’m new to python and I would like to extract rows from several csv (better tsv) files in one new excel file with a new column defining the source of the data.
My code for doing it just for one file is:
import pandas as pd
df = pd.read_csv('C:/Users/filename.tsv', names=['c1', 'c2', 'c3', 'c4'], delimiter='t')
result = df.loc [(df['c2'].isin(['name']))]
result.to_excel(r'C:/Users/filenamenew.xlsx')
But how do I do it for several files? like filename1.tsv; filename2.tsv; filename3.tsv…
You can iterate through the files in a for loop, for each file read it into a dataframe, set a new column containing the source file name and then append it to a list. At the end use pd.concat()
to concatenate all the dataframes into a single one and then save it as an excel sheet.
import pandas as pd
filenames = ["C:/Users/filename1.tsv", "C:/Users/filename2.tsv", ...]
dataframes = []
for filename in filenames:
df = pd.read_csv(filename, names=["c1", "c2", "c3", "c4"], delimiter="t")
df["filename"] = filename
dataframes.append(df)
pd.concat(dataframes).to_excel(r"C:/Users/filenamenew.xlsx")
If you need to filter the rows to keep from each dataframe you can do it before appending it to the list:
import pandas as pd
filenames = ["C:/Users/filename1.tsv", "C:/Users/filename2.tsv", ...]
dataframes = []
for filename in filenames:
df = pd.read_csv(filename, names=["c1", "c2", "c3", "c4"], delimiter="t")
df["filename"] = filename
df = df.loc[(df['c2'].isin(['name']))] # here you can filter
dataframes.append(df)
pd.concat(dataframes).to_excel(r"C:/Users/filenamenew.xlsx")
Assuming you know in advance the names of the tsv you can just put them in a list, loop on it and use the pd.concat() method to append them in the final df.
import pandas as pd
input_files=["filename1.tsv", "filename2.tsv", "filename3.tsv"]
col=["c1", "c2", "c3", "c4"]
final_df=pd.DataFrame(columns=col)
for i in input_files:
df=pd.read_csv(i, delimiter="t", columns=col)
df["source"]=i
final_df=pd.concat([final_df, df])
final_df.to_excel("C:/Users/filenamenew.xlsx", index=False)
If you don’t want to manually write the filenames in the list, you can grab them from a folder using the os module. Like that:
import pandas as pd
import os
input_files=os.listdir("C:/Path/To/The/Folder")
input_files=[f for f in input_files if f.endswith(".tsv")] #filter for tsv files only
col=["c1", "c2", "c3", "c4"]
final_df=pd.DataFrame(columns=col)
for i in input_files:
df=pd.read_csv(i, delimiter="t", columns=col)
df["source"]=i
final_df=pd.concat([final_df, df])
final_df.to_excel("C:/Users/filenamenew.xlsx", index=False)