Extracting conditional data from multiple csv files

Question

I’m new to python and I would like to extract rows from several csv (better tsv) files in one new excel file with a new column defining the source of the data.

My code for doing it just for one file is:

import pandas as pd

df = pd.read_csv('C:/Users/filename.tsv', names=['c1', 'c2', 'c3', 'c4'], delimiter='t')

result = df.loc [(df['c2'].isin(['name']))]

result.to_excel(r'C:/Users/filenamenew.xlsx')

But how do I do it for several files? like filename1.tsv; filename2.tsv; filename3.tsv…

Asked By: Chras

||

Source

Answer 1

You can iterate through the files in a for loop, for each file read it into a dataframe, set a new column containing the source file name and then append it to a list. At the end use pd.concat() to concatenate all the dataframes into a single one and then save it as an excel sheet.

import pandas as pd

filenames = ["C:/Users/filename1.tsv", "C:/Users/filename2.tsv", ...]

dataframes = []
for filename in filenames:
    df = pd.read_csv(filename, names=["c1", "c2", "c3", "c4"], delimiter="t")
    df["filename"] = filename
    dataframes.append(df)

pd.concat(dataframes).to_excel(r"C:/Users/filenamenew.xlsx")

If you need to filter the rows to keep from each dataframe you can do it before appending it to the list:

import pandas as pd

filenames = ["C:/Users/filename1.tsv", "C:/Users/filename2.tsv", ...]

dataframes = []
for filename in filenames:
    df = pd.read_csv(filename, names=["c1", "c2", "c3", "c4"], delimiter="t")
    df["filename"] = filename
    df = df.loc[(df['c2'].isin(['name']))]  # here you can filter
    dataframes.append(df)

pd.concat(dataframes).to_excel(r"C:/Users/filenamenew.xlsx")

Answered By: Matteo Zanoni

Answer 2

Assuming you know in advance the names of the tsv you can just put them in a list, loop on it and use the pd.concat() method to append them in the final df.

import pandas as pd

input_files=["filename1.tsv", "filename2.tsv", "filename3.tsv"]
col=["c1", "c2", "c3", "c4"]

final_df=pd.DataFrame(columns=col)

for i in input_files:
    df=pd.read_csv(i, delimiter="t", columns=col)
    df["source"]=i
    final_df=pd.concat([final_df, df])

final_df.to_excel("C:/Users/filenamenew.xlsx", index=False)

If you don’t want to manually write the filenames in the list, you can grab them from a folder using the os module. Like that:

import pandas as pd
import os

input_files=os.listdir("C:/Path/To/The/Folder")
input_files=[f for f in input_files if f.endswith(".tsv")] #filter for tsv files only
col=["c1", "c2", "c3", "c4"]

final_df=pd.DataFrame(columns=col)

for i in input_files:
    df=pd.read_csv(i, delimiter="t", columns=col)
    df["source"]=i
    final_df=pd.concat([final_df, df])

final_df.to_excel("C:/Users/filenamenew.xlsx", index=False)

Answered By: Liutprand

Extracting conditional data from multiple csv files

Question:

Answers: