Importing multiple .cnv files, extract filename, and attach as a column
Question:
I have multiple .cnv files that I can import and concatenate with the following code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
# Get CSV files list from a folder
path ='/Users/mariacristinaalvarez/Documents/High3/'
csv_files = glob.glob(path + "/*_high3.cnv")
# Read each CSV file into DataFrame
# This creates a list of dataframes
df_list = (pd.read_csv(file, encoding="ISO-8859-1",delim_whitespace=True,skiprows=316 , header=None) for file in csv_files1)
# Concatenate all DataFrames
stations_df = pd.concat(df_list)
This code works for me to get all files concatenated into one, but I would like to have a column with the filename that is contained between ‘‘
.
For example:
filename = ‘HLY2202_008_high3_predown_av1dbar.cnv’
I would like to only extract the numbers after HLY2202 AND before _high3
So the return should be "008"
I want to do this for each file and add the name as a column so it becomes a identifier when I do explorative data analysis.
Answers:
Use pathlib and assign
the filename/substring you need to a new column.
import pandas as pd
from pathlib import Path
path = Path("/Users/mariacristinaalvarez/Documents/High3/")
csv_files = path.glob("*_high3.cnv")
df_list = [pd.read_csv(file, encoding="ISO-8859-1", delim_whitespace=True, skiprows=316, header=None)
.assign(filename=file.stem.split("_")[1]) for file in csv_files]
stations_df = pd.concat(df_list)
You can create a dict of dataframes instead of a list. The key will be used by pd.concat
to create a new index:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pathlib
# Use high-level module pathlib rather than os/glob
path = pathlib.Path('/Users/mariacristinaalvarez/Documents/High3/')
csv_files = path.glob('*_high3_*.cnv')
# CSV parameters
params = {'encoding': 'ISO-8859-1', 'delim_whitespace': True,
'skiprows': 316, 'header': None}
stations_df = (pd.concat({file.stem.split('_')[1]: pd.read_csv(file, **params)
for file in csv_files}, names=['id'], axis=0)
.reset_index(level='id').reset_index(drop=True))
Output:
>>> stations_df
id date value
0 008 2020-01-01 35 # file HLY2202_008_high3_predown_av1dbar.cnv
1 008 2020-01-02 40
2 008 2020-01-03 45
3 009 2020-01-01 35 # HLY2202_009_high3_predown_av1dbar.cnv
4 009 2020-01-02 40
5 009 2020-01-03 45
I have multiple .cnv files that I can import and concatenate with the following code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
# Get CSV files list from a folder
path ='/Users/mariacristinaalvarez/Documents/High3/'
csv_files = glob.glob(path + "/*_high3.cnv")
# Read each CSV file into DataFrame
# This creates a list of dataframes
df_list = (pd.read_csv(file, encoding="ISO-8859-1",delim_whitespace=True,skiprows=316 , header=None) for file in csv_files1)
# Concatenate all DataFrames
stations_df = pd.concat(df_list)
This code works for me to get all files concatenated into one, but I would like to have a column with the filename that is contained between ‘‘
.
For example:
filename = ‘HLY2202_008_high3_predown_av1dbar.cnv’
I would like to only extract the numbers after HLY2202 AND before _high3
So the return should be "008"
I want to do this for each file and add the name as a column so it becomes a identifier when I do explorative data analysis.
Use pathlib and assign
the filename/substring you need to a new column.
import pandas as pd
from pathlib import Path
path = Path("/Users/mariacristinaalvarez/Documents/High3/")
csv_files = path.glob("*_high3.cnv")
df_list = [pd.read_csv(file, encoding="ISO-8859-1", delim_whitespace=True, skiprows=316, header=None)
.assign(filename=file.stem.split("_")[1]) for file in csv_files]
stations_df = pd.concat(df_list)
You can create a dict of dataframes instead of a list. The key will be used by pd.concat
to create a new index:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pathlib
# Use high-level module pathlib rather than os/glob
path = pathlib.Path('/Users/mariacristinaalvarez/Documents/High3/')
csv_files = path.glob('*_high3_*.cnv')
# CSV parameters
params = {'encoding': 'ISO-8859-1', 'delim_whitespace': True,
'skiprows': 316, 'header': None}
stations_df = (pd.concat({file.stem.split('_')[1]: pd.read_csv(file, **params)
for file in csv_files}, names=['id'], axis=0)
.reset_index(level='id').reset_index(drop=True))
Output:
>>> stations_df
id date value
0 008 2020-01-01 35 # file HLY2202_008_high3_predown_av1dbar.cnv
1 008 2020-01-02 40
2 008 2020-01-03 45
3 009 2020-01-01 35 # HLY2202_009_high3_predown_av1dbar.cnv
4 009 2020-01-02 40
5 009 2020-01-03 45