Python Pandas – Read csv with commented header line

Question:

I want to read and process a csv file with pandas. The file (as seen below) contains multiple header lines which are indicated by a # tag. I can import that file easily by using

import pandas as pd

file = "data.csv"
data = pd.read_csv(file, delimiter="s+",
                   names=["Time", "Cd", "Cs", "Cl", "CmRoll", "CmPitch", "CmYaw", "Cd(f)",
                           "Cd(r)", "Cs(f)", "Cs(r)", "Cl(f)", "Cl(r)"],
                   skiprows=13)

However, I have a lot of such files with different header names and I don’t want to name them (Time Cd Cs ...) manually. Also the number of commented lines is different between each file. So I want to automate that task.

Do I have to use something like regular expression here, before passing the data into a pandas dataframe?

Thanks for any advise.

And yes, the header names are also beginning with an #.

data.csv:

# Force coefficients    
# dragDir               : (9.9735673312816520e-01 7.2660490528994301e-02 0.0000000000000000e+00)
# sideDir               : (0.0000000000000000e+00 0.0000000000000000e+00 -1.0000000000000002e+00)
# liftDir               : (-7.2660490528994315e-02 9.9735673312816520e-01 0.0000000000000000e+00)
# rollAxis              : (9.9735673312816520e-01 7.2660490528994301e-02 0.0000000000000000e+00)
# pitchAxis             : (0.0000000000000000e+00 0.0000000000000000e+00 -1.0000000000000002e+00)
# yawAxis               : (-7.2660490528994315e-02 9.9735673312816520e-01 0.0000000000000000e+00)
# magUInf               : 4.5000000000000000e+01
# lRef                  : 5.9399999999999997e-01
# Aref                  : 3.5639999999999999e-03
# CofR                  : (1.4999999999999999e-01 0.0000000000000000e+00 0.0000000000000000e+00)
#
# Time                      Cd                          Cs                          Cl                          CmRoll                      CmPitch                     CmYaw                       Cd(f)                       Cd(r)                       Cs(f)                       Cs(r)                       Cl(f)                       Cl(r)                   
5e-06                       1.8990180226147195e+00  1.4919925634649792e-11  2.1950119509976829e+00  -1.1085971520784955e-02 -1.0863798447281650e+00 9.5910040927874810e-03  9.3842303978657482e-01  9.6059498282814471e-01  9.5910041002474442e-03  -9.5910040853275178e-03 1.1126130770676479e-02  2.1838858202270064e+00
1e-05                       2.1428508927716594e+00  1.0045114197556737e-08  2.5051633252700962e+00  -1.2652317494411272e-02 -1.2367567798452046e+00 1.0822379290263353e-02  1.0587731288914184e+00  1.0840777638802410e+00  1.0822384312820453e-02  -1.0822374267706254e-02 1.5824882789843508e-02  2.4893384424802525e+00
...
Asked By: Sunsheep

||

Answers:

A little bit of regex could help. This is not the most beautiful of solutions so feel free to post a better solution.

Let’s read the first 50 rows of any file to find the last occurrence of the hash which should be the column name.

  • ^ asserts position at start of a line
  • # matches the character # literally (case sensitive)

Code:

import re
n_rows = 50

path_ = 'your_file_location'

with open(path_,'r') as f:
    data = []
    for i in range(n_rows): # read only 50 rows here. 
        for line in f:
            if re.match('^#',line):
                data.append(line)

start_col = max(enumerate(data))[0]


df = pd.read_csv(path_,sep='s+',skiprows=start_col) # use your actual delimiter.

          #      Time            Cd        Cs        Cl    CmRoll   CmPitch  
0  0.000005  1.899018  1.491993e-11  2.195012 -0.011086 -1.086380  0.009591   
1  0.000010  2.142851  1.004511e-08  2.505163 -0.012652 -1.236757  0.010822   

      CmYaw     Cd(f)     Cd(r)     Cs(f)     Cs(r)     Cl(f)  Cl(r)  
0  0.938423  0.960595  0.009591 -0.009591  0.011126  2.183886    NaN  
1  1.058773  1.084078  0.010822 -0.010822  0.015825  2.489338    NaN  

Edit, handling the # in the column name.

We can do this in two steps. We can read in 0 rows but slice the header column.

First read in the file from the header row, but set the header argument to None so no headers will be set.

We can then set the column headers manually:

df = pd.read_csv(path_,sep='s+',skiprows=start_col + 1, header=None)
df.columns = pd.read_csv(path_,sep='s+',skiprows=start_col,nrows=0).columns[1:]

print(df)

       Time        Cd            Cs        Cl    CmRoll   CmPitch     CmYaw  
0  0.000005  1.899018  1.491993e-11  2.195012 -0.011086 -1.086380  0.009591   
1  0.000010  2.142851  1.004511e-08  2.505163 -0.012652 -1.236757  0.010822   

      Cd(f)     Cd(r)     Cs(f)     Cs(r)     Cl(f)     Cl(r)  
0  0.938423  0.960595  0.009591 -0.009591  0.011126  2.183886  
1  1.058773  1.084078  0.010822 -0.010822  0.015825  2.489338 
Answered By: Umar.H

What about extracting the header before you read the file?
We only assume that your header lines start with #. Extraction of the header as well as its position in the file is automated. We also ensure that no more lines than necessary are read (except the first data line).

with open(file) as f:
    line = f.readline()
    cnt = 0
    while line.startswith('#'):
        prev_line = line
        line = f.readline()
        cnt += 1
        # print(prev_line)

header = prev_line.strip().lstrip('# ').split()

df = pd.read_csv(file, delimiter="s+",
                   names=header,
                   skiprows=cnt
           )

With this, you can also proccess the other header lines. It also gives you the position of the header in the file.

Answered By: Stefan

To simplify it, and to save time without using loops, you can create 2 dataframes for # commented rows, and the rest.
From those commented rows take last one – that’s your header, and then merge data dataframe and this title using concat() also if it’s neccesary to assign first row as header you can use df.columns=df.iloc[0]

df = pd.DataFrame({
    'A':['#test1 : (000000)','#test1 (000000)','#test1 (000000)','#test1 (000000)','#Time (000000)','5e-06','1e-05'],
})
print(df)
   

                A
0  #test1 : (000000)
1    #test1 (000000)
2    #test1 (000000)
3    #test1 (000000)
4     #Time (000000)
5              5e-06
6              1e-05

df_header = df[df.A.str.contains('^#')]
print(df_header)
         

          A
0  #test1 : (000000)
1    #test1 (000000)
2    #test1 (000000)
3    #test1 (000000)
4     #Time (000000)
df_data = df[~df.A.str.contains('^#')]
print(df_data)
       A
5  5e-06
6  1e-05

df = (pd.concat([df_header.iloc[[-1]],df_data])).reset_index(drop=True)
df.A=df.A.str.replace(r'^#',"")



print(df)
          

     A
0  Time (000000)
1          5e-06
2          1e-05
Answered By: sygneto

Assuming that comments always start with a single ‘#’ and the header is in the last commented line:

import csv

def read_comments(csv_file):
    for row in csv_file:
        if row[0] == '#':
            yield row.split('#')[1].strip()

def get_last_commented_line(filename):
    with open(filename, 'r', newline='') as f:
        decommented_lines = [line for line in csv.reader(read_comments(f))]
        header = decommented_lines[-1]
        skiprows = len(decommented_lines)
        return header, skiprows

header, skiprows = get_last_commented_line(path)
pd.read_csv(path, names=header, skiprows=skiprows)
Answered By: Wouter

This should do, it’s easy and efficient, it keeps variables at the minimum and it doesn’t require any input aside from the filename.

with open(file, 'r') as f:
    for line in f:
        if line.startswith('#'):
            header = line
        else:
            break #stop when there are no more #

header = header[1:].strip().split()

data = pd.read_csv(file, delimiter="s+", comment='#', names=header)

You first open the file and read only the commented line (it will be fast and memory-efficient). The last valid line will be the final header, which will be cleaned and converted to a list. Finally, you open the file with pandas.read_csv() with comment='#', which will skip the commented lines, and names=header.

Answered By: alec_djinn
# Read the lines in file
with open(file) as f:
    lines = f.readlines()

# Last commented line is header
header = [line for line in lines if line.startswith('#')][-1]

# Strip line and remove '#' 
header = header[1:].strip().split()

df = pd.read_csv(file, delimiter="s+", names=header, comment='#')
Answered By: Nikil Kumar
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.