How to create a pandas dataframe from a txt file with comments?

Question:

I need to create a pandas dataframe based on 4 txt files with comments (to skip while reading) based on the following structure:

# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=Tondeuse

# Propriétés générales
hauteur=0.5
masse=20.0
prix=110.00

# Propriétés du moteur
impulsion specifique=80

and

# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=Civic VTEC

# Propriétés générales
hauteur=2.0
masse=3000.0
prix=2968.00

# Propriétés du moteur
impulsion specifique=205

and

# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=VelociRAPTOR

# Propriétés générales
hauteur=4.0
masse=2000.0
prix=6000.00

# Propriétés du moteur
impulsion specifique=250

and

# Moteur conçu par le Poly Propulsion Lab (PPL)
nom=La Puissance

# Propriétés générales
hauteur=12.0
masse=15000.0
prix=39000.00

# Propriétés du moteur
impulsion specifique=295

That’s the result I need to have:

            nom  hauteur    masse     prix  impulsion specifique
0      Tondeuse      0.5     20.0    110.0                    80
1    Civic VTEC      2.0   3000.0   2968.0                   205
2  VelociRAPTOR      4.0   2000.0   6000.0                   250
3  La Puissance     12.0  15000.0  39000.0                   295

I don’t know if it’s possible, but that’s what i was asked to do

Asked By: Jotunes

||

Answers:

Your data files look very close to configuration files. You can use configparser to generate a dictionary from each file:

from pathlib import Path
from configparser import ConfigParser

data = []
for file in Path("data").glob("*.txt"):
    parser = ConfigParser()
    # INI file requires a section header. Yours don't have one.
    # So let's give it one called DEFAULT
    parser.read_string("[DEFAULT]n" + file.read_text())
    data.append(dict(parser.items("DEFAULT")))

df = pd.DataFrame(data)
Answered By: Code Different

welcome to Stackoverflow! 🙂

If your txt files have their content like you just showed, you could read them in using pandas as a CSV file.

The pandas.read_csv function has multiple things that will help you here:

  • It outputs a dataframe, which is the format you would like to end up with
  • Has a comment input argument, with which you can define lines that are to be ignored
  • You can use the = sign as a separator, which will make you able to split up your data in the wanted sections

Now, let’s try to read one of your files using the read_csv function:

import pandas as pd
df = pd.read_csv(file, comment='#', sep='=', header=None)
df
                    nom  Tondeuse                                                                                                                                                                                                                                               
0               hauteur       0.5                                                                                                                                                                                                                                               
1                 masse      20.0                                                                                                                                                                                                                                               
2                  prix     110.0                                                                                                                                                                                                                                               
3  impulsion specifique      80.0

We’re not completely there yet. We want to remove that index column that gives no info, and we want to transpose the dataframe (rows <-> columns) to be able to concatenate all dataframes together. Let’s do it!

import pandas as pd
df = pd.read_csv(file, comment='#', sep='=', header=None, index_col=0).T
df

0       nom hauteur masse    prix impulsion specifique                                                                                                                                                                                                                          
1  Tondeuse     0.5  20.0  110.00                   80

That’s looking way better! Putting index_col=0 makes the lefternmost column be the index column, and the .T at the end transposes your dataframe. Now we just need to put this inside of a loop and make a complete script out of it!

import pandas as pd
import glob
import os

files = glob.glob(os.path.join(path, '*.csv'))

all_dfs = []
for file in files:
    current_df = pd.read_csv(file, comment='#', sep='=', header=None, index_col=0).T
    all_dfs.append(current_df)

total_df = pd.concat(all_dfs)
total_df

0           nom hauteur    masse      prix impulsion specifique                                                                                                                                                                                                                 
1  La Puissance    12.0  15000.0  39000.00                  295                                                                                                                                                                                                                 
1    Civic VTEC     2.0   3000.0   2968.00                  205                                                                                                                                                                                                                 
1  VelociRAPTOR     4.0   2000.0   6000.00                  250                                                                                                                                                                                                                 
1      Tondeuse     0.5     20.0    110.00                   80

Notice that you still have that lefternmost column with the index number, I did not clean it out because I wasn’t sure of what you wanted there.

Also, importantly, you need to be aware that if there is a slight difference in the names of the columns in your files (e.g. impulsion specifique vs impulsion spécifique) this will bring errors. You will need to create error handling procedures for these. Or maybe enforcing a certain schema, but that is out of the scope of this question.

I hope this helps!

Answered By: Koedlt