Python reading a csv file, and skipping the non fixed length header part
Question:
I am reading a number of files, with non fixed-length headers included, and don’t know hov to skip the “header part” until the data of interest appears. The file content looks like below, i am always interested in the contents after the line "Measurement values:"
can i somehow use panda’s read_csv
‘s skiprows
argument, combined with a search string, or similar, to weed out the header part ?
Any inputs are welcome 🙂
Data of the Experiment
Test started: Wed Mar 07 08:10:32 CET 2018
Time Revolutions Axial Force Radial Force
0 0 0 0
10 3000 0 4000
172800 3000 0 4000
172800 2000 0 4000
180000 2000 0 4000
237600 3000 0 22000
237600 2000 0 22000
244800 2000 0 22000
244800 1000 0 22000
252000 1000 0 22000
252000 3000 0 4000
259200 3000 0 4000
Critical Temperature 1: 110
Critical Temperature 2: 120
Critical Temperature 3: 120
Critical Temperature 4: 110
Critical Vibration level: 3500
Critical Torque: 7000
Measurement values:
Time: Seconds elapsed [s] Torque [Nm] Speed [1/s]
20180307081032: 210025.02 5.25 0.00
20180307081033: 210025.98 17.50 3000.00
20180307081034: 210026.97 1688.75 3000.00
.
.
Answers:
i have used below to skip first line while reading excel, you can do same for the csv file.
df = pandas.read_excel(excelFile, header=2)
I am not sure if this is the correct approach.
import pandas as pd
df = pd.read_csv(r"filename.csv")
lineNumber = 0
for i, v in enumerate(df.to_string(index=False).split("n"), 1):
if "Measurement values" in v:
lineNumber = i #Find line number of "Measurement values"
break
df = pd.read_csv(r"filename.csv", skiprows=lineNumber) #Read file again with lineNumber
print(df)
Output:
Time: Seconds elapsed [s] Torque [Nm] Speed [1/s]
0 20180307081032: 210025.02 5.25 0.00
1 20180307081033: 210025.98 17.50 3000.00
2 20180307081034: 210026.97 1688.75 3000.00
There should be solution without reading the file twice.
Very similar to Rakesh’s answer but without reading the whole file just to find the line with “Measurement values:”
import pandas as pd
file_name = r"filename.csv"
line_number = -1
with open(file_name, "r") as in_file:
for i, line in enumerate(in_file, 1):
if line.startswith("Measurement values:"):
lineNumber = i
break
if line_number == -1:
raise RuntimeError("Could not find end of header")
df = pd.read_csv(file_name, skiprows = line_number)
print(df)
I’m not too familiar with pandas, but something like this should work for standard file I/O based on my own experience, and I hope the general strategy is transferrable:
data_file = open("filename.csv", "r")
data_file_line = ""
while not data_file_line.startswith("Measurement values:"):
data_file_line = data_file.readline()
data_file_lines_minus_header = np.append(data_file_line, data_file.readlines())
I hope this proves helpful to someone!
I am reading a number of files, with non fixed-length headers included, and don’t know hov to skip the “header part” until the data of interest appears. The file content looks like below, i am always interested in the contents after the line "Measurement values:"
can i somehow use panda’s read_csv
‘s skiprows
argument, combined with a search string, or similar, to weed out the header part ?
Any inputs are welcome 🙂
Data of the Experiment
Test started: Wed Mar 07 08:10:32 CET 2018
Time Revolutions Axial Force Radial Force
0 0 0 0
10 3000 0 4000
172800 3000 0 4000
172800 2000 0 4000
180000 2000 0 4000
237600 3000 0 22000
237600 2000 0 22000
244800 2000 0 22000
244800 1000 0 22000
252000 1000 0 22000
252000 3000 0 4000
259200 3000 0 4000
Critical Temperature 1: 110
Critical Temperature 2: 120
Critical Temperature 3: 120
Critical Temperature 4: 110
Critical Vibration level: 3500
Critical Torque: 7000
Measurement values:
Time: Seconds elapsed [s] Torque [Nm] Speed [1/s]
20180307081032: 210025.02 5.25 0.00
20180307081033: 210025.98 17.50 3000.00
20180307081034: 210026.97 1688.75 3000.00
.
.
i have used below to skip first line while reading excel, you can do same for the csv file.
df = pandas.read_excel(excelFile, header=2)
I am not sure if this is the correct approach.
import pandas as pd
df = pd.read_csv(r"filename.csv")
lineNumber = 0
for i, v in enumerate(df.to_string(index=False).split("n"), 1):
if "Measurement values" in v:
lineNumber = i #Find line number of "Measurement values"
break
df = pd.read_csv(r"filename.csv", skiprows=lineNumber) #Read file again with lineNumber
print(df)
Output:
Time: Seconds elapsed [s] Torque [Nm] Speed [1/s]
0 20180307081032: 210025.02 5.25 0.00
1 20180307081033: 210025.98 17.50 3000.00
2 20180307081034: 210026.97 1688.75 3000.00
There should be solution without reading the file twice.
Very similar to Rakesh’s answer but without reading the whole file just to find the line with “Measurement values:”
import pandas as pd
file_name = r"filename.csv"
line_number = -1
with open(file_name, "r") as in_file:
for i, line in enumerate(in_file, 1):
if line.startswith("Measurement values:"):
lineNumber = i
break
if line_number == -1:
raise RuntimeError("Could not find end of header")
df = pd.read_csv(file_name, skiprows = line_number)
print(df)
I’m not too familiar with pandas, but something like this should work for standard file I/O based on my own experience, and I hope the general strategy is transferrable:
data_file = open("filename.csv", "r")
data_file_line = ""
while not data_file_line.startswith("Measurement values:"):
data_file_line = data_file.readline()
data_file_lines_minus_header = np.append(data_file_line, data_file.readlines())
I hope this proves helpful to someone!