Parse CSV headers that have singular spaces inbetween header names but larger spaces between columns

Question:

I am parsing netstat data and my data looks like this

Proto Local Address Foreign Adress State PID Process name
TCP [0.0.0.0:7] 0.0.0.0:0 LISTENING 4112 tcpsvcs.exe
TCP 0.0.0.0:111 0.0.0.0:0 LISTENING 4 Can not obtain ownership information

When I try to parse this using this Python command

pd.read_csv("C:\Users\xxxxx\Documents\test.csv", on_bad_lines='skip', header=1, delim_whitespace= True)

my data ends up looking like this (Notice how Process_name isn’t properly filled out and everything is being shifted to the left due to the space between Local address and foreign address):

Proto Local Address Foreign Address State PID Process_name
TCP [0.0.0.0:7] 0.0.0.0:0 LISTENING 4112 tcpsvcs.exe
TCP 0.0.0.0:111 0.0.0.0:0 LISTENING 4 Can not obtain ownership information

When it should end up looking like this:

Proto Local Address Foreign Address State PID Process_name
TCP [0.0.0.0:7] 0.0.0.0:0 LISTENING 4112 tcpsvcs.exe
TCP 0.0.0.0:111 0.0.0.0:0 LISTENING 4 Can not obtain ownership information

And that’s solely due to the spacing of the headers. Right now I am just adding my own headers to my CSV file but the only issue with that is my file is extremely large and appending a header row at the top of a CSV is very heavy. I was wondering if there are any ways to parse the header that I do have and ignore the singular space and only delimit the spaces that are larger than the singular space that I have.

Asked By: Zarif Rahman

||

Answers:

I cannot test it with the raw file but try this:

pd.read_csv("C:\Users\xxxxx\Documents\test.csv", on_bad_lines='skip', header=1, sep="t")
Answered By: A259

Use sep='ss+' on read_csv to separate columns that are separated by more than one space:

Example Data

sim_csv = io.StringIO(
'''  Proto  Local Address          Foreign Address        State           PID    Process_name
  TCP    0.0.0.0:80             0.0.0.0:0              LISTENING       4      tcpsvcs.exe   
  TCP    0.0.0.0:135            0.0.0.0:0              LISTENING       1552   Can not obtain ownership information''')

Execute

df = pd.read_csv(sim_csv, sep='ss+')
print(df)

Result

   Proto Local Address Foreign Address      State   PID                          Process_name 
0   TCP    0.0.0.0:80       0.0.0.0:0  LISTENING     4                            tcpsvcs.exe 
1   TCP   0.0.0.0:135       0.0.0.0:0  LISTENING  1552   Can not obtain ownership information 
Answered By: jch
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.