Reading data-file with new header in the middle of the file

Question:

I have a .txt data-file which contains a number of columns with different headers. I can read the file with all columns and rows. However, my problem is that the file contains an additional header with three columns, appended at the last row of the data from the initial header. How do I separate the three last columns from the first ones? Additionally I would like to remove the first of the three columns as it is a replica of the first column and append the other two columns column-wise to the columns at the top of the file. I have used pandas to read the file like this:

c = pd.read_csv('C:filepath.txt', sep=',',header=None,names=['<Title1>','<Title2>','<Title3>','<Title4>','<Title5>','<Title6>','<Title7>','<Title8>','<Title9>','<Title10>','<Title11>','<Title12>'],skiprows=[0,1])

And the result is:

                         <Title1>  ... <Title12>
134849000   -0.420384078515376  ...    244.507248
135016000   -0.406915327374619  ...    244.507248
135183000   -0.406915327374619  ...    244.507248
135349000   -0.406915327374619  ...    244.507248
135516000   -0.406915327374619  ...    244.507248
...                        ...  ...           ... <-- (somewhere in here there is a new header with three columns)
2316226000   0.349323222511261  ...           NaN
2316393000   0.359268272664523  ...           NaN
2316560000   0.346797179431672  ...           NaN
2316726000   0.291363936474923  ...           NaN
2316893000   0.256587672540276  ...           NaN

[26188 rows x 12 columns] 

As can be seen the "4.th quandrant" (or column 4-12, row x, 1-indexed) of the data set contains NaN values because the three columns have been appended at last row of the first header and so they leave empty values because the file contains 12 columns from the top. In addition both headers have two lines in which the first is not needed, so I need to skip those lines.

Sample-file:

<Header1>
<Title1><Title2><Title3><Title4><Title5><Title6><Title7><Title8><Title9><Title10><Title11><Title12><Title13>
134849000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999938482075,-0.000223083188831434,-0.000166347560402173,3.08661080398315E-06,304.11793518,274.23748016,189.97101594,244.50724792
135016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999910346576,-0.000180534505822662,-0.000206991530844074,2.40981161937076E-06,304.0821228,274.15297698,189.97101594,244.50724792
135183000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999992511006,-0.000151940021895918,-0.000103313480817761,1.89050478219266E-06,304.0821228,274.15297698,189.97101594,244.50724792
135349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999945135159,-0.000162536174319313,-7.40562207892995E-05,2.04948428941809E-06,304.0821228,274.15297698,189.97101594,244.50724792
135516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792
135683000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792
135849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999931122814,-0.000250245794219842,-0.000134729677676283,3.5093405085021E-06,304.0821228,274.15297698,189.97101594,244.50724792
136016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999952747184,-0.000248275760427849,-0.000209879516698194,3.49816745295883E-06,304.0821228,274.15297698,189.97101594,244.50724792
136183000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.99999992607031,-0.000294028840627048,-0.000210060717325711,4.25711234103981E-06,304.11793518,274.23748016,189.97101594,244.50724792
136349000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999919180309,-0.00029795985581717,-0.000124844955889991,4.29227691325224E-06,304.11935424,274.17742156,189.97101594,244.50724792
136516000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999888009148,-0.000316878274912839,-3.29402653026431E-05,4.57532859246546E-06,304.11793518,274.23748016,189.97101594,244.50724792
136683000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.11935424,274.17742156,189.97101594,244.50724792
136849000,-0.405802775661793,-0.444669714471277,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.0791626,274.18255616,189.97101594,244.50724792
137016000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.99999991055272,-0.00029252456348538,-0.000168782643050744,4.22385527217017E-06,304.11935424,274.17742156,189.97101594,244.50724792
137183000,-0.412309946883439,-0.450987020223235,53.3941583535493,3.94861381238115,0.999999942521442,-0.000255490185269549,-0.00024667166566595,3.6414759449141E-06,304.09646606,274.19935608,189.97101594,244.50724792
137349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999876479583,-0.000264577733448331,-0.000298287883815869,3.80576077658318E-06,304.0821228,274.15297698,189.97101594,244.50724792
137516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999903983449,-0.000251750438760731,-0.000355224963982992,3.60887866227011E-06,304.0821228,274.15297698,189.97101594,244.50724792
137683000,-0.391801749871831,-0.435460567656641,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.04193116,274.1580658,189.97101594,244.50724792
137849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.0821228,274.15297698,189.97101594,244.50724792
<Header2>
<Title13(same as Title 1)><Title14><Title15>
134849000,0.120862187115588,0
135016000,0.171543242833847,0
135183000,0.146335932645973,0
135349000,0.09773669641824,0
135516000,0.0882672298282907,0
135683000,0.124406962864472,0
135849000,0.186013875486258,0
136016000,0.219045896500945,0
136183000,0.197246332120462,0
136349000,0.150083583561413,0
136516000,0.0838562129822536,0
136683000,0.00269632558524612,0
136849000,-0.0447052988191479,0
137016000,-0.00496292706410619,0
137183000,0.0799457149607322,0
137349000,0.137388731956788,0
137516000,0.142305654943302,0
137683000,0.115943857754048,0
137849000,0.0991913228381935,0
Asked By: Chr0203

||

Answers:

Here are two solutions, the first approach generates a new file, the second one fixes the header during the read_csv operation. You might use the first one if the file will be processed many times, but you need to read all lines at least twice. Prefer the second approach if you need to read many large files only once.

solution 1: pre-process your file

Parse once the file to remove the extra headers.

# create second file with unique header
with open('file.csv', 'r') as f_in, open('file_single_header.csv', 'w') as f_out:
    header = f_in.readlines(1)[0]
    f_out.write(header)
    for line in f_in.readlines():
        if line != header:
            f_out.write(line)

# then read corrected csv file
pd.read_csv('file_single_header.csv')

solution 2: treat header lines as comments and manually assign header

# read first line of the file to get header and split names
import re
with open('file.csv', 'r') as f:
    header = re.split('s+', f.readlines(1)[0].strip())

# exclude header lines and assign names manually
pd.read_csv('file.csv', comment='<', names= header)

NB. it is not clear from you example if you really have comma separated values, of another separator. If this is space, you need to tweak the read_csv as follow. Also, if the index is in the csv file, you need to add a name for option 2 (here None).

# option 1
pd.read_csv('file_fixed_header.csv', sep='s+')

# option 2
pd.read_csv('file.csv',
        comment='<',
        names=[None] + header, # added None for index
        sep='s+',
        index_col=0
       ).dropna(axis=0, how='all')

solution 3: fix a bad csv file

with open('file.csv', 'r') as f_in, open('file_single_header.csv', 'w') as f_out:
    i = 0
    for line in f_in.readlines():
        if line.strip().startswith('<'):
            if i == 1:
                f_out.write(','.join(line.strip('<>n').split('><'))+'n')
            i+=1
        else:
            f_out.write(line)
Answered By: mozway

A possible solution:

Change the index temporarily

c.reset_index(inplace=True)

Locate the rows of the new columns in the second header

newcols = c.iloc[c[c.iloc[:, 1].isna()].index.min() + 2:, [1, 2]].reset_index(drop=True)

Rename the new columns

newcols.rename(columns={'<Title1>' : '<Title14>', '<Title2>' : '<Title15>'}, inplace=True)

Add the new columns, delete the rows in the second header, and restore the original index

c = pd.concat([c, newcols], axis=1).dropna().set_index('index')
Answered By: josepn

I managed to find a relatively robust and simple solution for this specific dataset.

after reading the data, and skipping the first header:

raw_data = pd.read_csv('C:datafile.txt', sep=',',header=None, skiprows=[0,1])

I check the first column for non-numerical values to find out where the next header is located:

a = pd.to_numeric(pd.to_numeric(raw_data[0], errors='coerce').isnull())

result:

0        False
1        False
2        False
3        False
4        False
         ...  
26183    False
26184    False
26185    False
26186    False
26187    False
Name: 0, Length: 26188, dtype: bool

then I locate the indexes where the statement is true:

a = np.where(a)[0]

result:

[13093 13094]

from here I can simply index the data for the two headers, using the indexes:

d = raw_data.iloc[:raw_data.index.get_loc(a[0])]
e = raw_data.iloc[raw_data.index.get_loc(a[0])+2:, :3]

in e I also make sure to index the columns as we only have three columns for the second header

result:

d =

               0                   1   ...          11          12
0       134849000  -0.420384078515376  ...  189.971016  244.507248
1       135016000  -0.406915327374619  ...  189.971016  244.507248
2       135183000  -0.406915327374619  ...  189.971016  244.507248
3       135349000  -0.406915327374619  ...  189.971016  244.507248
4       135516000  -0.406915327374619  ...  189.971016  244.507248
...           ...                 ...  ...         ...         ...
13088  2316226000   -0.30945361835179  ...  188.914284  243.942856
13089  2316393000    -0.4099956694033  ...  188.914284  243.942856
13090  2316560000    -0.4099956694033  ...  188.914284  243.942856
13091  2316726000    -0.4099956694033  ...  188.914284  243.942856
13092  2316893000  -0.429752713005517  ...  188.914284  243.942856

[13093 rows x 13 columns]

e =

                0                   1  2
13095   134849000   0.120862187115588  0
13096   135016000   0.171543242833847  0
13097   135183000   0.146335932645973  0
13098   135349000    0.09773669641824  0
13099   135516000  0.0882672298282907  0
...           ...                 ... ..
26183  2316226000   0.349323222511261  0
26184  2316393000   0.359268272664523  0
26185  2316560000   0.346797179431672  0
26186  2316726000   0.291363936474923  0
26187  2316893000   0.256587672540276  0

[13093 rows x 3 columns]

As both data sets have a common column (the first column for each header) I use merge to append the bottom data set to the top one:

f = pd.merge(d,e, on=[0,0])

result:

                0                 1_x  ...                 1_y 2_y
0       134849000  -0.420384078515376  ...   0.120862187115588   0
1       135016000  -0.406915327374619  ...   0.171543242833847   0
2       135183000  -0.406915327374619  ...   0.146335932645973   0
3       135349000  -0.406915327374619  ...    0.09773669641824   0
4       135516000  -0.406915327374619  ...  0.0882672298282907   0
...           ...                 ...  ...                 ...  ..
13088  2316226000   -0.30945361835179  ...   0.349323222511261   0
13089  2316393000    -0.4099956694033  ...   0.359268272664523   0
13090  2316560000    -0.4099956694033  ...   0.346797179431672   0
13091  2316726000    -0.4099956694033  ...   0.291363936474923   0
13092  2316893000  -0.429752713005517  ...   0.256587672540276   0

[13093 rows x 15 columns]

And now I have the proper data set which I can save, defining my own headers with .to_csv!

Answered By: Chr0203