Txt to Dataframe regex for inconsistent spacing

Question:

After trying for hours to find regex for txt to dataframe I am no luck. can you please help –

first space = 1, second space = 2+, third space = 1, and onwards 2+ spaces

00002 A000    1 Cholera due to Vibrio cholerae 01, biovar cholerae           Cholera due to Vibrio cholerae 01, biovar cholerae
00003 A001    1 Cholera due to Vibrio cholerae 01, biovar eltor              Cholera due to Vibrio cholerae 01, biovar eltor
00004 A009    1 Cholera, unspecified                                         Cholera, unspecified

How do I split this txt format into dataframe.

Asked By: Santoo

||

Answers:

Using a split on 2+ spaces, then subsplits and concat:

text = io.StringIO ('''00002 A000    1 Cholera due to Vibrio cholerae 01, biovar cholerae           Cholera due to Vibrio cholerae 01, biovar cholerae
00003 A001    1 Cholera due to Vibrio cholerae 01, biovar eltor              Cholera due to Vibrio cholerae 01, biovar eltor
00004 A009    1 Cholera, unspecified                                         Cholera, unspecified''')

df = pd.read_csv(text, sep='ss+', header=None, engine='python')

df = pd.concat(
    [df[0].str.split(expand=True),
     df[1].str.split(n=1, expand=True),
     df.loc[:, 2:]],
    axis=1, ignore_index=True
)

Output:

       0     1  2                                                   3                                                   4
0  00002  A000  1  Cholera due to Vibrio cholerae 01, biovar cholerae  Cholera due to Vibrio cholerae 01, biovar cholerae
1  00003  A001  1     Cholera due to Vibrio cholerae 01, biovar eltor     Cholera due to Vibrio cholerae 01, biovar eltor
2  00004  A009  1                                Cholera, unspecified                                Cholera, unspecified
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.