Txt to Dataframe regex for inconsistent spacing
Question:
After trying for hours to find regex for txt to dataframe I am no luck. can you please help –
first space = 1, second space = 2+, third space = 1, and onwards 2+ spaces
00002 A000 1 Cholera due to Vibrio cholerae 01, biovar cholerae Cholera due to Vibrio cholerae 01, biovar cholerae
00003 A001 1 Cholera due to Vibrio cholerae 01, biovar eltor Cholera due to Vibrio cholerae 01, biovar eltor
00004 A009 1 Cholera, unspecified Cholera, unspecified
How do I split this txt format into dataframe.
Answers:
Using a split on 2+ spaces, then subsplits and concat
:
text = io.StringIO ('''00002 A000 1 Cholera due to Vibrio cholerae 01, biovar cholerae Cholera due to Vibrio cholerae 01, biovar cholerae
00003 A001 1 Cholera due to Vibrio cholerae 01, biovar eltor Cholera due to Vibrio cholerae 01, biovar eltor
00004 A009 1 Cholera, unspecified Cholera, unspecified''')
df = pd.read_csv(text, sep='ss+', header=None, engine='python')
df = pd.concat(
[df[0].str.split(expand=True),
df[1].str.split(n=1, expand=True),
df.loc[:, 2:]],
axis=1, ignore_index=True
)
Output:
0 1 2 3 4
0 00002 A000 1 Cholera due to Vibrio cholerae 01, biovar cholerae Cholera due to Vibrio cholerae 01, biovar cholerae
1 00003 A001 1 Cholera due to Vibrio cholerae 01, biovar eltor Cholera due to Vibrio cholerae 01, biovar eltor
2 00004 A009 1 Cholera, unspecified Cholera, unspecified
After trying for hours to find regex for txt to dataframe I am no luck. can you please help –
first space = 1, second space = 2+, third space = 1, and onwards 2+ spaces
00002 A000 1 Cholera due to Vibrio cholerae 01, biovar cholerae Cholera due to Vibrio cholerae 01, biovar cholerae
00003 A001 1 Cholera due to Vibrio cholerae 01, biovar eltor Cholera due to Vibrio cholerae 01, biovar eltor
00004 A009 1 Cholera, unspecified Cholera, unspecified
How do I split this txt format into dataframe.
Using a split on 2+ spaces, then subsplits and concat
:
text = io.StringIO ('''00002 A000 1 Cholera due to Vibrio cholerae 01, biovar cholerae Cholera due to Vibrio cholerae 01, biovar cholerae
00003 A001 1 Cholera due to Vibrio cholerae 01, biovar eltor Cholera due to Vibrio cholerae 01, biovar eltor
00004 A009 1 Cholera, unspecified Cholera, unspecified''')
df = pd.read_csv(text, sep='ss+', header=None, engine='python')
df = pd.concat(
[df[0].str.split(expand=True),
df[1].str.split(n=1, expand=True),
df.loc[:, 2:]],
axis=1, ignore_index=True
)
Output:
0 1 2 3 4
0 00002 A000 1 Cholera due to Vibrio cholerae 01, biovar cholerae Cholera due to Vibrio cholerae 01, biovar cholerae
1 00003 A001 1 Cholera due to Vibrio cholerae 01, biovar eltor Cholera due to Vibrio cholerae 01, biovar eltor
2 00004 A009 1 Cholera, unspecified Cholera, unspecified