Reading from a MD file to pandas dataframe

Question:

I’m doing a project with this data repository, and for each football season, on top of the CSV file with the game data, there is an extra README which has the final results. This is in some sort of table like structure, and I want to read it into a Pandas dataframe. I’ve tried both “read_csv” and “read_table”, but I’m not sure what delimiter is being used and if it possibly uses a multi-index… The MD file is as follows:

                                        - Home -          - Away -            - Total -
                                 Pld   W  D  L   F:A     W  D  L   F:A      F:A   +/-  Pts
 1. Club Brugge                   34  14  2  1  45:12   11  4  2  38:18    83:30  +53   81
 2. RSC Anderlecht                34  14  1  2  44:14    8  4  5  39:23    83:37  +46   71
 3. Germinal Beerschot            34  12  3  2  34:11    3  5  9  19:26    53:37  +16   53
 4. RWD Molenbeek                 34   9  6  2  24:9     4  8  5  15:20    39:29  +10   53
 5. K Lierse SK                   34   8  4  5  30:21    6  6  5  24:24    54:45   +9   52
 6. Standard Liège                34   9  7  1  28:15    4  5  8  23:31    51:46   +5   51
 7. Sporting Charleroi            34   8  7  2  37:21    5  4  8  22:32    59:53   +6   50
 8. Cercle Brugge                 34   7  5  5  27:23    6  5  6  24:24    51:47   +4   49
 9. KFC Lommel SK                 34   7  6  4  20:15    7  0 10  20:30    40:45   -5   48
10. SC Eendracht Aalst            34   8  6  3  37:21    4  4  9  18:29    55:50   +5   46
11. KV Mechelen                   34   8  4  5  20:16    4  4  9  20:30    40:46   -6   44
12. KRC Harelbeke                 34   8  1  8  26:26    5  3  9  14:22    40:48   -8   43
13. Royal Antwerp FC              34   7  4  6  26:23    4  5  8  12:23    38:46   -8   42
14. KAA Gent                      34   8  2  7  21:22    2  9  6  18:27    39:49  -10   41
15. Sint-Truidense VV             34   7  4  6  29:28    4  3 10  13:32    42:60  -18   40
16. RFC Seraing                   34   5  4  8  18:24    3  1 13  17:51    35:75  -40   29
17. KSK Beveren                   34   4  7  6  24:25    2  2 13  14:32    38:57  -19   27
18. SV Zulte Waregem              34   3  4 10  18:36    1  5 11  12:34    30:70  -40   21

Pld = Matches; W = Matches won; D = Matches drawn; L = Matches lost; F = Goals for; A = Goals against; +/- = Goal differencence; Pts = Points

How can I best read this file? Cheers!

Asked By: Zarathustra

||

Answers:

This file is not formatted as a CSV (which can use both ‘,’ and ‘;’ delimiters.
In your case you only have spaces to work with, so the approach would be for every line to split it by space char, get rid of empty entries and fetch them by index.

f = open("your.csv", 'r')
for l in f.readlines():
    vals = [l for l in l.split(' ') if l]
    index = vals[0]
    name = vals[1]
    goals_fa = tuple(vals[6].split(':'))
    ...
    # fill dataframe
f.close()
Answered By: G.Vernier

This file isn’t in a particular format – it’s meant to be human-readable rather than machine readable. So you’ll probably need to do some conversion on your own first.

One simple way:

lines = text.split('n')
df = pd.DataFrame([re.split(r's+', line[34:]) for line in lines])

You can name the columns directly:

df.columns = [['home_pld', 'home_w', 'home_d', 'home_l', 'home_fa', 
    'home_pld', 'home_w', 'home_d', 'home_l', 'home_fa', 
    'total_fa', 'total_plusminus', 'total_points']]

And add the club name:

df['club'] = [line[4:34].strip() for line in lines]
Answered By: ASGM

A quick way could be:

df = pd.read_table('data.txt', sep='s{2,}', header=1, engine='python')
                       Pld   W  D  L    F:A  W.1  D.1  L.1  F:A.1  F:A.2  +/-  Pts
1. Club Brugge          34  14  2  1  45:12   11    4    2  38:18  83:30   53   81
2. RSC Anderlecht       34  14  1  2  44:14    8    4    5  39:23  83:37   46   71
3. Germinal Beerschot   34  12  3  2  34:11    3    5    9  19:26  53:37   16   53
...

Explanation

  • sep='s{2,}' will split the data by 2 or more spaces
  • engine='python' is required for sep='s{2,}' to work

To clean up the column names you could set the prefix for each column manually:

# Manually set column prefix for each column
k = ['Pld'] + ['Home']*4 + ['Away']*4 + ['Total']*3

# Drop duplicate column name suffix (e.g. D.1 --> D)
cols = [re.sub('.[0-9]*$', '', c) for c in df.columns]

# Update column names
df.columns = [f'{k}_{c}' for k, c in zip(k, cols)]
Answered By: Benjamin Ziepert
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.