Read txt file with pandas into dataframe

Question:

I want to read the txt file from here with Dota 2 mmrs for different players. It has the form as below:

      1) "103757918"
      2) "1"
      3) "107361667"
      4) "1"
      5) "108464725"
      6) "1"
      7) "110818765"
      8) "1"
      9) "111436016"
     10) "1"
     11) "113518306"
     12) "1"
     13) "118896321"
     14) "1"
     15) "119780733"
     16) "1"
     17) "120360801"
     18) "1"
     19) "120870684"
     20) "1"
     21) "122616345"
     22) "1"
     23) "124393917"
     24) "1"
     25) "124487030"

With the account_id (e.g 103757918) followed by the mmr of the player (e.g 1). How can I read this in a pandas dataframe with two columns = account_id, mmr?

I don’t need the index numbers.

Asked By: OLGJ

||

Answers:

Read the data how you normally would, slice the data every other row and concat. After you can rename the columns to whatever you want.

Slicing the data this way assumes that the first value is always the account_id followed by the mmr. Notice how row 25 is missing the mmr in your sample data and is therefore null.

df = pd.read_csv(data.txt, sep='s+', header=None)
pd.concat([df[1][::2].reset_index(drop=True),
           df[1][1::2].reset_index(drop=True)], axis=1)

Here is a working example based on your sample data

s = '''1) "103757918"
2) "1"
3) "107361667"
4) "1"
5) "108464725"
6) "1"
7) "110818765"
8) "1"
9) "111436016"
10) "1"
11) "113518306"
12) "1"
13) "118896321"
14) "1"
15) "119780733"
16) "1"
17) "120360801"
18) "1"
19) "120870684"
20) "1"
21) "122616345"
22) "1"
23) "124393917"
24) "1"
25) "124487030"'''


from io import StringIO

df = pd.read_csv(StringIO(s),sep='s+', header=None)
data = pd.concat([df[1][::2].reset_index(drop=True),
              df[1][1::2].reset_index(drop=True)], axis=1)

data.columns = ['account_id', 'mmr']

    account_id  mmr
0    103757918  1.0
1    107361667  1.0
2    108464725  1.0
3    110818765  1.0
4    111436016  1.0
5    113518306  1.0
6    118896321  1.0
7    119780733  1.0
8    120360801  1.0
9    120870684  1.0
10   122616345  1.0
11   124393917  1.0
12   124487030  NaN
Answered By: It_is_Chris

You can use read_table since read_csv with n as a delimiter won’t work. Then shift and arrange the values like this:

df = pd.read_table('./mmr.txt', delim_whitespace=True, names=['account_id', 'mmr'])
df['account_id'] = df['mmr']
df['mmr'] = df['mmr'].shift(-1)
df = df[df.index % 2 == 0].reset_index(drop=True)

output will be:

    account_id  mmr
0   103757918   1.0
1   107361667   1.0
2   108464725   1.0
3   110818765   1.0
...
1069671 93119769    8674.0

Answered By: BehRouz

You can use a regular expression in multi-line mode to find quoted pairs. I like this solution because the only puzzler is the regular expression itself.

import re
import pandas as pd

acct_mmr_pair_re = re.compile(r'"([^"]*)"[^"]*"([^"]*)"', re.MULTILINE)

with open("k.txt") as file:
    df = pd.DataFrame(acct_mmr_pair_re.findall(file.read()), 
        columns=["account_id", "mmr"])
print(df)
Answered By: tdelaney