How to read a .txt file into pandas DataFrame, from transposed format

Question:

I’m trying to read a dataset into a pandas dataframe. The dataset is currently in a .txt file, and it looks something like this:

name: hello_world
rating: 5
description: basic program

name: python
rating: 10
description: programming language

As you can see, the column names start each line, followed by the data. Different rows of the dataframe are separated by an extra line. Is there a simple way to read this type of file into pandas, or do I just have to do it manually?

Thanks!

Edit: Thanks everyone for the help. It seems that the answer is, yes, you have to do it manually. I’ve posted the way I did it manually below, though I’m sure there are other, more efficient methods.

Asked By: user2832964

||

Answers:

I think you have to do it manually.
If you check the I/O API from Pandas(https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) there is no way to define a custom reading procedure.

Answered By: Cristian

data.txt:

name: hello_world
rating: 5
description: basic program

name: python
rating: 10
description: programming language

Code:

import pandas as pd
with open('data.txt', 'rt') as fin:
    lst = [line[:-1] for line in fin if line[:-1]]
print(lst)

# Soln 1
d = dict()
d['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
d['rating'] = [ele.split(':')[1] for ele in lst if ele.startswith('rating:')]
d['description'] = [ele.split(':')[1] for ele in lst if ele.startswith('description:')]
df = pd.DataFrame(data=d)
print(df)

#OR

data_tuples_lst = [(lst[i].split(':')[1], lst[i+1].split(':')[1], lst[i+2].split(':')[1]) for  i in range(0, len(lst), 3) ]
df1 = pd.DataFrame(data=data_tuples_lst, columns = ['name', 'rating', 'description'])
print(df1)

Output:

['name: hello_world', 'rating: 5', 'description: basic program', 'name: python', 'rating: 10', 'description: programming language']
           name rating            description
0   hello_world      5          basic program
1        python     10   programming language
           name rating            description
0   hello_world      5          basic program
1        python     10   programming language
Answered By: Aaj Kaal

My take. Again as part of my learning pandas.

import pandas as pd
from io import StringIO

data = '''
name: hello_world
rating: 5
description: basic program

name: python
rating: 10
description: programming language

name: foo
rating: 20
description: bar
'''
buffer = StringIO()
buffer.write('field: valuen')  # add column headers
buffer.write(data)
buffer.seek(0)

df = pd.read_csv(buffer, delimiter=':')

transposed = df.T

_, col_count = transposed.shape

x = []
for i in range(0, col_count, 3):
    tmp = transposed[[i, i + 1, i + 2]]
    columns = tmp.iloc[0]
    tmp = tmp[1:]
    tmp.columns = columns
    x.append(tmp)

out = pd.concat(x)
print(out.to_string(index=False))

I’d really appreciate someone experienced with pandas to show a better way.

Answered By: Justin Ezequiel

Here is one way to approach the ‘sideways’ data set. This code has been edited for efficiency, over the previous answer.

Sample code:

import pandas as pd
from collections import defaultdict

# Read the text file into a list.
with open('prog.txt') as f:
    text = [i.strip() for i in f]

# Split the list into lists of key, value pairs.
d = [i.split(':') for i in text if i]
# Create a data container.
data = defaultdict(list)
# Store the data in a DataFrame-ready dict.
for k, v in d:
    data[k].append(v.strip())

# Load the DataFrame.
df = pd.DataFrame(data)

Output:

          name rating           description
0  hello_world      5         basic program
1       python     10  programming language
Answered By: S3DEV

In case anyone comes here later, this is what I did. I simply converted the input file to a csv (except I used ‘|’ as the delimiter because the dataset contained strings). Thanks everyone for their input, but I neglected to mention that it was a 2GB data file, so I didn’t want to do anything to intensive for my poor overworked laptop.

import pandas as pd


ofile = open("out_file.csv", 'w')
ifile = open("in_file.txt", 'r', encoding='cp1252')

for l in ifile:
  if l == 'n':
    ofile.write('n')
  else:
    ofile.write(l.split(':')[1][:-1] + '|')

ofile.close()
ifile.close()

Then I opened the dataframe using:

import pandas as pd
df =pd.read_csv('out_file.csv', sep="|", skipinitialspace=True, index_col=False)
Answered By: user2832964

After having the list proposed by @aaj-kaal with this code:

import pandas as pd
with open('data.txt', 'rt') as fin:
    lst = [line[:-1] for line in fin if line[:-1]]

you can obtain directly the dataframe by:

dict_df=pd.DataFrame()
dict_df['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
dict_df['rating'] = [ele.split(':')[1] for ele in lst if 
                    ele.startswith('rating:')]
dict_df['description'] = [ele.split(':')[1] for ele in lst
                         if ele.startswith('description:')]
dict_df

output

name    rating          description
0       hello_world 5   basic program
1       python  10      programming language
Answered By: Hermes Morales