How to read a .txt file into pandas DataFrame, from transposed format
Question:
I’m trying to read a dataset into a pandas dataframe. The dataset is currently in a .txt file, and it looks something like this:
name: hello_world
rating: 5
description: basic program
name: python
rating: 10
description: programming language
As you can see, the column names start each line, followed by the data. Different rows of the dataframe are separated by an extra line. Is there a simple way to read this type of file into pandas, or do I just have to do it manually?
Thanks!
Edit: Thanks everyone for the help. It seems that the answer is, yes, you have to do it manually. I’ve posted the way I did it manually below, though I’m sure there are other, more efficient methods.
Answers:
I think you have to do it manually.
If you check the I/O API from Pandas(https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) there is no way to define a custom reading procedure.
data.txt:
name: hello_world
rating: 5
description: basic program
name: python
rating: 10
description: programming language
Code:
import pandas as pd
with open('data.txt', 'rt') as fin:
lst = [line[:-1] for line in fin if line[:-1]]
print(lst)
# Soln 1
d = dict()
d['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
d['rating'] = [ele.split(':')[1] for ele in lst if ele.startswith('rating:')]
d['description'] = [ele.split(':')[1] for ele in lst if ele.startswith('description:')]
df = pd.DataFrame(data=d)
print(df)
#OR
data_tuples_lst = [(lst[i].split(':')[1], lst[i+1].split(':')[1], lst[i+2].split(':')[1]) for i in range(0, len(lst), 3) ]
df1 = pd.DataFrame(data=data_tuples_lst, columns = ['name', 'rating', 'description'])
print(df1)
Output:
['name: hello_world', 'rating: 5', 'description: basic program', 'name: python', 'rating: 10', 'description: programming language']
name rating description
0 hello_world 5 basic program
1 python 10 programming language
name rating description
0 hello_world 5 basic program
1 python 10 programming language
My take. Again as part of my learning pandas.
import pandas as pd
from io import StringIO
data = '''
name: hello_world
rating: 5
description: basic program
name: python
rating: 10
description: programming language
name: foo
rating: 20
description: bar
'''
buffer = StringIO()
buffer.write('field: valuen') # add column headers
buffer.write(data)
buffer.seek(0)
df = pd.read_csv(buffer, delimiter=':')
transposed = df.T
_, col_count = transposed.shape
x = []
for i in range(0, col_count, 3):
tmp = transposed[[i, i + 1, i + 2]]
columns = tmp.iloc[0]
tmp = tmp[1:]
tmp.columns = columns
x.append(tmp)
out = pd.concat(x)
print(out.to_string(index=False))
I’d really appreciate someone experienced with pandas to show a better way.
Here is one way to approach the ‘sideways’ data set. This code has been edited for efficiency, over the previous answer.
Sample code:
import pandas as pd
from collections import defaultdict
# Read the text file into a list.
with open('prog.txt') as f:
text = [i.strip() for i in f]
# Split the list into lists of key, value pairs.
d = [i.split(':') for i in text if i]
# Create a data container.
data = defaultdict(list)
# Store the data in a DataFrame-ready dict.
for k, v in d:
data[k].append(v.strip())
# Load the DataFrame.
df = pd.DataFrame(data)
Output:
name rating description
0 hello_world 5 basic program
1 python 10 programming language
In case anyone comes here later, this is what I did. I simply converted the input file to a csv (except I used ‘|’ as the delimiter because the dataset contained strings). Thanks everyone for their input, but I neglected to mention that it was a 2GB data file, so I didn’t want to do anything to intensive for my poor overworked laptop.
import pandas as pd
ofile = open("out_file.csv", 'w')
ifile = open("in_file.txt", 'r', encoding='cp1252')
for l in ifile:
if l == 'n':
ofile.write('n')
else:
ofile.write(l.split(':')[1][:-1] + '|')
ofile.close()
ifile.close()
Then I opened the dataframe using:
import pandas as pd
df =pd.read_csv('out_file.csv', sep="|", skipinitialspace=True, index_col=False)
After having the list proposed by @aaj-kaal with this code:
import pandas as pd
with open('data.txt', 'rt') as fin:
lst = [line[:-1] for line in fin if line[:-1]]
you can obtain directly the dataframe by:
dict_df=pd.DataFrame()
dict_df['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
dict_df['rating'] = [ele.split(':')[1] for ele in lst if
ele.startswith('rating:')]
dict_df['description'] = [ele.split(':')[1] for ele in lst
if ele.startswith('description:')]
dict_df
output
name rating description
0 hello_world 5 basic program
1 python 10 programming language
I’m trying to read a dataset into a pandas dataframe. The dataset is currently in a .txt file, and it looks something like this:
name: hello_world
rating: 5
description: basic program
name: python
rating: 10
description: programming language
As you can see, the column names start each line, followed by the data. Different rows of the dataframe are separated by an extra line. Is there a simple way to read this type of file into pandas, or do I just have to do it manually?
Thanks!
Edit: Thanks everyone for the help. It seems that the answer is, yes, you have to do it manually. I’ve posted the way I did it manually below, though I’m sure there are other, more efficient methods.
I think you have to do it manually.
If you check the I/O API from Pandas(https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) there is no way to define a custom reading procedure.
data.txt:
name: hello_world
rating: 5
description: basic program
name: python
rating: 10
description: programming language
Code:
import pandas as pd
with open('data.txt', 'rt') as fin:
lst = [line[:-1] for line in fin if line[:-1]]
print(lst)
# Soln 1
d = dict()
d['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
d['rating'] = [ele.split(':')[1] for ele in lst if ele.startswith('rating:')]
d['description'] = [ele.split(':')[1] for ele in lst if ele.startswith('description:')]
df = pd.DataFrame(data=d)
print(df)
#OR
data_tuples_lst = [(lst[i].split(':')[1], lst[i+1].split(':')[1], lst[i+2].split(':')[1]) for i in range(0, len(lst), 3) ]
df1 = pd.DataFrame(data=data_tuples_lst, columns = ['name', 'rating', 'description'])
print(df1)
Output:
['name: hello_world', 'rating: 5', 'description: basic program', 'name: python', 'rating: 10', 'description: programming language']
name rating description
0 hello_world 5 basic program
1 python 10 programming language
name rating description
0 hello_world 5 basic program
1 python 10 programming language
My take. Again as part of my learning pandas.
import pandas as pd
from io import StringIO
data = '''
name: hello_world
rating: 5
description: basic program
name: python
rating: 10
description: programming language
name: foo
rating: 20
description: bar
'''
buffer = StringIO()
buffer.write('field: valuen') # add column headers
buffer.write(data)
buffer.seek(0)
df = pd.read_csv(buffer, delimiter=':')
transposed = df.T
_, col_count = transposed.shape
x = []
for i in range(0, col_count, 3):
tmp = transposed[[i, i + 1, i + 2]]
columns = tmp.iloc[0]
tmp = tmp[1:]
tmp.columns = columns
x.append(tmp)
out = pd.concat(x)
print(out.to_string(index=False))
I’d really appreciate someone experienced with pandas to show a better way.
Here is one way to approach the ‘sideways’ data set. This code has been edited for efficiency, over the previous answer.
Sample code:
import pandas as pd
from collections import defaultdict
# Read the text file into a list.
with open('prog.txt') as f:
text = [i.strip() for i in f]
# Split the list into lists of key, value pairs.
d = [i.split(':') for i in text if i]
# Create a data container.
data = defaultdict(list)
# Store the data in a DataFrame-ready dict.
for k, v in d:
data[k].append(v.strip())
# Load the DataFrame.
df = pd.DataFrame(data)
Output:
name rating description
0 hello_world 5 basic program
1 python 10 programming language
In case anyone comes here later, this is what I did. I simply converted the input file to a csv (except I used ‘|’ as the delimiter because the dataset contained strings). Thanks everyone for their input, but I neglected to mention that it was a 2GB data file, so I didn’t want to do anything to intensive for my poor overworked laptop.
import pandas as pd
ofile = open("out_file.csv", 'w')
ifile = open("in_file.txt", 'r', encoding='cp1252')
for l in ifile:
if l == 'n':
ofile.write('n')
else:
ofile.write(l.split(':')[1][:-1] + '|')
ofile.close()
ifile.close()
Then I opened the dataframe using:
import pandas as pd
df =pd.read_csv('out_file.csv', sep="|", skipinitialspace=True, index_col=False)
After having the list proposed by @aaj-kaal with this code:
import pandas as pd
with open('data.txt', 'rt') as fin:
lst = [line[:-1] for line in fin if line[:-1]]
you can obtain directly the dataframe by:
dict_df=pd.DataFrame()
dict_df['name'] = [ele.split(':')[1] for ele in lst if ele.startswith('name:')]
dict_df['rating'] = [ele.split(':')[1] for ele in lst if
ele.startswith('rating:')]
dict_df['description'] = [ele.split(':')[1] for ele in lst
if ele.startswith('description:')]
dict_df
output
name rating description
0 hello_world 5 basic program
1 python 10 programming language