Going to Python from R, what's the python equivalent of a data frame?
Question:
I’m familiar with the R data holders like vectors, dataframe, etc. but need to do some text analysis and it seems like python has some good setups for doing so. My question is where can I find an explanation of how python holds data.
Specifically I have a data set in a tab-separated file where the text is in the 3rd column and the scoring of the data that I need is in the 4th column.
id1 id2 text score
123 889 "This is the text I need to read..." 88
234 778 "This is the text I need to read..." 78
345 667 "This is the text I need to read..." 91
In R I’d just load it into a data frame named df1
and when I wanted to call a column I’d use df1$text or df1[,3]
and if I wanted a specific cell I could use df1[1,3]
.
I am getting a feel for how to read data into python but not how to deal with table like structures.
How would you suggest working with this for a python newbie?
Answers:
I’m not sure how well this translates to ‘R’ which I never used, but in Python this is how I would approach it:
lines = list()
with open('data.txt','r') as f:
for line in f:
lines.append(line.split())
That will read everything in a python list. Lists are zero-based. To get the text column from the second line:
print lines[1][2]
The score for that line:
print lines[1][3]
In addition to Panda’s DataFrame, you can use the rpy2 library (from http://thread.gmane.org/gmane.comp.python.rpy/1344):
import array
import rpy2.robjects as ro
d = dict(x = array.array('i', [1,2]), y = array.array('i', [2,3]))
dataf = ro.r['data.frame'](**d)
One option that I’ve used in the past is csv.DictReader
, which lets you reference data in a row by name (each row becomes a dict
):
import csv
with open('data.txt') as f:
reader = csv.DictReader(f, delimiter = 't')
for row in reader:
print row
Output:
{'text': 'This is the text I need to read...', 'score': '88', 'id2': '889', 'id1': '123'}
{'text': 'This is the text I need to read...', 'score': '78', 'id2': '778', 'id1': '234'}
{'text': 'This is the text I need to read...', 'score': '91', 'id2': '667', 'id1': '345'}
Mr Ullrich’s answer of using the pandas library is the closest approach to the R data frame. However, you can get extremely similar functionality using the numpy array, with the data type set to object
if necessary. Newer versions of numpy have field name capabilities similar to a data.frame
, its indexing is actually somewhat more powerful than R’s, and its ability to contain objects goes well beyond what R can do.
I use both R and numpy, depending on the task at hand. R is way better with formulas and built-in statistics. The Python code is more maintainable and easier to hook up to other systems.
Edited: added note that numpy now has field name capabilities
The equivalent of R in python is Pandas
You initialise a DataFrame as below
import pandas as pd
df = pd.read_csv("filename")
print df.head()
I’m familiar with the R data holders like vectors, dataframe, etc. but need to do some text analysis and it seems like python has some good setups for doing so. My question is where can I find an explanation of how python holds data.
Specifically I have a data set in a tab-separated file where the text is in the 3rd column and the scoring of the data that I need is in the 4th column.
id1 id2 text score
123 889 "This is the text I need to read..." 88
234 778 "This is the text I need to read..." 78
345 667 "This is the text I need to read..." 91
In R I’d just load it into a data frame named df1
and when I wanted to call a column I’d use df1$text or df1[,3]
and if I wanted a specific cell I could use df1[1,3]
.
I am getting a feel for how to read data into python but not how to deal with table like structures.
How would you suggest working with this for a python newbie?
I’m not sure how well this translates to ‘R’ which I never used, but in Python this is how I would approach it:
lines = list()
with open('data.txt','r') as f:
for line in f:
lines.append(line.split())
That will read everything in a python list. Lists are zero-based. To get the text column from the second line:
print lines[1][2]
The score for that line:
print lines[1][3]
In addition to Panda’s DataFrame, you can use the rpy2 library (from http://thread.gmane.org/gmane.comp.python.rpy/1344):
import array
import rpy2.robjects as ro
d = dict(x = array.array('i', [1,2]), y = array.array('i', [2,3]))
dataf = ro.r['data.frame'](**d)
One option that I’ve used in the past is csv.DictReader
, which lets you reference data in a row by name (each row becomes a dict
):
import csv
with open('data.txt') as f:
reader = csv.DictReader(f, delimiter = 't')
for row in reader:
print row
Output:
{'text': 'This is the text I need to read...', 'score': '88', 'id2': '889', 'id1': '123'}
{'text': 'This is the text I need to read...', 'score': '78', 'id2': '778', 'id1': '234'}
{'text': 'This is the text I need to read...', 'score': '91', 'id2': '667', 'id1': '345'}
Mr Ullrich’s answer of using the pandas library is the closest approach to the R data frame. However, you can get extremely similar functionality using the numpy array, with the data type set to object
if necessary. Newer versions of numpy have field name capabilities similar to a data.frame
, its indexing is actually somewhat more powerful than R’s, and its ability to contain objects goes well beyond what R can do.
I use both R and numpy, depending on the task at hand. R is way better with formulas and built-in statistics. The Python code is more maintainable and easier to hook up to other systems.
Edited: added note that numpy now has field name capabilities
The equivalent of R in python is Pandas
You initialise a DataFrame as below
import pandas as pd
df = pd.read_csv("filename")
print df.head()