Going to Python from R, what's the python equivalent of a data frame?

Question:

I’m familiar with the R data holders like vectors, dataframe, etc. but need to do some text analysis and it seems like python has some good setups for doing so. My question is where can I find an explanation of how python holds data.

Specifically I have a data set in a tab-separated file where the text is in the 3rd column and the scoring of the data that I need is in the 4th column.

id1            id2            text                             score
123            889     "This is the text I need to read..."      88
234            778     "This is the text I need to read..."      78
345            667     "This is the text I need to read..."      91

In R I’d just load it into a data frame named df1 and when I wanted to call a column I’d use df1$text or df1[,3] and if I wanted a specific cell I could use df1[1,3].

I am getting a feel for how to read data into python but not how to deal with table like structures.

How would you suggest working with this for a python newbie?

Asked By: screechOwl

||

Answers:

I’m not sure how well this translates to ‘R’ which I never used, but in Python this is how I would approach it:

lines = list()
with open('data.txt','r') as f:
  for line in f:
      lines.append(line.split())

That will read everything in a python list. Lists are zero-based. To get the text column from the second line:

print lines[1][2]

The score for that line:

print lines[1][3]
Answered By: Burhan Khalid

Look at the DataFrame object in the pandas library.

Answered By: Joshua Ulrich

In addition to Panda’s DataFrame, you can use the rpy2 library (from http://thread.gmane.org/gmane.comp.python.rpy/1344):

import array
import rpy2.robjects as ro

d = dict(x = array.array('i', [1,2]), y = array.array('i', [2,3]))
dataf = ro.r['data.frame'](**d)
Answered By: Jonathan

One option that I’ve used in the past is csv.DictReader, which lets you reference data in a row by name (each row becomes a dict):

import csv
with open('data.txt') as f:
    reader = csv.DictReader(f, delimiter = 't')
    for row in reader:
        print row

Output:

{'text': 'This is the text I need to read...', 'score': '88', 'id2': '889', 'id1': '123'}
{'text': 'This is the text I need to read...', 'score': '78', 'id2': '778', 'id1': '234'}
{'text': 'This is the text I need to read...', 'score': '91', 'id2': '667', 'id1': '345'}
Answered By: bigjim

Mr Ullrich’s answer of using the pandas library is the closest approach to the R data frame. However, you can get extremely similar functionality using the numpy array, with the data type set to object if necessary. Newer versions of numpy have field name capabilities similar to a data.frame, its indexing is actually somewhat more powerful than R’s, and its ability to contain objects goes well beyond what R can do.

I use both R and numpy, depending on the task at hand. R is way better with formulas and built-in statistics. The Python code is more maintainable and easier to hook up to other systems.

Edited: added note that numpy now has field name capabilities

Answered By: Brian B

The equivalent of R in python is Pandas

You initialise a DataFrame as below

 import pandas as pd
 df = pd.read_csv("filename")

 print df.head()
Answered By: Steve
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.