Python read file as stream from HDFS

Question:

Here is my problem: I have a file in HDFS which can potentially be huge (=not enough to fit all in memory)

What I would like to do is avoid having to cache this file in memory, and only process it line by line like I would do with a regular file:

for line in open("myfile", "r"):
    # do some processing

I am looking to see if there is an easy way to get this done right without using external libraries. I can probably make it work with libpyhdfs or python-hdfs but I’d like if possible to avoid introducing new dependencies and untested libs in the system, especially since both of these don’t seem heavily maintained and state that they shouldn’t be used in production.

I was thinking to do this using the standard “hadoop” command line tools using the Python subprocess module, but I can’t seem to be able to do what I need since there is no command line tools that would do my processing and I would like to execute a Python function for every linein a streaming fashion.

Is there a way to apply Python functions as right operands of the pipes using the subprocess module? Or even better, open it like a file as a generator so I could process each line easily?

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)

If there is another way to achieve what I described above without using an external library, I’m also pretty open.

Thanks for any help !

Asked By: Charles Menguy

||

Answers:

You want xreadlines, it reads lines from a file without loading the whole file into memory.

Edit:

Now I see your question, you just need to get the stdout pipe from your Popen object:

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)
for line in cat.stdout:
    print line
Answered By: Keith Randall

If you want to avoid adding external dependencies at any cost, Keith’s answer is the way to go. Pydoop, on the other hand, could make your life much easier:

import pydoop.hdfs as hdfs
with hdfs.open('/user/myuser/filename') as f:
    for line in f:
        do_something(line)

Regarding your concerns, Pydoop is actively developed and has been used in production for years at CRS4, mostly for computational biology applications.

Simone

Answered By: simleo

In the last two years, there has been a lot of motion on Hadoop-Streaming. This is pretty fast according to Cloudera: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/ I’ve had good success with it.

Answered By: Brian Dolan

You can use the WebHDFS Python Library (built on top of urllib3):

from hdfs import InsecureClient
client_hdfs = InsecureClient('http://host:port', user='root')
with client_hdfs.write(access_path) as writer:
    dump(records, writer)  # tested for pickle and json (doesnt work for joblib)

Or you can use the requests package in python as:

import requests
from json import dumps
params = (('op', 'CREATE')
('buffersize', 256))
data = dumps(file)  # some file or object - also tested for pickle library
response = requests.put('http://host:port/path', params=params, data=data)  # response 200 = successful

Hope this helps!

Answered By: Ramsha Siddiqui
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.