Read file object as string in python

Question:

I’m using urllib2 to read in a page. I need to do a quick regex on the source and pull out a few variables but urllib2 presents as a file object rather than a string.

I’m new to python so I’m struggling to see how I use a file object to do this. Is there a quick way to convert this into a string?

Asked By: Oli

||

Answers:

You can use Python in interactive mode to search for solutions.

if f is your object, you can enter dir(f) to see all methods and attributes. There’s one called read. Enter help(f.read) and it tells you that f.read() is the way to retrieve a string from an file object.

Answered By: stesch

From the doc file.read() (my emphasis):

file.read([size])

Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.

Be aware that a regexp search on a large string object may not be efficient, and consider doing the search line-by-line, using file.next() (a file object is its own iterator).

Answered By: gimel

Michael Foord, aka Voidspace has an excellent tutorial on urllib2 which you can find here:
urllib2 – The Missing Manual

What you are doing should be pretty straightforward, observe this sample code:

import urllib2
import re
response = urllib2.urlopen("http://www.voidspace.org.uk/python/articles/urllib2.shtml")
html = response.read()
pattern = '(V.+space)'
wordPattern = re.compile(pattern, re.IGNORECASE)
results = wordPattern.search(html)
print results.groups()
Answered By: t3rse
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.