Python, faster way to read fixed length fields from a file into dictionary
Question:
I have a file of names and addresses as follows (example line)
OSCAR ,CANNONS ,8 ,STIEGLITZ CIRCUIT
And I want to read it into a dictionary of name and value. Here self.field_list is a list of the name, length and start point of the fixed fields in the file. What ways are there to speed up this method? (python 2.6)
def line_to_dictionary(self, file_line,rec_num):
file_line = file_line.lower() # Make it all lowercase
return_rec = {} # Return record as a dictionary
for (field_start, field_length, field_name) in self.field_list:
field_data = file_line[field_start:field_start+field_length]
if self.strip_fields == True: # Strip off white spaces first
field_data = field_data.strip()
if field_data != '': # Only add non-empty fields to dictionary
return_rec[field_name] = field_data
# Set hidden fields
#
return_rec['_rec_num_'] = rec_num
return_rec['_dataset_name_'] = self.name
return return_rec
Answers:
struct.unpack()
combined with s
specifiers with lengths will tear the string apart faster than slicing.
If you want to get some speed up, you can also store field_start+field_length
directly in self.field_list, instead of storing field_length
.
Also, if field_data != ''
can more simply be written as if field_data
(if this gives any speed up, it is marginal, though).
I would say that your method is quite fast, compared to what standard Python can do (i.e., without using non-standard, dedicated modules).
If your lines include commas like the example, you can use line.split(‘,’) instead of several slices. This may prove to be faster.
You’ll want to use the csv module.
It handle not only csv, but any csv-like format which yours seems to be.
Edit: Just saw your remark below about commas. The approach below is fast when it comes to file reading, but it is delimiter-based, and would fail in your case. It’s useful in other cases, though.
If you want to read the file really fast, you can use a dedicated module, such as the almost standard Numpy:
data = numpy.loadtxt('file_name.txt', dtype=('S10', 'S8'), delimiter=',') # dtype must be adapted to your column sizes
loadtxt()
also allows you to process fields on the fly (with the converters
argument). Numpy also allows you to give names to columns (see the doc), so that you can do:
data['name'][42] # Name # 42
The structure obtained is like an Excel array; it is quite memory efficient, compared to a dictionary.
If you really need to use a dictionary, you can use a dedicated loop over the data
array read quickly by Numpy, in a way similar to what you have done.
I have a file of names and addresses as follows (example line)
OSCAR ,CANNONS ,8 ,STIEGLITZ CIRCUIT
And I want to read it into a dictionary of name and value. Here self.field_list is a list of the name, length and start point of the fixed fields in the file. What ways are there to speed up this method? (python 2.6)
def line_to_dictionary(self, file_line,rec_num):
file_line = file_line.lower() # Make it all lowercase
return_rec = {} # Return record as a dictionary
for (field_start, field_length, field_name) in self.field_list:
field_data = file_line[field_start:field_start+field_length]
if self.strip_fields == True: # Strip off white spaces first
field_data = field_data.strip()
if field_data != '': # Only add non-empty fields to dictionary
return_rec[field_name] = field_data
# Set hidden fields
#
return_rec['_rec_num_'] = rec_num
return_rec['_dataset_name_'] = self.name
return return_rec
struct.unpack()
combined with s
specifiers with lengths will tear the string apart faster than slicing.
If you want to get some speed up, you can also store field_start+field_length
directly in self.field_list, instead of storing field_length
.
Also, if field_data != ''
can more simply be written as if field_data
(if this gives any speed up, it is marginal, though).
I would say that your method is quite fast, compared to what standard Python can do (i.e., without using non-standard, dedicated modules).
If your lines include commas like the example, you can use line.split(‘,’) instead of several slices. This may prove to be faster.
You’ll want to use the csv module.
It handle not only csv, but any csv-like format which yours seems to be.
Edit: Just saw your remark below about commas. The approach below is fast when it comes to file reading, but it is delimiter-based, and would fail in your case. It’s useful in other cases, though.
If you want to read the file really fast, you can use a dedicated module, such as the almost standard Numpy:
data = numpy.loadtxt('file_name.txt', dtype=('S10', 'S8'), delimiter=',') # dtype must be adapted to your column sizes
loadtxt()
also allows you to process fields on the fly (with the converters
argument). Numpy also allows you to give names to columns (see the doc), so that you can do:
data['name'][42] # Name # 42
The structure obtained is like an Excel array; it is quite memory efficient, compared to a dictionary.
If you really need to use a dictionary, you can use a dedicated loop over the data
array read quickly by Numpy, in a way similar to what you have done.