Python, faster way to read fixed length fields from a file into dictionary

Question:

I have a file of names and addresses as follows (example line)

OSCAR    ,CANNONS      ,8     ,STIEGLITZ CIRCUIT

And I want to read it into a dictionary of name and value. Here self.field_list is a list of the name, length and start point of the fixed fields in the file. What ways are there to speed up this method? (python 2.6)

def line_to_dictionary(self, file_line,rec_num):
  file_line = file_line.lower()  # Make it all lowercase

  return_rec = {}  # Return record as a dictionary

  for (field_start, field_length, field_name) in self.field_list:

    field_data = file_line[field_start:field_start+field_length]

    if self.strip_fields == True:  # Strip off white spaces first
      field_data = field_data.strip()

    if field_data != '':  # Only add non-empty fields to dictionary
      return_rec[field_name] = field_data

  # Set hidden fields
  #
  return_rec['_rec_num_'] = rec_num
  return_rec['_dataset_name_'] = self.name
  return return_rec      
Asked By: Martlark

||

Answers:

struct.unpack() combined with s specifiers with lengths will tear the string apart faster than slicing.

If you want to get some speed up, you can also store field_start+field_length directly in self.field_list, instead of storing field_length.

Also, if field_data != '' can more simply be written as if field_data (if this gives any speed up, it is marginal, though).

I would say that your method is quite fast, compared to what standard Python can do (i.e., without using non-standard, dedicated modules).

Answered By: Eric O Lebigot

If your lines include commas like the example, you can use line.split(‘,’) instead of several slices. This may prove to be faster.

Answered By: lunixbochs

You’ll want to use the csv module.

It handle not only csv, but any csv-like format which yours seems to be.

Answered By: e-satis

Edit: Just saw your remark below about commas. The approach below is fast when it comes to file reading, but it is delimiter-based, and would fail in your case. It’s useful in other cases, though.

If you want to read the file really fast, you can use a dedicated module, such as the almost standard Numpy:

data = numpy.loadtxt('file_name.txt', dtype=('S10', 'S8'), delimiter=',')   # dtype must be adapted to your column sizes

loadtxt() also allows you to process fields on the fly (with the converters argument). Numpy also allows you to give names to columns (see the doc), so that you can do:

data['name'][42]  # Name # 42

The structure obtained is like an Excel array; it is quite memory efficient, compared to a dictionary.

If you really need to use a dictionary, you can use a dedicated loop over the data array read quickly by Numpy, in a way similar to what you have done.

Answered By: Eric O Lebigot