Python – numpy.loadtxt how to ignore end commas?

Question:

I’m trying to read in a file that looks like this:

1, 2,
3, 4,

I’m using the following line:

l1,l2 = numpy.loadtxt('file.txt',unpack=True,delimiter=', ')

This gives me an error because the end comma in each row is lumped together as the last element (e.g. “2” is read as “2,”). Is there a way to ignore the last comma in each row, with loadtxt or another function?

Asked By: ylangylang

||

Answers:

It’s fairly easy to roll your own file-reader in Python, rather than having to rely on the constraints of numpy.loadtxt:

content = [ [ float( x ) for x in row.split(',') if x.strip() ] for row in open( filename, 'rt' ) ]
Answered By: jez

numpy.genfromtxt is a bit more robust. If you use the default dtype (which is np.float64), it thinks there is a third column with missing values, so it creates a third column containing nan. If you give it dtype=None (which tells it to figure out the data type from the file), it returns a third column containing all zeros. Either way, you can ignore the last column by using usecols=[0, 1]:

In [14]: !cat trailing_comma.csv
1, 2,
3, 4,

Important note: I use delimiter=',', not delimiter=', '.

In [15]: np.genfromtxt('trailing_comma.csv', delimiter=',', dtype=None, usecols=[0,1])
Out[15]: 
array([[1, 2],
       [3, 4]])

In [16]: col1, col2 = np.genfromtxt('trailing_comma.csv', delimiter=',', dtype=None, usecols=[0,1], unpack=True)

In [17]: col1
Out[17]: array([1, 3])

In [18]: col2
Out[18]: array([2, 4])
Answered By: Warren Weckesser

usecols also works with loadtxt:

Simulate a file with text split into lines:

In [162]: txt=b"""1, 2,
3,4,"""
In [163]: txt=txt.splitlines()
In [164]: txt
Out[164]: [b'1, 2,', b'3,4,']

In [165]: x,y=np.loadtxt(txt,delimiter=',',usecols=[0,1],unpack=True)
In [166]: x
Out[166]: array([ 1.,  3.])
In [167]: y
Out[167]: array([ 2.,  4.])

loadtxt and genfromtxt don’t work well with multicharacter delimiters.

loadtxt and genfromtxt accept any iterable, including a generator. Thus you could open the file and process the lines one by one, removing the extra character.

In [180]: def g(txt):
   .....:     t = txt.splitlines()
   .....:     for l in t:
   .....:         yield l[:-1]

In [181]: list(g(txt))
Out[181]: [b'1, 2', b'3,4']

A generator that yields the lines one by one, stripped of the last character. This could be changed to read a file line by line:

In [182]: x,y=np.loadtxt(g(txt),delimiter=',',unpack=True)
In [183]: x,y
Out[183]: (array([ 1.,  3.]), array([ 2.,  4.]))
Answered By: hpaulj

Depending on your needs this solution might be overkill but when working with large sets of data files from external sources (especially excel, but also binary, csv, tsv, or others) I found the pandas module to be a very convenient and efficient way to read and process data.

Given a data file test-data.txt having the following content

1, 2,
2, 3,
4, 5,

you can read the file by using

import pandas as pd
data = pd.read_csv("test-data.txt", names = ("col1", "col2"), usecols=(0,1))
in[25]: data
Out[25]: 
   col1  col2
0     1     2
1     2     3
2     4     5
In[26]: data.col1
Out[26]: 
0    1
1    2
2    4

The result is a DataFrame object with indexed lines and column labels that can be used for data access. If your data file contains a header it is directly used for labeling the columns. Otherwise you can specify the label for each column with the names argument. The usecols argument allows to avoid the 3rd column that would otherwise be read as a column with nan values.

Answered By: MrCyclophil

I faced the same problem, and the solution I went with using numpy.genfromtxt instead, and overwriting its delimiting behaviour, to ignore the last element if it’s empty.

import numpy as np
from numpy.lib import npyio


def _cutoff_last(func, *args, **kwargs) -> list:
    line = func(*args, **kwargs)
    if line and line[-1] == '':
        line = line[:-1]
    return line


if __name__ == '__main__':
    # overwrite delimiting behavior
    _delim_splitter_original = npyio.LineSplitter._delimited_splitter
    npyio.LineSplitter._delimited_splitter = lambda *args: _cutoff_last(_delim_splitter_original, *args)

    mat = np.genfromtxt('mat.txt', delimiter=',')

This solution is inadvisable under most circumstances, as it changes the behaviour of numpy, and shouldn’t be relied on outside of small-ish scripts.

Answered By: Philipp
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.