Read .csv with different delimiters/separators

Question:

I’m having a rough time trying to read this .csv file with pandas.

"Sección;""Descripción Sección"";""Servicio"";""Descripción Servicio"";""Programa"";""Descripción Programa"";""Capítulo"";""Descripción Capítulo"";""Subconcepto"";""Descripción Subconcepto"";""Proyecto de Inversión"";""Nombre"";""CT"";""Isla"";""Importe""",,,,
"01;""Parlamento"";""0101"";""Servicios Generales"";""911A"";""Actuación Legislativa y de Control "";""1"";""GASTOS DE PERSONAL"";""10000"";""Retrib.básic. y otras ret. del Gob.y altos Cargos"";"""";"""";"""";"""";3.836.041",,,,
"01;""Parlamento"";""0101"";""Servicios Generales"";""911A"";""Actuación Legislativa y de Control "";""2"";""GASTOS CORRIENTES EN BIENES Y SERVICIOS"";""21900"";""Otro inmovilizado material"";"""";"""";"""";"""";1.500",,,,
"01;""Parlamento"";""0101"";""Servicios Generales"";""911A"";""Actuación Legislativa y de Control "";""2"";""GASTOS CORRIENTES EN BIENES Y SERVICIOS"";""22001"";""Prensa", revistas," libros y otras publicaciones"";"""";"""";"""";"""";111.000",,
header = ["Sección", "Descripción Sección", "Servicio", "Descripción Servicio", "Programa", "Descripción Programa", "Capítulo", "Descripción Capítulo", "Subconcepto", "Descripción Subconcepto", "Proyecto de Inversión", "Nombre", "CT", "Isla", "Importe"] 

I’ve tried different things, for example with regex and with reading it as a table of fixed-width, but with no luck.

# With regex
data = pd.read_csv("file.csv", engine='python', sep='("";"")|("";)|(;"")')
# Table of fixed width
data = pd.read_fwf("file.csv")

Here is the desired output:

Sección,Descripción Sección,Servicio,Descripción Servicio,Programa,Descripción Programa,Capítulo,Descripción Capítulo,Subconcepto,Descripción Subconcepto,Proyecto de Inversión,Nombre,CT,Isla,Importe
1,Parlamento,101,Servicios Generales,911A,Actuación Legislativa y de Control,1,GASTOS DE PERSONAL,10000,Retrib.básic. y otras ret. del Gob.y altos Cargos,,,,,"3,836,041"
1,Parlamento,101,Servicios Generales,911A,Actuación Legislativa y de Control,2,GASTOS CORRIENTES EN BIENES Y SERVICIOS,21900,Otro inmovilizado material,,,,,"1,500"
1,Parlamento,101,Servicios Generales,911A,Actuación Legislativa y de Control,2,GASTOS CORRIENTES EN BIENES Y SERVICIOS,22001,"Prensa, revistas, libros y otras publicaciones",,,,,"111,000"

Thanks for your ideas!!

Asked By: user3262756

||

Answers:

As I mentioned in the comments, this one is especially nasty:

  1. It’s a regular formatted csv with comma delimiter, quote encapsulation, and double quotes as an escape character. But all the data is in the first column.
  2. The data in the first column is itself delimited by a semi-colon and uses quote encapsulation (which are escaped properly by the outer csv) but its encapsulated double quote literals are not escaped by either the outer csv or the inner csv… it’s very odd. So commas inside the first column’s delimited data are treated as actual commas by the real csv in which it’s wrapped, malforming the data.

It is, essentially, a csv in a csv and because both layers of csv use quote encapsulation but the inner most layer isn’t escaping properly, it’s malformed.

One option is to just read each line in as an entire column and clean it up after it’s imported:

#read the entire line into a single dataframe column
df = pd.read_csv('file.csv', header=None, names=['data'], sep='|')
#replace commas and double quotes with a blank
data = df.data.str.replace('[,]*$|"','',regex=True)
#create a new data frame by splitting the single column above on a semicolon using lines 1:<end>, use line 0 as the header, splitting it using the same logic.
df_split = pd.DataFrame(data[1:].str.split(';').tolist(), columns = data[0].split(';'))
display(df_split)

+-----+---------+---------------------+----------+----------------------+----------+------------------------------------+----------+-----------------------------------------+-------------+---------------------------------------------------+-----------------------+--------+----+------+-----------+
| idx | Sección | Descripción Sección | Servicio | Descripción Servicio | Programa |        Descripción Programa        | Capítulo |          Descripción Capítulo           | Subconcepto |              Descripción Subconcepto              | Proyecto de Inversión | Nombre | CT | Isla |  Importe  |
+-----+---------+---------------------+----------+----------------------+----------+------------------------------------+----------+-----------------------------------------+-------------+---------------------------------------------------+-----------------------+--------+----+------+-----------+
|   0 |      01 | Parlamento          |     0101 | Servicios Generales  | 911A     | Actuación Legislativa y de Control |        1 | GASTOS DE PERSONAL                      |       10000 | Retrib.básic. y otras ret. del Gob.y altos Cargos |                       |        |    |      | 3.836.041 |
|   1 |      01 | Parlamento          |     0101 | Servicios Generales  | 911A     | Actuación Legislativa y de Control |        2 | GASTOS CORRIENTES EN BIENES Y SERVICIOS |       21900 | Otro inmovilizado material                        |                       |        |    |      |     1.500 |
|   2 |      01 | Parlamento          |     0101 | Servicios Generales  | 911A     | Actuación Legislativa y de Control |        2 | GASTOS CORRIENTES EN BIENES Y SERVICIOS |       22001 | Prensa, revistas, libros y otras publicaciones    |                       |        |    |      |   111.000 |
+-----+---------+---------------------+----------+----------------------+----------+------------------------------------+----------+-----------------------------------------+-------------+---------------------------------------------------+-----------------------+--------+----+------+-----------+

This is a little heavy handed since it’s just replacing double quotes with nothing, even though you may have double quotes in your data (though it’s very hard to tell if you do, and likely they are all malformed anyway).

Answered By: JNevill

I think I know what happened, it looks like it was a normal semi-colon ; separated file, but it was read and then written as if it were a comma , separated file, which adds a lot of unnecessary quotes.

So, another solution would have been to just search and replace using a placeholder for the real quotes:

  1. First search and replace "" with "$QUOTE$" (so quote-placeholder-quote)
  2. then " with empty (i.e. remove all " quote characters)
  3. and finally replace the placeholder $QUOTE$ with "

So in code something like:

infile = open('file.csv', 'r', encoding='utf8')
outfile = open('file_output.csv', 'w', encoding='utf8')
  
for line in infile.readlines():
    # remove single quotes AND replace double quote with single quote
    adjline = line.replace('""', '"&quot;"').replace('"', '').replace('&quot;', '"')
    outfile.write(adjline)

infile.close()
outfile.close()
Answered By: BdR