convert em-dash to hyphen in python

Question:

I’m converting csv files into python Dataframe. And in the original file, one of the column has characters em-dash. I want it replaced by hyphen “-“.

Partial original file from csv:

 NoDemande     NoUsager     Sens    IdVehicule     NoConduteur     HeureDebutTrajet    HeureArriveeSurSite    HeureEffective'
42192001801   42192002715    —        157Véh       42192000153    ...
42192000003   42192002021    +        157Véh       42192000002    ...
42192001833   42192000485    —      324My3FVéh     42192000157    ...

My code:

#coding=latin-1
import pandas as pd
import glob

pd.set_option('expand_frame_repr', False)

path = r'D:Python27mypfedata_test'
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None,header=0,sep=';',parse_dates=['HeureDebutTrajet','HeureArriveeSurSite','HeureEffective'],
                      dayfirst=True)
    df['Sens'].replace(u"u2014","-",inplace=True,regex=True)
    list_.append(df)

And it doesn’t work at all, every time it only convert them into ?, which looks like it:

42191001122  42191002244    ?            181Véh   42191000114  ...
42191001293  42191001203    ?         319M9pVéh   42191000125  ...
42191000700  42191000272    ?            183Véh   42191000072  ...

And because I have french characters in the file, I’m using latin-1 instead of utf-8. If I delete the first line and write like this:

df = pd.read_csv(file_,index_col=None,header=0,sep=';',encoding='windows-1252',parse_dates=['HeureDebutTrajet','HeureArriveeSurSite','HeureEffective'],
                          dayfirst=True)

The result will be:

42191001122  42191002244  â??           181Véh   42191000114   ...
42191001293  42191001203  â??        319M9pVéh   42191000125   ...
42191000700  42191000272  â??           183Véh   42191000072   ...

How can I make all the em-dash replaced by -?

I added the part about repr:

for line in open(file_):
    print repr(line)

And the result turns out:

'"42191002384";"42191000118";"xe2x80x94";"";"42191000182";...
'"42191002464";"42191001671";"+";"";"42191000182";...
'"42191000045";"42191000176";"xe2x80x94";"620M9pVxc3xa9h";"42191000003";...
'"42191001305";"42191000823";"xe2x80x94";"310V7pVxc3xa9h";"42191000126";...
Asked By: ch36r5s

||

Answers:

u'u2014' (EM DASH) can not be encoded in latin1/iso-8859-1, so that value can not appear in a properly encoded latin1 file.

Possibly the files are encoded as windows-1252 for which u'u2014' can be encoded as 'x97'.

Another problem is that the CSV file apparently uses whitespace as the column separator, but your code uses semicolons. You can specify whitespace as the separator using delim_whitespace=True:

df = pd.read_csv(file_, delim_whitespace=True)

You can also specify the file’s encoding using the encoding parameter. read_csv() will convert the incoming data to unicode:

df = pd.read_csv(file_, encoding='windows-1252', delim_whitespace=True)

In Python 2 (I think that you’re using that), if you do not specify the encoding, the data remains in the original encoding, and this is probably the reason that your replacements are not working.

Once you have properly loaded the file, you can replace characters as you have been doing:

df = pd.read_csv(file_, encoding='windows-1252', delim_whitespace=True)
df['Sens'].replace(u'u2014', '-', inplace=True)

EDIT

Following your update where you show the repr() output, your file would appear to be UTF8 encoded, not latin1, and not Windows-1252. Since you are using Python 2 you need to specify the encoding when loading the CSV file:

df = pd.read_csv(file_, sep=';', encoding='utf8')
df['Sens'].replace(u'u2014', '-', inplace=True)

Because you specified an encoding, read_csv() will convert the incoming data to unicode, so replace() should now work as shown above. It should be that easy.

Answered By: mhawke

First of all many thanks for all your answer, let me try to give you more info:
I am listing a directory content:
# Create a Path object
desktop = pathlib.Path(r"W:Persoblabla")
result = list(desktop.glob("*.pdf")) # I want to get all my pdf
# From here my goal is to rename my pdf files.
# initial format: xx(integer)-string—rubishString.pdf The first dash is alt+045 the second one is alt+0151. I am looking for to split on – in order to rename my files in keeping only the string and get rid off XX and rubishString.
The problem I got is that python can split on -(alt+045) but not on — (alt+0151) when I am looking for splitting on both.enter code here
Hopefully it is a bit more comprehensible 🙁

Answered By: josh