replace ^M(control M character) in a text file in python

Question:

The file is like this:

This line has control character ^M this is bad
I will try it

I want to remove control M characters in the file, and create a new file like this using Python

This line has control character  this is bad
I will try it

I tried the methods I found in stack overflow and use regular expression like this:

line.replace("r", "r")

and

line.replace("rn", "r")

Here is part of the code snippet:

with open(file_path, "r") as input_file:
    lines = input_file.readlines()

new_lines = []
for line in lines:
    new_line = line.replace("r", "")
    new_lines.append(new_line)

new_file_name = "replace_control_char.dat"
new_file_path = os.path.join(here, data_dir, new_file_name)
with open(new_file_path, "w") as output_file:
    for line in new_lines:
        output_file.write(line)

However, the new file I got is:

This line has control character
 this is bad
I will try it

"This line has control character" and " this is bad" are not on the same line. I expect remove control M character will make these two phrases on the same line.
Can someone help me solve this issue?

Thanks,
Arthur

Asked By: Arthur

||

Answers:

You cannot rely on text mode in that case.

On Windows understands sole r as linefeeds (even if the "official" line terminator is rn) and on Macintosh, the line terminator can be only r. Text mode converts linefeeds as n or remove them if followed by n, so it destroys the information you need.

Universal newlines by default makes this code also fail on Unix/Linux. Python behaves the same on all platforms

Python doesn’t depend on the underlying operating system’s notion of text files; all the the processing is done by Python itself, and is therefore platform-independent.

If you want to remove those, you have to use binary mode.

with open(file_path, "rb") as input_file:
    contents = input_file.read().replace(b"r",b"")
with open(file_path, "wb") as output_file:
    output_file.write(contents)

That code will remove all r characters (including line terminators). That works but if your aim is just to remove stray r and preserve endlines, another method is required.

One way to do it is to use a regular expression, which can accept binary (bytes) as well:

re.sub(rb"r([^n])",rb"1",contents)

That regular expression removes r chars only if not followed by n chars, efficiently preserving CR+LF windows end-of-line sequences.