How can I remove the ANSI escape sequences from a string in python
Question:
Here is a snippet that includes my string.
'lsrnx1b[00mx1b[01;31mexamplefile.zipx1b[00mrnx1b[01;31m'
The string was returned from an SSH command that I executed. I can’t use the string in its current state because it contains ANSI standardized escape sequences. How can I programmatically remove the escape sequences so that the only part of the string remaining is 'examplefile.zip'
.
Answers:
Delete them with a regular expression:
import re
# 7-bit C1 ANSI sequences
ansi_escape = re.compile(r'''
x1B # ESC
(?: # 7-bit C1 Fe (except CSI)
[@-Z\-_]
| # or [ for CSI, followed by a control sequence
[
[0-?]* # Parameter bytes
[ -/]* # Intermediate bytes
[@-~] # Final byte
)
''', re.VERBOSE)
result = ansi_escape.sub('', sometext)
or, without the VERBOSE
flag, in condensed form:
ansi_escape = re.compile(r'x1B(?:[@-Z\-_]|[[0-?]*[ -/]*[@-~])')
result = ansi_escape.sub('', sometext)
Demo:
>>> import re
>>> ansi_escape = re.compile(r'x1B(?:[@-Z\-_]|[[0-?]*[ -/]*[@-~])')
>>> sometext = 'lsrnx1b[00mx1b[01;31mexamplefile.zipx1b[00mrnx1b[01;31m'
>>> ansi_escape.sub('', sometext)
'lsrnexamplefile.ziprn'
The above regular expression covers all 7-bit ANSI C1 escape sequences, but not the 8-bit C1 escape sequence openers. The latter are never used in today’s UTF-8 world where the same range of bytes have a different meaning.
If you do need to cover the 8-bit codes too (and are then, presumably, working with bytes
values) then the regular expression becomes a bytes pattern like this:
# 7-bit and 8-bit C1 ANSI sequences
ansi_escape_8bit = re.compile(br'''
(?: # either 7-bit C1, two bytes, ESC Fe (omitting CSI)
x1B
[@-Z\-_]
| # or a single 8-bit byte Fe (omitting CSI)
[x80-x9Ax9C-x9F]
| # or CSI + control codes
(?: # 7-bit CSI, ESC [
x1B[
| # 8-bit CSI, 9B
x9B
)
[0-?]* # Parameter bytes
[ -/]* # Intermediate bytes
[@-~] # Final byte
)
''', re.VERBOSE)
result = ansi_escape_8bit.sub(b'', somebytesvalue)
which can be condensed down to
# 7-bit and 8-bit C1 ANSI sequences
ansi_escape_8bit = re.compile(
br'(?:x1B[@-Z\-_]|[x80-x9Ax9C-x9F]|(?:x1B[|x9B)[0-?]*[ -/]*[@-~])'
)
result = ansi_escape_8bit.sub(b'', somebytesvalue)
For more information, see:
- the ANSI escape codes overview on Wikipedia
- ECMA-48 standard, 5th edition (especially sections 5.3 and 5.4)
The example you gave contains 4 CSI (Control Sequence Introducer) codes, as marked by the x1B[
or ESC [
opening bytes, and each contains a SGR (Select Graphic Rendition) code, because they each end in m
. The parameters (separated by ;
semicolons) in between those tell your terminal what graphic rendition attributes to use. So for each x1B[....m
sequence, the 3 codes that are used are:
- 0 (or
00
in this example): reset, disable all attributes
- 1 (or
01
in the example): bold
- 31: red (foreground)
However, there is more to ANSI than just CSI SGR codes. With CSI alone you can also control the cursor, clear lines or the whole display, or scroll (provided the terminal supports this of course). And beyond CSI, there are codes to select alternative fonts (SS2
and SS3
), to send ‘private messages’ (think passwords), to communicate with the terminal (DCS
), the OS (OSC
), or the application itself (APC
, a way for applications to piggy-back custom control codes on to the communication stream), and further codes to help define strings (SOS
, Start of String, ST
String Terminator) or to reset everything back to a base state (RIS
). The above regexes cover all of these.
Note that the above regex only removes the ANSI C1 codes, however, and not any additional data that those codes may be marking up (such as the strings sent between an OSC opener and the terminating ST code). Removing those would require additional work outside the scope of this answer.
if you want to remove the rn
bit, you can pass the string through this function (written by sarnold):
def stripEscape(string):
""" Removes all escape sequences from the input string """
delete = ""
i=1
while (i<0x20):
delete += chr(i)
i += 1
t = string.translate(None, delete)
return t
Careful though, this will lump together the text in front and behind the escape sequences. So, using Martijn’s filtered string 'lsrnexamplefile.ziprn'
, you will get lsexamplefile.zip
. Note the ls
in front of the desired filename.
I would use the stripEscape function first to remove the escape sequences, then pass the output to Martijn’s regular expression, which would avoid concatenating the unwanted bit.
The accepted answer only takes into account ANSI Standardized escape sequences that are formatted to alter foreground colors & text style.
Many sequences do not end in 'm'
, such as: cursor positioning, erasing, and scroll regions. The pattern bellow attempts to cover all cases beyond setting foreground color and text-style.
Below is the regular expression for ANSI standardized control sequences:
/(x9B|x1B[)[0-?]*[ -/]*[@-~]/
Additional References:
Function
Based on Martijn Pieters♦’s answer with Jeff’s regexp.
def escape_ansi(line):
ansi_escape = re.compile(r'(?:x1B[@-_]|[x80-x9F])[0-?]*[ -/]*[@-~]')
return ansi_escape.sub('', line)
Test
def test_remove_ansi_escape_sequence(self):
line = 'tu001b[0;35mBlablau001b[0m u001b[0;36m172.18.0.2u001b[0m'
escaped_line = escape_ansi(line)
self.assertEqual(escaped_line, 'tBlabla 172.18.0.2')
Testing
If you want to run it by yourself, use python3
(better unicode support, blablabla). Here is how the test file should be:
import unittest
import re
def escape_ansi(line):
…
class TestStringMethods(unittest.TestCase):
def test_remove_ansi_escape_sequence(self):
…
if __name__ == '__main__':
unittest.main()
The suggested regex didn’t do the trick for me so I created one of my own.
The following is a python regex that I created based on the spec found here
ansi_regex = r'x1b('
r'([??d+[hl])|'
r'([=<>a-kzNM78])|'
r'([()][a-b0-2])|'
r'([d{0,2}[ma-dgkjqi])|'
r'([d+;d+[hfy]?)|'
r'([;?[hf])|'
r'(#[3-68])|'
r'([01356]n)|'
r'(O[mlnp-z]?)|'
r'(/Z)|'
r'(d+)|'
r'([?d;d0c)|'
r'(d;dR))'
ansi_escape = re.compile(ansi_regex, flags=re.IGNORECASE)
I tested my regex on the following snippet (basically a copy paste from the ascii-table.com page)
x1b[20h Set
x1b[?1h Set
x1b[?3h Set
x1b[?4h Set
x1b[?5h Set
x1b[?6h Set
x1b[?7h Set
x1b[?8h Set
x1b[?9h Set
x1b[20l Set
x1b[?1l Set
x1b[?2l Set
x1b[?3l Set
x1b[?4l Set
x1b[?5l Set
x1b[?6l Set
x1b[?7l Reset
x1b[?8l Reset
x1b[?9l Reset
x1b= Set
x1b> Set
x1b(A Set
x1b)A Set
x1b(B Set
x1b)B Set
x1b(0 Set
x1b)0 Set
x1b(1 Set
x1b)1 Set
x1b(2 Set
x1b)2 Set
x1bN Set
x1bO Set
x1b[m Turn
x1b[0m Turn
x1b[1m Turn
x1b[2m Turn
x1b[4m Turn
x1b[5m Turn
x1b[7m Turn
x1b[8m Turn
x1b[1;2 Set
x1b[1A Move
x1b[2B Move
x1b[3C Move
x1b[4D Move
x1b[H Move
x1b[;H Move
x1b[4;3H Move
x1b[f Move
x1b[;f Move
x1b[1;2 Move
x1bD Move/scroll
x1bM Move/scroll
x1bE Move
x1b7 Save
x1b8 Restore
x1bH Set
x1b[g Clear
x1b[0g Clear
x1b[3g Clear
x1b#3 Double-height
x1b#4 Double-height
x1b#5 Single
x1b#6 Double
x1b[K Clear
x1b[0K Clear
x1b[1K Clear
x1b[2K Clear
x1b[J Clear
x1b[0J Clear
x1b[1J Clear
x1b[2J Clear
x1b5n Device
x1b0n Response:
x1b3n Response:
x1b6n Get
x1b[c Identify
x1b[0c Identify
x1b[?1;20c Response:
x1bc Reset
x1b#8 Screen
x1b[2;1y Confidence
x1b[2;2y Confidence
x1b[2;9y Repeat
x1b[2;10y Repeat
x1b[0q Turn
x1b[1q Turn
x1b[2q Turn
x1b[3q Turn
x1b[4q Turn
x1b< Enter/exit
x1b= Enter
x1b> Exit
x1bF Use
x1bG Use
x1bA Move
x1bB Move
x1bC Move
x1bD Move
x1bH Move
x1b12 Move
x1bI
x1bK
x1bJ
x1bZ
x1b/Z
x1bOP
x1bOQ
x1bOR
x1bOS
x1bA
x1bB
x1bC
x1bD
x1bOp
x1bOq
x1bOr
x1bOs
x1bOt
x1bOu
x1bOv
x1bOw
x1bOx
x1bOy
x1bOm
x1bOl
x1bOn
x1bOM
x1b[i
x1b[1i
x1b[4i
x1b[5i
Hopefully this will help others 🙂
If it helps future Stack Overflowers, I was using the crayons library to give my Python output a bit more visual impact, which is advantageous as it works on both Windows and Linux platforms. However I was both displaying onscreen as well as appending to log files, and the escape sequences were impacting legibility of the log files, so wanted to strip them out. However the escape sequences inserted by crayons produced an error:
expected string or bytes-like object
The solution was to cast the parameter to a string, so only a tiny modification to the commonly accepted answer was needed:
def escape_ansi(line):
ansi_escape = re.compile(r'(x9B|x1B[)[0-?]*[ -/]*[@-~]')
return ansi_escape.sub('', str(line))
For 2020 with python 3.5 it as easy as string.encode().decode('ascii')
ascii_string = 'lsrnx1b[00mx1b[01;31mexamplefile.zipx1b[00mrnx1b[01;31m'
decoded_string = ascii_string.encode().decode('ascii')
print(decoded_string)
>ls
>examplefile.zip
>
none of the regex solutions worked in my case with OSC sequences (x1b]
)
to actually render the visible output, you will need a terminal emulator like pyte
#! /usr/bin/env python3
import pyte # terminal emulator: render terminal output to visible characters
pyte_screen = pyte.Screen(80, 24)
pyte_stream = pyte.ByteStream(pyte_screen)
bytes_ = b''.join([
b'$ cowsay hellorn', b'x1b[?2004l', b'r', b' _______rn',
b'< hello >rn', b' -------rn', b' \ ^__^rn',
b' \ (oo)\_______rn', b' (__)\ )\/\rn',
b' ||----w |rn', b' || ||rn',
b'x1b]0;user@laptop1:/tmpx1b\', b'x1b]7;file://laptop1/tmpx1b\', b'x1b[?2004h$ ',
])
pyte_stream.feed(bytes_)
# pyte_screen.display always has 80x24 characters, padded with whitespace
# -> use rstrip to remove trailing whitespace from all lines
text = ("".join([line.rstrip() + "n" for line in pyte_screen.display])).strip() + "n"
print("text", text)
print("cursor", pyte_screen.cursor.y, pyte_screen.cursor.x)
print("title", pyte_screen.title)
Here is a snippet that includes my string.
'lsrnx1b[00mx1b[01;31mexamplefile.zipx1b[00mrnx1b[01;31m'
The string was returned from an SSH command that I executed. I can’t use the string in its current state because it contains ANSI standardized escape sequences. How can I programmatically remove the escape sequences so that the only part of the string remaining is 'examplefile.zip'
.
Delete them with a regular expression:
import re
# 7-bit C1 ANSI sequences
ansi_escape = re.compile(r'''
x1B # ESC
(?: # 7-bit C1 Fe (except CSI)
[@-Z\-_]
| # or [ for CSI, followed by a control sequence
[
[0-?]* # Parameter bytes
[ -/]* # Intermediate bytes
[@-~] # Final byte
)
''', re.VERBOSE)
result = ansi_escape.sub('', sometext)
or, without the VERBOSE
flag, in condensed form:
ansi_escape = re.compile(r'x1B(?:[@-Z\-_]|[[0-?]*[ -/]*[@-~])')
result = ansi_escape.sub('', sometext)
Demo:
>>> import re
>>> ansi_escape = re.compile(r'x1B(?:[@-Z\-_]|[[0-?]*[ -/]*[@-~])')
>>> sometext = 'lsrnx1b[00mx1b[01;31mexamplefile.zipx1b[00mrnx1b[01;31m'
>>> ansi_escape.sub('', sometext)
'lsrnexamplefile.ziprn'
The above regular expression covers all 7-bit ANSI C1 escape sequences, but not the 8-bit C1 escape sequence openers. The latter are never used in today’s UTF-8 world where the same range of bytes have a different meaning.
If you do need to cover the 8-bit codes too (and are then, presumably, working with bytes
values) then the regular expression becomes a bytes pattern like this:
# 7-bit and 8-bit C1 ANSI sequences
ansi_escape_8bit = re.compile(br'''
(?: # either 7-bit C1, two bytes, ESC Fe (omitting CSI)
x1B
[@-Z\-_]
| # or a single 8-bit byte Fe (omitting CSI)
[x80-x9Ax9C-x9F]
| # or CSI + control codes
(?: # 7-bit CSI, ESC [
x1B[
| # 8-bit CSI, 9B
x9B
)
[0-?]* # Parameter bytes
[ -/]* # Intermediate bytes
[@-~] # Final byte
)
''', re.VERBOSE)
result = ansi_escape_8bit.sub(b'', somebytesvalue)
which can be condensed down to
# 7-bit and 8-bit C1 ANSI sequences
ansi_escape_8bit = re.compile(
br'(?:x1B[@-Z\-_]|[x80-x9Ax9C-x9F]|(?:x1B[|x9B)[0-?]*[ -/]*[@-~])'
)
result = ansi_escape_8bit.sub(b'', somebytesvalue)
For more information, see:
- the ANSI escape codes overview on Wikipedia
- ECMA-48 standard, 5th edition (especially sections 5.3 and 5.4)
The example you gave contains 4 CSI (Control Sequence Introducer) codes, as marked by the x1B[
or ESC [
opening bytes, and each contains a SGR (Select Graphic Rendition) code, because they each end in m
. The parameters (separated by ;
semicolons) in between those tell your terminal what graphic rendition attributes to use. So for each x1B[....m
sequence, the 3 codes that are used are:
- 0 (or
00
in this example): reset, disable all attributes - 1 (or
01
in the example): bold - 31: red (foreground)
However, there is more to ANSI than just CSI SGR codes. With CSI alone you can also control the cursor, clear lines or the whole display, or scroll (provided the terminal supports this of course). And beyond CSI, there are codes to select alternative fonts (SS2
and SS3
), to send ‘private messages’ (think passwords), to communicate with the terminal (DCS
), the OS (OSC
), or the application itself (APC
, a way for applications to piggy-back custom control codes on to the communication stream), and further codes to help define strings (SOS
, Start of String, ST
String Terminator) or to reset everything back to a base state (RIS
). The above regexes cover all of these.
Note that the above regex only removes the ANSI C1 codes, however, and not any additional data that those codes may be marking up (such as the strings sent between an OSC opener and the terminating ST code). Removing those would require additional work outside the scope of this answer.
if you want to remove the rn
bit, you can pass the string through this function (written by sarnold):
def stripEscape(string):
""" Removes all escape sequences from the input string """
delete = ""
i=1
while (i<0x20):
delete += chr(i)
i += 1
t = string.translate(None, delete)
return t
Careful though, this will lump together the text in front and behind the escape sequences. So, using Martijn’s filtered string 'lsrnexamplefile.ziprn'
, you will get lsexamplefile.zip
. Note the ls
in front of the desired filename.
I would use the stripEscape function first to remove the escape sequences, then pass the output to Martijn’s regular expression, which would avoid concatenating the unwanted bit.
The accepted answer only takes into account ANSI Standardized escape sequences that are formatted to alter foreground colors & text style.
Many sequences do not end in 'm'
, such as: cursor positioning, erasing, and scroll regions. The pattern bellow attempts to cover all cases beyond setting foreground color and text-style.
Below is the regular expression for ANSI standardized control sequences:
/(x9B|x1B[)[0-?]*[ -/]*[@-~]/
Additional References:
Function
Based on Martijn Pieters♦’s answer with Jeff’s regexp.
def escape_ansi(line):
ansi_escape = re.compile(r'(?:x1B[@-_]|[x80-x9F])[0-?]*[ -/]*[@-~]')
return ansi_escape.sub('', line)
Test
def test_remove_ansi_escape_sequence(self):
line = 'tu001b[0;35mBlablau001b[0m u001b[0;36m172.18.0.2u001b[0m'
escaped_line = escape_ansi(line)
self.assertEqual(escaped_line, 'tBlabla 172.18.0.2')
Testing
If you want to run it by yourself, use python3
(better unicode support, blablabla). Here is how the test file should be:
import unittest
import re
def escape_ansi(line):
…
class TestStringMethods(unittest.TestCase):
def test_remove_ansi_escape_sequence(self):
…
if __name__ == '__main__':
unittest.main()
The suggested regex didn’t do the trick for me so I created one of my own.
The following is a python regex that I created based on the spec found here
ansi_regex = r'x1b('
r'([??d+[hl])|'
r'([=<>a-kzNM78])|'
r'([()][a-b0-2])|'
r'([d{0,2}[ma-dgkjqi])|'
r'([d+;d+[hfy]?)|'
r'([;?[hf])|'
r'(#[3-68])|'
r'([01356]n)|'
r'(O[mlnp-z]?)|'
r'(/Z)|'
r'(d+)|'
r'([?d;d0c)|'
r'(d;dR))'
ansi_escape = re.compile(ansi_regex, flags=re.IGNORECASE)
I tested my regex on the following snippet (basically a copy paste from the ascii-table.com page)
x1b[20h Set
x1b[?1h Set
x1b[?3h Set
x1b[?4h Set
x1b[?5h Set
x1b[?6h Set
x1b[?7h Set
x1b[?8h Set
x1b[?9h Set
x1b[20l Set
x1b[?1l Set
x1b[?2l Set
x1b[?3l Set
x1b[?4l Set
x1b[?5l Set
x1b[?6l Set
x1b[?7l Reset
x1b[?8l Reset
x1b[?9l Reset
x1b= Set
x1b> Set
x1b(A Set
x1b)A Set
x1b(B Set
x1b)B Set
x1b(0 Set
x1b)0 Set
x1b(1 Set
x1b)1 Set
x1b(2 Set
x1b)2 Set
x1bN Set
x1bO Set
x1b[m Turn
x1b[0m Turn
x1b[1m Turn
x1b[2m Turn
x1b[4m Turn
x1b[5m Turn
x1b[7m Turn
x1b[8m Turn
x1b[1;2 Set
x1b[1A Move
x1b[2B Move
x1b[3C Move
x1b[4D Move
x1b[H Move
x1b[;H Move
x1b[4;3H Move
x1b[f Move
x1b[;f Move
x1b[1;2 Move
x1bD Move/scroll
x1bM Move/scroll
x1bE Move
x1b7 Save
x1b8 Restore
x1bH Set
x1b[g Clear
x1b[0g Clear
x1b[3g Clear
x1b#3 Double-height
x1b#4 Double-height
x1b#5 Single
x1b#6 Double
x1b[K Clear
x1b[0K Clear
x1b[1K Clear
x1b[2K Clear
x1b[J Clear
x1b[0J Clear
x1b[1J Clear
x1b[2J Clear
x1b5n Device
x1b0n Response:
x1b3n Response:
x1b6n Get
x1b[c Identify
x1b[0c Identify
x1b[?1;20c Response:
x1bc Reset
x1b#8 Screen
x1b[2;1y Confidence
x1b[2;2y Confidence
x1b[2;9y Repeat
x1b[2;10y Repeat
x1b[0q Turn
x1b[1q Turn
x1b[2q Turn
x1b[3q Turn
x1b[4q Turn
x1b< Enter/exit
x1b= Enter
x1b> Exit
x1bF Use
x1bG Use
x1bA Move
x1bB Move
x1bC Move
x1bD Move
x1bH Move
x1b12 Move
x1bI
x1bK
x1bJ
x1bZ
x1b/Z
x1bOP
x1bOQ
x1bOR
x1bOS
x1bA
x1bB
x1bC
x1bD
x1bOp
x1bOq
x1bOr
x1bOs
x1bOt
x1bOu
x1bOv
x1bOw
x1bOx
x1bOy
x1bOm
x1bOl
x1bOn
x1bOM
x1b[i
x1b[1i
x1b[4i
x1b[5i
Hopefully this will help others 🙂
If it helps future Stack Overflowers, I was using the crayons library to give my Python output a bit more visual impact, which is advantageous as it works on both Windows and Linux platforms. However I was both displaying onscreen as well as appending to log files, and the escape sequences were impacting legibility of the log files, so wanted to strip them out. However the escape sequences inserted by crayons produced an error:
expected string or bytes-like object
The solution was to cast the parameter to a string, so only a tiny modification to the commonly accepted answer was needed:
def escape_ansi(line):
ansi_escape = re.compile(r'(x9B|x1B[)[0-?]*[ -/]*[@-~]')
return ansi_escape.sub('', str(line))
For 2020 with python 3.5 it as easy as string.encode().decode('ascii')
ascii_string = 'lsrnx1b[00mx1b[01;31mexamplefile.zipx1b[00mrnx1b[01;31m'
decoded_string = ascii_string.encode().decode('ascii')
print(decoded_string)
>ls
>examplefile.zip
>
none of the regex solutions worked in my case with OSC sequences (x1b]
)
to actually render the visible output, you will need a terminal emulator like pyte
#! /usr/bin/env python3
import pyte # terminal emulator: render terminal output to visible characters
pyte_screen = pyte.Screen(80, 24)
pyte_stream = pyte.ByteStream(pyte_screen)
bytes_ = b''.join([
b'$ cowsay hellorn', b'x1b[?2004l', b'r', b' _______rn',
b'< hello >rn', b' -------rn', b' \ ^__^rn',
b' \ (oo)\_______rn', b' (__)\ )\/\rn',
b' ||----w |rn', b' || ||rn',
b'x1b]0;user@laptop1:/tmpx1b\', b'x1b]7;file://laptop1/tmpx1b\', b'x1b[?2004h$ ',
])
pyte_stream.feed(bytes_)
# pyte_screen.display always has 80x24 characters, padded with whitespace
# -> use rstrip to remove trailing whitespace from all lines
text = ("".join([line.rstrip() + "n" for line in pyte_screen.display])).strip() + "n"
print("text", text)
print("cursor", pyte_screen.cursor.y, pyte_screen.cursor.x)
print("title", pyte_screen.title)