Remove all line breaks from a long string of text
Question:
Basically, I’m asking the user to input a string of text into the console, but the string is very long and includes many line breaks. How would I take the user’s string and delete all line breaks to make it a single line of text. My method for acquiring the string is very simple.
string = raw_input("Please enter string: ")
Is there a different way I should be grabbing the string from the user? I’m running Python 2.7.4 on a Mac.
P.S. Clearly I’m a noob, so even if a solution isn’t the most efficient, the one that uses the most simple syntax would be appreciated.
Answers:
You can try using string replace:
string = string.replace('r', '').replace('n', '')
How do you enter line breaks with raw_input
? But, once you have a string with some characters in it you want to get rid of, just replace
them.
>>> mystr = raw_input('please enter string: ')
please enter string: hello world, how do i enter line breaks?
>>> # pressing enter didn't work...
...
>>> mystr
'hello world, how do i enter line breaks?'
>>> mystr.replace(' ', '')
'helloworld,howdoienterlinebreaks?'
>>>
In the example above, I replaced all spaces. The string 'n'
represents newlines. And r
represents carriage returns (if you’re on windows, you might be getting these and a second replace
will handle them for you!).
basically:
# you probably want to use a space ' ' to replace `n`
mystring = mystring.replace('n', ' ').replace('r', '')
Note also, that it is a bad idea to call your variable string
, as this shadows the module string
. Another name I’d avoid but would love to use sometimes: file
. For the same reason.
You can split the string with no separator arg, which will treat consecutive whitespace as a single separator (including newlines and tabs). Then join using a space:
In : " ".join("nnsome text rn with multiple whitespace".split())
Out: 'some text with multiple whitespace'
A method taking into consideration
- additional white characters at the beginning/end of string
- additional white characters at the beginning/end of every line
- various end-line characters
it takes such a multi-line string which may be messy e.g.
test_str = 'nhej ho n aaarn an '
and produces nice one-line string
>>> ' '.join([line.strip() for line in test_str.strip().splitlines()])
'hej ho aaa a'
UPDATE:
To fix multiple new-line character producing redundant spaces:
' '.join([line.strip() for line in test_str.strip().splitlines() if line.strip()])
This works for the following too
test_str = 'nhej ho n aaarnnnnn an '
Another option is regex:
>>> import re
>>> re.sub("n|r", "", "Foonrbarnrbaznr")
'Foobarbaz'
The problem with rstrip()
is that it does not work in all cases (as I myself have seen few). Instead you can use
text = text.replace("n"," ")
This will remove all new line 'n'
with a space.
If anybody decides to use replace
, you should try r'n'
instead 'n'
mystring = mystring.replace(r'n', ' ').replace(r'r', '')
The canonic answer, in Python, would be :
s = ''.join(s.splitlines())
It splits the string into lines (letting Python doing it according to its own best practices). Then you merge it. Two possibilities here:
- replace the newline by a whitespace (
' '.join()
)
- or without a whitespace (
''.join()
)
Regular expressions is the fastest way to do this
s='''some kind of
string with a bunchr of
extra spaces in it'''
re.sub(r's(?=s)','',re.sub(r's',' ',s))
result:
'some kind of string with a bunch of extra spaces in it'
You really don’t need to remove ALL the signs: lf cr crlf.
# Pythonic:
r'n', r'r', r'rn'
Some texts must have breaks, but you probably need to join broken lines to keep particular sentences together.
Therefore it is natural that line breaking happens after priod, semicolon, colon, but not after comma.
My code considers above conditions. Works well with texts copied from pdfs.
Enjoy!:
def unbreak_pdf_text(raw_text):
""" the newline careful sign removal tool
Args:
raw_text (str): string containing unwanted newline signs: \n or \r or \r\n
e.g. imported from OCR or copied from a pdf document.
Returns:
_type_: _description_
"""
pat = re.compile((r"[, w]n|[, w]r|[, w]rn"))
breaks = re.finditer(pat, raw_text)
processed_text = raw_text
raw_text = None
for i in breaks:
processed_text = processed_text.replace(i.group(), i.group()[0]+" ")
return processed_text
Basically, I’m asking the user to input a string of text into the console, but the string is very long and includes many line breaks. How would I take the user’s string and delete all line breaks to make it a single line of text. My method for acquiring the string is very simple.
string = raw_input("Please enter string: ")
Is there a different way I should be grabbing the string from the user? I’m running Python 2.7.4 on a Mac.
P.S. Clearly I’m a noob, so even if a solution isn’t the most efficient, the one that uses the most simple syntax would be appreciated.
You can try using string replace:
string = string.replace('r', '').replace('n', '')
How do you enter line breaks with raw_input
? But, once you have a string with some characters in it you want to get rid of, just replace
them.
>>> mystr = raw_input('please enter string: ')
please enter string: hello world, how do i enter line breaks?
>>> # pressing enter didn't work...
...
>>> mystr
'hello world, how do i enter line breaks?'
>>> mystr.replace(' ', '')
'helloworld,howdoienterlinebreaks?'
>>>
In the example above, I replaced all spaces. The string 'n'
represents newlines. And r
represents carriage returns (if you’re on windows, you might be getting these and a second replace
will handle them for you!).
basically:
# you probably want to use a space ' ' to replace `n`
mystring = mystring.replace('n', ' ').replace('r', '')
Note also, that it is a bad idea to call your variable string
, as this shadows the module string
. Another name I’d avoid but would love to use sometimes: file
. For the same reason.
You can split the string with no separator arg, which will treat consecutive whitespace as a single separator (including newlines and tabs). Then join using a space:
In : " ".join("nnsome text rn with multiple whitespace".split())
Out: 'some text with multiple whitespace'
A method taking into consideration
- additional white characters at the beginning/end of string
- additional white characters at the beginning/end of every line
- various end-line characters
it takes such a multi-line string which may be messy e.g.
test_str = 'nhej ho n aaarn an '
and produces nice one-line string
>>> ' '.join([line.strip() for line in test_str.strip().splitlines()])
'hej ho aaa a'
UPDATE:
To fix multiple new-line character producing redundant spaces:
' '.join([line.strip() for line in test_str.strip().splitlines() if line.strip()])
This works for the following too
test_str = 'nhej ho n aaarnnnnn an '
Another option is regex:
>>> import re
>>> re.sub("n|r", "", "Foonrbarnrbaznr")
'Foobarbaz'
The problem with rstrip()
is that it does not work in all cases (as I myself have seen few). Instead you can use
text = text.replace("n"," ")
This will remove all new line 'n'
with a space.
If anybody decides to use replace
, you should try r'n'
instead 'n'
mystring = mystring.replace(r'n', ' ').replace(r'r', '')
The canonic answer, in Python, would be :
s = ''.join(s.splitlines())
It splits the string into lines (letting Python doing it according to its own best practices). Then you merge it. Two possibilities here:
- replace the newline by a whitespace (
' '.join()
) - or without a whitespace (
''.join()
)
Regular expressions is the fastest way to do this
s='''some kind of
string with a bunchr of
extra spaces in it'''
re.sub(r's(?=s)','',re.sub(r's',' ',s))
result:
'some kind of string with a bunch of extra spaces in it'
You really don’t need to remove ALL the signs: lf cr crlf.
# Pythonic:
r'n', r'r', r'rn'
Some texts must have breaks, but you probably need to join broken lines to keep particular sentences together.
Therefore it is natural that line breaking happens after priod, semicolon, colon, but not after comma.
My code considers above conditions. Works well with texts copied from pdfs.
Enjoy!:
def unbreak_pdf_text(raw_text):
""" the newline careful sign removal tool
Args:
raw_text (str): string containing unwanted newline signs: \n or \r or \r\n
e.g. imported from OCR or copied from a pdf document.
Returns:
_type_: _description_
"""
pat = re.compile((r"[, w]n|[, w]r|[, w]rn"))
breaks = re.finditer(pat, raw_text)
processed_text = raw_text
raw_text = None
for i in breaks:
processed_text = processed_text.replace(i.group(), i.group()[0]+" ")
return processed_text