Truncate a string without ending in the middle of a word
Question:
I am looking for a way to truncate a string in Python that will not cut off the string in the middle of a word.
For example:
Original: "This is really awesome."
"Dumb" truncate: "This is real..."
"Smart" truncate: "This is really..."
I’m looking for a way to accomplish the “smart” truncate from above.
Answers:
I actually wrote a solution for this on a recent project of mine. I’ve compressed the majority of it down to be a little smaller.
def smart_truncate(content, length=100, suffix='...'):
if len(content) <= length:
return content
else:
return ' '.join(content[:length+1].split(' ')[0:-1]) + suffix
What happens is the if-statement checks if your content is already less than the cutoff point. If it’s not, it truncates to the desired length, splits on the space, removes the last element (so that you don’t cut off a word), and then joins it back together (while tacking on the ‘…’).
Here’s a slightly better version of the last line in Adam’s solution:
return content[:length].rsplit(' ', 1)[0]+suffix
(This is slightly more efficient, and returns a more sensible result in the case there are no spaces in the front of the string.)
def smart_truncate(s, width):
if s[width].isspace():
return s[0:width];
else:
return s[0:width].rsplit(None, 1)[0]
Testing it:
>>> smart_truncate('The quick brown fox jumped over the lazy dog.', 23) + "..."
'The quick brown fox...'
def smart_truncate1(text, max_length=100, suffix='...'):
"""Returns a string of at most `max_length` characters, cutting
only at word-boundaries. If the string was truncated, `suffix`
will be appended.
"""
if len(text) > max_length:
pattern = r'^(.{0,%d}S)s.*' % (max_length-len(suffix)-1)
return re.sub(pattern, r'1' + suffix, text)
else:
return text
OR
def smart_truncate2(text, min_length=100, suffix='...'):
"""If the `text` is more than `min_length` characters long,
it will be cut at the next word-boundary and `suffix`will
be appended.
"""
pattern = r'^(.{%d,}?S)s.*' % (min_length-1)
return re.sub(pattern, r'1' + suffix, text)
OR
def smart_truncate3(text, length=100, suffix='...'):
"""Truncates `text`, on a word boundary, as close to
the target length it can come.
"""
slen = len(suffix)
pattern = r'^(.{0,%d}S)s+S+' % (length-slen-1)
if len(text) > length:
match = re.match(pattern, text)
if match:
length0 = match.end(0)
length1 = match.end(1)
if abs(length0+slen-length) < abs(length1+slen-length):
return match.group(0) + suffix
else:
return match.group(1) + suffix
return text
There are a few subtleties that may or may not be issues for you, such as handling of tabs (Eg. if you’re displaying them as 8 spaces, but treating them as 1 character internally), handling various flavours of breaking and non-breaking whitespace, or allowing breaking on hyphenation etc. If any of this is desirable, you may want to take a look at the textwrap module. eg:
def truncate(text, max_size):
if len(text) <= max_size:
return text
return textwrap.wrap(text, max_size-3)[0] + "..."
The default behaviour for words greater than max_size is to break them (making max_size a hard limit). You can change to the soft limit used by some of the other solutions here by passing break_long_words=False to wrap(), in which case it will return the whole word. If you want this behaviour change the last line to:
lines = textwrap.wrap(text, max_size-3, break_long_words=False)
return lines[0] + ("..." if len(lines)>1 else "")
There are a few other options like expand_tabs that may be of interest depending on the exact behaviour you want.
>>> import textwrap
>>> textwrap.wrap('The quick brown fox jumps over the lazy dog', 12)
['The quick', 'brown fox', 'jumps over', 'the lazy dog']
You just take the first element of that and you’re done…
From Python 3.4+ you can use textwrap.shorten. With the OP example:
>>> import textwrap
>>> original = "This is really awesome."
>>> textwrap.shorten(original, width=20, placeholder="...")
'This is really...'
textwrap.shorten(text, width, **kwargs)
Collapse and truncate the given text to fit in the given width.
First the whitespace in text is collapsed (all whitespace is replaced by single spaces). If the result fits in the width, it is
returned. Otherwise, enough words are dropped from the end so that the
remaining words plus the placeholder fit within width:
For Python 3.4+, I’d use textwrap.shorten.
For older versions:
def truncate(description, max_len=140, suffix='…'):
description = description.strip()
if len(description) <= max_len:
return description
new_description = ''
for word in description.split(' '):
tmp_description = new_description + word
if len(tmp_description) <= max_len-len(suffix):
new_description = tmp_description + ' '
else:
new_description = new_description.strip() + suffix
break
return new_description
In case you might actually prefer to truncate by full sentence rather than by word, here’s something to start with:
def smart_truncate_by_sentence(content, length=100, suffix='...',):
if not isinstance(content,str): return content
if len(content) <= length:
return content
else:
sentences=content.split('.')
cs=np.cumsum([len(s) for s in sentences])
n = max(1, len(cs[cs<length]) )
return '.'.join(sentences[:n])+ '. ...'*(n<len(sentences))
C++ version:
string trim(string s, int k) {
if (s.size()<=k) return s;
while(k>=0 && s[k]!=' ')
k--;
if (k<0) return "";
string res=s.substr(0, k+1);
while(res.size() && (res.back()==' '))
res.pop_back();
return res;
}
I am looking for a way to truncate a string in Python that will not cut off the string in the middle of a word.
For example:
Original: "This is really awesome." "Dumb" truncate: "This is real..." "Smart" truncate: "This is really..."
I’m looking for a way to accomplish the “smart” truncate from above.
I actually wrote a solution for this on a recent project of mine. I’ve compressed the majority of it down to be a little smaller.
def smart_truncate(content, length=100, suffix='...'):
if len(content) <= length:
return content
else:
return ' '.join(content[:length+1].split(' ')[0:-1]) + suffix
What happens is the if-statement checks if your content is already less than the cutoff point. If it’s not, it truncates to the desired length, splits on the space, removes the last element (so that you don’t cut off a word), and then joins it back together (while tacking on the ‘…’).
Here’s a slightly better version of the last line in Adam’s solution:
return content[:length].rsplit(' ', 1)[0]+suffix
(This is slightly more efficient, and returns a more sensible result in the case there are no spaces in the front of the string.)
def smart_truncate(s, width):
if s[width].isspace():
return s[0:width];
else:
return s[0:width].rsplit(None, 1)[0]
Testing it:
>>> smart_truncate('The quick brown fox jumped over the lazy dog.', 23) + "..."
'The quick brown fox...'
def smart_truncate1(text, max_length=100, suffix='...'):
"""Returns a string of at most `max_length` characters, cutting
only at word-boundaries. If the string was truncated, `suffix`
will be appended.
"""
if len(text) > max_length:
pattern = r'^(.{0,%d}S)s.*' % (max_length-len(suffix)-1)
return re.sub(pattern, r'1' + suffix, text)
else:
return text
OR
def smart_truncate2(text, min_length=100, suffix='...'):
"""If the `text` is more than `min_length` characters long,
it will be cut at the next word-boundary and `suffix`will
be appended.
"""
pattern = r'^(.{%d,}?S)s.*' % (min_length-1)
return re.sub(pattern, r'1' + suffix, text)
OR
def smart_truncate3(text, length=100, suffix='...'):
"""Truncates `text`, on a word boundary, as close to
the target length it can come.
"""
slen = len(suffix)
pattern = r'^(.{0,%d}S)s+S+' % (length-slen-1)
if len(text) > length:
match = re.match(pattern, text)
if match:
length0 = match.end(0)
length1 = match.end(1)
if abs(length0+slen-length) < abs(length1+slen-length):
return match.group(0) + suffix
else:
return match.group(1) + suffix
return text
There are a few subtleties that may or may not be issues for you, such as handling of tabs (Eg. if you’re displaying them as 8 spaces, but treating them as 1 character internally), handling various flavours of breaking and non-breaking whitespace, or allowing breaking on hyphenation etc. If any of this is desirable, you may want to take a look at the textwrap module. eg:
def truncate(text, max_size):
if len(text) <= max_size:
return text
return textwrap.wrap(text, max_size-3)[0] + "..."
The default behaviour for words greater than max_size is to break them (making max_size a hard limit). You can change to the soft limit used by some of the other solutions here by passing break_long_words=False to wrap(), in which case it will return the whole word. If you want this behaviour change the last line to:
lines = textwrap.wrap(text, max_size-3, break_long_words=False)
return lines[0] + ("..." if len(lines)>1 else "")
There are a few other options like expand_tabs that may be of interest depending on the exact behaviour you want.
>>> import textwrap
>>> textwrap.wrap('The quick brown fox jumps over the lazy dog', 12)
['The quick', 'brown fox', 'jumps over', 'the lazy dog']
You just take the first element of that and you’re done…
From Python 3.4+ you can use textwrap.shorten. With the OP example:
>>> import textwrap
>>> original = "This is really awesome."
>>> textwrap.shorten(original, width=20, placeholder="...")
'This is really...'
textwrap.shorten(text, width, **kwargs)
Collapse and truncate the given text to fit in the given width.
First the whitespace in text is collapsed (all whitespace is replaced by single spaces). If the result fits in the width, it is
returned. Otherwise, enough words are dropped from the end so that the
remaining words plus the placeholder fit within width:
For Python 3.4+, I’d use textwrap.shorten.
For older versions:
def truncate(description, max_len=140, suffix='…'):
description = description.strip()
if len(description) <= max_len:
return description
new_description = ''
for word in description.split(' '):
tmp_description = new_description + word
if len(tmp_description) <= max_len-len(suffix):
new_description = tmp_description + ' '
else:
new_description = new_description.strip() + suffix
break
return new_description
In case you might actually prefer to truncate by full sentence rather than by word, here’s something to start with:
def smart_truncate_by_sentence(content, length=100, suffix='...',):
if not isinstance(content,str): return content
if len(content) <= length:
return content
else:
sentences=content.split('.')
cs=np.cumsum([len(s) for s in sentences])
n = max(1, len(cs[cs<length]) )
return '.'.join(sentences[:n])+ '. ...'*(n<len(sentences))
C++ version:
string trim(string s, int k) {
if (s.size()<=k) return s;
while(k>=0 && s[k]!=' ')
k--;
if (k<0) return "";
string res=s.substr(0, k+1);
while(res.size() && (res.back()==' '))
res.pop_back();
return res;
}