How do I split a string into a list of words?


How do I split a sentence and store each word in a list? For example, given a string like "these are words", how do I get a list like ["these", "are", "words"]?

To split on other delimiters, see Split a string by a delimiter in python.

To split into individual characters, see How do I split a string into a list of characters?.

Asked By: Thanx



To split the string text on any consecutive runs of whitespace:

words = text.split()      

To split the string text on a custom delimiter such as ",":

words = text.split(",")   

The words variable will be a list and contain the words from text split on the delimiter.

Answered By: zalew

Given a string sentence, this stores each word in a list called words:

words = sentence.split()
Answered By: nstehr

Use str.split():

Return a list of the words in the string, using sep as the delimiter
… If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.

>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
Answered By: gimel

I want my python function to split a sentence (input) and store each word in a list

The str().split() method does this, it takes a string, splits it into a list:

>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0
Answered By: dbr

Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:

import nltk
words = nltk.word_tokenize(raw_sentence)

This has the added benefit of splitting out punctuation.


>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',', 
'waking', 'it', '.']

This allows you to filter out any punctuation you don’t want and use only words.

Please note that the other solutions using string.split() are better if you don’t plan on doing any complex manipulation of the sentence.


Answered By: tgray

How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
Answered By: Colonel Panic

shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:

>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']

NB: it works well for Unix-like command line strings. It doesn’t work for natural-language processing.

Answered By: Tarwin

If you want all the chars of a word/sentence in a list, do this:

#  ['w', 'o', 'r', 'd']

print(list("some sentence"))
#  ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
Answered By: BlackBeard

Split the words without without harming apostrophes inside words
Please find the input_1 and input_2 Moore’s law

def split_into_words(line):
    import re
    word_regex_improved = r"(w[w']*w|w)"
    word_matcher = re.compile(word_regex_improved)
    return word_matcher.findall(line)

#Example 1

input_1 = "computational power (see Moore's law) and "

# output 
['computational', 'power', 'see', "Moore's", 'law', 'and']

#Example 2

input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""

Answered By: thrinadhn