Split string into list of two words, repeating the last word

Question:

I need to split a string into a list of each two words, but repeating the last word of each pair of words.
Here is what I tried, by using examples I found for other questions:

line = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."""

def split_line(in_line):
    line_sp = line.split(" ")
    line_two = [" ".join(line_sp[i:i + 2]) for i in range(0, len(line_sp), 2)]
    return line_two

print(split_line(line))

This results into:

['Lorem ipsum', 'dolor sit', 'amet, consectetur', 'adipiscing elit,', 'sed do', 'eiusmod tempor', 'incididunt ut', 'labore et', 'dolore magna', 'aliqua.']

But what I actually need is this:

['Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet', 'amet, consectetur', 'consectetur adipiscing', ...]

How can I make it work?
Thanks!

Asked By: Litwos

||

Answers:

You can start with constructing a list of words in the line

words = line.split()

then you can make a list of lists containing consequential pairs with slicing

pairs = [words[i:i + 2] for i in range(len(words))]

finally, you can take each pair and joint it with ' '

result = [" ".join(pair) for pair in pairs if len(pair) > 1]
Answered By: taras

You can use zip on the following two slices of words:

words = line.split()
print(list(map(' '.join, zip(words[:-1], words[1:]))))

This outputs:

['Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet,', 'amet, consectetur', 'consectetur adipiscing', 'adipiscing elit,', 'elit, sed', 'sed do', 'do eiusmod', 'eiusmod tempor', 'tempor incididunt', 'incididunt ut', 'ut labore', 'labore et', 'et dolore', 'dolore magna', 'magna aliqua.']
Answered By: blhsing

You can try something like, I dont know syntax in python so answering in java.
may be you can convert it to python

String line = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.";
    String[] split = line.split(" ");
    String [] line_two = new String[split.length-1];

    for (int i = 1; i < split.length; i++) {
        line_two[i-1] =split[i-1] +" "+split[i];
    }
Answered By: Rupesh Agrawal

You can use a lazy generator with zip:

def split_line(in_line):
    line_sp = line.split()
    yield from map(' '.join, zip(line_sp, line_sp[1:]))

print(list(split_line(line)))

['Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet,',
 ...
 'labore et', 'et dolore', 'dolore magna', 'magna aliqua.']
Answered By: jpp

Simple for loop

l = line.split(' ')
result = []
for i in range(len(l) - 1):
    result.append(l[i] + ' ' + l[i+1])
print(result) 
# ['Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet,', 'amet, consectetur', 'consectetur adipiscing', 'adipiscing elit,', 'elit, sed', 'sed do', 'do eiusmod', 'eiusmod tempor', 'tempor incididunt', 'incididunt ut', 'ut labore', 'labore et', 'et dolore', 'dolore magna', 'magna aliqua.', 'Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet,', 'amet, consectetur', 'consectetur adipiscing', 'adipiscing elit,', 'elit, sed', 'sed do', 'do eiusmod', 'eiusmod tempor', 'tempor incididunt', 'incididunt ut', 'ut labore', 'labore et', 'et dolore', 'dolore magna', 'magna aliqua.']
Answered By: RobJan

You can try it with regex, too:

rslt=[ " ".join(tup) for tup in re.findall(r"(w+)W+(?=(w+))",line) ]

w+ one or more word characters;

(w+) we capture the matched pattern;

W+ one or more non-word characters;

(?=(w+)) look ahead as (?=…), but don’t step forward, however capture the next word.

Answered By: kantal

What you are looking for is nltk.bigrams()

import nltk
bigrm = list(nltk.bigrams(line.split()))
Answered By: shantanuo

For whatever it is worth, just change the iterative value for loop from 2 to 1:

BEFORE:

line_sp = line.split(" ")
line_two = [" ".join(line_sp[i:i + 2]) for i in range(0, len(line_sp), 2)]
return line_two

FIXED:

line_sp = line.split(" ")
line_two = [" ".join(line_sp[i:i + 2]) for i in range(0, len(line_sp),1)]
return line_two

print(split_line(line))

Answered By: Moazzan Ishfaq
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.