IndexError: string index out of range. While-Loop goes one round too much

Question:

I know there are other problems similar to this but I can’t seem to figure out how to fix it. I’ve implemented some "tracking"-code to understand where it goes wrong. The problem is obviously the same for other parts of the code as well depending on what is run through the tokenize function. I know the problem exists because of the end += 1 in the while-loops and doesn’t stop/continue correctly. After the last letter/number/symbol is read it should add it to words but instead it tries to go one further step and creates this error. Tried numerous if’s and tries things but my coding is to weak to solve it properly. Any other comments of the code in general is much appreciated as well. I had a draft that was working earlier but I accidentally deleted that draft when I was supposed to polish it and move it to another document…

def tokenize(lines):
words = []
for line in lines:
    print("new line")
    start = 0
    
    while start-1 < len(line):
        print(start)
        print("start")
        while line[start].isspace() == True:
            print("remove space")
            start += 1
        end = start
        while line[end].isspace() == True:
            print("remove space")
            end += 1
        if line[end].isalpha() == True:
            while line[end].isalpha() == True:
                print("letter")
                end += 1
        elif line[end].isdigit() == True:
            while line[end].isdigit() == True:
                print("number")
                end += 1
        else:
            print("symbol")
            end += 1
        words.append(line[start:end].lower())
        print(line[start:end] + " - adds to words")
        start = end
        print(len(line))
        print(words)
return words

tokenize([" all .. 12 foas d 12 9"])

Asked By: Roslund

||

Answers:

The main issue is that you have to check your indexing bounds in every part of the code where indexing variables might have changed. This includes both start and end variables as they are independently incremented within your code.

I also cut out areas of your code which were unnecessary and considered mainly duplicate code and untidy logic, which you have to avoid in every program you write. This also makes it easier to debug, maintain, and understand your program. Always make sure your logic gets as straightforward as possible before you start writing any code.

def tokenize(lines):
    words = []
   
    for line in lines:
        print("new line")
        start = 0
        
        # start, as an index, is allowed in the range [0, len(line) - 1]
        # so use either *start < len(line)* or *start <= len(line) - 1* as they are equivalent
        while start < len(line):
            print(start)
            print("start")
            # going forward, watch not to overstep again
            while start < len(line) and line[start].isspace():
                print("remove space")
                start += 1
            end = start
            # whatever variable you use as an index, you have to make sure
            # it will be within bounds; as you go forward to capture 
            # non-space symbols, you should also stop before the string finishes.
            while end < len(line) and not line[end].isspace():
                if line[end].isalpha():
                    print("letter")
                elif line[end].isdigit():
                    print("number")
                else:
                    print("symbol")
                end += 1
            words.append(line[start:end].lower())
            print(line[start:end] + " - adds to words")
            start = end
            print(len(line))
            print(words)

UPDATE:

It seems the OP is trying to keep the non-alphanumeric symbols as separate tokens. I suggest not doing this in a single passing. You can first split the normal way and then go over each word again to split by symbols (and retain symbols). This will keep your code simpler and easier to read. I’m going to use regex split for the second step:

import re

greeting = "Hey, how are you doing?"
# get rid of spaces
tokens = greeting.split()
result = []
for w in tokens:
    # '[^dw]' will match symbol characters (non-digit and non-alpha)
    # parentheses will capture the delimiters (the symbols) as tokens in the final list
    for x in re.split("([^dw]+)", w):
        if x:
            result.append(x)
print(result)

##### or use list comprehension to achieve this in a single line
result = [x for w in greeting.split() for x in re.split("([^dw]+)", w) if x != ""] 
print(result)
Answered By: Sajad
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.