Word count script in Python

Question:

Can someone please explain me why there is ‘b’ in front of each word and how to get read of it? Script returns something like this:

word= b’yesterday,’ , count = 3

current_word = {}
current_count = 0
text = "https://raw.githubusercontent.com/KseniaGiansar/pythonProject2_text/master/yesterday.txt"
request = urllib.request.urlopen(text)
each_word = []
words = None
count = 1
same_words ={}
word = []

# сollect words into a list
for line in request:
    #print "Line = " , line
    line_words = line.split()
    for word in line_words:  # looping each line and extracting words
        each_word.append(word)
for words in each_word:
    if words.lower() not in same_words.keys() :
        same_words[words.lower()]=1
    else:
        same_words[words.lower()]=same_words[words.lower()]+1
for each in same_words.keys():
    print("word = ", each, ", count = ",same_words[each])
Asked By: Aella

||

Answers:

It is indicating that the variable words is a bytes object.

urllib.request.urlopen() returns a bytes object.

To fix this, you can use the .decode() method to convert the bytes object to a string before appending it to the list.

for line in request:
    line_words = line.decode().split() # decode the bytes object to a string
    for word in line_words:
        each_word.append(word)
Answered By: David Meu

B-strings in python are byte strings.

When you are reading from an HTTP request, the response is in bytes, and you should decode it like this:

line_words = line.decode("utf8").split()

Please make sure the encoding of your string (UTF-8 in my example) matches the charset in the Content-Type header of the request. You can send an Accept-Charset: utf-8 header in the request to tell the server to return a UTF-8 string.

Answered By: Marc Sances

the b prolly means bytes

i think you can remove the "b" decoding the bytes into a string using the .decode() method. In this case, you can add the following line before the for loop:

line = line.decode("utf-8")

You can also remove the ️ from each word, before adding it to the each_word list by doing the following:

word = word.decode("utf-8")
Answered By: Abdullah Arafat
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.