Word count script in Python

Question

Can someone please explain me why there is ‘b’ in front of each word and how to get read of it? Script returns something like this:

word= b’yesterday,’ , count = 3

current_word = {}
current_count = 0
text = "https://raw.githubusercontent.com/KseniaGiansar/pythonProject2_text/master/yesterday.txt"
request = urllib.request.urlopen(text)
each_word = []
words = None
count = 1
same_words ={}
word = []

# сollect words into a list
for line in request:
    #print "Line = " , line
    line_words = line.split()
    for word in line_words:  # looping each line and extracting words
        each_word.append(word)
for words in each_word:
    if words.lower() not in same_words.keys() :
        same_words[words.lower()]=1
    else:
        same_words[words.lower()]=same_words[words.lower()]+1
for each in same_words.keys():
    print("word = ", each, ", count = ",same_words[each])

Asked By: Aella

||

Source

Answer 1

It is indicating that the variable words is a bytes object.

urllib.request.urlopen() returns a bytes object.

To fix this, you can use the .decode() method to convert the bytes object to a string before appending it to the list.

for line in request:
    line_words = line.decode().split() # decode the bytes object to a string
    for word in line_words:
        each_word.append(word)

Answered By: David Meu

Answer 2

B-strings in python are byte strings.

When you are reading from an HTTP request, the response is in bytes, and you should decode it like this:

line_words = line.decode("utf8").split()

Please make sure the encoding of your string (UTF-8 in my example) matches the charset in the Content-Type header of the request. You can send an Accept-Charset: utf-8 header in the request to tell the server to return a UTF-8 string.

Answered By: Marc Sances

Answer 3

the b prolly means bytes

i think you can remove the "b" decoding the bytes into a string using the .decode() method. In this case, you can add the following line before the for loop:

line = line.decode("utf-8")

You can also remove the ️ from each word, before adding it to the each_word list by doing the following:

word = word.decode("utf-8")

Answered By: Abdullah Arafat

Word count script in Python

Question:

Answers: