How do I create a list from a webpage?
Question:
I am attempting to create a list of words from the website text. I would like to be able to randomise the word that is produced from this list using random
. I hope this makes sense.
import random as r
from bs4 import BeautifulSoup
import requests as rq
url = 'https://www.mit.edu/~ecprice/wordlist.10000'
page = rq.get(url)
soup = [BeautifulSoup(page.text, 'html.parser')]
print(r.choice(soup))
I tried this but I get the full list. I presume this is due to the fact that the website I am scraping from does not use breaks or anything else so I am unsure how to specify what to take from.
Answers:
There is no need of BeautifulSoup
in this context, simply split()
the text from the response into list.
Example
import random as r
import requests as rq
url = 'https://www.mit.edu/~ecprice/wordlist.10000'
word_list = rq.get(url).text.split()
print(r.choice(word_list))
If you really need to use BeautifulSoup
you could get_text()
and split()
:
word_list = BeautifulSoup(rq.get(url).text).get_text('n',strip=True).split()
If you use [BeautifulSoup(page.text, 'html.parser')]
, the entire document will be converted as single element of the list. Instead convert into string and then use string split method to convert to list.
import random as r
from bs4 import BeautifulSoup
import requests as rq
url = 'https://www.mit.edu/~ecprice/wordlist.10000'
page = rq.get(url)
soup = str(BeautifulSoup(page.text, 'html.parser'))
soup = soup.split('n')
print(r.choice(soup))
Note: I wanted to use the same approach you used so that you will understand the difference.
I am attempting to create a list of words from the website text. I would like to be able to randomise the word that is produced from this list using random
. I hope this makes sense.
import random as r
from bs4 import BeautifulSoup
import requests as rq
url = 'https://www.mit.edu/~ecprice/wordlist.10000'
page = rq.get(url)
soup = [BeautifulSoup(page.text, 'html.parser')]
print(r.choice(soup))
I tried this but I get the full list. I presume this is due to the fact that the website I am scraping from does not use breaks or anything else so I am unsure how to specify what to take from.
There is no need of BeautifulSoup
in this context, simply split()
the text from the response into list.
Example
import random as r
import requests as rq
url = 'https://www.mit.edu/~ecprice/wordlist.10000'
word_list = rq.get(url).text.split()
print(r.choice(word_list))
If you really need to use BeautifulSoup
you could get_text()
and split()
:
word_list = BeautifulSoup(rq.get(url).text).get_text('n',strip=True).split()
If you use [BeautifulSoup(page.text, 'html.parser')]
, the entire document will be converted as single element of the list. Instead convert into string and then use string split method to convert to list.
import random as r
from bs4 import BeautifulSoup
import requests as rq
url = 'https://www.mit.edu/~ecprice/wordlist.10000'
page = rq.get(url)
soup = str(BeautifulSoup(page.text, 'html.parser'))
soup = soup.split('n')
print(r.choice(soup))
Note: I wanted to use the same approach you used so that you will understand the difference.