Crawling IMDB for movie trailers?
Question:
I want to crawl IMDB and download the trailers of movies (either from YouTube or IMDB) that fit some criteria (e.g.: released this year, with a rating above 2).
I want to do this in Python – I saw that there were packages for crawling IMDB and downloading YouTube videos. The thing is, my current plan is to crawl IMDB and then search youtube for ‘$movie_name’ + ‘trailer’ and hope that the top result is the trailer, and then download it.
Still, this seems a bit convoluted and I was wondering if there was perhaps an easier way.
Any help would be appreciated.
Answers:
There is no easier way. I doubt IMDB allows people to scrap their website freely so your IP is probably gonna get blacklisted and to counter that you’ll need proxies. Good luck and scrape respectfully.
EDIT: Please take a look at @pds’s answer below. My answer is no longer valid.
The imdbpy
API https://imdbpy.github.io/ will get you started, it’s very straightforward.
import imdb # pip install IMDbPY
ia = imdb.IMDb()
list_of_movies = ia.search_movie("string text")
[ia.(m, info=['main','votes']) for m in list_of_movies[:1]]
for m in list_of_movies[:1]:
yt_search_term = m.get("name") + "trailer"
# connect to YT API to start that part of the search.
Then lookup how to connect to the YTv3 API with appropriate authentication and download the corresponding Google client API – Sample code here
Issues: One challenge is that movie titles are not unique, so searching YT by name+" trailer"
will not necessarily return your intended trailer. So you’ll need to account for that somehow. For new hollywood blockbusters (and similar), you may be successful.
Legal: As indicated by the other answer, do verify your use case is in compliance with the terms and conditions and licenses of the technologies and information services that you are using. If in doubt seek the approval from those parties first or seek professional legal advice.
This will provide the video link for you.
like this
The following code parses the HTML source file of this video page.
The mp4 link is here in the HTML source file. You can view the source file and search ".mp4"
The links are in <script type="application/json"> json file having links </script>
Each link expires in 1-2 hours, so you may download from the link instead of saving the links in a file or you can just run the script every time.
from bs4 import BeautifulSoup
import requests
video_id = "vi2766453273"
video_url = "https://www.imdb.com/video/"+video_id
print(video_url)
r = requests.get(url=video_url)
soup = BeautifulSoup(r.text, 'html.parser')
script =soup.find("script",{'type': 'application/json'})
json_object = json.loads(script.string)
print(json_object["props"]["pageProps"]["videoPlaybackData"]["video"]["playbackURLs"])
videos = json_object["props"]["pageProps"]["videoPlaybackData"]["video"]["playbackURLs"]
# links video quality order auto,1080,720
for video in videos[1:] :
video_link = video["url"]
print(video_link)
#break
Checkout the full code at GitHub
I want to crawl IMDB and download the trailers of movies (either from YouTube or IMDB) that fit some criteria (e.g.: released this year, with a rating above 2).
I want to do this in Python – I saw that there were packages for crawling IMDB and downloading YouTube videos. The thing is, my current plan is to crawl IMDB and then search youtube for ‘$movie_name’ + ‘trailer’ and hope that the top result is the trailer, and then download it.
Still, this seems a bit convoluted and I was wondering if there was perhaps an easier way.
Any help would be appreciated.
There is no easier way. I doubt IMDB allows people to scrap their website freely so your IP is probably gonna get blacklisted and to counter that you’ll need proxies. Good luck and scrape respectfully.
EDIT: Please take a look at @pds’s answer below. My answer is no longer valid.
The imdbpy
API https://imdbpy.github.io/ will get you started, it’s very straightforward.
import imdb # pip install IMDbPY
ia = imdb.IMDb()
list_of_movies = ia.search_movie("string text")
[ia.(m, info=['main','votes']) for m in list_of_movies[:1]]
for m in list_of_movies[:1]:
yt_search_term = m.get("name") + "trailer"
# connect to YT API to start that part of the search.
Then lookup how to connect to the YTv3 API with appropriate authentication and download the corresponding Google client API – Sample code here
Issues: One challenge is that movie titles are not unique, so searching YT by name+" trailer"
will not necessarily return your intended trailer. So you’ll need to account for that somehow. For new hollywood blockbusters (and similar), you may be successful.
Legal: As indicated by the other answer, do verify your use case is in compliance with the terms and conditions and licenses of the technologies and information services that you are using. If in doubt seek the approval from those parties first or seek professional legal advice.
This will provide the video link for you.
like this
The following code parses the HTML source file of this video page.
The mp4 link is here in the HTML source file. You can view the source file and search ".mp4"
The links are in <script type="application/json"> json file having links </script>
Each link expires in 1-2 hours, so you may download from the link instead of saving the links in a file or you can just run the script every time.
from bs4 import BeautifulSoup
import requests
video_id = "vi2766453273"
video_url = "https://www.imdb.com/video/"+video_id
print(video_url)
r = requests.get(url=video_url)
soup = BeautifulSoup(r.text, 'html.parser')
script =soup.find("script",{'type': 'application/json'})
json_object = json.loads(script.string)
print(json_object["props"]["pageProps"]["videoPlaybackData"]["video"]["playbackURLs"])
videos = json_object["props"]["pageProps"]["videoPlaybackData"]["video"]["playbackURLs"]
# links video quality order auto,1080,720
for video in videos[1:] :
video_link = video["url"]
print(video_link)
#break
Checkout the full code at GitHub