Returning the time stamps of youtube videos based on a list of video ids

Question:

You can run my code in this google colab file –> https://colab.research.google.com/drive/1Tfoa5y13GPLxbS-wFNmZpvtQDogyh1Rg?usp=sharing

So I wrote a script that takes a VideoID of a YouTube video like:

VideoID = '3c584TGG7jQ'

Based on this VideoID my script returns a list of dictionaries with the youtube transcript (the video content) like:

data = [{'text': 'Hello World', 'start': 0.19, 'duration': 4.21}, ...]

Finally I wrote a function that takes an input from the user, namely the word/sentence that you want to search and the function returns the time stamp with the according hyperlink.

def search_dictionary(user_input, dictionary):
        MY_CODE_SEE_GOOGLE_COLAB_NOTEBOOK


search_dictionary(user_input, dictionary)

Input: "stolen"

Output: 
the 2 million packages that are stolen... 0.0 min und 39.0 sec :: https://youtu.be/3c584TGG7jQ?t=38s
stolen and the fifth is this outer... 3.0 min und 13.0 sec :: https://youtu.be/3c584TGG7jQ?t=192s

Now comes my question. How can I apply this to a list of video_ids? E.g.

list_of_video_ids = ['pXDx6DjNLDU', '8HEfIJlcFbs', '3c584TGG7jQ', ...]

Expected Output:

Title_0, timestamp, hyperlink
Title_0, timestamp, hyperlink
Title_1, timestamp, hyperlink
Title_2, timestamp, hyperlink
Title_2, timestamp, hyperlink
Title_2, timestamp, hyperlink
Title_2, timestamp, hyperlink

So every mention within all the video_ids, not just a single video_id

Asked By: Maximilian Freitag

||

Answers:

I’ve checked you code and you just needed more time and tests.

As I commented, you need to append the results of transcript.fetch() to a global variable – each time you loop the elements of list_of_video_ids, then, you can – in the search_dictionary function you created, iterate the transcripts.

This is the main code:

# Get user input here: 
# N.B: You should validate for avoid a blank line or some invalid input...
user_input = input("Enter a word or sentence: ")
user_input = user_input.lower()

# We use here the global list "all_transcripts": 
dictionary = all_transcripts

# Function to loop all transcripts and search the captions thath contains the 
# user input.
# TO-DO: Validate when no data is found.
def search_dictionary(user_input, dictionary): 
  link = 'https://youtu.be/'

  # Get the video_id: 
  v_id  = ""

  # I add here the debbuged results: 
  lst_results = []

  # string body:
  matched_line = ""

  # You're really looping a list of dictionaries: 
  for i in range(len(dictionary)): # <= this is really a "list".
    try:
      #print(type(dictionary[i])) # <= this is really a "dictionary".
      #print(dictionary[i])

      # now you can iterate here the "dictionary": 
      for x, y in dictionary[i].items():
        #print(x, y)
        if (x == "video_id"): 
          v_id = y
        if (user_input in str(y) and len(v_id) > 0):
          matched_line = str(dictionary[i]['text']) + '...' + str(dictionary[i]['start']) + ' min und ' + str(dictionary[i]['duration']) + ' sec :: ' + link + v_id + '?t=' + str(int(dictionary[i]['start'] - 1)) + 's'
          #matched_line = "text: " + y + " -- found in video_id = " + v_id
          
          # Check if line does not exists in the list of results: 
          if len(lst_results) == 0:
            lst_results.append(matched_line)
          if matched_line not in lst_results: 
            lst_results.append(matched_line)

    except Exception as err: 
      print('Unexpected error - see details bellow:')
      print(err)

  # Just an example for show "no results":
  if (len(lst_results) == 0):
    print("No results found with input (" + user_input + ")")
  else: 
    print("Results: ")
    print("n".join(lst_results)) # <= this is for show the results with a line break.
# Function ends here.

# Call function: 
search_dictionary(user_input, dictionary) 

# Show message - indicating end of the program - just informative :)
print("End of the program")

Following this manner of thinking for this issue, I’ve modified your code and this is the link of your Google Colab file modified.

This is the Google Colab public notebook link.

The code is resume as follows:

  • Your variable naming needs to change, due to – while testing, I was having issues understanding what kind of data I was dealing with = lists or dictionaries, it seems there are both = as you can see when you read the modified code.
  • I advise you to organize your code and be focused on the spacing – there are lines way too long to read in Google Colab – this might be some personal preferences, though.
  • As you can see in the comments I made in your code, I encourage you to add comments in your code – in order to help others to understand your code.

For test this code and see if it works with this modified code, try the input teach:

Bellow are the results:

Enter a word or sentence: teach
Results: 
teacher and set up a class or even...626.0 min und 4.079 sec :: https://youtu.be/pXDx6DjNLDU?t=625s
teach this process and where you watch...738.399 min und 3.68 sec :: https://youtu.be/8HEfIJlcFbs?t=737s
few times a year i teach a month-long...418.8 min und 3.44 sec :: https://youtu.be/3c584TGG7jQ?t=417s
End of the program