I'm unable to get href attribute from Instagram comment element using Selenium Webdriver for Python

Question:

I’m trying to scrape Instagram posts from the latest post and the next posts within my cutoff time under this handle. However, I don’t really understand HTML and I can’t finish learning in time because I need this done quick.

I want to scrape the href attributes of each comments and the replies to get its unique comment ID to prevent duplicates for further analysis and cross-checking. I’ve tried using CLASS_NAME, XPATH, and CSS_SELECTOR, and neither worked in my favor.

This is the element I want to get:
element I want to scrape

My code somehow only scraped [instagram_url]/p/CqhlzLmpfKV/# and not the full [instagram_url]/p/CqhlzLmpfKV/c/17979860159080809/ (main comment) or [instagram_url]/p/CqhlzLmpfKV/c/17979860159080809/r/18078377977318895/ (reply to the main comment).

This is my current code:

driver.get("https://www.instagram.com/weareone.exo/")

latest_post = WebDriverWait(driver, timeout=40).until(lambda d: d.find_element(By.CLASS_NAME,"_aabd"))
latest_post.click()

comment_ids = []

load_more_path = "/html/body/div[2]/div/div/div[2]/div/div/div[1]/div/div[3]/div/div/div/div/div[2]/div/article/div/div[2]/div/div/div[2]/div[1]/ul/li/div"


# "Load more comments" until 2 clicks
while i<3:
    try:
        WebDriverWait(driver, timeout=20).until(EC.element_to_be_clickable((By.XPATH, load_more_path))).click()
        time.sleep(1.42)
                
    except:
        print("No more 'LOAD MORE COMMENTS' button to be clicked")
        break
    
# "View Replies" if there's any
view_reply_path = 'li > ul > li > div > button[class="_acan _acao _acas _aj1-"]'

WebDriverWait(driver, timeout=20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, view_reply_path)))
view_reply_buttons = driver.find_elements(By.CSS_SELECTOR, view_reply_path)

for button in range(len(view_reply_buttons)):
    view_reply_buttons[button].click()
    time.sleep(1.32)

time.sleep(5.8)
comment = driver.find_elements(By.CLASS_NAME, "_a9zj")

for c in comment:
    container = c.find_element(By.CLASS_NAME,'_a9zr')

    WebDriverWait(driver, timeout=20).until(EC.element_to_be_clickable((By.XPATH, '//div[2]/div/a')))
    commentid = c.find_element(By.XPATH, '//div[2]/div/a').get_attribute("href")
  
    comment_ids.append(commentid)

    print(commentid)

Then the output of my code above is this: (I omitted the full instagram url from the output since stack won’t accept the question that way)

[instagram_url]/p/CqhlzLmpfKV/# 
[instagram_url]/p/CqhlzLmpfKV/# 
[instagram_url]/p/CqhlzLmpfKV/#
..
..
..
[instagram_url]/p/CqhlzLmpfKV/# 
[instagram_url]/p/CqhlzLmpfKV/# 
[instagram_url]/p/CqhlzLmpfKV/#

Any help is welcome! Thanks in advance.

Asked By: June W.

||

Answers:

Finally, after so many trials and errors, I found these two changes to be the solution

comment = driver.find_elements(By.XPATH, '//ul[@class="_a9ym"]/div/li/div[@class="_a9zm"]')

and

commentid = c.find_element(By.XPATH, ".//a[@role='link' and .//time[contains(@class,'_a9ze _a9zf')]]").get_attribute("href")

Explanation:
It turned out that my old code with class name and xpath above also scraped the caption of the post. The caption doesn’t have an href attribute inside their timestamp, so it gave me error messages. Also, I have to reference the elements below ‘a’ tag since I just can’t scrape it if I reference the elements above ‘a’ tag.

Whelp. Until someone has better and more optimized answer, I’ll accept this one first.

Answered By: June W.