Can we convert from text to existing header and URL that available in search engine using pandas

Question:

Here’s my input

app
fix
jd_id
zalora
leomaster

Here’s my expected output

app         header                                                        url
fix         Fix.com | Your Source for Genuine Parts & DIY Repair Help     https://www.fix.com/             
jd_id       jdid                                                          https://www.jd.id/
zalora      ZALORA Indonesia: Belanja Online Fashion & Lifestyle Terbaru  https://www.zalora.co.id/   
leomaster   Leomaster — Manufacturers of fine fabrics                  https://www.leomaster.it/en/

It can be done manually by using google chrome and exhausting copy-paste process, since I have 22000+ of app that need to be cheked, we need a scalable solution

Asked By: Nabih Bawazir

||

Answers:

To do this with google you will need a Google Search API account. So my solution will be with DuckDuckGo, but is obviously the same with Google:

import pandas as pd
import requests
from bs4 import BeautifulSoup

def extract_info(app_name):
    query = f"{app_name} website"

    url = f"https://duckduckgo.com/html/?q={query}"

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    }

    response = requests.get(url, headers=headers)

    soup = BeautifulSoup(response.content, "html.parser")

    search_results = soup.find_all("div", class_="result")

    for result in search_results:
        link = result.find("a")
        if link is not None:
            header = link.get_text()
            url = link.get("href")
            if url.startswith("https://"):
                return {"app": app_name, "header": header, "url": url}

    return None

app_list = ["fix", "jd_id", "zalora", "leomaster"]

results = [extract_info(app) for app in app_list]

results = [r for r in results if r is not None]

df = pd.DataFrame(results)

print(df)

which returns

         app                                             header  
0        fix                     iFixit: The Free Repair Manual   
1      jd_id                                              Jd.id   
2     zalora  Zalora - Asia'S Leading Online Fashion Destina...   
3  leomaster                               LEOMASTER | LinkedIn   

                                          url  
0                     https://www.ifixit.com/  
1                          https://www.jd.id/  
2                         https://zalora.com/  
3  https://www.linkedin.com/company/leomaster