Python web sraping: How do i avoid scraping duplicates from signal.nfx.com

Question:

I’m a web scraping newbie trying to efficiently scrape data from signal.nfx.com. The issue i have is that i keep scraping the same data over and over making my scraper inefficient. I want to be able to scrape all investors in a page but i am scraping just a few per page repeatedly, how can i resolve this? check the code below:

url= "https://signal-api.nfx.com/graphql"
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
payload = {"operationName":"vclInvestors",
           "variables":{"slug":"gig-economy-pre-seed",
                        "order":[{}],
                        "after":"OA"},
           "query":"query vclInvestors($slug: String!, $after: String) {n  list(slug: $slug) {n    idn    slugn    investor_countn    vertical {n      idn      display_namen      kindn      __typenamen    }n    location {n      idn      display_namen      __typenamen    }n    stagen    firms {n      idn      namen      slugn      __typenamen    }n    scored_investors(first: 8, after: $after) {n      pageInfo {n        hasNextPagen        hasPreviousPagen        endCursorn        __typenamen      }n      record_countn      edges {n        node {n          ...investorListInvestorProfileFieldsn          __typenamen        }n        __typenamen      }n      __typenamen    }n    __typenamen  }n}nnfragment investorListInvestorProfileFields on InvestorProfile {n  idn  person {n    idn    first_namen    last_namen    namen    slugn  linkedin_urln  twitter_urln  is_men    is_on_target_listn   __typenamen  }n  image_urlsn  positionn  min_investmentn  max_investmentn  target_investmentn  areas_of_interest_freeformn is_preferred_coinvestorn  firm {n    idn  current_fund_sizen  namen    slugn    __typenamen  }n  investment_locations {n    idn    display_namen    location_investor_list {n   stage_namen   idn      slugn      __typenamen    }n    __typenamen  }n  investor_lists {n    idn    stage_namen    slugn    vertical {n   kindn   idn      display_namen      __typenamen    }n    __typenamen  }n  __typenamen}n"}


results = pd.DataFrame()
hasNextPage = True
after = ''

while hasNextPage == True:
    payload['variables']['after'] == after
    jsonData = requests.post(url, headers=headers, json=payload ).json()
    data = jsonData['data']['list']['scored_investors']['edges']
    df = pd.json_normalize(data)
    results = results.append(df, sort=False).reset_index(drop=True)
    
    count = len(results) 
    tot = jsonData['data']['list']['investor_count']
    
    print(f'{count} of {tot}')
    
    hasNextPage = jsonData['data']['list']['scored_investors']['pageInfo']['hasNextPage']
    after = jsonData['data']['list']['scored_investors']['pageInfo']['endCursor']

i was able to scrape over 50, 000 rows but almost all of them were duplicates, see below:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55448 entries, 0 to 55447
Data columns (total 28 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Unnamed: 0                       55448 non-null  int64  
 1   __typename                       55448 non-null  object 
 2   node.__typename                  55448 non-null  object 
 3   node.id                          55448 non-null  int64  
 4   node.person.id                   55448 non-null  int64  
 5   node.person.first_name           55448 non-null  object 
 6   node.person.last_name            55448 non-null  object 
 7   node.person.name                 55448 non-null  object 
 8   node.person.slug                 55448 non-null  object 
 9   node.person.linkedin_url         55448 non-null  object 
 10  node.person.twitter_url          20793 non-null  object 
 11  node.person.is_me                55448 non-null  bool   
 12  node.person.is_on_target_list    55448 non-null  bool   
 13  node.person.__typename           55448 non-null  object 
 14  node.image_urls                  55448 non-null  object 
 15  node.position                    55448 non-null  object 
 16  node.min_investment              55448 non-null  int64  
 17  node.max_investment              55448 non-null  int64  
 18  node.target_investment           55448 non-null  int64  
 19  node.areas_of_interest_freeform  20793 non-null  object 
 20  node.is_preferred_coinvestor     55448 non-null  bool   
 21  node.firm.id                     55448 non-null  int64  
 22  node.firm.current_fund_size      0 non-null      float64
 23  node.firm.name                   55448 non-null  object 
 24  node.firm.slug                   55448 non-null  object 
 25  node.firm.__typename             55448 non-null  object 
 26  node.investment_locations        55448 non-null  object 
 27  node.investor_lists              55448 non-null  object 
dtypes: bool(3), float64(1), int64(7), object(17)
memory usage: 10.7+ MB

After removing duplicates and unnecessary columns:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 0 to 7
Data columns (total 10 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   node.person.name                 8 non-null      object 
 1   node.person.linkedin_url         8 non-null      object 
 2   node.person.twitter_url          3 non-null      object 
 3   node.position                    8 non-null      object 
 4   node.min_investment              8 non-null      int64  
 5   node.max_investment              8 non-null      int64  
 6   node.target_investment           8 non-null      int64  
 7   node.areas_of_interest_freeform  3 non-null      object 
 8   node.firm.current_fund_size      0 non-null      float64
 9   node.firm.name                   8 non-null      object 
dtypes: float64(1), int64(3), object(6)
memory usage: 704.0+ bytes

Asked By: Dsavy

||

Answers:

You have a typo asigning your after parameter:

payload['variables']['after'] == after
#                             ^^ should be just a single =

In general when scraping with while loops you should be very careful and confirm all parameters were set correctly before sending out your requests or you end up just spamming the website.

One easy way to prevent this is to confirm that the hash of a new response hasn’t been seen before.

Answered By: Granitosaurus