How to scrape multiple table including contained href with pandas?

Question:

All tables are obtained using pandas. But it only gets table data. The href link is lost.

Is there any solution to get side link with table data?

Example – 1
import pandas as pd

for page in range(1,10):
    df = pd.read_html('https://www.screener.in/screens/881782/rk-all-stocks/?page={page}'.format(page=page))
    df[0].to_csv('tab.csv',  mode='a', index=False, header=1)
Example – 2
import pandas as pd

dfs = []

url = 'https://www.screener.in/screens/881782/rk-all-stocks/?page={}'
for i in range(1,10):    
    df = pd.read_html(url.format(i), header=None)[0]      
    dfs.append(df)

finaldf = pd.concat(dfs)              
finaldf.to_csv('Output.csv')
Output

enter image description here

Expected output:

enter image description here

Asked By: RK Solanki

||

Answers:

Even if pd.read_html has an option to extract_links, sometimes it’s easier to do it manually especially for this kind of request:

import pandas as pd
import requests
import bs4

url = 'https://www.screener.in/screens/881782/rk-all-stocks/?page={}'

cols = ['S.No.', 'Name', 'CMP', 'P/E', 'Mar Cap', 'Div Yld', 'NP Qtr', 'Qtr Profit Var', 'Sales Qtr', 'Qtr Sales Var', 'ROCE']

def extract_data(row):
    data = dict(zip(cols, [cell.text.strip() for cell in row.find_all('td')]))
    data.update({'href': f"https//www.screener.in/{row.find('a')['href']}"})
    return data

data = []
for page in range(1, 10):    
    r = requests.get(url.format(page))
    soup = bs4.BeautifulSoup(r.content)
    rows = soup.find_all('tr', {'data-row-company-id': True})
    df = pd.DataFrame([extract_data(row) for row in rows])
    data.append(df)
data
out = pd.concat(data, ignore_index=True)

Output:

>>> out
    S.No.              Name      CMP    P/E   Mar Cap Div Yld  NP Qtr Qtr Profit Var Sales Qtr Qtr Sales Var     ROCE                                               href
0      1.      R J Bio-Tech     6.00             5.68    0.00  -17.41          33.07      1.74       1833.33  2400.00            https//www.screener.in//company/536456/
1      2.       Forbes & Co   627.20   0.17    809.09    0.00   -4.08        -128.11    103.43        -24.35  1216.69  https//www.screener.in//company/FORBESGOK/cons...
2      3.    Nexus Surgical    10.90             5.97    0.00    0.14         -62.16      0.30        -83.05  1114.29            https//www.screener.in//company/538874/
3      4.      SBEC Systems    25.65  42.75     25.65    0.00    0.15         -46.43      0.67        -10.67  1107.32  https//www.screener.in//company/517360/consoli...
4      5.  Sri Lak.Sar.Arni    32.70            10.89    0.00   -6.47        -781.05     35.87        -11.69   694.63            https//www.screener.in//company/521161/
..    ...               ...      ...    ...       ...     ...     ...            ...       ...           ...      ...                                                ...
220  221.  Glaxosmi. Pharma  1336.25  51.98  22618.88    2.25  164.56          13.33    802.30         -1.67    36.66  https//www.screener.in//company/GLAXO/consolid...
221  222.    Nitin Spinners   206.80   5.49   1162.35    1.93   31.58         -66.14    537.20        -23.79    36.63         https//www.screener.in//company/NITINSPIN/
222  223.  Timescan Logist.   120.30  14.49     42.03    0.00    0.41                    27.18                  36.52          https//www.screener.in//company/TIMESCAN/
223  224.            Mastek  1648.40  17.52   5015.34    1.15   67.12         -12.85    658.66         19.34    36.48  https//www.screener.in//company/MASTEK/consoli...
224  225.     Lloyds Metals   279.50  16.35  12432.84    0.18  230.03         813.91    999.62        494.06    36.41        https//www.screener.in//company/LLOYDMETAL/

[225 rows x 12 columns]
Answered By: Corralien

As @Corralien mentioned since pandas 1.5.0 there is extract_links available. Depending on the extent of the readjustments, I would also choose the direct approach via BeautifulSoup, as this allows more flexibility.

Example

import pandas as pd
url = 'https://www.screener.in/screens/881782/rk-all-stocks/?page={}'

df = pd.concat([pd.read_html(url.format(i), header=None, extract_links='body')[0] for i in range(1,10)])   

# extract the href from name column tuple and prepend base_url
df['Link'] = 'https://www.screener.in'+df['Name'].str[1]

# convert all columns except last one (Link) from tuple to string 
for c in df.columns[:-1]:
    df[c] = df[c].str[0]

# filter all out all subheaders or footer rows
df[(df['S.No.']!='S.No.') & (df['S.No.'].notna())]

# or filter and reorganize columns to fit your needs
# df = df[(df['S.No.']!='S.No.') & (df['S.No.'].notna())][['S.No.', 'Link', 'Name', 'CMP  Rs.', 'P/E', 'Mar Cap  Rs.Cr.', 'Div Yld  %','NP Qtr  Rs.Cr.', 'Qtr Profit Var  %', 'Sales Qtr  Rs.Cr.','Qtr Sales Var  %', 'ROCE  %']]

Output

S.No. Link Name CMP Rs. P/E Mar Cap Rs.Cr. Div Yld % NP Qtr Rs.Cr. Qtr Profit Var % Sales Qtr Rs.Cr. Qtr Sales Var % ROCE %
0 1 https://www.screener.in/company/536456/ R J Bio-Tech 6 5.68 0 -17.41 33.07 1.74 1833.33 2400
1 2 https://www.screener.in/company/FORBESGOK/consolidated/ Forbes & Co 627.2 0.17 809.09 0 -4.08 -128.11 103.43 -24.35 1216.69
24 224 https://www.screener.in/company/MASTEK/consolidated/ Mastek 1648.4 17.52 5015.34 1.15 67.12 -12.85 658.66 19.34 36.48
25 225 https://www.screener.in/company/LLOYDMETAL/ Lloyds Metals 279.5 16.35 12432.8 0.18 230.03 813.91 999.62 494.06 36.41
Answered By: HedgeHog