How to scrape multiple table including contained href with pandas?
Question:
All tables are obtained using pandas
. But it only gets table data. The href
link is lost.
Is there any solution to get side link with table data?
Example – 1
import pandas as pd
for page in range(1,10):
df = pd.read_html('https://www.screener.in/screens/881782/rk-all-stocks/?page={page}'.format(page=page))
df[0].to_csv('tab.csv', mode='a', index=False, header=1)
Example – 2
import pandas as pd
dfs = []
url = 'https://www.screener.in/screens/881782/rk-all-stocks/?page={}'
for i in range(1,10):
df = pd.read_html(url.format(i), header=None)[0]
dfs.append(df)
finaldf = pd.concat(dfs)
finaldf.to_csv('Output.csv')
Output
Expected output:
Answers:
Even if pd.read_html
has an option to extract_links
, sometimes it’s easier to do it manually especially for this kind of request:
import pandas as pd
import requests
import bs4
url = 'https://www.screener.in/screens/881782/rk-all-stocks/?page={}'
cols = ['S.No.', 'Name', 'CMP', 'P/E', 'Mar Cap', 'Div Yld', 'NP Qtr', 'Qtr Profit Var', 'Sales Qtr', 'Qtr Sales Var', 'ROCE']
def extract_data(row):
data = dict(zip(cols, [cell.text.strip() for cell in row.find_all('td')]))
data.update({'href': f"https//www.screener.in/{row.find('a')['href']}"})
return data
data = []
for page in range(1, 10):
r = requests.get(url.format(page))
soup = bs4.BeautifulSoup(r.content)
rows = soup.find_all('tr', {'data-row-company-id': True})
df = pd.DataFrame([extract_data(row) for row in rows])
data.append(df)
data
out = pd.concat(data, ignore_index=True)
Output:
>>> out
S.No. Name CMP P/E Mar Cap Div Yld NP Qtr Qtr Profit Var Sales Qtr Qtr Sales Var ROCE href
0 1. R J Bio-Tech 6.00 5.68 0.00 -17.41 33.07 1.74 1833.33 2400.00 https//www.screener.in//company/536456/
1 2. Forbes & Co 627.20 0.17 809.09 0.00 -4.08 -128.11 103.43 -24.35 1216.69 https//www.screener.in//company/FORBESGOK/cons...
2 3. Nexus Surgical 10.90 5.97 0.00 0.14 -62.16 0.30 -83.05 1114.29 https//www.screener.in//company/538874/
3 4. SBEC Systems 25.65 42.75 25.65 0.00 0.15 -46.43 0.67 -10.67 1107.32 https//www.screener.in//company/517360/consoli...
4 5. Sri Lak.Sar.Arni 32.70 10.89 0.00 -6.47 -781.05 35.87 -11.69 694.63 https//www.screener.in//company/521161/
.. ... ... ... ... ... ... ... ... ... ... ... ...
220 221. Glaxosmi. Pharma 1336.25 51.98 22618.88 2.25 164.56 13.33 802.30 -1.67 36.66 https//www.screener.in//company/GLAXO/consolid...
221 222. Nitin Spinners 206.80 5.49 1162.35 1.93 31.58 -66.14 537.20 -23.79 36.63 https//www.screener.in//company/NITINSPIN/
222 223. Timescan Logist. 120.30 14.49 42.03 0.00 0.41 27.18 36.52 https//www.screener.in//company/TIMESCAN/
223 224. Mastek 1648.40 17.52 5015.34 1.15 67.12 -12.85 658.66 19.34 36.48 https//www.screener.in//company/MASTEK/consoli...
224 225. Lloyds Metals 279.50 16.35 12432.84 0.18 230.03 813.91 999.62 494.06 36.41 https//www.screener.in//company/LLOYDMETAL/
[225 rows x 12 columns]
As @Corralien mentioned since pandas 1.5.0
there is extract_links
available. Depending on the extent of the readjustments, I would also choose the direct approach via BeautifulSoup
, as this allows more flexibility.
Example
import pandas as pd
url = 'https://www.screener.in/screens/881782/rk-all-stocks/?page={}'
df = pd.concat([pd.read_html(url.format(i), header=None, extract_links='body')[0] for i in range(1,10)])
# extract the href from name column tuple and prepend base_url
df['Link'] = 'https://www.screener.in'+df['Name'].str[1]
# convert all columns except last one (Link) from tuple to string
for c in df.columns[:-1]:
df[c] = df[c].str[0]
# filter all out all subheaders or footer rows
df[(df['S.No.']!='S.No.') & (df['S.No.'].notna())]
# or filter and reorganize columns to fit your needs
# df = df[(df['S.No.']!='S.No.') & (df['S.No.'].notna())][['S.No.', 'Link', 'Name', 'CMP Rs.', 'P/E', 'Mar Cap Rs.Cr.', 'Div Yld %','NP Qtr Rs.Cr.', 'Qtr Profit Var %', 'Sales Qtr Rs.Cr.','Qtr Sales Var %', 'ROCE %']]
Output
S.No.
Link
Name
CMP Rs.
P/E
Mar Cap Rs.Cr.
Div Yld %
NP Qtr Rs.Cr.
Qtr Profit Var %
Sales Qtr Rs.Cr.
Qtr Sales Var %
ROCE %
0
1
https://www.screener.in/company/536456/
R J Bio-Tech
6
5.68
0
-17.41
33.07
1.74
1833.33
2400
1
2
https://www.screener.in/company/FORBESGOK/consolidated/
Forbes & Co
627.2
0.17
809.09
0
-4.08
-128.11
103.43
-24.35
1216.69
…
24
224
https://www.screener.in/company/MASTEK/consolidated/
Mastek
1648.4
17.52
5015.34
1.15
67.12
-12.85
658.66
19.34
36.48
25
225
https://www.screener.in/company/LLOYDMETAL/
Lloyds Metals
279.5
16.35
12432.8
0.18
230.03
813.91
999.62
494.06
36.41
All tables are obtained using pandas
. But it only gets table data. The href
link is lost.
Is there any solution to get side link with table data?
Example – 1
import pandas as pd
for page in range(1,10):
df = pd.read_html('https://www.screener.in/screens/881782/rk-all-stocks/?page={page}'.format(page=page))
df[0].to_csv('tab.csv', mode='a', index=False, header=1)
Example – 2
import pandas as pd
dfs = []
url = 'https://www.screener.in/screens/881782/rk-all-stocks/?page={}'
for i in range(1,10):
df = pd.read_html(url.format(i), header=None)[0]
dfs.append(df)
finaldf = pd.concat(dfs)
finaldf.to_csv('Output.csv')
Output
Expected output:
Even if pd.read_html
has an option to extract_links
, sometimes it’s easier to do it manually especially for this kind of request:
import pandas as pd
import requests
import bs4
url = 'https://www.screener.in/screens/881782/rk-all-stocks/?page={}'
cols = ['S.No.', 'Name', 'CMP', 'P/E', 'Mar Cap', 'Div Yld', 'NP Qtr', 'Qtr Profit Var', 'Sales Qtr', 'Qtr Sales Var', 'ROCE']
def extract_data(row):
data = dict(zip(cols, [cell.text.strip() for cell in row.find_all('td')]))
data.update({'href': f"https//www.screener.in/{row.find('a')['href']}"})
return data
data = []
for page in range(1, 10):
r = requests.get(url.format(page))
soup = bs4.BeautifulSoup(r.content)
rows = soup.find_all('tr', {'data-row-company-id': True})
df = pd.DataFrame([extract_data(row) for row in rows])
data.append(df)
data
out = pd.concat(data, ignore_index=True)
Output:
>>> out
S.No. Name CMP P/E Mar Cap Div Yld NP Qtr Qtr Profit Var Sales Qtr Qtr Sales Var ROCE href
0 1. R J Bio-Tech 6.00 5.68 0.00 -17.41 33.07 1.74 1833.33 2400.00 https//www.screener.in//company/536456/
1 2. Forbes & Co 627.20 0.17 809.09 0.00 -4.08 -128.11 103.43 -24.35 1216.69 https//www.screener.in//company/FORBESGOK/cons...
2 3. Nexus Surgical 10.90 5.97 0.00 0.14 -62.16 0.30 -83.05 1114.29 https//www.screener.in//company/538874/
3 4. SBEC Systems 25.65 42.75 25.65 0.00 0.15 -46.43 0.67 -10.67 1107.32 https//www.screener.in//company/517360/consoli...
4 5. Sri Lak.Sar.Arni 32.70 10.89 0.00 -6.47 -781.05 35.87 -11.69 694.63 https//www.screener.in//company/521161/
.. ... ... ... ... ... ... ... ... ... ... ... ...
220 221. Glaxosmi. Pharma 1336.25 51.98 22618.88 2.25 164.56 13.33 802.30 -1.67 36.66 https//www.screener.in//company/GLAXO/consolid...
221 222. Nitin Spinners 206.80 5.49 1162.35 1.93 31.58 -66.14 537.20 -23.79 36.63 https//www.screener.in//company/NITINSPIN/
222 223. Timescan Logist. 120.30 14.49 42.03 0.00 0.41 27.18 36.52 https//www.screener.in//company/TIMESCAN/
223 224. Mastek 1648.40 17.52 5015.34 1.15 67.12 -12.85 658.66 19.34 36.48 https//www.screener.in//company/MASTEK/consoli...
224 225. Lloyds Metals 279.50 16.35 12432.84 0.18 230.03 813.91 999.62 494.06 36.41 https//www.screener.in//company/LLOYDMETAL/
[225 rows x 12 columns]
As @Corralien mentioned since pandas 1.5.0
there is extract_links
available. Depending on the extent of the readjustments, I would also choose the direct approach via BeautifulSoup
, as this allows more flexibility.
Example
import pandas as pd
url = 'https://www.screener.in/screens/881782/rk-all-stocks/?page={}'
df = pd.concat([pd.read_html(url.format(i), header=None, extract_links='body')[0] for i in range(1,10)])
# extract the href from name column tuple and prepend base_url
df['Link'] = 'https://www.screener.in'+df['Name'].str[1]
# convert all columns except last one (Link) from tuple to string
for c in df.columns[:-1]:
df[c] = df[c].str[0]
# filter all out all subheaders or footer rows
df[(df['S.No.']!='S.No.') & (df['S.No.'].notna())]
# or filter and reorganize columns to fit your needs
# df = df[(df['S.No.']!='S.No.') & (df['S.No.'].notna())][['S.No.', 'Link', 'Name', 'CMP Rs.', 'P/E', 'Mar Cap Rs.Cr.', 'Div Yld %','NP Qtr Rs.Cr.', 'Qtr Profit Var %', 'Sales Qtr Rs.Cr.','Qtr Sales Var %', 'ROCE %']]
Output
S.No. | Link | Name | CMP Rs. | P/E | Mar Cap Rs.Cr. | Div Yld % | NP Qtr Rs.Cr. | Qtr Profit Var % | Sales Qtr Rs.Cr. | Qtr Sales Var % | ROCE % | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | https://www.screener.in/company/536456/ | R J Bio-Tech | 6 | 5.68 | 0 | -17.41 | 33.07 | 1.74 | 1833.33 | 2400 | |
1 | 2 | https://www.screener.in/company/FORBESGOK/consolidated/ | Forbes & Co | 627.2 | 0.17 | 809.09 | 0 | -4.08 | -128.11 | 103.43 | -24.35 | 1216.69 |
… | ||||||||||||
24 | 224 | https://www.screener.in/company/MASTEK/consolidated/ | Mastek | 1648.4 | 17.52 | 5015.34 | 1.15 | 67.12 | -12.85 | 658.66 | 19.34 | 36.48 |
25 | 225 | https://www.screener.in/company/LLOYDMETAL/ | Lloyds Metals | 279.5 | 16.35 | 12432.8 | 0.18 | 230.03 | 813.91 | 999.62 | 494.06 | 36.41 |