getting 200 response from requests lib but not through Scrapy in python

Question:

I have tried to scrap data using scrapy spider in python to the targeted URL: https://www.accenture.com/ro-en/services/data-analytics-index#block-what-we-think
but it returns the Error: twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)

But if i try to scrape data using the python requests library it works fine.

Asked By: Manoj Bhatt

||

Answers:

It is a known issue in the upstream library twisted due to the website sending a large header.

If you check the headers for the above URL, you can see that the content-security-policy is too long.

❯ curl -I "https://www.accenture.com/ro-en/services/data-analytics-index#block-what-we-think"
HTTP/2 200 
content-type: text/html;charset=utf-8
content-length: 237650
content-security-policy: default-src 'unsafe-eval' 'unsafe-inline' *.adobeaemcloud.com *.flipsnack.com *.cdnsvc.com *.novetta.com  *.datadoghq-browser-agent.com *.day.com *.scene7.com *.accenture.com *.reddit.com *.captcha.com *.amazonaws.com *.bing.com *.cdninstagram.com *.clicktale.net *.cloudflare.com *.demandbase.com *.demdex.net *.facebook.net *.fls.doubleclick.net *.fontawesome.com *.microsoftonline.com *.onetrust.com *.siteimprove.com *.siteimprove.net *.vidyard.com *.storied.co *.cookielaw.org *.accenture.test *.bnr.nl *.mktoresp.com *.adobe.com *.clarity.ms https://cdn.flipsnack.com https://t.co *.ads-twitter.com *.twitter.com https://static.ads-twitter.com *.confirmit.com https://digitalfeedback.us.confirmit.com https://www.haceonline.org *.contentsquare.com *.salesforce.com https://tableau.javelingroup.com *.javelingroup.com *.slidesharecdn.com *.sndcdn.com *.soundcloud.com *.doubleclick.net https://yt3.ggpht.com https://www.gstatic.com https://adservice.google.co.za https://adservice.google.ca https://maxcdn.bootstrapcdn.com https://cdn.embed.ly https://cdn.jsdelivr.net https://d3js.org https://insight.adsrvr.org https://schema.org https://i.ytimg.com https://c.contentsquare.net https://l.contentsquare.net https://k-aeu1.contentsquare.net https://q-aus1.contentsquare.net *.contentsquare.net *.ytimg.com *.apple.com https://player.simplecast.com *.libsyn.com https://adservice.google.co.in https://public.tableau.com https://embed.podcasts.apple.com https://w.soundcloud.com *.accenture.cn https://html5-player.libsyn.com https://px4.ads.linkedin.com https://adservice.google.com.ph https://adservice.google.com https://units.knotch.it https://ad.doubleclick.net https://www.google.com.sg https://accenture.lightinfosys.com https://accenture.mettl.de https://accenture.percipio.com https://accenture.tt.omtrdc.net https://api.company-target.com https://assets.adobedtm.com https://authenticate.cocubes.com/candidate https://cdn.cookielaw.org https://cm.everesttech.net https://cocubesprod.com https://fonts.googleapis.com https://fonts.gstatic.com https://idsync.rlcdn.com https://img.en25.com https://india.jobs.accenture.com https://login.rosettastone.com https://ml314.com https://p.adsymptotic.com https://pbs.twimg.com https://px.ads.linkedin.com https://s.delvenetworks.com https://secure.na2.echosign.com https://secure.in1.echosign.com https://snap.licdn.com https://static.echocdn.com https://stats.g.doubleclick.net https://sync.crwdcntrl.net https://tools.ietf.org https://unpkg.com https://www.accenturealumni.com https://www.facebook.com https://www.glassdoor.com https://www.google.com.ph https://www.google-analytics.com https://www.googletagmanager.com https://www.indeed.com https://www.knotch-cdn.com https://www.monster.com https://www.redditstatic.com https://www.slideshare.net https://www.youtube.com https://www.youtube-nocookie.com https://ssl.google-analytics.com https://munchkin.marketo.net https://www.google.co.in https://dev.virtualearth.net https://a.mktgcdn.com data *.accenture.jp *.newsroom.accenture.jp *.bat.bing.com *.dpm.demdex.net *.rlcdn.com *.youtube.com *.ml314.com *.linkedin.com *.ads.linkedin.com *.facebook.com *.adsymptotic.com *.knotch.it *.login.microsoftonline.com *.google.com *.newsroom.accenture.de *.login.live.com *.acnprodedit-2016.accenture.com *.adnxs.com *.yahoo.com *.analytics.yahoo.com *.adsrvr.org *.casalemedia.com *.candidate.accenture.com *.app-5292-eus2-prejoiner-prod-web.azurewebsites.net *.rubiconproject.com *.indaorm.accenture.com *.t.co *.bidswitch.net *.pubmatic.com blob:; script-src 'unsafe-inline' 'unsafe-eval' *.adobeaemcloud.com *.flipsnack.com *.cdnsvc.com *.novetta.com  *.datadoghq-browser-agent.com *.day.com *.scene7.com *.accenture.com *.reddit.com *.captcha.com *.amazonaws.com *.bing.com *.cdninstagram.com *.clicktale.net *.cloudflare.com *.demandbase.com *.demdex.net *.facebook.net *.fls.doubleclick.net *.fontawesome.com *.microsoftonline.com *.onetrust.com *.siteimprove.com *.siteimprove.net *.vidyard.com *.storied.co *.cookielaw.org *.accenture.test *.bnr.nl *.mktoresp.com *.adobe.com *.clarity.ms https://cdn.flipsnack.com https://t.co *.ads-twitter.com *.twitter.com https://static.ads-twitter.com *.confirmit.com https://digitalfeedback.us.confirmit.com https://www.haceonline.org *.contentsquare.com *.salesforce.com https://tableau.javelingroup.com *.javelingroup.com *.slidesharecdn.com *.sndcdn.com *.soundcloud.com *.doubleclick.net https://yt3.ggpht.com https://www.gstatic.com https://adservice.google.co.za https://adservice.google.ca https://maxcdn.bootstrapcdn.com https://cdn.embed.ly https://cdn.jsdelivr.net https://d3js.org https://insight.adsrvr.org https://schema.org https://i.ytimg.com https://c.contentsquare.net https://l.contentsquare.net https://k-aeu1.contentsquare.net https://q-aus1.contentsquare.net *.contentsquare.net *.ytimg.com *.apple.com https://player.simplecast.com *.libsyn.com https://adservice.google.co.in https://public.tableau.com https://embed.podcasts.apple.com https://w.soundcloud.com *.accenture.cn https://html5-player.libsyn.com https://px4.ads.linkedin.com https://adservice.google.com.ph https://adservice.google.com https://units.knotch.it https://ad.doubleclick.net https://www.google.com.sg https://accenture.lightinfosys.com https://accenture.mettl.de https://accenture.percipio.com https://accenture.tt.omtrdc.net https://api.company-target.com https://assets.adobedtm.com https://authenticate.cocubes.com/candidate https://cdn.cookielaw.org https://cm.everesttech.net https://cocubesprod.com https://fonts.googleapis.com https://fonts.gstatic.com https://idsync.rlcdn.com https://img.en25.com https://india.jobs.accenture.com https://login.rosettastone.com https://ml314.com https://p.adsymptotic.com https://pbs.twimg.com https://px.ads.linkedin.com https://s.delvenetworks.com https://secure.na2.echosign.com https://secure.in1.echosign.com https://snap.licdn.com https://static.echocdn.com https://stats.g.doubleclick.net https://sync.crwdcntrl.net https://tools.ietf.org https://unpkg.com https://www.accenturealumni.com https://www.facebook.com https://www.glassdoor.com https://www.google.com.ph https://www.google-analytics.com https://www.googletagmanager.com https://www.indeed.com https://www.knotch-cdn.com https://www.monster.com https://www.redditstatic.com https://www.slideshare.net https://www.youtube.com https://www.youtube-nocookie.com https://ssl.google-analytics.com https://munchkin.marketo.net https://www.google.co.in https://dev.virtualearth.net https://a.mktgcdn.com data *.accenture.jp *.newsroom.accenture.jp *.bat.bing.com *.dpm.demdex.net *.rlcdn.com *.youtube.com *.ml314.com *.linkedin.com *.ads.linkedin.com *.facebook.com *.adsymptotic.com *.knotch.it *.login.microsoftonline.com *.google.com *.newsroom.accenture.de *.login.live.com *.acnprodedit-2016.accenture.com *.adnxs.com *.yahoo.com *.analytics.yahoo.com *.adsrvr.org *.casalemedia.com *.candidate.accenture.com *.app-5292-eus2-prejoiner-prod-web.azurewebsites.net *.rubiconproject.com *.indaorm.accenture.com *.t.co *.bidswitch.net *.pubmatic.com  blob:; img-src *.adobeaemcloud.com *.flipsnack.com *.cdnsvc.com *.novetta.com  *.datadoghq-browser-agent.com *.day.com *.scene7.com *.accenture.com *.reddit.com *.captcha.com *.amazonaws.com *.bing.com *.cdninstagram.com *.clicktale.net *.cloudflare.com *.demandbase.com *.demdex.net *.facebook.net *.fls.doubleclick.net *.fontawesome.com *.microsoftonline.com *.onetrust.com *.siteimprove.com *.siteimprove.net *.vidyard.com *.storied.co *.cookielaw.org *.accenture.test *.bnr.nl *.mktoresp.com *.adobe.com *.clarity.ms https://cdn.flipsnack.com https://t.co *.ads-twitter.com *.twitter.com https://static.ads-twitter.com *.confirmit.com https://digitalfeedback.us.confirmit.com https://www.haceonline.org *.contentsquare.com *.salesforce.com https://tableau.javelingroup.com *.javelingroup.com *.slidesharecdn.com *.sndcdn.com *.soundcloud.com *.doubleclick.net https://yt3.ggpht.com https://www.gstatic.com https://adservice.google.co.za https://adservice.google.ca https://maxcdn.bootstrapcdn.com https://cdn.embed.ly https://cdn.jsdelivr.net https://d3js.org https://insight.adsrvr.org https://schema.org https://i.ytimg.com https://c.contentsquare.net https://l.contentsquare.net https://k-aeu1.contentsquare.net https://q-aus1.contentsquare.net *.contentsquare.net *.ytimg.com *.apple.com https://player.simplecast.com *.libsyn.com https://adservice.google.co.in https://public.tableau.com https://embed.podcasts.apple.com https://w.soundcloud.com *.accenture.cn https://html5-player.libsyn.com https://px4.ads.linkedin.com https://adservice.google.com.ph https://adservice.google.com https://units.knotch.it https://ad.doubleclick.net https://www.google.com.sg https://accenture.lightinfosys.com https://accenture.mettl.de https://accenture.percipio.com https://accenture.tt.omtrdc.net https://api.company-target.com https://assets.adobedtm.com https://authenticate.cocubes.com/candidate https://cdn.cookielaw.org https://cm.everesttech.net https://cocubesprod.com https://fonts.googleapis.com https://fonts.gstatic.com https://idsync.rlcdn.com https://img.en25.com https://india.jobs.accenture.com https://login.rosettastone.com https://ml314.com https://p.adsymptotic.com https://pbs.twimg.com https://px.ads.linkedin.com https://s.delvenetworks.com https://secure.na2.echosign.com https://secure.in1.echosign.com https://snap.licdn.com https://static.echocdn.com https://stats.g.doubleclick.net https://sync.crwdcntrl.net https://tools.ietf.org https://unpkg.com https://www.accenturealumni.com https://www.facebook.com https://www.glassdoor.com https://www.google.com.ph https://www.google-analytics.com https://www.googletagmanager.com https://www.indeed.com https://www.knotch-cdn.com https://www.monster.com https://www.redditstatic.com https://www.slideshare.net https://www.youtube.com https://www.youtube-nocookie.com https://ssl.google-analytics.com https://munchkin.marketo.net https://www.google.co.in https://dev.virtualearth.net https://a.mktgcdn.com data *.accenture.jp *.newsroom.accenture.jp *.bat.bing.com *.dpm.demdex.net *.rlcdn.com *.youtube.com *.ml314.com *.linkedin.com *.ads.linkedin.com *.facebook.com *.adsymptotic.com *.knotch.it *.login.microsoftonline.com *.google.com *.newsroom.accenture.de *.login.live.com *.acnprodedit-2016.accenture.com *.adnxs.com *.yahoo.com *.analytics.yahoo.com *.adsrvr.org *.casalemedia.com *.candidate.accenture.com *.app-5292-eus2-prejoiner-prod-web.azurewebsites.net *.rubiconproject.com *.indaorm.accenture.com *.t.co *.bidswitch.net *.pubmatic.com  data:; connect-src *.adobeaemcloud.com *.flipsnack.com *.cdnsvc.com *.novetta.com  https://rum.browser-intake-datadoghq.com *.datadoghq.com *.day.com *.scene7.com *.accenture.com *.reddit.com *.captcha.com *.amazonaws.com *.bing.com *.cdninstagram.com *.clicktale.net *.cloudflare.com *.demandbase.com *.demdex.net *.facebook.net *.fls.doubleclick.net *.fontawesome.com *.microsoftonline.com *.onetrust.com *.siteimprove.com *.siteimprove.net *.vidyard.com *.storied.co *.cookielaw.org *.accenture.test *.bnr.nl *.mktoresp.com *.adobe.com *.clarity.ms https://cdn.flipsnack.com https://t.co *.ads-twitter.com *.twitter.com https://static.ads-twitter.com *.confirmit.com https://digitalfeedback.us.confirmit.com https://www.haceonline.org *.contentsquare.com *.salesforce.com https://tableau.javelingroup.com *.javelingroup.com *.slidesharecdn.com *.sndcdn.com *.soundcloud.com *.doubleclick.net https://yt3.ggpht.com https://www.gstatic.com https://adservice.google.co.za https://adservice.google.ca https://maxcdn.bootstrapcdn.com https://cdn.embed.ly https://cdn.jsdelivr.net https://d3js.org https://insight.adsrvr.org https://schema.org https://i.ytimg.com https://c.contentsquare.net https://l.contentsquare.net https://k-aeu1.contentsquare.net https://q-aus1.contentsquare.net *.contentsquare.net *.ytimg.com *.apple.com https://player.simplecast.com *.libsyn.com https://adservice.google.co.in https://public.tableau.com https://embed.podcasts.apple.com https://w.soundcloud.com *.accenture.cn https://html5-player.libsyn.com https://px4.ads.linkedin.com https://adservice.google.com.ph https://adservice.google.com https://units.knotch.it https://ad.doubleclick.net https://www.google.com.sg https://accenture.lightinfosys.com https://accenture.mettl.de https://accenture.percipio.com https://accenture.tt.omtrdc.net https://api.company-target.com https://assets.adobedtm.com https://authenticate.cocubes.com/candidate https://cdn.cookielaw.org https://cm.everesttech.net https://cocubesprod.com https://fonts.googleapis.com https://fonts.gstatic.com https://idsync.rlcdn.com https://img.en25.com https://india.jobs.accenture.com https://login.rosettastone.com https://ml314.com https://p.adsymptotic.com https://pbs.twimg.com https://px.ads.linkedin.com https://s.delvenetworks.com https://secure.na2.echosign.com https://secure.in1.echosign.com https://snap.licdn.com https://static.echocdn.com https://stats.g.doubleclick.net https://sync.crwdcntrl.net https://tools.ietf.org https://unpkg.com https://www.accenturealumni.com https://www.facebook.com https://www.glassdoor.com https://www.google.com.ph https://www.google-analytics.com https://www.googletagmanager.com https://www.indeed.com https://www.knotch-cdn.com https://www.monster.com https://www.redditstatic.com https://www.slideshare.net https://www.youtube.com https://www.youtube-nocookie.com https://ssl.google-analytics.com https://munchkin.marketo.net https://www.google.co.in https://dev.virtualearth.net https://a.mktgcdn.com data *.accenture.jp *.newsroom.accenture.jp *.bat.bing.com *.dpm.demdex.net *.rlcdn.com *.youtube.com *.ml314.com *.linkedin.com *.ads.linkedin.com *.facebook.com *.adsymptotic.com *.knotch.it *.login.microsoftonline.com *.google.com *.newsroom.accenture.de *.login.live.com *.acnprodedit-2016.accenture.com *.adnxs.com *.yahoo.com *.analytics.yahoo.com *.adsrvr.org *.casalemedia.com *.candidate.accenture.com *.app-5292-eus2-prejoiner-prod-web.azurewebsites.net *.rubiconproject.com *.indaorm.accenture.com *.t.co *.bidswitch.net *.pubmatic.com ; font-src *.adobeaemcloud.com *.flipsnack.com *.cdnsvc.com *.novetta.com  *.datadoghq-browser-agent.com *.day.com *.scene7.com *.accenture.com *.reddit.com *.captcha.com *.amazonaws.com *.bing.com *.cdninstagram.com *.clicktale.net *.cloudflare.com *.demandbase.com *.demdex.net *.facebook.net *.fls.doubleclick.net *.fontawesome.com *.microsoftonline.com *.onetrust.com *.siteimprove.com *.siteimprove.net *.vidyard.com *.storied.co *.cookielaw.org *.accenture.test *.bnr.nl *.mktoresp.com *.adobe.com *.clarity.ms https://cdn.flipsnack.com https://t.co *.ads-twitter.com *.twitter.com https://static.ads-twitter.com *.confirmit.com https://digitalfeedback.us.confirmit.com https://www.haceonline.org *.contentsquare.com *.salesforce.com https://tableau.javelingroup.com *.javelingroup.com *.slidesharecdn.com *.sndcdn.com *.soundcloud.com *.doubleclick.net https://yt3.ggpht.com https://www.gstatic.com https://adservice.google.co.za https://adservice.google.ca https://maxcdn.bootstrapcdn.com https://cdn.embed.ly https://cdn.jsdelivr.net https://d3js.org https://insight.adsrvr.org https://schema.org https://i.ytimg.com https://c.contentsquare.net https://l.contentsquare.net https://k-aeu1.contentsquare.net https://q-aus1.contentsquare.net *.contentsquare.net *.ytimg.com *.apple.com https://player.simplecast.com *.libsyn.com https://adservice.google.co.in https://public.tableau.com https://embed.podcasts.apple.com https://w.soundcloud.com *.accenture.cn https://html5-player.libsyn.com https://px4.ads.linkedin.com https://adservice.google.com.ph https://adservice.google.com https://units.knotch.it https://ad.doubleclick.net https://www.google.com.sg https://accenture.lightinfosys.com https://accenture.mettl.de https://accenture.percipio.com https://accenture.tt.omtrdc.net https://api.company-target.com https://assets.adobedtm.com https://authenticate.cocubes.com/candidate https://cdn.cookielaw.org https://cm.everesttech.net https://cocubesprod.com https://fonts.googleapis.com https://fonts.gstatic.com https://idsync.rlcdn.com https://img.en25.com https://india.jobs.accenture.com https://login.rosettastone.com https://ml314.com https://p.adsymptotic.com https://pbs.twimg.com https://px.ads.linkedin.com https://s.delvenetworks.com https://secure.na2.echosign.com https://secure.in1.echosign.com https://snap.licdn.com https://static.echocdn.com https://stats.g.doubleclick.net https://sync.crwdcntrl.net https://tools.ietf.org https://unpkg.com https://www.accenturealumni.com https://www.facebook.com https://www.glassdoor.com https://www.google.com.ph https://www.google-analytics.com https://www.googletagmanager.com https://www.indeed.com https://www.knotch-cdn.com https://www.monster.com https://www.redditstatic.com https://www.slideshare.net https://www.youtube.com https://www.youtube-nocookie.com https://ssl.google-analytics.com https://munchkin.marketo.net https://www.google.co.in https://dev.virtualearth.net https://a.mktgcdn.com data *.accenture.jp *.newsroom.accenture.jp *.bat.bing.com *.dpm.demdex.net *.rlcdn.com *.youtube.com *.ml314.com *.linkedin.com *.ads.linkedin.com *.facebook.com *.adsymptotic.com *.knotch.it *.login.microsoftonline.com *.google.com *.newsroom.accenture.de *.login.live.com *.acnprodedit-2016.accenture.com *.adnxs.com *.yahoo.com *.analytics.yahoo.com *.adsrvr.org *.casalemedia.com *.candidate.accenture.com *.app-5292-eus2-prejoiner-prod-web.azurewebsites.net *.rubiconproject.com *.indaorm.accenture.com *.t.co *.bidswitch.net *.pubmatic.com  data:; upgrade-insecure-requests; block-all-mixed-content
x-xss-protection: 1; mode=block
x-frame-options: SAMEORIGIN
cache-control: public, max-age=7200, stale-while-revalidate=600, stale-if-error=600
strict-transport-security: max-age=31536000; includeSubdomains
last-modified: Sat, 29 Oct 2022 08:30:44 GMT
etag: "3a052-5ec2830c18354"
x-vhost: publish
set-cookie: affinity="5050abfac79c08a8"; Path=/; HttpOnly
x-content-type-options: nosniff
accept-ranges: bytes
date: Sat, 29 Oct 2022 08:56:14 GMT
x-served-by: cache-bom4748-BOM
x-timer: S1667033774.209945,VS0,VS0,VE404
vary: Accept-Encoding
x-cache: Miss from cloudfront
via: 1.1 8415fbfe8b717c99d8a0b872dac2363c.cloudfront.net (CloudFront)
x-amz-cf-pop: DEL54-C2
x-amz-cf-id: R5thS1i6ACKo2GXI7l7rSfHoSEFrUcGsXzJgeofPPJP56z0EEYaiRQ==
age: 0

You can checkout the workaround mentioned in the scrapy issue here – https://github.com/scrapy/scrapy/issues/355, if that helps.

Answered By: gutsytechster

The above answer does not completely work but it gives me the idea.
You can check the twisted lib class (https://github.com/twisted/twisted/blob/trunk/src/twisted/protocols/basic.py#L406-L422) which is defined as MAX_LENGTH = 16384

we just need to override this in our init.py file of scrapy spider project like Below,

from twisted.protocols.basic import LineReceiver

LineReceiver.MAX_LENGTH = 65536
Answered By: Manoj Bhatt