How to handle IncompleteRead: in biopython

Question

I am trying to fetch fasta sequences for accession numbers from NCBI using Biopython. Usually the sequences were successfully downloaded. But once in a while I get the below error:

http.client.IncompleteRead: IncompleteRead(61808640 bytes read)

I have searched the answers How to handle IncompleteRead: in python

I have tried top answer https://stackoverflow.com/a/14442358/4037275. It is working. However, the problem is, it downloads partial sequences. Is there any other way. Can anyone point me in right direction?

from Bio import Entrez
from Bio import SeqIO
Entrez.email = "my email id"


def extract_fasta_sequence(NC_accession):
    "This takes the NC_accession number and fetches their fasta sequence"
    print("Extracting the fasta sequence for the NC_accession:", NC_accession)
    handle = Entrez.efetch(db="nucleotide", id=NC_accession, rettype="fasta", retmode="text")
    record = handle.read()

Asked By: catuf

||

Source

Answer 1

You will need to add a try/except to catch common network errors like this. Note that exception httplib.IncompleteRead is a subclass of the more general HTTPException, see: https://docs.python.org/3/library/http.client.html#http.client.IncompleteRead

e.g. http://lists.open-bio.org/pipermail/biopython/2011-October/013735.html

See also https://github.com/biopython/biopython/pull/590 would catch some of the other errors you can get with the NCBI Entrez API (errors the NCBI ought to deal with but don’t).

Answered By: Peter Cock

Answer 2

I think the best way to solve this problem is to use the base URL of NCBI using the requests package. In this way, you can set a timeout for the host’s response easily.

e.g Some base URLs:

ESearch https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
ESummary https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi

You can find complete information on the E-utilities guide website of NCBI.

This is so convenient, as some errors occur does NCBI host not responding and has to wait for a long time without any response. But if re-get maybe gains a response. So you can combine the try/except statement to build your own retrieve data code.

Example code

I have an EC number and I want to use the ESearch to find 50 related papers on the Pubmed database from 2015 to now.

import requests
import re

ec_num = '1.1.1.6'
esearch_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
payload = {'db':'pubmed', 'term':f"{ec_num}[EC/RN Number]",
                            'retmax':50, 'sort':"pub_date", 'usehistory':"y",
                            'datetype':'pdat', 'mindate':'2015', 'maxdate':'3000'}

handle = requests.get(esearch_url,params=payload, timeout=20) #Set time out is 20s
records = handle.text

## Retrieve query_key and wed_env for the next tool (e.g ESumary, Elink, EFetch)
query_key = re.search(r'<QueryKey>(d+)</QueryKey>', records).group(1)
wed_env = re.search(r'<WebEnv>(w+)</WebEnv>', records).group(1)

## Retrieve the number of related articles
counts = re.search(r"<Count>(d+)</Count>", records).group(1)
#print(counts)

## Retrieve the Pubmed Id of related articles
pub_ids = re.findall(r"<Id>(d+)</Id>", records)
#print(pub_ids)

Answered By: Rossy Clair

How to handle IncompleteRead: in biopython

Question:

Answers:

Example code