Terminal error from download_bios.py: the following arguments are required: wetpaths

Question:

I’ve been attempting to download a dataset downloaded off of PapersWithCode and when I run the download program I get the following error message:

usage: download_bios.py [-h] [-o OUT] [-r RETRIES] [-p N] wetpaths download_bios.py: error: the following arguments are required: wetpaths

and am not sure how to fix it

I attempted to reach out to a couple coder friends and the internet and none of them seemed to know what "wetpaths" were so I thought I would look here

Asked By: Willowinthewind

||

Answers:

The wetpaths argument of the download_bios.py script refers to the path of a WET file type used by CommonCrawl. The source code says that it expects a

common_crawl date like 2017-43 or a path to a -wet.paths file

So you should pass a valid date as an argument (e.g. 2022-49 is the latest crawl for Nov/Dec 2022).

To understand where the WET format comes from and why it’s used, some background information is required.

Web crawls (e.g. those done by CommonCrawl) were originally stored in the internet ARChive (ARC) format. The Web ARChive (WARC) is a revision to this format that includes additional secondary data like metadata, abbreviated duplicate detection events, and later-date transformations. Since 2013, CommonCrawl has used the WARC format which allows for more efficient storage and processing of the archives. The full WARC specification can be found here.

One can think of WARC files as providing the raw data from the crawl process by CommonCrawl. Two additional formats are offered, namely WET and WAT:

  • The WAT file format contains the metadata about the records stored in the WARC format.
  • The WET file format contains the extracted plain text from the records stored in the WARC format.
Answered By: Kyle F Hartzenberg