python: set file path to only point to files with a specific ending

Question:

I am trying to run a program with requires pVCF files alone as inputs. Due to the size of the data, I am unable to create a separate directory containing the particular files that I need.

The directory contains multiple files with ‘vcf.gz.tbi’ and ‘vcf.gz’ endings. Using the following code:

file_url = "file:///mnt/projects/samples/vcf_format/*.vcf.gz"

I tried to create a file path that only grabs the ‘.vcf.gz’ files while excluding the ‘.vcf.gz.tbi’ but I have been unsuccesful.

Asked By: Ava Wilson

||

Answers:

The code you have, as written, is just assigning your file path to the variable file_url. For something like this, glob is popular but isn’t the only option:

import glob, os

file_url = "file:///mnt/projects/samples/vcf_format/"

os.chdir(file_url)
for file in glob.glob("*.vcf.gz"):
    print(file)

Note that the file path doesn’t contain the kind of file you want (in this case, a gzipped VCF), the glob for loop does that.

Check out this answer for more options.

It took some digging but it looks like you’re trying to use the import_vcf function of Hail. To put the files in a list so that it can be passed as input:

import glob, os

file_url = "file:///mnt/projects/samples/vcf_format/"


def get_vcf_list(path):
    vcf_list = []
    os.chdir(path)
    for file in glob.glob("*.vcf.gz"):
       vcf_list.append(path + "/" + file)
    return vcf_list


get_vcf_list(file_url)

# Now you pass 'get_vcf_list(file_url)' as your input instead of 'file_url'

mt = hl.import_vcf(get_vcf_list(file_url), force_bgz=True, reference_genome="GRCh38", array_elements_required=False)
Answered By: R.T. Canterbury
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.