Wrong snakemake glob_wilcards and wildcard_constraints

Question:

Within my snakemake pipeline I’m trying to retrieve the correct wildcards. I’ve looked into wildcard_constraints and this post and this post, however I can’t figure out the exact solution.

Here’s an example of file names within 2 datasets. 1 dataset contains paired mouse RNAseq read files and another dataset contains human paired RNAseq read files.

"Mus_musculus" dataset is "PRJNA362883_GSE93946_SRP097621" with file names:

"SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz"
"SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz"

"Homo_sapiens" dataset is "PRJNA362883_GSE93946_SRP097621" with file names:

"SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1.fastq.gz"
"SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2.fastq.gz"

I would like glob_wildcards to spit out the following wildcards

> ['PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621',
> 'PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872']
> ['SRR5195524_GSM2465521_KrasT_45649_NoDox',
> 'SRR5195524_GSM2465521_KrasT_45649_NoDox',
> 'SRR7942395_GSM3406786_sAML_Control_1',
> 'SRR7942395_GSM3406786_sAML_Control_1'] ['Mus_musculus', 'Mus_musculus',
> 'Homo_sapiens', 'Homo_sapiens'] ['1', '2', '1', '2']

I’ve tried the following code:

> import glob import os
> 
> DATASET,SAMPLE,SPECIES,FRR,
> =glob_wildcards(config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_{frr}.fastq.gz")
> print(DATASET,SAMPLE,SPECIES,FRR)

However, I get this as output. The many underscores messes up the glob_wildcards

> ['PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621',
> 'PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872']
> ['SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus',
> 'SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus',
> 'SRR7942395_GSM3406786_sAML_Control_1_Homo',
> 'SRR7942395_GSM3406786_sAML_Control_1_Homo'] ['musculus', 'musculus',
> 'sapiens', 'sapiens'] ['1', '2', '1', '2']

I for example tried this, but the output stays the same:
wildcard_constraints: species = ‘!(sapiens)’

Could anyone suggest the correct code to get the wanted wildcards? Thanks in advance!

Asked By: manvenk

||

Answers:

The naming scheme is a mess, so I’d fix that first and ensure that you do not ask underscores to perform two functions (separate words in a given variable and separate variables).

If that is not possible, wildcard_constraints is one solution:

wildcard_constraints:
    species="Mus_musculus|Homo_sapiens"
    # can try variations with r"" or escaping the pipe

Of course this would need to be adjusted if you have more species.

Answered By: SultanOrazbayev

If suitable for your case, it would be better to prepare a sample sheet or yaml file that indicates the characteristics of each file (e.g. the species, dataset, etc.). Then use python code (e.g. using pandas) to extract the wildcard values and direct the pipeline.

In my opinion, extracting information from filenames is brittle unless the names have been created within the pipeline and you have full control over them.

Answered By: dariober