Snakefile how to mix wildcards and variables

Question:

I want to make a rule which for a given number of threads translates files in one directory and format to another directory and format, in parallel. Certain elements of the path are defined by variables and certain are wildcards. I want it to wildcard on phase and sample and ext but take stage, challenge and language from the Python variable environment. I want the copy operation to take file to file. I don’t want it to get the entire list of files as input. I’m not using expand here because if I use expand then snakemake will pass the entire list of inputs as {input} and the entire list of outputs as {output} to the function, which is not what I want. Here is the Snakefile:

from glob import glob
from copy_sph_to_wav import copy_sph_to_wav

stage = "/media/catskills/interspeech22"
challenge = "openasr21"
language = "farsi"
sample_rate = 16000

# Resample WAV and SPH files to 16000 kHz WAV

rule sph_to_wav:
     input:
         "{stage}/{challenge}_{language}/{phase}/audio/{sample}.{ext}"
     output:
         "{stage}/{challenge}_{language}/{phase}/wav_{sample_rate}/{sample}.wav"
     run:
         copy_sph_to_wav({input}, {output}, sample_rate)

When I run it I get this error:

$ snakemake -c16 
Building DAG of jobs...
WildcardError in line 11 of /home/catskills/is22/interspeech22/scripts/Snakefile:
Wildcards in input files cannot be determined from output files:
'stage'

Is there a way to do this in snakemake?

UPDATE: I found a partial solution here, which is to use f-strings and double curly quote the patterns.

from glob import glob
from copy_sph_to_wav import copy_sph_to_wav

STAGE = "/media/catskills/interspeech22"
CHALLENGE = "openasr21"
LANGUAGE = "farsi"
SAMPLE_RATE = 16000

# Resample WAV and SPH files to 16000 kHz WAV

rule sph_to_wav:
     input:
         f"{STAGE}/{CHALLENGE}_{LANGUAGE}/{{phase}}/audio/{{sample}}.{{ext}}"
     output:
     f"{STAGE}/{CHALLENGE}_{LANGUAGE}/{{phase}}/wav_{{SAMPLE_RATE}}/{{sample}}.wav"
     run:
         copy_sph_to_wav({input}, {output}, sample_rate)

However the wildcard is not matching the subdirectory name. I’m still getting an error, but it’s a little different:

$ snakemake -c16 
Building DAG of jobs...
WildcardError in line 11 of /home/catskills/is22/interspeech22/scripts/Snakefile:
Wildcards in input files cannot be determined from output files:
'phase'

This leads to here:

from pathlib import Path
from glob import glob
from copy_sph_to_wav import copy_sph_to_wav

STAGE = "/media/catskills/interspeech22"
CHALLENGE = "openasr21"
LANGUAGE = "farsi"
SAMPLE_RATE = 16000

# Resample WAV and SPH files to 16000 kHz WAV

rule sph_to_wav:
     input:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/audio/{sample}.{ext}" )
     output:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/wav_{SAMPLE_RATE}/{sample}.wav" )
     run:
         copy_sph_to_wav({input}, {output}, sample_rate)

However I’m still not done yet:

$ snakemake -c16 
Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).

I add rule all in an attempt to correct this issue:

from pathlib import Path
from glob import glob
from copy_sph_to_wav import copy_sph_to_wav

STAGE = "/media/catskills/interspeech22"
CHALLENGE = "openasr21"
LANGUAGE = "farsi"
SAMPLE_RATE = 16000

sph_results = [x.replace('.sph', '.wav').replace('/audio/', f'/wav_{SAMPLE_RATE}/')
               for x in glob(f"{STAGE}/{CHALLENGE}_{LANGUAGE}/*/audio/*")]

# Resample WAV and SPH files to 16000 kHz WAV

rule all:
     input:
        sph_results

rule sph_to_wav:
     input:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/audio/{sample}.{ext}" )
     output:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/wav_{SAMPLE_RATE}/{sample}.sph" )
     run:
         copy_sph_to_wav({input}, {output}, sample_rate)

Finally it complains that the files to be constructed don’t exist yet, before attempting to call this function which will construct them:

$ snakemake -c16 
Building DAG of jobs...
MissingInputException in line 15 of /home/catskills/is22/interspeech22/scripts/Snakefile:
Missing input files for rule all:
/media/catskills/interspeech22/openasr21_farsi/dev/wav_16000/MATERIAL_OP2-3S-BUILD_46645_20171106_064534_inLine.wav
/media/catskills/interspeech22/openasr21_farsi/eval/wav_16000/MATERIAL_OP2-3S_77793199_outLine.wav

where, to fill out the example, the function copy_sph_to_wav is:

import os
import librosa
import soundfile as sf

def copy_sph_to_wav(src, dst, sr):
    cmd='/home/catskills/is22/sph2pipe_v2.5/sph2pipe'
    if src[-4:]=='.wav':
        audio,sr1=librosa.load(src, sr=sr)
    else:
        os.system(f"{cmd} -f wav {src} {dst}")
        audio,sr1=librosa.load(dst, sr=sr)

    sf.write(dst, audio, sr)

UPDATE 2: Which leads us here, where we fix some issues with the sph_to_wav rule generating outputs which do not match our OUTPUTS file:

from pathlib import Path
from glob import glob
from copy_sph_to_wav import copy_sph_to_wav

STAGE = "/media/catskills/interspeech22"
CHALLENGE = "openasr21"
LANGUAGE = "farsi"
SAMPLE_RATE = 16000

OUTPUTS = [x.replace('.sph', '.wav').replace('/audio/', f'/wav_{SAMPLE_RATE}/')
           for x in glob(f"{STAGE}/{CHALLENGE}_{LANGUAGE}/*/audio/*")]

# Resample WAV and SPH files to 16000 kHz WAV

rule all:
     input:
         expand("{output}", output=OUTPUTS)

rule sph_to_wav:
     input:
         '/media/catskills/interspeech22/openasr21_farsi/{phase}/audio/{sample}.{ext}'
     output:
         '/media/catskills/interspeech22/openasr21_farsi/{phase}/wav_16000/{sample}.wav'
     run:
         copy_sph_to_wav({input}, {output}, sample_rate)

However we still get an error but a much more focused one, which is:

$ snakemake -c16 -p -n  
Building DAG of jobs...
WildcardError in line 19 of /home/catskills/is22/interspeech22/scripts/Snakefile:
Wildcards in input files cannot be determined from output files:
'ext'

There may be a clue here having to do with wildcard_constraints.

UPDATE 3: This answer says that

Each wildcard in the input section shall have a corresponding wildcard
(with the same name) in the output section. That is how Snakemake
works: when the Snakemake tries to constract the DAG of jobs and finds
that it needs a certain file, it looks at the output section for each
rule and checks if this rule can produce the required file. This is
the way how Snakemake assigns certain values to the wildcard in the
output section. Every wildcard in other sections shall match one of
the wildcards in the output, and that is how the input gets concrete
filenames.

If that is true then I don’t think there is a snakemake solution because I am trying to replace the .sph with .wav and I don’t want to have to make a .sph.wav file.

Asked By: Lars Ericson

||

Answers:

Try this:

rule all:
    input:
        expand("{your_path}.extension", replacements)

rule make_output:
    input: "{input}_{num}.extension"
    output: "{output}_{num}.extension"
    shell:
        copy_sph_to_wav {input} > {output}
    
Answered By: Mahsa Hassankashi

I finally got it. Not 100% happy about it (would rather have had .{ext} in the input but not the output), but this works and I guess it makes it’s own kind of sense. The issue being that my input directory can have either .sph or .wav files depending on the vagaries of the data provider, so I have to be ready for either contingency:

from pathlib import Path
from glob import glob
from copy_sph_to_wav import copy_sph_to_wav
from copy_wav_to_wav import copy_wav_to_wav

STAGE = "/media/catskills/interspeech22"
CHALLENGE = "openasr21"
LANGUAGE = "farsi"
SAMPLE_RATE = 16000

OUTPUTS = [x.replace('.sph', '.wav').replace('/audio/', f'/wav_{SAMPLE_RATE}/')
           for x in glob(f"{STAGE}/{CHALLENGE}_{LANGUAGE}/*/audio/*")]

# Resample WAV and SPH files to 16000 kHz WAV

rule all:
     input:
         expand("{output}", output=OUTPUTS)

rule sph_to_wav:
     input:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/audio/{sample}.sph" )
     output:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / f"{{phase}}/wav_{SAMPLE_RATE}/{{sample}}.wav" )
     run:
         copy_sph_to_wav(list({input})[0][0], list({output})[0][0], SAMPLE_RATE)

rule wav_to_wav:
     input:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/audio/{sample}.wav" )
     output:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / f"{{phase}}/wav_{SAMPLE_RATE}/{{sample}}.wav" )
     run:
         copy_wav_to_wav(list({input})[0][0], list({output})[0][0], SAMPLE_RATE)

Also I discovered by commenting my function that {input} is a set of a list with one element:

SRC {['/media/catskills/interspeech22/openasr21_farsi/build/audio/MATERIAL_OP2-3S-BUILD_29884_20170907_021506_outLine.sph']}
DST {['/media/catskills/interspeech22/openasr21_farsi/build/wav_16000/MATERIAL_OP2-3S-BUILD_29884_20170907_021506_outLine.wav']}
SR 16000

which I didn’t even know was possible, so I have to do this ugly conversion list({input})[0], I don’t know exactly why.

In any event finally the consummation devoutly to be wished, for snakemake -c16:

Running hot

Answered By: Lars Ericson
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.