how to quickly identify if a rule in Snakemake needs an input function

Question:

I’m following the snakemake tutorial on their documentation page and really got stuck on the concept of input functions https://snakemake.readthedocs.io/en/stable/tutorial/advanced.html#step-3-input-functions

Basically they define a config.yaml as follows:

samples:
  A: data/samples/A.fastq
  B: data/samples/B.fastq

and the Snakefile as follows without any input function:

configfile: "config.yaml"

rule all:
    input:
        "plots/quals.svg"

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    threads: 12
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} -O bam {input} > {output}"

rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"

rule bcftools_call:
    input:
        fa = "data/genome.fa",
        bam = expand("sorted_reads/{sample}.bam",sample=config['samples']),
        bai = expand("sorted_reads/{sample}.bam.bai",sample=config['samples'])
    output:
        "calls/all.vcf"
    shell:
        "bcftools mpileup -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

rule plot_quals:
    input:
        "calls/all.vcf"
    output:
        "plots/quals.svg"
    script:
        "scripts/plot-quals.py"

In the tutorial they mention that this expansion happens in the initialization step:

bam = expand("sorted_reads/{sample}.bam",sample=config['samples']),
bai = expand("sorted_reads/{sample}.bam.bai",sample=config['samples'])

and that the FASTQ paths cannot be determined for rule bwa_map in this phase. However the code works if we run as is, why is that ?

Then they recommend using an input function to defer bwa_map to the next phase (DAG phase) as follows:

def get_bwa_map_input_fastqs(wildcards):
    return config["samples"][wildcards.sample]

rule bwa_map:
    input:
        "data/genome.fa",
        get_bwa_map_input_fastqs
    output:
        "mapped_reads/{sample}.bam"
    threads: 8
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

I’m really confused when an input function makes sense and when it does not ?

Asked By: moth

||

Answers:

I’m really confused when an input function makes sense and when it does not ?

In my experience, an input function is needed only in specific circumstances when there is a complex pattern linking specific outputs with the required inputs.

For example, imagine that you are working with two colleagues that have their own style preferences. Colleague A likes to name files using CamelCase, while colleague B likes to name files using_underscores. Now, if you are in a position where your output depends on their inputs, then one way to create a consistent rule is to define an input function that adjust the input files appropriately. Rough pseudocode example:


# A prepared files MyResultsA.data
# B prepared files my_results_B.dat_extension

def fix_input_name(wildcards):
   if wildcards.specific_sample=='A':
       return "MyResultsA.data"
   if wildcards.specific_sample=='B':
       return "my_results_B.dat_extension"

rule process:
   input: fix_input_name
   output: 'processed_{specific_sample}.report'
   ...

rule collect:
   input: expand(rules.process.output, specific_sample=['A', 'B'])

Note that in the example above it’s possible to just create two rules, one for A and one for B, so the use of input function is meant only as a way to make the workflow more readable.

Edit: another example is when there is an arithmetic logic between input and output files: https://stackoverflow.com/a/72810138/10693596

Edit2: another example is when the list of input files is not consistent across output/wildcards: https://stackoverflow.com/a/72856839/10693596

Answered By: SultanOrazbayev

SultanOrazbayev has a good answer already. Here’s another typical example.

Often, the input and output files share the same pattern (wildcards). For example, if you want to sort a file you may do: input: {name}.txt -> output: {name}.sorted.txt.

Sometimes however the input files are not linked to the output by a simple pattern. An example from bioinformatics is a rule that align reads to a genome:

rule align:
    input:
        reads= '{name}.fastq',
        genome= 'human_genome.fa',
    output:
        bam= '{name}.bam',
    shell: ...

here the name of the genome file is unrelated to the name of the input reads file and the name of the output bam file. The rule above works because the reference genome is a concrete filename without wildcards.

But: What if the choice of reference genome depends on the input fastq file? For same input reads you may need the mouse genome and for others the human genome. An input function comes handy:

def get_genome(wildcards):
    if wildcards.name in ['bob', 'alice']:
        return 'human_genome.fa',
    if wildcards.name in ['mickey', 'jerry'],
        return 'mouse_genome.fa',

rule align:
    input:
        reads= '{name}.fastq',
        genome= get_genome,
    output:
        bam= '{name}.bam',
    shell: ...

now the reference genome is mouse or human depending on the input reads.

Answered By: dariober

I’ll answer this the other way, what does it mean if you don’t use an input function? The "default" input function could be implemented as

def default_input(wildcards):
   return INPUT_PATH.format(**wildcards)

rule:
    input: default_input
    output: OUTPUT_WITH_WILDCARDS

So you should an input function whenever you want to do something else. I also want to point out that input functions can be used in params and resources, not just inputs. There are already some examples of formatting some wildcards conditionally so I will add a few more ideas:

  • lookup wildcards matches in a pandas dataframe
  • raise an exception to prevent executing a rule. This is helpful for when you have a default rule, you raise an exception on the specialized rule when it shouldn’t be executed (key not in config, file does not exist, etc)
  • recursive rule definition. You can create a binary combination function so file_{start}_{end} is created with file_{start}_{(start+end)/2} and file_{(start+end)/2}_{end} with a base case when start == end.
  • optionally include an option as a param, e.g.
def get_params(wildcards, input):
    if 'sample' in input:
        return f'--sample {input.sample}'
    return ''
  • based on input files and wildcards, estimate possible resource usage.
  • super ancient, where you return an empty string if the output file already exists. This was useful for getting around some bugs with ancient a while back.

A final word, when naming input functions I try to stick to the format of {rule_name}_{directive}. Rule make_file may have make_file_input, make_file_params, etc. This helps prevent name collisions and I only rarely find that an input function can be reused in multiple rules.

Answered By: Troy Comi