Generic input functions for Snakemake

Question:

I am using input functions in my Snakemake rules. Most of these rules simply look up a sample sheet (pandas data frame) derived from the PEP specifications. For example ..

samples = pep.sample_table

def get_image(wildcards):
    return samples.loc[wildcards.sample, "image_file"]

def get_visium_fastqs(wildcards):
    return samples.loc[wildcards.sample, "visium_fastqs"]

def get_slide(wildcards):
    return samples.loc[wildcards.sample, "slide"]
 
def get_area(wildcards):
    return samples.loc[wildcards.sample, "area"]

Unfortunately, input functions can only have one parameter, wildcards, which essentially a named list of wildcards and their values. Otherwise I could define an input function something like this …

def lookup_sample_table(wildcards, target):
    return samples.loc[wildcards.sample, target]

… and then call this is in a rule as …

input:
    fq=lookup_sample_table(target="visium_fastqs")

But AFAIK this is not possible.

I tried lambda functions in my rules. For example ..

input:
    lambda wildcards: samples.loc[wildcards.sample, "slide"]

This works OK if the input items are not named. But I can’t figure out how to create named input items usng lambda functions. For example, the following doesn’t work …

input:
    slide=lambda wildcards: samples.loc[wildcards.sample, "slide"]

Can I combine named inputs with lambda functions? If so, then I could extend the idea in this answer.

This is such a generic situation, I am sure that there must be a generic solution, right?

Inspired by this question I have come up with the following generic function which seems to work (so far):

def sample_lookup(pattern):
    def handle_wildcards(wildcards):
        s = pattern.format(**wildcards)
        [sample,target] = s.split(':')
        return samples.loc[sample, target]
    return handle_wildcards

This function is called as follows:

rule preproc:
    input:
        bam=sample_lookup('{sample}:sample_bam'),
        barcodes=sample_lookup('{sample}:sample_barcodes')

That is, sample_lookup() is given a "pattern" with the {sample} wildcard, followed by the name of the column in sample_table to look up.
But this function definition is quite opaque compared to the simple (if repetitive) input functions that I started with, and I feel like I’m beginning to invent my own syntax, which then makes the rules harder to read.

What is the simplest way to reduce repetition and redundancy in this kind of input function?

Asked By: j0hn

||

Answers:

Not sure if I missing something but this should give you what you want:

def lookup_sample_table(sample, target):
    return samples.loc[sample, target]

# Bla bla bla

input:
    fq=lambda wc: lookup_sample_table(sample=wc.sample, target="visium_fastqs")
Answered By: dariober

If anyone stumbles on this question, here is where I ended up:

# Look up field in sample_sheet
def samplesheet(field):
    def handle_wildcards(wildcards):
        return samples.loc[wildcards.sample, field]
    return handle_wildcards

This function can be called from rules as follows:

sample=pep.sample_table

rule preproc:
    input:
        bam=samplesheet('bam'),
        barcodes=samplesheet('barcodes')

So while the generic function is a little convoluted, using it is very straightforward.

Answered By: j0hn
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.