Snakemake expand on a dictionary, keeping wildcards

Question:

I have a dictionary like the following:

data = {
    "group1": ["a", "b", "c"], 
    "group2": ["x", "y", "z"]
}

I want to use expand to get all combinations between the keys and their values separately in "rule all", s.t. the expected output files are e.g. "group1/a.txt", "group1/b.txt", … "group2/x.txt, "group2/y.txt" …

rule all: 
    input: 
        expand("{group}/{sub_group}.txt", group = ???, sub_group = ???)

I need this for the rule "some_rule":


rule some_rule: 
    input: "single_input_file.txt"
    output: "{group}/{sub_group}.txt"
    params: 
        group=group, # how do I extract these placeholders?
        sub_group=sub_group
    script: 
        "some_script.R"

The reason why I need to have group and sub_group wildcards is because I need to pass them to the params of rule "some_rule"

I tried to hardcode all output files needed in the "rule all" with list comprehension, but then the placeholders are not defined in the wildcards and I cannot pass them to the params.

So I guess I need to define the "rule all" input files using expand, but here I don’t know how to get the correct files, as I need the combinations to be performed individually between "group1" and its values and "group2" and its values.

I also cannot use an input function for the rule "some_rule", as it has only one singular static input file.

In other similar questions on StackOverflow, either there is not the combinatorial problem, or they create the input files for "rule_all" using plain python, which makes me loose the wildcards.

Asked By: Klumpi

||

Answers:

You can use nested list comprehensions

data = {
    "group1": ["a", "b", "c"], 
    "group2": ["x", "y", "z"]
}

files = sum(
    [
        [f"{key}/{value}.txt" for value in values] for key,values in data.items()],
    []
)

print(files)

I think you are planning to then run a program on each of the files? If so:

for file in files:
     # run script on `file`
Answered By: ProfDFrancis

You can use this:

rule some_rule: 
    input: "single_input_file.txt"
    output: "{group}/{sub_group}.txt"
    script: 
        "some_script.R"

and access the value of the wildcards {group} and {subgroups} inside the R
script with e.g. snakemake@wildcards[['group']] (not tested but I think it
should do it).

Alternatively I think you could have:

params:
    group='{group}'
    sub_group='{sub_group}',
Answered By: dariober

Answer based on your comment.

import pandas as pd

data = {
    "group1": ["a", "b", "c"],
    "group2": ["x", "y", "z"]
}

df = pd.DataFrame([(k, v) for k, vs in data.items() for v in vs],
                  columns=['Group', 'Value'])

rule all:
    input:
        expand("{group}/{sub_group}.txt", zip, group=df['Group'], sub_group=df['Value'])

rule some_rule:
    output: "{group}/{sub_group}.txt"
    params:
        group='{group}',
        sub_group='{sub_group}'
    shell:
        """
        echo {params.group} {params.sub_group} > {output}
        """
Answered By: Giang Le

I found a solution for my problem using a custom combinator function.

def pairwise_product(*args):
result = []
for group, sub_group in zip(*args):
    sub_group = ([sub_group[0]], sub_group[1])
    for sub_sub_group in itertools.product(*sub_group):
        result.append((group, sub_sub_group))
return result

Looking at the source code for snakemake’s expand function, I realized that I can use my own combinator function.

pairwise_product expects as input two lists of tuples, where each tuple contains the wildcard name and the wildcard value, e.g.

wildcard1 = [("group", "group1"), ("group", "group2")]
wildcard2 = [("sub_group", ["a", "b", "c"]), ("sub_group", ["x", "y", "z"])]
pairwise_product(wildcard1, wildcard2)

The output of this function call would be:

[(('group', 'group1'), ('sub_group', 'a')),
 (('group', 'group1'), ('sub_group', 'b')),
 (('group', 'group1'), ('sub_group', 'c')),
 (('group', 'group2'), ('sub_group', 'x')),
 (('group', 'group2'), ('sub_group', 'y')),
 (('group', 'group2'), ('sub_group', 'z'))]

And the output of the expand function would be:

expand("{group}/{sub_group}.txt", pairwise_product, group=data.keys(), sub_group=data.values())

['group1/a.txt',
 'group1/b.txt',
 'group1/c.txt',
 'group2/x.txt',
 'group2/y.txt',
 'group2/z.txt']

With this solution I also get the wildcards I want, i.e. the individual elements in the list-values for each dictionary key separately.

Note that this function has been designed for only two wildcards in the format as shown above in the data dictionary and not tested for other formats.

Answered By: Klumpi
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.