Snakemake expand on a dictionary, keeping wildcards
Question:
I have a dictionary like the following:
data = {
"group1": ["a", "b", "c"],
"group2": ["x", "y", "z"]
}
I want to use expand to get all combinations between the keys and their values separately in "rule all", s.t. the expected output files are e.g. "group1/a.txt", "group1/b.txt", … "group2/x.txt, "group2/y.txt" …
rule all:
input:
expand("{group}/{sub_group}.txt", group = ???, sub_group = ???)
I need this for the rule "some_rule":
rule some_rule:
input: "single_input_file.txt"
output: "{group}/{sub_group}.txt"
params:
group=group, # how do I extract these placeholders?
sub_group=sub_group
script:
"some_script.R"
The reason why I need to have group
and sub_group
wildcards is because I need to pass them to the params
of rule "some_rule"
I tried to hardcode all output files needed in the "rule all" with list comprehension, but then the placeholders are not defined in the wildcards and I cannot pass them to the params.
So I guess I need to define the "rule all" input files using expand
, but here I don’t know how to get the correct files, as I need the combinations to be performed individually between "group1" and its values and "group2" and its values.
I also cannot use an input function for the rule "some_rule", as it has only one singular static input file.
In other similar questions on StackOverflow, either there is not the combinatorial problem, or they create the input files for "rule_all" using plain python, which makes me loose the wildcards.
Answers:
You can use nested list comprehensions
data = {
"group1": ["a", "b", "c"],
"group2": ["x", "y", "z"]
}
files = sum(
[
[f"{key}/{value}.txt" for value in values] for key,values in data.items()],
[]
)
print(files)
I think you are planning to then run a program on each of the files? If so:
for file in files:
# run script on `file`
You can use this:
rule some_rule:
input: "single_input_file.txt"
output: "{group}/{sub_group}.txt"
script:
"some_script.R"
and access the value of the wildcards {group}
and {subgroups}
inside the R
script with e.g. snakemake@wildcards[['group']]
(not tested but I think it
should do it).
Alternatively I think you could have:
params:
group='{group}'
sub_group='{sub_group}',
Answer based on your comment.
import pandas as pd
data = {
"group1": ["a", "b", "c"],
"group2": ["x", "y", "z"]
}
df = pd.DataFrame([(k, v) for k, vs in data.items() for v in vs],
columns=['Group', 'Value'])
rule all:
input:
expand("{group}/{sub_group}.txt", zip, group=df['Group'], sub_group=df['Value'])
rule some_rule:
output: "{group}/{sub_group}.txt"
params:
group='{group}',
sub_group='{sub_group}'
shell:
"""
echo {params.group} {params.sub_group} > {output}
"""
I found a solution for my problem using a custom combinator function.
def pairwise_product(*args):
result = []
for group, sub_group in zip(*args):
sub_group = ([sub_group[0]], sub_group[1])
for sub_sub_group in itertools.product(*sub_group):
result.append((group, sub_sub_group))
return result
Looking at the source code for snakemake’s expand function, I realized that I can use my own combinator function.
pairwise_product
expects as input two lists of tuples, where each tuple contains the wildcard name and the wildcard value, e.g.
wildcard1 = [("group", "group1"), ("group", "group2")]
wildcard2 = [("sub_group", ["a", "b", "c"]), ("sub_group", ["x", "y", "z"])]
pairwise_product(wildcard1, wildcard2)
The output of this function call would be:
[(('group', 'group1'), ('sub_group', 'a')),
(('group', 'group1'), ('sub_group', 'b')),
(('group', 'group1'), ('sub_group', 'c')),
(('group', 'group2'), ('sub_group', 'x')),
(('group', 'group2'), ('sub_group', 'y')),
(('group', 'group2'), ('sub_group', 'z'))]
And the output of the expand function would be:
expand("{group}/{sub_group}.txt", pairwise_product, group=data.keys(), sub_group=data.values())
['group1/a.txt',
'group1/b.txt',
'group1/c.txt',
'group2/x.txt',
'group2/y.txt',
'group2/z.txt']
With this solution I also get the wildcards I want, i.e. the individual elements in the list-values for each dictionary key separately.
Note that this function has been designed for only two wildcards in the format as shown above in the data
dictionary and not tested for other formats.
I have a dictionary like the following:
data = {
"group1": ["a", "b", "c"],
"group2": ["x", "y", "z"]
}
I want to use expand to get all combinations between the keys and their values separately in "rule all", s.t. the expected output files are e.g. "group1/a.txt", "group1/b.txt", … "group2/x.txt, "group2/y.txt" …
rule all:
input:
expand("{group}/{sub_group}.txt", group = ???, sub_group = ???)
I need this for the rule "some_rule":
rule some_rule:
input: "single_input_file.txt"
output: "{group}/{sub_group}.txt"
params:
group=group, # how do I extract these placeholders?
sub_group=sub_group
script:
"some_script.R"
The reason why I need to have group
and sub_group
wildcards is because I need to pass them to the params
of rule "some_rule"
I tried to hardcode all output files needed in the "rule all" with list comprehension, but then the placeholders are not defined in the wildcards and I cannot pass them to the params.
So I guess I need to define the "rule all" input files using expand
, but here I don’t know how to get the correct files, as I need the combinations to be performed individually between "group1" and its values and "group2" and its values.
I also cannot use an input function for the rule "some_rule", as it has only one singular static input file.
In other similar questions on StackOverflow, either there is not the combinatorial problem, or they create the input files for "rule_all" using plain python, which makes me loose the wildcards.
You can use nested list comprehensions
data = {
"group1": ["a", "b", "c"],
"group2": ["x", "y", "z"]
}
files = sum(
[
[f"{key}/{value}.txt" for value in values] for key,values in data.items()],
[]
)
print(files)
I think you are planning to then run a program on each of the files? If so:
for file in files:
# run script on `file`
You can use this:
rule some_rule:
input: "single_input_file.txt"
output: "{group}/{sub_group}.txt"
script:
"some_script.R"
and access the value of the wildcards {group}
and {subgroups}
inside the R
script with e.g. snakemake@wildcards[['group']]
(not tested but I think it
should do it).
Alternatively I think you could have:
params:
group='{group}'
sub_group='{sub_group}',
Answer based on your comment.
import pandas as pd
data = {
"group1": ["a", "b", "c"],
"group2": ["x", "y", "z"]
}
df = pd.DataFrame([(k, v) for k, vs in data.items() for v in vs],
columns=['Group', 'Value'])
rule all:
input:
expand("{group}/{sub_group}.txt", zip, group=df['Group'], sub_group=df['Value'])
rule some_rule:
output: "{group}/{sub_group}.txt"
params:
group='{group}',
sub_group='{sub_group}'
shell:
"""
echo {params.group} {params.sub_group} > {output}
"""
I found a solution for my problem using a custom combinator function.
def pairwise_product(*args):
result = []
for group, sub_group in zip(*args):
sub_group = ([sub_group[0]], sub_group[1])
for sub_sub_group in itertools.product(*sub_group):
result.append((group, sub_sub_group))
return result
Looking at the source code for snakemake’s expand function, I realized that I can use my own combinator function.
pairwise_product
expects as input two lists of tuples, where each tuple contains the wildcard name and the wildcard value, e.g.
wildcard1 = [("group", "group1"), ("group", "group2")]
wildcard2 = [("sub_group", ["a", "b", "c"]), ("sub_group", ["x", "y", "z"])]
pairwise_product(wildcard1, wildcard2)
The output of this function call would be:
[(('group', 'group1'), ('sub_group', 'a')),
(('group', 'group1'), ('sub_group', 'b')),
(('group', 'group1'), ('sub_group', 'c')),
(('group', 'group2'), ('sub_group', 'x')),
(('group', 'group2'), ('sub_group', 'y')),
(('group', 'group2'), ('sub_group', 'z'))]
And the output of the expand function would be:
expand("{group}/{sub_group}.txt", pairwise_product, group=data.keys(), sub_group=data.values())
['group1/a.txt',
'group1/b.txt',
'group1/c.txt',
'group2/x.txt',
'group2/y.txt',
'group2/z.txt']
With this solution I also get the wildcards I want, i.e. the individual elements in the list-values for each dictionary key separately.
Note that this function has been designed for only two wildcards in the format as shown above in the data
dictionary and not tested for other formats.