Snakemake | Creating an aggregate without specifying a list in expand
Question:
My directory structure looks like this:
-- path
-- parameter_combination_1
- time_average.property1.csv
- time_average.property2.csv
- ...
-- parameter_combination_2
- time_average.property1.csv
- time_average.property2.csv
- ...
-- ...
I would like to create a rule which aggregates information of all files which carry the time_average
name, for the wildcards {filename} (e.g. property1.csv
) and {path}.
Hence, the input files for the example wildcard would be:
path/parameter_combination_1/time_average.property1.csv
path/parameter_combination_2/time_average.property1.csv
path/parameter_combination_3/time_average.property1.csv
- …
I know that with expand
I can cover parameter combinations. This requires me to specify the parameters to be covered, e.g. I could write a rule with a fixed list parameter_combinations
as follows (and similar to this section in the Snakemake tutorial:
rule aggregate_time_averages:
input:
expand("{path}/{parameter_combination}/time_average.{filename}", parameter_combination=['parameter_combination_1', 'parameter_combination_2'])
output:
"{path}/aggregate.{filename}"
Is there a way to glob / to collect all parameter_combination_*
folders without having to specify a fixed list?
What would be the best practice in this case?
I also read about the glob_wildcards
function here.
I would expect something like this to work:
PROJECT_PATHS, PARAMS_COMBS, SEEDS = glob_wildcards("{path}/{parameter_combination}/{seed}/config.yaml")
rule aggregate_time_averages:
input:
expand("{path}/{parameter_combination}/time_avg.{filename}", parameter_combination=PARAMS_COMBS)
output:
"{path}/aggregate.{filename}"
With the command:
snakemake --cores 1 test_project/aggregate.order_parameter.csv --use-conda
I then get the error No values given for wildcard 'path'.
(which then cannot be processed by expand anymore I guess, so maybe I should not be using expand
here at all?).
Also glob_wildcards
in the global snakemake scope gives me ALL wildcards, what I want however is just the wildcards for {parameter_combination}
that match the {path}
/ {filename}
combination for which the rule is called (so I would expect the globbing to take place in the rule itself).
Thank you for your help 🙂
Answers:
Sure, you can use an input function for your rule which evaluates the glob_wildcards
based on the wildcard values given to the rule:
def input_timefiles(wildcards):
param_combs = glob_wildcards(f"{wildcards.path}/{{param_comb}}/time_average.{wildcards.filename}").param_comb
return expand("{path}/{param_comb}/time_average.{filename}", path=wildcards.path, param_comb=param_combs, filename=wildcards.filename)
rule aggregate_time_averages:
input:
input_timefiles
output:
"{path}/aggregate.{filename}"
Note due to the default behaviour of expand(..)
this will produce the combination product of all {path}, {param_comb}, {filename}
which don’t necessarily exist. If not all combinations exist, another solution could be to use pathlib.Path.rglob(..)
instead to determine the input files.
If the files are created by an earlier rule and don’t exist before the workflow is executed, you might want to look into checkpoint
rules. See this SO answer for details.
@euronion pointed me to input functions, which was the missing ingredient, thank you very much!
I now understood that I do not need checkpoints, because each subfolder should have a specified file which must go into the aggregate, hence these files must be created if not present and the number of files is fully determined before creating the DAG. (Maybe there is also a good/better solution with checkpoints though).
In the end I used:
def get_subdirectories(path):
return [f.path for f in os.scandir(path) if (f.is_dir() and ".snakemake" not in str(f))]
def collect_wildcard_files_from_subdirs(wildcards):
print("Found wildcards: ", wildcards)
subpaths = get_subdirectories(wildcards.path)
print("Found subpaths: ", subpaths)
paths = expand("{subpath}/{file}", subpath=subpaths, file=wildcards.file)
print("Required input paths: ", paths)
if not paths: raise RuntimeError("ERROR: No paths found!")
return paths
My directory structure looks like this:
-- path
-- parameter_combination_1
- time_average.property1.csv
- time_average.property2.csv
- ...
-- parameter_combination_2
- time_average.property1.csv
- time_average.property2.csv
- ...
-- ...
I would like to create a rule which aggregates information of all files which carry the time_average
name, for the wildcards {filename} (e.g. property1.csv
) and {path}.
Hence, the input files for the example wildcard would be:
path/parameter_combination_1/time_average.property1.csv
path/parameter_combination_2/time_average.property1.csv
path/parameter_combination_3/time_average.property1.csv
- …
I know that with expand
I can cover parameter combinations. This requires me to specify the parameters to be covered, e.g. I could write a rule with a fixed list parameter_combinations
as follows (and similar to this section in the Snakemake tutorial:
rule aggregate_time_averages:
input:
expand("{path}/{parameter_combination}/time_average.{filename}", parameter_combination=['parameter_combination_1', 'parameter_combination_2'])
output:
"{path}/aggregate.{filename}"
Is there a way to glob / to collect all parameter_combination_*
folders without having to specify a fixed list?
What would be the best practice in this case?
I also read about the glob_wildcards
function here.
I would expect something like this to work:
PROJECT_PATHS, PARAMS_COMBS, SEEDS = glob_wildcards("{path}/{parameter_combination}/{seed}/config.yaml")
rule aggregate_time_averages:
input:
expand("{path}/{parameter_combination}/time_avg.{filename}", parameter_combination=PARAMS_COMBS)
output:
"{path}/aggregate.{filename}"
With the command:
snakemake --cores 1 test_project/aggregate.order_parameter.csv --use-conda
I then get the error No values given for wildcard 'path'.
(which then cannot be processed by expand anymore I guess, so maybe I should not be using expand
here at all?).
Also glob_wildcards
in the global snakemake scope gives me ALL wildcards, what I want however is just the wildcards for {parameter_combination}
that match the {path}
/ {filename}
combination for which the rule is called (so I would expect the globbing to take place in the rule itself).
Thank you for your help 🙂
Sure, you can use an input function for your rule which evaluates the glob_wildcards
based on the wildcard values given to the rule:
def input_timefiles(wildcards):
param_combs = glob_wildcards(f"{wildcards.path}/{{param_comb}}/time_average.{wildcards.filename}").param_comb
return expand("{path}/{param_comb}/time_average.{filename}", path=wildcards.path, param_comb=param_combs, filename=wildcards.filename)
rule aggregate_time_averages:
input:
input_timefiles
output:
"{path}/aggregate.{filename}"
Note due to the default behaviour of expand(..)
this will produce the combination product of all {path}, {param_comb}, {filename}
which don’t necessarily exist. If not all combinations exist, another solution could be to use pathlib.Path.rglob(..)
instead to determine the input files.
If the files are created by an earlier rule and don’t exist before the workflow is executed, you might want to look into checkpoint
rules. See this SO answer for details.
@euronion pointed me to input functions, which was the missing ingredient, thank you very much!
I now understood that I do not need checkpoints, because each subfolder should have a specified file which must go into the aggregate, hence these files must be created if not present and the number of files is fully determined before creating the DAG. (Maybe there is also a good/better solution with checkpoints though).
In the end I used:
def get_subdirectories(path):
return [f.path for f in os.scandir(path) if (f.is_dir() and ".snakemake" not in str(f))]
def collect_wildcard_files_from_subdirs(wildcards):
print("Found wildcards: ", wildcards)
subpaths = get_subdirectories(wildcards.path)
print("Found subpaths: ", subpaths)
paths = expand("{subpath}/{file}", subpath=subpaths, file=wildcards.file)
print("Required input paths: ", paths)
if not paths: raise RuntimeError("ERROR: No paths found!")
return paths