Snakemake | Creating an aggregate without specifying a list in expand

Question:

My directory structure looks like this:

-- path
   -- parameter_combination_1
      - time_average.property1.csv
      - time_average.property2.csv
      - ...
   -- parameter_combination_2
      - time_average.property1.csv
      - time_average.property2.csv
      - ...
   -- ...

I would like to create a rule which aggregates information of all files which carry the time_average name, for the wildcards {filename} (e.g. property1.csv) and {path}.

Hence, the input files for the example wildcard would be:

  • path/parameter_combination_1/time_average.property1.csv
  • path/parameter_combination_2/time_average.property1.csv
  • path/parameter_combination_3/time_average.property1.csv

I know that with expand I can cover parameter combinations. This requires me to specify the parameters to be covered, e.g. I could write a rule with a fixed list parameter_combinations as follows (and similar to this section in the Snakemake tutorial:

rule aggregate_time_averages:
    input:
        expand("{path}/{parameter_combination}/time_average.{filename}", parameter_combination=['parameter_combination_1', 'parameter_combination_2'])
    output:
        "{path}/aggregate.{filename}"

Is there a way to glob / to collect all parameter_combination_* folders without having to specify a fixed list?
What would be the best practice in this case?

I also read about the glob_wildcards function here.

I would expect something like this to work:


PROJECT_PATHS, PARAMS_COMBS, SEEDS = glob_wildcards("{path}/{parameter_combination}/{seed}/config.yaml")

rule aggregate_time_averages:
    input:
        expand("{path}/{parameter_combination}/time_avg.{filename}", parameter_combination=PARAMS_COMBS)
    output:
        "{path}/aggregate.{filename}"

With the command:

snakemake --cores 1 test_project/aggregate.order_parameter.csv --use-conda

I then get the error No values given for wildcard 'path'. (which then cannot be processed by expand anymore I guess, so maybe I should not be using expand here at all?).

Also glob_wildcards in the global snakemake scope gives me ALL wildcards, what I want however is just the wildcards for {parameter_combination} that match the {path} / {filename} combination for which the rule is called (so I would expect the globbing to take place in the rule itself).

Thank you for your help 🙂

Asked By: zanzu

||

Answers:

Sure, you can use an input function for your rule which evaluates the glob_wildcards based on the wildcard values given to the rule:

def input_timefiles(wildcards):
    param_combs = glob_wildcards(f"{wildcards.path}/{{param_comb}}/time_average.{wildcards.filename}").param_comb

    return expand("{path}/{param_comb}/time_average.{filename}", path=wildcards.path, param_comb=param_combs, filename=wildcards.filename)

rule aggregate_time_averages:
    input:
        input_timefiles
    output:
        "{path}/aggregate.{filename}"

Note due to the default behaviour of expand(..) this will produce the combination product of all {path}, {param_comb}, {filename} which don’t necessarily exist. If not all combinations exist, another solution could be to use pathlib.Path.rglob(..) instead to determine the input files.

If the files are created by an earlier rule and don’t exist before the workflow is executed, you might want to look into checkpoint rules. See this SO answer for details.

Answered By: euronion

@euronion pointed me to input functions, which was the missing ingredient, thank you very much!

I now understood that I do not need checkpoints, because each subfolder should have a specified file which must go into the aggregate, hence these files must be created if not present and the number of files is fully determined before creating the DAG. (Maybe there is also a good/better solution with checkpoints though).

In the end I used:

def get_subdirectories(path):
    return [f.path for f in os.scandir(path) if (f.is_dir() and ".snakemake" not in str(f))]


def collect_wildcard_files_from_subdirs(wildcards):
   print("Found wildcards: ", wildcards)

   subpaths = get_subdirectories(wildcards.path)
   print("Found subpaths: ", subpaths)

   paths = expand("{subpath}/{file}", subpath=subpaths, file=wildcards.file)
   print("Required input paths: ", paths)

   if not paths: raise RuntimeError("ERROR: No paths found!")
   return paths
Answered By: zanzu