How to rename files with snakemake with a dictionary in config file?

Question

I currently have the issue of renaming some files with snakemake based on pattern matching with help of a dictionary in the config file. The input wildcard does not match the output wildcard any more afterwards. The data follow this structure:

.
├── pool1
│   ├── name_A.txt
│   ├── name_B.txt
│   ├── name_C.txt
│   └── name_D.txt
└── pool2
    ├── name_E.txt
    ├── name_F.txt
    ├── name_G.txt
    └── name_H.txt

I want to rename them to based on a sub-pattern of the filename. In this case the capital letters should be replaced by numbers, while different pools can "encode" the same numbers.

.
├── pool1
│   ├── name_1.txt
│   ├── name_2.txt
│   ├── name_3.txt
│   └── name_4.txt
└── pool2
    ├── name_2.txt
    ├── name_3.txt
    ├── name_5.txt
    └── name_6.txt

The replacements per pool are stored in the config file which looks like this:

pools=['pool1','pool2']

c2n : [{'A':'1',
        'B':'2',
        'C':'3',
        'D':'4'},
       {'E':'2',
        'F':'3',
        'G':'5',
        'H':'6'}]

Unfortunately rule all from snakemake dos not find the renamed output files. The lists were created with nested for loops prior to rule all based on the config file.

rename_in=['pool1/name_A','pool1/name_B','pool1/name_C','pool1/name_D','pool2/name_E','pool2/name_F','pool2/name_G','pool2/name_H']
rename_out=['pool1/name_1','pool1/name_2','pool1/name_3','pool1/name_4','pool2/name_2','pool2/name_3','pool2/name_5','pool2/name_6']

rule all:
    input:
        # rename.smk
        expand("{pattern}.txt", pattern=rename_out)

So far I tried to use a for loop to create multiple rules while iterating over the lists:

for l, n in zip(rename_in, rename_out):
    rule:
        input:
            f"{l}.txt"
        output:
            f"{n}.txt"
        shell:
            "mv {input} {output}"

I also tried to encode a single pool (pool1) in the config file and later make a rules for all pools:

rule rename:
    input:
        "pool1/name_{l}.txt"
    output:
        "pool1/name_{config[c2n][l]}.txt"
    shell:
        "mv {input} {output}"

In my third attempt, I wrote my own Python wrapper with a subprocess calling the mv command, but rule all still does not recognize the output correctly.

Is there a smart and easy way to rename files with Snakemake? In an optimal world, this would happen dynamically based on the pools, but at this point I am just fine to make it work somehow.
So far I have been trying to get around checkpoints since it seemed like a simple problem to me in the beginning.
I have found some questions similar to mine, but none of them changed the wildcard in input and output.
Thanks in advance (:

EDIT:
I accepted Wayne’s answer since it is a perfectly working solution (and much more!) for the issue I described, based on what I provided. For anyone having the task to rename files generated beforehand, use this.
My problem originated from a bioinformatic tool, demultiplexing IsoSeq data. One input file generates a number of known outputs which have to be renamed afterwards. This can be done via the config file. The solution was to integrate the renaming step directly in the process generating the known output patterns. This also solves the issue of only being able to rename already generated files, for which Wayne provided a perfect answer.

In a more useful context with direct integration in the workflow:
I iterate over the list of pools (config['pools']) in the config file and with via the index information, simultaneously draw the corresponding sample to barcodes (config['name2bc'][idx]) dictionary, in which the known name to barcodes mappings are stored, similar to c2n in my original example (In my actual example, I wanted to turn letters to numbers, it is the other way around here, that’s why I iterate over the values to get the barcodes for bcs). The outputs of the demultiplexing run are required by renaming rules, called per pool-iteration.
The requested final output of each renaming step per pool is used as only input for rule all. config["name2pools"] is mapping the actual names to pools, to be able to combine them in downstream analyses, but could be used at this step too.

rule all:
    input:
        [f"output/lima/{pool}/{pool}.demux.hifi.{name}.renamed.bam" 
            for name, pools in config["name2pools"].items()
            for pool in pools],

for idx, pool in enumerate(config['pools']):
    n2bc = config['name2bc'][idx]
    bcs = list(n2bc.values())
    rule:
    #rule lima_per_pool:
        """
        Demultiplex the pools and remove primers + barcodes
        """
        input:
            hifi=f"{pool}.hifi.bam",
            biosamples=f"biosamp_{pool}.csv",
            barcodes="barcodes_uniprimers.fasta"
        output: 
            expand(f"output/lima/{pool}/{pool}.demux.hifi.{{bc}}.bam", bc=bcs)
        shell:
            "lima --flags"

    #rename_per_pool
    for name, barcodes in n2bc.items():
        #rule rename_per_barcode
        rule:
            input:
                f"output/lima/{pool}/{pool}.demux.hifi.{barcodes}.bam"
            output:
                f"output/lima/{pool}/{pool}.demux.hifi.{name}.renamed.bam"
            shell:
                "ln -sf $(readlink -f {input}) {output}"

A very similar solution to the renaming step is in the comments.
I case of unknown output files/filenames, the usage of checkpoints is advisable as Wayne states too.

Asked By: Laron

||

Source

Answer 1

In regards to the `rule all`

You say, "The lists were created with nested for loops prior to rule all based on the config file." Using the example code just below that, you simply need to change the input to rule all to be the list rename_out.

rule all:
    input:
        rename_out

Regarding the main rule doing the renaming

I think you were on the right track with your zip idea. I embedded it into the rule.
Here’s an example that works given what you supplied, putting it all together:

from shutil import move
import os
import glob

rename_in=['pool1/name_A','pool1/name_B','pool1/name_C','pool1/name_D','pool2/name_E','pool2/name_F','pool2/name_G','pool2/name_H']
rename_out=['pool1/name_1','pool1/name_2','pool1/name_3','pool1/name_4','pool2/name_2','pool2/name_3','pool2/name_5','pool2/name_6']


rule all:
    input:
        rename_out
        
        
rule rename_files:
    input: 
        [path for path in rename_in if os.path.exists(path)]
    output: 
        [rename_out[indx] for indx,path in enumerate(rename_in) if os.path.exists(path)]
    run:
        for f_in,f_out in zip(rename_in, rename_out):
            if f_in in glob.glob(f"{f_in.split('/')[0]}/*"):
               move(f_in, f_out)

Note that in order for snakemake to handle minor changes in the state of the files later, the entire lists aren’t used as input and output for the rule doing the heavy lifting. While using the entire lists as input and output for the primary rule works when starting at square one. If in that case you revert one file to the original name, it breaks the workflow because snakemake evaluates things and says the entire output list is outdated and wipes all the involved files out in preparation to make them anew. (In fact, it will even wipe out the directory those files are in if that directory becomes completely empty, it seems.).
Bu adding in just involving the files that are yet to be renamed as input and the corresponding ones in the output in rename_files rule, you’ll see that you can restore the original names to one or a few and re-run the workflow and snakemake will only deal with renaming those files and leave the others that were renamed in the first round intact. One of the things that makes snakemake great is that it tracks everything that has to be made in the workflow, and so you don’t want to lose that ability. Because otherwise you could just use Python directly.

Building this into something more like an actual case …

The code above works as a Snakemake file if you have run whatever other process created the input files separately already and you want to rename them. But what if you want to build the renaming step inside a workflow where you’d have the files to be renamed as part of the rules in the run?

Using an input function to glue together the renaming with upstream and downstream steps

The simple version above to act on files already made as a separated workflow works for the described case; however, the point of using Snakemake is usually to have it be modular and easily plugged into a large workflow. And so I wanted to make at least one example of a more typical "real-world" case where the renaming is part of steps on file made in upstream processing. The approach used above wouldn’t integrate easily as a new rule because the way Snakemake works by default is that it evaluates all the rules backwards for how to chain things together to go from what final result needs to be made back to the input. And then it assesses what input and output already exists, etc, and tries to determine what needs to be made. However, the way I wrote that approach above, at the outset it wouldn’t be able to chain things backwards because it wouldn’t necessary know what is going to need to be renamed because those input file don’t exist on the drive and I have the rule checking real-time what already exists. Using checkpoints there is a way to have it re-evaluate during the run so the approach I use of evaluating what is present using Python to check what files exist can work if an checkpoint is put in in the step before so that rule I have can then check at that time if there is input to be made when the checkpoint finishes. However, that turns out to be more complex and unnecessary here. It is still doable and I’ll post what I have below. However, a better way is to use Snakemake’s syntax and features to properly build in the chain based on the correspondences of that will be the input files to be renamed and what they should be as output names. In this case even though there are arbitrarily related it is possible to construct the rules to properly chain. I will largely rely on this example with some extras for making the files in the directories from this example added in, plus some other special snakemake tricks. It is quite a but more advanced though as it requires some snakemake knowledge and so I think it works well having it come after the simple example above. Also be aware it is best to call things what you want and ideally for tools with pre-defined outputs, you’d build in the renaming step at the time the files are made as pointed out in this comment here. To make this more real, each time it runs it happens to not address one of the original input and output files. This is to show if it is written right it can be robust enough to not target an exact set of files to make a priori.

import os
import random

rename_in=['pool1/name_A','pool1/name_B','pool1/name_C','pool1/name_D','pool2/name_E','pool2/name_F','pool2/name_G','pool2/name_H']
rename_out=['pool1/name_1','pool1/name_2','pool1/name_3','pool1/name_4','pool2/name_2','pool2/name_3','pool2/name_5','pool2/name_6']
rename_key_dict = {k:v for k,v in zip(rename_out, rename_in)} # OUT back up to IN!

# Make the 'triggering'-input files
#-------------------------------------------------------------------------------#
# The idea is that this will be more like a tpyical workflow where renaming is
# downstream of some files being made.
# To highlight that in some cases Snakemake still can determine exactly 
# all the input and output for each step in the workflow when even at outset not 
# sure exact set of output that will be made. will exactly be when the workflow 
# is initiated if the rules are written in the right way, this workflow will 
# leave out a single, arbitrary member of the files the original workflow was 
# set up to make. The important distinction is that checkpoints will not be used 
# either. 
# This also will have the ADDED BOUNUS that it sets it up to allow re-runs to 
# potentially do something different without me needing to do anything to the 
# files. Makes for more robust tests to see if it eventually makes them all.
rename_in_without_one = rename_in.copy()
left_out_this_round = (
    rename_in_without_one.pop(random.randrange(len(rename_in_without_one))))# 
    # based on https://stackoverflow.com/a/10048122/8508004; 
# `left_out_this_round` will be useful for also removing the corresponding one 
# from `rename_out`
rename_out_without_one = rename_out.copy()
left_out_of_out_this_round  = (
    rename_out_without_one.pop(rename_in.index(left_out_this_round)))
# Now make the specific files that will be triggers for all downstream steps. 
# These will be made by Python directly. The rules will then use those as the 
# starting input to  kick off running the workflow.
initial_trigger_file_names = [f"make_{pn.split('/')[0]}_letter_{pn.rsplit('_')[1]}"for pn in rename_in_without_one] 
# example result of above line: 'make_pool1_letter_A'
for initial_trigger_file_name in initial_trigger_file_names:
    with open(initial_trigger_file_name, 'w') as output:
        output.write(initial_trigger_file_name) # put that in there just to put 
        # something in the file produced; won't be used because name of file and 
        # directory it is triggering to make is the only thing that matters


#*------------------------------Input function--------------------------------*#
def map_out_back_to_in(wildcards):
    '''
    Input function to relate output back to input using wildcards so that they 
    can be evaluated by Snakemake to keep the 1:1 corespondences during rule
    evaluation
    '''
    return (rename_key_dict[wildcards.renamedFilePath])
#*------------------------End of Input function section-----------------------*#

rule all:
    input:
        expand("{renamedFilePath}", renamedFilePath = rename_out_without_one)


rule make_files_in_subdirs:
    input: "make_{pool}_letter_{letter}"
    output: "{pool}/name_{letter}"
    shell: """ 
        mkdir -p {wildcards.pool}
        echo hello world > {output} 
    """

# Renaming rule based on https://stackoverflow.com/q/73163437/8508004 , but 
# didn't seem as easy as there because since I had the wildcard have no flanking 
# constraints other things snakemake was combining out of the wildcards were
# trying to be passed in to `map_out_back_to_in` in the `wildcards` argument, 
# such as `make_pool1_letter_1`, and giving 
# `Missing input files for rule rename_files:` and so I added constraints.   
rule rename_files:
    input: 
        map_out_back_to_in
    output: 
        "{renamedFilePath}"
    wildcard_constraints:
        renamedFilePath="poold+/name_d+"
    shell:
        "mv {input} {output}"

The key feature is that it uses Snakemake input function to keep the correspondences. Notably because it uses the wildcards that relate to later steps, Snakemake can chain things back to see how the starting input and desired final output in rule chain together.
Quick aside about that input function:
Because the input function that is maintaining the correspondences is very direct and simple, it could be built into lambda define right in the rule. However, this is already complex enough so I left that out.

Note that this workflow can be run again immediately after running it once because to illustrate it is dynamic, I left one set of input and output files the first time. Usually one or two more times and it will make all the files in rename_out because a different one will get dropped out and the rules will set up making that.

You’ll note that the ‘rename_files’ rule has wildcard constraints. These were necessary to guide Snakemake away from trying to build in all the wildcard combinations in to specify non-existent output files for renaming that had no correspondences from the previous steps in the workflow. Great resource for more about wildcards: snakemake for doing bioinformatics – using wildcards to generalize your rules.

Checkpoint version in a ‘complete’ workflow

The checkpoint based one uses the simple version from above as an approach and adapts the approach from here. It also has a set up like the one above where going in it isn’t apparently which file is not going to be made the first time the workflow is run. They is to highlight the dynamic nature of it, which here takes advantage of a checkpoint to make the simple approach work (sort of).

from shutil import move
import os
import glob
import random

rename_in=['pool1/name_A','pool1/name_B','pool1/name_C','pool1/name_D','pool2/name_E','pool2/name_F','pool2/name_G','pool2/name_H']
rename_out=['pool1/name_1','pool1/name_2','pool1/name_3','pool1/name_4','pool2/name_2','pool2/name_3','pool2/name_5','pool2/name_6']

# Make the 'triggering'-input files
#-------------------------------------------------------------------------------#
rename_in_without_one = rename_in.copy()
left_out_this_round = (
    rename_in_without_one.pop(random.randrange(len(rename_in_without_one))))# 
    # based on https://stackoverflow.com/a/10048122/8508004; 
rename_out_without_one = rename_out.copy()
left_out_of_out_this_round  = (
    rename_out_without_one.pop(rename_in.index(left_out_this_round)))
files_indicating_renaming_events = [x+"_has_been_renamed" for x in rename_in_without_one]
initial_trigger_file_names = [f"make_{pn.split('/')[0]}_letter_{pn.rsplit('_')[1]}"for pn in rename_in_without_one] 
# example result of above line: 'make_pool1_letter_A'
for initial_trigger_file_name in initial_trigger_file_names:
    with open(initial_trigger_file_name, 'w') as output:
        output.write(initial_trigger_file_name)

rule all:
    input:
        rename_in_without_one,
        "renamed_step_completed_indicator.txt"

checkpoint make_files_in_subdirs:
    input: "make_{pool}_letter_{letter}"
    output: "{pool}/name_{letter}"
    shell: """ 
        mkdir -p {wildcards.pool}
        echo hello world > {output} 
    """

   
rule rename_files:
    input: 
        [path for path in rename_in if os.path.exists(path)]
    output: 
        "renamed_step_completed_indicator.txt"
    run:
        # This actually leaves the renamed file because at the time I didn't see a way
        # to easily have snakemake be able follow the arbitrary correspondences 
        # back from the `input` of `rule all`. This StackOverflow post
        # https://stackoverflow.com/questions/73163437/snakemake-input-and-output-according-to-a-dictionary#comment129269254_73180937 
        # suggests leaving the originals is a valid approach because the DAG
        # is problematic otherwise.
        for f_in,f_out in zip(rename_in, rename_out):
            if f_in in glob.glob(f"{f_in.split('/')[0]}/*"):
               move(f_in, f_out)
               with open(f_in, 'w') as outputhandler:
                   outputhandler.write(f"content moved to '{f_out}'.")
        with open("renamed_step_completed_indicator.txt", 'w') as outputhandler:
            outputhandler.write("renamed files generated.") 
        # clean up the trigger files
        [os.remove(x) for x in glob.glob("make_pool*_letter_*")]

This example is less than ideal but does work. The main thing is that in order to have it work. I put the output of the checkpoint as input to rule all alongside the final subset of renamed files as input to rule all This seemed to work to have it run the entire workflow because without specifying both it doesn’t link things and sees nothing needs to be done. The files before the ‘renaming’ are there but are shells that have the contents moved. I leave a note i the ‘shells’ as to where moved. This leaving of the ‘shells’s of the files before they are ‘renamed’ is less than ideal. But they need to be there as written or updating any of the files near the end causes Snakemake to make all the upstream files again.

Because it leaves one out of the first round, this workflow can be run again to make ultimately all the rename_out files. I found the second time usually errors out with a suggestion to increase the increase the wait time with.... Interestingly, usually the third time completes just fine. So there may be that glitch, too.

I suspect taking what I learned from the input function and use of wildcards in this complex situation it may be possible to further edit this example using a checkpoint to not leave the ‘shells’ of the files to be renamed. But I’m leaving that exercise to maybe be revisited someday if needed.

The ‘Regarding the main rule doing the renaming’ section takes great advantage of Snakemake being a superset of Python. To put some of the code used in rename_files in context for a bit more explanation:

The list comprehensions for rename_files rule build on this code making a subset list of files that exist from a possibly larger list of files.
Enumeration is good for when iterating to aid in tracking lists that have correspondences.
The Python glob module lets you use Unix style pathnames. (I built in grabbing the pool specific-directory from the name using split() to process the string. That’s what the f_in.split('/')[0] part is doing.)

How to rename files with snakemake with a dictionary in config file?

Question:

Answers:

In regards to the `rule all`

Regarding the main rule doing the renaming

Building this into something more like an actual case …

Using an input function to glue together the renaming with upstream and downstream steps

Checkpoint version in a ‘complete’ workflow

Related Stackoverflow posts that may help others ending up here

How to rename files with snakemake with a dictionary in config file?

Question:

Answers:

In regards to the rule all

Regarding the main rule doing the renaming

Building this into something more like an actual case …

Using an input function to glue together the renaming with upstream and downstream steps

Checkpoint version in a ‘complete’ workflow

Related Stackoverflow posts that may help others ending up here

In regards to the `rule all`