Symlink (auto-generated) directories via Snakemake
Question:
I am trying to create a symlink-directory structure for aliasing output directories in a Snakemake workflow.
Let’s consider the following example:
A long time ago in a galaxy far, far away, somebody wanted to find the best ice cream flavour in the universe and conducted a survey. Our example workflow aims at representing the votes by a directory structure. The survey was conducted in English (because that’s what they all speak in that foreign galaxy), but the results should be understood by non-English speakers as well. Symbolic links come to the rescue.
To make the input parsable for us humans as well as Snakemake, we stick them into a YAML file:
cat config.yaml
flavours:
chocolate:
- vader
- luke
- han
vanilla:
- yoda
- leia
berry:
- windu
translations:
french:
chocolat: chocolate
vanille: vanilla
baie: berry
german:
schokolade: chocolate
vanille: vanilla
beere: berry
To create the corresponding directory tree, I started with this simple Snakefile:
### Setup ###
configfile: "config.yaml"
### Targets ###
votes = ["english/" + flavour + "/" + voter
for flavour, voters in config["flavours"].items()
for voter in voters]
translations = {language + "_translation/" + translation
for language, translations in config["translations"].items()
for translation in translations.keys()}
### Commands ###
create_file_cmd = "touch '{output}'"
relative_symlink_cmd = "ln --symbolic --relative '{input}' '{output}'"
### Rules ###
rule all:
input: votes, translations
rule english:
output: "english/{flavour}/{voter}"
shell: create_file_cmd
rule translation:
input: lambda wc: "english/" + config["translations"][wc.lang][wc.trans]
output: "{lang}_translation/{trans}"
shell: relative_symlink_cmd
I am sure there ary more ‘pythonic’ ways to achieve what I wanted, but this is just a quick example to illustrate my problem.
Running the above workflow with snakemake
, I get the following error:
Building DAG of jobs...
MissingInputException in line 33 of /tmp/snakemake.test/Snakefile
Missing input files for rule translation:
english/vanilla
So while Snakemake is clever enough to create the english/<flavour>
directories when attempting to make an english/<flavour>/<voter>
file, it seems to ‘forget’ about the existence of this directory when using it as an input to make a <language>_translation/<flavour>
symlink.
As an intermediate step, I applied the following patch to the Snakefile:
27c27
< input: votes, translations
---
> input: votes#, translations
Now, the workflow ran through and created the english
directory as expected (snakemake -q
output only):
Job counts:
count jobs
1 all
6 english
7
Now with the target directories created, I went back to the initial version of the Snakefile and re-ran it:
Job counts:
count jobs
1 all
6 translation
7
ImproperOutputException in line 33 of /tmp/snakemake.test/Snakefile
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule translation:
french_translation/chocolat
Exiting because a job execution failed. Look above for error message
While I am not sure if a symlink to a directory qualfies as a directory, I went ahead and applied a new patch to follow the suggestion:
35c35
< output: "{lang}_translation/{trans}"
---
> output: directory("{lang}_translation/{trans}")
With that, snakemake
finally created the symlinks:
Job counts:
count jobs
1 all
6 translation
7
As a confirmation, here is the resulting directory structure:
english
├── berry
│ └── windu
├── chocolate
│ ├── han
│ ├── luke
│ └── vader
└── vanilla
├── leia
└── yoda
french_translation
├── baie -> ../english/berry
├── chocolat -> ../english/chocolate
└── vanille -> ../english/vanilla
german_translation
├── beere -> ../english/berry
├── schokolade -> ../english/chocolate
└── vanille -> ../english/vanilla
9 directories, 6 files
However, besides not being able to create this structure without running snakemake
twice (and modifying the targets in between), even simply re-running the workflow results in an error:
Building DAG of jobs...
ChildIOException:
File/directory is a child to another output:
/tmp/snakemake.test/english/berry
/tmp/snakemake.test/english/berry/windu
running the translation rules again for no (good) reason:
Job counts:
count jobs
1 all
5 translation
6
So my question is: How can I implement the above logic in a working Snakefile?
Note that I am not looking for advice to change the data representation in the YAML file and/or the Snakefile. This is just an example to highlight (and isolate) an issue I encountered in a more complex scenario.
Sadly, while I could not figure this out by myself so far, I managed to get a working GNU make version (even though the ‘YAML parsing’ is hackish at best):
### Setup ###
configfile := config.yaml
### Targets ###
votes := $(shell awk '
NR == 1 { next }
/^[^ ]/ { exit }
NF == 1 { sub(":", "", $$1); dir = "english/" $$1 "/"; next }
{ print dir $$2 }
' '$(configfile)')
translations := $(shell awk '
NR == 1 { next }
/^[^ ]/ { trans = 1; next }
! trans { next }
{ sub(":", "", $$1) }
NF == 1 { dir = $$1 "_translation/"; next }
{ print dir $$1 }
' '$(configfile)')
### Commands ###
create_file_cmd = touch '$@'
create_dir_cmd = mkdir --parent '$@'
relative_symlink_cmd = ln --symbolic --relative '$<' '$@'
### Rules ###
all : $(votes) $(translations)
$(sort $(dir $(votes) $(translations))) : % :
$(create_dir_cmd)
$(foreach vote, $(votes), $(eval $(vote) : | $(dir $(vote))))
$(votes) : % :
$(create_file_cmd)
translation_targets := $(shell awk '
NR == 1 { next }
/^[^ ]/ { trans = 1; next }
! trans { next }
NF != 1 { print "english/" $$2 "/"}
' '$(configfile)')
define translation
$(word $(1), $(translations)) : $(word $(1), $(translation_targets)) | $(dir $(word $(1), $(translations)))
$$(relative_symlink_cmd)
endef
$(foreach i, $(shell seq 1 $(words $(translations))), $(eval $(call translation, $(i))))
Running make
on this works just fine:
mkdir --parent 'english/chocolate/'
touch 'english/chocolate/vader'
touch 'english/chocolate/luke'
touch 'english/chocolate/han'
mkdir --parent 'english/vanilla/'
touch 'english/vanilla/yoda'
touch 'english/vanilla/leia'
mkdir --parent 'english/berry/'
touch 'english/berry/windu'
mkdir --parent 'french_translation/'
ln --symbolic --relative 'english/chocolate/' 'french_translation/chocolat'
ln --symbolic --relative 'english/vanilla/' 'french_translation/vanille'
ln --symbolic --relative 'english/berry/' 'french_translation/baie'
mkdir --parent 'german_translation/'
ln --symbolic --relative 'english/chocolate/' 'german_translation/schokolade'
ln --symbolic --relative 'english/vanilla/' 'german_translation/vanille'
ln --symbolic --relative 'english/berry/' 'german_translation/beere'
The resulting tree is identical to the one shown above.
Also, running make
again works as well:
make: Nothing to be done for 'all'.
So I really hope the solution is not to go back to old-fashioned GNU make with all the unreadable hacks I internalized over the years but that there is a way to convince Snakemake as well to do what I spelled out to do. 😉
Just in case it is relevant: This was tested using Snakemake version 5.7.132.2.
edits:
- Fixed GNU make warning as per @MadScientist‘s comment.
- Since the general feedback so far indicates that this is not possible with Snakemake, I cross-posted this as a feature request over on Snakemake’s GitHub (before the bounty expires).
- Simplified
relative_symlink_cmd
as per @Nick‘s comment.
- Updated post to reflect behaviour of Snakemake v. 5.32.2.
Answers:
Here is a way to solve your first question (ie. have snakemake run only once to get all desired outputs). I use output files of rule english
as input to rule translation
, and the latter rule’s shell command modified to reflect that. In my experience, using directories as input doesn’t work great with snakemake, and if I remember correctly, directory()
tag in input
gets ignored.
Relevant code changes:
relative_symlink_cmd = """ln -s
"$(realpath --relative-to="$(dirname '{output}')" "$(dirname {input[0]})")"
'{output}'"""
rule translation:
input: lambda wc: ["english/" + config["translations"][wc.lang][wc.trans] + "/" + voter for voter in config['flavours'][config["translations"][wc.lang][wc.trans]]]
output: directory("{lang}_translation/{trans}")
shell: relative_symlink_cmd
Your second question is tricky because when you run the snakemake again, it will resolve the symlinks to their corresponding source file and this leads to ChildIOException
error. This can be verified by replacing relative_symlink_cmd
to make their own directory instead of symlinks, as shown below. In such case, snakemake works as expected.
relative_symlink_cmd = """mkdir -p '{output}'"""
I’m not sure how to get around that.
I wanted to test with a newer version of Snakemake (5.20.1), and I came up with something similar to the answer proposed by Manalavan Gajapathy:
### Setup ###
configfile: "config.yaml"
VOTERS = list({voter for flavour in config["flavours"].keys() for voter in config["flavours"][flavour]})
### Targets ###
votes = ["english/" + flavour + "/" + voter
for flavour, voters in config["flavours"].items()
for voter in voters]
translations = {language + "_translation/" + translation
for language, translations in config["translations"].items()
for translation in translations.keys()}
### Commands ###
create_file_cmd = "touch '{output}'"
relative_symlink_cmd = "ln --symbolic --relative $(dirname '{input}') '{output}'"
### Rules ###
rule all:
input: votes, translations
rule english:
output: "english/{flavour}/{voter}"
# To avoid considering ".done" as a voter
wildcard_constraints:
voter="|".join(VOTERS),
shell: create_file_cmd
def get_voters(wildcards):
return [f"english/{wildcards.flavour}/{voter}" for voter in config["flavours"][wildcards.flavour]]
rule flavour:
input: get_voters
output: "english/{flavour}/.done"
shell: create_file_cmd
rule translation:
input: lambda wc: "english/" + config["translations"][wc.lang][wc.trans] + "/.done"
output: directory("{lang}_translation/{trans}")
shell: relative_symlink_cmd
This runs and creates the desired output, but fails with ChildIOException
when re-run (even if there would be nothing more to be done).
I am trying to create a symlink-directory structure for aliasing output directories in a Snakemake workflow.
Let’s consider the following example:
A long time ago in a galaxy far, far away, somebody wanted to find the best ice cream flavour in the universe and conducted a survey. Our example workflow aims at representing the votes by a directory structure. The survey was conducted in English (because that’s what they all speak in that foreign galaxy), but the results should be understood by non-English speakers as well. Symbolic links come to the rescue.
To make the input parsable for us humans as well as Snakemake, we stick them into a YAML file:
cat config.yaml
flavours:
chocolate:
- vader
- luke
- han
vanilla:
- yoda
- leia
berry:
- windu
translations:
french:
chocolat: chocolate
vanille: vanilla
baie: berry
german:
schokolade: chocolate
vanille: vanilla
beere: berry
To create the corresponding directory tree, I started with this simple Snakefile:
### Setup ###
configfile: "config.yaml"
### Targets ###
votes = ["english/" + flavour + "/" + voter
for flavour, voters in config["flavours"].items()
for voter in voters]
translations = {language + "_translation/" + translation
for language, translations in config["translations"].items()
for translation in translations.keys()}
### Commands ###
create_file_cmd = "touch '{output}'"
relative_symlink_cmd = "ln --symbolic --relative '{input}' '{output}'"
### Rules ###
rule all:
input: votes, translations
rule english:
output: "english/{flavour}/{voter}"
shell: create_file_cmd
rule translation:
input: lambda wc: "english/" + config["translations"][wc.lang][wc.trans]
output: "{lang}_translation/{trans}"
shell: relative_symlink_cmd
I am sure there ary more ‘pythonic’ ways to achieve what I wanted, but this is just a quick example to illustrate my problem.
Running the above workflow with snakemake
, I get the following error:
Building DAG of jobs...
MissingInputException in line 33 of /tmp/snakemake.test/Snakefile
Missing input files for rule translation:
english/vanilla
So while Snakemake is clever enough to create the english/<flavour>
directories when attempting to make an english/<flavour>/<voter>
file, it seems to ‘forget’ about the existence of this directory when using it as an input to make a <language>_translation/<flavour>
symlink.
As an intermediate step, I applied the following patch to the Snakefile:
27c27
< input: votes, translations
---
> input: votes#, translations
Now, the workflow ran through and created the english
directory as expected (snakemake -q
output only):
Job counts:
count jobs
1 all
6 english
7
Now with the target directories created, I went back to the initial version of the Snakefile and re-ran it:
Job counts:
count jobs
1 all
6 translation
7
ImproperOutputException in line 33 of /tmp/snakemake.test/Snakefile
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule translation:
french_translation/chocolat
Exiting because a job execution failed. Look above for error message
While I am not sure if a symlink to a directory qualfies as a directory, I went ahead and applied a new patch to follow the suggestion:
35c35
< output: "{lang}_translation/{trans}"
---
> output: directory("{lang}_translation/{trans}")
With that, snakemake
finally created the symlinks:
Job counts:
count jobs
1 all
6 translation
7
As a confirmation, here is the resulting directory structure:
english
├── berry
│ └── windu
├── chocolate
│ ├── han
│ ├── luke
│ └── vader
└── vanilla
├── leia
└── yoda
french_translation
├── baie -> ../english/berry
├── chocolat -> ../english/chocolate
└── vanille -> ../english/vanilla
german_translation
├── beere -> ../english/berry
├── schokolade -> ../english/chocolate
└── vanille -> ../english/vanilla
9 directories, 6 files
However, besides not being able to create this structure without running snakemake
twice (and modifying the targets in between), even simply re-running the workflow results in an error:
Building DAG of jobs...
ChildIOException:
File/directory is a child to another output:
/tmp/snakemake.test/english/berry
/tmp/snakemake.test/english/berry/windu
running the translation rules again for no (good) reason:
Job counts:
count jobs
1 all
5 translation
6
So my question is: How can I implement the above logic in a working Snakefile?
Note that I am not looking for advice to change the data representation in the YAML file and/or the Snakefile. This is just an example to highlight (and isolate) an issue I encountered in a more complex scenario.
Sadly, while I could not figure this out by myself so far, I managed to get a working GNU make version (even though the ‘YAML parsing’ is hackish at best):
### Setup ###
configfile := config.yaml
### Targets ###
votes := $(shell awk '
NR == 1 { next }
/^[^ ]/ { exit }
NF == 1 { sub(":", "", $$1); dir = "english/" $$1 "/"; next }
{ print dir $$2 }
' '$(configfile)')
translations := $(shell awk '
NR == 1 { next }
/^[^ ]/ { trans = 1; next }
! trans { next }
{ sub(":", "", $$1) }
NF == 1 { dir = $$1 "_translation/"; next }
{ print dir $$1 }
' '$(configfile)')
### Commands ###
create_file_cmd = touch '$@'
create_dir_cmd = mkdir --parent '$@'
relative_symlink_cmd = ln --symbolic --relative '$<' '$@'
### Rules ###
all : $(votes) $(translations)
$(sort $(dir $(votes) $(translations))) : % :
$(create_dir_cmd)
$(foreach vote, $(votes), $(eval $(vote) : | $(dir $(vote))))
$(votes) : % :
$(create_file_cmd)
translation_targets := $(shell awk '
NR == 1 { next }
/^[^ ]/ { trans = 1; next }
! trans { next }
NF != 1 { print "english/" $$2 "/"}
' '$(configfile)')
define translation
$(word $(1), $(translations)) : $(word $(1), $(translation_targets)) | $(dir $(word $(1), $(translations)))
$$(relative_symlink_cmd)
endef
$(foreach i, $(shell seq 1 $(words $(translations))), $(eval $(call translation, $(i))))
Running make
on this works just fine:
mkdir --parent 'english/chocolate/'
touch 'english/chocolate/vader'
touch 'english/chocolate/luke'
touch 'english/chocolate/han'
mkdir --parent 'english/vanilla/'
touch 'english/vanilla/yoda'
touch 'english/vanilla/leia'
mkdir --parent 'english/berry/'
touch 'english/berry/windu'
mkdir --parent 'french_translation/'
ln --symbolic --relative 'english/chocolate/' 'french_translation/chocolat'
ln --symbolic --relative 'english/vanilla/' 'french_translation/vanille'
ln --symbolic --relative 'english/berry/' 'french_translation/baie'
mkdir --parent 'german_translation/'
ln --symbolic --relative 'english/chocolate/' 'german_translation/schokolade'
ln --symbolic --relative 'english/vanilla/' 'german_translation/vanille'
ln --symbolic --relative 'english/berry/' 'german_translation/beere'
The resulting tree is identical to the one shown above.
Also, running make
again works as well:
make: Nothing to be done for 'all'.
So I really hope the solution is not to go back to old-fashioned GNU make with all the unreadable hacks I internalized over the years but that there is a way to convince Snakemake as well to do what I spelled out to do. 😉
Just in case it is relevant: This was tested using Snakemake version 5.7.132.2.
edits:
- Fixed GNU make warning as per @MadScientist‘s comment.
- Since the general feedback so far indicates that this is not possible with Snakemake, I cross-posted this as a feature request over on Snakemake’s GitHub (before the bounty expires).
- Simplified
relative_symlink_cmd
as per @Nick‘s comment. - Updated post to reflect behaviour of Snakemake v. 5.32.2.
Here is a way to solve your first question (ie. have snakemake run only once to get all desired outputs). I use output files of rule english
as input to rule translation
, and the latter rule’s shell command modified to reflect that. In my experience, using directories as input doesn’t work great with snakemake, and if I remember correctly, directory()
tag in input
gets ignored.
Relevant code changes:
relative_symlink_cmd = """ln -s
"$(realpath --relative-to="$(dirname '{output}')" "$(dirname {input[0]})")"
'{output}'"""
rule translation:
input: lambda wc: ["english/" + config["translations"][wc.lang][wc.trans] + "/" + voter for voter in config['flavours'][config["translations"][wc.lang][wc.trans]]]
output: directory("{lang}_translation/{trans}")
shell: relative_symlink_cmd
Your second question is tricky because when you run the snakemake again, it will resolve the symlinks to their corresponding source file and this leads to ChildIOException
error. This can be verified by replacing relative_symlink_cmd
to make their own directory instead of symlinks, as shown below. In such case, snakemake works as expected.
relative_symlink_cmd = """mkdir -p '{output}'"""
I’m not sure how to get around that.
I wanted to test with a newer version of Snakemake (5.20.1), and I came up with something similar to the answer proposed by Manalavan Gajapathy:
### Setup ###
configfile: "config.yaml"
VOTERS = list({voter for flavour in config["flavours"].keys() for voter in config["flavours"][flavour]})
### Targets ###
votes = ["english/" + flavour + "/" + voter
for flavour, voters in config["flavours"].items()
for voter in voters]
translations = {language + "_translation/" + translation
for language, translations in config["translations"].items()
for translation in translations.keys()}
### Commands ###
create_file_cmd = "touch '{output}'"
relative_symlink_cmd = "ln --symbolic --relative $(dirname '{input}') '{output}'"
### Rules ###
rule all:
input: votes, translations
rule english:
output: "english/{flavour}/{voter}"
# To avoid considering ".done" as a voter
wildcard_constraints:
voter="|".join(VOTERS),
shell: create_file_cmd
def get_voters(wildcards):
return [f"english/{wildcards.flavour}/{voter}" for voter in config["flavours"][wildcards.flavour]]
rule flavour:
input: get_voters
output: "english/{flavour}/.done"
shell: create_file_cmd
rule translation:
input: lambda wc: "english/" + config["translations"][wc.lang][wc.trans] + "/.done"
output: directory("{lang}_translation/{trans}")
shell: relative_symlink_cmd
This runs and creates the desired output, but fails with ChildIOException
when re-run (even if there would be nothing more to be done).