Create a Workflow

Last updated on 2023-03-20 | Edit this page

Overview

Questions

  • How do you create a Snakemake Workflow?
  • How do you run a Snakemake Workflow?

Objectives

  • Create a workflow with a single rule
  • Run a snakemake workflow
  • Understand snakemake output
  • Reduce filename duplication using wildcards
  • Represent non-file inputs using a param

Snakemake Workflow


A Snakemake workflow is a list of rules.

Commonly used parts of a rule are:

  • name - unique name for a rule
  • input - input filenames used by a command
  • params - non-file input values used by a command
  • output - output filenames created by a command
  • container - singularity container to run command within
  • shell - command to run

Rule pattern:

rule <name>:
    input: ...
    params: ...
    output: ...
    container: ...
    shell: ...

Create a Snakemake Workflow


For our first step we will reduce the size of the multimedia.csv input file saving the result as reduce/multimedia.csv. This CSV file contains a header line followed by data lines. So to save the first 10 data lines we will need the first 11 lines in total. This can be accomplished with the shell head command passing the --n 11 argument. Run the following command in your terminal to see the first 11 lines printed out.

BASH

head -n 11 multimedia.csv

To save this output we will change the command like so:

head -n 11 multimedia.csv > reduce/multimedia.csv

Create a text file named Snakefile with the following contents:

rule reduce:
    input: "multimedia.csv"
    output: "reduce/multimedia.csv"
    shell: "head -n 11 multimedia.csv > reduce/multimedia.csv"

This code creates a Snakemake rule named reduce with a input file multimedia.csv, a output file reduce/multimedia.csv, and the shell command from above. Snakemake will automatically create the reduce directory for us before running the shell command. A common pattern is to output files in a directory named after the rule. This will help you keep track of which rule created a specific file.

Run the Workflow


BASH

snakemake

Expected Error Output:

OUTPUT

Error: you need to specify the maximum number of CPU cores
to be used at the same time. If you want to use N cores,
say --cores N or -cN. For all cores on your system
(be sure that this is appropriate) use --cores all.
For no parallelization use --cores 1 or -c1....

The snakemake command has a single required argument that specifies how manu CPU cores to use when running a workflow. Until we want to scale up our workflow we are fine using 1 core.

Run snakemake specifying 1 core:

BASH

snakemake -c1

Expected Output:

OUTPUT

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job       count    min threads    max threads
------  -------  -------------  -------------
reduce        1              1              1
total         1              1              1
Select jobs to execute...
[Mon Feb 27 11:28:29 2023]
rule reduce:
    input: multimedia.csv
    output: reduce/multimedia.csv
    jobid: 0
    reason: Missing output files: reduce/multimedia.csv
    resources: tmpdir=/tmp
[Mon Feb 27 11:28:29 2023]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2023-02-27T112816.505955.snakemake.log

Notice the reason logging Missing output files: reduce/multimedia.csv. This tells you why snakemake is running this rule. In this case it is because the output file reduce/multimedia.csv is missing.

Try running the workflow again:

BASH

snakemake -c1

Expected Output:

Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).
Complete log: .snakemake/log/2023-02-27T113325.906186.snakemake.log

Reduce duplication using wildcards


Within the shell command the input and output files can be referenced using wildcards. This will reduce duplication simplifing filename changes.

Change the shell line in the Snakefile as follows:

rule reduce:
    input: "multimedia.csv"
    output: "reduce/multimedia.csv"
    shell: "head -n 11 {input} > {output}"

Represent non-file inputs using a param


Currently the reduce rule grabs the top 11 rows. Since the meaning of this number may not be obvious and we are likely to change this number in the future a better option is to use this non-file input as a param.

Add a new params setting to the rule and use the params.rows wildcard in the shell as follows:

rule reduce:
  input: "multimedia.csv"
  params: rows="11"
  output: "reduce/multimedia.csv"
  shell: "head -n {params.rows} {input} > {output}"

Notice how we assigned a name “rows” to the parameter. Assigning a name like this can also be done for the input and output filenames as well.

Try running the workflow again:

BASH

snakemake -c1

Notice the reason has changed

OUTPUT

...
    reason: Code has changed since last execution; Params have changed since last execution
...