Process multiple files

Last updated on 2023-03-21 | Edit this page

Estimated time: 22 minutes

Overview

Questions

How can I check a Snakefile without running the whole workflow?
How can I create a generic rule that can process multiple files?
How can I use a python function as part of a rule?
How do I tell snakemake to create a certain file before running a python function?

Objectives

Determine a filename plan to download images
Create a pattern rule to download multiple image files
Use a python function as a params input
Setup a checkpoint to ensure a file exists before running a python function.

Snakemake Dry Run

When developing a workflow running the entire workflow after every change can be time intensive. To help with this snakemake has a --dry-run flag that will validate the Snakefile and show what it would do. Try this now:

snakemake -c1 --dry-run

Create a filename plan to save images

When downloading images from the internet it is important to avoid filename clashes. To avoid this problem deciding on a plan for naming the files is important.

The input CSV (multimedia.csv) was downloaded from https://bgnn.tulane.edu/userinterface/. This multimedia.csv file we are using is only meant to be used for this workshop and should not be used for any particular research purpose. To create a real world dataset you could use the https://fishair.org/ website that replaces the website I used to download this file.

When viewed within RStudio the data looks like this: multimedia CSV screenshot

The Tulane multimedia.csv documentation describes the arkID column as:

Multimedia unique identifier number

arkID seems like a good identifier to use in our image filenames.

For the multimedia row with arkID dd216t3d we will save the downloaded image as:

Images/dd216t3d.jpg

Use wget to download an image

BASH

wget -O test.jpg https://bgnn.tulane.edu/hdr-share/ftp/ark/89609/GLIN/FMNH/dd216t3d.jpg

Create a rule that uses a python function

Add a rule to download a single image that uses a python function param:

def get_image_url(wildcards):
    base_image_url = "https://bgnn.tulane.edu/hdr-share/ftp/ark/89609/GLIN/FMNH/"
    return base_image_url + "dd216t3d.jpg"

rule download_image:
    params: url=get_image_url
    output: "Images/dd216t3d.jpg"
    shell: "wget -O {output} {params.url}"

NOTE: Make sure you are using -O and not -o for the wget argument.

Make the rule generic with a pattern rule

Change this rule to be a pattern rule by adding a wildcard expression in an output filename.

def get_image_url(wildcards):
    base_url = "https://bgnn.tulane.edu/hdr-share/ftp/ark/89609/GLIN/FMNH/"
    return base_url + wildcards.ark_id + ".jpg"

rule download_image:
    params: url=get_image_url
    output: "Images/{ark_id}.jpg"
    shell: "wget -O {output} {params.url}"

Run this rule:

BASH

snakemake -c1 --dry-run Images/hd529k3h.jpg

Add python logic to lookup the URL

Add a pandas import to the top of Snakefile:

import pandas as pd

Change the get_image_url function to read the CSV file and add the file as a rule input:

def get_image_url(wildcards):
    filename = "multimedia.csv"
    df = pd.read_csv(filename)
    row = df[df["arkID"] == wildcards.ark_id]
    url = row["accessURI"].item()
    return url

rule download_image:
    params: url=get_image_url
    output: "Images/{ark_id}.jpg"
    shell: "wget -O {output} {params.url}"

Run snakemake:

BASH

snakemake -c1 --dry-run Images/hd529k3h.jpg

Updating all rule

Update the all rule adding a function that returns the list of all images that should be created.

def get_image_filenames(wildcards):
    filename = config["filter_multimedia"]
    df = pd.read_csv(filename)
    ark_ids = df["arkID"].tolist()
    return expand("Images/{ark_id}.jpg", ark_id=ark_ids)

rule all:
    input: get_image_filenames

BASH

snakemake -c1

Test starting from scratch

BASH

rm -rf reduce filter Images
snakemake -c1

OUTPUT

Building DAG of jobs...
FileNotFoundError in file /users/PAS2136/jbradley/SnakemakeWorkflow/Snakefile, line 7:
[Errno 2] No such file or directory: 'filter/multimedia.csv'
...

Require a file exists before a python function is used

To fix the FileNotFoundError error we need to inform Snakemake that it needs to wait for the filter/multimedia.csv file to be created before Snakemake runs the function.

We need two changes to fix the error:

Change the filter rule to be a checkpoint instead of a simple rule. This is done by changing the word “rule” to “checkpoint”.
Add code to the function requesting the output of the filter checkpoint.

Update the get_image_filenames function and filter rule/checkpoint as follows:

def get_image_filenames(wildcards):
    filename = checkpoints.filter.get().output[0]
    df = pd.read_csv(filename)
    ark_ids = df["arkID"].tolist()
    return expand("Images/{ark_id}.jpg", ark_id=ark_ids)
...

checkpoint filter:
  input:
      script="Scripts/FilterImages.R",
      fishes=config["reduce_multimedia"]
  output: config["filter_multimedia"]
  shell: "Rscript {input.script}"

Run downloading multiple files

BASH

snakemake -c1

Ensure the downloaded files are jpg

BASH

file Images/*

OUTPUT

Images/88624536.jpg: JPEG image data, EXIF standard
Images/9x56f44c.jpg: JPEG image data, EXIF standard
Images/bj373514.jpg: JPEG image data, EXIF standard
Images/dp60604r.jpg: JPEG image data, EXIF standard
Images/hd529k3h.jpg: JPEG image data, EXIF standard
Images/t868dr68.jpg: JPEG image data, EXIF standard