Make
has been a one of the key tools in my arsenal for gettings
things done. Although it was developed for compiling code, its
functionality can be generalized to any process that requires files
to be generated based on dependancies.
I recommend you look at these slides by Vince Buffalo
as a good introduction to using make
for scientific workflows.
Make
works by creating a
dependancy graph of files and their prerequisites
using the last time the file was
modified as a way to determine if a file needs to be remade.
This general concept is great for
reproducable scientific research or many other repeating
tasks and workflows. However it’s not without it’s flaws.
Make’s syntax is very obtuse using many shorthand variables like
to describe rules that make it difficult to start using. But
even after leaning its syntax I’ve continually found one fatal flaw:
all of the files need to be local, on your current computer.
This makes sense given its original function
of compiling code. However in data analysis and scientific workflows
we often have to interact with remote files on AWS S3 or files that
we download from a web resourse. These files don’t have a time stamp
that make
can use and so their presense completely breaks the
dependancy graph.
There are work arounds of course. I’ve used dummy empty files as a way of keeping track of the last time a URL was downloaded or just downloaded all of the external files at the beginning and end of an analysis, which is used in the popular cookiecutter data science template. But both of these solutions are brittle and don’t really solve the problem.
I looked for a URL enabled make, one that could integrate in remote files to the dependancy graph, but didn’t find anything suitable. So instead I set out to add this functionality to one myself.
I first looked at the source for GNU make
however I couldn’t understand the source code
so modifying that was out of the question.
Instead I used mk
as the base for my modifications. Mk was originally developed
Plan 9 operating system
as a re-write of make without many of its annoyances.
I had stumbled upon a simple re-implementation of mk in Go by Daniel Jones
a number of years ago and decided to add remote file support to that.
The changes I describe below are available on my fork
of mk
on Github.
The goal of my improvemnts was to make the following rule work:
file.txt.gz: "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_feature_count.txt.gz"
curl $prereq > $target
That is, that mk
would recognise the prerequisite as a URL,
determine if that URL was newer or older than the target and
proceed accordingly.
Since there is an extra colon
in the URL we need to protect it with quotes so the mkfile parser
doesn’t get confused.
The core of make
and mk
is deciding to remake a file based
on whether its prerequisites are newer than it. It does this by
looking at the last modified timestamp of a file. Sure enough inside
graph.go
there was a function updateTimestamp
that gets the
last modified time of a file or sets that the file doesn’t exist.
info, err := os.Stat(u.name)
if err == nil {
u.t = info.ModTime()
u.exists = true
u.flags |= nodeFlagProbable
} else {
_, ok := err.(*os.PathError)
if ok {
u.t = time.Unix(0, 0)
u.exists = false
} else {
mkError(err.Error())
}
}
This was the function to modify to look at time stamps of remote files. To do this we just need to identify files that look like remote files, i.e. start with http(s):// or s3://. The following simple modification makes that check and farms out the modification checking based on if it’s a http(s) or s3 remote file.
// u is a node in the dependancy graph.
// its name member is the full path of the file
if strings.HasPrefix(u.name, "s3://") || strings.HasPrefix(u.name, "https://") || strings.HasPrefix(u.name, "http://") {
up, err := url.Parse(u.name)
if err != nil {
log.Fatal(err)
}
if up.Scheme == "http" || up.Scheme == "https" {
updateHttpTimestamp(u)
} else if up.Scheme == "s3" {
updateS3Timestamp(u, up)
}
The implementation of updateHttpTimestamp
is pretty simple.
A HEAD
request is made to the URL and the Last-Modified
header is read. If that header is present the time is parsed
and used in the dependancy graph. If the header isn’t found
it’s assumed that the URL doesn’t exist causing it to be remade.
func updateHttpTimestamp(u *node) {
// get the headers of the URL
resp, err := http.Head(u.name)
if err != nil {
log.Fatal(err)
}
lastModified := resp.Header.Get("Last-Modified")
if lastModified == "" {
// no Last-Modified header so lets assume that it
// doesn't exist
u.t = time.Unix(0, 0)
u.exists = false
} else {
tmptime, err := time.Parse(time.RFC1123, lastModified)
if err != nil {
log.Fatal(err)
}
u.t = tmptime
u.exists = true
}
}
The implementation for updating an S3 file is similar but uses the AWS API to get the last modified time.
And with those small modifications the basic example I showed at the
beginning now works. These modifications can be found in my fork
of mk
, try it out and see how much easier your make-ing becomes!