3D explorations with Blender for Biopython

For reasons that I will explain later, I strongly believe that visualization is becoming more and more important. This in two fronts: the Browser and Virtual Reality.

I am playing along with 3D modeling techniques and I started with Blender. While for now I decided to go with Unity I did some explorations with the free software tool.

I have a model with two variations (which I will use in a tutorial to be available soon). Just for kicks I leave here the blender models along with the results. My content is in the public domain, but it includes a snake from Calore which is CC-BY.

OK, not yet Star Wars special effects material. But if you want some kind of separator/intro to your presentations with a DNA helix, feel free to take this.

The start of a Kitematic/Docker/Jupyter/Biopython tutorial

Tutorial model (blender file)

Another example:

Presentation model (blender file)

The too large bioinformatics core

I was reading Hen Li's post about Docker A few hours with docker and a apparently innocuous quote ringed some bells:

With all the buzz around docker, I finally decided to give it try. I first asked Broad sysadmins if there are machines set up for testing docker applications. They declined my request for security concerns and suggested Kitematic for my MacBook.

(my bold)

One of the most frustrating experiences in my life was dealing with the Sanger Institute cluster. How, you might ask could I be disappointed with what is probably the biggest (and best) bioinformatics cluster in the world?

The user experience was horrifying. For example, I remember waiting a couple of days for my scheduled jobs to start on the queueing system. This is perfectly normal: when you have lots of users which are both savvy and have huge computational demands, there is a probability that at some point in time there is a massive peak.

Also, the scratch space on the cluster was minute. It did happen that sometimes I would wait 2 days for my jobs to start, just to see them die for lack of scratch. In a big cluster there is no real solution around this: You simply cannot have ad-hoc provisions (how do you ad-hoc manage hundreds of users?) and you cannot allocate lots of scratch space to all users.

Note that on a small cluster it is feasible to have a chat with the sysadmin and ask for extra space for some time. This is doable on a small scale but complicated on a large, industrial scale.

Also, it is perfectly understandable that Broad's sysadms refused to install Docker: such low-level system application might have unforeseen impacts in the whole infrastructure causing great pain to many people.

A large, industrial-scale cluster has to be have a simple, streamlined administration. It is not designed for experimentalism and exploratory analysis. It is good for doing known analysis on large amounts of data.

Sadly (actually the word is Excitingly), much of the real interesting scientific analysis is exploratory, requires brand-new quirky programs, involves lots of retries and it will fail many many times.

The Oxford WTCHG solution

The Wellcome Trust Centre for Human Genetics in Oxford had a different approach: there was indeed a large cluster, but most of the groups had their private cluster with a private system administrator (yeah, money was not a big limitation). This is a 10-20 machine cluster with a known sysadm person. That person will install software on-demand much faster. While there was a queueing system you could start the odd job by hand (and direct user ssh to all machines were available). If some user did something stupid there was little problem: everybody knew everybody and things would get sorted out with a chat (and probably a stern talk from the sysadm). Need gazillions of scratch space for a specific project? "Sure we can set this up for you to use during the next month or so".

Remember that I said that there was also a big cluster? Never used it myself. But, if you had a massive need of computation power, it was there and some people used it for precisely that.

The point of all this

Sometimes I see people wanting to pool together large amounts of resources. That makes sense up to a certain level. After a certain size the inflexibility of a large system (by nature with formal procedures and streamlined restrictions) will make exploratory analysis with bleeding-edge tools hard.

There is a better size for most bioinformatics clusters and I suspect it is something of intermediate level. Economically scalable but still somewhat organic in terms of management.

Bioinformatics with Python Cookbook

I would like to announce my new book:

Book cover

This book is slightly different from the standard books on Bioinformatics and Python.

It is not about teaching Bioinformatics algorithms, but about solving practical day-to-day problems with Python, for example:

  • Next-Generation Sequencing: FASTQ, BAM and VCF processing. Along with filtering of datasets
  • Genomics: processing reference genomes of both high-quality references of model species and low-quality non-model species. Also discussed are genome annotations and gene ontologies
  • Population Genetics: doing PCA, Admixture/Structure, computing FSTs, ...
  • Genome simulation: mostly forward-time simulations, but also a bit of coalescent
  • Phylogenetics: tree reconstruction and tree drawing
  • Proteins: PDB processing and visualization.
  • Other topics like processing map data, GBIF, interfacing with Cytoscape, accessing lots of online databases, ...
  • There is a bit on interacting with R/Bioconductor via Python
  • Finally we discuss high-performance in Python: faster algorithms, clusters, Numba and Cython. Also related technologies like Docker

The book discusses the usual Python Libraries in the field: Biopython, PyVCF, Pysam, simuPOP, DendroPy, Pymol and also scientific libraries like NumPy, SciPy, matplotlib and scikit-learn.

The code is fully available for free at github. I am keen on maintaining the book code, so if you find any issues please do contact me.

The book is available in the usual places (Amazon, ...).

/wp-content/uploads/2015/07/merge.png

Docker containers for Biopython

This is a list of Docker containers for Biopython.

To install this you will need Docker on Linux or boot2docker on Windows or Mac. The instructions below were tested on Linux.

Functionality available

There are always 2 versions of each container, for Python 2 and 3.

Basic container

This is a container which will install all of Biopython's dependencies, you can ssh into it to use Biopython

Python 3

docker build -t biopython https://raw.githubusercontent.com/tiagoantao/my-containers/master/biopython/Biopython3
#Grab a coffee, wait a bit
docker run -t -i biopython /bin/bash
python3

Python 2

docker build -t biopython2 https://raw.githubusercontent.com/tiagoantao/my-containers/master/biopython/Biopython2
#Grab a coffee, wait a bit
docker run -t -i biopython2 /bin/bash
python

IPython Notebook container

This is a container which will install all of Biopython's dependencies, you can ssh into it to use Biopython, But you can also point a browser to http://localhost:9803 (or 9802 in Python 2). With boot2docker you might need an extra step to map the VM port to the host port.

Python 3

docker build -t biopython-nb https://raw.githubusercontent.com/tiagoantao/my-containers/master/biopython/Biopython3-Notebook
docker run -p 9803:9803 -t -i biopython-nb

Python 2

docker build -t biopython2-nb https://raw.githubusercontent.com/tiagoantao/my-containers/master/biopython/Biopython2-Notebook
docker run -p 9802:9802 -t -i biopython2-nb

IPython Notebook Tutorial container

This is a container which will install all of Biopython's dependencies and a notebook version. YPoint a browser to http://localhost:9803 (or 9802 in Python 2) . With boot2docker you might need an extra step to map the VM port to the host port.

Python 3

docker build -t biopython-tutorial https://raw.githubusercontent.com/tiagoantao/my-containers/master/biopython/Biopython3-Tutorial
docker run -p 9803:9803 -t -i biopython-tutorial

Python 2

docker build -t biopython2-tutorial https://raw.githubusercontent.com/tiagoantao/my-containers/master/biopython/Biopython2-Tutorial
docker run -p 9802:9802 -t -i biopython2-tutorial

Buildbot container

There is also a Buildbot container to test Biopython. If you are interested in this one, please read this and contact Biopython's development team.

Getting FASTA output from PLINK files using Biopython

Here is a bit of code that you can use to extract FASTA files from PLINK genotype calls. I am doing a few assumptions:

  • There is a reference genome to compare against
  • If no data is available from the genotyping we will infer the reference call. This might be acceptable for some Whole Genome Sequencing data sources (and even so...), but if you are using something less dense then you might want to mark it as missing data. The version without the reference genome would be much easier to code
  • Adding to the reference genome, your genes of interest will be on a BED file
  • I am doing this inside IPython Notebook. I only make use of a magic, which can be trivially changed by os.system
  • Your PLINK is binary encoded

Lets start with the import code, plus some variables that you need to change:

 import gzip
 from Bio.Seq import Seq
 from Bio import SeqIO
 from Bio.SeqRecord import SeqRecord

 indiv_file = 'indivs.txt'

 genome = '/data/atroparvus/tra/ag/ref/pest3.fa.gz'
 my_bed = 'my_genes.bed'
 plink_pref = '/data/gambiae/phase1/AR3'

You will need to change the genome file (a gzipped reference fasta), the bed file with your positions on interest) and the plink prefix.

Next, lets do a simple function to parse BED files, this will be coded as a generator:

1
2
3
4
5
6
7
8
9
 def get_features(bed_file):
     for l in open(bed_file):
         toks = l.rstrip().split('\t')
         chrom = toks[0]
         start = int(toks[1])
         end = int(toks[2])
         gene = toks[3]
         strand = toks[4]
         yield chrom, start, end, gene, strand

Warning

Be very careful with the reference coordinates that you use. Biopython and BED are typically 0-based. PLINK is typically 1-based. Serious pain and suffering (of the kind that is difficult to detect) can be caused by this.

Now lets get the sequence for a certain area of interest:

1
2
3
4
5
 def get_seq(fasta_name, chrom, start, end, determine_chrom=lambda x: x.description.split(':')[2]):
     recs = SeqIO.parse(gzip.open(fasta_name, 'rt', encoding='utf8'), 'fasta')
     for rec in recs:
         if determine_chrom(rec) == chrom:
             return rec.seq[start:end + 1]

This will take your reference genome file and return a Biopython sequence with your area of interest. Notice the parameter to determine the chromosome from the FASTA description. Different FASTA files can have completely different descriptor lines, so we abstract that away. In my concrete case, the description looks like this:

>2L chromosome:AgamP3:2L:1:49364325:1

So, I take the description, split it by : and get the 3rd element (2L in this case). "But I could have used the ID", I hear you say. Yes I could, but I wanted to illustrate code for strange description lines...

OK, now lets get the SNPs from the area of interest in the genome from our PLINK file. The file is in binary format (you can easily change this for text or even VCF as long as you use PLINK 1.9+) and the output is is text format (so that we can parse it). We are using IPython magic here (os.system, if you prefer):

1
2
3
4
 def get_snps(plink_prefix_in, plink_prefix_out,
              indiv_file, chrom, start, end):
     !plink --bfile $plink_prefix_in --chr $chrom --from-bp $start --to-bp $end\
     --keep $indiv_file --allow-extra-chr --recode --out $plink_prefix_out

OK, now the work-horse of all of this, a function to read a PLINK file and return sequences based on the reference genome:

 def get_plink_seqs(ref_seq, plink_pref, start):
     f = open(plink_pref + '.map')
     poses = []
     for l in f:
         pos = l.rstrip().split('\t')[-1]
         poses.append(int(pos))
     f = open(plink_pref + '.ped')
     for l in f:
         toks = l.rstrip().replace('\t', ' ').split(' ')
         fam = toks[0]
         ind = toks[1]
         alleles = toks[6:]
         seq1 = list(ref_seq)
         seq2 = list(ref_seq)
         for i, pos in enumerate(poses):
             ref_pos = pos - start
             ref_allele = ref_seq[ref_pos]
             a1 = alleles[2 * i]
             a2 = alleles[2 * i + 1]
             if ref_allele == ref_allele.lower():
                 #maintain case of the ref
                 a1 = a1.lower()
                 a2 = a2.lower()
             seq1[ref_pos] = a1
             seq2[ref_pos] = a2
             #if a1 != ref_allele or a2 != ref_allele:
             #    print(fam, ind, i, ref_pos, pos, ref_allele, a1, a2)
         yield fam, ind, ''.join(seq1), ''.join(seq2)

So, the input a reference sequence (only the tiny slice of the genome corresponding to your area of interest), the PLINK prefix of your PLINK files and the start position. The code starts by loading the PLINK positions from the MAP file (notice that I ignore the chromosome: you should only have your chromosome, plus PLINK can do strange things to chromosome names - I will not detail that here). Most of the code involves offsetting from the absolute coordinates in the PLINK file to the relative coordinates in our reference sequence. Also note that we maintain the capitalization as in the reference genome (which normally has a meaning, like soft-masking).

Finally, lets extract the sequences proper to FASTA files: these will be file named by the feature/gene:

 for feature in get_features(my_bed):
     chrom, start, end, gene, strand = feature
     ref_seq = get_seq(genome, chrom, start, end)
     print(chrom, start + 1, end + 1, gene)
     get_snps(plink_pref + '/' + chrom, gene, indiv_file,
              chrom, start + 1, end + 1)# 1 based
     my_records = []
     my_records.append(SeqRecord(seq=ref_seq, id=gene, description='Reference'))
     for fam, ind, seq1, seq2 in get_plink_seqs(ref_seq, gene, start + 1):
         my_records.append(SeqRecord(seq=Seq(seq1), id=gene, description='%s_%s_1' % (fam, ind)))
         my_records.append(SeqRecord(seq=Seq(seq2), id=gene, description='%s_%s_2' % (fam, ind)))
     SeqIO.write(my_records, '%s.fa' % gene, 'fasta')

We iterate through all our features on the BED file: we get the corresponding reference sequence, get the SNPs from PLINK and generate the FASTA sequences. We use SeqIO for writing.

Warning

A few important notes:

  • The PLINK files are not phased, so you cannot assume correct phasing of the FASTA files
  • I am assuming diploid data here
  • The code could be made more efficient by getting the correct reference sequence only once (it can be quite heavy). This can be easily achieved by retrieving the BED features by chromosome order (many times that have that order, anyway) and only loading the reference sequence when the chromosome/scafold changes<

I hope this code is useful to you. As usual comments and criticism are very welcome.

First experiences with conda and Anaconda

I have now used Anaconda (and most especially conda) for more than a month. I would like to give an overview of my experience. I will concentrate on my major hurdles with it, because that might be helpful to others. But the short story is: I wholeheartedly recommend it.

My requirements. I have a few fundamental requirements:

  • Being an old Python user, I have lots of stuff hanging around (libraries)
  • I prefer Python 3, but I have to support Python 2
  • I am very interested in deploying ready-to use production software to users that are not tech-savvy in an easy way
  • While I do all of my work on Linux, I care greatly for portability to Windows and Mac

I have to say that the first impression of using conda was really really bad. I installed it and then installed binstar (binstar is core to conda) and it failed out of the box! Why? Because I have quite a lot of stuff in my local installation (i.e. Python libraries) and some of that stuff was adding the libraries from the default Python installation (the standard one coming with Ubuntu) to the conda PYTHONPATH. But I hear you: "you should have been more careful!"

Actually no. Anaconda (and conda) sell themselves on making your life easier with the whole dependency management thing and in this case this was a clear failure of ensuring this out-of-the box. At the very least I think that conda should have inspected the system and given me a warning that this was a leaky environment. Furthermore, I would like to have a mode of operation that would be more tightly closed. That makes life much easier when controlling deployment environments. I ended up deleting all my local content and moved on. But I think conda should have isolated itself from the existing environment (or at least warn) if it wants to be seen as a distribution that takes care of deployment problems as one of its first goals.

Pip integration is also sometimes strange: for some reason sometimes pip installs to my local library (.local) and sometimes to the conda environment. By the way, generating a conda package from a pip package is generally trivial.

I could talk about the things where conda shines (and it really shines): Maintaining Python 2 and 3 environments or the binary library management. All the good features that are attributed to anaconda are really there. I will just say that I am now on conda (using many environments as I have lots of concurrent stuff going on) and it mostly works. I do not plan to go back. It is mostly well designed, default options are sensible and the learning curve is surprisingly non-steep.

I do nonetheless think that the issue of isolation for the default environment (the first point above) really needs to be addressed in order to have a platform that enables reliable and trustable deployment.

As a (very personal) requirement, I would like to have a way to make available on the web my applications very easily. In the past I have developed Java Web Start (Jython) applications and I miss the extreme convenience of single-click-to-install-and-run (security issues aside). For some users, asking to install anaconda (a big install, which can be partially mitigated with miniconda) and them adding something on top (for which there is no standard process outside the command-line - conda install...) is a bit too much. I understand that this is a niche thing, but easy user-deployment would be a nice to have feature...

IPython notebook and music composition (Abjad)

IPython Notebook is a revolutionary tool as it brings the REPL concept to the masses, with the extra added bonuses of easy, graphical, web based usage. And more extra brownie points for facilitating (scientific) replication of results... The whole concept of documented, interactive development coupled with sharing and easy replication is awesome!

IPython Notebook might have been developed in a scientific environment and, it indeed makes a lot of sense for research and teaching purposes. But I believe it makes even more sense in an artistic environment. Test, tweak, change, try, re-try - preferably in real-time and in front of your eyes (or ears).

Within that line of thought, I developed a (very simple) interface between IPython Notebook and Abjad (a tool for music composition). The interface is totally transparent in the sense that, if you know abjad, you do not need to learn anything new. I just replace the function responsible for showing the scores: instead of opening a new application (typically a postscript viewer), the show function is trapped, the output converted to one (or several) PNGs, cropped and sent to IPython Notebook.

You can see abjad's documentation examples on nbviewer. You will find 5 examples linked there.

A couple of caveats: I have tested this with the github version of abjad (not with the current stable) - I expect it to work with abjad 2.14 but the examples will require the development version. Also, I have tested this with Python 3 - if you are on Python 2 and this does not work, please drop me a line and I will patch it.

The interface is available as a github project. Feel free to use it. Note that this is a personal project, completely outside of my area of comfort. Feel free to criticize, suggest changes, etc...

Leaking Docker containers?

I have been using Docker quite a lot, especially for Biopython: it is great for setting up automated testing environments and also to create contained development environments where one can develop some crazy new feature in an protected/isolated context that is very agile to set-up.

While I am not a specialist on Docker, it is quite clear that we are not talking about an isolated VM, but something lighter. Recently I have been having a problem with the containment of the container:

Biopython has a BioSQL module that is able to use both PostgreSQL and MySQL. Our automated testing environment based on Docker automatically installs both database systems before proceeding with the unit tests. These Docker containers were running fine on a daily basis until a few days ago when installation of the Dockerfile stopped working. The error:

On

apt-get install -y mysql-server

I get:

Use of uninitialized value $_[1] in join or string at /usr/share/perl5/Debconf/DbDriver/Stack.pm line 111.
chfn: PAM: System error
adduser: '/usr/bin/chfn -f MySQL Server mysql' returned error code 1. Exiting.

Now, what was the only change on the <b>host</b> machine: Yes, you got it right, I installed a running MySQL server! And, of course, there were absolutely no changes in the Dockerfile.

The same Dockerfile works fine on a host machine without a MySQL server. Furthermore I get the same error with different versions of Ubuntu containers (both with trusty and the old saucy).

I am not exposing any TCP port or disk volume, it is just a completely closed container (other than network access).

While I understand that a container is not a VM, it is a bit troubling that we see this lack of isolation between host and container.

Automatic code changing in Python with the ast module

One of the changes that Python 3 brings versus Python 2 is that, if you redefine the __eq__ method of an object then your __hash__ method automatically becomes None. This means that your class is rendered unhashable (and thus unusable in many settings). This makes sense as the hash function is closely related to object equality. But, if you have Python 2 code that is nonetheless working and you are planning to move to Python 3, then automatically adding an __hash__ function to recover object hashbility might be a pragmatically acceptable solution.

This not-so-perfect example sets the stage for a mini-tutorial on traversing a Python source tree, finding classes that have new __eq__ methods but lack a new __hash__ method and then patching the class to have one. This is, thus, mostly an excuse for a brief exploration on the ast module of Python...

The ast module in Python

Note

The code presented here works on Python 3. There are a few differences for Python 2. For instance in Python 2 print is a statement, not an ordinary function. But the gist of things is the same.

The ast module allows you to access the Abstract Syntax Tree of a Python program. This can be useful for program analysis and program transformation, among other things. Lets start with the AST representation of a Hello World program:

#Typical hello world example
print('Hello World')

Getting the AST tree for this is trivial:

import ast
tree = ast.parse(open('hello.py').read())

Now, the tree object might seem daunting at first, but don't let appearances fool you, it is actually quite simple. The type of tree will be _ast.Module. If you ask, by doing tree._fields, for its children (as you ask can to all AST nodes) you will see that a Module will have a body (i.e. tree.body). The body attribute will have, you guessed it, the body of the file:

>>>print(tree.body)
[<_ast.expr object at>]</_ast.expr>

The body of a module is composed of a list of Statements. In our case we have a single Node: an Expr (Expression is another type of node - do not confuse them). Notice that <b>the AST representation will not include the comments</b>, these are "lost".

The only _field of an Expr is value, which is just an indirection to the type of certain statements: In our case we have a Call to the print function. <b>Remember, in Python 3 print is just an ordinary function, not a statement - so this would look different in Python 2.</b>

OK, back to our Call. What is in a call? Well, you call a function with arguments so a call is a function name plus a set of arguments:

>>>print(tree.body[0].value._fields)
('func', 'args', 'keywords', 'starargs', 'kwargs')</pre>

You will notice the func plus a lot of stuff for arguments. This is because, as you know, Python has quite a rich way of passing arguments (positional arguments, named arguments, ...). For now lets concentrate on func and args only. func is a Name with an attribute called id (the function name). args is a list with a single element, a String:

>>>print(tree.body[0].value.func.id)
print
>>>print(tree.body[0].value.args[0].s)
Hello World

This is actually quite simple, here is a graphical representation:

<img class="aligncenter size-full wp-image-122" alt="" src="/wp-content/uploads/2014/04/example1.png" width="539" height="265">

The best way to find all types of nodes is to check the abstract grammar documentation, or you might prefer the Python 2 version.

Second attempt (a function definition)

Lets have a look at the AST for a function definition:

def print_hello(who):
    print('Hello %s' % who)

If you get a tree for this code, you will notice that the body is still composed of a single statement:

>>>print(tree.body)
[<_ast.functiondef object at>]</_ast.functiondef></pre>

At the top level, there is only a single statement, the print function call belongs to the function definition proper. A function definition has name, args, body, decorator_list and returns. Name is a String (print_hello), no more indirections here, easy. args cannot be very simple because it has to accommodate the fairly rich Python syntax for argument specification, so its fields are ('args', 'vararg', 'kwonlyargs', 'kw_defaults', 'kwarg', 'defaults'). In our case we just have a simple positional argument, so we can find it on args.args:

>>>print(tree.body[0].args.args)
[<_ast.arg object at>]</_ast.arg></pre>

The arg object has a arg field (tree.body[0].args.args[0].arg starts to sound a bit ludicrous, but such is life), which is a string (who). As a side note, in Python 3, you can annotate function parameters (so you have a annotation field).

Now, the function body can be found in the body field of the function definition:

>>>print(tree.body[0].body)
[<_ast.expr object at>]</_ast.expr>

To finalise this version, I just want to present the sub-tree inside the print (i.e. the "Hello %s" % who):

<img class="aligncenter size-full wp-image-137" src="/wp-content/uploads/2014/04/example2.png" width="392" height="345">

OK, before we go to the final version, one very last thing:

>>>#Lets look at two extra properties of the print line...
>>>print(tree.body[0].body[0].lineno, tree.body[0].body[0].col_offset)
2 4

Yes, you can retrieve the line number and the column of a statement...

Third attempt (a class definition)

class Hello:
    def __init__(self, who):
        self.who = who

    def print_hello(self):
        print('Hello %s' % self.who)</pre>

It should not come as a shock that the module only has a single statement:

[<_ast.classdef object at>]</_ast.classdef>

And yes, as you might expect by now, the ClassDef object has a name and a body attribute (things start making sense and are consistent). Indeed most things are as expected: The class has two statements (two FuncDefs). There are only a couple of conceptually new things, and these are visible in the line:

self.who = who

Here we have two new features: the assignment (=) and the compound name (self.who)

<img class="aligncenter size-full wp-image-143" alt="" src="/wp-content/uploads/2014/04/example3.png" width="381" height="209">

Notice that the list of potential targets is a list, this is because you can do things like:

x, y = 1, 2

There is much more than can be said. Processing ASTs requires a recursive mindset, indeed if you are not used to think recursively I suspect that might be your biggest hurdle in doing things with the AST module. And with Python things can get very nested indeed (nested functions, nested classes, lambda functions, ...)

A complete solution

OK, lets switch gears completely and apply this to a concrete case. The Abjad project helps composers build up complex pieces of music notation in an iterative and incremental way. It is currently Python 2 only, and I am volunteering some of my time to help it support Python 3. This is highly-OO, highly well documented and with a massive load of test cases (kudos to the main developers for this). Some of the classes do have __eq__ methods defined, but lack __hash__ methods, so a pragmatic (though not totally rigorous solution) is to add the __hash__ methods required by Python 3.

A caveat here: the processing has to be done in Python 2, so the code below is Python 2. The idea is to generate Python 2 code with hashes that can be automatically translated by 2to3 (no need for monkey patching after 2to3)

First thing, we have to traverse all the files:

def traverse_dir(my_dir):
    content = os.listdir(my_dir)
    for element in content:
        if os.path.isdir(my_dir + os.sep + element):
            traverse_dir(my_dir + os.sep + element)
        elif os.path.isfile(my_dir + os.sep + element) and element.endswith('.py'):
            process_file(my_dir + os.sep + element)</pre>

Nothing special where, just plain traverse of a directory structure.

Now we want to find all classes that have an __eq__ method, but not an __hash__ method:

def get_classes(tree):
    # Will not work for nested classes
    my_classes = []
    for expr in tree.body:
        if type(expr) == ast.ClassDef:
            my_classes.append(expr)
    return my_classes

The function above will return all classes from a list of statements (typically a module body).

def get_class_methods(tree):
    my_methods = []
    for expr in tree.body:
        if type(expr) == ast.FunctionDef:
            my_methods.append(expr)
    return my_methods

The function above will return all function definitions from a ClassDef object.

def process_file(my_file):
    shutil.copyfile(my_file, my_file + '.bak')
    tree = ast.parse(open(my_file).read())
    my_classes = get_classes(tree)
    patches = {}
    for my_class in my_classes:
        methods = get_class_methods(my_class)
        has_eq = '__eq__' in [method.name for method in methods]
        has_hash = '__hash__' in [method.name for method in methods]
        if has_eq and not has_hash:
            lineno = compute_patch(methods)
            patches[lineno] = my_class.name
    patch(my_file, patches)

This is the main loop applied to each file: We get all the available classes; for each class we get all available methods and if there is an __eq__ method with no __hash__ method then a patch is computed and then applied.

The first thing that we need is to know where to patch the code:

def compute_patch(methods):
    names = [method.name for method in methods]
    names.append('__hash__')
    try:
        names.remove('__init__')
    except ValueError:
        pass
    names.sort()
    try:
        method_after = names[names.index('__hash__') + 1]
    except IndexError:
        method_after = names[-2]
    for method in methods:
        if method.name == method_after:
            return method.lineno

The main point here is to decide to which line to apply the patch. It has of course to be inside the class, but we want to be slightly more elegant than that: We want to put the method in lexicographical order with all the others (so, __hash__ would go between __eq__ and __init__). Now we can patch:

def patch(my_file, patches):
    f = open(my_file + '.bak')
    w = open(my_file, 'w')
    lineno = 0
    for l in f:
        lineno += 1
        if lineno in patches:
            w.write("""    def __hash__(self):
        r'''Hashes my class.

        Required to be explicitely re-defined on Python 3 if __eq__ changes.

        Returns integer.
        '''
        return super(%s, self).__hash__()

""" % patches[lineno])
        w.write(l)

Final notes

This could be done in many other ways. Indeed this is not even the standard way (that would be coding a 2to3 fixer). The practical solution is not general (for instance, it does not support nested classes). Also, it has problems with comments (as we lose them in the AST). But, for the practical purpose (patching abjad) it was good enough (You can see the result here). For further reading, you might want to have a look at this stack overflow question.

Reproducibility and user empowerment via Docker

There are two reasons why I think Docker (and similar things like e.g. vagrant) is an important technology. And this is not just because of its technological aspects but a also due to its social-political implications...

Reproducibility

The first one, which is quite obvious and has been already discussed elsewhere is about reproducibility of results, especially in a scientific setting: Analysis environments are getting more and more complex, thus reproducing results becomes itself a complex enterprise. And we are not talking here even of wet-lab reproducibility: we are talking about uber-complex software analysis pipelines. Docker and similar technologies change all that: Now it is becoming trivial to allow others to reproduce your analysis. Sure, in the past you could replicate the methods (assuming that they were fully documented - which I cannot think of a single case where that happened) but the cost of doing that was normally staggering (manually setting up everything). Also, if you were the originator of the work, making things easier for the world to replicate your work was no trivial task either (while there might be many cases where some people want to obfuscate their own work, it should be noted that empowering others to reproduce your work is not easy, even for those intentioned on doing that).

Enter Docker and friends: now it is easy for everyone involved to create/use a certain analysis pipeline. Furthermore the process can be standardized and can be widely used. For instance, in the realm of science it is not difficult to envision a time where scientific journals require that your work has to be easily (automatically) reproducible using a procedure like this.

User empowerment - The big thing?

The deal with cloud computing is well known: you trade convenience for control. Clouds create a dependency relationship between the user and the provider. It happens to be a nasty kind of relationship - The weak part (the user) is at an increased dependency of the strong part (the cloud provider). Things like the disappearance of Google Reader are just the harbinger of what can happen in a cloud environment: Your daily tools (which might be critical to your work - something more serious than a news reader) might disappear over-night. Also do you really trust that your cloud provider does not access your (private?) data?

Now, the convenience of the cloud is undeniable. The interesting question is: can we strike a compromise where the convenience still exists but the user retains a fair share of control?

From my perspective (as a software developer interested in allowing users maximum control and convenience) the last few years have been tough. A compromise has not been easy as there has been a massive amount of development in technologies that are convenient but take control away from the user, a few examples:

  • On the front-end, almost everything that is exciting is web-based. Sure one can use native GUIs, but... meh... Of course "web-based" meant putting stuff behind a server, away from user control. The idea that the user could install all the necessary server software on her computer only works for 1% of us (and even so, most of us would like to be saved from the boredom of installing yet another web app).
  • The same argument goes for the back-end: be it the installation of a database server or of a MapReduce framework (yes, a MapReduce framework can come in handy in a user computer with lots of cores), these are not trivial tasks. Surely not easily accessible to most users.

With Docker this all changes: you can make a cloud version of your application but you can also make a version that is trivial to download and install and can be run completely under the control of the user. An application with a complex web-interface, SQL database, NoSQL database, MapReduce framework can be trivially installed locally using a docker container. You can say to your users: "download this link, everything is inside and will be configured for you, like magic (very convenient). And if you do not like magic, feel free to inspect the container... its all there for you to see and change..."

If you are a developer that believes in both user-empowerment and convenience, there are no more heart-breaking doubts: you can use modern web/cloud-technologies while knowing that your work might deployed in the cloud OR under total user control.

Now, you might say that the current stage of the Virtualization/Container technology is too geeky. I agree, but is it too difficult to envision a world where a very complex application can be downloaded and run from a simple web click? Indeed, is Docker very far from this?

Doctor Faustus

Patrick O’Kane and Coral Messam in Doctor Faustus. Photograph: Jonathan Keenan