At The Data Incubator, we really like anaconda. In particular, we like that we can quickly install and update environments that include the usual PyData libraries (numpy, scipy, pandas) at any version from binary code. One issue we’ve had, however, is that freezing conda environments, currently, is machine-specific. In particular, there are subtle differences in versions of binary dependencies required on different machines for cairo, a dependency of matplotlib.
However, we (and, we suspect, most data scientists) don’t really care about that. We just want PyData at the appropriate versions on our machines. We tried to capture this behavior by minifying the environment.yml file and taking only the “most important” packages - the ones that, because they depend on other packages, will install everyting currently specified in
environment.yml. For the rest of the packages, we just want the latest version, and anaconda automatically behaves this way for us.
Dependencies and graphing them
So, maybe you’ve thought about dependencies before. We have. As it turns out, we can think of dependencies as a directed acyclic graph (often called by their abbreviation, DAG).* In this graph, each package is a node, and if package X depends on package Y, we have a (directed) edge (X, Y).
If you look at such a graph, you’ll find nodes that represent packages which aren’t dependencies for any other packages? We call those “source nodes” (or source packages, for the purposes of this explanation). Those are the nodes we care about, as they represent the packages we really care about. To find this set of packages, we can take the complement (in graph theory parlance, “the cut”) of the subset of all nodes which are specified as another package’s dependency. This leaves us with our source packages.
* We could conceivably have two packages which are “tightly coupled” - i.e. depend on each other - but that’s typically considered pretty bad practice. We know of no examples of this in anaconda’s repositories.
First, without further ado, here is the package itself:
And, if you want to just
pip install git+https://github.com/thedataincubator/[email protected]
Then, usage is quite simple.
python -m conda_minifier path_to_environment.yml > minified_environment.yml
Now that you’ve got the goods, here is a bit more about how we do this in practice.
Parse all packages/versions in
Get the dependencies of each package and store them in a set (so we know which packages appeared as a dependency at least once).
We do this with
conda info [package_name]. Here’s a sample output:
As you can see in the source code, we chose to use python to scrape this output.
- Take the complement of all requirements and every package that appears as a dependency, making sure to preserve version for these “important” packages. It’s a simple snippet:
And that’s it! We have a minified environment. Simple, right?
One other thing
It’s worth noting that our minifier also removes the “binary” package spec at the end of the version string. For example, OSX and Linux may have different binaries compiled for each respective platform. We remove that platform-specific bit, and make conda resolve depending on the machine we’re on.
Pitfalls / Gotchas
We’ve come across some interesting cases. For example, what if we want to include Bokeh, which depends on numpy, scipy, and pandas? Suddenly, we have to explicitly declare our PyData packages, since otherwise our “minimum set” non-transitive dependencies is just Bokeh. Since pandas depends on numpy and scipy, even including pandas means we don’t specify our numpy or scipy versions. Our suggestion: if you know you need numpy 1.9.2, hard-specify that version after running this minifier.