Tuesday, October 22, 2013

Python segfault with OS X 10.9 Mavericks

If you're using python from python.org's mac installer and you've just upgraded to Mavericks, you'll find that the interactive console segfaults after the second command you type:

Python 3.3.2 (v3.3.2:d047928ae3f6, May 13 2013, 13:52:24) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 1
1
>>> 2
[1]    1851 segmentation fault  python3

The root cause is this bug: http://bugs.python.org/issue18458

Though it looks like the issue has been resolved in the source, the fix hasn't made it to the downloadable builds as of this writing. The fix is to use either Apple's included version of Python 2.7.5, or to use python from homebrew.

You'll want to uninstall the existing version of python.  Instructions for that are here. In my case I wanted to remove all python.org versions, so the commands were:

sudo rm -rf /Library/Frameworks/Python.framework
cd /usr/local/bin; 
ls -l . | grep '../Library/Frameworks/Python.framework' | awk '{print $9}' | sudo xargs rm

At this point /usr/bin/python should be the only remaining version in your path and will point to the Apple supplied Python 2.7.5. If you're happy with that, you can stop here!

Note that site packages are within /Library/Frameworks/Python.framework, so any that you've got installed will need to be replaced. To get a decent python package manager back, do:

sudo easy_install pip

If you want Python 3, install it with homebrew.

Monday, October 21, 2013

Partitioned Backup Retention

I manage some servers on Amazon EC2, and recently I've been rewriting our backup scripts. Previously we had shell scripts that made a tarball, encrypted it via openssl, and dumped it in a directory. Old backups were cleaned up with a find command, and then the whole directory was synced to s3 with s3sync. Our retention policy was simply to keep the last month's daily backups, mostly because this was easy to do with find. S3 storage is cheap but definitely not free, so we'd like to do a bit better.

Requirements

I set out to find a new rotation scheme based on the following assumptions:
  1. The density of backup points as a function of age should be exponential, with many more recent backups than older backups.
  2. Backups will be triggered regularly, but might be skipped due to downtime, script failure, etc. As such, our rotation scheme should simply take a set of dates and return a list of those to delete. No special marking of backups to retain longer should be required when they are made.
  3. No incremental backups.
  4. Keep as few backups as possible while still covering a month.

Existing Options

Taking a quick look at wikipedia's backup rotation scheme page, we find a few existing options:

First In, First Out is what we were using previously, and fails on point 1 above.

Grandfather-father-son accomplishes most of what I'm looking for, but the most obvious way to use it and keep a month's backups would require more storage than I'd like - 24 hourly backups, 7 daily, 4 weekly. I tried to figure how to use the Python GrandFatherSon module to keep a more sparse set, but couldn't quite figure out what it was doing.

Tower of Hanoi is formulated for use rotating removable media. I'm looking for something similar, but that fits better with point 2 above.

A New Scheme

I make backups every hour. Just after I make a backup, I gather the datestamps of all the backups I've got stored on s3 and pass them to the rotation function. It returns a set of datestamps of backups that should be deleted.

As configuration for the rotation function, select a set of retention targets, or approximate ages around which backups should be kept. I selected 1 day, 7 days and 28 days.

The rotation function partitions the set of existing backup dates it is passed into groups between retention targets. In my case, there is a group made between now and 1 day ago, a group between 1 and 7 days old, and a group between 7 and 28 days old. Then delete all but the oldest and newest backup within each such group. As a special case, delete all but the youngest backup older than the last retention target.

Results


The plot above shows the resulting distribution of backups over time, with the retention targets shown as the blue lines.

We cover a month with 8 backups.

Code

A script with the rotation function (using my retention targets) and a program to visualize the results with matplotlib are on github. CC0 license.

Gotchas

Initially I kept only the oldest backup in each group. This resulted in big gaps when a backup aged into a new group where it was deleted as an older one was still present, while the younger group now had no backup.

For similar reasons, it is important to keep one backup older than the oldest retention target.