David Avraamides Code and other geekery

A Mercurial Testing and Coverage Hook

A Mercurial Testing and Coverage Hook

I've recently migrated from Subversion to Mercurial for my personal coding projects. I wanted to get more familiar with distributed version control systems and after reading up on the choices, I felt Mercurial was the best fit for my needs.

At the same time, I've been trying to be more diligent about applying good testing practices to my development routine so I thought this would be a good opportunity to take advantage of Mercurial's event hooks.

Like most version control systems, Mercurial provides a way for custom actions to be executed at certain points in the version control work flow (e.g. commit, update, push, pull). Common uses of these "hooks" are to send out email notifications when changes are checked in or to verify that a bug number is referenced in the check-in comments. One other fairly common use of an event hook—that I was particularly interested in—is to run a test suite across the code base to verify that all tests pass. This type of action can be tied to Mercurial's precommit event and if the test fails, the commit is aborted. This ensures that you never check in changes that fail any of your tests.

Running Nose on precommit

Adding the following to your repository's .hg/hgrc file tells Mercurial to run the external command nosetests tests prior to committing a change to the repository. If Nose succeeds (exits with 0), Mercurial will continue with the commit. If Nose fails (exits with a non-zero value), Mercurial will abort the change.

[hooks]
precommit = nosetests tests

The syntax used here is for an external hook, that is, Mercurial calls the external command with the specified arguments. The script has access to limited information through a set of environment variables prefixed with HG_. Mercurial also supports in-process hooks defined as a Python callable. These hooks are passed arguments that allows the script to access detailed information about the changeset, files and revisions. The syntax is slightly different:

[hooks]
precommit = python:myext.myhooks.run_test

In this case, Mercurial imports the myext.myhooks module (which must be in your PYTHONPATH) and calls run_test. For the simple case of running Nose on a standard test suite, either of these approaches works well, although the external version is probably a bit simpler and easier to wrap in a shell script if your test command requires a long list of arguments.

Adding a Minimum Code Coverage Hook

Checking if the code repository passes or fails a test suite is pretty simple, especially since the exit code of Nose is just what we need to signal the success or failure to Mercurial. But I also wanted to verify that my test suite had a reasonable amount of coverage of my code base. While I can just enable Nose's coverage plugin to calculate the percent coverage, I can't simply test the exit code since it says nothing about the coverage amount.

For that I needed to parse the output of coverage.py. This is fairly simple but there is one wrinkle. The output for covering one module is slightly different than for covering multiple modules. There is no "TOTAL" line when only one module is being processed.

When you run Nose with code coverage enabled, you get output like the following:

pfa-ct-imac01:code david$ nosetests --with-coverage calc.py
E...
======================================================================
ERROR: test_constants (calc.CalcTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/david/code/calc.py", line 18, in test_constants
    self.assertAlmostEquals(3.1415926, calc(['pi']))
  File "/Users/david/code/calc.py", line 6, in calc
    val = eval(expr)
  File "<string>", line 1, in <module>
NameError: name 'pi' is not defined

Name    Stmts   Exec  Cover   Missing
-------------------------------------
calc       20     16    80%   21-25
----------------------------------------------------------------------
Ran 4 tests in 0.005s

FAILED (errors=1)
pfa-ct-imac01:code david$

I needed to parse the coverage summary (the 80% value) from this output, but since I was going to that trouble, I thought I'd parse out some of the testing statistics, too:

def run_nose(args):
    """Run nosetests with coverage enabled, passing along the additional
    arguments. Parse out the test results (number of tests run, number of
    errors, number of failures, execution time) as well as the percent
    code coverage. Return all of these stats along with the nose output.
    """
    def parse_tests(out):
        """Parse test stats out of summary and status messages.
        Status message will contain number of errors and failures
        if there were any, else just an OK message.
        """
        ntests, time = 0, 0.0
        m = re.search(r'^Ran (\d+) tests in ([\d\.]+)s$', out, re.M)
        if m:
            ntests = int(m.group(1))
            time = float(m.group(2))

        nerrs, nfails = 0, 0
        m = re.search(r'^FAILED (.*)$', out, re.M)
        if m:
            msgs = m.group(1)[1:-1].split(', ')
            msgdict = dict([msg.split('=') for msg in msgs])
            nerrs = int(msgdict['errors']) if 'errors' in msgdict else 0
            nfails = int(msgdict['failures']) if 'failures' in msgdict else 0

        return ntests, nerrs, nfails, time

    def parse_coverage(out):
        """Find the last line of dashes and parse the percent coverage
        from the preceding line. Works for both single module reports
        and for multi-module reports that contain a final total line.
        """
        lines = out.split('\n')
        for i in reversed(range(len(lines))):
            if re.match('^-{20,}$', lines[i]):
                if i:
                    m = re.search('\d+\s+\d+\s+(\d+)%', lines[i - 1])
                    if m:
                        return int(m.group(1))
        return 0

    command = ['nosetests', '--with-coverage', '--cover-erase'] + args
    out, err = Popen(command, stdout=PIPE, stderr=PIPE).communicate()
    tests = parse_tests(out + err)
    cover = parse_coverage(out + err)
    return list(tests) + [cover, out + err]

This is part of a script I call verifycode which I tied to Mercurial's precommit hook:

[hooks]
precommit = verifycode 70 --exclude=__slow \
    --cover-package=nn,ns projects

Now whenever I commit code, the script runs a suite of verification tests to make sure I haven't broken anything. It also makes sure that my tests cover a reasonable percent of my code base. I can reuse this script across different repositories and simply change the arguments for the test coverage threshold and the other arguments which are passed on to Nose.

Aside: the --exclude=__slow argument excludes any test methods with __slow in their name. This is a convention I follow to make it easy to run a subset of my tests quickly. This particular project involves testing a neural network which can be slow.

The complete script is available here. The following is the output from a commit.

david$ hg ci -m "finished first draft of article"
verifycode...ok [3.1s, 10/10 tests passed, 79% coverage (>70%)]

Tips for Writing Mercurial Hooks

As I was going through this, I ran into a few problems and learned a few tips along the way. The first problem was getting in-process hooks to work properly on Windows XP. It turns out there is a known bug with the binary installer for Mercurial on Windows which messes up the importer. The workaround is to use the explicit path of the Python source file for an in-process hook:

[hooks]
precommit = python:c:/path/to/hook/file/myhooks.py:hookfunc

That was ugly enough to convince me to use external hooks instead. Additionally, I didn't really need the detailed information available through in-process hooks so it seems the general advice you'll find is to stick with external hooks when possible.

The second thing I found useful was the fact that you can install multiple hooks under the same event and Mercurial will call them in alphabetical order. You just need to append something to the name, after a period, to keep the hooks unique. You can use this to add a hook that will always be called last and will fail so you can test your hook without actually committing any changes:

[hooks]
precommit = my.real.hook
precommit.zstop = false

Mercurial will call the hook with the .zstop suffix after the real one. Since it simply calls the *nix false command, the commit will fail.

The third trick I learned to help me debug hooks, when I was struggling with in-process hooks, was to write a debug hook that can be called in-process and externally. The following script shows the approach I used.

#! /usr/bin/env python
import os
import sys
from mercurial import ui, hg

def debug_hook(ui, repo, **kwargs):
    ui.status('repo.root: %s\n' % repo.root)
    fmod, fadd, frem, fdel, funk, fign, fcln = repo.status()
    files = [('M', f) for f in fmod] + \
            [('A', f) for f in fadd] + \
            [('R', f) for f in frem] + \
            [('D', f) for f in fdel]
    ui.status('repo.status:\n')
    for f in files:
        ui.status('  %s %s\n' % f)

    changectx = repo['tip']
    ui.status('change.desc: %s\n' % changectx.description())
    return False

if __name__ == '__main__':
    print 'environment:'
    keys = sorted([k for k in os.environ.keys() if k.startswith('HG')])
    size = max(map(len, keys))
    for k in keys:
        print "  %-*s: %s" % (size, k, os.environ[k])
    hgui = ui.ui()
    repo = hg.repository(hgui, '.')
    debug_hook(hgui, repo)
    sys.exit(0)

You can use it as a both an in-process hook or an external hook like this:

[hooks]
# external
pretxncommit.ext = ~/code/hgext/hgdebug.py
# in-process
pretxncommit.in = python:hgext.hgdebug.debug_hook

You have to have the module in your Python path, of course. Also, you need to make the script executable so you can call it directly as an external hook (or you could invoke it as python ~/code/hgext/hgdebug.py).

Mercurial over Subversion

I moved to Mercurial from Subversion and I'm still getting my feet wet at this point, but I really like it so far. It's fairly easy to learn, is similar enough to Subversion in many ways that the commands don't seem too foreign and was very easy to convert my Subversion projects over to try it out. I still use Subversion at work and haven't found it confusing to bounce back and forth between the two.

As I'm not using it on a team, I'm not getting the full benefits of a DVCS, but the nice thing is that most of the work flows scale nicely down to a team of one. If you want to get started, I found these resources very helpful:

  • Mercurial: The Definitive Guide: I bought this book (and I suggest you do, too), but it's also available online for free. After reading it, I find it nice to just browse to it online and review certain sections.
  • Hg Init: a Mercurial tutorial: Joel Spolsky's company, Fog Creek Software, switched over to Mercurial from Subversion and now offers a hosted Mercurial service, with a code review work flow that integrates nicely with their FogBugz bug tracking system. He wrote this nice, short tutorial on how to use Mercurial and what he found the advantages to be.

Password-less ssh Access

I use ssh to manage my OS X and Ubuntu machines. It's one of those things I setup once when I build a new machine, but then don't think about too much and often forget the steps to set it up correctly, so I thought I'd write it down.

The steps are pretty simple:

  1. create a key pair
  2. copy the key to the remote host
  3. create a "Host" alias for the remote host in your ssh config file (optional)

Create the ssh Key Pair

You really only need to do this once and can copy the same public key to multiple hosts, but you can also create a separate key pair for each host if you prefer.

Use ssh-keygen to create the key. I typically use RSA which is the default on most systems, but like to specify it explicitly just in case.

$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/user/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/user/.ssh/id_rsa.
Your public key has been saved in /home/user/.ssh/id_rsa.pub.
The key fingerprint is:
60:d9:e3:8c:d9:e8:1e:17:78:42:3f:a1:e9:b6:de:06 user@host

The default location for the key files is in the ~/.ssh directory which will be created if it doesn't already exist. The id_rsa.pub file is the public key. This is the file you will put on the remote hosts.

Copy the Public Key to the Remote Host

You need to copy the public key to the remote host so that the machine will recognize the local host when it tries to open an ssh connection.

$ cat ~/.ssh/id_rsa.pub | ssh user@remote.host 'cat >> ~/.ssh/authorized_keys'
user@remote.host's password:

This appends the public key to the list of authorized ssh keys on the remote host (creating the file if needed). Since the key isn't there yet, we are prompted for the password. Once we run the command above, we can test it out and should be able to connect without a password.

$ ssh user@remote.host
Linux remote.host 2.6.24-19-server #1 SMP Wed Jun 18 15:18:00 UTC 2008 i686
No mail.
Last login: Tue Aug 11 21:14:41 2009 from my.local.isp
$

Add an Alias for the Remote Host

With only one remote host, it's pretty easy to remember the ssh command to connect to it. But once you have many hosts, each with potentially different user IDs and ssh ports, it can get confusing to remember the correct parameters for each remote machine.

You can edit the ~/.ssh/config file and put host aliases in it with default values that will be used when connecting to remote sites. You can even use wild cards for host names to factor out common settings.

Host web
    User usr123
    HostName www.catchy-domain-name.com

Host svn
    User usr456
    HostName svn.catchy-domain-name.com

Host *
    Port 2992

So instead of connecting to the first host via

$ ssh -p 2992 usr123@www.catchy-domain-name.com

we can just type

$ ssh web

Changing the ssh Port

By default sshd uses TCP port 22 for all connections. This makes it a common target for hackers so it's wise to move it to a different port to make it a little harder for malware to attack your machine. On the remote machine, change the port definition in the sshd config file.

$ sudo vi /etc/ssh/sshd_config
# change line containing "Port = 22" to a number above 1023,
# or add the line if it's missing
$ sudo /etc/init.d/ssh restart

Now you'll need to use ssh with the -p option to specify the new port, or define this in the ~/.ssh/config file as we did above.


Mode-Aware Google Help in Emacs

While surfing the other night, I ran across a nice Emacs Lisp function for using Google to do context-sensitive help from inside Emacs. I liked how simple it was, and had always been curious about writing some Emacs lisp code (my .emacs file is a random collection of snippets I egregiously pilfered from various locations), so I thought I would enhance the solution by making it aware of what major-mode I was in.

Building the Search URL

John Conners, the author of the original post built his URLs with the site: and inurl: keywords to limit the search to a specific site and with a specific string or pattern in the URL. He also appends the "I Feel Lucky" button (btnI) query string parameter so you will jump directly to the top hit. I simply modified his function to make the site, inurl and lucky parameters optional to give me a little more control over the search.

(defun search-site-url (keyword &optional site inurl lucky)
  "Do a Google search for KEYWORD. Restrict to SITE and INURL, if specified.
Jump to best match (I Feel Lucky) if LUCKY set.
"
  (concat "http://www.google.com/"
          (format "search?q=%s" (url-hexify-string keyword))
          (if site (format "+site:%s" (url-hexify-string site)))
          (if inurl (format "+inurl:%s" (url-hexify-string inurl)))
          (if lucky "&btnI")))

Making the Search Mode-Aware

Next I changed John's wxhelp function to check the major-mode you are editing in, and use that to configure the arguments to the search-site-url function (I also renamed the function to the more general context-help). This makes it easy to add new search modes and customize them so they work well based on what you are looking for.

(defun context-help ()
  "Open a browser window showing documentation for the word under the point.
Uses `major-mode' to optionally refine the search to a specific web site,
or a specific pattern in the URL. Defaults to a simple keyword search.
Uses `search-site-url' to do the actual search.
"
  (interactive)
  (browse-url
   (apply 'search-site-url
          (thing-at-point 'symbol)
          (cond
            ((equal major-mode 'css-mode)
             '("www.w3schools.com" "/css/" t))
            ((equal major-mode 'emacs-lisp-mode)
             '("www.gnu.org" "/emacs/"))
            ((or (equal major-mode 'html-mode)
                 (equal major-mode 'html-helper-mode))
             '("www.htmlquick.com" "/reference/" t))
            ((equal major-mode 'javascript-mode)
             '("www.w3schools.com" nil t))
            ((equal major-mode 'python-mode) 
             '("docs.python.org" "/ref/" t))
            (t nil)))))

You can see different uses of the various arguments in the code:

  • CSS: searches www.w3schools.com and looks for URLs with /css/ in them.
  • Emacs Lisp: searches www.gnu.org and looks for /emacs/ in the URL, but does not automatically jump to the top hit. I did this because sometimes I'm looking for the Emacs editor sections of the help and sometimes I'm looking for the Emacs Lisp sections. This way I just get the usual Google search hits and can scan them to find the best one.
  • HTML: for HTML help, I use the same search against www.htmlquick.com in either html-mode or html-helper-mode.
  • JavaScript: searches www.w3schools.com again, but does not limit with the inurl: keyword. This is because JavaScript language pages and JavaScript DOM pages do not have a common URL substring on that site.
  • Python: searches the standard Python documentation web site. This is the one I've been using the most and seems to provide good results most of the time.
  • The final default case just does a Google keyword search. This is reached if none of the major-mode cases match.

This has proved to be really useful in the few days that I've been using it.


Digging Into Spotlight

I came across a nice article on advanced searching using Spotlight that talked about search keywords like kind and date, and as I read it I thought to myself I wonder what other cool keywords you can use in Spotlight? At first blush, this seems like the type of innocent question that should be easy to quickly answer, but as often is the case with technology, things are a bit more complicated than they first appear.

Dropping to the Command Line

My first thought was just to search for help on Spotlight, and that's not a bad starting place as the page covers a number of keywords, more correctly called metadata attributes. Spotlight's help covers kind, author, date, created and by as well as the boolean operators AND, OR and NOT. But I knew there were many other metadata fields that were commonly used in files such as images and audio files. I had come across the mdls shell command before which lists the metadata fields on a file. A quick check of a JPEG image revealed all kinds of interesting data:

da-imac-01:stuff  david$ mdls IMG_3564.JPG 
kMDItemAcquisitionMake         = "Canon"
kMDItemAcquisitionModel        = "Canon EOS 10D"
kMDItemAperture                = 0.970855712890625
kMDItemBitsPerSample           = 32
kMDItemColorSpace              = "RGB"
kMDItemContentCreationDate     = 2008-04-22 18:40:25 -0400
kMDItemContentModificationDate = 2008-04-22 18:40:25 -0400
...
kMDItemFlashOnOff              = 0
kMDItemFNumber                 = 1.399999976158142
kMDItemFocalLength             = 50
...

This is a truncated list of the 50+ fields in one of my image files. Note some of the interesting ones like the aperture, flash setting and focal length.

I played around with another of the Spotlight/metadata shell commands: mdfind. This lets you do the equivalent of a Spotlight search from the command line and after a bit of trial and error, guessing the keyword names and value formats was fairly easy:

da-imac-01:stuff david$ mdfind make:canon focallength:50 flash:0 iso:125
/Users/david/Desktop/Turks, April 2008/IMG_3564.JPG
/Users/david/Pictures/iPhoto Library/Originals/2008/Turks, April 2008/IMG_3564.JPG
/Users/david/Pictures/iPhoto Library/Originals/2008/Museum Visit/IMG_3273.JPG
/Users/david/Pictures/iPhoto Library/Originals/2008/Museum Visit/IMG_3276.JPG
/Users/david/Pictures/iPhoto Library/Originals/2008/Mar 23, 2008/IMG_3288.JPG
...

Peeling the Onion with DTrace

Although these shell commands are very useful, the man pages for the commands do not list the valid search keywords. I knew there must be a list of the keywords used in Spotlight search bar that mapped to these constant names so I thought what a great time to learn dtrace!

For those of you who haven't heard of dtrace I encourage you to play around with it. It's a very powerful tool for doing live probing and tracing of low level activity in the operating system. After skimming this nice tutorial I tried this command in one window:

da-imac-01:bin david$ sudo dtrace -n 'syscall::open*:entry /execname == "mdfind"/ \
    { printf("%s %s", execname, copyinstr(arg0)); }'
Password:
dtrace: description 'syscall::open*:entry ' matched 3 probes

and then ran my mdfind command again in another Terminal window. The dtrace "script" says to trace all system calls whose name begins with "open" when the system call is entered, but only if they were called from the mdfind process, and then print out the name of the system call and the first argument (which in the case of open is the file or device name). That resulted in LOTs of calls like this showing mdfind opening all kinds of metadata importer files, which I assume are libraries that know how to manipulate certain types of metadata attributes:

CPU     ID                    FUNCTION:NAME
  0  18390              open_nocancel:entry mdfind /System/Library/Spotlight/\
    Audio.mdimporter
  0  18390              open_nocancel:entry mdfind /System/Library/Spotlight/\
    Audio.mdimporter/Contents
  0  17604                       open:entry mdfind /dev/autofs_nowait
  0  17604                       open:entry mdfind /System/Library/Spotlight/\
    Audio.mdimporter/Contents/Info.plist
  0  18390              open_nocancel:entry mdfind /System/Library/Spotlight/\
    Chat.mdimporter
  0  18390              open_nocancel:entry mdfind /System/Library/Spotlight/\
    Chat.mdimporter/Contents
...

But the part of the trace I was most interested in was near the very end:

...
  1  17604                       open:entry mdfind /System/Library/Frameworks/\
    CoreServices.framework/Versions/A/Frameworks/Metadata.framework/\
    Resources/MDPredicate.plist
  1  17604                       open:entry mdfind /dev/autofs_nowait
  1  17604                       open:entry mdfind /System/Library/Frameworks/\
    CoreServices.framework/Versions/A/Frameworks/Metadata.framework/\
    Resources/English.lproj/MDPredicateKeywords.plist
  1  17604                       open:entry mdfind /dev/autofs_nowait
  1  17604                       open:entry mdfind /System/Library/Frameworks/\
    CoreServices.framework/Versions/A/Frameworks/Metadata.framework/\
    Resources/English.lproj/schema.strings
...

Note the files MDPredicateKeywords.list and schema.strings in the /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/Metadata.framework/Resources/English.lproj folder. I tried looking at the schema.strings file but it was in a binary format. So I tried open schema.strings and sure enough, Xcode launched and loaded the file which contains over 400 lines of mostly metadata keyword definitions. The ones we are interested in are near the end and of the form kMDItemXXX.ShortName = yyy:

"kMDItemPixelHeight.ShortName"                = "pixelheight,height";
"kMDItemPixelWidth.ShortName"                 = "pixelwidth,width";
"kMDItemWhiteBalance.ShortName"               = "whitebalance";
"kMDItemAperture.ShortName"                   = "aperture,fstop";
"kMDItemAudioEncodingApplication.ShortName"   = "audioencodingapplication";
"kMDItemComposer.ShortName"                   = "composer,author,by";
"kMDItemLyricist.ShortName"                   = "lyricist,author,by";
"kMDItemStarRating.ShortName"                 = "starrating";

These are just a few of the dozens of entries to whet your appetite. For the most part, I've found them to work as expected, with one exception: starrating. I never got any hits using it so I tried using mdls on an MP3 that I knew had a rating set in iTunes and there was no metadata attribute set on it for the iTunes rating. So I guess all you Mac developers out there should "do as Apple says, not as Apple does."

Satisfaction

One of the things that I really like about OS X is the ability to work with the system at varying levels of depth. This diversion started with me playing around with the Spotlight search bar: a very advanced "desktop search" feature found only in the most modern operating systems. But when I wanted to learn more, I was able to easily muck around at the command line and experiment with the very same infrastructure that Spotlight is built on. Finally, I was able to leverage a very powerful, low level system tool, dtrace to probe the details of what was going on inside OS X which led me to the answer I was looking for.


Amazon vs. iTunes

I bought some new music online the other night. It was easy, cheap, and the songs appeared directly in my iTunes library and synced up with my iPod without a hitch as usual. I've done this many times before so it wasn't really noteworthy, except this time I didn't buy any of the music on iTunes Music Store. And I don't think I will again.

I should say that I'm a little old school in terms of buying music. Of my 4000+ track music collection only 90 songs have been purchased online. In the past I would typically buy CDs and rip them to MP3s, and now have over 400 CDs in my "physical" collection (collecting dust in the basement, truth be told). But most of the 90 tracks I have purchased online have been singles bought in the last year.

So when I was putting together a playlist for a party the other night, I was just about to click in iTunes when I thought "why not try out Amazon.com?" I'd heard they had a good collection, all DRM-free, with equal or better pricing than iTunes. And you can usually count on Amazon.com to make the buying experience simple and smooth. So I surfed over to Amazon.com, searched for the Psychedelic Furs "Love My Way" and bought my first MP3 on Amazon.com.

The first time you purchase an MP3 on Amazon.com, it installs an application that manages the file download process and automatically adds the files to your iTunes library. The files are fully tagged and include artwork just like a song you would by from the iTunes Music Store. Additional purchases have a very similar experience to iTMS: search for a song, play a sample of it if you like, one click to buy and in a few seconds it's in iTunes.

Amazon's MP3s are also pretty high quality, encoded at or near 256 kbps, some using variable bit-rate and others using constant bit-rate. I also noticed that they use the LAME 3.x encoder, considered by many to be one of the best MP3 encoders out there.

In all I bought 10 singles and one 13-track album, spending a total of $17.61. Of the 10 singles I purchased, 8 were $0.99, and 2 were $0.89. The album was $7.97. Had I purchased these on iTMS I would have paid an additional $2.22 or 12.5% more (all of the songs were $0.99 on iTMS and the album was $9.99, and interestingly enough none of the songs I bought were available in iTunes Plus DRM-free format).

So not only did I save a couple of bucks, but more importantly I purchased DRM-free music where I'm not limited on how and where I can play my music. And the purchasing experience was about as close to a fully-iTunes-integrated solution as you could get without the software being written by Apple. Apple may need to swallow a bit of its pride and go back to the record companies to work out DRM-free deals with everyone or they may be in for a tough fight with Amazon.com over digital music.