Handling Continuous Integration Failures

My blogging activity has been subdued recently. To try and remedy this I’ve decided to write some entries on build engineering, a subject I’ve looked at in some depth over the last few years.

To start off with, here’s an entry about Automated Continuous Integration. Automated CI is normally widely accepted when introduced to a team. However there’s often confusion about what to do when the automated build breaks but developers are able to build without problem on their own machine.

The 2 reasons I’ve seen most often that cause this are:

1 – Problems retrieving updates from Source Control

2 – The part of the build that’s failing is not used by developers

The first of these is normally easy to diagnose since it happens early in the build process before anything else has a chance to run. It can occur when a file checked into source control is overwritten by the build process. Often the simplest solution is to delete the build server’s copy of the source tree totally, and check it out cleanly from Source Control.

The second part can be caught much more easily using the following pattern:

Be able to run the complete integration build from any development machine

Typically a full integration build should include the following tasks:

– Full, clean compile

– Deployment to local app server

– Running all the tests

– Publishing of distributable files to file server

– Labelling of source control

People sometimes setup specific scripts that are only available on the build server that perform some of these functions. This causes pain in exactly the situation when the build script starts failing. If you can run the entire build on a developer machine using sensible default values (e.g. using a ‘test’ name for labelling and distributable locations), you can debug the problem without having to go onto the build server.

Moreover, If you have the ability to easily to perform a full, clean, compile; deployment; and test in one easy step from a developer machine, then developers will likely use this and catch any problems before checking in broken code.

Here’s an example of this using NAnt and CruiseControl.NET.

In NAnt, we can have the following in a build script to define some properties and specify how to publish a distributable file:

<property name=”publish.basedir” value=”publish” />

<property name=”label-to-apply” value=”temp” />

<target name=”dist.publish” depends=”dist”>

<property name=”publish.dir” value=”${publish.basedir}\${label-to-apply}” />

<mkdir dir=”${publish.dir}” />

<copy todir=”${publish.dir}”>

<fileset basedir=”dist”>

<includes name=”*”/>

</fileset>

</copy>

</target>

The ‘dist.publish’ target relies on the 2 properties publish.basedir and label-to-apply. By setting them in the script we give these values sensible defaults for running on a developer machine. However, we want them to be more specific when running on the real build machine. We can do this by specifying values for the properties when NAnt is started which override these defaults.

If we’re using CruiseControl.NET, the ‘label-to-apply’ property is always set to the build number for us. We can override other variables using the following CCNet Configuration option:

<cruisecontrol>

<project name=”MyProject”>

<build type=”nant”>

<buildArgs>-D:publish.basedir=d:\MyProject\PublishedBuilds</buildArgs>

.

.

</build>

.

.

</project>

</cruisecontrol>

So, summing up, automated CI build failures are bound to occur. However, if we set up a build process that is repeatable on machines other than the build server it is easier to solve such problems.