Mutual information

The intensity

of a voxel located at coordinates ( ) within the fixed image is given by:

and similarly, the intensity

of a voxel at coordinates ( ) within the moving image is given by:

We want to apply a coordinate transform

to the moving image such that it registers best with the fixed image . However, since and could have been obtained from different imaging modalities, we cannot simply minimize the error between intensities and . Instead, we must attempt to maximize the amount of information shared between the two images. The statistical Mutual Information can therefore be used as a similarity metric:

which depends on the probability distributions of the voxel intensities for the static and moving images. We now view

and as random variables with associated probability distribution functions and and joint probability . Note that applying the spatial transformation to will result in modification of . This effect is implied with the notation . Furthermore, if results in voxels being displaced outside of the image, will change. Perhaps more importantly, if results in a voxel being remapped to a location that falls between points on the voxel grid, some form of interpolation must be employed to obtain , which will modify . These effects are implied with the notation .

The method of interpolation used to obtain

given is important both in terms of execution speed and solution convergence. Here we will discuss the Partial Volume Interpolation method proposed by Maes et. al.


Figure 1(a) depicts the situation where

has displaced a voxel (red) to the center of a neighborhood of 8 other voxels (black). The interpolation method divides the volume defined by the 8-nearest neighbors (red) into 8 partial volumes (w1 - w8) that all share the red voxel’s location as a common vertex Figure 1(b-c). We can more formally define the 8 partial volumes as a function of :

Noting that

Once the partial volumes have been computed, they are placed into the histograms of their corresponding voxels in the 8-neighborhood. Partial volume

is placed into the histogram bin associated with , with , etc. This method is used in computing both and .

We now have a complete expression for the mutual information similarity metric, which we will use as a cost function that describes the quality of the registration.

We can attempt to achieve the optimal registration by maximizing this cost function, which is achieved through modification of the coordinate transformation

applied to the moving image . It is important to note that the could be any type of conceivable coordinate transformation. For example, could represent a set of affine transforms to be applied to the moving image. Here, however, we will be focusing on a parametrically defined nonrigid coordinate transformation. The transformation is characterized by the displacement field , which defines how individual voxels within the moving image should be displaced such that and are optimally registered. This displacement field is computed at every voxel given the B-spline coefficients defined for the sparse array of control-points:

where (

) form the indices of the 64 control points in the immediate vicinity of a voxel located at ( ). Note, ( ) forms the voxel’s offsets within the region bound by the 8 control-points in the immediate vicinity of the voxel. For simplicity, we refer to such a region as a tile.

Having defined the coordinate transformation

in terms of the B-spline coefficients , it is becomes apparent that the mutual information may be maximized through optimization of . If this is to be accomplished via the method of gradient descent, an analytic expression for the gradient must be derived. The chain rule is employed to separate the expression into two partial derivatives with the first being dependent on the similarity metric employed and the second on the type of transform employed:

where the second term is easily obtained by taking the derivative of (4) with respect to


Referring again to the first term of (5), it is important to realize that

and are coupled through the probability distribution and are therefore directly affected by the 8 neighborhood partial volume interpolation. This becomes further apparent when is further decomposed:

where the first term may be found by taking the derivative of (3) with respect to the joint distribution


see {A}.

The second term,

, describes how the joint distribution changes with the vector field. Explicitly stated, a displacement vector locally transforms the coordinates of a voxel in the moving image such that:

From Figure fig:pv8 it becomes immediately apparent that as the vector field is modified the partial volumes

- to be inserted into the moving and joint histograms and will change in size. Consequently, can be found for by taking the derivative of (2) with respect to each of the Cartesian directions.

Therefore, as prescribed by (6), computing

at a given voxel in involves first cycling through the 8-bins corresponding to the 8 nearest neighbors used during the partial volume interpolation and computing for each, which is scalar and therefore the same for all Cartesian directions. The 8 values are each weighted by one . Finally, once this process has been completed for all 8 bins, the resulting values is assimilated into the sparse coefficient array by way of (5). Once this is completed for all voxels, the next optimizer iteration begins and the process is repeated with a modified coefficient array .


It is important to note that if all 8 histograms associated with the neighborhood are equal, the value of

. This is a direct result of

and similarly for the

- and -directions. However, when the 8 histograms within the neighborhood are not equal, this will result in the gradient being “pulled” in the Cartesian direction of voxels associated with large , which is desirable when attempting to maximize MI.