Mutual informationΒΆ

The intensity a of a voxel located at coordinates (\textbf{x}) within the fixed image F is given by:

a = F(\textbf{x})

and similarly, the intensity $b$ of a voxel at coordinates ($\textbf{y}$) within the moving image $M$ is given by:

b = M(\textbf{y})

We want to apply a coordinate transform $\textbf{T}(\textbf{y})$ to the moving image such that it registers best with the fixed image $F$. However, since $F$ and $M$ could have been obtained from different imaging modalities, we cannot simply minimize the error between intensities $a$ and $b$. Instead, we must attempt to maximize the amount of information shared between the two images. The statistical Mutual Information can therefore be used as a similarity metric:

(1)I = \sum_{\textbf{a},\textbf{b}} p_j(a,\textbf{T}(b)) \ln \frac{p_j(a,\textbf{T}(b))}{p_F(a)p_M(\textbf{T}(b))}

which depends on the probability distributions of the voxel intensities for the static and moving images. We now view $a$ and $b$ as random variables with associated probability distribution functions $p_F(a)$ and $p_M(b)$ and joint probability $p_j(a,b)$. Note that applying the spatial transformation $\textbf{T}(y)$ to $M$ will result in modification of $p_j(a,b)$. This effect is implied with the notation $p_j(a,\textbf{T}(b))$. Furthermore, if $\textbf{T}$ results in voxels being displaced outside of the image, $p_M(b)$ will change. Perhaps more importantly, if $\textbf{T}$ results in a voxel being remapped to a location that falls between points on the voxel grid, some form of interpolation must be employed to obtain $b$, which will modify $p_M(b)$. These effects are implied with the notation $p_M(\textbf{T}(b))$.

The method of interpolation used to obtain $b$ given $\textbf{T}(\textbf{y})$ is important both in terms of execution speed and solution convergence. Here we will discuss the Partial Volume Interpolation method proposed by Maes et. al.


Figure 1(a) depicts the situation where $\boldmath{\Delta} =
\textbf{T}(\textbf{y})$ has displaced a voxel (red) to the center of a neighborhood of 8 other voxels (black). The interpolation method divides the volume defined by the 8-nearest neighbors (red) into 8 partial volumes (w1 - w8) that all share the red voxel’s location as a common vertex Figure 1(b-c). We can more formally define the 8 partial volumes as a function of $\boldmath{\Delta}$:

(2)w1 = (1 - x) (1 - y) (1 - z) \\
w2 = ( x) (1 - y) (1 - z) \\
w3 = (1 - x) ( y) (1 - z) \\
w4 = ( x) ( y) (1 - z) \\
w5 = (1 - x) (1 - y) ( z) \\
w6 = ( x) (1 - y) ( z) \\
w7 = (1 - x) ( y) ( z) \\
w8 = ( x) ( y) ( z)

Noting that

{\sum_{i=1}^8}w_i = 1

Once the partial volumes have been computed, they are placed into the histograms of their corresponding voxels in the 8-neighborhood. Partial volume $w1$ is placed into the histogram bin associated with $n1$, $w2$ with $n2$, etc. This method is used in computing both $p_M(\textbf{T}(b))$ and $p_j(a,\textbf{T}(b))$.

We now have a complete expression for the mutual information similarity metric, which we will use as a cost function that describes the quality of the registration.

(3)C = \sum_{\textbf{a},\textbf{b}} p_j(a,\textbf{T}(b)) \ln \frac{p_j(a,\textbf{T}(b))}{p_F(a)p_M(\textbf{T}(b))}

We can attempt to achieve the optimal registration by maximizing this cost function, which is achieved through modification of the coordinate transformation $\textbf{T}(\textbf{y})$ applied to the moving image $M$. It is important to note that the $\textbf{T}$ could be any type of conceivable coordinate transformation. For example, $\textbf{T}$ could represent a set of affine transforms to be applied to the moving image. Here, however, we will be focusing on a parametrically defined nonrigid coordinate transformation. The transformation $\textbf{T}(\textbf{y})$ is characterized by the displacement field $\nu$, which defines how individual voxels within the moving image $M$ should be displaced such that $S$ and $M$ are optimally registered. This displacement field is computed at every voxel given the B-spline coefficients $P$ defined for the sparse array of control-points:

(4)\nu_x(\textbf{y}) = \sum_{l=0}^{3} \sum_{m=0}^{3} \sum_{n=0}^{3} \beta_l(v_x) \beta_m(v_y) \beta_n(v_z) P_x(j_x+l, j_y+m, j_z+n)

where ($l, m, n$) form the indices of the 64 control points in the immediate vicinity of a voxel located at ($\textbf{y}$). Note, ($\textbf{j}$) forms the voxel’s offsets within the region bound by the 8 control-points in the immediate vicinity of the voxel. For simplicity, we refer to such a region as a tile.

Having defined the coordinate transformation $\textbf{T}(\textbf{y})$ in terms of the B-spline coefficients $P$, it is becomes apparent that the mutual information may be maximized through optimization of $P$. If this is to be accomplished via the method of gradient descent, an analytic expression for the gradient $\partial C / \partial P$ must be derived. The chain rule is employed to separate the expression into two partial derivatives with the first being dependent on the similarity metric employed and the second on the type of transform employed:

(5)\frac{\partial C}{\partial P} = \frac{\partial C}{\partial \nu} \times \frac{\partial \nu}{\partial P}

where the second term is easily obtained by taking the derivative of (4) with respect to $P$:

\frac{\partial \nu}{\partial P} = \sum_{l=0}^{3} \sum_{m=0}^{3} \sum_{n=0}^{3} \beta_l(u) \beta_m(v) \beta_n(w).

Referring again to the first term of (5), it is important to realize that $C$ and $v$ are coupled through the probability distribution $p_j$ and are therefore directly affected by the 8 neighborhood partial volume interpolation. This becomes further apparent when $\partial C / \partial v$ is further decomposed:

\frac{\partial C}{\partial \nu} =
\frac{\partial C}{\partial p_j(a,M(n))} \frac{\partial p_j(a,M(n))}{\partial \nu} \\

(6)\frac{\partial C}{\partial \nu} =
\sum_{n=1}^{8} \frac{\partial C}{\partial p_j(a,M(n))} \frac{\partial w_n}{\partial \nu}

where the first term may be found by taking the derivative of (3) with respect to the joint distribution $p_j$:

(7)\frac{\partial C}{\partial p_j(a,M(n))} = \ln \frac{p_j(a,M(n))}{p_F(a)p_M(M(n))} - C

see {A}.

The second term, $\partial p_j(a,M(n)) / \partial \nu$, describes how the joint distribution changes with the vector field. Explicitly stated, a displacement vector locally transforms the coordinates of a voxel in the moving image $M$ such that:

\boldmath{\Delta} = \boldmath{x} + \boldmath{\nu}

From Figure fig:pv8 it becomes immediately apparent that as the vector field is modified the partial volumes $w1$ - $w8$ to be inserted into the moving and joint histograms $p_M$ and $p_j$ will change in size. Consequently, $\partial p_{j_n} / \partial \boldmath{\nu}$ can be found for $n \in{1,8}$ by taking the derivative of (2) with respect to each of the Cartesian directions.

Therefore, as prescribed by (6), computing $\partial C /
\partial v$ at a given voxel in $M$ involves first cycling through the 8-bins corresponding to the 8 nearest neighbors used during the partial volume interpolation and computing $\partial C / \partial p_j$ for each, which is scalar and therefore the same for all Cartesian directions. The 8 $\partial C / \partial p_j$ values are each weighted by one $\partial p_j / \partial \nu$. Finally, once this process has been completed for all 8 bins, the resulting $\partial C / \partial v$ values is assimilated into the sparse coefficient array $P$ by way of (5). Once this is completed for all voxels, the next optimizer iteration begins and the process is repeated with a modified coefficient array $P$.


It is important to note that if all 8 histograms associated with the neighborhood are equal, the value of $\partial C / \partial \nu = 0$. This is a direct result of

{\sum_{n=1}^8} \frac{\partial w_n}{\partial v_x} = 0

and similarly for the $y$- and $z$-directions. However, when the 8 histograms within the neighborhood are not equal, this will result in the gradient $\partial C / \partial \nu$ being “pulled” in the Cartesian direction of voxels associated with large $\partial C / \partial p_j$, which is desirable when attempting to maximize MI.