An official website of the United States government

Multi-Group Clustering Steps

The steps involved with the clustering process.

NOTES:   The Multi-Group Clustering is considered an "Alpha" feature.

Below are the steps involved with Multi-Group Clustering process. 

  1.  Pre-Analysis Compression

Compression is always performed.  New age categories are created by compressing the age categories with counts less than or equal to the user-specified minimum. This compressed data containing crude rates of the new age categories is used in the steps to follow.

  1. Fit Joinpoint Models For All Ordered Groups Of Age Categories

If age consists of n categories, then there will be a total of n*(n+1)/2 ordered groups of age categories. For example, if there are 6 age categories, then the 6*(6 +1)/2 = 21 possible groups are:

Age category 1

Age categories 1, 2

Age categories 1, 2, 3, 4, 5, 6

Age category 2

Age categories 2, 3

Age categories 2, 3, 4, 5, 6

Age category 5

Age categories 5, 6

Age category 6

 For each group and for each number of joinpoints ranging from minK to maxK, a parallel model is fit. The values minK and maxK are set in the Joinpoint Session’s “Method And Parameters” tab, and a parallel model assumes equal slopes for each age category in the group. The optimal number of joinpoints is determined by a test statistic (WBIC, BIC, or permutation test), which is defined by the “Method for Determining the Number of Joinpoints” option on the Joinpoint’s Session’s Clustering tab.

3.  Best Fit For Each Group Of Age Categories

The best number of joinpoints for a group of age categories is taken to be the maximum number of estimated joinpoints from the model fits of the entire group, along with each individual age category in the group. For example, if the group consists of age categories {1, 2, 3}, let fit_123, fit_1, fit_2, and fit_3 be the model fits corresponding to the entire group and each individual age category. If we let k_123, k_1, k_2, and k_3 denote the estimated number of joinpoints from these models in step 2, then the best number of joinpoints for group {1, 2, 3} is kmax = maximum{k_123, k_1, k_2, k_3}. Now the best fit for the group of age categories is defined as the fit of the entire group with kmax joinpoints.

4.  Best Partition For Each Number of Clusters

The best Joinpoint models found in the previous step are then used to get the best partitions of the age categories for each number of clusters ranging from one to the maximum number of clusters.  The number of possible partitions of n age categories into m clusters is (n - 1)! / ((m - 1)! *(n - m)!).  An example partition of the age categories 1-6 for three clusters is {1,2,3}, {4}, {5,6}. The best partition for m clusters is found by minimizing a test statistic (Q) over all possible partitions of the age categories into m clusters. The Q statistic is the sum of the residual sum of squares (RSS) from the best Joinpoint model for each group of age categories in the partition. For one cluster, the best partition is the only partition, which contains all age categories. In this case, the Q statistic is the RSS from the model fit with all age categories. For two clusters and say 6 age categories, there are 5 possible partitions shown below. 

Partition of Age Categories Best Model Fits Q Statistic
1 and 2 to 6 Fit.1and Fit.2to6 RSS(Fit.1) + RSS(Fit.2to6)
1 to 2 and 3 to 6 Fit.1to2 and Fit.3to6 RSS(Fit.1to2) + RSS(Fit.3to6)
1 to 3 and 4 to 6 Fit.1to3 and Fit.4to6 RSS(Fit.1to3) + RSS(Fit.4to6)
1 to 4 and 5 to 6 Fit.1to4 and Fit.5to6 RSS(Fit.1to4) + RSS(Fit.5to6)
1 to 5 and 6 Fit.1to5 and Fit.6 RSS(Fit.1to5) + RSS(Fit.6)

The best partition for two clusters is the partition that minimizes the third column in the table above. Continuing in this manner, the best partition is obtained for each number of clusters.

5.   Best Number of Clusters

 The best number of clusters is found by computing a BIC statistic for each best partition identified in the previous step. The number of clusters with the minimum BIC value is taken as the best number of clusters. If we let \(S\) denote the sum of the number of estimated joinpoints from each model fit in the best partition, then the BIC statistic for m clusters with G age categories is defined as  \(Log(\frac{_{_{^{RSS}m}}}{N}) + (\frac{p}{N}) *log(N) \), where \( N \) is the total number of observations, \( {RSS}m \) is the Q statistic from the best partition of m clusters, and \( p = m - 1 + 2 * S + m + G \). If M is the best number of clusters, then the best partition of M clusters defines the final clusters of age categories unless a MADWD value has been specified.

6.   Apply Clustering MADWD To Reduce The Best Number Of Clusters

A Clustering MADWD value can be specified to set the minimum average APC difference allowed between adjacent clusters to further reduce the best number of clusters. More specifically, start with the best number of clusters (M) determined in the previous step and the best partition for M clusters.  Compute a distance (see below) between each consecutive pair of clusters in this partition. If any distance is less than the Clustering MADWD value, then reduce the best number of clusters by one (M-1), and repeat the process on the best partition of M-1 clusters. The process continues until either all distances between consecutive clusters are greater than the Clustering MADWD value or the best number of clusters has reduced to one. This new value for the best number of clusters (M*) is the final number of clusters and the best partition for M* clusters defines the final groups of the age categories. The distance between two consecutive clusters is a weighted sum of absolute differences in APC values, where the weights take into account irregular intervals of years.

7.   Compute Age-Adjusted or Delay Age-Adjusted  Rates

For each final group of age categories defined in the previous step, the age-adjusted (or delay age-adjusted) rates are computed. This data is then used for the computation of the final models in the next step.

     8.  Final Models For Each Cluster

For each final cluster of age categories, the best model fit is computed as in step 2, except that the test statistic for determining the best number of joinpoints is defined by the “Model Selection Method” option on the Joinpont Session’s “Method And Parameters” tab.