Overview of the Clustering Process

This section defines the Multi-Group Clustering process and its associated parameters.

NOTES:

The Multi-Group Clustering is considered an "Alpha" feature.

What is Clustering?

When analyzing trends for age-specific groups the analyst may not want to analyze each group separately. The clustering algorithm incorporated into the Joinpoint software optimally combines groups, so the trends are as parallel as possible within contiguous groupings and separating groupings where the trends differ. Models are then fit separately for each cluster using directly standardized age-adjusted trends within each cluster.

Why would you use Clustering?

Fitting common models for clustered contiguous groupings represents a more parsimonious way to present trends than having a separate model for each group. Examining trends in age-adjusted rates for all ages combined may mask important differences in trends for specific age groups. In addition, when age-adjusting, the trend is invariant to the choice of the standard only if the age specific trends are parallel. One way of selecting age groupings for age-specific analyses is to use knowledge of how specific cancers progress or differ by age, e.g. separating trends for female gynecologic cancers and female breast cancer into pre- and post-menopausal groups. Absent a strong substantive reason to form specific age groupings, an alternative method is to use a data-dependent algorithm to partition the age groups into clusters of contiguous age groupings for which cancer (or other disease) incidence or mortality rates follow the same trend, while at the same time separating clusters where the trends differ.

Pre-Analysis Count Compression

If the count in some of the groups are small there is insufficient statistical power to determine that they have different trends from those in adjacent groups. It is therefore better to collapse age groups prior to the analysis that have small case counts. In the case of age groups for cancer, the counts are usually small for the very youngest age groups (where rates are very low), or the very oldest age groups (where the population size is small). Prior to running the clustering algorithm, analysts can combine age groups to get sufficient counts or exclude the youngest or oldest age groups with small counts from the analysis. The pre-analysis count compression represents a set of automated algorithms that ensure that the case counts in every group exceed a minimum specified by the user for each calendar year.

Customize Adjustment Variable Labels

You can use this feature to change the labels of the age groups found in your input data file. A dialog will appear and display each age group found in your data file along with its start and end range label. The dialog uses the SEER*Stat dictionary labels (if one exists with your input data file).

Auto-Compress Counts Less Than or Equal to "N"

This is where a user can set the minimum count to be used for the age compression process. Joinpoint will combine age groups to create a count that is greater than the value specified. If it cannot, then all the age groups will be combined and Joinpoint will indicate that the cohort cannot be used in the analysis.

Forward Compression

The forward compression process starts by looking at the count in the youngest age group of the first data point (e.g., year) to see if it is greater than the minimum count specified by the user. If it is, then it “looks forward” to see if the next age group count within the same data point (e.g., year) is also greater. If the next age group count is not greater than the minimum count, then the age groups and counts are combined. If the next count is greater than the minimum, then the first age group is left as is and the forward compression then goes to the second age group. This process continues until all age groups for each data point are combined such that no count is less than or equal to the minimum specified by the user. The combined age groups will be consistent for all data points.

Below is an example of how the forward compression works. Please note that this is only for a single data point (year 2000). In the example below, the 10–14 age group has a zero and thus is combined with the 5–9 age group. The algorithm then continues to look forward to see if the next age group count is less than or equal to the minimum specified by the user. If it is, then the age group is combined with the others. This is done for each data point (e.g. year) until all counts are greater than or equal to the minimum age. In the end, each data point will have the same ‘combined’ age groups.

Because the clustering process computes age-adjusted rates, when an age group is combined with another to create a count that is greater than the minimum count, the associated population and standard populations are also combined.

Forward Compression – Minimum count of zero specified
Year 2000
Age Group	Original Count	Count Compression Result	New Age Groups
00 years	10	10	00 Years
01-04 years	12	12	01-04 years
05-09 years	14	14	05-84 years
10-14 years	0
15-19 years	0
20-24 years	0
25-29 years	0
30-34 years	0
35-39 years	0
40-44 years	0
45-49 years	0
50-54 years	0
55-59 years	0
60-64 years	0
65-69 years	0
70-74 years	0
75-79 years	0
80-84 years	0
85+ years	2	2	85+ years

Backward Compression

The backward compression process starts with the last/oldest age group and goes ‘backwards’ through each age group (for a given data point) combining ages in a similar fashion to the forward compression until all counts are greater than the minimum specified by the user. Below is an example of how the backwards compression works. The table contains the same age groups and counts as the table used in the Forward Compression example (above).

Please note that the results of the backward compression are slightly different from the forward.

Backward Compression – min count of zero
Year 2000
Age Group	Original Count	Compression Result	New Age Groups
00 years	10	10	00 Years
01-04 years	12	12	01-04 years
05-09 years	14	14	05-09 years
10-14 years	0	2	10+ years
15-19 years	0
20-24 years	0
25-29 years	0
30-34 years	0
35-39 years	0
40-44 years	0
45-49 years	0
50-54 years	0
55-59 years	0
60-64 years	0
65-69 years	0
70-74 years	0
75-79 years	0
80-84 years	0
85+ years	2

View Compression Results...

The View Compression Results button will perform the pre-analysis compression process on your input data file and display the results for each cohort via a special dialog box. This can be helpful to users by providing them with a way to view the pre-compression results prior to running the clustering job.

Method for Determining Number of Joinpoints for Clustering

The choices are the WBIC, BIC, and Permutation test. The permutation test procedure had been the default selection method, but it is computationally inefficient. BIC has been proposed as a computationally more efficient tool, but simulations have shown that it tends to select models that have too many joinpoints when the effect size, the change size adjusted for data variability, is relatively large. BIC3, a BIC type measure with a harsher penalty, has been proposed to improve such over-fitting tendency, but it has been shown to be too conservative when the effect size is small. WBIC, which combines BIC and BIC3, has been proposed to improve the over-fitting tendency of the BIC and the under-fitting tendency of the BIC3, and WBIC is a current default of Joinpoint. Simulations have shown that the WBIC agrees with the results of the permutation tests more often than the BIC but is much more computationally efficient than the permutation test.

Method for Determining Number of Clusters

The only choice for determining the number of clusters is the BIC. The permutation test is not feasible computationally.

Maximum Number of Clusters

The largest maximum number of possible clusters is the number of groups. For example, if there are 10 age groups, then the maximum number of clusters is 10 (one age group per cluster). However, in determining the optimal clusters the algorithm tests among all possible clustering possibilities with a single cluster (i.e. all 10 age groups in a single cluster), with two clusters (i.e. nine possible clustering options -- i.e., group 1 vs groups 2-10, groups 1-2 vs groups 3-10, …, groups 1-9, vs group 10), with 3 clusters, …,, and with 9 clusters. Testing among so many possible clusters is computationally very intensive, and in many cases finding more than 4 or 5 clusters makes the results difficult to interpret. When doing analyses for more common cancers the algorithm may find clusters where the differences in APCs across adjacent clusters is not very large (See MADWD which helps deal with this issue).

Clustering Minimum APC Difference Worth Detecting (MADWD)

Two different MADWD algorithms are available in Joinpoint. They both allow the user to avoid finding APC differences that are too small to be of substantive interest. Often the analyst is focused on trying to interpret only larger APC differences and would like to eliminate situations where the APC difference falls below a specified threshold because they are too small to be of practical importance and/or may be difficult to interpret. It is important to note that this is a subjective criterion that overlays the statistical criteria which is applied first.

The first application of MADWD (called Segment MADWD) applies to consecutive joinpoint segments for a single cohort where the user wants to specify a minimum APC difference for each pair of segments (see https://surveillance.cancer.gov/help/joinpoint/setting-parameters/method-and-parameters-tab/method/madwd ).

The second type of MADWD is called Clustering MADWD. After completing the clustering, for each set of adjacent cluster pairs the difference in APCs are computed for each year and then averaged across all calendar years. For example, if the data series runs from 2000 to 2015 and the joinpoint fit for cluster (i) is an APC of 1.5% from 2000 to 2010 and an APC of 2% from 2010 to 2015, while the fit for cluster (i+1) is 1% from 2000 to 2015, then the average APC difference is { (0.5 * 10 years) + (1 * 5 years)} / 15 years = .33. Note that the year 2010 is excluded from the calculations because the APC is not defined for 2010 in cluster (i). Note that absolute values of the APC differences for each year are utilized so that all differences that are averaged are always positive.

For either type of clustering, the user sets a Minimum APC Difference Worth Detecting (i.e. a MADWD criteria). If the Clustering MADWD algorithm determines that any two adjacent clusters do not meet the specified MADWD criteria, the clustering algorithm is repeated lowering the maximum number of clusters by one less than the number of clusters found in the first step. This process is repeated iteratively, until all adjacent cluster pairs meet the MADWD criteria. The choice of the MADWD criteria is somewhat arbitrary, but experience has indicated that an APC difference of 1% or 0.5% tends to work reasonably well.