SPARC: Frequently Asked Questions
Methodology
1. Why are estimates used for the populations in SPARC and how are they different from census populations?
It is necessary to estimate populations in SPARC from representative survey samples because Census-based populations do not have detailed data needed on risk factors, such as birthplace as in the first application of Mortality Rates by Birthplace.
Let us take the application of Mortality Rates by Birthplace as an example. Populations stratified by birthplace are estimated using 1-year American Community Survey (ACS) samples and they differ from census-based populations in two ways.
First, unlike the U.S. Census, which provides 100% counts of U.S. populations (although overcount or undercount is possible), the ACS surveys approximately one percent of a random sample of the U.S. population each year. The survey covers a variety of sociodemographic topics. The information obtained from this sample is then used to estimate the characteristics of the U.S. population. Census Bureau's methodology (Census Bureau, 2014) is followed to produce the weighted populations and the replicates weights standard errors. Specifically, for a given geodemographic group, the population is estimated by summing the weights assigned to individuals in that geodemographic group. Weights not only reflect unequal sampling probabilities, but also the difference between the full sample and the interviewed sample, and the difference between the sample and external populations by common basic characteristics (such as census population counts). In addition, a sample-based population estimate will vary depending on the particular sample selected and interviewed, and this variability is reflected as the sampling error, measured as the variance or the standard error (the square root of the variance). The replicates weights method is used to calculate the replicates' standard errors. For a detailed description of the estimation process, refer to the FAQ "How are population estimates and standard errors calculated?"
Second, the ACS produces 'period' population estimates representing the average populations over a calendar year starting from January 1 through December 31 - unlike the census, which provides a snapshot of U.S. populations in reference to a specific date, i.e., April 1, of the census year. Used as denominators in conjunction with numerators consisting of the cancer incidence or death counts that occurred throughout the same calendar year, these population estimates ensure that the resulting age-adjusted rates are based on data collected with comparable time reference.
Reference
U.S. Census Bureau. 2014 ACS Design and Methodology Report. https://www.census.gov/programs-surveys/acs/methodology/design-and-methodology.html. Accessed August 16, 2022.
2. Why do sampling errors in the population denominators cause bias in estimating cancer rates?
Decennial census-based population denominators are typically assumed to be error free or of negligible error in statistical inference of cancer rates (Brillinger 1986; Fay 1997). Two examples of such populations are: 1) populations calculated directly from 2010 census Summary File 1 and Summary File 2; and 2) population estimates produced by the Census Bureau's Population Estimates Program. Under this setup, the age-adjusted rate (AAR) is a simple estimator that involves only one random variable, that is the Poisson-distributed random numerator, as shown below.
\[{AAR}_1=\sum_{j=1}^{J}{w_j\frac{d_j}{N_j}}\]
where \(d_j~Po\left(\lambda_j\right)\) and \(N_j\) are the number of cancer incidence or mortality and the population for age group j, and \(w_j\) are the proportions of U.S. standard population for age group j. Both \(N_j\) and \(w_j\) are fixed constants. Because \(E\left(d_j\right)=\lambda_j\) (meaning that random Poisson errors associated with \(d_j\) cancelled each other out over repeated sampling), \({AAR}_1\) is an unbiased estimator for the true rate \({AAR}_{\mathrm{true}}=\sum_{j=1}^{J}{w_j\frac{\lambda_j}{N_j}}\) as shown below.
\[E\left({AAR}_1\right)=\sum_{j=1}^{J}{w_j\frac{E\left(d_j\right)}{N_j}}=\sum_{j=1}^{J}{w_j\frac{\lambda_j}{N_j}}={AAR}_{\mathrm{true}}\]
When population denominators involve sampling errors, the AAR is a weighted sum of ratios of means of two random variables, i.e., cancer incidence or mortality and population, as shown below.
\[{AAR}_2=\sum_{j=1}^{J}{w_j\frac{d_j}{n_j}}\]
where \(n_j~N\left(N_j,\sigma_j^2\right)\) are sample survey-based population estimates of true population \(N_j\) and are subject to sampling error \(\sigma_j\). Although \(E\left(n_j\right)=N_j\), \({AAR}_2\) is not an unbiased estimator, but a positively biased estimator for \({AAR}_{\mathrm{true}}\) because
\[E\left({AAR}_2\right)=\sum_{j=1}^{J}{w_jE\left(\frac{d_j}{n_j}\right)}\geq\sum_{j=1}^{J}{w_j\frac{E\left(d_j\right)}{E\left(n_j\right)}}=\sum_{j=1}^{J}{w_j\frac{\lambda_j}{N_j}}={AAR}_{\mathrm{true}}\]
This type of ratio estimator bias is well documented in the literature (Cochran 1977) and the magnitude of bias is a factor of the coefficient of variation (the ratio of the standard deviation to the mean) of the population, i.e., \(CV=\frac{\sigma_j}{N_j}\) for age group j. The more precise the population estimate is, the smaller the bias. A bias-corrected estimator developed by Jiang et al (2020) is
\[{AAR}_{bc}=\sum_{j=1}^{J}{w_j\frac{d_j}{n_j}}\left(1-\frac{{\hat{\sigma}}_j}{n_j}\right)\]
where \({\widehat{cv}}_j=\frac{{\hat{\sigma}}_j}{n_j}\) is the estimated CV for age group j. This bias-corrected estimator is almost unbiased if \({\widehat{cv}}_j\) is less than 50%. The results of this bias-corrected estimator should be interpreted with caution if \({\widehat{cv}}_j\) is large.
References
- Brillinger, DR. A biometrics invited paper with discussion: the natural variability of vital rates and associated statistics. Biometrics. 1986;42:693-712 [Abstract]
- Fay MP, Feuer EJ. Confidence intervals for directly standardized rates: a method based on the gamma distribution. Statistics in Medicine. 1997;16:791-801 [Abstract]
- Cochran WG. Sampling Techniques. Third edition. John Wiley & Sons; 1977. [Abstract
]
- Jiang J, Feuer EJ, Li Y, Nguyen T, Yu M. Inference about age-standardized rates with sampling errors in the denominators. Statistical Methods in Medical Research. 2021;30(2):535-548 [Abstract]
3. What is the minimum recommended population size to use this tool?
The criterion for using SPARC depends on the precision of the population estimates. SPARC produces unbiased age-adjusted rates only if the coefficient of variation (CV) of the population estimates (the ratio of the standard error to the population estimate) is less than 50% in all age groups involved in the age-adjusting. Small population sizes are usually associated with large CVs. Based on an unpublished evaluation, a geodemographic group with an estimated population size of no fewer than 100,000 persons summing over all age groups is usually associated with age-specific CV that is stable enough to use SPARC. However, users are strongly advised to examine the CV by age group to gauge the suitability.
4. What is the difference between a crude rate and an age-adjusted rate?
A crude rate is the number of new cases (incidence) or deaths (mortality) in a population during a specific time period (numerator) divided by the number of person-years lived by the population during the time period (denominator). A crude rate can be calculated for a specific age group, i.e., age-specific rate, or across more than one age group. Because cancer rates vary substantially with age, crude rates are influenced by the underlying age distribution of the population.
In contrast, an age-adjusted Rate is a single rate that allows comparison of populations with differing age compositions. It is a weighted sum of age-specific rates by proportion of U.S. standard population within each age group.
Even if two populations, e.g., states or countries, have the same age-adjusted rates, the state or country with the relatively older population generally will have higher crude rates because incidence or death rates for most cancers increase with increasing age.
Mortality Rates by Birthplace
1. How are U.S.-born and foreign-born defined?
For both mortality data and population data, foreign-born is defined as anyone who is not a U.S. citizen at birth according to the U.S. Census Bureau. Everyone else who is not foreign-born is identified as U.S.-born, including anyone born in the U.S., Puerto Rico, a U.S. Island Area (Guam, the Commonwealth of the Northern Mariana Islands, or the U.S. Virgin Islands), or abroad to a U.S. citizen parent or parents.
2. What is the source of population denominator data used in SPARC?
Population denominators used for calculating mortality rates by birthplace in SPARC are estimated from the 2006-2018 1-year American Community Survey (ACS) samples. These ACS samples are extracted from the Integrated Public Use Microdata Series (IPUMS) maintained by the Minnesota Population Center of the University of Minnesota (Ruggles et al. 2022.) The smallest geodemographic groups for which populations are available are those defined by 5-year age group (<4 years, 5-9, 10-14, …., 80-84, and 85 years and older), sex (female and male), bridged single race/ethnicity (non-Hispanic White, non-Hispanic Black, non-Hispanic Asian/Pacific Islander, non-Hispanic American Indian and Alaska Natives, and Hispanic) and birthplace (U.S.-born and foreign-born) at the state level. Bridged single race is identified as "RACESING" variable in University of Minnesota's Integrated Public Use Microdata Series (IPUMS)-USA. For more details about the bridged single race, refer to Liebler (2022)
.
Populations of broader years, ages or racial/ethnic groups are available for query by specifying combined categories in SPARC.
References
- Ruggles S, Flood S, Goeken R, Schouweiler M, Sobek M. IPUMS USA: Version 12.0, 2006-2018 1-year American Community Survey. Minneapolis, MN: IPUMS, 2022. DOI: https://doi.org/10.18128/D010.V12.0
.
- Liebler CA. Building New Bridges: Developing and Disseminating a Simplified Race/Ethnicity Measure for Working with Complex or Contradictory Race Data. University of Minnesota, Minnesota Population Center, IPUMS Working Paper No. 2022-02. DOI: https://doi.org/10.18128/MPC2022-02
.
3. How are population estimates and standard errors calculated?
Three main steps were taken to obtain the weighted population estimates and the corresponding replicates weights' standard errors.
In the first step, small proportions of missing data in race and birthplace variables were multiply imputed using the sequential regression multiple imputation method (Raghunathan et al. 2001). Missing data, if not imputed, will be excluded from the estimation, leading to an underestimation of populations and subsequently an overestimation of age-adjusted rates (AA rates) (assuming mortality numerator data are complete). Based on a statistical model, single imputation gives a plausible value to a variable where the actual value is missing, thereby preserving all records in the survey sample. Multiple imputations create multiple copies of complete survey samples.
In the second step, weighted populations and associated replicates weights standard errors are calculated using each copy of imputed sample separately. Specifically, the weighted population of the geodemographic group G estimated using the fully imputed survey sample \({\ S}_m,m=1,2,\ldots,M\) is
\[{\hat{N}}^{\left(m\right)}=\sum_{i\in G,{\ S}_m} W_i\]
where, \(W_i\) is the final weight assigned to individual i in geodemographic group G, and M denotes the number of imputations.
The replicates weights variance of \({\hat{N}}_m\) is
\[{\hat{V}}^{\left(m\right)}=\frac{4}{80}\sum_{r=1}^{80}\left({\hat{N}}_r^{\left(m\right)}-{\hat{N}}^{\left(m\right)}\right)^2\]
where \({\hat{N}}_r^{\left(m\right)}=\sum_{i\in G,{\ S}_m}{RW}_{i,r}\), \({RW}_{i,r}\) is the \(r^{th}\) replicate weight assigned to individual i, and \(r=1,2,\ldots,80\). For a detailed description, refer to Census Bureau (2014).
Finally, in the third step, multiple sets of population estimates and replicates weights standard errors are combined (Rubin 1986) to obtain the multiply imputed population
\[\hat{N}=\frac{1}{M}\sum_{m=1}^{M}{\hat{N}}^{\left(m\right)}\]
and the standard error
\[se\left(\hat{N}\right)=\left(\hat{W}+\left(1+\frac{1}{M}\right)\hat{B}\right)^\frac{1}{2}\]
where \(\hat{W}=\frac{\sum_{m=1}{\hat{V}}^{\left(m\right)}}{M}\) is the average within-imputation variance and \(\hat{B}=\frac{\sum_{m=1}\left({\hat{N}}^{\left(m\right)}-\hat{N}\right)^2}{\left(M-1\right)}\) is the between-imputation variance, which are used as the AA Rate denominators.
References
- Raghunathan TE, Lepkowski JM, Van Hoewyk J, Solenberger P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv Methodol. 2001;27(1):85-96 [Abstract
]
- U.S. Census Bureau. 2014 ACS Design and Methodology Report. https://www.census.gov/programs-surveys/acs/methodology/design-and-methodology.html. Accessed August 16, 2022.
- Rubin DB, Schenker N. Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J Am Stat Assoc. 1986;81(394):366-374. [Abstract
]
4. Why are some cancer rates not shown or suppressed?
Several factors trigger the suppression of an age-adjusted rate (AA Rate) at the national level and state level.
A national-level AA Rate is suppressed if a non-zero numerator (the incidence or mortality count) is associated with a zero population estimate in any age group. In this situation, the age-specific rate cannot be calculated, so the AA Rate is invalid. In addition, an AA Rate is suppressed if the population estimate is associated with an estimated coefficient of variation (CV) greater than or equal to 50% in any of the age groups. This is necessary because the bias-corrected AA Rate estimator (Jiang et al. 2020) implemented in this tool is unbiased only if CV is less than 50% in all age groups. This tool displays the maximum CV across all age groups for an AA Rate as part of the output. Users are advised to examine the maximum CV to assess the impact of sampling errors on an AA Rate.
In addition to national-level suppression rules, a state-level AA Rate is suppressed if the sum of incidence or mortality across all age groups is less than 10. This is necessary for data confidentiality reasons.
5. Why is an age-adjusted rate flagged?
An age-adjusted rate (AA Rate) is flagged if the estimated AA Rate is unstable due to small incidence count and/or unstable population estimates. The criterion for identifying an unstable AA Rate is if the estimated coefficient of variation (the ratio of the standard error of AA Rate to the AA Rate) is greater than 25%.
6. Why are some age groupings not available for state-level analyses?
The bias-corrected age-adjusted rate estimator (Jiang et al. 2020) is implemented in SPARC. This estimator is unbiased only if the coefficient of variation (CV) of the population estimate (the ratio of the standard error to the population estimate) is less than 50% in all age groups involved in the age-adjusting. However, population estimates, even after being aggregated into 3-death year groups, are not stable enough for all of the standard 5-year age groupings, i.e., <4 years old, 5-9, …, 80-84, and 85 years and older. To avoid suppressing too many state-level AA Rates (for reasons of suppression, refer to FAQ on "Why are cancer rates not shown or suppressed?"), age is aggregated into 10-year groupings, with the first group being 14 years of age and younger and the last group being 75 years and older (i.e., ≤14 years old, 15-24, 25-34, …, 75+); these are the smallest age groups available for calculating age-specific rates.
Reference
Jiang J, Li Y, Nguyen T, Yu M. Inference about ratios of age-standardized rates with sampling errors in the population denominators for estimating both rates. Stat Med. 2022 May 20;41(11):2052-2068 [Abstract]
7. How do missing data impact the age-adjusted rate?
Data may be missing in the numerator data (i.e., cancer incidence/mortality) or/and denominator data (i.e., populations) that are used in calculating age-adjusted rates (AA Rates). In the birthplace-specific mortality data, data are missing in both bridged single race/ethnicity and birthplace variables. Taking the birthplace variable as an example, the proportion of missing data is about 2% for most U.S. states except for Georgia, where it is over 80% in 2008 and 2009 during Georgia's transition to electronic death certificate reporting. When populations do not involve missing data, birthplace-specific AA Rates are subject to underestimation bias, because death records with missing birthplace data are counted in the numerators. The magnitude of underestimation bias increases as the proportion of missing data increases, and it may vary by birthplace status because missing data may not be equally likely to be U.S.-born or foreign-born. When populations are also subject to missing data, the direction and magnitude of bias depend on the relative amount of missing data for the numerator and denominator, and they are difficult to quantify without a thorough analysis. Users are strongly advised to examine the magnitude of missing data, study the impact, and interpret the findings with caution.
8. Why is Georgia not available for the long-term trend analysis from 2006 to 2018?
The mortality birthplace data in Georgia have a considerable amount of missing data during the period when Georgia was transitioning to its Electronic Death Registration System. In the death years of 2008 and 2009, the proportion of missing birthplace data is over 80%. Such a large amount of missing data will lead to a sizable underestimation bias, which will subsequently distort the long-term national trend if Georgia is included. To avoid such bias, the 2008-2009 Georgia data are excluded from the 2006-2018 long-term mortality database.
Users who are interested in reporting long-term national trends are advised to compare mortality rates between Georgia and the rest of the U.S. If the rates are similar, excluding Georgia from the national trend analysis may not be biased enough to cause major concern.