Published under: new feature, statistical analysis, data analysis, Data analytics, Statgraphics, analytics software, Statgraphics 18, censored data, survival functions, distribution fitting, nonparametric methods
Real world data frequently contain observations that are only partially known. For example, patients may drop out of medical trials before a study is complete. When collecting environmental data, observations may be less than a reporting limit and therefore lead to a result such as "<1.0". Nevertheless, such censored observations contain useful information that must be accounted for when characterizing the distributions of the population from which the data come.
Statgraphics 18 introduced a new data column type referred to as "censored numeric data". An example of such a column is contained in the data file below:
The column "days" contains the number of days between when breast cancer patients received a radiation treatment and breast retraction was first observed. It contains a combination of left-censored, right-censored, and interval-censored data.
There are 2 primary approaches to characterizing this data:
1. A parametric approach in which a distribution such as the lognormal distribution is estimated using maximum likelihood methods that account for the censoring of each observation.
2. A nonparametric approach that uses the Kaplan-Meier-Turnbull (KMT) method to estimate the CDF and survival functions.
The new Distribution Fitting (Arbitrarily Censored Data) procedure in Statgraphics 18 fits any of 27 parametric distributions to a column of censored data, and also obtains nonparametric estimates. The graph below shows the survival function for both a fitted lognormal distribution for the breast cancer data and a nonparametric KMT estimate. 95% confidence intervals are provided for the nonparametric estimate:
The results of such a study are typically estimates of survival percentiles. The table below shows several percentile estimates using both methods:
The confidence limits displayed were obtained by fitting lognormal distributions to 1,000 bootstrap subsamples taken from the original data. Note that the KMT estimates are within the confidence limits for the lognormal percentiles up until 46 days. Above that value, most of the data are right censored so that the nonparametric estimates are not very meaningful.
You can see an extensive discussion of fitting the breast cancer data and also a set of arsenic concentration measurements taken from an urban stream in Oahu, Hawaii by viewing the webinar titled Distribution Fitting of Arbitrarily Censored Data.
Source of Breast Cancer Data
Finkelstein, D.M. and Wolfe, R.A. (1985). “A semiparametric model for regression analysis of interval-censored failure time data.” Biometrics 41, 731-740.