# Demystifying Statistics for 510(k) Submissions in the US

This article was originally published in RAJ Devices

#### Executive Summary

*Steven Walfish** shows non-statisticians how to familiarise themselves with common statistical techniques used for 510(k) submissions to the Food and Drug Administration.*

For most non-statisticians working at medical devices companies, reviewing the statistical sections of a 510(k) premarket notification to the US Food and Drug Administration can be a daunting task.

Using the wrong statistical method in a 510(k) submission or making an incorrect assumption can lead to a delay in regulatory approval being granted or result in the submission being halted unnecessarily. The statistical elements of the submission, from the design phase of a medical device through to its manufacture, must be monitored scrupulously.

The FDA held a public meeting on 18 February 2010 to discuss key challenges related to the 510(k) process, which is used to review and clear certain medical devices for sale in the US. A trend that has emerged for 510(k) submissions is an increase in the number of submissions requiring clinical trials and/or comparability data. This article helps to demystify the statistics behind a 510(k) submission. It does not aim to replace a statistician but rather to familiarise the reader with some common statistical techniques used for 510(k) submissions. Though some of the techniques are also applicable to premarket approval (PMA) submissions, the goal is not to cover the depth of statistical analysis required for such submissions.

# The 510(k) and types of data

A 510(k) submission must demonstrate that the device to be marketed is as safe and effective as, that is, substantially equivalent to, a legally marketed device. There are several types of data that can be collected for 510(k) submissions including: continuous data (quantitative data); discrete data (quantitative data with a limited number of outcomes); categorical data (qualitative data such as male versus female); and ranked/ordered data (categories that can be ordered from “smallest” to “largest”). For example, age as a calculated variable can be treated either as a continuous variable (eg years and days of age), measured in years; a discrete variable, categorised as decades; or an ordered categorical variable. As a rule, wherever possible, measure outcomes as a continuous variable and categorise later.

**Summary statistics**

The most typical statistical analysis seen in a 510(k) is summary statistics. Summary statistics can be divided into two classes, location and dispersion. The measure of location or central tendency is either the mean or median. The mean is used for normally distributed data, while the median is used with data that is not the typical bell-shaped normal distribution. The measure of dispersion of variability is either the standard deviation or range. The mean is reported with the standard deviation and median is reported with the range. A standard method for reporting summary statistics is a summary statistics table, as illustrated in Table 1.

## Table 1. Standard summary table

**Treated patients (N=38)**

**Placebo patients (N=32)**

Sex:

Male

Female

5 (13.2%)

33 (86.8%)

6 (18.8%)

26 (81.3%)

Age:

Mean ± SD (n)

Median (range)

59.5 ± 10.3 (38)

60.0 (33.0 – 79.0)

57.6 ± 11.5 (32)

56.5 (37.0 – 75.0)

**Graphical methods**

Summary statistics provide a single estimate of the distribution of data. Graphical methods such as graphs and charts give a visual presentation of the data distribution. A box plot is a graphical representation of dispersions and extreme scores. The box represents the inter-quartile range (middle 50% of the data). A histogram plots the frequency of events, where the x-axis is in increasing order. Continuous data can be “binned” for histograms; binning is the process of taking continuous data and putting them into discrete (finite) groups. A histogram allows you to see the distribution of the data.

**P-value**

A p-value is a measure of how much evidence we have against the null hypothesis. The null hypothesis represents the hypothesis of no change or no effect. The smaller the p-value, the more evidence we have against the null hypothesis. It is also a measure of how likely we are to get a certain sample result or a result “more extreme”, assuming the null hypothesis is true. You should not interpret the p-value as the probability that the null hypothesis is true. Typically, we use a p-value of less than 0.05 to denote statistical significance.

**Statistical terms**

The coefficient of determination (R^{2}) is the proportion of the variation explained by the model. There is no R^{2} value that is considered good. Using the mean value at each level of X versus the individual data will artificially inflate R^{2}. The probability that you will reject the null hypothesis when it is true is called Type I Error (α). The probability that you will accept the null hypothesis when it is false is called Type II Error (β). Power (1-β) is the probability the test will reject the null hypothesis when it is false. The more power in a study, the higher probability of correctly rejecting the null hypothesis. You can increase power by increasing the sample size for the test.

Sample size selection

If we want to compare our test device to a predicate, the sample size calculation is based on the confidence level (α) and the power (1- β). Decisions are often based on our analysis of a sample. How we conduct a sample is very important. It is desirable that the sample minimises bias, is representative of the population and is economical. The population (reality), though unknown, is estimated by the sample (decision). Table 2 illustrates the Type I and Type II error. Most studies have either too few or too many samples, leading to false conclusions. The Type I and Type II errors need to be specified as must the historical variability (standard deviation), and the difference between the two deemed to be practically significant. Figure 1 provides the formula for calculating the sample size (n).

## Table 2. Sampling risks

**Decision**

**Reality**

Accept

Reject

Accept

Correct decision

Type II error (β) consumer risk

Reject

Type I error (α) producer risk

Correct decision

Figure 1 provides the formula for calculating the sample size (n).

Zα and Zβ are the Type I and Type II errors, respectively. S^{2} is the historical variance. ?^{2} is the minimum difference to be detected from the null hypothesis. As the effect size decreases, the sample size increases. As variability increases, sample size increases. Sample size is proportional to risks taken. As an example, consider comparing a new pain relief device to an existing device for pain relief. On a pain-relief scale of 1-100, the current device provides relief of 80 (100 being totally pain free). The new device needs to have a mean pain relief of 90. The standard deviation of the historical data is 10 units. Using Figure 1, the sample size would be 9 units of the test device.

**Confidence and tolerance intervals**

Confidence intervals are statements about population parameters. We are x% confident that the interval includes the true population value. Tolerance intervals make a statement about the proportion of the population values with a fixed confidence. Therefore, we would say that x% of the population will be contained in the tolerance limits with y% confidence. Tolerance intervals are computed from the sample mean and sample standard deviation. A constant k is used such that the interval will cover p percent of the population with a certain confidence level. The general formula for a tolerance interval is as follows:

The value of k is based on the sample size (n) and the confidence level (1-α). The k factor can be obtained from the appropriate table contained in the International Organization for Standardization standard ISO 16269-6:2005 – Statistical Interpretation of Data. During the preparation of a 510(k), a confidence interval is used to set the interval on the mean value, while the tolerance interval is used to set the interval on the individual values of the data.

**Equivalence and TOST**

The underpinning of a 510(k) is to show substantial equivalence, not necessarily superiority. What does it mean to be substantially equivalent? The typical hypothesis test for equivalence is that the difference between the predicate device and the test device is less than some value ?. If the confidence interval for the difference is entirely within the bounds of (-?, ?) then there is equivalence. This approach is called the two one-sided test (TOST). Graphically this is shown in Figure 2.

The horizontal lines denote the four possible scenarios for statistical testing. Scenario 1 is when both the 95% confidence interval (denoted by the horizontal line) contains the target and the entire 95% confidence interval is contained in the equivocal zone. In this case, both statistical significance and scientific judgment agree. Scenario 2 is when the 95% confidence interval does not contain the target and, therefore, would be considered statistically different, though the 95% confidence interval is fully contained in the equivocal zone. In this case, we would judge the sample to be scientifically similar to the target. Scenario 3 is when the 95% confidence interval would conclude there is no statistical significance, but the 95% confidence interval is not fully contained in the equivocal zone. Here, since the variability is larger we cannot conclude there is a statistical difference, but scientifically it is shown to be possibly too large a difference. Scenario 4 is when neither the 95% confidence interval nor the equivocal zone shows that the sample is equivalent to the target.

Scenarios 1 and 4 both agree, while scenarios 2 and 3 have some discrepancy. In scenario 2, the precision is so good the statistical test fails, though in a practical sense it is similar to the target. It is only scenario 3 that gives the most confusing conclusion. Since the confidence interval is not fully contained in the equivocal zone, one might increase sampling to reduce the variability. The equivocal zone should be based on clinical relevance and not statistical significance.

**Non-inferiority**

The “at least good as” criterion or non-inferiority is a one-sided significance test to reject the null hypothesis that standard therapy is better than experimental therapy by a clinically acceptable amount. To demonstrate that a new device is “at least as good as” an existing device, a statistical test or confidence interval procedure must rule out clinical inferiority with a high probability. The noninferiority hypothesis and sample size are attributable to Blackwelder^{1}.

**Contingency tables**

A contingency table is used with binary, categorical or rank/order data to classify the predicate and test device. Typically, the contingency table is used to classify results from a binary diagnostic test (positive or negative). Table 3 shows a typical 2x2 contingency table where A is negative agreement and D is positive agreement. A chi-square (?^{2}) test can be used to statistically test if there is agreement amongst the two devices.

## Table 3. A typical 2x2 contingency table

**Treatment 2**

Negative

Positive

**Treatment 1**

Negative

A

B

Positive

C

D

A special test when you have paired data (same sample or patient using two different devices) is called McNemar’s Chi Square test. Another form of the chi-square test is Fisher's exact test, which is a statistical significance test used in the analysis of contingency tables where sample sizes are small.

**Software tools**

There are many different software tools available in the marketplace for conducting these statistical tests and analysis. The following is a non-exhaustive list of tools: JMP from SAS; Minitab; NCSS-PASS for sample size; programming languages such as SAS and R; StatXact; and SPSS.

# Conclusions

Statistics used in 510(k) submissions for medical devices might seem trivial. But using the wrong statistical method, or making the wrong assumption, can delay a company from gaining regulatory approval from the FDA or can lead to a submission being unnecessarily halted. A basic understanding of the statistical methods used by the agency and the data it requires can help even non-statisticians understand how to show equivalence in a 510(k) submission.

*References*

1. Blackwelder W, Proving the null hypothesis in clinical trials, *Controlled Clinical Trials*, 1982, **3**(4), 345-353

*Steven Walfish** is the president of Statistical Outsourcing Services, a consulting company that provides statistical analysis and training to FDA-regulated firms. Statistical Outsourcing Services is based in Olney, Maryland. Telephone: +1 301 325 3129. Email: steven@statisticaloutsourcingservices.com.*