Design Your Experiments, Part III:
Noise Characteristics and Measurement
by Kevin Kilty
As
I advised at the end of the previous installment of this series, even
though noise is a constant impediment to our search for truth, there is
no reason to despair over it. We can handle noisy experiments if we learn
to think statistically. In this installment I wish to begin doing this
by examining the following topics:
- Characteristics of noise
in general.
- The statistical description
of noise.
- The effect of noise on experimental
design.
What to expect of noise
As a model to explore the nature
of noise I'll propose a simple one, Yo = Y + n; where, Yo
is the value I obtain from my measurement or experiment, Y is a true value
(which I don't know of course) and n is some additional value which has
corrupted my measurement. You can call it error or noise. What should
I expect of the noise values? This is straight forward to answer. I expect
that as long as all conditions surrounding my measurment stay constant
then there is some probability density, call it N, that describes the
noise. A probability density is a function of sorts which assigns to each
possible noise value (or range of noise value) an associated probability.
I do not need to know the exact density at this time, but I do need to
suppose a couple of its characteristics.
- The mean value of N should
be zero. If it is not, then there is a constant difference between a
best estimate based on experiment and a true value. This means that
there is a systematic bias in the experiment. Experimental
design ought to identify possible systematic errors and eliminate them.
- Each n ought to be independent
and random. If this is not true, then I need some measure of how one
is correlated to another.
- I should strive to make
each individual value of n as small as possible. This makes my experimental
result very precise.
What replications of an experiment
provide is a series of observed values that cluster around a central value.
Statisticians call this central tendency. We assume, from
our experiment being well designed and executed, that the central value
is our best estimate of the true value of the thing we are looking for.
This is the value that we will report in our experiment. I am going to
assume throughout this installment that our experiment is calibrated to
provide a true value. I'll discuss calibration separately at a later time.
Because each measurement contains
an unknown n, unless I report a measure of n as well, then my result has
limited value. One question to answer is how to report it. The National
Institute of Standards and Technology (NIST) suggests two ways of doing
this. The first is to report the square root of the mean variance of replicated
measurements as the standard uncertainty of the measurement result. Most
of you familiar with statistics will recognize this as being the standard
error of a sample mean. Take {Yoi};i=1...n as the set
of measured values from replicated experiments, and <Y> as the mean
measured value. Then...
u = sqrt( Sum of (Yoi-<Y>)2/n(n-1) )
is the standard uncertainty of
these measurements.
As an example, suppose I measure
the speed of light 5 times with some apparatus and get the following values
in km/Sec: 299,034, 300,006, 298,510, 299,435, and 299,987. Then the best
value to report for the speed of light and the uncertainty in this estimate
is 299394.4±286.3 km/Sec. Including the uncertainty of measurement
makes quibbling about whether or not to include the single digit beyond
the decimal point insignificant.
Calculating uncertainty directly
from measurements this way is very useful, but it does obscure the idea
that noise, or random errors if you will, follow some probability density.
Standard error provides a complete description of noise density only if
the noise is Gaussian--i.e. follows a normal probability distribution.
This might not always be so, and at some point in a future installment
I'll speak at length about probability density and distributions.
Calculating uncertainty from
replicated measurements is fine when we have access to them, but, what
do we do otherwise? NIST suggests a second reporting method in which a
standard uncertainty is obtained from the square root of a variance determined
by other means. This is very vague, but also very interesting,
for it suggests using prior knowledge about measurements, models of noise
sources, equipment, and even experimental conditions to calculate and
report uncertainty.
Once again let me provide an
example. A photomultiplier counts photons. If I use one to scan across
a spectral peak, each count it makes has some associated uncertainty.
I could figure this uncertainty by making numerous photon counts at one
setting of my spectrometer and calculating the standard uncertainty. However,
counting photons is a Poisson process, the variance of which equals the
count itself, so at low count rates I could spend forever obtaining data
to calculate uncertainty directly. Yet, by knowing the theoretical density
of a particular noise process, such as Poisson in this instance, I can
estimate uncertainty from it. What matters about an estimate of uncertainty
is not so much how it was done, but that how a person does it is defensible.
Peter Baum suggested an even better example to me recently.
Suppose that I am measuring
some quantity that varies little from one experiment to another, and I
am doing it with an instrument that has not very much resolution. All
of my measurements are practically the same, and when I get around to
calculating a standard uncertainty from the replications I get a very
tiny value. Have I made an impressively precise measurement? Of course
not. This is exactly what I mean by a standard uncertainty that is not
defensible. What I must do in such a situation is use an estimate of measurement
uncertainty, such as using a unform noise (error) density equal to my
instrument's least significant digit (call it e), and report the standard
uncertainty as e/(sqrt(12*n)). The square root of 12 comes from the standard
deviation of a uniform density and the factor of n, which is the number
of independent measurements I made, makes this a standard error of the
mean.
Now that I know how to figure
uncertainty, how should I interpret it, and what effect does it have on
experimental design?
The first part of the question
is easy to answer. The distribution of noise is unknown, but if it is
concentrated near a central value, then I can invoke the central
limit theorem and treat the density of mean value as normal
with a standard deviation of u as calculated above. Therefore, the true
value, Y, is within ±u of the mean value of this experiment to 68%
certainty. What this percentage certainty really means is difficult to
assess. Statisticians from the school of "Bayesians" don't believe it
means a thing. The idea is, though, that if a person were to make many
repetitions of the experiment I described in my example of 5 measurements
of the speed of light, then in 68% of these the true value (Y) would occur
within an interval of plus or minus u around each mean value, or, <Y>±u.
If more certainty is needed then the interval <Y>±2u would
be 95% certain and <Y>±3u would be 99.7% certain and so forth.
The second part of the question
gets to the heart of design, but is complex to explain. Remember that
we do not actually know the true value Y. The whole point of the experiment
is to estimate it as accurately as possible. The initial design of the
experiment has two aims. First, we rid the experiment of systematic bias
and make sure our method is calibrated. This insures that our measurements
tend to center on the true value. Second, we need to make the uncertainty
of central value (mean typically) small enough to accomplish the experimental
objective.
Often we organize an experiment
to distinguish between two alternatives. For example, we may want to see
if a treatment is 50% effective at preventing a disease. One alternative
commonly is the nominal or null hypothesis, so called because
it implies that a treatment is not effective. The other is the alternative
hypothesis that the treatment is, in this example, 50% effective.
Because of unavoidable experimental noise, it may not be possible to distinguish
between these hypotheses unless we design carefully in advance.
This brings me to the idea
of the power of an experiment and what it means for design.
Nominal incidence of a disease implies a certain probability of incidence.
A 50% effective treatment means that incidence of the disease is cut to
one-half of the nominal probability in a group that gets treated. Power
of an experiment is a probability measure and has to be between 0 and
100% as a result. Power of 90% means that if the treatment is truly 50%
effective, our experimental design has 90% chance of success in detecting
it. Simply put, we can detect an effective treatment only if two things
happen. 1) The incidence in the treated group is less than nominal incidence;
and 2) the uncertainty of experimental incidence is so small that we cannot
mistake it for nominal incidence easily. In other words we have to reject
the nominal hypothesis. The one thing that affects uncertainty and the
probability of rejecting the nominal hypothesis when it is in fact false,
and which is also under our control, is the number of replications of
the experiment or the size of the control and test groups.
Another numerical example is
in order. Let's take the 1954 trials of the Salk Vaccine as a great historical
example. The nominal incidence of Polio in the 1950s was about 0.0003,
or 30 per hundred thousand. Jonas Salk wanted to estimate the size required
of test and control groups to detect 50% effectiveness of his vaccine
at 95% confidence with 95% power. Polio is a binomial process. A person
either gets the disease or not, and this dichotomy of outcomes is truly
binomial. Unfortunately the binomial distribution is inconvenient. However,
when the expected number of successes is at least 15 or so, the
binomial density is nearly a normal density with mean value of
np, where n is the number of trials (at-risk population size in this case)
and p is the probability of incidence per trial (0.0003). Binomial variance
is np(1-p), which I can use the square root of as my uncertainty calculated
by other means in the venacular of NIST.
Salk reasoned as follows. The
nominal incidence of Polio is 0.0003 and there is some uncertainty in
a control group that might produce a slightly different result. A treated
group of 50% effectiveness would have an incidence of 0.00015 with some
uncertainty of a slightly different result. As long as the observed incidence
of the treated group is farther than 1.96u from the observed incidence
in the control, he would reject the nominal hypothesis, because the 95%
confidence region of the normal density is 1.96u. The required size of
u is that which will separate the expected values at nominal and one-half
nominal incidence by the combined 95% confidence regions around both the
expected nominal and experimental result. This way Salk is 95% certain
that his treated group result would be outside the acceptance region of
the nominal group 95% of the time, presuming his vaccine really is 50%
effective. Therefore, an experiment of 95% power requires that
expected incidence difference>95% treated uncertainty+95% nominal uncertainty
or
0.5np>1.96sqrt(0.5np(1-0.5p))+1.96sqrt(np(1-p))
The only unknown is n, and solving
for it shows that 150,000 people in each group would just do the task. I
hope your amateur experiments never need so many. 
|