Dave Verbyla
Email:D.Verbyla@uaf.edu
and Tim Hammond
Department of Forest Sciences
University of Alaska Fairbanks
ABSTRACT
There are many sources of both conservative and optimistic bias in classification
accuracy assessment, many of which are impossible to avoid. Bias occurs
when a classification estimate is optimistic or conservative. There are
at least three significant sources of conservative bias: Errors
in reference data, positional errors, and minimum mapping unit of reference
grid. There are also at least three significant sources of optimistic
bias: use of training data for accuracy assessment, restriction of reference
data sampling to homogeneous areas, and sampling of reference data not
independent of training data. The magnitude and direction of bias in classification
accuracy estimates depends on the methods used for classification and reference
data sampling. Therefore simply reporting an error matrix and associated
classification accuracy estimates is not enough. The error matrix is practically
meaningless unless methods are reported in sufficient detail to enable
readers to assess the potential for bias in the classification accuracy
estimates.
INTRODUCTION
A common question of digital satellite remote sensing is : "How accurate
is the classification?". Visual inspection of a classified image can be
misleading. For example, a siviculturalist may look at some favorite spruce
stands and see that they were all correctly classified, while a moose biologist
may see that many critical willow stands were misclassified. These two
biologists would have very different conclusions about the accuracy of
the classification. Fortunately, there has been significant research on
classification accuracy assessment techniques over the last decade.
THE ERROR MATRIX
Here is a simple example. Imagine that you have classified an image
from a small area along the Tanana River in interior Alaska. The major
cover types within your study area are balsam poplar (BP), black spruce
(BS), Alder/Willow (AW), white spruce(WS), and water (W). For each of the
classified cover types you establish at least 30 random sample points.
You then visit each random sample point in the field and verify the actual
cover type (recorded as reference data). You then produce an error matrix,
by comparing your classifications with your reference data (Table 1). The
columns of the error matrix represent the actual "ground truth" from field
verification of each random sample point, while the rows represent the
predicted classes for the random sample points. The overall classification
accuracy can be computed as the total number of correct class predictions
(the sum of the diagonal cells) divided by the total number of cells. In
our example, the overall classification accuracy is (40 + 30 + 25 + 50
+ 32) / 200 or 88 percent.
Table 1. A simple example of an error matrix.
|
("GROUND TRUTH") |
||||||||
|
|
|
|
|
|
Totals |
|||
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
||
HOW TO LIE WITH THE ERROR MATRIX?
Believe it or not, it is impossible not to lie with an error matrix. Here is why: three crucial assumptions in classification accuracy assessment are: 1) that the reference data are truly representative of the entire classification (unlikely), 2) the reference data and classified image are perfectly co-registered (impossible), and 3) there is no error in the reference data (unlikely).
The actual accuracy of our classification is unknown because it is impossible
to perfectly assess the true class of every pixel. It is possible to produce
a misleading assessment of classification accuracy. Depending on how the
reference data are collected, our estimate of accuracy may be either conservative
or optimistic. If our estimate is less than the actual classification accuracy,
then we have made a conservative estimate. An analogy is stating that you
could predict the flipping of a coin with 10 percent accuracy---your actual
accuracy is higher (50 percent) and therefore your estimate is conservative.
SOURCES OF CONSERVATIVE BIAS
There at least three sources of conservative bias in accuracy assessment:
1) Errors in Reference Data
2) Positional Errors
3) Minimum mapping unit of reference grid
Errors in Reference Data
We usually assume that our "ground truth" data are perfect. However,
if there are any errors in our reference data (such as incorrect
class assignment, change in covertype between the time of imaging and the
time of field verification, mistakes in recording or processing the reference
data, etc.), some of our correctly classified pixels may be incorrectly
assessed as being misclassified. For example, if we had a perfect classification,
but ten percent of our ground-truth samples were incorrect, we would estimate
our classification to be 90 percent accurate (while in this example it
is actually 100 percent).
Positional Errors
Because images are always rectified to a certain tolerance of positional
error, we can not be absolutely sure of the location of any given pixel.
For example, if we rectified an image to 30-meter output pixels with an
RMS error of 1.0 pixel, the average positional error of the rectification
model is +/- 30 meters. Some pixels will have positional accuracy
better than this and some pixels will have positional accuracy that is
worse than this. Therefore, despite popular belief, even if we use sub-meter
GPS technology to navigate us to the center coordinates of a pixel to be
"ground truthed", we cannot be certain that we are in that pixel! Because
of the positional error inherent in rectified images, some correctly classified
pixels may not be correctly located during field sampling. This problem
of positional error will also lead to a conservative estimate of classification
accuracy. We have used simulation to show that even with an actual classification
accuracy of 100 percent, the estimated classification accuracy can be less
than 50 percent (Verbyla and Hammond 1995).
Minimum Mapping Unit Area
Sometimes aerial photography is interpreted as a substitute for field
collection of reference data. Because aerial photography is visually interpreted
as polygons, the minimum mapping unit area used in the interpretation can
also lead to conservative estimates of classification accuracy. For example,
if you interpret aerial photographs to a minimum mapping unit area of 1
hectare, a classified image of 30-meter (0.09 ha) pixels may contain many
small clusters of single-class pixels that were too small to be included
in the interpretation of the aerial photography. This will lead to a conservative
estimate of classification accuracy. We have demonstrated that this conservative
bias can be greater than 50 percent and is especially large if a large
minimum mapping unit is used or if the classified image is spatially heterogeneous
in terms of classes (Verbyla and Hammond 1995).
SOURCES OF OPTIMISTIC BIAS
It is possible to have a poor classification and yet report a high classification
accuracy estimate. There are at least three sources of this optimistic
bias.
1) Using training data for accuracy assessment.
2) Sampling of reference data not independent of training data sampling.
3) Sampling from homogeneous groups of pixels.
Training fields as reference data
It is tempting to minimize cost and time by using the data collected
from the training field areas as reference data. This can lead to optimistic
estimates of classification accuracy for at least three reasons. First,
the pixels selected for training field data are usually selected because
they are from relatively homogeneous areas---areas that are typically easier
to correctly classify compared to other areas in the image. For example,
we might estimate our overall classification accuracy to be 100 percent
if we selected our reference data from training fields that encompassed
large, pure, level stands. However, if much of our image is a heterogeneous
mixture of vegetation types of variable canopy structure and variable topographic
conditions, our classified image is probably not really 100 percent accurate.
Second, because we are developing a statistical model with the training
field data, estimates of classification accuracy are likely to be optimistically
biased because we use the same data for model development and model validation.
For example, Verbyla (1986) developed a classifier with random numbers
that had an estimated classification accuracy of 95 percent---even though
the classifier was based on nonsense random numbers data.
Reference Data Sampling Not Independent of Training Data Sampling
Some researchers have sampled reference data while sampling for training
fields. This approach is wrong and will lead to an optimistic estimate
of classification accuracy. Typically training fields are selected because
they are relatively spectrally pure and therefore easy to classify. If
reference area are sampled that are near training fields, they are likely
to also be relatively easy to classify correctly. Some researchers have
randomly reserved some training field pixels as reference data and excluded
these pixels from training the classifier. This approach is also wrong
because the "reference" data are spectrally correlated to the training
data. We have demonstrated that such an approach can lead to inflated classification
accuracy estimates of up to 100 percent (Hammond and Verbyla 1996).
Sampling from homogeneous blocks of classified pixels.
Because of the positional uncertainty inherent in rectified images,
a common approach for reference data is to sample the coordinates of the
center pixel of a 3 by 3 group of pixels belonging to the same class. However,
this can lead to an optimistic estimate of classification accuracy since
homogeneous areas are selected while heterogeneous areas that are more
difficult to correctly classify are excluded from selection as reference
data. We have used computer simulations to show that this optimistic bias
can be as high as 30 percent above the actual classification accuracy if
sampling of reference sites is restricted to the center of large homogeneous
groups of classified pixels.
The Take-Home Message
The error matrix and associated classification accuracy estimates have become standard in quality remote sensing studies. However, if the error matrix is generated by using improper reference data collection methods, then the assessment can be misleading.
Reporting the error matrix and classification accuracy are insufficient!
Sampling methods used for reference data should be reported in detail
so that potential users can judge whether there may be significant biases
in the classification accuracy assessment.
References
Verbyla, D. L. 1986. Potential prediction bias in regression and discriminant analysis.
Canadian Journal of Forest Research. 16:1255-1257.
Verbyla, D. L. and T. O. Hammond. 1995. Conservative bias in classification accuracy assessment due to pixel-by-pixel comparison of classified images with reference grids. International Journal of Remote Sensing. 16:581-587.
Hammond, T. O. and Verbyla, D. L. 1996. Optimistic bias in classification
accuracy assessment. International Journal of Remote Sensing. 17:1261-1266.
ADDITIONAL READINGS
Arnoff, S. 1982. Classification accuracy: a user approach. Photogrammetric Engineering and Remote Sensing. 48:1299-1307.
Arnoff, S. 1982. The map accuracy report: a userÆs view. Photogrammetric Engineering and Remote Sensing. 48:1309-1312.
Arnoff, S. 1983. Evaluating the effectiveness of remote sensing derived data for environmental planning. Journal of Environmental Management. 17:277-290.
Arnoff, S. 1985. The minimum accuracy value as an index of classification accuracy. Photogrammetric Engineering and Remote Sensing. 51:99-111.
Card, D. H. 1982. Using known map category marginal frequencies to improve estimates of thematic map accuracy. Photogrammetric Engineering and Remote Sensing. 48:431-439.
Card, D. H. 1989. Accuracy assessment, using stratified plurality sampling of portions of a landsat classification of the Arctic National Wildlife Refuge coastal plain. NASA Technical Memorandum 101042. Ames Research Center, Moffett Field, CA.
51 pp.
Congalton, R. G. and K. Green. 1993. A practical look at the sources of confusion in error matrix generation. Photogrammetric Engineering and Remote Sensing. 59:641-644.
Congalton, R. G. and R. A. Mead. 1983. A quantitative method to test for consistency and correctness in photointerpretation. Photogrammetric Engineering and Remote Sensing. 49:69-74.
Congalton, R. G. 1988. A comparison of sampling schemes used in generating error matrices for assessing the accuracy of maps generated from remotely sensed data. Photogrammetric Engineering and Remote Sensing. 54:593-600.
Congalton, R. G. 1988. Using spatial autocorrelation analysis to explore errors in maps generated from remotely sensed data. . Photogrammetric Engineering and Remote Sensing. 54:587-592.
Congalton, R. G. 1991. A review of assessing the accuracy of classifications of remotely sensed data. Remote Sensing of Environment. 37:35-46.
Congalton, R. G. and G. S. Biging. 1992. A pilot study evaluating ground reference data collection efforts for use in forest inventory. Photogrammetric Engineering and Remote Sensing. 58:1669-1671.
Dicks, S. E., and T. H. C. Lo. 1990. Evaluation of thematic map accuracy in a land-use and land-cover mapping program. Photogrammetric Engineering and Remote Sensing. 56:1247-1252.
Fitzgerald, R. W. and B. G. Lees. 1994. Assessing the classification accuracy of multisource remote sensing data. Remote Sensing of Environment. 47:362-368.
Fitzpatrick-Lins, K. 1981. Comparison of sampling procedures and data analysis for a land-use and land-cover map. Photogrammetric Engineering and Remote Sensing. 47:343-351.
Foody, G. M. 1988. Incorporating remotely sensed data into a GIS: the problem of classification evaluation. Geocarto International. 3:13-16.
__________. 1992. On the compensation for chance agreement in image classification accuracy assessment. Photogrammetric Engineering and Remote Sensing.. 58:1459-1460.
Guinevan, M. E. 1979. Testing land-use map accuracy: another look.. Photogrammetric Engineering and Remote Sensing. 45:1371-1377.
Hay, A. M. 1979. Sampling designs to test land-use map accuracy. Photogrammetric Engineering and Remote Sensing. 45:529-533.
Hay, A. M. 1988. The derivation of global estimates from a confusion matrix. International Journal of Remote Sensing. 9:1395-1398.
Hord, R. M. and W. Brooner. 1976. Land use map accuracy criteria. Photogrammetric Engineering and Remote Sensing. 42:671-677.
Hudson, W. D. and C. W. Ramm. 1987. Correct formulation of the Kappa coefficient of agreement. Photogrammetric Engineering and Remote Sensing. 53:421-422.
Lunetta, R. S., Congalton, R. G., Fenstermaker, L. K., Jensen, J. R., McGwire, K. C. and L. R. Tinney. 1991. Remote sensing and geographic information system data integration: error sources and research issues. Photogrammetric Engineering and Remote Sensing. 57:677-687.
Martin, L . R. G. 1989. Accuracy assessment of landsat-based change detection methods applied to the rural-urban fringe. Photogrammetric Engineering and Remote Sensing. 55:209-215.
Maxim, L. D., Harrington, L. and M. Kennedy. 1981. Alternative scale-up estimates for aerial surveys where both detection and classification error exist. Photogrammetric Engineering and Remote Sensing. 47:1227-1239.
Mead, R. A. and J. Szajgin. 1982. Landsat classification accuracy assessment procedures. Photogrammetric Engineering and Remote Sensing. 139-141.
Prisley, S. P. and J. Smith. 1987. Using classification error matrices to improve the accuracy of weighted land-cover models. Photogrammetric Engineering and Remote Sensing. 53:1259-1263.
Rosenfield, G. H., Fitzpatrick-Lins, K. and H. S. Ling. 1982. Sampling for thematic map accuracy testing. Photogrammetric Engineering and Remote Sensing. 48:131-137.
Rosenfield, G. H. and K. Fitzpatrick-Lins. 1986. A coefficient of agreement as a measure of thematic classification accuracy. Photogrammetric Engineering and Remote Sensing. 52: 223-227.
Skidmore, A. K. and B. J. Turner. 1992. Map accuracy assessment using line intersect sampling. . Photogrammetric Engineering and Remote Sensing. 58:1453-1457.
Star, J. L. 1989. Sources of errors in thematic classification of remotely sensed imagery. Proceedings of IGARSS 89/12th Canadian Symposium on Remote Sensing. Vancouver, BC. pp. 1851-1853.
Stehman, S. V. 1992. Comparison of systematic and random sampling for estimating the accuracy of maps generated from remotely sensed data. Photogrammetric Engineering and Remote Sensing. 58:1343-1350.
Thomas, I. L. and G. M. Allcock. 1984. Determining the confidence level for a classification. Photogrammetric Engineering and Remote Sensing. 50:1491-1496.
Todd, W. J., Gehring, D. G., and J. F. Haman. Landsat wildland mapping accuracy. Photogrammetric Engineering and Remote Sensing. 46(4):590-620.
van Genderen, J. L., Lock, B. F., and P. A. Vass. 1978. Remote sensing: statistical testing of thematic map accuracy. Remote Sensing of Environment. 7:3-14.
Wang, M. and P. J. Howarth. 1993. Modeling errors in remote sensing
image classification. Remote Sensing of Environment. 45:261-271.