How To Lie With An Error Matrix
 
 

Dave Verbyla
Email:D.Verbyla@uaf.edu
and Tim Hammond
Department of Forest Sciences
University of Alaska Fairbanks


 
 

ABSTRACT

There are many sources of both conservative and optimistic bias in classification accuracy assessment, many of which are impossible to avoid. Bias occurs when a classification estimate is optimistic or conservative. There are at least three significant sources of conservative bias: Errors in reference data, positional errors, and minimum mapping unit of reference grid. There are also at least three significant sources of optimistic bias: use of training data for accuracy assessment, restriction of reference data sampling to homogeneous areas, and sampling of reference data not independent of training data. The magnitude and direction of bias in classification accuracy estimates depends on the methods used for classification and reference data sampling. Therefore simply reporting an error matrix and associated classification accuracy estimates is not enough. The error matrix is practically meaningless unless methods are reported in sufficient detail to enable readers to assess the potential for bias in the classification accuracy estimates.
 
 

INTRODUCTION

A common question of digital satellite remote sensing is : "How accurate is the classification?". Visual inspection of a classified image can be misleading. For example, a siviculturalist may look at some favorite spruce stands and see that they were all correctly classified, while a moose biologist may see that many critical willow stands were misclassified. These two biologists would have very different conclusions about the accuracy of the classification. Fortunately, there has been significant research on classification accuracy assessment techniques over the last decade.
 
 

THE ERROR MATRIX

Here is a simple example. Imagine that you have classified an image from a small area along the Tanana River in interior Alaska. The major cover types within your study area are balsam poplar (BP), black spruce (BS), Alder/Willow (AW), white spruce(WS), and water (W). For each of the classified cover types you establish at least 30 random sample points. You then visit each random sample point in the field and verify the actual cover type (recorded as reference data). You then produce an error matrix, by comparing your classifications with your reference data (Table 1). The columns of the error matrix represent the actual "ground truth" from field verification of each random sample point, while the rows represent the predicted classes for the random sample points. The overall classification accuracy can be computed as the total number of correct class predictions (the sum of the diagonal cells) divided by the total number of cells. In our example, the overall classification accuracy is (40 + 30 + 25 + 50 + 32) / 200 or 88 percent.
 
 
 
 

Table 1. A simple example of an error matrix.
 
REFERENCE DATA 

("GROUND TRUTH")

   
BP
BS
AW
WS
W
Row

Totals

 
 
BP
40
0
0
3
0
43
 
 
BS
0
30
12
0
1
43
 
 
AW
0
3
25
0
2
30
 
 
WS
2
0
0
50
0
52
 
 
W
0
0
0
0
32
32
 
 
ColumnTotals
42
33
37
53
35
200
 
                 

 
 
 

HOW TO LIE WITH THE ERROR MATRIX?

Believe it or not, it is impossible not to lie with an error matrix. Here is why: three crucial assumptions in classification accuracy assessment are: 1) that the reference data are truly representative of the entire classification (unlikely), 2) the reference data and classified image are perfectly co-registered (impossible), and 3) there is no error in the reference data (unlikely).

The actual accuracy of our classification is unknown because it is impossible to perfectly assess the true class of every pixel. It is possible to produce a misleading assessment of classification accuracy. Depending on how the reference data are collected, our estimate of accuracy may be either conservative or optimistic. If our estimate is less than the actual classification accuracy, then we have made a conservative estimate. An analogy is stating that you could predict the flipping of a coin with 10 percent accuracy---your actual accuracy is higher (50 percent) and therefore your estimate is conservative.
 
 

SOURCES OF CONSERVATIVE BIAS

There at least three sources of conservative bias in accuracy assessment:

1) Errors in Reference Data

2) Positional Errors

3) Minimum mapping unit of reference grid

Errors in Reference Data

We usually assume that our "ground truth" data are perfect. However, if there are any errors in our reference data (such as incorrect class assignment, change in covertype between the time of imaging and the time of field verification, mistakes in recording or processing the reference data, etc.), some of our correctly classified pixels may be incorrectly assessed as being misclassified. For example, if we had a perfect classification, but ten percent of our ground-truth samples were incorrect, we would estimate our classification to be 90 percent accurate (while in this example it is actually 100 percent).
 
 

Positional Errors

Because images are always rectified to a certain tolerance of positional error, we can not be absolutely sure of the location of any given pixel. For example, if we rectified an image to 30-meter output pixels with an RMS error of 1.0 pixel, the average positional error of the rectification model is +/- 30 meters. Some pixels will have positional accuracy better than this and some pixels will have positional accuracy that is worse than this. Therefore, despite popular belief, even if we use sub-meter GPS technology to navigate us to the center coordinates of a pixel to be "ground truthed", we cannot be certain that we are in that pixel! Because of the positional error inherent in rectified images, some correctly classified pixels may not be correctly located during field sampling. This problem of positional error will also lead to a conservative estimate of classification accuracy. We have used simulation to show that even with an actual classification accuracy of 100 percent, the estimated classification accuracy can be less than 50 percent (Verbyla and Hammond 1995).
 
 

Minimum Mapping Unit Area

Sometimes aerial photography is interpreted as a substitute for field collection of reference data. Because aerial photography is visually interpreted as polygons, the minimum mapping unit area used in the interpretation can also lead to conservative estimates of classification accuracy. For example, if you interpret aerial photographs to a minimum mapping unit area of 1 hectare, a classified image of 30-meter (0.09 ha) pixels may contain many small clusters of single-class pixels that were too small to be included in the interpretation of the aerial photography. This will lead to a conservative estimate of classification accuracy. We have demonstrated that this conservative bias can be greater than 50 percent and is especially large if a large minimum mapping unit is used or if the classified image is spatially heterogeneous in terms of classes (Verbyla and Hammond 1995).
 
 

SOURCES OF OPTIMISTIC BIAS

It is possible to have a poor classification and yet report a high classification accuracy estimate. There are at least three sources of this optimistic bias.
 
 

1) Using training data for accuracy assessment.

2) Sampling of reference data not independent of training data sampling.

3) Sampling from homogeneous groups of pixels.
 
 

Training fields as reference data

It is tempting to minimize cost and time by using the data collected from the training field areas as reference data. This can lead to optimistic estimates of classification accuracy for at least three reasons. First, the pixels selected for training field data are usually selected because they are from relatively homogeneous areas---areas that are typically easier to correctly classify compared to other areas in the image. For example, we might estimate our overall classification accuracy to be 100 percent if we selected our reference data from training fields that encompassed large, pure, level stands. However, if much of our image is a heterogeneous mixture of vegetation types of variable canopy structure and variable topographic conditions, our classified image is probably not really 100 percent accurate. Second, because we are developing a statistical model with the training field data, estimates of classification accuracy are likely to be optimistically biased because we use the same data for model development and model validation. For example, Verbyla (1986) developed a classifier with random numbers that had an estimated classification accuracy of 95 percent---even though the classifier was based on nonsense random numbers data.
 
 

Reference Data Sampling Not Independent of Training Data Sampling

Some researchers have sampled reference data while sampling for training fields. This approach is wrong and will lead to an optimistic estimate of classification accuracy. Typically training fields are selected because they are relatively spectrally pure and therefore easy to classify. If reference area are sampled that are near training fields, they are likely to also be relatively easy to classify correctly. Some researchers have randomly reserved some training field pixels as reference data and excluded these pixels from training the classifier. This approach is also wrong because the "reference" data are spectrally correlated to the training data. We have demonstrated that such an approach can lead to inflated classification accuracy estimates of up to 100 percent (Hammond and Verbyla 1996).
 
 

Sampling from homogeneous blocks of classified pixels.

Because of the positional uncertainty inherent in rectified images, a common approach for reference data is to sample the coordinates of the center pixel of a 3 by 3 group of pixels belonging to the same class. However, this can lead to an optimistic estimate of classification accuracy since homogeneous areas are selected while heterogeneous areas that are more difficult to correctly classify are excluded from selection as reference data. We have used computer simulations to show that this optimistic bias can be as high as 30 percent above the actual classification accuracy if sampling of reference sites is restricted to the center of large homogeneous groups of classified pixels.
 
 

The Take-Home Message

The error matrix and associated classification accuracy estimates have become standard in quality remote sensing studies. However, if the error matrix is generated by using improper reference data collection methods, then the assessment can be misleading.

Reporting the error matrix and classification accuracy are insufficient!

Sampling methods used for reference data should be reported in detail so that potential users can judge whether there may be significant biases in the classification accuracy assessment.
 
 

References

Verbyla, D. L. 1986. Potential prediction bias in regression and discriminant analysis.

Canadian Journal of Forest Research. 16:1255-1257.

Verbyla, D. L. and T. O. Hammond. 1995. Conservative bias in classification accuracy assessment due to pixel-by-pixel comparison of classified images with reference grids. International Journal of Remote Sensing. 16:581-587.

Hammond, T. O. and Verbyla, D. L. 1996. Optimistic bias in classification accuracy assessment. International Journal of Remote Sensing. 17:1261-1266.
 
 

ADDITIONAL READINGS

Arnoff, S. 1982. Classification accuracy: a user approach. Photogrammetric Engineering and Remote Sensing. 48:1299-1307.

Arnoff, S. 1982. The map accuracy report: a userÆs view. Photogrammetric Engineering and Remote Sensing. 48:1309-1312.

Arnoff, S. 1983. Evaluating the effectiveness of remote sensing derived data for environmental planning. Journal of Environmental Management. 17:277-290.

Arnoff, S. 1985. The minimum accuracy value as an index of classification accuracy. Photogrammetric Engineering and Remote Sensing. 51:99-111.

Card, D. H. 1982. Using known map category marginal frequencies to improve estimates of thematic map accuracy. Photogrammetric Engineering and Remote Sensing. 48:431-439.

Card, D. H. 1989. Accuracy assessment, using stratified plurality sampling of portions of a landsat classification of the Arctic National Wildlife Refuge coastal plain. NASA Technical Memorandum 101042. Ames Research Center, Moffett Field, CA.

51 pp.

Congalton, R. G. and K. Green. 1993. A practical look at the sources of confusion in error matrix generation. Photogrammetric Engineering and Remote Sensing. 59:641-644.

Congalton, R. G. and R. A. Mead. 1983. A quantitative method to test for consistency and correctness in photointerpretation. Photogrammetric Engineering and Remote Sensing. 49:69-74.

Congalton, R. G. 1988. A comparison of sampling schemes used in generating error matrices for assessing the accuracy of maps generated from remotely sensed data. Photogrammetric Engineering and Remote Sensing. 54:593-600.

Congalton, R. G. 1988. Using spatial autocorrelation analysis to explore errors in maps generated from remotely sensed data. . Photogrammetric Engineering and Remote Sensing. 54:587-592.

Congalton, R. G. 1991. A review of assessing the accuracy of classifications of remotely sensed data. Remote Sensing of Environment. 37:35-46.

Congalton, R. G. and G. S. Biging. 1992. A pilot study evaluating ground reference data collection efforts for use in forest inventory. Photogrammetric Engineering and Remote Sensing. 58:1669-1671.

Dicks, S. E., and T. H. C. Lo. 1990. Evaluation of thematic map accuracy in a land-use and land-cover mapping program. Photogrammetric Engineering and Remote Sensing. 56:1247-1252.

Fitzgerald, R. W. and B. G. Lees. 1994. Assessing the classification accuracy of multisource remote sensing data. Remote Sensing of Environment. 47:362-368.

Fitzpatrick-Lins, K. 1981. Comparison of sampling procedures and data analysis for a land-use and land-cover map. Photogrammetric Engineering and Remote Sensing. 47:343-351.

Foody, G. M. 1988. Incorporating remotely sensed data into a GIS: the problem of classification evaluation. Geocarto International. 3:13-16.

__________. 1992. On the compensation for chance agreement in image classification accuracy assessment. Photogrammetric Engineering and Remote Sensing.. 58:1459-1460.

Guinevan, M. E. 1979. Testing land-use map accuracy: another look.. Photogrammetric Engineering and Remote Sensing. 45:1371-1377.

Hay, A. M. 1979. Sampling designs to test land-use map accuracy. Photogrammetric Engineering and Remote Sensing. 45:529-533.

Hay, A. M. 1988. The derivation of global estimates from a confusion matrix. International Journal of Remote Sensing. 9:1395-1398.

Hord, R. M. and W. Brooner. 1976. Land use map accuracy criteria. Photogrammetric Engineering and Remote Sensing. 42:671-677.

Hudson, W. D. and C. W. Ramm. 1987. Correct formulation of the Kappa coefficient of agreement. Photogrammetric Engineering and Remote Sensing. 53:421-422.

Lunetta, R. S., Congalton, R. G., Fenstermaker, L. K., Jensen, J. R., McGwire, K. C. and L. R. Tinney. 1991. Remote sensing and geographic information system data integration: error sources and research issues. Photogrammetric Engineering and Remote Sensing. 57:677-687.

Martin, L . R. G. 1989. Accuracy assessment of landsat-based change detection methods applied to the rural-urban fringe. Photogrammetric Engineering and Remote Sensing. 55:209-215.

Maxim, L. D., Harrington, L. and M. Kennedy. 1981. Alternative scale-up estimates for aerial surveys where both detection and classification error exist. Photogrammetric Engineering and Remote Sensing. 47:1227-1239.

Mead, R. A. and J. Szajgin. 1982. Landsat classification accuracy assessment procedures. Photogrammetric Engineering and Remote Sensing. 139-141.

Prisley, S. P. and J. Smith. 1987. Using classification error matrices to improve the accuracy of weighted land-cover models. Photogrammetric Engineering and Remote Sensing. 53:1259-1263.

Rosenfield, G. H., Fitzpatrick-Lins, K. and H. S. Ling. 1982. Sampling for thematic map accuracy testing. Photogrammetric Engineering and Remote Sensing. 48:131-137.

Rosenfield, G. H. and K. Fitzpatrick-Lins. 1986. A coefficient of agreement as a measure of thematic classification accuracy. Photogrammetric Engineering and Remote Sensing. 52: 223-227.

Skidmore, A. K. and B. J. Turner. 1992. Map accuracy assessment using line intersect sampling. . Photogrammetric Engineering and Remote Sensing. 58:1453-1457.

Star, J. L. 1989. Sources of errors in thematic classification of remotely sensed imagery. Proceedings of IGARSS 89/12th Canadian Symposium on Remote Sensing. Vancouver, BC. pp. 1851-1853.

Stehman, S. V. 1992. Comparison of systematic and random sampling for estimating the accuracy of maps generated from remotely sensed data. Photogrammetric Engineering and Remote Sensing. 58:1343-1350.

Thomas, I. L. and G. M. Allcock. 1984. Determining the confidence level for a classification. Photogrammetric Engineering and Remote Sensing. 50:1491-1496.

Todd, W. J., Gehring, D. G., and J. F. Haman. Landsat wildland mapping accuracy. Photogrammetric Engineering and Remote Sensing. 46(4):590-620.

van Genderen, J. L., Lock, B. F., and P. A. Vass. 1978. Remote sensing: statistical testing of thematic map accuracy. Remote Sensing of Environment. 7:3-14.

Wang, M. and P. J. Howarth. 1993. Modeling errors in remote sensing image classification. Remote Sensing of Environment. 45:261-271.