What is it about?

When two or more subjective observers, for example, human radiologists independently assess disease severity in an x-ray, their individual assessments are often compared using the kappa inter-observer statistic. Kappa values range in theory from -1 to +1, in practice from 0 to 1. A high inter-observer agreement is desirable otherwise problems exists with the method or the observers. It is well known that kappas of studies with different populations cannot be compared for mathematical reasons. We extend this caution to clinical trials where an intervention changes the distribution of outcome measurements: At the start of the trial most patients are in the severe categories, at the conclusion most are in the mild categories. Kappa will have to change even if nothing about the observers changed.

Featured Image

Why is it important?

Inter-observer statistics are increasingly used when comparing the performance of AI (artificial intelligence) with human readers. The more prevalent the use of kappa, the higher the risk of not understanding its limitations Machine learning scientists need to pay attention.

Perspectives

Summary statistics such as kappa try to tell a story in one number. This may work well as long as no apples are compared with oranges. There is a tendency of bragging with kappas, the higher the better. We show that sometimes higher does not mean better.

Klaus Gottlieb
Eli Lilly and Co

Read the Original

This page is a summary of: Sequentially Determined Measures of Interobserver Agreement (Kappa) in Clinical Trials May Vary Independent of Changes in Observer Performance, Therapeutic Innovation & Regulatory Science, January 2020, Springer Science + Business Media,
DOI: 10.1007/s43441-019-00102-5.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page