What Is Cohen's Kappa?
Cohen's kappa (κ) is a statistic that measures the level of agreement between two raters (or judges, coders, or classifiers) who each categorize items into mutually exclusive categories. What makes kappa more useful than simple percent agreement is that it accounts for agreement that would occur by chance alone. If two coders classify 100 open-ended survey responses as "positive," "neutral," or "negative," they might agree on 70% of them, but if both tend to code most responses as "neutral," a large portion of that agreement would happen by random chance. Kappa adjusts for this, giving you a more honest measure of how much the raters actually agree beyond what luck would predict.
Why Cohen's Kappa Matters
In any research involving human judgment, coding qualitative data, categorizing open-ended responses, rating content quality, you need to demonstrate that your coding is reliable. If two raters can't agree on how to classify responses, the classification system is unreliable, and any analysis built on those classifications is questionable. Cohen's kappa is the standard metric for demonstrating inter-rater reliability, and journal reviewers and research clients routinely expect to see it reported.
How Cohen's Kappa Works
The Formula
κ = (P_o - P_e) / (1 - P_e)
Where:
- P_o = the observed proportion of agreement (how often the raters actually agree)
- P_e = the expected proportion of agreement by chance
- 1 - P_e = the maximum possible agreement beyond chance
Worked Example
Two coders classified 100 customer feedback comments as "positive," "negative," or "neutral." Here's the confusion matrix:
| Coder B: Positive | Coder B: Negative | Coder B: Neutral | Row Total | |
|---|---|---|---|---|
| Coder A: Positive | 30 | 2 | 3 | 35 |
| Coder A: Negative | 4 | 25 | 1 | 30 |
| Coder A: Neutral | 6 | 3 | 26 | 35 |
| Column Total | 40 | 30 | 30 | 100 |
Step 1: Calculate observed agreement (P_o) P_o = (30 + 25 + 26) / 100 = 81/100 = 0.81
Step 2: Calculate expected agreement by chance (P_e) For each category, multiply the row and column marginal proportions:
- Positive: (35/100) × (40/100) = 0.140
- Negative: (30/100) × (30/100) = 0.090
- Neutral: (35/100) × (30/100) = 0.105
P_e = 0.140 + 0.090 + 0.105 = 0.335
Step 3: Calculate kappa κ = (0.81 - 0.335) / (1 - 0.335) = 0.475 / 0.665 = 0.714
A kappa of 0.71 indicates substantial agreement between the two coders, well above chance.
Interpretation Scale
Landis and Koch (1977) proposed the most widely used benchmarks:
| κ Value | Interpretation |
|---|---|
| < 0.00 | Less than chance agreement |
| 0.01 - 0.20 | Slight agreement |
| 0.21 - 0.40 | Fair agreement |
| 0.41 - 0.60 | Moderate agreement |
| 0.61 - 0.80 | Substantial agreement |
| 0.81 - 1.00 | Almost perfect agreement |
For most research purposes, κ ≥ 0.60 is considered acceptable, and κ ≥ 0.80 is considered strong. Below 0.60, the coding scheme typically needs revision, unclear category definitions, insufficient coder training, or ambiguous items are usually the culprits.
When Kappa Is Low: What to Do
If your initial kappa falls below acceptable levels:
- Review disagreements: examine the specific items where coders disagreed and identify patterns
- Refine category definitions: unclear or overlapping definitions are the most common cause of low kappa
- Provide additional training: walk coders through borderline cases and establish decision rules
- Run another round: code a new batch of items and recalculate; kappa should improve with better definitions and training
- Consider merging categories: sometimes categories are too granular; combining similar ones improves agreement
Limitations of Cohen's Kappa
Prevalence problem: When one category dominates (e.g., 90% of responses are "neutral"), P_e becomes very high, and kappa can be paradoxically low even with high percent agreement. This is a known issue called the "kappa paradox."
Only two raters: Standard kappa works for exactly two raters. For three or more raters, you need Fleiss' kappa, which extends the concept to multiple judges.
Nominal categories only: Kappa treats all disagreements equally. If your categories are ordinal (mild, moderate, severe), a disagreement between "mild" and "severe" is counted the same as between "mild" and "moderate." For ordinal data, weighted kappa is more appropriate, it assigns smaller penalties to disagreements between adjacent categories.
When to Use Cohen's Kappa
- Coding open-ended survey responses to verify that multiple coders classify responses consistently before running analysis
- Content analysis where researchers categorize text, images, or media into predefined themes
- Quality assurance for data classification tasks in research operations
- Validating AI-assisted coding by comparing automated classifications against human expert judgments
- Clinical or diagnostic research where practitioners independently classify cases
Common Mistakes to Avoid
- Reporting percent agreement instead of kappa: percent agreement doesn't account for chance, making it look artificially high, especially when categories have unequal prevalence
- Calculating kappa on the full dataset instead of a random subsample: inter-rater reliability should be assessed on a representative subset (typically 10-20% of items coded by both raters), then applied to the rest
- Using standard kappa for ordinal categories: if the categories have a natural order, weighted kappa is more appropriate because it distinguishes near-misses from large disagreements
How Quali-Fi Supports Inter-Rater Reliability
Quali-Fi's qualitative analysis tools include built-in kappa calculation for any coding task involving multiple coders. The platform generates confusion matrices, flags categories with the highest disagreement rates, and tracks reliability across coding iterations so you can document improvement as you refine your codebook.
Measure coder agreement with Quali-Fi
Frequently Asked Questions
Can kappa be negative?
Yes. A negative kappa means the raters agree less often than chance alone would predict, they're systematically disagreeing. This is rare in practice and usually indicates a fundamental misunderstanding of the coding scheme (e.g., one coder is using the categories in reverse).
What's the difference between Cohen's kappa and Fleiss' kappa?
Cohen's kappa is designed for exactly two raters. Fleiss' kappa extends the concept to three or more raters, calculating the degree of agreement among multiple judges beyond what's expected by chance. The formulas differ, but the interpretation scale is the same.
How many items should two coders overlap on?
A common guideline is to double-code at least 10-20% of your total items, with a minimum of 30-50 items. This gives enough data to calculate a stable kappa. If the overlap sample is too small, kappa will be unstable and may not reflect the true level of agreement.