Cohen's Kappa: What It Is, How to Calculate It, and Interpretation Scale

Q: Can kappa be negative?

Yes. A negative kappa means the raters agree less often than chance alone would predict, they're systematically disagreeing. This is rare in practice and usually indicates a fundamental misunderstanding of the coding scheme (e.g., one coder is using the categories in reverse).

Q: What's the difference between Cohen's kappa and Fleiss' kappa?

Cohen's kappa is designed for exactly two raters. Fleiss' kappa extends the concept to three or more raters, calculating the degree of agreement among multiple judges beyond what's expected by chance. The formulas differ, but the interpretation scale is the same.

Q: How many items should two coders overlap on?

A common guideline is to double-code at least 10-20% of your total items, with a minimum of 30-50 items. This gives enough data to calculate a stable kappa. If the overlap sample is too small, kappa will be unstable and may not reflect the true level of agreement.

Learn what Cohen's kappa is, how to calculate inter-rater reliability, and how to interpret kappa values for agreement beyond chance.

What Is Cohen's Kappa?

Cohen's kappa (κ) is a statistic that measures the level of agreement between two raters (or judges, coders, or classifiers) who each categorize items into mutually exclusive categories. What makes kappa more useful than simple percent agreement is that it accounts for agreement that would occur by chance alone. If two coders classify 100 open-ended survey responses as "positive," "neutral," or "negative," they might agree on 70% of them, but if both tend to code most responses as "neutral," a large portion of that agreement would happen by random chance. Kappa adjusts for this, giving you a more honest measure of how much the raters actually agree beyond what luck would predict.

Why Cohen's Kappa Matters

In any research involving human judgment, coding qualitative data, categorizing open-ended responses, rating content quality, you need to demonstrate that your coding is reliable. If two raters can't agree on how to classify responses, the classification system is unreliable, and any analysis built on those classifications is questionable. Cohen's kappa is the standard metric for demonstrating inter-rater reliability, and journal reviewers and research clients routinely expect to see it reported.

How Cohen's Kappa Works

The Formula

κ = (P_o - P_e) / (1 - P_e)

Where:

P_o = the observed proportion of agreement (how often the raters actually agree)
P_e = the expected proportion of agreement by chance
1 - P_e = the maximum possible agreement beyond chance

Worked Example

Two coders classified 100 customer feedback comments as "positive," "negative," or "neutral." Here's the confusion matrix:

	Coder B: Positive	Coder B: Negative	Coder B: Neutral	Row Total
Coder A: Positive	30	2	3	35
Coder A: Negative	4	25	1	30
Coder A: Neutral	6	3	26	35
Column Total	40	30	30	100

Step 1: Calculate observed agreement (P_o) P_o = (30 + 25 + 26) / 100 = 81/100 = 0.81

Step 2: Calculate expected agreement by chance (P_e) For each category, multiply the row and column marginal proportions:

Positive: (35/100) × (40/100) = 0.140
Negative: (30/100) × (30/100) = 0.090
Neutral: (35/100) × (30/100) = 0.105

P_e = 0.140 + 0.090 + 0.105 = 0.335

Step 3: Calculate kappa κ = (0.81 - 0.335) / (1 - 0.335) = 0.475 / 0.665 = 0.714

A kappa of 0.71 indicates substantial agreement between the two coders, well above chance.

Interpretation Scale

Landis and Koch (1977) proposed the most widely used benchmarks:

κ Value	Interpretation
< 0.00	Less than chance agreement
0.01 - 0.20	Slight agreement
0.21 - 0.40	Fair agreement
0.41 - 0.60	Moderate agreement
0.61 - 0.80	Substantial agreement
0.81 - 1.00	Almost perfect agreement

For most research purposes, κ ≥ 0.60 is considered acceptable, and κ ≥ 0.80 is considered strong. Below 0.60, the coding scheme typically needs revision, unclear category definitions, insufficient coder training, or ambiguous items are usually the culprits.

When Kappa Is Low: What to Do

If your initial kappa falls below acceptable levels:

Review disagreements: examine the specific items where coders disagreed and identify patterns
Refine category definitions: unclear or overlapping definitions are the most common cause of low kappa
Provide additional training: walk coders through borderline cases and establish decision rules
Run another round: code a new batch of items and recalculate; kappa should improve with better definitions and training
Consider merging categories: sometimes categories are too granular; combining similar ones improves agreement

Limitations of Cohen's Kappa

Prevalence problem: When one category dominates (e.g., 90% of responses are "neutral"), P_e becomes very high, and kappa can be paradoxically low even with high percent agreement. This is a known issue called the "kappa paradox."

Only two raters: Standard kappa works for exactly two raters. For three or more raters, you need Fleiss' kappa, which extends the concept to multiple judges.

Nominal categories only: Kappa treats all disagreements equally. If your categories are ordinal (mild, moderate, severe), a disagreement between "mild" and "severe" is counted the same as between "mild" and "moderate." For ordinal data, weighted kappa is more appropriate, it assigns smaller penalties to disagreements between adjacent categories.

When to Use Cohen's Kappa

Coding open-ended survey responses to verify that multiple coders classify responses consistently before running analysis
Content analysis where researchers categorize text, images, or media into predefined themes
Quality assurance for data classification tasks in research operations
Validating AI-assisted coding by comparing automated classifications against human expert judgments
Clinical or diagnostic research where practitioners independently classify cases

Common Mistakes to Avoid

Reporting percent agreement instead of kappa: percent agreement doesn't account for chance, making it look artificially high, especially when categories have unequal prevalence
Calculating kappa on the full dataset instead of a random subsample: inter-rater reliability should be assessed on a representative subset (typically 10-20% of items coded by both raters), then applied to the rest
Using standard kappa for ordinal categories: if the categories have a natural order, weighted kappa is more appropriate because it distinguishes near-misses from large disagreements

How Quali-Fi Supports Inter-Rater Reliability

Quali-Fi's qualitative analysis tools include built-in kappa calculation for any coding task involving multiple coders. The platform generates confusion matrices, flags categories with the highest disagreement rates, and tracks reliability across coding iterations so you can document improvement as you refine your codebook.

Measure coder agreement with Quali-Fi

Frequently Asked Questions

Can kappa be negative?

Yes. A negative kappa means the raters agree less often than chance alone would predict, they're systematically disagreeing. This is rare in practice and usually indicates a fundamental misunderstanding of the coding scheme (e.g., one coder is using the categories in reverse).

What's the difference between Cohen's kappa and Fleiss' kappa?

Cohen's kappa is designed for exactly two raters. Fleiss' kappa extends the concept to three or more raters, calculating the degree of agreement among multiple judges beyond what's expected by chance. The formulas differ, but the interpretation scale is the same.

How many items should two coders overlap on?

A common guideline is to double-code at least 10-20% of your total items, with a minimum of 30-50 items. This gives enough data to calculate a stable kappa. If the overlap sample is too small, kappa will be unstable and may not reflect the true level of agreement.

What Is Cohen's Kappa?

Why Cohen's Kappa Matters

How Cohen's Kappa Works

The Formula

Worked Example

Interpretation Scale

When Kappa Is Low: What to Do

Limitations of Cohen's Kappa

When to Use Cohen's Kappa

Common Mistakes to Avoid

How Quali-Fi Supports Inter-Rater Reliability

Frequently Asked Questions

Can kappa be negative?

What's the difference between Cohen's kappa and Fleiss' kappa?

How many items should two coders overlap on?

Frequently Asked Questions

Related Guides

Cronbach's Alpha: What It Is, Formula, and Acceptable Thresholds

Inter-Rater Reliability Explained

Content Analysis Research: What It Is and How to Use It in Research

Qualitative Data: What It Is and How to Use It in Research

Survey Design: The Complete Guide to Building Effective Surveys

Ready to apply this in your research?

Cohen's Kappa: What It Is, How to Calculate It, and Interpretation Scale

What Is Cohen's Kappa?

Why Cohen's Kappa Matters

How Cohen's Kappa Works

The Formula

Worked Example

Interpretation Scale

When Kappa Is Low: What to Do

Limitations of Cohen's Kappa

When to Use Cohen's Kappa

Common Mistakes to Avoid

How Quali-Fi Supports Inter-Rater Reliability

Frequently Asked Questions

Can kappa be negative?

What's the difference between Cohen's kappa and Fleiss' kappa?

How many items should two coders overlap on?

Related Topics

Frequently Asked Questions

Related Guides

Cronbach's Alpha: What It Is, Formula, and Acceptable Thresholds

Inter-Rater Reliability Explained

Content Analysis Research: What It Is and How to Use It in Research

Qualitative Data: What It Is and How to Use It in Research

Survey Design: The Complete Guide to Building Effective Surveys

Ready to apply this in your research?