Statistical Concepts

Linear Regression: What It Is and How to Use It in Research

6 min read

Learn what linear regression is, how to interpret R-squared, and when to use simple linear models in market research and survey analysis.

What Is Linear Regression?

Linear regression is a statistical method that models the relationship between a continuous outcome variable and one or more predictor variables by fitting a straight line through the data. In its simplest form, simple linear regression, you have one predictor and one outcome, and the model finds the line that minimizes the total squared distance between each observed data point and the line's predicted value. It's one of the most widely used techniques in market research, applied to everything from predicting customer satisfaction scores based on service quality ratings to estimating sales volume from advertising spend. The output tells you both the direction and magnitude of the relationship: how much the outcome changes, on average, for each one-unit increase in the predictor.

Why Linear Regression Matters

Linear regression provides a clear, quantifiable answer to "how much does X influence Y?", the kind of question that drives most business decisions. Beyond simple correlation, it gives you a predictive equation you can use to forecast outcomes under different scenarios. It's also the foundation for nearly every advanced statistical technique in research, so understanding it well makes everything else easier to learn.

How Linear Regression Works

The Model

The simple linear regression equation is:

Y = b₀ + b₁X + ε

Where Y is the outcome (dependent variable), X is the predictor (independent variable), b₀ is the intercept (the predicted value of Y when X = 0), b₁ is the slope (the change in Y for each one-unit increase in X), and ε represents the error term, the variation in Y not explained by X.

Worked Example

You want to know if there's a relationship between the number of survey reminders sent (X) and response rate percentage (Y). You run 8 surveys with varying reminder counts:

Reminders (X) Response Rate % (Y)
0 12
1 18
1 21
2 25
2 28
3 30
3 34
4 38

Running the regression produces: Y = 12.5 + 6.2X

Interpretation: With zero reminders, the predicted response rate is 12.5%. Each additional reminder is associated with a 6.2 percentage-point increase in response rate. If you sent 2 reminders, the predicted response rate is 12.5 + 6.2(2) = 24.9%.

R-Squared (R²)

R-squared tells you the proportion of variance in the outcome that's explained by the predictor(s). It ranges from 0 to 1.

R² = 1 - (SS_residual / SS_total)

Where SS_residual is the sum of squared differences between observed and predicted values, and SS_total is the sum of squared differences between observed values and the mean.

In the reminders example, R² = 0.96, meaning 96% of the variation in response rates is explained by the number of reminders. That's unusually high, in real-world market research, R² values between 0.20 and 0.50 are common and useful.

Assumptions

For the results to be trustworthy, linear regression requires:

  1. Linearity: The relationship between X and Y is approximately straight. Check with a scatterplot.
  2. Independence: Observations don't influence each other. This is violated when you have repeated measures from the same respondent.
  3. Homoscedasticity: The spread of residuals is roughly constant across all values of X. Fan-shaped residual plots signal a violation.
  4. Normality of residuals: The errors follow a roughly normal distribution. This matters most for small samples and for confidence intervals around predictions.
  5. No influential outliers: A single extreme data point can dramatically shift the regression line.

Interpreting the Output

A typical regression output includes:

  • Coefficient (b₁): The slope, the predicted change in Y for a one-unit change in X
  • Standard error: How precisely the coefficient is estimated
  • t-statistic and p-value: Whether the coefficient is statistically different from zero
  • Confidence interval: The range of plausible values for the true coefficient
  • R²: Overall model explanatory power
  • F-statistic: Whether the model as a whole explains significant variance

When to Use Linear Regression

  • Predicting continuous outcomes like satisfaction scores, revenue, or response rates from one or more measurable predictors
  • Quantifying the effect size of a single factor, how much does each additional dollar of ad spend contribute to awareness?
  • Baseline modeling before adding complexity with multiple regression, interaction terms, or nonlinear transformations
  • Trend analysis to estimate how a metric changes over time when the trend appears roughly linear

Common Mistakes to Avoid

  • Extrapolating beyond your data range: a model trained on 0-4 reminders can't reliably predict what happens at 10 reminders
  • Assuming causation from regression alone: the model shows association; experimental design is what establishes causation
  • Ignoring assumption violations: running the model on clearly nonlinear data or data with extreme outliers produces misleading coefficients

How Quali-Fi Supports Linear Regression

Quali-Fi's Research plan ($1,061/month) includes regression analysis tools that generate coefficient tables, residual diagnostics, and R-squared summaries directly from survey data. The platform flags assumption violations automatically, so you know when a linear model fits well and when you need a different approach.

Try Quali-Fi's regression analysis tools

Frequently Asked Questions

What's a "good" R-squared value?

It depends entirely on your field. In physics, R² below 0.90 might be concerning. In market research and social science, R² between 0.20 and 0.50 is typical and can be highly actionable. A model explaining 30% of variation in purchase intent still provides valuable insight into which levers move the needle.

Can I use linear regression with categorical predictors?

Yes. Categorical predictors are included as dummy variables (0/1 coding). If you have a variable like "region" with four categories, you'd create three dummy variables. The coefficients represent the difference in the outcome relative to a reference category.

What's the difference between linear regression and correlation?

Correlation (r) measures the strength and direction of a linear relationship between two variables. Linear regression goes further, it provides a predictive equation, quantifies how much Y changes per unit change in X, and can be extended to include multiple predictors. Correlation is symmetric (r of X with Y equals r of Y with X); regression is directional.

Frequently Asked Questions

Related Guides

Put it into practice

Ready to apply this in your research?

Quali-Fi makes it easy to run surveys, conjoint studies, and more, all in one platform.