Tutorial 9: Regression Modelling II
In this session, you will learn the following:
Assumptions of Linear Regression
Transforming variables (log and quadratic transformation)
Always make sure you check your assumptions. Some assumptions are more important than others. If the assumptions are not met, then consider other models such as logistic regression (covered in the next tutorial).
Remember, if the assumptions are not met, it is possible that your output will not explain your data - rubbish in, rubbish out.
9.1 Linear Regression Assumptions
9.1.1 Why we check for assumptions
Checking for the assumptions of a statistical model, like linear regression, is fundamental for several reasons. Ensuring these assumptions are met is critical for the accuracy, reliability, and validity of the conclusions drawn from the model. Here's why checking for assumptions matters:
Validity of Model Inferences: For the statistical inferences of a model to be valid, such as hypothesis tests and confidence intervals for the model parameters, the underlying assumptions of the model must be satisfied. If assumptions are violated, the inferences made about the population from the sample data may not be accurate.
Accuracy of Predictions: The predictive accuracy of a model relies on how well the model's assumptions are met. Violations of these assumptions can lead to biased or inaccurate predictions. Checking assumptions helps ensure the model correctly captures the relationship between the variables.
Efficiency of Estimators: Many statistical models, including linear regression, aim to provide the best possible estimates of the relationship between variables. When assumptions such as homoscedasticity and normality of residuals are met, the ordinary least squares (OLS) estimators are the most efficient linear unbiased estimators available (according to the Gauss-Markov theorem). Violating these assumptions can lead to less efficient estimators, meaning there could be other estimators that have lower variance.
Identification of Model Limitations: Checking assumptions can help identify limitations or weaknesses in your model, pointing to areas where the model could be improved. For example, if residuals are not normally distributed or there is evidence of heteroscedasticity, this might indicate missing variables, the need for variable transformation, or the presence of outliers.
Guidance on Model Selection: The process of checking assumptions can guide the selection of an appropriate model or the need for model adjustments. If certain assumptions are not met, alternative models that do not have these assumptions (e.g., generalized linear models for non-normal data) might be more appropriate.
Checking that the models' assumptions are met is a critical step in any statistical analysis. It ensures that the model is appropriate for the data and that the conclusions drawn from the analysis are valid and reliable.
-----------------------------------------------------------------------------------------------------------------------------
When assumptions are not met:
If the assumptions are not met, consider the following options:
Add More Explanatory Variables to Your Model: Determine if there are additional variables that could help explain your dependent variable. If so, incorporate these into your model to see if the assumptions are now satisfied. However, be cautious not to indiscriminately add too many variables, as this can lead to other issues such as overfitting.
Non-linear Relationships: The relationship between variables may be non-linear rather than linear. In such cases, you can model this non-linear relationship. Doing so is relatively straightforward, but interpreting the results can be more complex. For guidance on modelling non-linear relationships, you might find [this link] helpful (Note: Link placeholder for actual URL).
Transforming Your Dependent and/or Independent Variable: If assumptions continue to be violated, consider transforming your dependent variable into an ordinal or nominal variable. This transformed variable can then be analyzed using logistic regression.
Quantile or Robust Regression: If the above strategies do not address the issue, quantile regression or robust regression methods may be appropriate. These techniques are designed to be more resilient to outliers and heteroscedasticity but are not covered in this tutorial.
Logistic Regression for Categorical Outcomes: Alternatively, if your dependent variable is or can be converted into an ordinal or nominal variable, logistic regression can be a suitable analytical approach.
Watch an in-depth explanation of why we test for assumptions and what we do if they are not met here. A much short introduction to assumption testing is below.
9.1.2 Assumption 1: Linearity
The assumption of linearity is fundamental in many statistical models, particularly in linear regression, where we need a straight-line relationship between independent and dependent variables. This assumption implies that any change in an independent variable will result in a consistent change in the dependent variable, regardless of the value of the independent variable. Ensuring this assumption is met is crucial for the model to accurately represent the relationship between variables and for the validity of inference made from the model.
Checking for Linearity
Linearity can be assessed through several methods:
Scatter Plots: Plotting scatter plots of the independent variables against the dependent variable can visually indicate whether a linear relationship exists. Non-linear patterns, such as curves or clusters, suggest violations of the linearity assumption. This is easy to do for simple linear regressions.
Residual Plots: Plotting the residuals (the differences between observed and predicted values) against predicted values or against an independent variable can help identify non-linearity. Ideally, the residuals should be randomly scattered around zero without any apparent pattern.
Addressing Linearity Issues
When the linearity assumption is violated, you can use several strategies:
Transformation of Variables: Applying transformations to the dependent and/or independent variables can sometimes linearize the relationship. Common transformations include logarithmic, square root, and power transformations. The choice of transformation depends on the nature of the non-linearity.
Adding Polynomial Terms: If the relationship between variables is curvilinear, adding polynomial terms (e.g., the square or cube of an independent variable) to the model can capture the non-linear relationship.
Piecewise Regression (Segmented Regression): If the data suggests different linear relationships in different ranges of the independent variable, piecewise regression can model these distinct phases separately.
Non-linear Modeling: When linear models are insufficient to capture the complexity of the relationship, moving to non-linear models or generalized additive models (GAMs) may be more appropriate (These are not covered currently)
Important Considerations
Transformations and the addition of polynomial terms can make interpreting model coefficients more complex. It's essential to understand how these changes affect the relationship between variables. Adding polynomial terms or segmenting the data increases the model's complexity. It's crucial to balance the need for capturing non-linearity with the risk of overfitting, where the model fits the training data too closely and performs poorly on new data. Ultimately, the choice between transforming variables, adding polynomial terms, or switching to a non-linear model depends on the specific research question, the nature of the data, and the interpretability of the model.
While the assumption of linearity simplifies modelling and interpretation, real-world data often exhibit complex relationships that require careful assessment and potentially sophisticated modelling strategies to accurately capture.
9.1.3 Assumptions 2: Independence
The assumption of independence stipulates that the observations (data points) are independent of each other; that is, the value of one observation does not influence or predict the value of another. This assumption is foundational for the validity of statistical tests, as many inferential statistics rely on the independence of observations to produce accurate standard errors, confidence intervals, and significance tests.
How to Check for Independence
Durbin-Watson Statistic: This test specifically checks for autocorrelation in residuals from a regression analysis. A Durbin-Watson statistic near 2 suggests no autocorrelation; values deviating substantially from 2 indicate positive or negative autocorrelation.
Examination of Study Design: Often, the best way to assess independence is by examining the study design. For example, in experiments, randomization ensures independence. In observational studies, you'll need to consider whether the data collection method might introduce dependence.
Addressing Independence Issues
Violations of the independence assumption can often be addressed through modifications to the model or the data. Generalized Estimating Equations can be used for correlated data, often found in longitudinal studies, where observations from the same subject are correlated.
Use a Random Effects or Multi-Level Model. These models are particularly useful for handling data where observations are nested within higher-level groups (e.g., students within schools). By introducing random effects, these models can account for the lack of independence within groups while still assuming independence between groups. This is correct; using multi-level (hierarchical) regression is a way to address independence issues, especially in data with a nested or hierarchical structure
Alternatively, use fixed Fixed Effects Models in Panel data to control for unobserved heterogeneity when this heterogeneity is constant over time and specific to individuals or entities, thus addressing potential non-independence.
Ensuring the assumption of independence is crucial for the integrity of your statistical analysis. When this assumption is violated, it means the standard errors, p-values, and confidence intervals may not be reliable. Using appropriate models like multi-level regression or adjusting the analysis technique can help mitigate these issues and lead to more accurate and trustworthy conclusions.
9.1.4 Assumption 3: Homoscedasticity
The assumption of homoscedasticity, or equal variance, is pivotal in linear regression analysis and various other statistical modelling techniques. It posits that the variance of the error terms (residuals) is constant across all levels of the independent variables. In simpler terms, as the value of the predictor variables changes, the spread (variability) of the residuals remains consistent.
Importance of Homoscedasticity
Homoscedasticity ensures that the model estimates the dependent variable with uniform precision across the range of independent variables. It's crucial for several reasons:
Reliability of Coefficient Estimates: Homoscedasticity helps to ensure that the standard errors of the regression coefficients are accurately estimated, which is essential for conducting valid hypothesis tests.
Validity of Statistical Tests: Many tests, including those for the significance of regression coefficients, assume homoscedasticity. If this assumption is violated, the test statistics may not follow the assumed distribution, leading to incorrect inferences.
Checking for Homoscedasticity
Homoscedasticity can be assessed visually and through statistical tests:
Residual Plot Analysis: Plotting the residuals against the predicted values or one of the independent variables. In a homoscedastic dataset, the plot should show a random scatter of points without a discernible pattern. Patterns, such as a funnel shape where the spread of residuals increases or decreases with the fitted values, indicate heteroscedasticity.
Interpreting the Residual Plot:
The residuals are the differences between the observed values and the values predicted by the model. Analyzing these differences can provide insights into the adequacy of the model, the presence of outliers, and whether the assumptions of linear regression are being met (e.g., linearity, homoscedasticity, independence, and normality of residuals).
A fitted residual plot is a specific type of residual plot that plots the residuals on the y-axis against the fitted values (or predicted values) on the x-axis. This plot can help assess:
Linearity: If the relationship between the variables is linear, the plot should show no specific pattern. A systematic pattern (like a curve) suggests a non-linear relationship that the linear model can't capture.
Homoscedasticity: This assumption means that the variance of the error terms is constant across all levels of the independent variables. In the plot, this is indicated by the residuals spreading equally across all levels of the fitted values without forming a funnel shape (where the spread increases or decreases with the fitted values).
Outliers: Points that are far away from the zero line could indicate outliers in the data that could potentially influence the regression model disproportionately.
Leverage points: While not directly indicated by the fitted residual plot, by identifying unusual patterns or outliers, you may also indirectly identify data points with high leverage.
While the fitted residual plot is a powerful diagnostic tool, it's usually not enough on its own. It is best used in conjunction with other diagnostic plots and statistics, such as:
QQ Plot (Quantile-Quantile Plot): Helps to assess if the residuals are approximately normally distributed.
Leverage Plot: Helps to identify influential observations that have a significant impact on the model's parameters.
Cook’s Distance: A measure that combines the information of leverage and residuals to identify influential data points.
Together, these diagnostics give a comprehensive view of the model's performance and assumptions. It's crucial to use multiple diagnostics because each highlights different aspects of the data and model fitting process.
-----------------------------------------------------------------------------------------------------------------------------
Addressing Heteroscedasticity
When homoscedasticity is violated, there are several approaches to address the issue:
Transformations: Applying a transformation to the dependent variable (e.g., log, square root) can sometimes stabilize the variance across the range of independent variables.
Weighted Least Squares (WLS): Instead of ordinary least squares (OLS), WLS can be used where each observation is weighted inversely to its variance, helping to mitigate the impact of heteroscedasticity (not covered here)
Robust Standard Errors: Utilizing robust standard errors can adjust the estimates to account for heteroscedasticity, making the model's inference more reliable without transforming the data or changing the estimation technique. (not covered here).
The assumption of homoscedasticity is integral to many statistical analyses for ensuring the accuracy and validity of the model's inferences. Identifying and addressing heteroscedasticity is crucial for the reliability of the conclusions drawn from statistical models.
9.1.5 Assumption 4: Normality of Residuals
The normality assumption posits that the residuals — the differences between the observed values and the values predicted by the model — are normally distributed for any given value of the independent variables.
Importance of Normality of Residuals
The normality of residuals is crucial for several reasons:
Confidence Intervals and Hypothesis Testing: The normality assumption underpins the validity of confidence intervals and hypothesis tests for the regression coefficients. When residuals are normally distributed, it ensures that t-tests and F-tests for the significance of coefficients and models, respectively, are reliable.
Model Accuracy and Predictive Power: Although the normality of residuals primarily affects inference, ensuring this assumption can help in diagnosing potential issues with the model that might affect its accuracy and predictive power.
Checking the Normality of Residuals
To assess whether the residuals from a regression model are normally distributed, the following methods are commonly used:
Graphical Methods:
Q-Q (Quantile-Quantile) Plots: A Q-Q plot compares the quantiles of the residuals to the quantiles of a normal distribution. If the residuals are normally distributed, the points should approximately lie on a straight line.
Histograms: A histogram of the residuals can provide a visual indication of normality. If the residuals are normally distributed, the histogram should resemble the bell curve of a normal distribution.
Statistical Tests:
Shapiro-Wilk Test: This test specifically assesses the normality of a dataset. A significant p-value (typically <0.05) suggests that the residuals do not follow a normal distribution.
Kolmogorov-Smirnov Test: This test compares the distribution of residuals to a normal distribution, with a significant p-value indicating a departure from normality.
Interept Q/Q Plots
Q-Q plots are graphical tools used to assess if a dataset follows a certain theoretical distribution, usually the normal distribution. When interpreting Q-Q plots, the main thing to look for is how well the points follow the straight line drawn on the plot. If the points lie on or very close to the line, it suggests that the data conforms well to the theoretical distribution. Deviations from this line indicate departures from the expected distribution:
If the points form a curve that is higher than the line at the ends, it suggests "heavy tails" - meaning there are more extreme values than expected.
If the curve is below the line at the ends, the data has light tails - fewer extreme values than expected.
A systematic deviation to the left or right suggests skewness in the data.
By assessing how closely the data points follow this reference line, Q-Q plots provide a visual means to evaluate the distribution assumptions underlying many statistical tests and models, guiding further analysis decisions.
------------------------------------------------------------------------------------------------------------------------Addressing Non-normality of Residuals
When residuals do not appear to be normally distributed, several strategies can be employed:
Transformations: Applying transformations to the dependent variable (e.g., logarithmic, square root, or Box-Cox transformation) can help achieve normality in the residuals.
Adding Missing Variables: Non-normality can sometimes result from omitting relevant variables that capture the non-linear effects or interactions in the data.
Alternative Models: For some types of data, particularly count data or binary outcomes, linear regression may not be appropriate. Generalized linear models (GLMs) offer alternative link functions and error distributions that might better suit the data structure.
While the assumption of the normality of residuals is less critical for large sample sizes due to the Central Limit Theorem, which suggests that the sampling distribution of the estimate becomes approximately normal with sufficient sample size, it remains a vital diagnostic tool. Assessing and addressing the normality of residuals ensures the integrity of statistical inferences drawn from the model, enhancing the credibility and reliability of the findings.
9.1.6 Assumption 5: No multicollinearity
Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, meaning that one variable can be linearly predicted from the others with a substantial degree of accuracy.
Importance of No Multicollinearity
The presence of multicollinearity can significantly impact the regression analysis in several ways:
Coefficient Estimation: Multicollinearity increases the standard errors of the coefficients, making them less reliable. High standard errors can lead to coefficients being statistically insignificant when they should be significant.
Model Interpretation: With multicollinearity, it becomes challenging to determine the individual effect of one independent variable on the dependent variable because the independent variables are entangled.
Predictive Power: While multicollinearity may not affect the model's ability to predict the dependent variable, it undermines the interpretability of the model coefficients, which is often crucial for understanding the dynamics between variables.
Checking for Multicollinearity
To detect multicollinearity in a regression model, the following methods are commonly employed:
Variance Inflation Factor (VIF): The VIF quantifies how much the variance of an estimated regression coefficient increases if your predictors are correlated. A VIF value of 1 indicates no multicollinearity. As a rule of thumb, a VIF greater than 5 (or 10, by some standards) suggests significant multicollinearity that requires attention.
Tolerance: Gives a direct measure of how much of an independent variable's variance is not shared with other variables. Very low tolerance values (close to 0) highlight variables that are potentially redundant because their variance is largely explained by other variables in the model.
Correlation Matrix: Before building the model, examining the correlation matrix of the independent variables can help identify pairs of variables that are highly correlated, signalling potential multicollinearity.
Addressing Multicollinearity
When multicollinearity is detected, several strategies can be considered to mitigate its impact:
Remove Highly Correlated Predictors: If two variables are highly correlated, consider removing one of them, especially if it's not crucial for your model.
Combine Variables: If the correlated variables represent similar concepts or measures, combining them into a single variable or using principal component analysis (PCA) to reduce dimensionality might be effective.
Increase Sample Size: Sometimes, increasing the sample size can help reduce the impact of multicollinearity, although this is not always feasible.
Ridge Regression: This technique introduces a small bias in the regression coefficients to significantly reduce their variance, making the model more interpretable despite multicollinearity.
The assumption of no multicollinearity is critical for ensuring the reliability and interpretability of the regression coefficients in a multiple regression model. By detecting and addressing multicollinearity, researchers can make more confident inferences about the relationships between independent variables and the dependent variable, enhancing the model's overall validity and usefulness.
9.1.7 Assumption 6: Auto-correlations
We won't say much about this here. This is only an issue when you have time-series data. Autocorrelation, also known as serial correlation, refers to the situation where residuals (errors) from a regression model are not independent but instead exhibit a pattern or correlation over time or sequence. This assumption violation is particularly relevant in time series analysis.
Checking Autocorrelation
Durbin-Watson Statistic: A test that measures autocorrelation by comparing residuals separated by one or more time periods. Values near 2 indicate no autocorrelation; values deviating significantly from 2 suggest positive or negative autocorrelation.
ACF and PACF Plots: Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots visually assess the degree and lag of autocorrelation.
Addressing Autocorrelation
Model Adjustments: Incorporating lagged terms of the dependent variable or residuals can account for the autocorrelation.
Alternative Models: Time series models like ARIMA are specifically designed to handle autocorrelation.
Briefly, detecting and correcting autocorrelation is crucial for ensuring the reliability of regression analysis, especially for time series data, where temporal patterns can significantly influence model accuracy and inference.
9.1.8 Check for outliers using Cooks Distance
Cook's Distance is a measure used in regression analysis to identify influential observations. It specifically relates to the potential impact of each observation on the estimated regression coefficients. An observation with a high Cook's Distance can indicate that it is an outlier or has high leverage, meaning it significantly influences the model's fit.
While Cook's Distance itself does not directly correspond to one of the classical assumptions of linear regression (like linearity, independence, homoscedasticity, normality of residuals, no multicollinearity, and no autocorrelation), it helps in assessing the assumption of independence by identifying data points that disproportionately influence the model's parameters. Essentially, it's more about diagnosing potential problems with model fit and robustness rather than testing a specific assumption.
Influential observations identified by Cook's Distance might indicate a violation of the **homoscedasticity** assumption if these points are also outliers in the Y dimension, or they might relate to the **linearity** assumption if removing them significantly changes the relationship between variables. Thus, while Cook's Distance is not used to check a specific assumption directly, it is a valuable diagnostic tool in the broader context of validating the assumptions of independence and identifying data points that could lead to violations of other linear regression assumptions.
-----------------------------------------------------------------------------------------------------------------------------
General Guidelines for Cook's Distance
Low Cook's Distance: A low value suggests that the observation has little to no influence on the regression model's coefficients. Typically, values of Cook's Distance below 0.5 are considered low, indicating that removing the observation would not significantly alter the model.
Moderate Cook's Distance: Values between 0.5 and 1 can be considered moderate. Observations falling in this range may warrant closer examination, as they could start to influence the model's predictions.
High Cook's Distance: Observations with Cook's Distance values greater than 1 are often considered highly influential. These points have a substantial impact on the model and could potentially skew the results. A common, more conservative threshold used to flag potentially influential points is 4/n, where n is the total number of observations in the dataset. This threshold adjusts for the size of the dataset, making it particularly useful for datasets of varying sizes.
Interpreting Cook's Distance
Interpreting Cook's Distance involves both quantitative assessment using these guidelines and qualitative judgment about the data and the context of the study. It's essential to:
Examine observations with high Cook's Distance to determine if they are outliers or leverage points that unduly influence the model.
Consider the substantive context: Is there a reason to expect certain data points to be influential? Are they errors, or do they represent important but rare events?
Decide on appropriate action: Depending on the analysis and the impact of the influential points, you might choose to investigate further, remove them, or model them separately.
In summary, while values of Cook's Distance greater than 1 are commonly flagged as potentially influential, the decision on what constitutes low or high should be informed by the specific context of your analysis, the distribution of Cook's Distance values across your data, and the size of your dataset.
9.2 Log and Quadratic Transformations
9.2.1 Log Transformation
Log transformation is a powerful mathematical technique widely used in statistical analysis to stabilize variance, normalize distributions, and make patterns in data more interpretable. It involves applying the logarithm function to a dataset, transforming each value x into log(x). This transformation can be particularly beneficial in various scenarios:
Addressing Skewness
Log transformation is especially useful for handling right-skewed data (where the tail on the right side of the distribution is longer or fatter than the left). By compressing the long tail and stretching out the left side of the distribution, log transformation can help normalize data, making it more symmetrical.
Stabilizing Variance
In regression analysis, heteroscedasticity (non-constant variance) can violate model assumptions, affecting the reliability of statistical tests. Log transformation can help stabilize the variance of residuals, ensuring that they're more uniform across the range of values, which is essential for meeting the homoscedasticity assumption.
Linearizing Relationships
Some relationships between variables may be multiplicative or exponential in nature, making it challenging to model them with linear methods. Applying a log transformation to one or both variables can linearize such relationships, allowing for a more straightforward analysis with linear models.
Interpretation Changes
After a log transformation, the interpretation of the coefficients in regression models changes. For example, in a model where the dependent variable is log-transformed, a one-unit change in an independent variable leads to a percentage change in the dependent variable, rather than a unit change.
Implementation
Log transformation can be applied to a single variable, several variables, or even the entire dataset, depending on the analysis needs. Common bases for the logarithm include:
Natural logarithm log (ln): Often used in economic and biological models.
Base-10 logarithm (log10): Frequently used in engineering and scientific contexts.
Practical Considerations
Zero Values: Since log 0 is undefined, datasets with zero values require a slight adjustment before transformation, such as adding a small constant to all values.
Negative Values: Log transformation is not directly applicable to negative values. Alternative transformations or data adjustments may be necessary. You can easily transform this variable first by adding a number to bring all values into the positive space.
Log transformation is a versatile tool in data preprocessing and analysis, enhancing the appropriateness of statistical models and the clarity of data patterns. It's a valuable technique for analysts and researchers dealing with non-normal distributions, heteroscedasticity, or non-linear relationships.
9.2.1 Quadrtic Transformation
At this point, it is worth pointing out that the term linear does not refer to the line, but to the underlying mathematical function. This means you can model non-linear relationships - although these will be more difficult to interpret.
Quadratic Transformation:
Using a quadratic transformation in a dataset for linear regression is a strategy to address situations where the relationship between the independent variable x and the dependent variable y isn't a straight line but rather has a curve. Here’s how it’s typically done, without diving into the math:
First, you'd look at your data, often by plotting x against y, to see if the relationship between them looks curved or has a peak or trough, suggesting a quadratic relationship might be a better fit than a straight line.
You transform the independent variable by squaring it, essentially creating a new variable by multiplying the variable by itself. This doesn't mean you alter your original data; you just add extra information (the squared term) to use in the analysis. In your linear regression model, you include both the original independent variable x and its squared version x*x. This allows the model to account for the curve by bending the line to fit the data points better.
Finally, you can analyze the Results. Examine the results to see if the model with the quadratic term fits the data better than a simple linear model. You're looking for a more accurate representation of how x predicts y.
Interpreting the output:
The presence of the quadratic term in the model tells you that the relationship between x and y changes at different values of x - it might increase initially and then decrease, or vice versa, creating a U-shaped or an inverted U-shaped curve. In essence, a quadratic transformation allows a linear regression model to capture more complex, non-linear relationships between variables, improving the model's accuracy and explanatory power without departing from the linear regression framework.
Below you see how different transformations change the line.
9.3 Additional Learning Materials
Easy: Davis, C (2019)Statistical Testing with Jamovi and JASP Open Source Software. Vor Books. Read: Chapter 6 & 14
Moderate: Frost, J (2019) Regression Analysis: an intuitive guide for using and interpreting linear models.
Advanced: Navarro, D & Foxcroft, D (2022) Learning Statistics with Jamovi. Read: Chapter 12