Tutorial 8: Regression Modelling I
In this tutorial, we will introduce you to the basic concepts of Linear Regression. Linear regressions are a statistical method that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. The main goal is to predict the value of the dependent variable based on the values of the independent variables.
To help you understand how Linear regression works we will cover the following in this tutorial:
How the line is fitted
How RSquared is calculated
How the p-value is calculated
How to interpret the model and interpret the outputs of a linear regression.
Linear regressions allow you to explore correlations between data, predict values, and test hypothesis too. Regressions are considered predictive tools and are considered part of the family of Machine Learning. These models allow you to test your hypothesis and explore how different factors (independent variables) affect your correlation.
Dependent Variable: Your dependent variable must be continuous. Under certain circumstances, discussed in the previous tutorial, you can use ordinal variables.
Independent Variables: You can use a combination of continuous, ordinal and categorical variables
For the examples in this tutorial we will be using the following dataset:
Skoczylis, Joshua, 2021, "Extremism, Life Experiences and the Internet", https://doi.org/10.7910/DVN/ICTI8T, Harvard Dataverse, Version 3.
8.1 Linear Regression the basics
8.1.1 Plotting your variables and calculating the slope
Let's start by plotting our data. This is a nice way of visually demonstrating what Linear Regressions are and what they do.
The Cartesian plane is a two-dimensional surface defined by two perpendicular axes: the horizontal axis (x-axis) and the vertical axis (y-axis). Points on the plane are identified by their coordinates, (x, y), representing their positions along the x-axis and y-axis, respectively. This plane is used to visualise linear regressions.
To understand the relationship between Political Leaning (dependent variable) and Age (independent variable) we plot the data on the Cartesian Plane. The independent variable is plotted along the x-axis and the dependent variable is plotted along the y-axis.
---------------------------------------------------------------------------------------------------------------------
Plotting the data:
In our example, we want to know how changes in age affect political leaning.
To plot it we do the following:
Age is plotted along the x-axis. Person A is 45 years old. To plot it we move the corresponding spaces along the X-axis.
Political Leaning: The dependent variable is plotted along the y-axis. Person A has a Political Leaning score of 1.5. To plot this move 1.5 move Person A from their current position 1.5 space up the y-axis.
Now we can do the same for Person B (Age 25, Political Leaning -0.5) Rather than moving them up the y-axis, we move them down along the axis as it is they have a negative Political Leaning score.
If we collect data from various people, noting their ages and corresponding political leanings. Each person will be a point on this plane. As we plot more points, we might see a pattern or trend emerging. In this data, it looks like the older people get the more conservative they become. In this dataset, the trend appears to be downward.
This visual representation helps us quickly grasp if there's a noticeable relationship between age and political leaning. This forms the basis for more detailed statistical analysis, like linear regression, to quantify this relationship.
The y-intercept:
The y-intercept of a line is the point where the line crosses the y-axis of a Cartesian coordinate system. It is typically represented as the value of y when x = 0. In the equation of a line in slope-intercept form, y = mx + b, the y-intercept is represented by b. m is the slope of the line.
The y-intercept is useful for several reasons, including graphing as it provides a starting point for drawing the line on a graph. Knowing the y-intercept and the slope allows you to plot the line accurately.
In many real-world scenarios, such as economics or science, the y-intercept can have a meaningful interpretation. For example, in a linear model predicting savings over time, the y-intercept could represent the initial amount of savings before any additional contributions.
Finally, the y-intercept can be used to compare different linear models or to understand the baseline level of the dependent variable when all independent variables are zero.
Overall, the y-intercept is a fundamental component in linear equations and graphing, providing insights into the behaviour of linear relationships in both mathematical and real-world contexts.
Understanding the Slope:
The slope of the line indicates how strongly your data are correlated. This section will show you how to calculate the slope—it's quite straightforward, actually. Understanding the calculation process helps you grasp the concept behind it. If you're familiar with the formula, you can easily input numbers to predict values for your dependent or independent variables. While computers perform these calculations, deeper comprehension is always beneficial.
Consider the example we've discussed so far, where we aim to understand the correlation between Political Leaning and age. Essentially, we're examining the ratio between the change in x (age) and the change in y (Political Leaning). This allows us to address a simple question: For each increase in age, how much does your political leaning increase or decrease?
The formula:
y = mx + b + Error
m is your slope and b is the intercept. This formula allows you to work out any predicted values. TO calculate the slope we use this formula:
m = y2-y1/x2-x1
Remember each data point will consist of two values x and y e.g. data point a might be 34(x)/ 1.4(y). You can now easily plot this on the graph. You would do the same with the second data point. Two data points are sufficient to calculate the slope. These data points can be anywhere on the slope.
The slope represents the rate of change. It's a number that tells us, on average, how much one thing (like Political Leaning) increases or decreases when another thing (like age) goes up by one unit. The slope enables us to predict outcomes and understand the strength and direction of the relationship between two variables. A steep slope (a large number) signifies that a small change in the independent variable results in a significant change in the dependent variable.
In essence, the slope is a fundamental concept that aids in making predictions based on past data, understanding how two things are related, and making informed decisions based on those relationships. Whether in social science, business, or everyday life, knowing the slope can offer insights into how altering one factor might influence another.
8.1.2 Line of Best Fit
The line of best fit, also known as the trend line or regression line, is a straight line that best represents the data on a scatter plot. This line may pass through some of the points, all of them, or none; however, its overall goal is to get as close as possible to all the points collectively. It's used in simple and multiple linear regression to estimate the relationship between two or more variables.
Imagine you're plotting the amount of time spent studying on the x-axis and the scores on a test on the y-axis for a group of students.
In the example above we plotted Age on the x-axis and Political Leaning on the y-axis. The scatterplot above indicates that as people become older, people turn more rightward. The line of best fit is drawn through these points to show the general direction (upward, downward, or flat) to show mathematically the strength of the relationship. Here's how to understand it easily:
Direction: If the line slopes upwards from left to right, it indicates a positive relationship (as studying time increases, scores tend to increase). If it slopes downwards, it suggests a negative relationship (as studying time increases, scores tend to decrease). If it slope is flat, it indicates no relationship.
Position: The closer the points are to the line, the stronger the relationship between the variables is. If the points are scattered far from the line, the relationship is weaker, and the line is less accurate in predicting scores based on study time. This can be determined using the R-Square, something we will discuss below.
Prediction: Once we have the line, we can use it to predict outcomes. For instance, if a student studies for a certain number of hours, we can follow up from that point on the x-axis to where it meets the line of best fit, and then across to the y-axis to predict their score.
The line of best fit is determined by a mathematical method that minimizes the distance between the line and all the points on the graph, usually through a process called 'least squares.' This method calculates the best-fitting line that minimizes the sum of the squares of the vertical distances (residuals) of the points from the line. The numbers are squared to avoid the points below cancelling out the points above the line. It's a powerful statistical tool for making predictions and understanding the strength and direction of relationships between variables.
8.1.2 Ordinal and Categorical Data
Ordinal and categorical variables can be effectively incorporated into regression analyses to explore relationships and impacts on a dependent variable. Here’s how they are used and interpreted:
Using Ordinal and Categorical Variables:
These are variables that represent types or categories (e.g., gender, ethnicity) or ordinal data (e.g. Likert scale). In regression, they are often converted into dummy variables (also known as indicator variables) for inclusion. For a categorical variable with n categories, you create n-1 dummy variables where one category is left out as a reference group. Don't worry many programs will create the dummy variables for you, all you will have to do is drag and drop it into place.
A dummy variable, also known as an indicator variable, is a numerical variable used in regression analysis to represent subgroups of the sample in a study. It takes a value of 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. Dummy variables are essential for including categorical data, such as gender, ethnicity, or any other classification, in a regression model, allowing the model to account for the impact of these categorical factors on the dependent variable.
Interpreting the Coefficients:
Categorical Variables: The coefficient of a dummy variable in a regression model represents the difference in the dependent variable between the category represented by that dummy and the reference category, holding all other variables constant. For example, if gender is coded as 0 for males and 1 for females, and the coefficient for females is positive and significant, it means females have a higher value on the dependent variable compared to males, after controlling for other factors.
Ordinal Variables: The interpretation of coefficients for ordinal variables depends on the coding scheme used. If treated like categorical variables, the interpretation is similar to that of dummy variables. If numerical coding is used, the coefficient indicates the expected change in the dependent variable for a one-unit increase in the ordinal variable, assuming the relationship is linear across the categories.
Using ordinal and categorical variables in regression models allows for a nuanced understanding of how different groups or ordered categories affect a dependent variable. The key is appropriately coding these variables to reflect their nature and carefully interpreting the coefficients to understand the impact of each category or level, considering the reference category or the assumed order. This approach enables researchers to uncover valuable insights into the dynamics influencing their variable of interest.
If you are intersted in a comphresnsive overview on how linear regression is applied to categorical data watch this video.
8.1.3 R-squared and Adjusted R-squared
A quick re-cap - Correlation Coefficient (R):
The correlation coefficient, often represented by "R", measures the strength and direction of a linear relationship between two variables on a scatterplot. The value of R is always between -1 and 1.
R = 1: There's a perfect positive linear relationship between variables.
R = -1: There's a perfect negative linear relationship between variables.
R = 0: No linear correlation exists between the variables.
Understanding and interpreting R correctly is crucial for accurately analyzing data and drawing reliable conclusions
---------------------------------------------------------------------------------------------------------------------
R-squared
Note: If you use more than 1 independent variable just the adjusted R-squared number - more on this further down.
R-squared is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by one or more independent variables in a regression model. In simpler terms, it gives us an idea of how well the independent variables can predict the dependent variable. It's like looking at how good a guess is but in a very methodical, mathematical way.
Understand and interpreting R-squared:
In our example above, the R-square values will let us how well Age predicts Political Leaning. In our model, the R-squared is 0.04. Let's look at what it tells us. The R-squared values are measures, like many things in statistics on a scale of 0-1.
R-squared value of 1: This means 100% of the variation in the exam scores can be explained by studying hours. This is a perfect match, which is quite rare.
R-squared value of 0: This means none of the variations in the exam scores can be explained by studying hours. The two are not related at all.
R-squared value of 0.5: This means 50% of the variation in the exam scores can be explained by studying hours.
The higher this percentage, the better your model predicts the dependent variable. In the Social Sciences, we look for values of 0.5 and higher.
Why We Use R-squared:
R-squared provides a glance at how well our regression model performs. It also allows us to compare our model to new models (e.g. when you are or delete variables from your model). It helps compare the explanatory power of regression models that have different numbers of predictors. R-squared can help you make decisions about which variables are important predictors and how changes in those predictors might affect the outcome.
It's important to note, however, that a high R-squared does not imply causation, and models should be carefully evaluated for their relevance and accuracy in predicting outcomes.
Calculating R-squared (Made Easy):
R-squared is essentially R (or the correlation coefficient) squared. R-squared, however, is easier to interpret in terms of how well your model explains the relationship within your data. R-squared does not tell you about the strength of the relationship though.
But of course, we can calculate this as follows (don't worry the computer does this for you).
R-square = Var(mean) - Var (Line)/Var(mean) or to be more technical:
R-Squared = 1 - SS residuals/ SS total
SS residuals are the sum of squares of the residuals. It represents the differences between the observed values and the values predicted by the model.
SS total is the total sum of squares. It represents the differences between the observed values and the mean of the observed values.
Think of it as comparing the accuracy of your model against a very basic model that just guesses the average every time. If your model is no better than guessing the average (leading to high SS residuals), your R-square will be close to 0. If your model is great at predicting the actual outcomes (leading to low SS residuals), your R-square will be closer to 1.
Why We Use R-squared:
R-squared provides a glance at how well our regression model performs. It also allows us to compare our model to new models (e.g. when you are or delete variables from your model). It helps compare the explanatory power of regression models that have different numbers of predictors. R-squared can help you make decisions about which variables are important predictors and how changes in those predictors might affect the outcome.
It's important to note, however, that a high R-squared does not imply causation, and models should be carefully evaluated for their relevance and accuracy in predicting outcomes.
---------------------------------------------------------------------------------------------------------------------
Adjusted R-squared
Note: If you use more than 1 independent variable just the adjusted R-squared number
Adjusted R-squared is a version of R-squared (which measures how well your model fits the data) that adjusts for the number of variables in your model. While the R-square can increase just by adding more variables, whether they are useful or not, adjusted R-square accounts for this by penalizing the addition of unnecessary variables.
When to Use Adjusted R-square:
It helps you figure out which model has the best mix of simplicity and accuracy, especially when those models have a different number of variables. It also provides a truer measure of how well your model performs, preventing you from being misled by the addition of irrelevant variables. Finally, it helps you avoid overfitting by discouraging adding variables that don't improve your model's ability to predict new data. This ensures your model is complex enough to capture the true patterns in the data but not so complex that it starts capturing random noise.
In short, use adjusted R-square when you want a more reliable indicator of your model's quality, particularly when your model includes several variables or when comparing different models.
8.1.4 (Multi) linear regression
The difference between a multilinear and a simple linear regression is that we use more than one independent variable in our model. Multilinear regression, also known as multiple linear regression, is a statistical method that extends the concept of simple linear regression to accommodate two or more explanatory variables (independent variables). The goal of multilinear regression is to model the relationship between a dependent variable and several independent (or predictor) variables. This allows us to understand how the dependent variable changes when independent variables are varied while holding the other independent variables constant.
To visualise this, we essentially add another dimension. So we move from a two-dimensional to a three-dimensional version of the Cartesian plane by adding another axis. Each new variable adds another dimension. Mathematically, adding multiple dimensions is possible, although we can only visualise three - and even that can look hard to interpret.
The dreaded function:
The basic equation for a multilinear regression model is:
Here, Y represents the dependent variable, X1, X2, etc represent the independent variables, while the beta ß1, ß1, etc represents the intercept terms for each variable. They represent the weight of each independent variable. The fun-looking E (epsilon) represents the error term, which accounts for any variation in Y not explained by the independent variables.
Essentially this is the same function as a simple linear regression, we just add in the same term for each new variable.
Multilinear regression is widely used in fields such as economics, social sciences, and engineering to analyze the effect of multiple factors on a particular outcome and to make predictions. For example, in the study of counterterrorism policy, researchers might use multilinear regression to analyze how various factors such as political stability, economic conditions, and law enforcement practices collectively influence the level of terrorist activity in a country. This method helps in identifying significant predictors and in understanding complex relationships between variables, enabling more informed decision-making and policy development.
8.1.5 Interaction Terms
So far we have assumed that all the independent variables are independent of each other. In reality, however, independent variables can and do influence each other. Interactions in a regression model occur when the effect of one independent variable on the dependent variable depends on the level of another independent variable. These interactions allow us to explore the combined effects of two or more predictors on the outcome variable, which can reveal more complex relationships in the data that are not apparent when considering each predictor separately.
How to Include Interactions:
To include an interaction in a regression model, you create a new variable that represents the product of the two interacting variables. So the interactions of Gender and Share Nothing with Society become Gender*Share Nothing with Society
Interpreting Interactions:
If the coefficient for the interaction term is statistically significant, this suggests that the effect of one independent variable on the dependent variable depends on the level of the other independent variable. For instance, the impact of studying time on exam scores could vary based on the level of prior knowledge. Note, you must still also include the original variables in your regression - even if they are not significant. The sign and size of the interaction term's coefficient indicate the direction and magnitude of the interaction. A positive sign means that as one variable increases, it enhances the effect of the other variable on the dependent variable. Conversely, a negative sign suggests that the increase in one variable diminishes the effect of the other variable on the outcome.
Because interpreting interaction effects directly from coefficients can be challenging, visualizing the interaction through plots can provide clearer insights. Plots can show how the relationship between one predictor and the outcome changes at different levels of another predictor.
Why Interactions Matter:
Including interaction terms in regression models is crucial when the relationship between predictors and the outcome is not simply additive. Interactions help us understanding the real-world complexities where factors often influence outcomes in tandem rather than isolation. This deeper insight allows for more accurate predictions and can reveal nuanced dynamics that inform theory development, policy-making, and strategic decisions in various fields.
8.2 General Linear Models: T-test & ANOVAs
The terms general linear models (GLM) and linear regression are often used in statistical analysis, while they are related, they refer to concepts encompassing different scopes and capabilities within the realm of statistical modelling.
Linear Regression
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. The simplest form is simple linear regression, which models the relationship between two variables (one independent variable and one dependent variable). Multiple linear regression extends this concept to include two or more independent variables.
The key idea behind linear regression is to find the linear equation that best predicts the dependent variable from the independent variable(s).
General Linear Models (GLM)
General linear models are a broader class of models that include not only multiple linear regression but also other models like ANOVA and ANCOVA (Analysis of Covariance). GLM is a flexible generalization of ordinary linear regression that allows for the dependent variable to have a linear relationship with the independent variables, but it can model scenarios where the dependent variable is continuous, binary, count data, etc., by using different link functions and error distributions.
Key Differences
Linear regression is specifically focused on modelling the relationship between variables through a linear equation, assuming the dependent variable is continuous and normally distributed errors. GLM, on the other hand, includes linear regression as a special case but extends to accommodate various types of dependent variables and relationships by using different link functions and error distributions.
GLM offers greater flexibility than linear regression because it can handle a wider range of data types and relationships. For example, logistic regression (a type of GLM) is used for binary dependent variables, while Poisson regression (another GLM) is used for count data.
Linear regression is used when the relationship between the independent and dependent variables is assumed to be linear, and the dependent variable is continuous. GLM is used in more complex scenarios involving different types of dependent variables and relationships.
In summary, while linear regression is a fundamental statistical technique for modelling linear relationships, general linear models offer a more comprehensive framework that includes linear regression as a subset and extends its capabilities to accommodate a wider range of statistical modelling scenarios.
Below we will cover T-test and ANOVAs
8.2.1 Test of differences Assumptions and T-tests (two levels)
Tests of Difference
Tests of difference are statistical tools that help us understand if there are meaningful differences in the average values (means) between two or more groups. Once you've chosen the variables you're interested in, statistical software can help you determine if the differences observed are statistically significant, meaning they're likely not due to chance, and also estimate the size of these differences.
There are specific types of tests used for comparing means. For comparing just two groups (e.g., comparing scores between males and females or comparing outcomes for people with a degree vs. those without), T-tests are appropriate. Their non-parametric counterparts, which don't assume your data follows a normal distribution, are also available for similar comparisons.
When you are looking at differences across three or more groups, ANOVAs (Analysis of Variance) are the go-to method. More on these test in the next section.
---------------------------------------------------------------------------------------------------------------------------
Assumptions for tests of difference:
Before delving deeper into these tests, it's crucial to understand the underlying assumptions. Both T-Tests and ANOVAs come with specific assumptions that must be met. If these assumptions are not satisfied, you should consider using non-parametric tests or robust tests instead.
Fortunately, most statistical software provides tools to easily check these assumptions. The main assumptions to consider are:
Normality
Homogeneity of variance
It's important to note that the assumptions for T-Tests and ANOVAs are identical.
However, if your sample size is relatively small (fewer than 50 participants), you should consider using a non-parametric test. Non-parametric tests and robust tests do not require the above assumptions to be met.
Robust tests are designed to handle outliers within your data. So, if your dataset includes extreme outliers, opting for a robust test may be more appropriate than a non-parametric test (although you will also lose some information).
Non-parametric tests are not as powerful in terms of statistical inference compared to parametric tests. Therefore, where feasible, a parametric test should be your first choice.
Now, let's examine each of these assumptions more closely.
---------------------------------------------------------------------------------------------------------------------------
Test for Normality
Recall the concept of the Normal distribution? Tests of difference operate under the assumption that your data is approximately normally distributed. When checking for normality, the software essentially compares your dataset with a normal distribution and indicates whether there's a significant difference between your data's distribution and the normal distribution. In essence, it tests the following hypotheses:
H0: There is no significant difference between the normal distribution and the sample data.
Ha: There is a significant difference between the normal distribution and the sample data.
As is common in statistical analysis, the software will generate a p-value to help you decide. If this p-value is less than 0.05, we reject the null hypothesis and conclude that your data is not approximately normally distributed. In such cases, opting for a non-parametric test would be the appropriate next step."
----------------------------------------------------------------------------------------------------------------------------
Homogeneity of Variance:
Here, the assumption centres around your two groups having data that is similarly distributed with a uniform structure, as well as having a standard deviation for your samples that is roughly the same. Essentially, you should avoid situations where one group displays a positive skew while the other shows a negative skew.
Homogeneity of variance is a fundamental assumption for many statistical tests, including T-Tests and ANOVAs. This assumption holds that the variances within each group being compared should be approximately equal. In simpler terms, it means that the spread or dispersion of scores in each of your groups should be similar. Imagine you're comparing the extremism score across two different groups (different conditions). Homogeneity of variance assumes that the variation within each group is roughly the same. If one group shows a wide range of extremism scores while the other group's scores are very consistent, this assumption is violated. Ensuring homogeneity of variance is crucial because it allows for a fair comparison between groups, ensuring that any observed differences are due to the variable being tested rather than unequal variances skewing the results.
Once again, statistical software tests the following hypotheses:
H0: There is no significant difference in the distribution and structure of the two sample distributions.
Ha: There is a significant difference in the distribution and structure of the two sample distributions.
If the p-value returned is less than 0.05, we reject the null hypothesis and conclude that there is indeed a significant difference in the distribution and structure of the two sample distributions.
------------------------------------------------------------------------------------------------------------------------------
Test for assumptions
T-Test
As outlined above, a sT-test allows you to compare the difference between the mean of two groups. You can easily visualise the outcomes of a t-test using density plots or box plots.
Assumptions of Normality and Homogeneity met: Use a Students T-Test (note some argue that we should use a Welch's Test as standard).
Assumption of Normality met, Assumption of Homogeneity not met: Use the Welch's Test
Assumptions of Normality and Homogeneity not met: Use the Mann-Whitney U Test or a robust t-test.
---------------------------------------------------------------------------------------------------------------------
T-Test
In statistics, a T-Test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups. It is commonly used when the test statistic follows a normal distribution and the value of a standard deviation is unknown.
Key Points of T-Test:
Types of t-test: There are mainly three types: the one-sample t-test, which compares the mean of a single group against a known mean; the independent two-sample t-test, which compares the means of two independent groups; and the paired sample t-test, which compares means from the same group at different times (or under different conditions).
Assumptions: The t-test assumes that the data are normally distributed, the data are independent, and the variances of the two groups are equal. If these assumptions are not met, the results of the t-test might not be valid.
When Assumptions Are Not Met:
Non-normal Distribution: If the data are not normally distributed, especially with small sample sizes, non-parametric tests such as the Mann-Whitney U test (for two independent samples) or the Wilcoxon signed-rank test (for paired samples) can be used as alternatives.
Unequal Variances: If the variances are unequal, you can use a variation of the t-test known as Welch's t-test, which does not assume equal population variances.
Link to General Linear Regressions:
The T-test is closely related to general linear regression, where the significance of individual predictors is assessed using T-tests. In linear regression, the coefficients of the regression equation are tested to see if they are significantly different from zero. This test is essentially a t-test where the null hypothesis is that the coefficient is equal to zero (indicating that the variable does not have a significant effect on the outcome variable).
In summary, the t-test is a powerful tool for comparing means, but care must be taken to ensure the assumptions are met. When they are not, alternative methods are available. Similarly, in linear regression, the significance of predictors is assessed using principles related to the t-test, with attention to assumptions critical for valid results.
In the example below, you can see that the assumptions are not met, so a Mann-Whitney U test is used instead.
__________________________________________________________________________________________________________
Finally, let's just mention paired T-test and ANOVAs
Paired T-Test
A paired T-test is a statistical method used when you are comparing the means of two related groups. These groups are paired because they are somehow linked or matched; for example, measurements taken from the same subjects before and after an intervention, or matched individuals in two different conditions. The key here is that the same individuals are involved in both conditions, allowing us to control for individual differences. For example, if we want to compare exam scores before support was provided and after it was provided. The paired T-test helps determine if the mean difference between these paired observations is significantly different from zero, indicating a significant effect of the intervention or condition.
As the assumptions are not met, the non-parametric alternative should be used. The output of the Kruskal Wallis test are below.
8.2.2 Test of Difference ANOVAs (more than three levels)
One-Way ANOVAs
When you are looking at differences across three or more groups, ANOVAs (Analysis of Variance) are the go-to method. They can be thought of as extensions of T-tests for multiple groups. Non-parametric alternatives to ANOVAs are also available for data that doesn't fit the assumptions required for a traditional ANOVA.
Factorial ANOVAs, or two-way ANOVAs, let you examine the effect of two or more categorical independent variables on a single continuous outcome. They can reveal not only the main effects of each independent variable but also how those variables interact with each other. For instance, you might explore how both gender and education level together influence attitudes toward a topic. However, use linear regression models as they are informative for these types of investigations.
Selecting which ANOVA test to use here is just as simple:
Assumptions of Normality and Homogeneity met: Use the One-way ANOVA.
Assumption of Normality met, Assumption of Homogeneity not met: Use the Welch's Test
Assumptions of Normality and Homogeneity not met: Use the Kruskal Wallis test or a robust t-test.
---------------------------------------------------------------------------------------------------------------------
Key Points of ANOVAs:
Types of ANOVA: There are several types, but the most common are one-way ANOVA, which tests differences between groups based on one independent variable, and two-way ANOVA, which examines the effect of two independent variables on a dependent variable, including any interaction between them.
Assumptions: Like the t-test, ANOVA assumes that the data distributions are normally distributed within groups, variances across groups are equal (homoscedasticity), and observations are independent. Additionally, the dependent variable should be continuous.
When Assumptions Are Not Met:
Non-normal Distribution: For data that are not normally distributed, a non-parametric alternative to ANOVA can be used, such as the Kruskal-Wallis test, which is the non-parametric version of one-way ANOVA, or the Friedman test for repeated measures.
Unequal Variances (Heteroscedasticity): If the assumption of equal variances is violated, consider using the Welch ANOVA, a variation that adjusts for unequal variances.
Link to General Linear Regressions:
ANOVA can be viewed as a special case of linear regression. ANOVA and linear regression are both part of the general linear model family. The primary difference lies in the way the variables are coded. ANOVA typically deals with categorical independent variables and a continuous dependent variable, while linear regression deals with continuous predictors. However, when categorical variables are encoded as dummy variables, the distinction between ANOVA and linear regression blurs, and they essentially become the same analysis.
ANOVA is a powerful tool for exploring the differences among group means, but its validity hinges on the assumptions of the analysis. When assumptions are violated, alternatives or modifications are available to ensure the robustness of your findings. Understanding the link between ANOVA and linear regression enhances the flexibility and depth of your statistical analysis, allowing for a more nuanced interpretation of the data.
Again in the example below, you can see that while the Variance test is met, the normality test is not.
As the assumptions are not met, the non-parametric alternative should be used. The output of the Kruskal Wallis test are below.
Post-Hoc Tests
However, while ANOVA can tell us that there's a significant difference, it does not specify between which groups this difference exists. This is where post-hoc tests come into play.
Post-hoc tests are follow-up analyses conducted after an ANOVA to pinpoint exactly where the significant differences lie among group means. Use post-hoc tests when your initial ANOVA indicates a significant effect (p-value less than 0.05), suggesting that not all group means are equal, but you need to understand which specific groups differ from each other.
These tests control for the increased risk of Type I errors (falsely finding significance) that occur when conducting multiple comparisons. Common post-hoc tests include the Tukey HSD (Honestly Significant Difference), Bonferroni correction, and Scheffé test, each with its advantages and suited for different scenarios. The choice of post-hoc test often depends on the balance between Type I error control and statistical power, as well as the study's specific requirements and data characteristics. In most instances, the Tukey test is the preferred option, unless you want something more conservative.
8.2.3 Test of Difference Effect Size
Using Effect size:
P-values help determine if the relationship in your data is statistically significant. However, a significant p-value doesn't inform you about the strength of the relationship; it merely indicates that the relationship is unlikely to have occurred by chance.
This becomes particularly relevant with larger datasets, where significant p-values are common, but the actual impact (or effect) of the relationship might be minimal. This is where Effect Size is crucial. Effect Size quantifies the magnitude of the relationship, providing insight into its practical significance.
In a previous tutorial, you learned about Pearson's R, Spearman's rho, and Cramer's V. While these are correlation coefficients, they essentially serve as measures of effect size, indicating the strength of the relationship.
It's important to consider that a very small effect size may warrant the rejection of the alternative hypothesis, even when the p-value suggests significance. This is because, despite the statistical significance, the practical effect of the relationship might be too minor to be of any real-world importance.
For interpreting Effect Size in the context of T-Tests and ANOVAs, here are some guidelines to assist you:
Cohen's d for T-Tests: This measures the difference between two means relative to the standard deviation of the data. A \(d\) of 0.2 is considered small, 0.5 medium, and 0.8 large.
Eta squared for ANOVAs: This represents the variance explained by your independent variable. Values of 0.01 indicate a small effect, 0.06 a medium effect, and 0.14 a large effect.
These guides help assess the practical significance of your findings, complementing the p-value's indication of statistical significance.
Interpret Cohen's D and Pearsons' r Effect Size
Cohen's D effect size is used for T-Tests. You should already be familiar with Pearson's R (Yes, it is a correlation but it essentially tells you the effect of the correlation).
You can also visualise this as follows:
Interpret (Partial)Eta Squared
Use Eta Squared for the Full ANOVA output and Partial Eta Squared for the Post-hoc test output.
Eta Squared provides a measure of the effect size associated with one or more independent variables in an ANOVA context. It is calculated as the ratio of the variance explained by an effect to the total variance. Here’s a rough guideline for interpretation, although it's important to note these thresholds can vary by field:
Small effect: = 0.01
Medium effect: = 0.06
Large effect: = 0.14
If Eta Squared = 0.05, for example, it means that 5% of the total variation in the dependent variable can be explained by the independent variable, which could be considered a small to medium effect.
Partial Eta Squared
Used this effect size in the context of multiple factors or covariates, measures the proportion of total variance attributed to a factor, controlling for other factors. This gives a more precise understanding of each factor's unique contribution. The same rough thresholds for small, medium, and large effects apply, but partial Eta Squared values are typically higher than Eta Sqauared for the same effect because they account for less total variance:
8.3 Understanding the output of a regression
8.3.1 Linear Regression: Interpreting your Output
To demonstrate the above, we will build our regression model versions. We will then compare the different versions to see whether they improve our model.
Model 1:
Political Leaning (dependent variable) v Age (independent variable)
Model 2:
Political Leaning (dependent variable) v Age (continuous), Social Media Use (continuous), Gender (ordinal), Share Nothing With Society (categorical), Highest Qualification (Categorical)
Model 3:
In this model, we have included an interaction term Social Media Use * Gender
-----------------------------------------------------------------------------------------------------------------------------
Table 1: Model Fit
This table provides the R-square for each of our three models. The first model only explains 1% of the variation (R-square of 0.01). Not a great model.
The second model is better with an R-square to 0.13, explaining 13% of the variation - better, but still not great.
The third model's R-squared remains at 0.13.
Interpreting RSquared (re-capped):
0 to 0.33 - is a very weak model.
0.34 to 0.50 is a weak to moderate model
0.51 to 0.67 is moderate model
0.68 to 0.75 is a moderate to strong model
0.76 and above is a very strong model.
Table 2: Model Comparison
This table compares the three models. We can see that there is a significant difference between Models 1 and 2, but there is no difference between Models 2 and 3. This suggests that adding the interaction model to our model does not significantly improve our models.
Model Coefficients Table
The table below provides you with more information about your model. This includes your correlation coefficient (estimate), the SE, your p-value and in this case the confidence intervals of the correlation coefficients
The Estimate (correlation coefficient):
You interpret the values as follows:
positive values mean that as the independent variable increases so does the dependent variable
negative values mean that as the independent variable increases the dependent variable decreases.
The Estimate Coefficient represents the mean change in the response given a one-unit change in the independent variable. This means that if your Estimate is +0.5 your dependent variable will increase by 0.5 for each unit increase of the independent variable.
Example: We can see that Age is negatively correlated with Political Leaning with an estimate of -0.01. The correlation is very weak.
Standard Error of the Regression:
Standard Error of the Regression measures the spread of the observed values around the regression line. In other words, it gives us an estimate of the typical distance that the observed values fall from the regression line. A smaller SE indicates that the observations are closer to the regression line, suggesting a better fit of the model to the data. It does not directly measure how "wrong" our model is but rather how much scatter there is in our data around the fitted values.
In summary, the SE is an important measure that indicates the average distance of the data points from the fitted regression line, reflecting the scatter of the data. A lower SE suggests a tighter fit. However, interpreting its value requires caution, especially in relation to the distribution of residuals and the overall predictive performance of the model.
Example: The SE is very low at 0.08 which suggests that the regression line is a good fit.
The p-value:
Here we are essentially testing the following Hypthesis:
H0: Age has no impact on Political Leaning
Ha: Age has an impact on Political Leaning
Lookingat all of the information we, can say that as people get older they turn politically the right. However, the correlation coefficient tells us that the relationship itself is very weak
A marginal Means Table provides more information on how the mean changes with Age. A marginal mean is the average value of a variable within a dataset, accounting for the structure of a more complex model, such as one involving multiple variables or groups. In the context of analyses like ANOVA or regression models with categorical predictors, it refers to the mean of an outcome variable across levels of another variable, averaging over the levels of any other variables in the model. Marginal means provide insight into the overall effect of one variable at a time, simplifying interpretation in multifactorial designs by abstracting from the specific combinations of other factors.
For the mean age (49.89 years) the mean Political Leaning Score is 0.07. For those 66.29 years old (or one SD above the mean) the mean Political Leaning Score drops to -0.02.
Model 2 & 3:
We already know that adding additional variables has improved our model, increasing the R-squared to 0.13. Looking at the table below, we can observe that Age continues to be significant (p-value of <.001), although the relationship is not very strong (Estimate: 0.01). On the other hand, Social Media Use is not a significant factor in this model. Exploring 'Share Nothing With Society', an ordinal variable, we note, for instance, that the mean difference in Political Leaning between those who 'Strongly Agree' (the reference level) and those who 'Strongly Disagree' is 0.53 (p-value of <.001).
Corrections made include minor grammatical adjustments for clarity and removing redundant words for conciseness.
The results confirmed by the Estimates Marginal Means and the Plot below that those who Strongly Agree have a marginal mean of -0.30 (so more right of centre views).
Many programs allow us to generate multiple plots and tables of marginal means. The graph below, for instance, illustrates the relationship between Gender, Age, and Social Media Use. It indicates that males tend to have more right-leaning views than females, and those who use social media more frequently (e.g., 1 SD above the mean) tend to lean more towards the left. However, it's important to note that the model's coefficient table shows that the use of Social Media is not significant (p-value of 0.794). Furthermore, the interaction term (Gender*Social Media Use) was also not significant (p-value of 0.193). This, combined with the model comparison, suggests that in this specific model, we might consider excluding Social Media use.
Corrections made include specifying "programs" instead of "programmes" to maintain consistency in terminology, minor grammatical adjustments for readability, and clarifying the statistical analysis' outcomes and implications.
8.4 Additional Learning Materials
Easy: Davis, C (2019)Statistical Testing with Jamovi and JASP Open Source Software. Vor Books. Read: Chapter 6 & 14
Moderate: Frost, J (2019) Regression Analysis: an intuitive guide for using and interpreting linear models.
Advanced: Navarro, D & Foxcroft, D (2022) Learning Statistics with Jamovi. Read: Chapter 12