Tutorial 10: Regression Modelling III
In this session, you will learn the following:
Odds ratios
Binomial/ Multinomial Logistic Regression
Ordinal Regression
These techniques allow you to explore correlations between data. Regressions are considered predictive tools which use statistical models. These models allow you to test your hypothesis and explore how different factors (independent variables) affect your correlation.
10.1 Odds Ratios
10.1.1 Odds and Odds Ratio
The concepts of probabilities and odds are fundamental in statistics, and while they are related, they describe different aspects of likelihood:
Probabilities, a re-cap:
Probabilities quantify the likelihood of an event happening, measured on a scale from 0 to 1, where 0 indicates impossibility and 1 indicates certainty. The probability of an event is calculated as the number of favourable outcomes divided by the total number of possible outcomes. For example, the probability of rolling a 3 on a standard six-sided die is 1/6, as there is one favourable outcome (rolling a 3) and six possible outcomes in total.
Odds
Odds, on the other hand, compare the likelihood of an event happening to the likelihood of it not happening. They are expressed as a ratio of two numbers: the number of favourable outcomes to the number of unfavourable outcomes. Using the dice example, the odds of rolling a 3 are 1 to 5, because there is one favorable outcome (rolling a 3) and five unfavorable outcomes (rolling a 1, 2, 4, 5, or 6).
There are some key differences, these are:
Representation: Probabilities are represented as a fraction of the total number of outcomes, while odds are represented as a ratio of favourable outcomes to unfavourable outcomes.
Range: Probabilities range between 0 and 1, whereas odds range from 0 to infinity. A probability of 0.5 corresponds to even odds (1:1), but as probabilities approach 1, the odds increase without bounds.
Interpretation: Probabilities provide direct information about the likelihood of an event within the universe of all possible outcomes. Odds provide a comparison of the event happening versus not happening.
Odds Ratio
An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, often used in studies to compare the occurrence of outcomes across different groups. Let's apply this concept to a political science example, examining whether attending political rallies increases the likelihood of voting in an election.
First, calculate the odds of voting for people who attend political protests versus those who do not. Suppose in a study of 100 protest attendees, 80 voted in the last election, giving odds of 80:20, or 4 (since 80 out of 100 voting means 20 did not, so the odds are 80 divided by 20). In a comparable group of 100 non-attendees, 50 voted, giving odds of 50:50, or 1. Now we can compare the results of the two groups using odds ratios (OR). The odds ratio is the ratio of these two odds. For our example, the OR is 4 / 1 = 4.
Interpretation:
OR = 1: No difference in odds between the two groups. Protest attendance doesn't affect voting behaviour.
OR > 1: Higher odds in the first group. In our case, an OR of 4 suggests that individuals who attend political rallies have four times the odds of voting compared to those who don't attend, indicating a strong association between protest attendance and voting.
OR < 1: Lower odds in the first group. If it were less than 1, it would suggest that attending a protest is associated with lower odds of voting, which isn't the case here.
In summary, the odds ratio in this context helps us understand the association between attending political protests and the likelihood of voting in an election. An OR greater than 1, as in our example, signifies a positive link between rally attendance and higher voting rates, offering insights into how political engagement activities can influence electoral participation.
For a much more in-depth explanation of Odds and Log(Odds) watch this Stats Quest video. A follow-up video on Odds Ratios is also worth watching.
It is highly recommended that you watch the two Stats Quest videos mentioned above.
10.1.2 Interpreting Odds Ratios and log (odds)
Interpreting an odds ratio (OR) involves understanding the measure of association it provides between an exposure and an outcome in a study. An odds ratio quantifies how the odds of the outcome change with the exposure compared to without it. Here's how to interpret an odds ratio in various contexts:
Odds Ratio Greater Than 1
An OR greater than 1 indicates that the exposure is associated with higher odds of the outcome occurring. For example, an OR of 2 means that the exposed group has twice the odds of experiencing the outcome compared to the unexposed group. This can be interpreted as a 100% increase in the odds of the outcome occurring with the exposure.
Odds Ratio Less Than 1
An OR less than 1 suggests that the exposure is associated with lower odds of the outcome. For example, an OR of 0.5 implies that the exposed group has half the odds of experiencing the outcome compared to the unexposed group, or a 50% decrease in the odds.
Odds Ratio Equal to 1
An OR of 1 indicates no association between the exposure and the outcome. The odds of the outcome are the same in both the exposed and unexposed groups. This means that with an OR of 1 the exposure does not affect the odds of the outcome occurring.
The significance of an OR depends on the context of the study, including the magnitude of the effect and its relevance to the research question or clinical practice. Always consider the confidence intervals around an OR to assess the precision of the estimate. Narrow CIs indicate a more precise estimate of the OR, while wide CIs suggest less certainty. If the OR confidence intervals cross 1, this indicates no effect. An OR does not directly translate to risk or probability of the outcome. It measures the odds, which can be less intuitive than probabilities, especially when the outcome is common. In summary, interpreting an odds ratio involves understanding the direction and magnitude of the association it describes between an exposure and an outcome. An OR above 1 indicates increased odds, an OR below 1 indicates decreased odds and an OR of 1 indicates no change in odds due to the exposure.
_____________________________________________________________________________________
Interpreting log odds
Interpreting the log of odds, often encountered in logistic regression analysis, involves understanding how changes in predictor variables affect the odds of a certain outcome on a logarithmic scale. The log of odds transforms the odds ratio, making the relationship between variables linear and easier to model statistically. Here’s a simplified explanation:
Positive Log-Odds: A positive coefficient (log-odds) indicates that as the predictor variable increases, the odds of the outcome occurring increase. The relationship is exponential due to the logarithmic transformation, meaning small increases in the predictor lead to multiplicative increases in the odds of the outcome.
Negative Log-Odds: A negative coefficient means that as the predictor variable increases, the odds of the outcome occurring decrease. In this case, higher values of the predictor are associated with lower odds of the outcome.
Magnitude: The magnitude of the coefficient (how large or small it is) tells you how strong the association is between the predictor and the outcome. A larger absolute value indicates a stronger relationship.
Zero: If the log-odds coefficient is zero, it suggests that the predictor has no effect on the odds of the outcome occurring.
To convert log odds back to regular odds for easier interpretation, you would exponentiate the coefficient. This gives you the odds ratio, which can be more intuitively understood as the factor by which the odds multiply for a one-unit increase in the predictor variable.
In essence, interpreting the log of odds allows you to understand the direction and strength of the relationship between predictor variables and the outcome in a logistic regression model, with the transformation facilitating the modelling of non-linear relationships in a linear framework.
For a much more in-depth explanation of Odds and Log(Odds) watch this Stats Quest video. A follow-up video on Odds Ratios is also worth watching.
It is highly recommended that you watch the two Stats Quest videos mentioned above.
10.2 Logistic Regression
So far we have only focused on using continuous variables in regression modelling. Logistic regressions allow us to use nominal and ordinal dependent variables. Logistic regression is another extension of the linear regression discussed above. There are three types of Logistic regressions:
Binomial Logistic Regression: here the dependent variable must be binary (e.g. Male/Female or Yes/No)
Multinominal Logistic Regression: here the dependent variable can be ordinal or nominal, but must have more than two levels.
Ordinal Logistic Regression: here your dependent variable should be ordinal.
It is worth watching the below videos which provide you with an in-depth introduction to the concept of Logistic Regression. For a shorter and less extensive introduction watch this clip here.
It is also worth watching Logistic 'Regression Understanding its Coefficients' and 'Logistic Regressions RSquared' which will help you learn how to interpret the outcomes of a Logistic Regression.
10.2.1 Binomial Logistic Regression
Logistic regression is a way to explore the relationship between a series of factors (like age, income, and education) and a particular outcome that has two possibilities (like voting for candidate A or not voting for candidate A). The tricky part is that this outcome is like an on-off switch (0 or 1), which doesn't fit well with regular straight-line (linear) predictions.
To handle this, logistic regression takes the on-off switch and stretches it out on a scale from 0 to 1 using logs. This stretching turns our simple on-off outcome into something that can be mapped with a curve, allowing us to use a linear approach. In essence, it transforms the prediction problem into figuring out the odds of the switch being on (1) or off (0), based on the factors we're looking at. By using logs, logistic regression can make predictions and show how different factors increase or decrease the likelihood of the outcome happening.
The output of a Logistic Regression looks very similar to that of a Linear Regression. There are, however, some significant differences.
In logistic regression, the coefficients represent the log odds, which is a way of quantifying the relationship between each predictor variable and the probability of the outcome occurring. Log odds can be a bit tricky to interpret directly, but they essentially tell us how the odds of the outcome change with a one-unit increase in the predictor variable. When you exponentiate a coefficient, you get the odds ratio, which is easier to understand – it shows the factor by which the odds of the outcome increase (if the odds ratio is greater than 1) or decrease (if it's less than 1) for a one-unit increase in the predictor. Most programmes will also provide you with the OR which is easier to understand and interpret.
Because logistic regressions don't have an equivalent to the R-squared statistic from linear regression (which measures how well the model explains the variability of the data), we often use a Pseudo R-squared measure instead. Pseudo R-squared values provide a way to gauge the model's explanatory power, but they don't have a direct interpretation like the R-squared in linear regression. There are several types of Pseudo R-squared, with Nagelkerke’s R-squared being one popular option. These measures give us a rough idea of how well our model fits the data, but they should be interpreted with caution and understood as not directly comparable to the R-squared from linear regression.
Interpreting Pseudo R Squared
There are several different Pseudo R Squared below are two commonly used ones. Pseudo R-Square typically ranges from range from 0 to just under 1. However, its values are generally much lower than what you would expect from a linear regression R-squared. A value of 0 indicates that the model has no explanatory power, and values closer to 1 indicate a better fit.
McFadden's R-Squared
Interpretation: A rough guideline for McFadden's R-Squared is that values around 0.2 to 0.4 represent a good fit. However, these are very rough benchmarks, and interpretation should consider the specific context of the model and the data.
Nagelkerke’s R-Squared
Interpretation: Similar to McFadden's R-Squared, higher values indicate a model that better fits the data. However, because Nagelkerke’s R-Squared is scaled to the full 0 to 1 range, its values might appear more intuitive, resembling the R-squared values from linear regression. Still, values will generally be lower than those typically seen in linear regression models.
Key Points for Both
Context is Critical: The interpretation of both McFadden's and Nagelkerke’s R-Squared should always consider the context of the study and the complexity of the model. There are no hard-and-fast rules for what constitutes a "good" value.
Comparative Use: These measures are most useful for comparing the fit of different models for the same data set rather than as absolute indicators of model quality. A model with a higher pseudo R-squared value is generally considered to have a better fit compared to a model with a lower value.
Not Directly Comparable to Linear R-Squared: Despite the intuitive appeal of Nagelkerke’s R-Squared, it's important to remember that these measures do not have the same interpretation as the R-squared from linear regression. They do not represent the proportion of variance explained by the model in the same way.
In summary, McFadden's and Nagelkerke’s R-Squared values provide useful, albeit rough, measures of model fit in logistic regression, with Nagelkerke’s adjustment offering a scale that is perhaps easier to interpret for those familiar with linear regression. However, their interpretation should be contextual and cautious, particularly when comparing across different types of models.
Above you see the typical output of a Binomial Logistic Regression. You can of course also create additional visualisations, Marginal Means plots and tables etc.
The R-Squared for example indicates that this might be an okay fit, for the data McFadden (0.21).
Interpreting the Output
The Intercept
The intercept (−0.83) represents the log odds of being in the "Extreme" category when all predictors are at their reference levels (zero or baseline category). The odds ratio of 0.43 suggests lower odds of being "Extreme" at the baseline levels of predictors.
Predictor Coefficients
Each predictor’s estimate represents the change in the log odds of being "Extreme" for a one-unit increase in that predictor, holding all other predictors constant.
Impact_Globalisation: A coefficient of −0.25 indicates that as the impact of globalisation increases by one unit, the log odds of being "Extreme" decrease, with an odds ratio of 0.78 suggesting a decrease in odds.
Age: For each one-year increase in age, the log odds of being "Extreme" decrease by 0.03, with the odds ratio of 0.97 indicating a slight decrease in odds.
Gender (Female as the reference group): Being male (compared to female) increases the log odds of being "Extreme" by 1.07, translating to an odds ratio of 2.91, suggesting that males have higher odds of being "Extreme" than females.
Political_LeaningRight_Left: A coefficient of −0.35 means that moving from right to left in political leaning decreases the log odds of being "Extreme", with an odds ratio of 0.71 indicating reduced odds.
Significance (p-values)
p-values below 0.05 (e.g., Age, Gender: Male – Female, Political_LeaningRight_Left, and several categories under "Share_nothing_Society") indicate statistically significant effects.
Larger p-values suggest the evidence is weaker for a predictor's effect on the outcome (e.g. the various strain variables).
Odds Ratios
Values greater than 1 (e.g., Gender: Male – Female) indicate increased odds of the outcome ("Extreme") with each unit increase in the predictor or moving to the specified category.
Values less than 1 (e.g., Impact_Globalisation, Age, Political_LeaningRight_Left) indicate decreased odds of the outcome with each unit increase in the predictor.
Strains and Social Factors
Different "Strain" factors and "SocialMedia_Use" show varying influences, with most not significantly affecting the odds of being "Extreme", as indicated by their p-values and odds ratios close to 1.
Share_nothing_Society: Strong opinions (either disagreeing or agreeing strongly with sharing nothing in society) significantly decrease the odds of being "Extreme", highlighting how certain social attitudes are associated with the outcome.
Has Degree Yes/No
Having a degree (Yes – No) with a coefficient of −0.39 and an odds ratio of 0.68 suggests that having a degree slightly decreases the odds of being "Extreme", though this effect is not statistically significant (p = 0.105).
This table offers a comprehensive view of how various factors influence the likelihood of being categorized as "Extreme", combining individual, social, and political predictors to provide insights into the factors associated with extreme responses.
Below we can see a visual representation of Age, Gender and Political Leaning.
Binominal Logistic Regression:
Your dependent variable must be binomial (two levels, e.g. Male/Female). Your independent variables can be continuous, ordinal and nominal. Note, if you use nominal/ordinal variables, these are turned into dummy variables within your regression.
10.2.2 Multinominal Logistic Regression
Multinomial logistic regression is a statistical method used when you want to predict outcomes that fall into three or more categories, which are not ordered. For example, if you're studying voters' preferences for different political parties (e.g., Party A, Party B, Party C), multinomial logistic regression can help you understand how different factors (like age, income, or education) influence someone's likelihood of preferring one party over the others.
Unlike binary logistic regression, which deals with outcomes that have two possible states (yes/no, win/lose), multinomial logistic regression handles multiple categories by comparing the log odds of being in one category to the log odds of being in a baseline category, for each predictor variable. This allows you to model and predict more complex relationships where the outcome isn't just a simple yes or no but includes several distinct options.
They are generally interpreted the same way as a binomial logistic Regression. The big difference is that it will provide you a breakdown for each level compared with the other levels.
Here you can see is give us the output of each level in the table compared t the reference level which is Strongly Agree.
Again you can visualise the output using Marginal Means plots and tables.
Your dependent variable must have three or more levels (e.g. Green/Red/Blue). These levels can be nominal or ordinal. Your independent variables can be continuous, ordinal and nominal. Note, if you use nominal /ordinal variables, these are turned into dummy variables within your regression.
10.2.3 Ordinal Logistic Regression
In ordinal logistic regression, which is used for predicting an ordinal dependent variable (a variable with categories that have a natural order, but no consistent interval between them), model thresholds (or cut points) play a crucial role. These thresholds separate the categories of the dependent variable based on the predicted probabilities.
Understanding Model Thresholds
Model thresholds are values derived from the model that define the boundaries between the ordinal categories. For an ordinal outcome with k categories, there will k-1 thresholds. These thresholds are estimated during the model fitting process and are used to determine the predicted category for each observation based on its predicted probabilities.
Interpreting Model Thresholds
Location of Thresholds: Each threshold marks the point along the continuum of the linear predictor (the combination of predictors weighted by their coefficients) at which the probability of being in a given category or below shifts to being more likely than being in a higher category. The first threshold marks the boundary between the first and second categories, the second threshold between the second and third categories, and so on.
Predicted Category Assignment: To assign a predicted category to an observation, the model uses these thresholds along with the observation's values on the predictor variables. If an observation's linear predictor score falls below the first threshold, it's predicted to be in the first category. If it falls between the first and second thresholds, it's predicted to be in the second category, and so forth.
Significance of Thresholds: The exact values of the thresholds tell you about the distribution of the predicted probabilities across the categories. Larger gaps between thresholds can indicate areas where the categories are more distinctly separated by the predictor variables.
Model thresholds in ordinal logistic regression offer a mechanistic insight into how the model discriminates between the ordered categories based on the predictor variables. Understanding and interpreting these thresholds help in grasping the underlying dynamics of the model predictions and the influence of predictor variables on the ordinal outcome.
This output from an ordinal logistic regression model examines factors that might influence people's agreement with the statement Share nothing Society. The dependent variable has five ordered categories: Strongly Agree, Somewhat Agree, Neither Agree nor Disagree, Somewhat Disagree and Strongly Disagree.
Model Fit Measures
Deviance: A measure of model fit; lower values indicate better fit.
AIC (Akaike Information Criterion)**: Another fit index, where lower values suggest a better model.
R² McF (McFadden’s R-Squared): A pseudo R-Squared value indicating the proportion of variance explained by the model. At 0.02, it suggests the model explains a small amount of the variance in the response variable.
Model Coefficients
The coefficients indicate how each predictor variable is expected to affect the log odds of being in a higher versus lower category of agreement with the Share nothing Society.
Positive Estimate: Predicts higher odds of agreeing more with the statement. For instance, "Impact_Globalisation" and "Racism_scale" have positive estimates and significant p-values, indicating they're associated with greater odds of stronger agreement.
Negative Estimate: Predicts higher odds of disagreeing more with the statement. "Extremism_Score_scaled" and "Gender: Female - Male" have negative estimates, suggesting they're associated with greater odds of stronger disagreement, though their p-values are not significant, indicating a weak or non-existent relationship.
Odds Ratios (OR): Values above 1 indicate higher odds of falling into a higher agreement category with a one-unit increase in the predictor. OR below 1 suggests higher odds of falling into a lower agreement category. For instance, "Age" has an OR close to 1, suggesting a minimal impact on agreement level.
Model Thresholds
The model thresholds in an ordinal logistic regression are the points that separate the different levels of the outcome variable. In this case, the outcome variable is the level of agreement with the "Share_nothing_Society" statement. Here's how to interpret the thresholds:
Threshold Estimates: Each threshold estimate represents the log-odds of being at or below a certain level of agreement versus being above it, given that all predictors are at zero. These are the points along the logit (log-odds scale) where the probability of falling in one category or the next higher category is equal (50/50 chance).
Strongly Agree | Somewhat Agree: The threshold estimate between "Strongly Agree" and "Somewhat Agree" is −2.51. This is a very negative value, indicating that the probability of an individual being predicted to "Strongly Agree" is very low when all predictors are at their reference levels. The odds ratio of 0.08 reinforces this, suggesting that the odds are heavily skewed towards "Somewhat Agree" or lower levels of agreement.
Somewhat Agree | Neither Agree nor Disagree: The threshold estimate is −0.96. This is less negative than the previous threshold, reflecting higher odds of being at least in the "Somewhat Agree" category compared to "Neither Agree nor Disagree" or below.
Neither Agree nor Disagree | Somewhat Disagree: With a threshold estimate of 0.36, this suggests that at the reference levels of predictors, the odds are slightly in favour of being in the "Neither Agree nor Disagree" category or lower, compared to "Somewhat Disagree" or "Strongly Disagree".
Somewhat Disagree | Strongly Disagree: The threshold estimate is 1.99, which is quite positive. It indicates that without strong factors to push the prediction otherwise, the model will likely predict the "Somewhat Disagree" category or higher rather than "Strongly Disagree".
Significance: The Z-scores and associated p-values for each threshold estimate tell us whether the threshold is significantly different from zero. In this table, all p-values are below .05, indicating that each threshold significantly separates the levels of agreement.
In practical terms, these thresholds can help us understand where the 'cut-off' points are between different response categories and how certain the model is about these distinctions. They are particularly useful for making predictions about where new observations might fall in the ordered outcome categories based on their predictor values.
Summary
The model suggests that variables like the perceived impact of globalisation and experienced racism are significantly associated with stronger agreement with the "Share_nothing_Society" statement, while extremism and being female are less clear in their impact. The thresholds show where the odds shift in favour of one category over another. The overall model has a low explanatory power, indicating that other unmeasured factors may also influence individuals' levels of agreement with the statement.
Your dependent variable must be an ordinal variable. Your independent variables can be continuous, ordinal and nominal. Note, if you use nominal/ordinal variables, these are turned into dummy variables within your regression.
10.3 Additional Learning Materials
Easy: Davis, C (2019)Statistical Testing with Jamovi and JASP Open Source Software. Vor Books. Read: Chapter 6 & 14
Moderate: Frost, J (2019) Regression Analysis: an intuitive guide for using and interpreting linear models.
Advanced: Navarro, D & Foxcroft, D (2022) Learning Statistics with Jamovi. Read: Chapter 12