Tutorial 1: Understand Data
Tutorial Overview:
One of the most important things in Statistics and Data Science is understanding your data. This is so fundamental, that you shouldn't move on to more advanced tutorials until you have fully understood the ideas and concepts outlined. While you might get results in an analysis, the results won't make much sense unless you have used the correct type of variables.
In this tutorial, we will cover variables and levels of measurement.
Throughout this, and all other tutorials, we will be using the raw dataset on Politics, Extremism, Life and the internet. Once you have worked through this material you will be able to apply it to your own datasets.
1.1 Understanding your data
1.1.1 What are Variables
In the simplest terms, a variable in statistics refers to something that can change or vary. It's akin to a characteristic or feature that may differ across various situations or among different individuals.
A variable is a record of any number, quantity or characteristic that can be measured. Examples are Age, Gender, or Attitudes to name but a few. So the variable of age, for example, would be a record of the age of every person you spoke to as part of your research. The amount of data a variable can hold is infinite.
When conducting research, for instance, you might be interested in people's attitudes towards the current government. During data collection, you might gather information about participants' ages, where they live, whom they voted for in the previous election, and their views on the current government. Each one of these pieces of information represents a variable (e.g., Age, Place of Residence, Voting History, etc.). For each participant, the answers to these questions will vary.
Thus, a variable is essentially a collection of the different outcomes or responses to your question, characteristic, or feature.
There are different types of variables. We call these levels of measurement. These will be discussed below.
1.1.2 Level of Measurements
Levels of measurement describe the nature of information with the values assigned to a variable. These help us classify variables into types of data.
These levels are nominal, ordinal, and continuous. These can be split into two main categories:
Qualitative Data (Nominal, Ordinal)
Quantitative Data (Continuous - can be split into interval and ratio data)
Here qualitative does not mean engaging in qualitative research, rather this is qualitative data that is being quantified (e.g. by counting or summing it). In the social sciences, much of the data you will use will be qualitative.
The importance of understanding and applying the different levels of measurement correctly cannot be understated - your analysis will depend on selecting the correct types of data.
1.1.3 Nominal Data often referred to as categorical data
This data type refers to data that can be divided into separate categories distinguished by names or labels, where the order or ranking of these categories is not meaningful or relevant. Key characteristics of nominal data include:
Label-Based Categories: Each value or observation in nominal data represents a label or category. For instance, gender (male, female, other) and relationship status (single, in a relationship, widowed, etc) are examples of nominal data.
Non-Numeric or Categorized Numeric Values: While nominal data is often non-numeric, it can include numbers used as labels without implying a numeric relationship. For example, postal codes or phone numbers. Note: Depending on your dataset numbers may also represent labels. E.g. 0 might represent Male, while 1 represents Female. Always check your codebook.
No Inherent Order: Nominal data does not have any inherent hierarchy or ranking. In the case of gender, for example, Male is not 'higher' or 'lower' than Female.
Understanding the nature of nominal data is crucial, as it dictates the type of statistical tests that can be appropriately applied and influences how conclusions are drawn from the data.
Avoid treating variables where numbers are used to represent categories as continuous variables (e.g., Male = 0, Female = 1, Other = 2, etc). They are not and any results you get will be meaningless. Always check your codebook for what variables represent.
1.1.4 Visualising and Analysing Nominal Data:
This type of data is typically used for grouping or classifying information. Statistical analysis of nominal data includes counting frequencies, calculating modes, or using chi-square tests. Measures of central tendency like the mean or median are not applicable due to the lack of numerical values.
For visual representation, nominal data is commonly presented through bar charts or pie charts, which show the distribution of frequencies across different categories.
These examples visualise categorical data using bar charts. The examples show you the difference between counts and percentages - more on that in later tutorials.
1.1.5 Ordinal Data
Ordinal data is a type of categorical data that involves ordering or ranking the values. These are some of the key features:
Ranked Categories: Ordinal data consists of categories that are logically ordered or ranked. Each category represents a level on a scale or a ranking, but the intervals between these levels are not necessarily equal or known. For example, a survey on political satisfaction might have responses like 'Very Unsatisfied', 'Unsatisfied', 'Neutral', 'Satisfied', and 'Very Satisfied'. The distance between 'Satisfied' and 'Very Satisfied' is undefined, and may differ from participant to participant. Ordinal data has a clear hierarchy or sequence.
Typically Non-Numeric, but Can Be Represented Numerically: Ordinal data is often expressed in words (like ranks, grades, and levels), but it can also be represented by numbers that serve as codes for the order. For example, educational levels could be coded as 1 for primary school, 2 for secondary school, 3 for university, etc.
Note: Data such as Birth year, dates are ordinal and not continuous variables
Visualising and Analysing Ordinal Data:
In the social sciences, the use of ordinal data is common. Ordinal outcomes are often incorporated into surveys and structured interviews where understanding the order or ranking is important, but the precise differences between these ranks are either unknown or irrelevant.
The analysis of ordinal data is somewhat limited compared to interval or ratio data. You can use measures like the median or mode to describe central tendency, but the mean is not appropriate due to the undefined intervals between ranks. Non-parametric statistical tests are often used for ordinal data.
Ordinal data is generally visualised similarly to nominal bar charts and line graphs, which can display the ordering of the categories effectively. See below, it is also worth pointing out that can choose to represent your data in percentages and counts (normally you should use percentages though)
1.1.6 Continous Data
Continuous data represents measurements and can take any value within a given range (e.g., 0-500). This type of data is often associated with scale measurements. Below are the main features of continuous data.
Continuous Data
Infinite Possibilities: Continuous data can assume any value within a range. It's typically measured on a scale or continuum and can have fractions or decimals. For example, height, weight, and temperature are continuous because they can be measured with increasing precision and don't have fixed steps.
Graphical Representation: It's commonly represented through histograms, line graphs, or scatter plots.
Above, you can see a graphical representation between two continuous variables using a scatter plot. Further, we can divide continuous data into interval and ratio data:
Interval Data
Equal Intervals: This subtype of continuous data has meaningful, equal intervals between measurements. However, it lacks a true zero point. For instance, the temperature in Celsius or Fahrenheit has equal intervals (1 degree) but does not have a true zero point where the quantity being measured ceases to exist.
Arbitrary Zero Point: The zero point in interval data is arbitrary and does not signify the absence of the quantity. For example, 0 degrees Celsius does not mean 'no temperature'.
Operations: You can add and subtract with interval data, but multiplication and division are not meaningful because of the lack of a true zero.
Ratio Data
True Zero Point: Ratio data is like interval data but with a meaningful zero point, indicating the absence of the quantity being measured. Examples include weight, height, and time.
Full Range of Mathematical Operations: Since ratio data has a true zero, it supports all arithmetic operations, including meaningful multiplication and division.
Comparisons: Ratio data allows for comparing differences and the expression of one measurement in terms of another (e.g., one object is twice as heavy as another).
Key Differences
To summarise, ratio data has a true zero point whereas interval data does not. Ratio data can support all arithmetic operations, while only subtraction and addition can be used on interval data.
Understanding these differences is crucial in statistical analysis, as they determine the types of statistical tests that can be used and the conclusions that can be drawn from the data.
All datasets will have an index column. This column is not a continuous variable - rather it is a unique value that represents each row (generally speaking this variable shouldn't be used in your analysis).
1.2 Additional Resources & Reading:
Easy: Rowntree, Derek (2018) Statistic without Tears. Penguin Books: London - Read: Chapter 1 (pp. 7-22)
Easy: Davis, Cole (2019) Statistical Testing with Jamovi and JASP Open Source Software: Criminology. Vor Press Norwich. Read: Chapter 2 (pp. 13-16)
Moderate: Navarro, D & Foxcroft, D (2019) Learning Statistics with Jamovi: A tutorial for psychology students and other beginners. Online: Version 0.7. Read: Chapter 2 (pp. 13-40)