Tutorial 2: Preparing your Data
Tutorial Overview:
With your newfound knowledge of variable types and data structures, it's crucial to understand that datasets typically arrange data in rows and columns. The process of preparing your data for analysis can take a significant amount of time but is essential.
You can of course find data that is ready to be analysed, but usually, there is always something that needs to be done to get the data ready. Understanding how datasets work and how to prepare them is a vital part of data analysis. We'll explore dataset structures, tackle data cleaning (including dealing with missing values), and learn about variable transformation.
Python and R make cleaning data much easier, but it does require coding skills. Alternatively, you can use Excel, although this will be more time intensive, and to some extent, Jamovi to get your data ready. This is your basic primer on the subject.
To prepare your data you will usually follow the following steps:
Explore your data - this cannot be overstated. Get to know the data you are working with.
Remove unnecessary rows and columns
Rename variables
Adjust your levels
Deal with missing values
Transform variables
Throughout the process document what you do and why you do it. This allows others to replicate and check your analysis!
2.1 Dataset Structure
2.1.1 Explore Dataset Structures
If you're familiar with Excel, you will recognise the standard structure of datasets: organised into rows and columns. While datasets vary in appearance, this basic format is consistent.
Columns: Each column represents a variable, which can be any characteristic, number, or quantity that is measurable. Variables include attributes like an individual's age, sex, income, ethnicity, or views on a topic.
Rows: Each row corresponds to a single observation, which may include multiple fields — for instance, a participant's complete survey responses or comprehensive details of an incident. These rows are comprised of numerous variables, each represented by a column.
It is crucial to grasp this row-and-column structure, as it is a constant across various software like Excel, Jamovi, SPSS, R, and Python. Usually, your dataset will come as a .csv file. However, there are other data structures, such as .json files, but these need to be transformed into the row-and-column format before analysing the data.
The above image gives you a good idea of what your data should look, prior to you analysing your data.
Familarise yourself with this structure - you will not be able to get away from it.
2.2 Data Preparation
2.2.1 Explore your data
Understanding and exploring your dataset before diving into data preparation is a critical step in any data analysis project. It serves as a foundational stage where you familiarize yourself with the nature, structure, and idiosyncrasies of your data. This preliminary step can inform how you approach cleaning, transforming, and analyzing your data.
Familiarize yourself with the Datas
Review Metadata:
Start by examining any available metadata, which includes information about the data source, collection methods, data dictionary (descriptions of variables), and any previous manipulations the data might have undergone. Often this can be found in a codebook. If you have collected the data yourself, consider creating a codebook to record this data
Initial Observations:
Open your dataset in a data analysis tool and do a cursory review to understand the basic format, the number of variables (columns), and the number of observations (rows). Check for:
The presence of header rows.
Data types (e.g., numerical, categorical, text, date).
Overall size and scope of the dataset.
Descriptive Statistics
Generate summary statistics for numerical data to get a sense of the central tendency (mean, median), dispersion (range, interquartile range, standard deviation), and shape (skewness, kurtosis).
For categorical variables, look at frequency distributions to understand the proportion of each category and to spot any categories that may have very few observations.
Data Quality Assessment
Identify if the dataset has missing values, where they are, and how they're represented (e.g., as NaN, NULL, blank spaces, or placeholders like -999).
Search for and assess the impact of duplicate rows or entries. Decide on a strategy for handling any duplicates you find.
Look for outliers or anomalies in your data, which could be errors or legitimate extreme values.
At this state you might also consider visualising some of your data, to help you get a better understanding of what the data is hidding. Use visual tools like histograms, box plots, and scatter plots to observe the distribution and relationships between variables.
Initial Findings and Tidy Data:
Note down any interesting patterns, trends, or initial insights that could guide more in-depth analysis later.
Ensure that your data is tidy; each variable should form a column, each observation should form a row, and each type of observational unit should form a table.
Identify areas that will require cleaning or transformation based on your exploration.
By thoroughly understanding and exploring your dataset upfront, you'll set a strong foundation for the subsequent steps in your data analysis process. This phase ensures that the data you work with is well-suited for your analysis objectives and that the insights you derive are reliable and meaningful.
2.2.2 Remove unnecessary rows and columns & Rename Variables
Data preparation is a critical step in the data analysis process, involving cleaning and transforming raw data into a format suitable for analysis. Part of this process includes removing unnecessary columns and rows, as well as renaming variables. Here's an overview of these tasks:
Removing Unnecessary Columns and Rows
Raw datasets will often have variables that are simple of no use. When you conduct a survey on a platform like Qualtrics or Surveymonkey, the raw data contains additional variables such as progress, the persons IP address etc. There may also be questions about ethics, that will add nothing to your dataset and can be deleted.
Delete incomplete observations:
It is highly likley that you have some incomplete observations. You may want to delete some of these incomplete observations. You will need to excercise some judgment when deciding what incomplete observations to delete.
In a survey, if a participant has only complete 50-60% maybe even 70% of the survey consider deleting such observations - but again it depends what data is missing. If the questions they have answered are inconsequanicial, the consider keeping the observation.
If you download your raw data from Qualtrics, JISC survey or Surveymonkey, they will usually have some sort of progress variable that tells you how much of the survey has been completed. You can use the filter function in Excel and filter out all of the observations you want to keep. Then delete those remaining.
Columns: Removing unnecessary columns simplifies the dataset by eliminating irrelevant, redundant, or unimportant data. This is crucial because it makes the dataset easier to work with and understand.
Rows: Similarly, removing unnecessary rows, or "observations", can be important for several reasons:
It cleans up the data by getting rid of noise, such as duplicate entries or records that are irrelevant to the specific analysis.
It ensures that the quality of the dataset is maintained by excluding data points that may be outliers or have two many missing values that cannot be imputed accurately.
Renaming Variables
Renaming variables in a dataset is an essential step that enhances clarity and readability, which in turn facilitates better communication of analysis results. Renaming helps to:
Improve the accessibility of the dataset to others (and to your future self).
Ensure consistency across different datasets and analyses, which is especially important when combining data sources.
Help to avoid confusion, for example, by using descriptive names that clearly define what each variable represents.
Make programming easier by avoiding names that might conflict with reserved words or functions in the data analysis software.
Often you will need to rename your variables. Often the raw dataset will include the question you asked. This is not a suitable variable name. Variable names should be short (but still make sense).
Depending how on the size of your dataset, this may take quite some time. You should also add the questions into your codebook, as an easy reference.
Consider the following?
When performing these tasks, consider the following best practices:
Documentation: Always keep a record of all changes you make to the original dataset, including which columns or rows were removed and any variable names that were changed. Your documentation should clearly communicate the rationale behind removing certain data and renaming variables so that everyone who uses the datasets knows what to expect.
Consistency: Apply a consistent logic when deciding which columns and rows to remove and when renaming variables. This consistency should be aligned with the goals of your analysis.
Validation: After removal and renaming, validate the dataset to ensure that no critical information is lost and that the dataset still accurately represents the underlying phenomena.
Backup: Always keep a copy of the original data before making any changes. This allows you to revert back in case of any errors during the data preparation process.
By carefully considering the need to remove and rename data, you can significantly improve the quality and utility of your data, making subsequent analyses more robust and reliable.
2.2.3 Adjust your Variables and Levels
For Continuous Variables:
Data Entry Errors: Check for errors in our data. Look out for impossible values. Some survey programs for example automatically record missing values as 99 or -99 or you might have a surevey for participants 18+ but have values below that. Such values need to be removed and/or corrected.
Consistency Checks: Make sure that the data across the related variables makes sense (e.g., age, income, etc) and is conistent.
Standardise Formats: Make sure your continuous data follows a standard format (e.g., number of decimal places, etc).
For Ordinal Variables:
Make sure that the values are accurate. Convert the levels of ordinal variables into numerical codes, as this is necessary for certain statistical tests that require numerical input. Some programs (such as Jamovi or R) allow you to attach labels to your numbers. For example: Strongly agree = 0
Somewhat agree = 1
Neither agree nor disagree = 2
Somewhat disagree = 3
Strongly disagree = 4
If you do not encode your levels, some tests may not function correctly. Even when tests run without errors, you must still inform the statistical program about the order of the levels.
For Nominal Data:
You may also choose to convert nominal data labels into numbers based on personal preference. This step is not mandatory, but doing so can make the data easier to handle. Ensure that the text for your labels is accurate. For example, use 'Other' instead of 'Other:'. To convert the labels into numbers, you could use:
Male = 0
Female = 1
Note: Encoding does not turn your nominal variable into a continuous or ordinal variable.
After performing the above steps, your dataset should be relatively clean. In Jamovi, also verify that the correct level of measurement is selected for each variable.
2.2.4 Recode a Variable
Recoding involves converting variables from one format or structure into another. It is typically used with categorical data. We recode variables as some statistical tests and programs can only handle numerical data, so categorical data need to be recoded into a numerical format. There are several methods.
Label Encoding: Each unique category level is assigned an integer value. This is more compact but can introduce a new problem: the model might interpret the numerical values as having an ordinal relationship.
2.2.5 Deal with Missing Values
When analysing your data, you MUST instruct the statistical programme to recognise missing data. Otherwise, you risk obtaining wrong results. Qualtrics, for instance, designates missing data with a '-99' code. If you neglect to inform your statistical programme that '-99' signifies missing data, the computer will erroneously incorporate these values into its calculations. Other programmes such as Jamovi or Python represent missing data with NaN.
Ensure you adopt a method appropriate and consistent with the statistical programme you're utilising. For example, Jamovi identifies blank cells, 'na', or 'NaN' as missing data, requiring no additional action on your part. However, if you're dataset uses '-99', you will need to manually tell Jamovi to recognise this as the missing value code.
There's no need to eliminate all missing data, as most statistical programmes will exclude it from analysis (this will affect your sample size). Excessive missing data may trigger an error, indicating that analysis is unfeasible. Adding in missing data will increase the available data for your analysis.
Working with Missing Data:
It's inevitable that you will encounter missing data, even after removing incomplete observations. There are two primary approaches to handle missing data:
Discard Observations: You may choose to discard the entire column or row containing missing data. This approach is generally not advisable if there are only a few missing values. However, if substantial data is missing from a row or column, it might be sensible to discard it. The decision is yours to make – ensure you document your reasoning.
Impute Missing Data: There are numerous strategies for calculating and replacing NaN values. You should carefully consider which method to use. For an in-depth discussion, consider consulting relevant literature. Essentially, you have the following options:
Use the Median, Mean, or Mode for continuous data.
Use the Mode for ordinal and nominal data.
Employ advanced techniques such as regression and machine learning to estimate the missing value more accurately.
After choosing your method, decide which data (rows and columns) to discard, then proceed to impute the missing values using your chosen method.
For a more indepth dicussion on dealing with missing values visit this website.
2.2.6 Transform Variables
Transforming variables is a common operation in data analysis. There are a few common types of transformations:
Normalization: Scaling data to a small, specified range, like 0 to 1.
Standardization: Transforming data so that it has a mean of 0 and a standard deviation of 1 (more on this later)
Log Transformation: Applying the natural logarithm or log10 to the data, often used to handle skewed data.
Binning or Bucketing: Transforming continuous data into discrete bins or categories.
Recoding: Converting categorical data into numerical format, such as one-hot encoding or label encoding for machine learning models (covered above).
Feature Engineering: Creating new variables from existing ones, which could involve mathematical operations, aggregation, or other techniques to derive more informative or useful features.
The specific method to use depends on the nature of your data and the goals of your analysis. We will keep it simple for now and only cover recoding and binning.
Recoding is covered above. Below we also looked at One-Hot Encoding and Bucketing/Binning in more details.
One-Hot Encoding:
One-Hot Encoding converts each category level into a new binary variable (0 or 1). Each level is represented by a dummy variable.
How does One-Hot-Encoding work:
One-hot encoding is a process used to convert categorical data into a numerical format by creating a binary column for each category of the variable. In one-hot encoding, each category becomes a new column, and a binary value of 1 or 0 is assigned to those columns in each row of the dataset.
Below is a step-by-step summary of how one-hot encoding works:
Identify Unique Categories: identify all the unique categories within the categorical variable you want to encode.
Create New Columns: For each unique category, create a new column in the dataset.
Assign Binary Values: For each row in the dataset, fill the new columns with binary values:
Place a 1 in the column corresponding to the category that the row belongs to.
Place a 0 in all other newly created category columns.
The result is a "one-hot" because only one category column is hot (1) for each row, while all others are cold (0). This technique is widely used to prepare categorical data for many types of machine learning algorithms.
Bucketing/Binning:
Imagine you have a laundry basket full of socks of different colours. Now, you want to organize these socks so that it's easier for you to find a pair when you need it. So you sort the socks into different bins or buckets based on their colour. All the red socks go into one bin, blue into another, and so on. This is what binning or bucketing is like in data analysis.
Now, let's relate this to data:
Bucketing or binning is like sorting socks. But instead of socks, you have a lot of numbers (data points). These could be anything: ages of people, prices of houses, temperatures, etc. When there are too many different numbers, it's hard to make sense of them. So, you put these numbers into groups (or bins) to organize them better.
Why Use Bucketing/Binning?
Simplicity: It makes your data simpler. Instead of looking at every single age in a survey, you look at age groups like 0-18, 19-35, etc.
Manageability: It makes large data sets more manageable. Imagine trying to understand the individual ages of thousands of people at once!
Analysis: It helps in analysis. It's easier to see patterns and trends when you have groups to compare, like seeing which age group buys more online.
How Does it Work
Choose a Variable: Pick the data you want to organize, like the ages of people in a survey.
Decide on Bin Size or Number: Choose how big each group should be or how many groups you want. For example, you could have 5-year age groups or maybe 10 groups in total.
Create Bins: Group each data point into the right bin. For ages, anyone between 0-18 years goes into the first bin, 19-35 in the second, and so on.
Label the Bins: Give each bin a label that makes sense, like "Teenagers" for the 0-18 group.
Use Your Bins: Now you can easily compare these bins to find patterns, like which age group votes for whom.
That's bucketing/binning in a nutshell – organizing your data into groups to make it easier to understand and work with.
Simple transformations can easily be done. However, often you may want to start combining variables to reduce them into one variable.
Of course, this is something that can be done. However, don't just go and add variables together. There are statistical tests, such as a Reliability Analysis, a Principle Component Analysis, or an Exploratory Factor Analysis, covered later that should be used to combine variables. These tests will show you whether the variables you want to combine are statistically correlated or not. This topic will be covered in more depth later.
When transforming a variable, ensure that each new category is exclusive. There should be no overlap between the categories.
E.g., In the above example the 'under 65' category should only include values up to, but not including 65, so any values from 0-64. The over 65 variable should include all values above 65. If you overlap them, you will be counting some data points twice.
2.3. Suggested Reading:
Easy: Rowntree, Derek (2018) Statistic without Tears. Penguin Books: London - Read: Chapter 1 (pp. 7-22)
Easy: Davis, Cole (2019) Statistical Testing with Jamovi and JASP Open Source Software: Criminology. Vor Press Norwich. Read: Chapter 2 (pp. 13-16)
Moderate: Navarro, D & Foxcroft, D (2019) Learning Statistics with Jamovi: A tutorial for psychology students and other beginners. Online: Version 0.7. Read: Chapter 2 (pp. 13-40)
2.4 Additional Reading: Quantitative Research Design
In this tutorial, you will learn about understanding the data that has already been collected. If you conduct your own research, it is really important to understand these concepts and think about them in your research design phase. The readings below provide you with a good overview of Quantitative Research Design.
While research design is extremely important, it is not covered here, as the focus is on learning basic data analysis skills.
Easy: Mertler, Craig (2021) Introduction to Educational Research. Sage: London. Chapter 7.
Moderate: Angrist, Joshua & Piscke, Jorn-Steffen (2015) Mastering Metrics. Cambridge: Princeton University Press
Difficult: Morgan, Stephen & Christian Winship (2007) Counterfactuals and casual interference. Cambridge: Cambridge University Press