Exploratory Data Analysis for International Journals I PhD Insight

RStudioDataLab

7 min readNov 8, 2023

Read and Download the code:

Join our Community

Key points

Exploratory data analysis (EDA) is crucial in any data analysis project. It involves exploring, summarizing, and visualizing your data to gain insights, identify patterns, and detect outliers.
EDA can also help you formulate hypotheses, choose appropriate statistical tests, and communicate your findings effectively.
In this article, I will explain how I perform EDA in R using tidyverse packages, a collection of tools for data manipulation, visualization, and modeling, and my article in Impact Factor Journal.
I will use a generated dataset for this tutorial that contains information about 1000 students from different countries, their academic performance, and their satisfaction with their university.
You will learn how to Load and view the data in R, Summarize the data using descriptive statistics, Visualize the data using charts and graphs, Identify missing values and outliers, Transform and filter the data, Perform hypothesis testing and correlation analysis, Generate an EDA report using R Markdown.

Introduction

In data analysis and statistics, R has emerged as a powerful tool for students, researchers, and professionals. This article will delve into the fascinating world of data analysis and visualization using R. Our journey will include generating a synthetic dataset, performing data visualization, identifying missing values and outliers, data transformation, and conducting hypothesis testing and correlation analysis. So, fasten your seatbelts, and let’s embark on this data-driven adventure!

Generating a Dataset

To begin our exploration, we’ll first create a synthetic dataset. Synthetic data allows us to simulate real-world scenarios, and in this case, we’ll generate data for 1000 students. The R code snippet below demonstrates how we can achieve this:

In the code snippet above, we set the seed for reproducibility and create a dataset with various student attributes, including country of origin, gender, age, major, GPA, standardized test scores, and satisfaction level.

Analyzing the Dataset

Before diving into data visualization, it’s crucial to understand the dataset’s structure. We can use R functions like names, dim, and str to achieve this. Additionally, we can display the top five rows of the dataset using the head function. This information provides us with an initial overview of the data:

##  [1] "id"           "country"      "gender"       "age"          "major"       
##  [6] "gpa"          "sat"          "toefl"        "ielts"        "gre"         
## [11] "satisfaction"

## [1] 1000   11

## 'data.frame':    1000 obs. of  11 variables:
##  $ id          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ country     : chr  "India" "Brazil" "USA" "UK" ...
##  $ gender      : chr  "Female" "Male" "Female" "Male" ...
##  $ age         : int  18 24 18 19 19 18 20 18 19 20 ...
##  $ major       : chr  "CS" "Eng" "CS" "Bio" ...
##  $ gpa         : num  2.6 3.7 3.2 3.6 2.6 2.3 3.8 2 3.1 3.9 ...
##  $ sat         : num  1300 1250 1350 1250 1400 1100 1250 1400 1350 1100 ...
##  $ toefl       : num  90 85 90 110 105 100 80 120 95 100 ...
##  $ ielts       : num  7.2 5.8 8.1 7.4 7.2 5.7 8.6 8.9 9 5.1 ...
##  $ gre         : num  260 280 320 330 340 280 290 270 260 310 ...
##  $ satisfaction: int  2 5 1 2 3 4 3 5 1 3 ...

##   id country gender age major gpa  sat toefl ielts gre satisfaction
## 1  1   India Female  18    CS 2.6 1300    90   7.2 260            2
## 2  2  Brazil   Male  24   Eng 3.7 1250    85   5.8 280            5
## 3  3     USA Female  18    CS 3.2 1350    90   8.1 320            1
## 4  4      UK   Male  19   Bio 3.6 1250   110   7.4 330            2
## 5  5      UK   Male  19  Math 2.6 1400   105   7.2 340            3

These functions allow us to inspect the variable names, dataset dimensions, data structure, and initial data rows.

Data Visualization

Now, we enter the exciting realm of data visualization. Visualizing data is essential for gaining insights and identifying patterns. We’ll create various plots to explore the dataset, including bar charts, histograms, and box plots.

https://youtu.be/oQkDaAZBXLQ

Identifying Missing Values and Outliers

Data quality is paramount in any analysis. Identifying missing values and outliers is a crucial step. We can use R functions to check for missing values and identify outliers in numerical variables:

Checking the number of missing values for each variable:

##           id      country       gender          age        major          gpa 
##            0            0            0            0            0            0 
##          sat        toefl        ielts          gre satisfaction 
##            0            0            0            0            0

This code snippet provides the count of missing values for each variable.

13. Identifying outliers using the IQR method:

We’ll create a function called ‘identify_outliers’ to identify outliers in numerical variables such as ‘age,’ ‘GPA,’ ‘SAT,’ ‘TOEFL,’ ‘IELTS,’ ‘GRE,’ and ‘satisfaction.’ This function calculates the lower and upper bounds for outliers based on the interquartile range (IQR).

Check for outliers in each numerical variable:

## [[1]]
## integer(0)
## 
## [[2]]
## numeric(0)
## 
## [[3]]
## numeric(0)
## 
## [[4]]
## numeric(0)
## 
## [[5]]
## numeric(0)
## 
## [[6]]
## numeric(0)
## 
## [[7]]
## integer(0)

This code checks for outliers in the specified numerical variables and returns the outlier values.

Data Transformation

Data transformation is a fundamental step in data analysis. In this section, we’ll perform data transformation by creating a new variable and filtering the data:

Creating a new variable called ‘test_score’:

We’ll calculate the average test score based on the SAT, TOEFL, IELTS, and GRE scores. This new variable provides a more comprehensive measure of a student’s test performance.

Filtering the rows where ‘test_score’ is not missing:

We filter out rows where the ‘test_score’ variable is not missing to ensure data quality. This step eliminates incomplete data points.

Selecting specific columns for analysis:

We’ll select a subset of columns for further analysis, including ‘id,’ ‘country,’ ‘gender,’ ‘major,’ ‘gpa,’ ‘test_score,’ and ‘satisfaction.’

Grouping and Summarizing Data

Data grouping and summarization are essential for gaining insights into specific subgroups of the data. We’ll group the dataset by ‘country’ and ‘major’ and then summarize the data by calculating the mean and standard deviation of ‘gpa,’ ‘test_score,’ and ‘satisfaction’ for each group:

19. Summarizing the dataset:

We calculate each group's mean and standard deviation for ‘gpa,’ ‘test_score,’ and ‘satisfaction’.

## # A tibble: 36 × 8
## # Groups:   country [6]
##    country major mean_gpa sd_gpa mean_test_score sd_test_score mean_satisfaction
##    <chr>   <chr>    <dbl>  <dbl>           <dbl>         <dbl>             <dbl>
##  1 Brazil  Art       2.87  0.587            482.          48.6              2.48
##  2 Brazil  Bio       2.91  0.538            466.          49.9              3.36
##  3 Brazil  CS        3.13  0.532            472.          49.3              2.62
##  4 Brazil  Econ      3.03  0.589            473.          48.3              2.76
##  5 Brazil  Eng       3.04  0.539            473.          52.7              2.45
##  6 Brazil  Math      2.89  0.622            457.          36.2              2.77
##  7 Canada  Art       2.98  0.555            483.          51.5              2.69
##  8 Canada  Bio       2.95  0.671            465.          44.5              2.96
##  9 Canada  CS        2.93  0.566            467.          51.7              3.21
## 10 Canada  Econ      3.17  0.592            467.          45.5              2.81
## # ℹ 26 more rows
## # ℹ 1 more variable: sd_satisfaction <dbl>

This summary provides insights into students' academic performance and satisfaction levels grouped by country and major.

Hypothesis Testing and Correlation Analysis

Now, we move on to hypothesis testing and correlation analysis, crucial aspects of data analysis:

T-test to compare the mean GPA of students from China and India:

We’ll perform a t-test to determine if there is a significant difference in GPA between students from China and India.

## 
##  Welch Two Sample t-test
## 
## data:  gpa by country
## t = 0.31124, df = 397.49, p-value = 0.7558
## alternative hypothesis: true difference in means between group China and group India is not equal to 0
## 95 percent confidence interval:
##  -0.09189383  0.12646282
## sample estimates:
## mean in group China mean in group India 
##            3.007576            2.990291

This test helps us understand whether these two groups have statistically significant differences in GPA.

Correlation analysis between GPA and test scores:

We’ll use a correlation test to measure the correlation between GPA and the ‘test_score’ variable, representing the average SAT, TOEFL, IELTS, and GRE scores.

## 
##  Pearson's product-moment correlation
## 
## data:  student_data1$gpa and student_data1$test_score
## t = 0.21888, df = 998, p-value = 0.8268
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05508843  0.06889181
## sample estimates:
##         cor 
## 0.006928313

This correlation analysis assesses the relationship between GPA and test performance.

Conclusion

In this extensive exploration of student data analysis and visualization using R, we’ve covered various aspects of data analysis, including data generation, data visualization with bar charts, histograms, and box plots, identifying missing values and outliers, data transformation, data summarization, hypothesis testing, and correlation analysis. R’s versatility and power make it an invaluable data analysis and statistics tool.

We’ve taken a deep dive into a synthetic dataset, but the principles and techniques presented here can be applied to real-world data scenarios. Whether you’re a college student, a researcher, or a data enthusiast, R can empower you to extract meaningful insights and make informed decisions based on data. As you continue your journey in data analysis, remember that the world of data is vast and full of discoveries waiting to be made. Happy analyzing!

Exploratory Data Analysis for International Journals I PhD Insight

Read and Download the code:

Key points

Introduction

Generating a Dataset

Analyzing the Dataset

Data Visualization

Identifying Missing Values and Outliers

Checking the number of missing values for each variable:

13. Identifying outliers using the IQR method:

Check for outliers in each numerical variable:

Data Transformation

Creating a new variable called ‘test_score’:

Filtering the rows where ‘test_score’ is not missing:

Selecting specific columns for analysis:

Grouping and Summarizing Data

19. Summarizing the dataset:

Hypothesis Testing and Correlation Analysis

T-test to compare the mean GPA of students from China and India:

Correlation analysis between GPA and test scores:

Conclusion

Written by RStudioDataLab

No responses yet