Exploratory Data Analysis for International Journals I PhD Insight
Read and Download the code:
Exploratory Data Analysis for International Journals I PhD Insight
Key points
- Exploratory data analysis (EDA) is crucial in any data analysis project. It involves exploring, summarizing, and visualizing your data to gain insights, identify patterns, and detect outliers.
- EDA can also help you formulate hypotheses, choose appropriate statistical tests, and communicate your findings effectively.
- In this article, I will explain how I perform EDA in R using tidyverse packages, a collection of tools for data manipulation, visualization, and modeling, and my article in Impact Factor Journal.
- I will use a generated dataset for this tutorial that contains information about 1000 students from different countries, their academic performance, and their satisfaction with their university.
- You will learn how to Load and view the data in R, Summarize the data using descriptive statistics, Visualize the data using charts and graphs, Identify missing values and outliers, Transform and filter the data, Perform hypothesis testing and correlation analysis, Generate an EDA report using R Markdown.
Introduction
In data analysis and statistics, R has emerged as a powerful tool for students, researchers, and professionals. This article will delve into the fascinating world of data analysis and visualization using R. Our journey will include generating a synthetic dataset, performing data visualization, identifying missing values and outliers, data transformation, and conducting hypothesis testing and correlation analysis. So, fasten your seatbelts, and let’s embark on this data-driven adventure!
Generating a Dataset
To begin our exploration, we’ll first create a synthetic dataset. Synthetic data allows us to simulate real-world scenarios, and in this case, we’ll generate data for 1000 students. The R code snippet below demonstrates how we can achieve this:
In the code snippet above, we set the seed for reproducibility and create a dataset with various student attributes, including country of origin, gender, age, major, GPA, standardized test scores, and satisfaction level.
Analyzing the Dataset
Before diving into data visualization, it’s crucial to understand the dataset’s structure. We can use R functions like names
, dim
, and str
to achieve this. Additionally, we can display the top five rows of the dataset using the head
function. This information provides us with an initial overview of the data:
## [1] "id" "country" "gender" "age" "major"
## [6] "gpa" "sat" "toefl" "ielts" "gre"
## [11] "satisfaction"
## [1] 1000 11
## 'data.frame': 1000 obs. of 11 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ country : chr "India" "Brazil" "USA" "UK" ...
## $ gender : chr "Female" "Male" "Female" "Male" ...
## $ age : int 18 24 18 19 19 18 20 18 19 20 ...
## $ major : chr "CS" "Eng" "CS" "Bio" ...
## $ gpa : num 2.6 3.7 3.2 3.6 2.6 2.3 3.8 2 3.1 3.9 ...
## $ sat : num 1300 1250 1350 1250 1400 1100 1250 1400 1350 1100 ...
## $ toefl : num 90 85 90 110 105 100 80 120 95 100 ...
## $ ielts : num 7.2 5.8 8.1 7.4 7.2 5.7 8.6 8.9 9 5.1 ...
## $ gre : num 260 280 320 330 340 280 290 270 260 310 ...
## $ satisfaction: int 2 5 1 2 3 4 3 5 1 3 ...
## id country gender age major gpa sat toefl ielts gre satisfaction
## 1 1 India Female 18 CS 2.6 1300 90 7.2 260 2
## 2 2 Brazil Male 24 Eng 3.7 1250 85 5.8 280 5
## 3 3 USA Female 18 CS 3.2 1350 90 8.1 320 1
## 4 4 UK Male 19 Bio 3.6 1250 110 7.4 330 2
## 5 5 UK Male 19 Math 2.6 1400 105 7.2 340 3
These functions allow us to inspect the variable names, dataset dimensions, data structure, and initial data rows.
Data Visualization
Now, we enter the exciting realm of data visualization. Visualizing data is essential for gaining insights and identifying patterns. We’ll create various plots to explore the dataset, including bar charts, histograms, and box plots.
Identifying Missing Values and Outliers
Data quality is paramount in any analysis. Identifying missing values and outliers is a crucial step. We can use R functions to check for missing values and identify outliers in numerical variables:
Checking the number of missing values for each variable:
## id country gender age major gpa
## 0 0 0 0 0 0
## sat toefl ielts gre satisfaction
## 0 0 0 0 0
This code snippet provides the count of missing values for each variable.
13. Identifying outliers using the IQR method:
We’ll create a function called ‘identify_outliers’ to identify outliers in numerical variables such as ‘age,’ ‘GPA,’ ‘SAT,’ ‘TOEFL,’ ‘IELTS,’ ‘GRE,’ and ‘satisfaction.’ This function calculates the lower and upper bounds for outliers based on the interquartile range (IQR).
Check for outliers in each numerical variable:
## [[1]]
## integer(0)
##
## [[2]]
## numeric(0)
##
## [[3]]
## numeric(0)
##
## [[4]]
## numeric(0)
##
## [[5]]
## numeric(0)
##
## [[6]]
## numeric(0)
##
## [[7]]
## integer(0)
This code checks for outliers in the specified numerical variables and returns the outlier values.
Data Transformation
Data transformation is a fundamental step in data analysis. In this section, we’ll perform data transformation by creating a new variable and filtering the data:
Creating a new variable called ‘test_score’:
We’ll calculate the average test score based on the SAT, TOEFL, IELTS, and GRE scores. This new variable provides a more comprehensive measure of a student’s test performance.
Filtering the rows where ‘test_score’ is not missing:
We filter out rows where the ‘test_score’ variable is not missing to ensure data quality. This step eliminates incomplete data points.
Selecting specific columns for analysis:
We’ll select a subset of columns for further analysis, including ‘id,’ ‘country,’ ‘gender,’ ‘major,’ ‘gpa,’ ‘test_score,’ and ‘satisfaction.’
Grouping and Summarizing Data
Data grouping and summarization are essential for gaining insights into specific subgroups of the data. We’ll group the dataset by ‘country’ and ‘major’ and then summarize the data by calculating the mean and standard deviation of ‘gpa,’ ‘test_score,’ and ‘satisfaction’ for each group:
19. Summarizing the dataset:
We calculate each group's mean and standard deviation for ‘gpa,’ ‘test_score,’ and ‘satisfaction’.
## # A tibble: 36 × 8
## # Groups: country [6]
## country major mean_gpa sd_gpa mean_test_score sd_test_score mean_satisfaction
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Brazil Art 2.87 0.587 482. 48.6 2.48
## 2 Brazil Bio 2.91 0.538 466. 49.9 3.36
## 3 Brazil CS 3.13 0.532 472. 49.3 2.62
## 4 Brazil Econ 3.03 0.589 473. 48.3 2.76
## 5 Brazil Eng 3.04 0.539 473. 52.7 2.45
## 6 Brazil Math 2.89 0.622 457. 36.2 2.77
## 7 Canada Art 2.98 0.555 483. 51.5 2.69
## 8 Canada Bio 2.95 0.671 465. 44.5 2.96
## 9 Canada CS 2.93 0.566 467. 51.7 3.21
## 10 Canada Econ 3.17 0.592 467. 45.5 2.81
## # ℹ 26 more rows
## # ℹ 1 more variable: sd_satisfaction <dbl>
This summary provides insights into students' academic performance and satisfaction levels grouped by country and major.
Hypothesis Testing and Correlation Analysis
Now, we move on to hypothesis testing and correlation analysis, crucial aspects of data analysis:
T-test to compare the mean GPA of students from China and India:
We’ll perform a t-test to determine if there is a significant difference in GPA between students from China and India.
##
## Welch Two Sample t-test
##
## data: gpa by country
## t = 0.31124, df = 397.49, p-value = 0.7558
## alternative hypothesis: true difference in means between group China and group India is not equal to 0
## 95 percent confidence interval:
## -0.09189383 0.12646282
## sample estimates:
## mean in group China mean in group India
## 3.007576 2.990291
This test helps us understand whether these two groups have statistically significant differences in GPA.
Correlation analysis between GPA and test scores:
We’ll use a correlation test to measure the correlation between GPA and the ‘test_score’ variable, representing the average SAT, TOEFL, IELTS, and GRE scores.
##
## Pearson's product-moment correlation
##
## data: student_data1$gpa and student_data1$test_score
## t = 0.21888, df = 998, p-value = 0.8268
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05508843 0.06889181
## sample estimates:
## cor
## 0.006928313
This correlation analysis assesses the relationship between GPA and test performance.
Conclusion
In this extensive exploration of student data analysis and visualization using R, we’ve covered various aspects of data analysis, including data generation, data visualization with bar charts, histograms, and box plots, identifying missing values and outliers, data transformation, data summarization, hypothesis testing, and correlation analysis. R’s versatility and power make it an invaluable data analysis and statistics tool.
We’ve taken a deep dive into a synthetic dataset, but the principles and techniques presented here can be applied to real-world data scenarios. Whether you’re a college student, a researcher, or a data enthusiast, R can empower you to extract meaningful insights and make informed decisions based on data. As you continue your journey in data analysis, remember that the world of data is vast and full of discoveries waiting to be made. Happy analyzing!