Data Manipulation with dplyr in R: A Comprehensive Guide
https://www.data03.online
Have you ever spent hours wrangling data, feeling like you’re wrestling a slimy octopus in a bathtub? Yeah, me too. You sift through rows and columns, filter, sort, and summarize, your brain churning like a washing machine on a spin cycle. And when you think you’ve tamed the beast, another tentacle of messy data slaps you in the face. Ugh. But what if there was a way to transform that data wrestling match into a graceful ballet of insights? A secret weapon to bend mountains of numbers to your will? Enter dplyr in R, your magic wand for data wrangling mastery.
- Read the complete article and download the code: Data Manipulation with dplyr in R: A Comprehensive Guide.
- You can also Join our community.
- You can also Hire us or get free Exploratory Data Analysis.
- This article written by RStudioDataLab
What is dplyr
Data manipulation is crucial to statistical analysis, enabling researchers and analysts to glean valuable insights from datasets. The dplyr
package in R serves as a versatile toolkit for these tasks. Whether you are a beginner or an experienced data scientist, mastering dplyr
can significantly enhance your ability to handle and analyze data effectively.
Getting Started with dplyr
To start this data wrangling using dplyr
in R, start by dplyr
in R the dplyr
package.
Next, we’ll use the ‘mtcars’ dataset for hands-on practice. This dataset, a collection of automobile specifications, will serve as our playground for mastering the art of data manipulation.
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
For a detailed introduction to the installation process, refer to the dplyr on using dplyr
in R.
Data Selection and Removal
The first step in data manipulation is often selecting relevant columns. With the select
function in dplyr
, you can precisely choose the variables of interest. In our example, we remove the ‘carb’ column, deemed irrelevant for our analysis.
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
For a deeper understanding of the select
function and the removal of columns, check out this article on Aggregate count.
Exploratory Data Analysis
Once we’ve prepared our dataset, we will conduct exploratory data analysis (EDA). This involves gaining insights into the characteristics of our data, such as descriptive statistics and filtering based on specific conditions.
Let’s start by using the summary
function to obtain descriptive statistics for the ‘mpg’ variable:
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
This provides a quick overview of key statistical measures such as mean, median, and quartiles for the ‘mpg’ variable.
Now, let’s filter the data to isolate cars with fuel efficiency (mpg) less than or equal to 20.09:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
This allows us to focus on specific subsets of the data that meet certain criteria, revealing valuable patterns.
For further insights into EDA concepts and techniques, explore this article on data analysis.
Sorting and Arranging Data
Sorting data is a crucial step in organizing information for better interpretation. We can arrange our dataset based on a chosen variable with the arrange function. Let’s sort our data in descending order of ‘mpg’:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
It facilitates a clearer understanding of the dataset, especially when examining trends or identifying outliers.
Creating New Variables
Creating new variables is often necessary for data manipulation to derive additional insights. The mutate
function in dplyr
allows us to add a new variable, in this case, a ‘ratio’ variable based on the relationship between ‘mpg’ and ‘hp’:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## ratio
## Mazda RX4 0.19090909
## Mazda RX4 Wag 0.19090909
## Datsun 710 0.24516129
## Hornet 4 Drive 0.19454545
## Hornet Sportabout 0.10685714
## Valiant 0.17238095
## Duster 360 0.05836735
## Merc 240D 0.39354839
## Merc 230 0.24000000
## Merc 280 0.15609756
## Merc 280C 0.14471545
## Merc 450SE 0.09111111
## Merc 450SL 0.09611111
## Merc 450SLC 0.08444444
## Cadillac Fleetwood 0.05073171
## Lincoln Continental 0.04837209
## Chrysler Imperial 0.06391304
## Fiat 128 0.49090909
## Honda Civic 0.58461538
## Toyota Corolla 0.52153846
## Toyota Corona 0.22164948
## Dodge Challenger 0.10333333
## AMC Javelin 0.10133333
## Camaro Z28 0.05428571
## Pontiac Firebird 0.10971429
## Fiat X1-9 0.41363636
## Porsche 914-2 0.28571429
## Lotus Europa 0.26902655
## Ford Pantera L 0.05984848
## Ferrari Dino 0.11257143
## Maserati Bora 0.04477612
## Volvo 142E 0.19633028
This newly created variable provides a nuanced perspective on the data, capturing the ratio of fuel efficiency to horsepower.
For a detailed guide on creating new variables, refer to Create variable.
Summarizing Data
Summarizing data is essential for obtaining aggregated information. The summarize
function in dplyr
helps calculate summary statistics. Let’s explore the average ‘mpg’ and ‘hp’ in our dataset:
## average_mpg average_hp
## 1 20.09062 146.6875
This provides a concise summary, offering key insights into the central tendencies of the variables.
For more advanced techniques in summarizing data, delve into Descriptive Analysis and Descriptive Statistics.
Group-wise Operations
Grouping data is pivotal when analyzing trends within specific categories. The group_by
function in dplyr
facilitates such operations. Let’s begin by grouping the data by the ‘cyl’ variable and calculating the average ‘mpg’ and standard deviation of ‘hp’:
## # A tibble: 3 × 3
## cyl avg_mpg sd_hp
## <dbl> <dbl> <dbl>
## 1 4 26.7 20.9
## 2 6 19.7 24.3
## 3 8 15.1 51.0
This operation allows us to discern patterns and variations across different cylinder counts.
Expanding our exploration, let’s group the data by the ‘am’ variable and summarize the count and proportion:
## # A tibble: 2 × 3
## am count prop
## <dbl> <int> <dbl>
## 1 0 19 0.594
## 2 1 13 0.406
Understanding the count and proportion within each category provides valuable context for further analysis.
For more insights into the power of data analysis, refer to DataAnalysis.
Renaming and Transforming Variables
Properly naming variables enhances the clarity of your dataset. Renaming and transforming variables is straightforward. Let’s rename the ‘mpg’ variable to ‘miles_per_gallon’:
## miles_per_gallon
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
## Duster 360 14.3
## Merc 240D 24.4
## Merc 230 22.8
## Merc 280 19.2
It not only improves readability but also ensures consistency in variable names.
For a comprehensive guide on renaming variables, explore the New variables in R.
Now, let’s transform all column names to uppercase for a standardized format:
## MPG CYL DISP HP DRAT WT QSEC VS AM GEAR CARB
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Consistent naming conventions contribute to a more organized and professional dataset.
Data Merging and Joining
Combining datasets is a common requirement in data analysis. With dplyr
, merging and joining datasets is seamless. Let’s start by left joining the mtcars and iris datasets based on ‘cyl’ and ‘Sepal.Width’:
## mpg cyl disp hp drat wt qsec vs am gear carb Sepal.Length Petal.Length
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 NA NA
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 NA NA
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 5.8 1.2
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA NA
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NA NA
## Petal.Width Species
## 1 NA <NA>
## 2 NA <NA>
## 3 0.2 setosa
## 4 NA <NA>
## 5 NA <NA>
This operation allows us to enrich our automotive dataset with additional information from the iris dataset.
Explore the concept of inner joins and their applications by referring to the Create variable.
Conditional Data Transformation
In real-world scenarios, it’s common to perform conditional transformations on data. The dplyr
package provides the tools to execute these transformations with precision. Let’s select specific values in the ‘cyl’ column and replace them with 99:
## cyl
## Mazda RX4 6
## Mazda RX4 Wag 6
## Datsun 710 99
## Hornet 4 Drive 6
## Hornet Sportabout 8
This targeted transformation allows for adjustments based on specific conditions, ensuring the data aligns with analytical requirements.
For further exploration into conditional transformations, refer to Aggregate count.
Handling Missing Values
Dealing with missing data is a critical aspect of data manipulation. Let’s introduce missing values into the ‘mpg’ column of the mtcars dataset and explore strategies for handling them:
## [1] "mpg"
This snippet demonstrates the introduction of missing values and identifies columns with missing data.
Now, let’s replace the missing values in ‘mpg’ with 0:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 0.0 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Handling missing values ensures the robustness of our analysis and maintains data integrity.
For a deeper understanding of handling missing data, refer to Exploring R Date Format Complexities.
In-depth Analysis with Grouping
Grouping data offers a nuanced perspective, especially when conducting in-depth analyses. Let’s group the data by ‘cyl’ and calculate the mean and standard deviation of ‘mpg’:
## # A tibble: 3 × 3
## cyl mean_mpg sd_mpg
## <dbl> <dbl> <dbl>
## 1 4 26.7 4.51
## 2 6 19.7 1.45
## 3 8 15.1 2.56
This in-depth analysis provides insights into fuel efficiency trends across different cylinder counts.
Conclusion
In conclusion, mastering data manipulation with dplyr
in R opens up a world of possibilities for analysts and data scientists. We’ve covered an array of essential functions, from selecting and transforming variables to grouping and summarizing data. The journey through exploratory data analysis, conditional transformations, and handling missing values has equipped you with valuable skills.
As we wrap up, it’s crucial to emphasize the significance of these techniques in real-world scenarios. Efficient data manipulation forms the backbone of insightful statistical analyses, providing the foundation for data-driven decision-making.
Next Steps and Advanced Techniques
While this guide provides a solid foundation, the world of data manipulation is vast and dynamic. To delve deeper into advanced techniques and enhance your proficiency, consider exploring the following resources:
- Analyzing Data in R: A Beginner’s Guide — A comprehensive guide to analyzing data in R, complementing your data manipulation skills.
- Data Analysis Concepts and Techniques — Gain real-world insights into data analysis concepts and techniques to broaden your analytical toolkit.
- Statistics: A Guide from Basics to Machine Learning — Expand your statistical knowledge, bridging the gap between fundamental concepts and advanced machine learning applications.
- ggplot2: Comprehensive Guide to Data Visualization — Dive into the world of data visualization using ggplot2, a powerful package for creating compelling visualizations.
For a quick reference, you can explore the dplyr Cheat Sheet, which provides concise guidance on essential functions.
Final Thoughts
In the ever-evolving landscape of data science, mastering tools dplyr
positions you as a proficient data manipulator. As you continue your journey, remember that practice is key. Work on diverse datasets, experiment with different functions and embrace the challenges that come with real-world data.
May your data manipulation endeavors be insightful and transformative.
Note: The provided links are valuable resources for further exploration and reference throughout your data manipulation journey.