Hands of Tutorial on Data Cleaning, Data Manipulation in R

R
Data manipulation
Data cleaning
Data Analysis
Published

June 1, 2025

Interactive Tutorial with Air Quality Data

This document demonstrates the use of dplyr functions for manipulating environmental data from the airquality dataset, which includes ozone levels, solar radiation, wind speed, temperature, month, and day. It includes separate sections for each core function (select(), filter(), mutate(), arrange(), group_by(), summarise()) with 10+ examples each.

Setup

Load the dplyr package and view the dataset structure.

Code
library(dplyr)
str(airquality)

1. select(): Column Selection

Choose specific columns by name or pattern.

  1. Select Ozone and Temp.
Code
airquality %>% select(Ozone, Temp)
  1. Select columns from Month to Wind.
Code
airquality %>% select(Month:Wind)
  1. Exclude Day.
Code
airquality %>% select(-Day)
  1. Select columns starting with “T”.
Code
airquality %>% select(starts_with("T"))
  1. Select columns ending with “R”.
Code
airquality %>% select(ends_with("R"))
  1. Select columns containing “o”.
Code
airquality %>% select(contains("o"))
  1. Rename during selection.
Code
airquality %>% select(Ozone_Level = Ozone, Solar_Radiation = Solar.R)
  1. Select Month, Day, and everything else.
Code
airquality %>% select(Month, Day, everything())
  1. Select all numeric columns.
Code
airquality %>% select(where(is.numeric))
  1. Select columns with mean > 10 (numeric).
Code
airquality %>% select(where(~ is.numeric(.x) && mean(.x, na.rm = TRUE) > 10))

2. filter(): Row Subsetting

Select rows based on conditions.

  1. Days with Temp > 85°F.
Code
airquality %>% filter(Temp > 85)
  1. Days in August.
Code
airquality %>% filter(Month == 8)
  1. Days with Wind < 5 and Temp > 90.
Code
airquality %>% filter(Wind < 5, Temp > 90)
  1. Days with missing Solar.R.
Code
airquality %>% filter(is.na(Solar.R))
  1. Days with non-missing Ozone.
Code
airquality %>% filter(!is.na(Ozone))
  1. Days with Temp between 75 and 85.
Code
airquality %>% filter(between(Temp, 75, 85))
  1. Days with Ozone > 100 or Solar.R > 250.
Code
airquality %>% filter(Ozone > 100 | Solar.R > 250)
  1. Days in July after the 15th.
Code
airquality %>% filter(Month == 7, Day > 15)
  1. Top 3 days by Temp.
Code
airquality %>% slice_max(Temp, n = 3)
  1. Bottom 5 days by Wind.
Code
airquality %>% slice_min(Wind, n = 5)

3. mutate(): Column Creation

Create or modify variables based on existing ones.

  1. Convert Temp to Celsius.
Code
airquality %>% mutate(Temp_C = round((Temp - 32) * 5/9, 1))
  1. Flag days with Temp > 90.
Code
airquality %>% mutate(Heat_Alert = Temp > 90)
  1. Categorize wind speed.
Code
airquality %>% mutate(Wind_Type = case_when(
  Wind < 5 ~ "Calm",
  Wind < 10 ~ "Breeze",
  TRUE ~ "Windy"
))
  1. Calculate Ozone per Solar.R.
Code
airquality %>% mutate(Ozone_per_Solar = Ozone / Solar.R)
  1. Calculate Temp deviation from mean.
Code
airquality %>% mutate(Temp_Deviation = Temp - mean(Temp, na.rm = TRUE))
  1. Multiple transformations: Temp_C and Heat_Index.
Code
airquality %>% mutate(
  Temp_C = (Temp - 32) * 5/9,
  Heat_Index = 0.5 * Temp + 0.3 * Solar.R / 10
)
  1. Create a date column.
Code
airquality %>% mutate(Date = paste("1973", Month, Day, sep = "-"))
  1. Standardize Ozone (z-score).
Code
airquality %>% mutate(Ozone_std = (Ozone - mean(Ozone, na.rm = TRUE)) / sd(Ozone, na.rm = TRUE))
  1. Create a binary flag for high Wind (> 10).
Code
airquality %>% mutate(High_Wind = Wind > 10)
  1. Calculate Solar.R to Temp ratio.
Code
airquality %>% mutate(Solar_per_Temp = Solar.R / Temp)

4. arrange(): Sorting Rows

Sort rows by column values.

  1. Sort by Ozone ascending.
Code
airquality %>% arrange(Ozone)
  1. Sort by Temp descending.
Code
airquality %>% arrange(desc(Temp))
  1. Sort by Month ascending, then Day descending.
Code
airquality %>% arrange(Month, desc(Day))
  1. Sort by Solar.R descending.
Code
airquality %>% arrange(desc(Solar.R))
  1. Sort by Wind ascending, then Temp descending.
Code
airquality %>% arrange(Wind, desc(Temp))
  1. Sort by Ozone deviation from mean.
Code
airquality %>% arrange(abs(Ozone - mean(Ozone, na.rm = TRUE)))
  1. Sort by Temp descending within Month.
Code
airquality %>% group_by(Month) %>% arrange(desc(Temp)) %>% ungroup()
  1. Sort by Solar.R ascending, then Ozone descending.
Code
airquality %>% arrange(Solar.R, desc(Ozone))
  1. Sort by Wind descending.
Code
airquality %>% arrange(desc(Wind))
  1. Sort by Month and Ozone ascending.
Code
airquality %>% arrange(Month, Ozone)

5. group_by(): Grouping Data

Group data for subsequent operations without immediate summarization.

  1. Group by Month and calculate within-group Ozone rank.
Code
airquality %>% group_by(Month) %>% mutate(Ozone_Rank = rank(Ozone, na.last = "keep")) %>% ungroup()
  1. Group by Month and select top day by Temp.
Code
airquality %>% group_by(Month) %>% slice_max(Temp, n = 1) %>% ungroup()
  1. Group by Month and filter days with Wind < 5.
Code
airquality %>% group_by(Month) %>% filter(Wind < 5) %>% ungroup()
  1. Group by Month and add row number within each group.
Code
airquality %>% group_by(Month) %>% mutate(Group_Row = row_number()) %>% ungroup()
  1. Group by Month and calculate Temp deviation within groups.
Code
airquality %>% group_by(Month) %>% mutate(Temp_Dev = Temp - mean(Temp, na.rm = TRUE)) %>% ungroup()
  1. Group by Month and select bottom 3 days by Ozone.
Code
airquality %>% group_by(Month) %>% slice_min(Ozone, n = 3, na.rm = TRUE) %>% ungroup()
  1. Group by Month and flag high Solar.R days (> 200).
Code
airquality %>% group_by(Month) %>% mutate(High_Solar = Solar.R > 200) %>% ungroup()
  1. Group by Month and sort by Wind within groups.
Code
airquality %>% group_by(Month) %>% arrange(Wind) %>% ungroup()
  1. Group by Month and calculate lagged Temp difference.
Code
airquality %>% group_by(Month) %>% mutate(Temp_Diff = Temp - lag(Temp)) %>% ungroup()
  1. Group by Month and filter non-missing Ozone days.
Code
airquality %>% group_by(Month) %>% filter(!is.na(Ozone)) %>% ungroup()

6. summarise(): Dataset-Wide Summaries

Compute summary statistics across the entire dataset.

  1. Mean Temp across all data.
Code
airquality %>% summarise(Avg_Temp = mean(Temp, na.rm = TRUE))
  1. Median Ozone.
Code
airquality %>% summarise(Median_Ozone = median(Ozone, na.rm = TRUE))
  1. Standard deviation of Wind.
Code
airquality %>% summarise(SD_Wind = sd(Wind, na.rm = TRUE))
  1. Min and max Solar.R.
Code
airquality %>% summarise(Min_Solar = min(Solar.R, na.rm = TRUE), Max_Solar = max(Solar.R, na.rm = TRUE))
  1. Count of non-missing Ozone observations.
Code
airquality %>% summarise(Non_Missing_Ozone = sum(!is.na(Ozone)))
  1. Number of days with Temp > 85.
Code
airquality %>% summarise(Hot_Days = sum(Temp > 85, na.rm = TRUE))
  1. Correlation between Ozone and Temp.
Code
airquality %>% summarise(Corr_Ozone_Temp = cor(Ozone, Temp, use = "complete.obs"))
  1. Total number of observations.
Code
airquality %>% summarise(Total_Days = n())
  1. Mean of multiple columns.
Code
airquality %>% summarise(across(c(Ozone, Solar.R, Temp), mean, na.rm = TRUE))
  1. Proportion of days with Wind < 7.
Code
airquality %>% summarise(Prop_Low_Wind = mean(Wind < 7, na.rm = TRUE))

7. group_by() + summarise(): Aggregations

Group data and compute summary statistics.

  1. Mean Temp across all data.
Code
airquality %>% summarise(Avg_Temp = mean(Temp, na.rm = TRUE))
  1. Max Temp by Month.
Code
airquality %>% group_by(Month) %>% summarise(Max_Temp = max(Temp))
  1. Multiple summaries by Month.
Code
airquality %>% 
  group_by(Month) %>% 
  summarise(
    Days = n(),
    Avg_Ozone = mean(Ozone, na.rm = TRUE),
    High_Ozone_Days = sum(Ozone > 60, na.rm = TRUE)
  )
  1. Mean of Ozone and Solar.R by Month.
Code
airquality %>% 
  group_by(Month) %>% 
  summarise(across(c(Ozone, Solar.R), mean, na.rm = TRUE))
  1. Count windy days by Month.
Code
airquality %>% 
  group_by(Month) %>% 
  summarise(Windy_Days = sum(Wind > 10, na.rm = TRUE))
  1. Median Temp by Month.
Code
airquality %>% 
  group_by(Month) %>% 
  summarise(Median_Temp = median(Temp))
  1. Standard deviation of Ozone by Month.
Code
airquality %>% 
  group_by(Month) %>% 
  summarise(SD_Ozone = sd(Ozone, na.rm = TRUE))
  1. Count days with Solar.R > 200 by Month.
Code
airquality %>% 
  group_by(Month) %>% 
  summarise(High_Solar_Days = sum(Solar.R > 200, na.rm = TRUE))
  1. Min and max Wind by Month.
Code
airquality %>% 
  group_by(Month) %>% 
  summarise(Min_Wind = min(Wind), Max_Wind = max(Wind))
  1. Proportion of hot days by Month.
Code
airquality %>% 
  group_by(Month) %>% 
  summarise(Prop_Hot = mean(Temp > 85, na.rm = TRUE))

8. Combined Workflows

Combine multiple dplyr functions for complex tasks, grouped by analysis type.

Temperature Analysis

  1. Hot days in August.
Code
airquality %>% filter(Month == 8, Temp > 90) %>% select(Day, Temp)
  1. Temperature conversion.
Code
airquality %>% select(Temp) %>% mutate(Temp_C = (Temp - 32) * 5/9)
  1. Monthly temperature stats.
Code
airquality %>% group_by(Month) %>% summarise(Avg_Temp = mean(Temp)) %>% arrange(desc(Avg_Temp))
  1. Hot days with high wind.
Code
airquality %>% filter(Temp > 85, Wind > 10) %>% select(Month, Day, Temp, Wind)
  1. Temperature range analysis.
Code
airquality %>% 
  mutate(Temp_Range = cut(Temp, breaks = seq(50, 100, 10))) %>% 
  group_by(Temp_Range) %>% 
  summarise(Days = n())

Ozone Analysis

  1. High ozone days.
Code
airquality %>% filter(Ozone > 100) %>% select(Month, Day, Ozone) %>% arrange(desc(Ozone))
  1. Ozone completeness.
Code
airquality %>% filter(!is.na(Ozone)) %>% group_by(Month) %>% summarise(Complete_Days = n())
  1. Ozone-temperature relationship.
Code
airquality %>% 
  mutate(Ozone_Level = if_else(Ozone > 50, "High", "Low")) %>% 
  group_by(Ozone_Level) %>% 
  summarise(Mean_Temp = mean(Temp, na.rm = TRUE))
  1. Ozone by month with sorting.
Code
airquality %>% 
  group_by(Month) %>% 
  summarise(Max_Ozone = max(Ozone, na.rm = TRUE)) %>% 
  arrange(desc(Max_Ozone))
  1. Ozone deviation analysis.
Code
airquality %>% 
  mutate(Ozone_Dev = Ozone - mean(Ozone, na.rm = TRUE)) %>% 
  select(Month, Day, Ozone, Ozone_Dev) %>% 
  arrange(desc(Ozone_Dev))

Solar Radiation Analysis

  1. Sunny days.
Code
airquality %>% filter(Solar.R > 200) %>% select(Month, Day, Solar.R) %>% arrange(desc(Solar.R))
  1. Solar by wind condition.
Code
airquality %>% 
  mutate(Windy = Wind > 10) %>% 
  group_by(Windy) %>% 
  summarise(Mean_Solar = mean(Solar.R, na.rm = TRUE))
  1. Monthly solar averages.
Code
airquality %>% group_by(Month) %>% summarise(Solar_Avg = mean(Solar.R, na.rm = TRUE)) %>% arrange(Month)
  1. Solar and temperature correlation.
Code
airquality %>% 
  filter(!is.na(Solar.R), !is.na(Temp)) %>% 
  summarise(Corr_Solar_Temp = cor(Solar.R, Temp))
  1. High solar days by month.
Code
airquality %>% 
  filter(Solar.R > 250) %>% 
  group_by(Month) %>% 
  summarise(High_Solar_Days = n())

Wind Analysis

  1. Windiest days.
Code
airquality %>% arrange(desc(Wind)) %>% select(Month, Day, Wind) %>% head(10)
  1. Calm day temperatures.
Code
airquality %>% filter(Wind < 5) %>% group_by(Month) %>% summarise(Mean_Temp = mean(Temp))
  1. Wind speed categories.
Code
airquality %>% 
  mutate(Wind_Category = cut(Wind, breaks = c(0, 5, 10, 20))) %>% 
  group_by(Wind_Category) %>% 
  summarise(Days = n())
  1. Wind and ozone relationship.
Code
airquality %>% 
  filter(!is.na(Ozone)) %>% 
  group_by(Month) %>% 
  summarise(Corr_Wind_Ozone = cor(Wind, Ozone, use = "complete.obs"))
  1. Low wind days.
Code
airquality %>% filter(Wind < 7) %>% select(Month, Day, Wind) %>% arrange(Wind)

Advanced Metrics

  1. Heat index calculation.
Code
airquality %>% 
  mutate(Heat_Index = 0.5 * Temp + 0.3 * Solar.R / 10) %>% 
  select(Month, Day, Heat_Index) %>% 
  arrange(desc(Heat_Index))
  1. Ozone exposure risk.
Code
airquality %>% 
  mutate(Ozone_Risk = Ozone * Temp / 100) %>% 
  filter(!is.na(Ozone_Risk)) %>% 
  arrange(desc(Ozone_Risk))
  1. Solar efficiency.
Code
airquality %>% 
  filter(Solar.R > 0) %>% 
  mutate(Solar_per_Temp = Solar.R / Temp) %>% 
  select(Month, Day, Solar_per_Temp)
  1. Comfort index by month.
Code
airquality %>% 
  mutate(Comfort_Index = Temp - Wind * 0.5) %>% 
  group_by(Month) %>% 
  summarise(Mean_Comfort = mean(Comfort_Index, na.rm = TRUE))
  1. High-risk air quality days.
Code
airquality %>% 
  filter(Ozone > 80, Temp > 85) %>% 
  select(Month, Day, Ozone, Temp) %>% 
  arrange(desc(Ozone))

Missing Data Handling

  1. Find incomplete records.
Code
airquality %>% filter(is.na(Ozone) | is.na(Solar.R)) %>% select(Month, Day)
  1. Monthly missing ozone count.
Code
airquality %>% group_by(Month) %>% summarise(Missing_Ozone = sum(is.na(Ozone)))
  1. Complete cases only.
Code
airquality %>% filter(complete.cases(.)) %>% group_by(Month) %>% summarise(Days = n())
  1. Missing solar data analysis.
Code
airquality %>% filter(is.na(Solar.R)) %>% group_by(Month) %>% summarise(Missing_Solar = n())
  1. Complete data subset.
Code
airquality %>% filter(complete.cases(.)) %>% select(Ozone, Solar.R, Temp) %>% arrange(Month, Day)

Multi-Variable Analysis

  1. Weather profiles.
Code
airquality %>% 
  mutate(Weather_Type = case_when(
    Temp > 85 & Wind < 5 ~ "HotStill",
    Temp > 85 & Wind > 10 ~ "HotWindy",
    Temp < 70 ~ "Cool",
    TRUE ~ "Mild"
  )) %>% 
  group_by(Weather_Type) %>% 
  summarise(Days = n())
  1. Ozone by temperature ranges.
Code
airquality %>% 
  mutate(Temp_Range = cut(Temp, breaks = seq(50, 100, 10))) %>% 
  group_by(Temp_Range) %>% 
  summarise(Mean_Ozone = mean(Ozone, na.rm = TRUE))
  1. Solar-wind interaction.
Code
airquality %>% 
  mutate(Solar_Wind = Solar.R / Wind) %>% 
  filter(!is.na(Solar_Wind)) %>% 
  group_by(Month) %>% 
  summarise(Mean_Solar_Wind = mean(Solar_Wind))

Time-Based Analysis

  1. Monthly trends.
Code
airquality %>% 
  group_by(Month) %>% 
  summarise(across(c(Ozone, Temp), mean, na.rm = TRUE)) %>% 
  arrange(Month)
  1. Mid-month analysis.
Code
airquality %>% 
  mutate(Day_Category = if_else(Day < 15, "Early", "Late")) %>% 
  group_by(Month, Day_Category) %>% 
  summarise(Mean_Temp = mean(Temp))
  1. Daily fluctuations.
Code
airquality %>% 
  arrange(Month, Day) %>% 
  group_by(Month) %>% 
  mutate(Temp_Change = Temp - lag(Temp)) %>% 
  select(Month, Day, Temp_Change)

Data Export Prep

  1. Clean dataset for export.
Code
airquality %>% 
  filter(complete.cases(.)) %>% 
  mutate(Temp_C = round((Temp - 32) * 5/9, 1)) %>% 
  select(Month, Day, Ozone, Solar.R, Temp_F = Temp, Temp_C) %>% 
  arrange(Month, Day)
  1. Monthly report.
Code
airquality %>% 
  group_by(Month) %>% 
  summarise(
    Ozone = mean(Ozone, na.rm = TRUE),
    Solar.R = mean(Solar.R, na.rm = TRUE),
    Temp = mean(Temp)
  ) %>% 
  arrange(Month)
  1. Alert days report.
Code
airquality %>% 
  mutate(
    Ozone_Alert = Ozone > 60,
    Heat_Alert = Temp > 90
  ) %>% 
  filter(Ozone_Alert | Heat_Alert) %>% 
  select(Month, Day, Ozone_Alert, Heat_Alert) %>% 
  arrange(Month, Day)

9. Pro Tips for Effective Learning

Progressive Complexity

  1. Start with one step.
Code
airquality %>% select(Ozone)
  1. Then two steps.
Code
airquality %>% select(Ozone) %>% filter(!is.na(Ozone))
  1. Then three steps.
Code
airquality %>% 
  select(Ozone) %>% 
  filter(!is.na(Ozone)) %>% 
  mutate(Ozone_Level = if_else(Ozone > 50, "High", "Low"))

Experiment Systematically

  • Change filter thresholds (e.g., Temp > 80Temp > 85).
  • Modify grouping variables (e.g., group_by(Month)group_by(Month, Day_Category)).
  • Try different summary functions (e.g., meanmedian).

Visualize Results

Visualize data to enhance understanding (requires ggplot2).

Code
library(ggplot2)
airquality %>% 
  group_by(Month) %>% 
  summarise(Avg_Temp = mean(Temp)) %>% 
  ggplot(aes(Month, Avg_Temp)) + geom_col()

“In data science, 80% of the work is preparation. Master dplyr and you’ve mastered the most critical skill!”
Hadley Wickham (Chief Scientist at Posit PBC)