Top R Interview Questions and Answers(2025)
Question: How do you merge datasets in R?
Answer:
In R, datasets can be merged using the merge()
function, which combines two data frames by common columns or row names, similar to SQL joins. You can control how the data frames are merged (e.g., inner, outer, left, or right join) by specifying different options.
Syntax of merge()
function:
merge(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE, all.x = FALSE, all.y = FALSE, sort = TRUE, ...)
Arguments:
- x, y: The data frames to be merged.
- by: A character vector specifying the column(s) to merge on. If not provided, the function will merge on columns with the same name in both datasets.
- by.x and by.y: The column names in the first (
x
) and second (y
) data frames to merge on. These are used if the column names differ between the two data frames. - all: If
TRUE
, it performs a full outer join. IfFALSE
(default), it performs an inner join. - all.x: If
TRUE
, it performs a left join (all rows fromx
will be kept). - all.y: If
TRUE
, it performs a right join (all rows fromy
will be kept). - sort: If
TRUE
(default), the result will be sorted by the merged column(s).
Types of Joins:
- Inner Join: Only keeps the rows where there is a match in both datasets.
- Left Join: Keeps all rows from the left dataset and only matching rows from the right dataset.
- Right Join: Keeps all rows from the right dataset and only matching rows from the left dataset.
- Full Outer Join: Keeps all rows from both datasets, filling in
NA
where there are no matches.
Examples of Merging Datasets:
1. Inner Join (default)
An inner join combines rows where there is a match in both datasets.
# Data frames
df1 <- data.frame(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
df2 <- data.frame(ID = c(2, 3, 4), Age = c(25, 30, 22))
# Merging on 'ID' (common column)
merged_df <- merge(df1, df2, by = "ID")
print(merged_df)
Output:
ID Name Age
1 2 Bob 25
2 3 Charlie 30
In this example, only rows with matching ID
values (2 and 3) are included in the merged result.
2. Left Join
A left join keeps all rows from the left dataset (df1
) and only matching rows from the right dataset (df2
).
# Left Join
left_joined_df <- merge(df1, df2, by = "ID", all.x = TRUE)
print(left_joined_df)
Output:
ID Name Age
1 1 Alice NA
2 2 Bob 25
3 3 Charlie 30
In this case, the row for ID = 1
is kept from df1
, but since there is no matching row in df2
, the Age
column is filled with NA
.
3. Right Join
A right join keeps all rows from the right dataset (df2
) and only matching rows from the left dataset (df1
).
# Right Join
right_joined_df <- merge(df1, df2, by = "ID", all.y = TRUE)
print(right_joined_df)
Output:
ID Name Age
1 2 Bob 25
2 3 Charlie 30
3 4 <NA> 22
Here, the row for ID = 4
is kept from df2
, but since there is no matching row in df1
, the Name
column is filled with NA
.
4. Full Outer Join
A full outer join keeps all rows from both datasets, filling NA
where there is no match.
# Full Outer Join
full_joined_df <- merge(df1, df2, by = "ID", all = TRUE)
print(full_joined_df)
Output:
ID Name Age
1 1 Alice NA
2 2 Bob 25
3 3 Charlie 30
4 4 <NA> 22
In this case, rows from both df1
and df2
are kept, with NA
filling in the missing values.
5. Merging on Different Column Names
If the columns on which you want to merge have different names in the two data frames, you can use the by.x
and by.y
arguments.
# Data frames with different column names for merging
df1 <- data.frame(ID1 = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
df2 <- data.frame(ID2 = c(2, 3, 4), Age = c(25, 30, 22))
# Merging on 'ID1' from df1 and 'ID2' from df2
merged_df <- merge(df1, df2, by.x = "ID1", by.y = "ID2")
print(merged_df)
Output:
ID1 Name Age
1 2 Bob 25
2 3 Charlie 30
In this example, df1
has the column ID1
and df2
has the column ID2
. The merge is performed by specifying these column names using by.x
and by.y
.
6. Merging by Multiple Columns
You can also merge datasets by using multiple columns.
# Data frames with multiple columns for merging
df1 <- data.frame(ID = c(1, 2, 3), Department = c("HR", "Finance", "IT"), Name = c("Alice", "Bob", "Charlie"))
df2 <- data.frame(ID = c(2, 3, 4), Department = c("Finance", "IT", "Sales"), Age = c(25, 30, 22))
# Merging by both 'ID' and 'Department'
merged_df <- merge(df1, df2, by = c("ID", "Department"))
print(merged_df)
Output:
ID Department Name Age
1 2 Finance Bob 25
2 3 IT Charlie 30
In this case, the datasets are merged based on both the ID
and Department
columns.
Summary of merge()
Options:
- Inner join (default): Only includes rows where there is a match in both data frames.
- Left join: Includes all rows from the left data frame and only matching rows from the right data frame.
- Right join: Includes all rows from the right data frame and only matching rows from the left data frame.
- Full outer join: Includes all rows from both data frames, with
NA
for missing matches. - Merging on different column names: Use
by.x
andby.y
to specify columns to merge on when their names differ. - Merging by multiple columns: Use
by
with a vector of column names.
By using merge()
, you can efficiently combine datasets based on matching columns or rows, and you can customize the merge operation to suit your needs.
Question: What is the dplyr package in R?
Answer:
The dplyr
package in R is a powerful and popular package for data manipulation and transformation. It provides a set of functions that allow you to manipulate data in a fast, efficient, and intuitive way, focusing on operations such as filtering, selecting, mutating, arranging, and summarizing data.
dplyr
is part of the tidyverse, a collection of R packages designed for data science that share a common design philosophy and grammar. It is widely used for data wrangling, making it easier to clean, transform, and analyze data in a pipeline-oriented manner.
Key Features of dplyr
:
- Consistency: The syntax of
dplyr
functions is consistent and simple, which makes data manipulation easier and faster. - Efficiency: It is optimized for speed and is capable of handling large datasets efficiently.
- Tidyverse Integration:
dplyr
integrates seamlessly with other tidyverse packages likeggplot2
,tidyr
, andreadr
. - Pipelining: It works well with the
%>%
(pipe) operator, allowing you to chain multiple operations in a readable and concise manner.
Core Functions in dplyr
:
Here are some of the core functions provided by dplyr
:
-
select()
: Choose specific columns from a data frame.- Example:
library(dplyr) df <- data.frame(Name = c("Alice", "Bob", "Charlie"), Age = c(25, 30, 35), Score = c(85, 90, 95)) select(df, Name, Age)
- Output:
Name Age 1 Alice 25 2 Bob 30 3 Charlie 35
-
filter()
: Subset the data based on conditions.- Example:
filter(df, Age > 30)
- Output:
Name Age Score 1 Charlie 35 95
-
mutate()
: Create new variables or modify existing ones.- Example:
mutate(df, Age_in_5_years = Age + 5)
- Output:
Name Age Score Age_in_5_years 1 Alice 25 85 30 2 Bob 30 90 35 3 Charlie 35 95 40
-
arrange()
: Sort the data by one or more variables.- Example:
arrange(df, Age)
- Output:
Name Age Score 1 Alice 25 85 2 Bob 30 90 3 Charlie 35 95
-
summarize()
(orsummarise()
): Apply summary statistics to data.- Example:
summarize(df, avg_age = mean(Age))
- Output:
avg_age 1 30
-
group_by()
: Group data by one or more variables before summarizing or applying other operations.- Example:
df <- data.frame(Name = c("Alice", "Bob", "Charlie", "Alice", "Bob"), Age = c(25, 30, 35, 26, 31), Score = c(85, 90, 95, 88, 92)) df %>% group_by(Name) %>% summarize(avg_score = mean(Score))
- Output:
# A tibble: 3 × 2 Name avg_score <chr> <dbl> 1 Alice 86.5 2 Bob 91 3 Charlie 95
-
rename()
: Rename columns in a data frame.- Example:
rename(df, NewName = Name)
- Output:
NewName Age Score 1 Alice 25 85 2 Bob 30 90 3 Charlie 35 95
-
distinct()
: Return unique rows (or distinct values from a column).- Example:
distinct(df, Age)
- Output:
Age 1 25 2 30 3 35
Pipelining with %>%
:
One of the most powerful features of dplyr
is the pipe operator %>%
(from the magrittr package), which allows you to chain operations together, making the code more readable and expressive. Instead of nesting functions, you can pipe the result of one operation into the next.
- Example:
df %>%
filter(Age > 25) %>%
select(Name, Age) %>%
arrange(Age)
This code will:
- Filter rows where Age > 25.
- Select the
Name
andAge
columns. - Arrange the result by Age in ascending order.
Example of Combining Functions:
Here’s an example where multiple dplyr
functions are combined using the pipe operator:
library(dplyr)
# Sample data
df <- data.frame(Name = c("Alice", "Bob", "Charlie", "Alice", "Bob"),
Age = c(25, 30, 35, 26, 31),
Score = c(85, 90, 95, 88, 92))
# Chain multiple functions together
result <- df %>%
group_by(Name) %>%
filter(Age > 25) %>%
mutate(Age_in_5_years = Age + 5) %>%
summarize(avg_score = mean(Score))
print(result)
Output:
# A tibble: 3 × 3
Name avg_score Age_in_5_years
<chr> <dbl> <dbl>
1 Alice 86.5 31
2 Bob 91 36
3 Charlie 95 40
This example:
- Groups the data by
Name
. - Filters rows where
Age > 25
. - Creates a new column
Age_in_5_years
by adding 5 to theAge
column. - Summarizes the data to get the average
Score
for each name.
Benefits of Using dplyr
:
- Concise and Readable Code: It reduces the need for verbose loops and allows you to write clean, readable code for data manipulation.
- Speed: Optimized for performance, particularly when dealing with large datasets.
- Seamless Integration: Works well with other tidyverse packages like
ggplot2
for visualization andtidyr
for reshaping data.
Installation and Loading:
If you don’t have dplyr
installed, you can install it with:
install.packages("dplyr")
Then, load the package using:
library(dplyr)
Summary:
dplyr
is a popular R package for data manipulation.- It provides easy-to-use functions like
select()
,filter()
,mutate()
,arrange()
,summarize()
, andgroup_by()
. - Pipelining (
%>%
) is one of its most powerful features, allowing you to chain operations together in a clean and readable way. dplyr
is efficient, fast, and integrates seamlessly with other tidyverse packages, making it ideal for data wrangling and analysis in R.
Read More
If you can’t get enough from this article, Aihirely has plenty more related information, such as R interview questions, R interview experiences, and details about various R job positions. Click here to check it out.
Tags
- R
- R Language
- Data Science
- Data Analysis
- Data Manipulation
- Ggplot2
- Dplyr
- Tidyr
- Data Visualization
- R Functions
- Statistical Analysis
- R Packages
- Data Frames
- Vectors
- Lists
- Factors
- Linear Regression
- Tapply
- Apply Function
- Lapply
- Sapply
- Missing Data
- NA Handling
- Data Merging
- R Programming
- Data Structures
- R Statistics
- Data Wrangling
- R Tutorial
- Statistical Modeling
- R Interview Questions