Top R Interview Questions and Answers(2025)

author image Hirely
at 08 Jan, 2025

Question: How do you merge datasets in R?

Answer:

In R, datasets can be merged using the merge() function, which combines two data frames by common columns or row names, similar to SQL joins. You can control how the data frames are merged (e.g., inner, outer, left, or right join) by specifying different options.


Syntax of merge() function:

merge(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE, all.x = FALSE, all.y = FALSE, sort = TRUE, ...)

Arguments:

  • x, y: The data frames to be merged.
  • by: A character vector specifying the column(s) to merge on. If not provided, the function will merge on columns with the same name in both datasets.
  • by.x and by.y: The column names in the first (x) and second (y) data frames to merge on. These are used if the column names differ between the two data frames.
  • all: If TRUE, it performs a full outer join. If FALSE (default), it performs an inner join.
  • all.x: If TRUE, it performs a left join (all rows from x will be kept).
  • all.y: If TRUE, it performs a right join (all rows from y will be kept).
  • sort: If TRUE (default), the result will be sorted by the merged column(s).

Types of Joins:

  1. Inner Join: Only keeps the rows where there is a match in both datasets.
  2. Left Join: Keeps all rows from the left dataset and only matching rows from the right dataset.
  3. Right Join: Keeps all rows from the right dataset and only matching rows from the left dataset.
  4. Full Outer Join: Keeps all rows from both datasets, filling in NA where there are no matches.

Examples of Merging Datasets:

1. Inner Join (default)

An inner join combines rows where there is a match in both datasets.

# Data frames
df1 <- data.frame(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
df2 <- data.frame(ID = c(2, 3, 4), Age = c(25, 30, 22))

# Merging on 'ID' (common column)
merged_df <- merge(df1, df2, by = "ID")

print(merged_df)

Output:

  ID    Name Age
1  2     Bob  25
2  3 Charlie  30

In this example, only rows with matching ID values (2 and 3) are included in the merged result.

2. Left Join

A left join keeps all rows from the left dataset (df1) and only matching rows from the right dataset (df2).

# Left Join
left_joined_df <- merge(df1, df2, by = "ID", all.x = TRUE)

print(left_joined_df)

Output:

  ID    Name Age
1  1   Alice  NA
2  2     Bob  25
3  3 Charlie  30

In this case, the row for ID = 1 is kept from df1, but since there is no matching row in df2, the Age column is filled with NA.

3. Right Join

A right join keeps all rows from the right dataset (df2) and only matching rows from the left dataset (df1).

# Right Join
right_joined_df <- merge(df1, df2, by = "ID", all.y = TRUE)

print(right_joined_df)

Output:

  ID    Name Age
1  2     Bob  25
2  3 Charlie  30
3  4   <NA>  22

Here, the row for ID = 4 is kept from df2, but since there is no matching row in df1, the Name column is filled with NA.

4. Full Outer Join

A full outer join keeps all rows from both datasets, filling NA where there is no match.

# Full Outer Join
full_joined_df <- merge(df1, df2, by = "ID", all = TRUE)

print(full_joined_df)

Output:

  ID    Name Age
1  1   Alice  NA
2  2     Bob  25
3  3 Charlie  30
4  4   <NA>  22

In this case, rows from both df1 and df2 are kept, with NA filling in the missing values.


5. Merging on Different Column Names

If the columns on which you want to merge have different names in the two data frames, you can use the by.x and by.y arguments.

# Data frames with different column names for merging
df1 <- data.frame(ID1 = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
df2 <- data.frame(ID2 = c(2, 3, 4), Age = c(25, 30, 22))

# Merging on 'ID1' from df1 and 'ID2' from df2
merged_df <- merge(df1, df2, by.x = "ID1", by.y = "ID2")

print(merged_df)

Output:

  ID1    Name Age
1   2     Bob  25
2   3 Charlie  30

In this example, df1 has the column ID1 and df2 has the column ID2. The merge is performed by specifying these column names using by.x and by.y.


6. Merging by Multiple Columns

You can also merge datasets by using multiple columns.

# Data frames with multiple columns for merging
df1 <- data.frame(ID = c(1, 2, 3), Department = c("HR", "Finance", "IT"), Name = c("Alice", "Bob", "Charlie"))
df2 <- data.frame(ID = c(2, 3, 4), Department = c("Finance", "IT", "Sales"), Age = c(25, 30, 22))

# Merging by both 'ID' and 'Department'
merged_df <- merge(df1, df2, by = c("ID", "Department"))

print(merged_df)

Output:

  ID Department    Name Age
1  2   Finance      Bob  25
2  3       IT  Charlie  30

In this case, the datasets are merged based on both the ID and Department columns.


Summary of merge() Options:

  • Inner join (default): Only includes rows where there is a match in both data frames.
  • Left join: Includes all rows from the left data frame and only matching rows from the right data frame.
  • Right join: Includes all rows from the right data frame and only matching rows from the left data frame.
  • Full outer join: Includes all rows from both data frames, with NA for missing matches.
  • Merging on different column names: Use by.x and by.y to specify columns to merge on when their names differ.
  • Merging by multiple columns: Use by with a vector of column names.

By using merge(), you can efficiently combine datasets based on matching columns or rows, and you can customize the merge operation to suit your needs.

Question: What is the dplyr package in R?

Answer:

The dplyr package in R is a powerful and popular package for data manipulation and transformation. It provides a set of functions that allow you to manipulate data in a fast, efficient, and intuitive way, focusing on operations such as filtering, selecting, mutating, arranging, and summarizing data.

dplyr is part of the tidyverse, a collection of R packages designed for data science that share a common design philosophy and grammar. It is widely used for data wrangling, making it easier to clean, transform, and analyze data in a pipeline-oriented manner.


Key Features of dplyr:

  • Consistency: The syntax of dplyr functions is consistent and simple, which makes data manipulation easier and faster.
  • Efficiency: It is optimized for speed and is capable of handling large datasets efficiently.
  • Tidyverse Integration: dplyr integrates seamlessly with other tidyverse packages like ggplot2, tidyr, and readr.
  • Pipelining: It works well with the %>% (pipe) operator, allowing you to chain multiple operations in a readable and concise manner.

Core Functions in dplyr:

Here are some of the core functions provided by dplyr:

  1. select(): Choose specific columns from a data frame.

    • Example:
    library(dplyr)
    df <- data.frame(Name = c("Alice", "Bob", "Charlie"), Age = c(25, 30, 35), Score = c(85, 90, 95))
    select(df, Name, Age)
    • Output:
      Name Age
    1 Alice  25
    2   Bob  30
    3 Charlie  35
  2. filter(): Subset the data based on conditions.

    • Example:
    filter(df, Age > 30)
    • Output:
      Name Age Score
    1 Charlie  35    95
  3. mutate(): Create new variables or modify existing ones.

    • Example:
    mutate(df, Age_in_5_years = Age + 5)
    • Output:
      Name Age Score Age_in_5_years
    1 Alice  25    85            30
    2 Bob    30    90            35
    3 Charlie  35    95            40
  4. arrange(): Sort the data by one or more variables.

    • Example:
    arrange(df, Age)
    • Output:
      Name Age Score
    1 Alice  25    85
    2 Bob    30    90
    3 Charlie  35    95
  5. summarize() (or summarise()): Apply summary statistics to data.

    • Example:
    summarize(df, avg_age = mean(Age))
    • Output:
      avg_age
    1      30
  6. group_by(): Group data by one or more variables before summarizing or applying other operations.

    • Example:
    df <- data.frame(Name = c("Alice", "Bob", "Charlie", "Alice", "Bob"), Age = c(25, 30, 35, 26, 31), Score = c(85, 90, 95, 88, 92))
    df %>%
      group_by(Name) %>%
      summarize(avg_score = mean(Score))
    • Output:
    # A tibble: 3 × 2
      Name      avg_score
      <chr>        <dbl>
    1 Alice         86.5
    2 Bob           91
    3 Charlie       95
  7. rename(): Rename columns in a data frame.

    • Example:
    rename(df, NewName = Name)
    • Output:
      NewName Age Score
    1   Alice  25    85
    2     Bob  30    90
    3 Charlie  35    95
  8. distinct(): Return unique rows (or distinct values from a column).

    • Example:
    distinct(df, Age)
    • Output:
      Age
    1  25
    2  30
    3  35

Pipelining with %>%:

One of the most powerful features of dplyr is the pipe operator %>% (from the magrittr package), which allows you to chain operations together, making the code more readable and expressive. Instead of nesting functions, you can pipe the result of one operation into the next.

  • Example:
df %>%
  filter(Age > 25) %>%
  select(Name, Age) %>%
  arrange(Age)

This code will:

  1. Filter rows where Age > 25.
  2. Select the Name and Age columns.
  3. Arrange the result by Age in ascending order.

Example of Combining Functions:

Here’s an example where multiple dplyr functions are combined using the pipe operator:

library(dplyr)

# Sample data
df <- data.frame(Name = c("Alice", "Bob", "Charlie", "Alice", "Bob"),
                 Age = c(25, 30, 35, 26, 31),
                 Score = c(85, 90, 95, 88, 92))

# Chain multiple functions together
result <- df %>%
  group_by(Name) %>%
  filter(Age > 25) %>%
  mutate(Age_in_5_years = Age + 5) %>%
  summarize(avg_score = mean(Score))

print(result)

Output:

# A tibble: 3 × 3
  Name      avg_score Age_in_5_years
  <chr>        <dbl>          <dbl>
1 Alice         86.5             31
2 Bob           91               36
3 Charlie       95               40

This example:

  1. Groups the data by Name.
  2. Filters rows where Age > 25.
  3. Creates a new column Age_in_5_years by adding 5 to the Age column.
  4. Summarizes the data to get the average Score for each name.

Benefits of Using dplyr:

  • Concise and Readable Code: It reduces the need for verbose loops and allows you to write clean, readable code for data manipulation.
  • Speed: Optimized for performance, particularly when dealing with large datasets.
  • Seamless Integration: Works well with other tidyverse packages like ggplot2 for visualization and tidyr for reshaping data.

Installation and Loading:

If you don’t have dplyr installed, you can install it with:

install.packages("dplyr")

Then, load the package using:

library(dplyr)

Summary:

  • dplyr is a popular R package for data manipulation.
  • It provides easy-to-use functions like select(), filter(), mutate(), arrange(), summarize(), and group_by().
  • Pipelining (%>%) is one of its most powerful features, allowing you to chain operations together in a clean and readable way.
  • dplyr is efficient, fast, and integrates seamlessly with other tidyverse packages, making it ideal for data wrangling and analysis in R.

Read More

If you can’t get enough from this article, Aihirely has plenty more related information, such as R interview questions, R interview experiences, and details about various R job positions. Click here to check it out.

Related Posts

Trace Job opportunities

Hirely, your exclusive interview companion, empowers your competence and facilitates your interviews.

Get Started Now