Top R Interview Questions and Answers for 2025
Question: What is the difference between a matrix and a data frame in R?
Answer:
Both matrices and data frames are used to store data in R, but they have distinct characteristics and are used for different purposes. Here’s a breakdown of the differences:
1. Structure
-
Matrix:
- A matrix is a two-dimensional array in R that stores data of the same type (numeric, character, etc.).
- Matrices have rows and columns, and every element in a matrix must be of the same data type.
- The matrix is created using the
matrix()
function.
Example:
mat <- matrix(1:9, nrow = 3, ncol = 3) print(mat)
This creates a 3x3 matrix of numbers from 1 to 9.
-
Data Frame:
- A data frame is a two-dimensional table-like structure used for storing data of different types (numeric, character, factor, etc.).
- Unlike matrices, columns in a data frame can have different types of data.
- Data frames are typically used for storing datasets in R and are created using the
data.frame()
function.
Example:
df <- data.frame( Name = c("John", "Alice", "Bob"), Age = c(25, 30, 22), Score = c(90.5, 85.3, 78.9) ) print(df)
This creates a data frame with columns of different data types (character, numeric).
2. Homogeneity of Data
-
Matrix:
- All elements in a matrix must be of the same data type. If you attempt to mix data types (for example, numeric and character), R will automatically coerce all elements into the most general type (e.g., converting all to character type).
Example:
mat <- matrix(c(1, "a", 3, 4), nrow = 2, ncol = 2) print(mat)
Output:
[,1] [,2] [1,] "1" "3" [2,] "a" "4"
The numeric value
1
is converted to a character string"1"
because one of the elements in the matrix is a character. -
Data Frame:
- Each column in a data frame can contain different types of data (e.g., numeric, character, factor), making data frames more flexible than matrices when dealing with real-world data.
Example:
df <- data.frame( ID = 1:3, Name = c("Alice", "Bob", "Charlie"), Age = c(25, 30, 22) ) print(df)
Output:
ID Name Age 1 1 Alice 25 2 2 Bob 30 3 3 Charlie 22
Here, each column (ID, Name, Age) has a different data type: numeric, character, and numeric, respectively.
3. Usage
-
Matrix:
- Typically used when you need to perform matrix operations such as linear algebra (matrix multiplication, inverse, etc.).
- It is a mathematical object that is well-suited for mathematical computations where all data is of the same type.
Example (Matrix multiplication):
mat1 <- matrix(1:4, nrow = 2, ncol = 2) mat2 <- matrix(5:8, nrow = 2, ncol = 2) result <- mat1 %*% mat2 # Matrix multiplication print(result)
-
Data Frame:
- Primarily used for storing and manipulating data in tabular form.
- Ideal for use in data analysis, where different types of data (e.g., numeric, categorical) are often mixed in the same dataset.
- Data frames are also the most common structure used for importing and working with datasets in R.
Example (Working with data frames):
df <- data.frame( Name = c("John", "Alice", "Bob"), Age = c(25, 30, 22), Score = c(90.5, 85.3, 78.9) ) summary(df)
4. Indexing and Accessing Data
-
Matrix:
- Indexing in a matrix is done using two indices: one for the row and one for the column.
Example:
mat <- matrix(1:9, nrow = 3, ncol = 3) mat[2, 3] # Access the element at row 2, column 3
-
Data Frame:
- Data frames can be accessed similarly using indexing, but you can also reference columns by name.
Example:
df <- data.frame( Name = c("John", "Alice", "Bob"), Age = c(25, 30, 22) ) df[1, 2] # Access the element at row 1, column 2 (Age) df$Name # Access the "Name" column by name
5. Efficiency
-
Matrix:
- Matrices are more efficient when working with large datasets that contain only one type of data because R does not need to manage multiple types of data in each column.
-
Data Frame:
- Data frames are less efficient in terms of memory and computational speed because they allow different data types in different columns.
6. Summary of Differences:
Feature | Matrix | Data Frame |
---|---|---|
Data Type | Homogeneous (all elements must be the same type) | Heterogeneous (each column can have different types) |
Structure | 2D array with rows and columns | 2D table with rows and columns |
Use Case | Mathematical operations, matrix algebra | Storing and analyzing data with mixed data types |
Indexing | Two-dimensional indexing (row, column) | Two-dimensional or column-based indexing (with names) |
Data Handling | Efficient for numerical data | Flexible for real-world data (numeric, character, factors) |
Operations | Suited for mathematical operations like matrix multiplication | Suited for data manipulation and analysis |
Summary:
- A matrix is used when you need to store and manipulate data of the same type (e.g., numeric data) and perform mathematical operations.
- A data frame is used when you need to work with tabular data that may include different types (numeric, character, factor), making it more suitable for data analysis and statistical operations.
Matrices are ideal for mathematical computations, while data frames are ideal for data analysis, as they allow the storage of diverse data types in a structured format.
Question: What is the tapply() function in R?
Answer:
The tapply()
function in R is used to apply a function to subsets of a vector, based on the values of a factor or a grouping variable. It allows you to perform operations on grouped data, similar to the apply()
function but with a focus on data grouped by a factor.
Syntax:
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
Arguments:
- X: A vector (usually numeric) on which the function will be applied.
- INDEX: A factor or a list of factors that define the subsets of the vector
X
. - FUN: The function to be applied to each subset of data.
- …: Additional arguments passed to the function
FUN
. - simplify: If
TRUE
(default), the result will be simplified to an array or vector. IfFALSE
, the result will be returned as a list.
How does tapply()
work?
- Grouping: It groups the vector
X
based on the factor(s) inINDEX
. - Function application: It then applies the function
FUN
to each subset of data. - Return: It returns the result in a simplified form (unless
simplify = FALSE
, in which case a list is returned).
Example 1: Basic Usage of tapply()
Suppose you have a vector of numbers representing scores, and a factor representing two different groups (e.g., male and female).
# Data
scores <- c(85, 92, 78, 88, 95, 77)
gender <- factor(c("Male", "Female", "Male", "Female", "Male", "Female"))
# Applying tapply to calculate the mean score for each gender
result <- tapply(scores, gender, mean)
print(result)
Output:
Female Male
84.66667 86.66667
In this example:
scores
is the numeric vector.gender
is the factor that defines the grouping.- The function
mean
is applied to each subset (Male and Female), and the mean score is calculated for each group.
Example 2: Using tapply()
with Multiple Factors
You can also use tapply()
with multiple grouping factors. For example, if you have another factor for Age Group and want to apply a function to multiple factors.
# Data
scores <- c(85, 92, 78, 88, 95, 77)
gender <- factor(c("Male", "Female", "Male", "Female", "Male", "Female"))
age_group <- factor(c("Adult", "Adult", "Teen", "Teen", "Adult", "Teen"))
# Applying tapply to calculate mean score for each combination of Gender and Age Group
result <- tapply(scores, list(gender, age_group), mean)
print(result)
Output:
age_group
gender Adult Teen
Female 92.0 77.0
Male 90.0 78.0
In this example:
scores
is the numeric vector.gender
andage_group
are the factors that define the groups.- The mean score is computed for each combination of
gender
andage_group
.
Example 3: Using a Custom Function with tapply()
You can also apply custom functions using tapply()
. For instance, you might want to calculate the sum of scores for each gender:
# Data
scores <- c(85, 92, 78, 88, 95, 77)
gender <- factor(c("Male", "Female", "Male", "Female", "Male", "Female"))
# Applying tapply to calculate the sum of scores for each gender
result <- tapply(scores, gender, sum)
print(result)
Output:
Female Male
257 258
Here, we used sum
as the function to apply, so the sum of scores for each gender is calculated.
Summary of tapply()
Usage:
tapply()
is used to apply a function to subsets of data, grouped by a factor (or multiple factors).- It simplifies operations like calculating the mean, sum, or other statistical functions for each group in the data.
- It returns the result in a simplified format, or as a list if
simplify = FALSE
.
Common Uses:
- Calculating aggregate statistics (mean, sum, etc.) by group.
- Grouping data by categorical variables.
- Applying custom functions to grouped data.
Read More
If you can’t get enough from this article, Aihirely has plenty more related information, such as R interview questions, R interview experiences, and details about various R job positions. Click here to check it out.
Tags
- R
- R Language
- Data Science
- Data Analysis
- Data Manipulation
- Ggplot2
- Dplyr
- Tidyr
- Data Visualization
- R Functions
- Statistical Analysis
- R Packages
- Data Frames
- Vectors
- Lists
- Factors
- Linear Regression
- Tapply
- Apply Function
- Lapply
- Sapply
- Missing Data
- NA Handling
- Data Merging
- R Programming
- Data Structures
- R Statistics
- Data Wrangling
- R Tutorial
- Statistical Modeling
- R Interview Questions