Most Frequently asked regex Interview Questions (2024)

author image Hirely
at 28 Dec, 2024

Question: What is a regular expression?

Answer:

A regular expression (often abbreviated as regex) is a sequence of characters that defines a search pattern. It is mainly used for string matching and manipulation, such as searching, extracting, or replacing parts of a string. Regular expressions are widely supported in programming languages, text editors, and command-line utilities.

A regular expression allows you to create patterns for text, which can include special characters to match a wide range of text combinations. Some common elements in regular expressions are:

  • Literal characters: e.g., a, 1, abc
  • Metacharacters: e.g., ., *, +, ?, [], (), ^, $, |, {}, etc.
  • Anchors: e.g., ^ for the start of a string, $ for the end of a string
  • Character classes: e.g., [a-z] matches any lowercase letter
  • Quantifiers: e.g., * (zero or more), + (one or more), ? (zero or one), {n,m} (between n and m occurrences)

Regular expressions are useful for tasks such as:

  • Validation: Checking if a string matches a certain format (e.g., validating email addresses, phone numbers)
  • Search and replace: Replacing or extracting specific patterns from a string
  • Text processing: Filtering and splitting text based on patterns

Example:

  • The regular expression \d{3}-\d{2}-\d{4} matches Social Security numbers in the format 123-45-6789.

Regular expressions can vary slightly across different programming languages or tools, but their core principles remain the same.

Question: How would you use a regular expression to validate an email address?

Answer:

To validate an email address using a regular expression, you would need to ensure that the string follows the general structure of an email, which typically consists of:

  1. Local part (before the @ symbol): Can contain letters, digits, dots, hyphens, and underscores.
  2. Domain part (after the @ symbol): Contains letters, digits, hyphens, and periods.
  3. Top-level domain (TLD): Comes after the last period (.) in the domain part, and it usually contains 2-6 alphabetic characters (though new TLDs are increasing in variety).

Here is a common regular expression for validating a basic email address:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}$

Explanation of the regex:

  • ^ and $: Anchors that match the start and end of the string, respectively.
  • [a-zA-Z0-9._%+-]+: Matches one or more of the following characters for the local part of the email:
    • a-zA-Z: Letters (uppercase and lowercase)
    • 0-9: Digits
    • ., _, %, +, -: Special characters allowed in the local part.
  • @: Matches the literal @ symbol.
  • [a-zA-Z0-9.-]+: Matches one or more characters for the domain part, allowing:
    • a-zA-Z0-9: Letters and digits
    • . and -: Periods and hyphens in the domain.
  • \.: Matches the literal period (.) separating the domain and TLD.
  • [a-zA-Z]{2,6}: Matches the TLD, which should consist of 2 to 6 alphabetic characters (this handles most common TLDs like .com, .org, .net, etc.).

Example usage in Python:

import re

def validate_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}$'
    if re.match(pattern, email):
        return True
    return False

# Test
print(validate_email("[email protected]"))  # True
print(validate_email("invalid-email.com"))  # False

Considerations:

  • Internationalized domain names (IDNs): This regex does not account for international characters in domain names (e.g., example.中文), which might need additional handling.
  • Edge cases: This regex covers the majority of email formats but doesn’t handle all edge cases of the official email format (e.g., special rules for quotes, comments, and domain names).

For most use cases, this regex should be sufficient for validating standard email addresses. However, for more precise validation that adheres strictly to the full email specification, you might need a more complex regex or a specialized library for email validation.

Question: What is the difference between ^ and $ in regular expressions?

Answer:

In regular expressions, ^ and $ are anchors that specify the position of a match within a string. They are used to assert that the match occurs at the beginning or end of the string, respectively. They do not match any characters themselves, but instead mark where the match should happen.

1. ^ (Caret or Circumflex) - Matches the start of the string:

  • The ^ anchor asserts that the pattern must appear at the beginning of the string.
  • It ensures that the pattern matches only if it occurs right at the start, before any other characters.

Example:

  • Regex: ^abc
    • This will match any string that starts with abc, such as:
      • "abc123"
      • "abcxyz"
    • It will not match:
      • "123abc"
      • "xyzabc"

2. $ (Dollar Sign) - Matches the end of the string:

  • The $ anchor asserts that the pattern must appear at the end of the string.
  • It ensures that the pattern matches only if it occurs right at the end of the string, following all other characters.

Example:

  • Regex: abc$
    • This will match any string that ends with abc, such as:
      • "123abc"
      • "xyzabc"
    • It will not match:
      • "abc123"
      • "abcxyz"

Combining ^ and $:

When both ^ and $ are used in the same regular expression, they ensure that the entire string matches the pattern from start to finish. This is useful when you want to match the whole string exactly.

Example:

  • Regex: ^abc$
    • This will match only the string "abc" and nothing else:
      • "abc" ➔ match
      • "abc123" ➔ no match
      • "123abc" ➔ no match
      • "xyzabc" ➔ no match

Summary:

  • ^: Asserts that the pattern matches at the beginning of the string.
  • $: Asserts that the pattern matches at the end of the string.
  • Using ^ and $ together ensures that the entire string matches the pattern, with no characters before or after it.

These anchors help to precisely define where the match should occur in the string.

Question: What does the . (dot) mean in regular expressions, and what does it match?

Answer:

In regular expressions, the dot (.) is a metacharacter that matches any single character except for a newline (\n). It is a wildcard character, meaning it can stand in for any character in the input text (with the exception of a line break).

Key Points:

  • The dot (.) matches any single character except for a line break.
  • This includes letters, digits, symbols, spaces, and punctuation marks.
  • It does not match a newline character (\n) by default.

Examples:

  1. Regex: a.c

    • Matches any string that contains an a, followed by any single character, followed by c.
    • It will match:
      • "abc"
      • "axc"
      • "a1c"
      • "a c"
    • It will not match:
      • "ac" (since there is no character between a and c)
      • "ab\nc" (because \n is a newline, and the dot does not match newlines by default)
  2. Regex: ^.*$

    • This is often used to match an entire line, from the beginning (^) to the end ($), including any characters in between.
    • It will match:
      • "hello"
      • "12345"
      • "a*b+c"
      • " " (a string with just a space)
    • It will not match if the input is empty and no newline is involved.

Modifying the Behavior of . (Dot):

By default, the dot (.) does not match newline characters. However, this behavior can be changed depending on the regular expression engine. Some engines allow you to modify the behavior of the dot in the following ways:

  • Dot-all (or single-line mode):

    • In some regex engines (like Python’s re or JavaScript), you can use a flag or modifier to make the dot match newlines as well.
    • In Python, you can use the re.DOTALL flag.
    • In JavaScript, you can use the s flag (dotall mode).

    Example (Python):

    import re
    pattern = r'.'  # Matches any character except newline
    print(re.match(pattern, "abc\n123", flags=re.DOTALL))  # Matches

    Example (JavaScript):

    const regex = /./s;  // s flag enables dot-all mode
    console.log(regex.test("abc\n123"));  // Matches

Summary:

  • The dot (.) in regular expressions matches any single character except a newline.
  • It is useful as a wildcard to match varying characters in a string.
  • To match newlines, you may need to use specific flags or modes like DOTALL or s (depending on the language or regex engine).

Question: How do you match a digit in a regular expression?

Answer:

To match a digit in a regular expression, you can use the shorthand character class \d.

\d:

  • \d matches any digit from 0 to 9.
  • It is equivalent to the character class [0-9], but more concise.

Examples:

  1. Regex: \d

    • Matches any single digit.
    • It will match:
      • "0", "1", "2", …, "9"
    • It will not match:
      • "a", "z", "!", " " (since these are not digits)
  2. Regex: \d{3}

    • Matches exactly three digits in a row.
    • It will match:
      • "123"
      • "000"
      • "456"
    • It will not match:
      • "12" (since it has fewer than 3 digits)
      • "abc123" (because the string starts with non-digit characters)
  3. Regex: \d+

    • Matches one or more digits.
    • It will match:
      • "7"
      • "123"
      • "456789"
    • It will not match:
      • "abc" (since it’s not numeric)
  4. Regex: ^\d+$

    • Matches a string that contains only digits, from start to finish.
    • It will match:
      • "12345"
      • "987"
    • It will not match:
      • "abc123" (because it contains non-digit characters)
      • "123abc" (because it ends with non-digits)

Modifiers and Flags:

  • By default, \d matches only digits from 0 to 9.
  • It does not match other characters like numbers from other languages or symbols such as ² or ³, unless specified in a Unicode or locale-specific regex mode.

Summary:

  • \d is used to match any digit (0-9) in a regular expression.
  • You can use it with quantifiers (like {3}, +, etc.) to match multiple digits.
  • It is equivalent to the character class [0-9] and is a shorthand for easier regex writing.

Question: What is the purpose of the *, +, and ? quantifiers in regular expressions?

Answer:

In regular expressions, the quantifiers *, +, and ? are used to specify how many times an element (character or group) should appear in a match. These quantifiers give flexibility to define patterns for matching repeated characters or sequences.

1. * (Asterisk) - Zero or more occurrences:

  • The * quantifier matches zero or more occurrences of the preceding element (character, group, or character class).
  • It allows the preceding element to appear any number of times, including not at all (i.e., it can match an empty string).

Example: a*

  • This will match:
    • "" (empty string)
    • "a"
    • "aa"
    • "aaa"
    • And so on…
  • It will not match:
    • Any string that does not contain a (since a* can match zero occurrences).

Example (full regex): ab*c

  • This matches:
    • "ac" (zero occurrences of b)
    • "abc"
    • "abbc"
    • "abbbc"
  • It will not match:
    • "ab" (missing the c at the end)

2. + (Plus) - One or more occurrences:

  • The + quantifier matches one or more occurrences of the preceding element.
  • It requires the preceding element to appear at least once, and can repeat any number of times after that.

Example: a+

  • This will match:
    • "a"
    • "aa"
    • "aaa"
    • And so on…
  • It will not match:
    • "" (empty string, because there must be at least one a)

Example (full regex): ab+c

  • This matches:
    • "abc" (one b)
    • "abbc" (two bs)
    • "abbbc" (three bs)
  • It will not match:
    • "ac" (since there’s no b between a and c)

3. ? (Question mark) - Zero or one occurrence:

  • The ? quantifier matches zero or one occurrence of the preceding element.
  • It means the element is optional; it can appear once or not at all.

Example: a?

  • This will match:
    • "" (empty string, because the a is optional)
    • "a"
  • It will not match:
    • Any string that has more than one a

Example (full regex): ab?c

  • This matches:
    • "ac" (the b is optional)
    • "abc" (the b is present)
  • It will not match:
    • "abbbc" (more than one b)

Summary:

  • * (Asterisk): Matches zero or more occurrences of the preceding element. It allows for optional matches or repeated matches.
  • + (Plus): Matches one or more occurrences of the preceding element. It requires the element to appear at least once.
  • ? (Question mark): Matches zero or one occurrence of the preceding element. It makes the element optional.

These quantifiers are powerful tools for creating flexible patterns, especially when you’re dealing with data that may have variable lengths or optional elements.

Question: How do you match a word boundary in regex?

Answer:

In regular expressions, you can use the word boundary metacharacter \b to match a word boundary. A word boundary matches the position between a word character (letters, digits, or underscore) and a non-word character (anything other than letters, digits, or underscores).

What is a Word Boundary?

  • A word boundary (\b) matches a position where:
    • A word character (\w) is next to a non-word character (\W), or
    • The start or end of the string.
  • Word characters are defined as [a-zA-Z0-9_] (letters, digits, and underscore).
  • Non-word characters are anything else, such as spaces, punctuation, or the beginning/end of the string.

Example Usage of \b:

  1. Regex: \bword\b

    • This will match the exact word "word" but only if it is a whole word, not part of another word.
    • It will match:
      • "word"
      • "word!" (since ! is a non-word character)
      • "word "
    • It will not match:
      • "sword" (because the s is adjacent to word, so there’s no boundary)
      • "wording" (since the g is part of the word and there’s no boundary)
  2. Regex: \b\w+\b

    • This matches any whole word.
    • It will match:
      • "apple"
      • "hello"
      • "1234"
      • "word_" (the underscore is considered a word character)
    • It will not match:
      • " apple " (spaces are non-word characters and cause boundaries to be detected)
  3. Regex: \b\d+\b

    • This matches a whole number.
    • It will match:
      • "123"
      • "42"
    • It will not match:
      • "123abc" (since abc is not part of the number)

Usage with Negations or Other Word Characters:

  • You can use \b in combination with other regex patterns to ensure the match is at the start or end of a word.

For example, to match a word starting with “test” but not inside other words:

\btest\w*\b
  • This matches:
    • "test"
    • "testing"
    • "test123"
  • But will not match:
    • "pretest"

Summary:

  • \b is a word boundary anchor in regular expressions.
  • It matches the position between a word character and a non-word character, or the start/end of the string.
  • It is useful for matching whole words or ensuring specific patterns occur at word boundaries.

Question: Can you explain the use of [] in regular expressions?

Answer:

In regular expressions, square brackets [] are used to define a character class. A character class is a set of characters that can match any one character from the set. The characters inside the square brackets define the range or set of characters that can be matched.

Key Points about []:

  • Inside [], you list the characters that are allowed to match at that position in the string.
  • The regex will match any one of the characters inside the brackets.
  • Character classes allow you to match specific sets or ranges of characters without having to list every possible character individually.

Basic Usage of []:

  1. Matching a specific set of characters:

    • Regex: [abc]
      • This matches any one of the characters a, b, or c.
      • It will match:
        • "a"
        • "b"
        • "c"
      • It will not match:
        • "d"
        • "ab"
  2. Character ranges:

    • You can specify a range of characters using a hyphen (-).
    • Regex: [a-z]
      • This matches any lowercase letter from a to z.
      • It will match:
        • "a", "b", "z", "f"
      • It will not match:
        • "A" (uppercase letter)
        • "1" (a number)
    • Regex: [A-Z]
      • This matches any uppercase letter from A to Z.
  3. Matching digits:

    • Regex: [0-9]
      • This matches any digit from 0 to 9.
      • It will match:
        • "1", "5", "9"
      • It will not match:
        • "a" (letter)
        • "!" (symbol)
  4. Including multiple types of characters:

    • You can mix and match characters, ranges, and symbols.
    • Regex: [a-zA-Z0-9]
      • This matches any letter (lowercase or uppercase) or digit.
      • It will match:
        • "a", "B", "1", "z"
      • It will not match:
        • "@", " " (symbols or spaces)
  5. Negating a character class (using ^ inside []):

    • When you place a caret (^) at the beginning of the character class, it negates the class, meaning it matches any character except those inside the brackets.
    • Regex: [^a-z]
      • This matches any character that is not a lowercase letter.
      • It will match:
        • "1", "!", " ", "Z"
      • It will not match:
        • "a", "b", "z"
  6. Special Characters Inside []:

    • Inside square brackets, most special characters lose their special meaning (except for -, ^, ], and \), so you don’t need to escape them.

    • For example, the dot (.) inside [] will literally match a period (.), not any character.

      • Regex: [.]
        • This matches a literal period (".").
    • If you want to include a hyphen (-) in the character class, you can either place it at the beginning or end, or escape it with a backslash.

      • Regex: [-a-z] or [a-z-] (matches any lowercase letter or a hyphen -)
      • Regex: [a\-z] (matches the characters a, -, or z)

Examples:

  1. Regex: [aeiou]

    • Matches any vowel (lowercase).
  2. Regex: [a-zA-Z]

    • Matches any letter (either lowercase or uppercase).
  3. Regex: [^0-9]

    • Matches any character that is not a digit.
  4. Regex: [A-Za-z0-9_]

    • Matches any alphanumeric character or underscore.

Summary:

  • [] is used to define a character class in regular expressions.
  • It allows you to specify a set or range of characters that can match a position in the string.
  • Ranges can be specified using a hyphen (-), such as [a-z] or [0-9].
  • The caret (^) at the beginning of the class negates it, matching characters not in the specified set.
  • Most special characters inside [] lose their special meaning, except for a few (like -, ^, and ]).

Question: What do \d, \w, and \s represent in regex?

Answer:

In regular expressions, \d, \w, and \s are shorthand character classes that represent specific sets of characters. They are used to match common types of characters in a string.

1. \d - Matches any digit (0-9):

  • \d is a shorthand for any digit.
  • It matches any single character that is a digit from 0 to 9.

Examples:

  • Regex: \d
    • It will match:
      • "0", "5", "9", "3"
    • It will not match:
      • "a", "!", " " (letters or non-numeric characters)
  • Regex: \d{3}
    • This matches exactly three digits in a row.
    • It will match:
      • "123"
      • "456"
    • It will not match:
      • "12" (only two digits)
      • "abc" (non-digit characters)

2. \w - Matches any “word” character:

  • \w matches any word character. It includes:
    • Letters (both uppercase and lowercase): a-z, A-Z
    • Digits: 0-9
    • Underscore: _
  • \w is typically used to match a valid part of a word, like variable names in code or identifiers.

Examples:

  • Regex: \w
    • It will match:
      • "a", "A", "z", "9", "_"
    • It will not match:
      • "!", " ", "$", "@" (symbols and spaces)
  • Regex: \w+
    • This matches one or more word characters.
    • It will match:
      • "apple"
      • "123abc"
      • "a_123"
    • It will not match:
      • "apple!" (since ! is not a word character)

3. \s - Matches any whitespace character:

  • \s matches any whitespace character.
  • Whitespace characters include:
    • Spaces: " "
    • Tabs: "\t"
    • Newlines: "\n"
    • Carriage returns: "\r"
    • Form feeds: "\f"

Examples:

  • Regex: \s
    • It will match:
      • " ", "\t", "\n"
    • It will not match:
      • "a", "1", "_" (non-whitespace characters)
  • Regex: \s+
    • This matches one or more whitespace characters.
    • It will match:
      • " ", "\t", "\n", " " (a space, a tab, etc.)
    • It will not match:
      • "apple" (since it contains no whitespace)

Summary:

  • \d: Matches any digit (0-9).
  • \w: Matches any word character, which includes letters (uppercase and lowercase), digits, and the underscore (_).
  • \s: Matches any whitespace character, such as spaces, tabs, newlines, and other space characters.

These shorthand character classes are very useful for matching common types of characters without having to explicitly list them out.

Question: How do you perform a case-insensitive search using a regular expression?

Answer:

To perform a case-insensitive search using a regular expression, you can use the i flag (also called the case-insensitive modifier). This modifier makes the regex engine treat uppercase and lowercase characters as equivalent when matching patterns.

  1. Using the i flag in the regex:

    • The i flag is added to the regex pattern, which makes the whole regex case-insensitive.
    • This means the pattern will match characters regardless of whether they are uppercase or lowercase.

    Syntax: /pattern/i

    Example 1: Case-insensitive search for the word “hello”:

    • Regex: /hello/i
      • This will match:
        • "hello"
        • "HELLO"
        • "HeLLo"
        • "hElLo"
      • It will not match:
        • Any string that doesn’t contain “hello” in some form.

    Example 2: Case-insensitive search for a digit:

    • Regex: /\d/i
      • Since the digit \d is not affected by case, the i flag won’t change this, but if you were searching for letters or words, it would make them case-insensitive.
  2. Specifying case-insensitivity within a regular expression (not commonly used):

    • Some programming languages or tools allow you to specify case-insensitivity for individual characters or groups within a regex pattern by using specific syntax.
    • For example, in some regex engines, you might use (?i) at the start of a pattern or in specific parts of the pattern.

    Example: In some regex engines, you can apply (?i) at the start:

    • Regex: (?i)hello
      • This will match:
        • "hello", "HELLO", "HeLLo", etc.
      • It makes the pattern case-insensitive for the whole string.

Summary:

  • To perform a case-insensitive search in regex, you use the i flag (or modifier).
  • For example, /pattern/i makes the regex case-insensitive, allowing it to match uppercase and lowercase characters interchangeably.

Question: What is the difference between greedy and lazy matching in regex?

Answer:

In regular expressions, greedy and lazy matching refer to how much of the input string the regex engine tries to match when dealing with quantifiers (*, +, {n,m}).

1. Greedy Matching:

  • Greedy matching is the default behavior of most quantifiers in regex.
  • A greedy quantifier tries to match as much of the input string as possible while still allowing the overall regex to match the full pattern.

Example of Greedy Matching:

  • Regex: "<.*>"
    • This pattern matches a string that starts with <, ends with >, and contains any characters in between.
    • The .* is a greedy quantifier that matches as many characters as possible (including zero), as long as the overall pattern matches.
    • Input: "<div>Hello</div>"
      • The regex will match the entire string: "<div>Hello</div>".
    • Why? The .* will match everything from the first < to the last >, making the match as long as possible.

Greedy Quantifiers:

  • * (zero or more)
  • + (one or more)
  • {n,m} (between n and m repetitions)

2. Lazy (or Non-Greedy) Matching:

  • Lazy matching (also known as non-greedy or reluctant matching) tries to match as little of the input string as possible while still allowing the overall regex to match the pattern.
  • Lazy quantifiers match the smallest possible number of characters.

Example of Lazy Matching:

  • Regex: "<.*?>" (Note the ? after the * to make it lazy)
    • This pattern behaves similarly to the greedy version, but the .*? is a lazy quantifier.
    • The .*? will try to match as few characters as possible while still allowing the whole pattern to match.
    • Input: "<div>Hello</div>"
      • The regex will match the shortest possible match: "<div>".
    • Why? The .*? matches as little as possible to allow the rest of the pattern to match, so it stops at the first >, rather than going all the way to the last >.

Lazy Quantifiers:

  • *? (zero or more)
  • +? (one or more)
  • {n,m}? (between n and m repetitions)

Summary of Key Differences:

AspectGreedy MatchingLazy (Non-Greedy) Matching
BehaviorMatches as much as possible while still allowing the overall pattern to match.Matches as little as possible while still allowing the overall pattern to match.
Default Quantifiers*, +, {n,m}*?, +?, {n,m}?
Example (Greedy)"<.*>" matches "<div>Hello</div>"Greedy tries to match as much as possible.
Example (Lazy)"<.*?>" matches "<div>"Lazy tries to match as little as possible.
Use CaseUseful when you want to match the longest possible string that fits the pattern.Useful when you want to match the shortest string that fits the pattern.

When to Use:

  • Greedy matching is helpful when you want to capture large chunks of data (for example, when you’re matching an entire HTML tag).
  • Lazy matching is useful when you want to match specific portions of data within larger blocks (for example, when you’re extracting content between tags but don’t want to over-capture).

In summary:

  • Greedy quantifiers capture the largest possible match.
  • Lazy quantifiers capture the smallest possible match.

Question: How would you extract all the digits from a given string using regex?

Answer:

To extract all the digits from a given string using regex, you can use the \d shorthand (which matches any digit) combined with the global matching technique, depending on the programming language you’re using.

Basic Regex:

  • Regex: \d+
    • \d matches a single digit (0-9).
    • + ensures that it matches one or more digits (i.e., a sequence of digits).

This regex will match any continuous sequence of digits in the string.

Steps to Extract All Digits:

  1. Use the regex \d+ to find all matches of one or more digits in the string.
  2. Depending on the programming language or tool, you will use a method to find all occurrences of this pattern.

Example in Different Programming Languages:

1. JavaScript:

In JavaScript, you can use match() with the global flag g to find all matches.

let str = "The price is 123 and the discount is 45.";
let digits = str.match(/\d+/g);
console.log(digits); // Output: ["123", "45"]
  • The \d+ matches sequences of digits, and the g flag ensures that it finds all matches.

2. Python:

In Python, you can use re.findall() to extract all digits.

import re
str = "The price is 123 and the discount is 45."
digits = re.findall(r'\d+', str)
print(digits)  # Output: ['123', '45']
  • re.findall(r'\d+', str) returns a list of all sequences of digits found in the string.

3. Ruby:

In Ruby, you can use scan to find all matches.

str = "The price is 123 and the discount is 45."
digits = str.scan(/\d+/)
puts digits  # Output: ["123", "45"]
  • scan returns an array of all matching substrings.

4. Java:

In Java, you can use Pattern and Matcher to extract all digits.

import java.util.regex.*;
import java.util.*;

public class Main {
    public static void main(String[] args) {
        String str = "The price is 123 and the discount is 45.";
        Pattern p = Pattern.compile("\\d+");
        Matcher m = p.matcher(str);
        
        List<String> digits = new ArrayList<>();
        while (m.find()) {
            digits.add(m.group());
        }
        
        System.out.println(digits);  // Output: [123, 45]
    }
}
  • The Pattern.compile("\\d+") finds all digit sequences, and m.find() is used to iterate over the matches.

5. PHP:

In PHP, you can use preg_match_all() to get all digits.

$str = "The price is 123 and the discount is 45.";
preg_match_all('/\d+/', $str, $matches);
print_r($matches[0]);  // Output: Array ( [0] => 123 [1] => 45 )
  • preg_match_all('/\d+/', $str) finds all sequences of digits and stores them in the $matches array.

Summary:

To extract all digits from a string, use the regular expression \d+, which matches one or more digits in a row. The specific method for extracting all matches will depend on the programming language you’re using, but generally, you need to use a function or method that supports global search or multiple matches.

Question: How do you use parentheses () in regular expressions?

Answer:

In regular expressions, parentheses () are used to group parts of a pattern together and capture the matched portion of the string for later use. Parentheses have two primary functions in regex:

  1. Grouping: Parentheses group parts of the pattern so that they can be treated as a single unit. This is helpful when applying quantifiers to specific parts of the pattern.
  2. Capturing: Parentheses create capture groups, which allow you to extract, reference, or manipulate parts of the matched string.

1. Grouping:

  • Parentheses allow you to group expressions together. You can then apply quantifiers (such as *, +, {n,m}) to the entire group rather than to individual elements.

Example:

  • Regex: (\d+)-(\d+)

    • This matches a pattern where two groups of digits are separated by a hyphen, such as "123-456".
    • The parentheses group the digits before and after the hyphen separately.

    Explanation:

    • (\d+) matches one or more digits and groups them together.
    • - matches the hyphen literally.
    • (\d+) matches another group of digits.

    Example Input: "123-456"

    • The regex will match and capture the two groups:
      • Group 1: "123"
      • Group 2: "456"

2. Capturing Groups:

  • When parentheses are used, the part of the string matched by the group is captured and can be referenced later (either within the regex or programmatically).
  • Capture groups are numbered starting from 1, based on the order in which they appear in the regex.

Example:

  • Regex: (\w+)@(\w+)\.com

    • This matches an email address pattern like [email protected] and captures the username and domain separately.
    • (\w+) captures the username.
    • (\w+) captures the domain.

    Example Input: "[email protected]"

    • Group 1 (username): "user"
    • Group 2 (domain): "example"

3. Accessing Capture Groups in Code:

  • In most programming languages, after a regex match, you can access the captured groups (submatches) using specific methods or properties.

Example in Python:

import re

text = "[email protected]"
pattern = r"(\w+)@(\w+)\.com"
match = re.search(pattern, text)

if match:
    username = match.group(1)  # Group 1: username
    domain = match.group(2)    # Group 2: domain
    print(f"Username: {username}, Domain: {domain}")
  • Output: Username: user, Domain: example

Example in JavaScript:

let str = "[email protected]";
let regex = /(\w+)@(\w+)\.com/;
let match = str.match(regex);

if (match) {
    let username = match[1];  // Group 1: username
    let domain = match[2];    // Group 2: domain
    console.log(`Username: ${username}, Domain: ${domain}`);
}
  • Output: Username: user, Domain: example

4. Non-Capturing Groups:

  • If you only want to group part of the pattern without capturing it, you can use a non-capturing group by using (?:...). This is useful when you want to group the pattern but don’t need to access the matched content.

Example:

  • Regex: (?:\d+)-(?:\d+)
    • This matches a pattern like "123-456", but does not capture the digits into separate groups.
    Example Input: "123-456"
    • The match is found, but no separate capture groups are created.

5. Backreferences:

  • A backreference allows you to refer to a captured group within the same regular expression. This is useful for matching repeated patterns.

Example:

  • Regex: (\w+)\s+\1

    • This matches a word followed by one or more spaces, then the same word again.
    • (\w+) captures the first word.
    • \1 refers to the first captured group and matches the same word again.

    Example Input: "hello hello"

    • This matches "hello hello" because the second “hello” is the same as the first one.

Summary:

  • Parentheses () in regex serve two main purposes:
    1. Grouping: They allow you to group parts of a pattern, applying quantifiers to the group as a whole.
    2. Capturing: They capture parts of the input string for later reference or extraction.
  • Use (?:...) for non-capturing groups when you want to group patterns without storing the match.
  • You can access captured groups programmatically after a successful match.

Question: How can you match a specific number of occurrences of a character in a regular expression?

Answer:

To match a specific number of occurrences of a character (or group of characters) in a regular expression, you can use quantifiers that specify the exact number of times the preceding character or group should appear.

There are two primary ways to specify a fixed number of occurrences in regex:

1. Using {n}:

  • The {n} quantifier matches exactly n occurrences of the preceding character or group.

Syntax:
X{n}

  • Where X is the character (or group) you want to match, and n is the exact number of occurrences.

Example:

  • Regex: a{3}
    • This matches exactly three a characters in a row.
    • Matches: "aaa"
    • Does not match: "aa", "aaaa"

2. Using {n,m} (Optional Range):

  • The {n,m} quantifier matches between n and m occurrences of the preceding character or group, where n is the minimum number of occurrences, and m is the maximum.

Syntax:
X{n,m}

  • Where X is the character (or group) to match, n is the minimum number of occurrences, and m is the maximum number of occurrences.

Example:

  • Regex: a{2,4}
    • This matches between two and four a characters in a row.
    • Matches: "aa", "aaa", "aaaa"
    • Does not match: "a", "aaaaa"

3. Using {n,} (Minimum Occurrences):

  • The {n,} quantifier matches at least n occurrences of the preceding character or group, with no upper limit.

Syntax:
X{n,}

  • Where X is the character (or group) to match, and n is the minimum number of occurrences.

Example:

  • Regex: a{2,}
    • This matches two or more a characters in a row.
    • Matches: "aa", "aaa", "aaaa", "aaaaa", etc.
    • Does not match: "a"

Summary of Quantifiers for Matching Specific Occurrences:

  • X{n}: Matches exactly n occurrences of the character/group.
    • Example: a{3} matches "aaa" but not "aa" or "aaaa".
  • X{n,m}: Matches between n and m occurrences.
    • Example: a{2,4} matches "aa", "aaa", and "aaaa", but not "a" or "aaaaa".
  • X{n,}: Matches at least n occurrences.
    • Example: a{2,} matches "aa", "aaa", "aaaa", and so on.

These quantifiers are very useful when you need to match a specific or range of occurrences of a pattern in your string.

Question: What is the difference between | (alternation) and [] (character class) in regex?

Answer:

In regular expressions, both alternation (|) and character classes ([]) are used to match different patterns, but they serve distinct purposes and behave differently. Here’s a breakdown of the differences between them:


1. Alternation (|):

  • Purpose: The alternation operator (|) allows you to match one of several alternatives. It acts like a logical OR between two or more patterns.
  • How it works: It matches the left-hand side or the right-hand side of the pattern, but not both. The alternation operator provides a way to specify multiple possible matches, and the regex engine will attempt to match each option in sequence.
  • Syntax:
    A|B
    This means “match either A or B.”

Example:

  • Regex: cat|dog

    • This will match either "cat" or "dog".
    • Matches: "cat", "dog"
    • Does not match: "bat", "cog"
  • More complex example: (a|b)+

    • This will match one or more repetitions of either a or b. For instance, it will match "a", "b", "ab", "ba", "aa", "bb", etc.

2. Character Class ([]):

  • Purpose: A character class (denoted by square brackets []) allows you to match any one of a set of characters. It defines a set of characters, and the regex will match any single character from that set.
  • How it works: The characters inside the square brackets are matched individually, and the regex engine will look for one occurrence of any of those characters at that position.
  • Syntax:
    [A-B]
    This means “match any single character in the range from A to B.”

Example:

  • Regex: [cC]at

    • This will match either "cat" or "Cat".
    • Matches: "cat", "Cat"
    • Does not match: "bat", "rat"
  • More complex example: [aeiou]

    • This matches any single vowel (lowercase).
    • Matches: "a", "e", "i", "o", "u"
    • Does not match: "b", "z"
  • Ranges within brackets: [a-z] matches any lowercase letter, [0-9] matches any digit, etc.


Key Differences:

| Aspect | Alternation (|) | Character Class ([]) | |-------------------------|-----------------------------------------------------|-----------------------------------------------------| | Purpose | Matches one of several alternative patterns. | Matches any one character from a set or range. | | Syntax | A|B (either A or B) | [A-Z] (any character from A to Z) | | Matches | Matches one complete alternating pattern. | Matches any one character within the brackets. | | Use case | Useful when you want to match multiple alternatives. | Useful when you want to match a specific set of characters. | | Example | cat|dog matches "cat" or "dog". | [a-e] matches "a", "b", "c", "d", or "e".| | Matching Scope | Matches a whole pattern at once. | Matches a single character at a time. | | Character Ranges | Does not support ranges. | Supports ranges, e.g., [a-z], [0-9]. |


When to Use Each:

  • Use | (alternation) when you need to match one of several different patterns. For example, if you want to match "cat", "dog", or "fish", you would use cat|dog|fish.
  • Use [] (character class) when you want to match any one character from a set of characters, such as vowels, digits, or a specific set of letters. For example, [aeiou] matches any single vowel.

Examples to Clarify:

Example 1: Matching “cat” or “dog”

  • Regex: cat|dog
    • This will match either "cat" or "dog" as a whole word.

Example 2: Matching any vowel in a string

  • Regex: [aeiou]
    • This matches any single vowel (lowercase) anywhere in the string.

Example 3: Matching a string that starts with either “a” or “b”

  • Regex: ^[ab]
    • This will match a string that starts with either "a" or "b", but only the first character.

Conclusion:

  • Alternation (|) is used to match one of several patterns.
  • Character classes ([]) are used to match any single character within a set or range of characters.

Both tools are essential in regular expressions, but they serve different purposes and are applied in different contexts.

Question: How do you match non-digit characters in regex?

Answer:

To match non-digit characters in a regular expression, you can use the \D metacharacter. Here’s a breakdown of how it works:

  • \D matches any character that is not a digit (0-9). It is the inverse of \d, which matches digits.

Example:

  • Regular Expression: \D
  • This will match any character except for digits (0-9).

Example Usage:

If you want to find non-digit characters in a string, you can use the following regex pattern:

\D+

This will match one or more consecutive non-digit characters.

Example in Python:

import re

text = "Hello123World!"
matches = re.findall(r'\D+', text)

print(matches)

Output:

['Hello', 'World!']

In this example, \D+ matches sequences of non-digit characters (“Hello” and “World!”).

Question: What does [^a-zA-Z] mean in a regular expression?

Answer:

In a regular expression, [^a-zA-Z] is a pattern that matches any character except for an uppercase or lowercase letter. Here’s a breakdown of the components:

  • []: The square brackets denote a character class, which matches any one character within the brackets.
  • a-z: This matches any lowercase letter from ‘a’ to ‘z’.
  • A-Z: This matches any uppercase letter from ‘A’ to ‘Z’.
  • ^: When placed at the beginning of the character class (right after the opening bracket [), it negates the character class, meaning “match any character except the ones inside the brackets.”

Explanation:

  • [^a-zA-Z] matches any character that is not a letter (neither uppercase nor lowercase). This includes digits, punctuation, spaces, and any other special characters.

Example:

If you apply this regular expression to the string "Hello123!", it will match the characters 1, 2, 3, and ! because these are the characters that are not letters.

Example in Python:

import re

text = "Hello123!"
matches = re.findall(r'[^a-zA-Z]', text)

print(matches)

Output:

['1', '2', '3', '!']

In this case, the regex [^a-zA-Z] matches the non-letter characters in the string "Hello123!".

Question: How do you match a pattern that includes special characters like ., *, ?, etc., literally in regex?

Answer:

In regular expressions, certain characters have special meanings, such as:

  • .: Matches any character except a newline.
  • *: Matches 0 or more repetitions of the preceding element.
  • ?: Makes the preceding element optional (0 or 1 occurrence).
  • [], (), {}, |, etc., are also special characters used for various regex operations.

To match these characters literally (i.e., as ordinary characters, not their special meanings), you need to escape them using a backslash (\).

How to escape special characters:

  • For example, to match a literal dot (.), you use \..
  • To match a literal asterisk (*), you use \*.
  • To match a literal question mark (?), you use \?.

Example:

If you want to match the string Hello.*? exactly (including the literal . and * characters), you would use the following regular expression:

Hello\.\*\?

This matches the exact string Hello.*? where the dot, asterisk, and question mark are treated as literal characters rather than regex operators.

Example in Python:

import re

text = "Hello.*? world"
matches = re.findall(r'Hello\.\*\?', text)

print(matches)

Output:

['Hello.*?']

In this case, the regular expression Hello\.\*\? correctly matches the string "Hello.*?" as it includes the literal characters . (dot), * (asterisk), and ? (question mark).

Question: How can you use regular expressions for search and replace operations?

Answer:

Regular expressions are commonly used for search and replace operations, allowing you to search for specific patterns in a string and replace them with new text. Most programming languages and tools that support regular expressions provide a built-in way to perform these operations. Here’s a general overview of how to perform search and replace with regex.

Basic Syntax for Search and Replace:

  1. Search: Define the pattern you want to search for using a regular expression.
  2. Replace: Specify the replacement text (or pattern) that should replace the matched content.

The basic syntax for performing a search-and-replace operation in most programming languages is usually:

  • re.sub(pattern, replacement, string) in Python.
  • string.replace() or re.sub() in JavaScript.

Example in Python (re.sub):

Python’s re.sub() function allows you to perform search and replace using regular expressions.

import re

# Sample text
text = "Hello world! It's a wonderful world."

# Regular expression pattern to search for the word "world"
pattern = r'world'

# Replacement text
replacement = "universe"

# Perform search and replace
new_text = re.sub(pattern, replacement, text)

print(new_text)

Output:

Hello universe! It's a wonderful universe.

In this example, the regex pattern r'world' is used to search for the word “world,” and it’s replaced with the word “universe.”

Using Group References in Replace:

You can also use group references in the replacement string to refer to parts of the matched pattern.

Example:

Suppose you want to swap the first name and last name in a string like "John Doe" to "Doe, John". Here’s how you could do it using regex groups.

import re

# Sample name
name = "John Doe"

# Pattern with groups for first and last name
pattern = r'(\w+)\s(\w+)'

# Replacement pattern to swap the names
replacement = r'\2, \1'

# Perform search and replace
new_name = re.sub(pattern, replacement, name)

print(new_name)

Output:

Doe, John

In this case:

  • (\w+) captures a word (first name) and (\w+) captures the second word (last name).
  • In the replacement string, \1 refers to the first captured group (first name), and \2 refers to the second captured group (last name).

Example in JavaScript (replace):

In JavaScript, you can use the replace() method for search and replace operations.

let text = "Hello world! It's a wonderful world.";
let pattern = /world/g;  // g flag for global replacement
let replacement = "universe";

let newText = text.replace(pattern, replacement);

console.log(newText);

Output:

Hello universe! It's a wonderful universe.

Example with Group References in JavaScript:

JavaScript allows you to use captured groups in the replacement string as well.

let name = "John Doe";
let pattern = /(\w+)\s(\w+)/;
let replacement = '$2, $1';  // $1 and $2 refer to the first and second captured groups

let newName = name.replace(pattern, replacement);

console.log(newName);

Output:

Doe, John

Key Points:

  1. Search: Use a regex pattern to identify the text to be replaced.
  2. Replace: Provide the text to replace the matched pattern, or use references to capture groups for dynamic replacements.
  3. Flags: Use flags like g (global) to replace all matches in the string, not just the first one.

Regular expressions provide a powerful way to do complex search-and-replace operations that are not possible with simple string replacement functions.

Question: How would you use a regular expression to match a URL?

Answer:

To match a URL using regular expressions, you need to account for several components of a URL, such as the protocol (http:// or https://), domain name, optional port number, optional path, query parameters, and fragments. A typical URL can have the following structure:

  • Protocol: http:// or https://
  • Domain: example.com
  • Optional Port: :8080
  • Optional Path: /path/to/resource
  • Optional Query String: ?key=value
  • Optional Fragment: #section

A regular expression to match a URL should account for all of these components, although in simpler cases, you may exclude some optional parts like the port number, query string, and fragment.

Basic Regex to Match a URL:

Here’s a basic regex pattern that matches a typical URL:

https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,6}(?:\/[^\s]*)?

Breakdown of the Regex:

  1. https?: Matches the protocol part (http or https). The s? means the s is optional.
  2. :\/\/: Matches the :// part after the protocol (escaped because : and / are special characters in regex).
  3. (?:www\.)?: Optionally matches www. (the ?: indicates a non-capturing group, meaning it will not create a backreference).
  4. [a-zA-Z0-9-]+: Matches the domain name, which can consist of letters (both uppercase and lowercase), digits, and hyphens.
  5. \.[a-zA-Z]{2,6}: Matches the top-level domain (TLD), such as .com, .org, .net, etc. The {2,6} specifies that the TLD must be between 2 to 6 characters.
  6. (?:\/[^\s]*)?: Optionally matches the path part of the URL, starting with / and followed by any number of characters that are not whitespace.

Example in Python:

import re

# Sample text containing URLs
text = "Visit https://www.example.com for more info or http://example.org/path?query=1."

# Regex pattern to match URLs
pattern = r'https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,6}(?:\/[^\s]*)?'

# Find all URLs in the text
urls = re.findall(pattern, text)

print(urls)

Output:

['https://www.example.com', 'http://example.org/path?query=1']

A More Comprehensive Regex for Matching URLs:

If you want to match a wider variety of URLs, including those with ports, query parameters, or fragments, you can use a more complex regex pattern. Here’s a more comprehensive one:

https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,6}(?::\d+)?(?:\/[^\s]*)?(?:\?[^\s]*)?(?:#[^\s]*)?

Explanation:

  • (?:www\.)?: Optionally matches www..
  • [a-zA-Z0-9-]+\.[a-zA-Z]{2,6}: Matches the domain and TLD.
  • (?::\d+)?: Optionally matches a port number (e.g., :8080).
  • (?:\/[^\s]*)?: Optionally matches the path part, which starts with /.
  • (?:\?[^\s]*)?: Optionally matches a query string that starts with ?.
  • (?:#[^\s]*)?: Optionally matches a fragment that starts with #.

Example in Python:

import re

# Sample text containing different types of URLs
text = "Check out https://www.example.com:8080/path?query=1#fragment and http://example.org."

# Regex pattern for more comprehensive URL matching
pattern = r'https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,6}(?::\d+)?(?:\/[^\s]*)?(?:\?[^\s]*)?(?:#[^\s]*)?'

# Find all URLs in the text
urls = re.findall(pattern, text)

print(urls)

Output:

['https://www.example.com:8080/path?query=1#fragment', 'http://example.org']

Key Points:

  • The https? part of the regex matches either http or https.
  • The (?:www\.)? part optionally matches www. if it’s present.
  • The [a-zA-Z0-9-]+\.[a-zA-Z]{2,6} part matches the domain name and TLD.
  • The (?:\/[^\s]*)? part optionally matches the path, and other optional components like query strings (\?[^\s]*) and fragments (#[^\s]*) are handled as well.

This pattern covers a variety of URLs, but depending on the use case, you might need to adjust it to suit different domain formats, longer TLDs, or other components.

Read More

If you can’t get enough from this article, Aihirely has plenty more related information, such as regex interview questions, regex interview experiences, and details about various regex job positions. Click here to check it out.

Trace Job opportunities

Hirely, your exclusive interview companion, empowers your competence and facilitates your interviews.

Get Started Now