Skip to content

Regular Expression

Overview

A Regular Expression (Regex) is a language that explicitly describes specific patterns in text strings. It offers powerful features not only for finding patterns but also for iteratively processing matches, parsing strings into substrings, and replacing or reconstructing text based on conditions. Utilizing regex allows for complex text processing tasks to be solved very simply and efficiently.


Simple Expression

The most basic regex is a 'literal string' that matches characters exactly as they are. For example, the regex pattern foo finds the exact match for "foo" in an input string. In this case, the sentence "The food was quite tasty" contains "foo", so it is considered a match.

However, simply finding identical words is a very basic function of regex. If you want to find "all words starting with 'f'" or "words with exactly 3 letters," literal strings are insufficient. To handle these complex conditions, you must understand the deeper functions of regex. Below is an example of a literal expression.

Pattern Input (Matches)
foo foo, food, foot, "There's evil afoot."


Quantifiers

Quantifiers specify how many times a specific character is repeated within a pattern. Representative implicit quantifiers are as follows:


* : 0 or more times (optional)
+ : 1 or more times (at least once)
? : 0 or 1 time (optional)


Quantifiers always modify the character immediately preceding them (to the left). To repeat multiple characters together, you must group them using parentheses ().


Pattern Input (Matches)
fo* foo, foe, food, fooot, "forget it", funny, puffy
fo+ foo, foe, food, foot, "forget it"
fo? foo, foe, food, foot, "forget it", funny, puffy

The ? character, besides meaning 0 or 1 occurrence, also performs a 'Non-greedy' function, matching the fewest possible characters.
Apart from symbolic implicit quantifiers, explicit quantifiers using curly braces {} allow specific repetition counts. For example, x{5} means 'x' is repeated exactly 5 times. Using a comma like x{5,} means '5 or more times', and x{5,8} means '5 to 8 times'.

Pattern Input (Matches)
ab{2}c abbc, aaabbccc
ab{,2}c ac, abc, abbc, aabbcc
ab{2,3}c abbc, abbbc, aabbcc, aabbbcc


Metacharacters

Reserved words with special meanings in regex are called metacharacters. The previously seen *, ?, +, {} fall into this category, along with $, ^, ., [, (, |, ), ], \, etc.

  • . (Dot): Matches any single character except newline characters. Useful for finding strings of a specific length or when the character in the middle of a pattern does not matter.
  • ^ (Caret) and $ (Dollar): Anchors that specify positions. ^ means the start of the string, and $ means the end. These can be used to validate (verify) if an input value exactly matches a specific pattern.
  • \ (Backslash): Used to disable the special function of a metacharacter and treat it as a literal, or to call a predefined set of special characters. For example, to find the actual string "c:\", you must escape it by using two backslashes like ^c:\\.
  • | (Pipe): Means "OR". a|b matches "a" or "b".
  • ( ) (Parentheses): Groups patterns. Used to apply quantifiers to the entire group or to reuse matched substrings later.

Below are examples of using metacharacters.

Pattern Input (Matches)
. a, b, c, 1, 2, 3 (Single character)
.* Abc, 123, (All strings including empty strings)
^c:\ c:\windows, c:\foo.txt (Strings starting with c:)
abc$ abc, 123abc (Strings ending with abc)
(abc){2,3} abcabc, abcabcabc


Character Classes

Character classes are defined by enclosing them in square brackets [ ] and represent a set of characters that can appear at that position. [aeiou] matches one of the vowels. Inside brackets, metacharacters are often treated as normal characters.

  • Range Specification (-): You can simply specify a range using a hyphen, like [0-9] instead of [0123456789]. Lowercase is expressed as [a-z], and uppercase as [A-Z].
  • Negation (^): If ^ appears as the first character inside brackets, it means "excluding these characters." Example: [^0-9] matches any character that is not a number. (This is different from the ^ that signals the start of a regex.)

Below are examples of character classes.

Pattern Input (Matches)
^b[aeiou]t$ Bat, bet, bit, bot, but
^[0-9]{5}$ 11111, 12345, 99999 (5-digit number)
^c:\ c:\windows, c:\foo.txt
abc$ All strings ending with abc
(abc){2,3} abcabc, abcabcabc
^[^-][0-9]$ 0, 1, 2 ... (Excluding those starting with a hyphen like -0, -1)


Predefined Metacharacter Sets

To reduce the inconvenience of writing long patterns like [0-9] every time for frequently used patterns (all numbers, all alphabets, etc.), predefined shorthand expressions are provided. (The below is based on .NET Framework but operates similarly in most regex engines.)

Metacharacter Description
\a Bell character (\u0007)
\b Matches a Word Boundary.
\t Tab character (\u0009)
\r Carriage Return (\u000D)
\v Vertical Tab (\u000B)
\f Form Feed (\u000C)
\n New Line (\u000A)
\e ESC character (\u001B)
\040 Matches octal ASCII character (e.g., \040 is a space)
\x20 Matches hex ASCII character (e.g., \x20 is a space)
\cC Matches ASCII control character (e.g., \cC is Ctrl+C)
\u0020 Matches hex Unicode character
\* When using a metacharacter literally (e.g., \* is literal *)
\w Matches any word character (alphanumeric plus underscore) [a-zA-Z0-9_]
\W Matches any non-word character [^a-zA-Z0-9_]
\s Matches whitespace character (space, tab, etc.)
\S Matches non-whitespace character
\d Matches a digit [0-9]
\D Matches a non-digit [^0-9]


Sample Expressions

Here are some regex examples to aid understanding.

Pattern Description
^\d{5}$ 5-digit number (e.g., Zip code)
^(\d{5})|(\d{5}-\d{4})$ 5-digit number or '5-digit-4-digit' format (e.g., US Zip code)
^\d{5}(-\d{4})?$ Same as above but more efficient (Using grouping and ?)
^[+-]?\d+(\.\d+)?$ Real number with optional sign (Can include decimal point)
^[+-]?\d*\.?\d*$ Similar to above but allows empty strings or just a decimal point
^(20|21|22|23|[01]\d)[0-5]\d$ 24-hour time format (HHmm)
/\*.*\*/ C language style comment (/* ... */)


Example) Adding a CC Attack Pattern Group

Configuration method to prevent CC Attacks (Cache-Control attacks) by inspecting Header Field values in DeepFinder.

  1. Go to the [TEMPLATE] > [SECURITY PATTERN] menu.
  2. Select the [USER] tab and click the [ADD PATTERN GROUP] button.
  3. Enter the group information as shown in [Reference Image 1] below.

Reference Image 1

  1. Go to the [POLICY] menu and select the Domain Group to apply the policy to.
  2. Right-click and click the [Policy Settings] menu.
  3. Select the Header Field Value Policy > Cache-Control item and click the [Enter Policy] button.

Reference Image 2

  1. Enter the security policy content, set the response method (Block, etc.) and log type, then click [OK] to save.

Reference Image 3

Reference Image 4