The re Package

Previous

Next Topic(s):

Created:
29th of October 2025
12:13:24 PM
Modified:
29th of October 2025
01:24:58 PM

Regular Expressions (Regex) in Python

Regular expressions (regex) are compact patterns for matching text. They are indispensable when you need to validate input, extract structured information from messy data, transform strings at scale, or search with precision—much like the powerful “Find” features in modern word processors and code editors. This page uses Python’s re module (Python 3.12.5) and focuses on practical, engineering-friendly recipes.

Why Pattern Matching?

Structured text is everywhere: register numbers, postcodes, phone numbers, emails, URLs, log lines, configuration files, JSON, XML, HTML… Regex lets you describe these structures succinctly and check or extract them reliably. Used well, it reduces brittle string handling and encourages clarity. Used carelessly, it can be slow or fragile—so we’ll include cautions and tips.

Regex in Python at a Glance

# Always use raw strings for patterns
import re

# Compile when reusing (faster and clearer)
VIT_REG = re.compile(r'^\d{2}[A-Z]{3}\d{4}$')
EMAIL   = re.compile(r'^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$')

print(bool(VIT_REG.fullmatch('25BCL0001')))   # True
print(bool(EMAIL.fullmatch('alice@example.org')))  # True

Regular Expressions: Syntax, Flags & Caveats (Python re)

Python’s re engine is Perl-like and works with both Unicode text (str) and 8-bit data (bytes). Pattern and input must be the same type. Use raw strings for readability (e.g. r"\d+\n"), and escape metacharacters when you need them literally.

Quick Reference: Special Characters, Quantifiers, Grouping & Modifiers
Category Syntax Meaning Example Care to be taken
Metacharacters . ^ $ * + ? { } [ ] ( ) | \ Core regex operators and delimiters. \. matches a literal dot; \| matches |. Escape when used literally; prefer raw strings to avoid double escaping.
Anchors & Boundaries ^, $, \A, \Z, \b, \B Start/end of string; word boundaries. ^\w+$ (whole-line word chars); \bcat\b. ^/$ become per-line with re.MULTILINE; \b depends on Unicode vs ASCII mode.
Character classes [abc], [^abc], ranges [a-z]; shorthands \d, \w, \s (and uppercase negations) Match one of a set; shorthands map to categories. [A-F0-9] (hex); \d+ (digits). In Unicode patterns, shorthands are Unicode-aware; force ASCII with re.ASCII or inline (?a).
Quantifiers (greedy) *, +, ?, {m}, {m,n} Repeat previous token. \d{6}; .*. Greedy by default; may over-match.
Quantifiers (non-greedy) *?, +?, ??, {m,n}? Minimal matching. <.*?> for a single HTML-like tag. Still backtracks; ensure later tokens can terminate the match.
Quantifiers (possessive) *+, ++, ?+, {m,n}+ No backtracking (performance control). \w++,(?>\s*)\w+ (no give-back after \w++). 3.11+ only; can cause the whole match to fail where greedy would have backtracked.
Grouping (...), non-capturing (?:...), named (?P<name>...), backrefs \1, (?P=name) Submatches and references. (?P<d>\d{2})/\112/12. Name must be a valid identifier; in bytes patterns, names must be ASCII.
Alternation A|B Match A or B (left-to-right priority). gray|grey. Order matters; put the more specific first.
Lookaround Lookahead (?=...), (?!...); Lookbehind (?<=...), (?<!...) Assert context without consuming text. (?<=\+)\d+ (digits after +); \d+(?=%). Lookbehind in re must be fixed width; avoid variable-length constructs inside.
Atomic grouping (?>...) Prevents backtracking into the group. (?>\d+)\.\d+. 3.11+ only; use to curb catastrophic backtracking.
Inline modifiers (scoped) (?iLmsux), (?a), or scoped (?i:...) Set flags (ignore-case, multiline, dot-all, verbose, ASCII). (?i)cat matches Cat; (?a:\w) → ASCII word char. Use module flags (e.g. re.IGNORECASE) or inline; (?a)/(?u)/(?L) are mutually exclusive.

Backslashes, Escapes & Raw Strings

  • Use raw strings for most patterns: r"\n" is a backslash + n; "\n" is a newline.
  • Match a literal backslash with r"\\"; a literal \n with r"\\n".
  • Escape metacharacters when needed: e.g. r"\(price\)" to match “(price)”.
  • Be consistent: avoid mixing normal strings and raw strings in the same pattern unless you know why.
🌍

Unicode vs 8-bit (bytes) patterns
re fully supports Unicode (str) and also works with 8-bit bytes. Don’t mix types: a Unicode pattern must search a Unicode string; a bytes pattern must search bytes. For ASCII-only behaviour on Unicode, use the re.ASCII flag or inline (?a).

Limitations & Edge Cases (and how to avoid surprises)

  • Fixed-width lookbehind only: in the standard re engine, the inside of (?<=...)/(?<!...) must have a constant length; avoid alternations or quantifiers that change width.
  • Catastrophic backtracking: patterns like (.*a)* or nested ambiguous quantifiers can blow up. Prefer specific classes, add anchors, use possessive quantifiers (++, *+, {m,n}+) or atomic groups (?>...) to eliminate backtracking where safe.
  • Unicode case-folding quirks: case-insensitive matches are Unicode-aware; characters such as German “ß” and Turkish dotted/dotless i can behave differently than ASCII expectations. Use re.ASCII if you need ASCII semantics.
  • Word boundaries: \b depends on the definition of “word character” (Unicode vs ASCII). In code editors you may see different rules; align your flags accordingly.
  • Line boundaries: ^/$ match start/end of the whole string; with re.MULTILINE they also match just after/before newlines. Use re.DOTALL if . should include newlines.
  • Parsing nested structures: Regex is poor at truly nested grammars (e.g. arbitrary HTML/XML with deep nesting). For JSON, XML, HTML, etc., prefer a proper parser and use regex for pre-validation or targeted extraction.
  • Bytes vs text: choose one and stick to it in each operation; don’t pass a bytes pattern to a str subject (or vice versa). For 8-bit processing with text-like classes, prefer bytes + ASCII-range classes or re.ASCII.

Common Patterns & Recipes

Frequently needed patterns with examples and cautions
Pattern / Regex What it matches Example inputs Care to be taken
^\d{2}[A-Z]{3}\d{4}$ VIT-style register number like 25BCL0001 (2 digits, 3 capitals, 4 digits) 25BCL0001, 24ABC1234 Assumes uppercase letters. Use (?i) prefix or re.IGNORECASE if mixed case is allowed.
^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$ Email IDs alice@example.org, a.b+c@sub.domain.co.uk A practical simplification of RFC rules. For internationalised domains, use dedicated libraries when needed.
^https?://(?:www\.)?[-\w%]+(?:\.[-\w%]+)+(?::\d+)?(?:/[^\s]*)?$ Web URLs (HTTP/S) https://vit.ac.in, http://example.com:8080/path?q=1 A pragmatic pattern. For full URL parsing (edge cases, IDNs), prefer a URL parser.
^[1-9][0-9]{5}$
^[1-9][0-9]{2}\s?[0-9]{3}$
Indian PIN codes (with/without space) 600036, 110 001 Does not validate region semantics; only format.
^(?:\+?91[-\s]?)?[6-9]\d{9}$ Indian mobile numbers (optionally with +91) 9876543210, +91 9123456789 Enforces leading digit 6–9 per current allocation. Allow separators judiciously.
^(?:0\d{2,4}[-\s]?)?\d{6,8}$ Indian landline (optional STD code) 044-12345678, 011 23456789, 2654321 Exchanges vary; treat as format check only.
^4\d{12}(?:\d{3})?$ (Visa)
^5[1-5]\d{14}$ (MC)
^3[47]\d{13}$ (AmEx)
Credit card number formats 4111111111111111, 5555555555554444, 371449635398431 Use a Luhn check after regex. Never log raw PANs; mask for storage and display.
^\s*([+-]?(?:90(?:\.0+)?|[0-8]?\d(?:\.\d+)?))\s*,\s*([+-]?(?:180(?:\.0+)?|1[0-7]\d(?:\.\d+)?|[0-9]?\d(?:\.\d+)?))\s*$ Latitude, Longitude pair with range checks 12.9716, 77.5946; -33.9, 151.2 Covers degrees only. For DMS formats, use alternate patterns.
^[1-5]$ (Likert 1–5)
^(?i:(?:yes|no|y|n))$ (Yes/No)
^(?:[A-E](?:\s*,\s*[A-E])*)$ (Multi-select A–E)
Survey responses 5, Yes, A, A,C,E Trim whitespace first; normalise case.
\bword\b Whole-word match as in editors Matches “word” but not “swordfish” Use case-insensitive flag for “Match case” off; consider Unicode word boundaries.
^\d{2}:\d{2}:\d{2},\d{3}\s+-->\s+\d{2}:\d{2}:\d{2},\d{3}$ SRT subtitle time-range lines 00:01:23,456 --> 00:01:27,000 Use for validation or extraction in media pipelines.
\b[A-Fa-f0-9]{8}\b-?(?:[A-Fa-f0-9]{4}-?){3}[A-Fa-f0-9]{12}\b UUIDs (with/without dashes) 550e8400-e29b-41d4-a716-446655440000 Use specific versions if required (v4 etc.).
📚

Note: Anchors ^ and $ ensure you match the entire string, not a substring. Prefer fullmatch() over search() when validating whole-field inputs.

Quick Patterns 

VIT_REG          = r'^\d{2}[A-Z]{3}\d{4}$'
IND_PIN          = r'^[1-9][0-9]{5}$'
IND_MOBILE       = r'^(?:\+?91[-\s]?)?[6-9]\d{9}$'
EMAIL_SIMPLE     = r'^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'
URL_SIMPLE       = r'^https?://(?:www\.)?[-\w%]+(?:\.[-\w%]+)+(?::\d+)?(?:/[^\s]*)?$'
CC_VISA          = r'^4\d{12}(?:\d{3})?$'
CC_MC            = r'^5[1-5]\d{14}$'
CC_AMEX          = r'^3[47]\d{13}$'
LAT_LON_PAIR     = r'^\s*([+-]?(?:90(?:\.0+)?|[0-8]?\d(?:\.\d+)?))\s*,\s*([+-]?(?:180(?:\.0+)?|1[0-7]\d(?:\.\d+)?|[0-9]?\d(?:\.\d+)?))\s*$'
WHOLE_WORD_word  = r'\\bword\\b'  # escape backslashes when embedding in JSON/strings

Editor-Style Searching

  • Whole word: \bterm\b matches term but not terminal.
  • Case-insensitive: use (?i) in the pattern or re.IGNORECASE in Python.
  • Multi-line files: (?m) changes ^ / $ to match the start/end of each line.
  • Dot-all: (?s) makes . match newlines; useful for block selections.
  • Overlapping finds: use lookarounds, e.g. (?=(aba)) to find overlapping abas.

Working with JSON, XML, HTML and similar

  • Quick validation: sanity-check fragments (e.g., a JSON key or simple value) without full parsing.
  • Targeted extraction: pull IDs, timestamps, or attribute values from logs or mixed content.
  • Safety first: for full documents, prefer a proper parser (e.g., json, xml.etree, lxml, BeautifulSoup). Regex is excellent for well-scoped tasks but brittle for arbitrary HTML.

Examples:

# JSON: capture the value of a top-level key "name" (simple case)
NAME_VALUE = re.compile(r'"name"\s*:\s*"([^"]*)"')

# XML/HTML: capture an attribute value on a specific tag
IMG_SRC = re.compile(r'<img\b[^>]*\bsrc="([^"]+)"', re.IGNORECASE)

# HTML tag stripping (rough):
STRIP_TAGS = re.compile(r'<[^>]+>')
🔍

Tip: Use re.VERBOSE (a.k.a. re.X) to document complex patterns neatly across lines with comments.

Applications to Images, Video, and Media Pipelines

  • OCR post-processing: clean up extracted text (page numbers, figure captions, codes) with regex filters.
  • Subtitles and captions: validate and extract SRT timecodes using the pattern above.
  • File naming conventions: parse camera roll names (e.g., IMG_20250131_154512.jpg) with ^IMG_(\d{8})_(\d{6}) to recover timestamps.
  • Video timecodes: match HH:MM:SS:FF via ^\d{2}:\d{2}:\d{2}:\d{2}$ when working with edit decision lists or logs.

Good Practice

  • Anchor your intent: use ^…$ and fullmatch() for whole-field validation; search() for in-text finding.
  • Compile frequently used patterns: improves performance and keeps code tidy.
  • Prefer raw strings: write r'^\d{2}$' so backslashes are not doubled.
  • Test early: unit-test your patterns against edge cases; include negative tests.
  • Be Unicode-aware: Python 3’s \w is Unicode-aware, but “word boundaries” may differ across languages. Consider explicit character classes when precision matters.
  • Security: avoid catastrophic backtracking by keeping patterns linear-time where possible; prefer atomic groups or possessive quantifiers where supported. Enforce input length caps before matching.

Key Takeaways

Regex gives you precise, declarative control over text. Start with anchored, readable patterns; compile what you reuse; and test thoroughly. Use regex to validate, extract, and transform—but reach for dedicated parsers when dealing with complete JSON, XML, or HTML documents. With these recipes you can confidently handle everyday inputs—from VIT register numbers and Indian PIN codes to URLs, emails, coordinates, and survey data—while keeping performance and reliability in check.