Introduction

According to Wikipedia

A regular expression is a sequence of characters that specifies a search pattern in text

There are many resources to learn how to use regular expressions (regex) for example Regular Expression HOWTO in the Python documentation.

But regex seems so complex and overwhelming. But the secret is to read regex left to right.

This blog starts works through examples to gradually build up this knowledge.

We can use the Python re library.

import re
weatherText ="It's raining in 1. Auckland and it's snowing in 2. Queenstown"

The method findall returns all non-overlapping matches of pattern in string, as a list of strings or tuples.The string is scanned left-to-right, and matches are returned in the order found.

Use Python’s raw string notation for regular expressions in a string literal prefixed with ‘r’ to handle backslashes.

The most simple regex is to find a literal character.

## Find the a characters in the weatherText
re.findall(r"a", weatherText)
## ['a', 'a', 'a']

There are two main ways to think about regex, inside of character Class ie [] and out side of them.

Character Class

[ and ] are used for specifying a character class, which is a set of characters to match.

Metacharacters Inside of [ ]

Some characters are special metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched.

Here are some metacharacters that we use inside brackets:

^ negate the class, but only if the first character

- indicates character range

## Find any a,b or c lower case  characters 
re.findall(r"[abc]", weatherText)
## ['a', 'c', 'a', 'a']

[abc] effectively expands to (a|b|c).

## Find any upper case characters in the range A-Z
re.findall(r"[A-Z]", weatherText)
## ['I', 'A', 'Q']
## Find any upper or lower case characters in the range A-Z or a-z
re.findall(r"[A-Za-z]", weatherText)
## ['I', 't', 's', 'r', 'a', 'i', 'n', 'i', 'n', 'g', 'i', 'n', 'A', 'u', 'c', 'k', 'l', 'a', 'n', 'd', 'a', 'n', 'd', 'i', 't', 's', 's', 'n', 'o', 'w', 'i', 'n', 'g', 'i', 'n', 'Q', 'u', 'e', 'e', 'n', 's', 't', 'o', 'w', 'n']
## Match any non-alphanumeric character 
re.findall(r"[\W]", weatherText)
## ["'", ' ', ' ', ' ', '.', ' ', ' ', ' ', "'", ' ', ' ', ' ', '.', ' ']

This is equivalent to the class [^a-zA-Z0-9_]. We can match the characters not listed within the class by complementing the set. This is indicated by including a ^ as the first character of the class.

## Find words with different cases for the first letter
re.findall(r"[Ii]t..", weatherText)
## ["It's", "it's"]

Outside the Characeter Class

Metacharacters Outside of [ ]

This is a list of the metacharacters that we can use outside of brackets:

. the wildcard, to match ANY character except newline (by default). If you want to match a literal . then you need to escape \

( start subpattern

) end subpattern

{ start min/max quantifier used for repitions of a character

? same as {0,1} quantifier

* 0 or more quantifier, same as {0,} quantifier. Matches the preceding element zero or more times. When used as .* this is the everything wildcard

+ 1 or more quantifier, same as {1,} quantifier. matches the preceding element one or more times.

Some other metacharacters are also called anchor characters:

^ assert start of string

$ assert end of string

\b asserts that the pattern must match at a word boundary. Putting this either side of a word stops the regular expression matching longer variants of words.

## Find words ending in ing using the {} min max quantifier
re.findall(r"[a-z]{1,}ing", weatherText)
## ['raining', 'snowing']
## Find 7 letter words ending with ing using the + instead of the {1,} quantifier
re.findall(r"[a-z]+ing", weatherText)
## ['raining', 'snowing']

We can continue to build up the regex into longer expressions using this regex tester https://regex101.com/ or Regex Cheat Sheet.