SQL pattern matching

Eleni Markou

Pattern matching over strings in SQL is a frequent need, much more frequent that some may think. Queries aren’t just for compiling demanding aggregate calculations, advanced joins and table partitioning. This applies especially to data scientists and data engineers who often have to deal with raw, unstructured data. For these professionals pattern matching operations, from the most basic to more advanced, are probably an everyday task.

The demand for efficient SQL pattern matching becomes immediate as the velocity of available data increases, especially within data-driven companies where decision making is highly depended on data.

Probably the simplest and most common case in which such an operation is needed is when one simply wants to search in his database for data that match exactly or are similar to a specific pattern. This becomes extremely useful when you do not actually know the exact word or phrase you are seeking for. E.g. Those whose name is ‘John’ or contains the string ‘John’, like for example ‘Johnpaul’.

In this post we are going to through all the capabilities that the majority of SQL dialects offer regarding both the aforementioned tasks, when using Amazon RedshiftPostgreSQLGoogle BigQuery or SQL Server.

This is part #3 of our helpful SQL short guides:

So, let’s get started!

When it comes to pattern matching the usual options are 3: LIKE operator, SIMILAR TO operator which is available only on some SQL dialects and Regular Expressions.

LIKE Operator

We will start our exploration with LIKE as it is probably the simplest of all expression and also present in many database systems including PostgreSQL, MS SQL Server, Redshift and BigQuery.

When it comes to BigQuery, the LIKE operator is available for both Legacy SQL, which is the default for BigQuery and the updated SQL.

Moreover, according to PostgreSQL and Redshift official documentation, LIKE is more secure compared to SIMILAR  TO, which available only for PostgreSQL and Redshift, or POSIX-style regular expressions.  Since SIMILAR TO is in a sense based on regular expressions, both methods are prone to the same security hazards and more specifically to the exploitation of the fact that most regex can be pushed to extreme situations that makes them to work really slowly.  Entering these extreme situations, a program can hang for a long time.

In order to use LIKE one has to use the following syntax: <string> LIKE <pattern>.

This syntax remains the same for all the aforementioned database systems.

So, LIKE is an expression which is evaluated either as true or as false depending on whether the string matches the given pattern.

Also, LIKE is often combined with some other special characters, i.e. ‘%’ and ‘_’, which are called wildcards. When the underscore is present in a pattern it means that it matches any single character while percent matches any sequence of characters of any length.

In order to clarify things a bit more, below we present some common examples of LIKE usages:

SELECT name
FROM customers
WHERE location LIKE ‘%Yor%’

This query will retrieve customers that come from New York, Yorkshire or any other city that includes ‘Yor’ in its name. In contrast, if we would like to retrieve only customers located in cities starting with Yor this query would be sufficient:

SELECT name
FROM customers
WHERE location LIKE ‘Yor%’

In an equivalent way one can use the underscore wildcard which, as mentioned before, it matches any single character. With the following query you are going to retrieve customers that are located, for example, in either Granada (Spain) or Grenada (Caribbean Island).

SELECT name
FROM customers
WHERE location LIKE ‘Gr_nada’

In Google BigQuery, an operator alternative to LIKE is the CONTAINS. Its functionality is basically quite the same except the fact that with CONTAINS you do not have to use the % wildcard mentioned above as by default CONTAINS matches anywhere within the string.

SIMILAR TO Operator

In order to use SIMILAR TO, one has to use the following syntax: <string> SIMILAR TO <pattern>.

SIMILAR TO can be used only on PostgreSQL and Redshift and for the time being there is no equivalent operator in the rest of the database systems like BigQuery and Microsoft SQL Server.

SIMILAR TO works in quite the same way as LIKE, as it returns true or false depending on whether the string matches entirely the supplied pattern.  SIMILAR TO also supports features borrowed by regular expressions, including the following:

  • . denotes the matching of any single character. E.g. ‘a.c’ matches ‘abc’, ‘adb’ etc
  • | is used to denote two alternatives. E.g. ‘blendo’ SIMILAR TO ‘%(b | z)%’ returns True
  • * denotes repetition of the previous symbol zero or more times. E.g. both ‘bbblendo’ SIMILAR TO ‘%b*le%’ and ‘lendo’ SIMILAR TO ‘%b*le%’ return True.
  • {m} denotes repetition of the previous symbol exactly m times. E.g. ‘blenndo’ SIMILAR TO ‘%len{2}%’ returns True.
  • {m,} denotes repetition of the previous symbol at least m times. E.g. ‘blennndo’ SIMILAR TO ‘%len{2,}%’ returns True.
  • {m,n} denotes repetition of the previous item at least m times and no more than n times. E.g. ‘bleeendo’ SIMILAR TO ‘%le{2,4}n%’ returns also True.

Regular Expressions

In cases where LIKE or SIMILAR TO does not cover your needs, regular expressions or simply regex, probably will as they provide you with a much more powerful way of SQL pattern matching. Those familiar with UNIX systems will probably have already used regex at some point, as they are part of many UNIX tools including sed, awk and grep.

So, what is a regular expression? A regular expression is a sequence of characters that complies with certain formation rules and can be seen as the abbreviation of the whole set of strings. In this case, a string is said to match the regular expression only if it is included in the set of strings the regex describes.

To begin with, all the features described in the SIMILAR TO section, i.e  |, *, {m}, {m,} and {m,n}, can be used for formulating a regex.

The simplest case is when you want to check if a string contains or does not contains a specified set of characters.  In PostgreSQL and Redshift, the operator widely used is the tilde (~). The tilde can be combined with other operators such as (*) and (!) in order to form different matching conditions.

Some examples are presented below:

Operator Usage Example
~ Matches if a string contains the specified set of characters (case sensitive) ‘blendo’ ~ ‘nd’ will return True.
~* Matches if a string contains the specified set of characters (case insensitive) ‘blendo’ ~* ‘nD’ will return True.
!~ Matches if a string does not contain the specified set of characters (case sensitive) ‘blendo’ !~ ‘nd’ will return False.
!~* Matches if a string does not contain the specified set of characters (case insensitive) ‘blendo’ !~* ‘nD’ will return False.

 

Especially, for BigQuery the function used for pattern matching using regular expressions is the REGEX_MATCH. This function takes as arguments the <string>, or more generally the column name,  and the regex pattern.

Another important features of regular expressions are the so-called bracket expressions. The [ <set of chars> ] denotes a list of characters and matches every single character from this list.  Within a bracket list,  two characters separated with the minus symbol (-), abbreviate the full range of characters between them. For example [A-D] is a shorthand for ‘A,B,C,D’ while [0-4] a shorthand for ‘0,1,2,3,4’. When the power sign (^) is included before the bracket expression it matches every character which is NOT included in the list.

It is worth mentioning that unlike LIKE, regular expressions match anywhere within the given string unless specified otherwise by explicitly anchoring the match to the beginning or the end of the string. This can be done by adding ^ or $ at the beginning of the end of the regex in order to anchor the match at each position accordingly. For example, while ‘blendo’ ~ ‘nd’ will returns true, ‘blendo’ ~ ‘^nd’ will return False because in order to match, it expects to match ‘nd’ at the beginning of the string.

In PostgreSQL, there is also a special function called substring(<string> from <pattern>) which takes as input a string and a regular expression and extracts the first substring that is found to match the provided expression. For example, substring(‘blendo’ from ‘e.d’) will return ‘end’.

In case your needs cannot be satisfied by any of these alternative a user-defined function in Perl is probably your way to go.

Outro

As you can easily tell,  making use of the SQL pattern matching capabilities most database systems offer can help database users to perform partial matches of data values and obtain answers to questions that go beyond simple text queries and conventional searching and comparing operations.

No matter what database system fits best your needs, the majority of them satisfactorily supports all these advanced concepts previously mentioned in this post.

Get our latest posts in your email to stay on top of everyone.