How to speed up "select...like" queries using pg_trgm
In the realm of databases, PostgreSQL stands out for its adaptability and the plethora of features it offers, especially when it comes to handling textual data. An often underappreciated gem in this context is the pg_trgm
extension.
What is pg_trgm?
pg_trgm
is a PostgreSQL extension that provides functionality for determining the similarity between text strings and improving text searches. It operates by breaking down strings into trigrams, which are groups of three consecutive characters extracted from the text. For instance, the word “Daniele” would be broken down into the trigrams: d
, da
,dan
,ani
,nie
,iel
,ele
,le
.
This approach facilitates fuzzy searches, identifying results that “resemble” the search term even in the presence of typos or minor variations.
How to Use It?
First and foremost, the extension needs to be enabled in your database with the command:
CREATE EXTENSION pg_trgm;
Once enabled, you can leverage the functions and operators provided by pg_trgm
to enhance searches. For instance, to find records similar to a search string, you might use the %
operator:
SELECT * FROM your_table WHERE your_text_column % 'word';
This will return all records where your_text_column
is similar to ‘word’.
If you are curious and want to see how text are splitted, use show_trgm
as shown below.
SELECT show_trgm('Daniele');
--- OUTPUT
--- {" d"," da","ani","dan","ele","iel","le ","nie"}
Practical Examples
Suppose we have a songs
table with a title
column. To find titles similar to “Sultans of swing”, one could write:
SELECT title FROM songs WHERE title % 'Sultens of swing';
This command could successfully retrieve the correct “Sultans of swing” despite the misspellings.
To understand how it works, let’s talk about the %
operator.
The %
operator can be applied on two text
and returns a boolean
.
Quoting the PG documentation: “Returns true if its arguments have a similarity that is greater than the current similarity threshold set by pg_trgm.similarity_threshold.”
similarity_threshold
params is at 0.3 by default. So it consider “similar” two words with a similarity greater or equal to 30%. This can be OK for your scenario or not. In case you need, changing this parameter is quite simple.
To read the current value of pg_trgm.similarity_threshold
you can use the following statement:
show pg_trgm.similarity_threshold;
Let’s say we’d like to set pg_trgm.similarity_threshold
to 0.5
.
Stamement | Scope |
---|---|
set pg_trgm.similarity_threshold = 0.5; |
Session |
ALTER DATABASE SET pg_trgm.similarity_threshold = 0.5; |
Database |
ALTER SYSTEM SET pg_trgm.similarity_threshold = 0.5; |
Cluster |
To optimize performance on large tables, it’s advisable to create a GIST or GIN index on the trigrams:
CREATE INDEX idx_title_trgm ON songs USING gist (title gist_trgm_ops);
Now, after this small introduction about pg_trgm and the %
operator, let’s focus on the topic of this article.
Other information about the available pg_trgm operators here.
Enhancing LIKE Queries with pg_trgm
Beyond managing fuzzy searches, the pg_trgm
extension harbors a trick up its sleeve that can significantly speed up searches using LIKE
and ILIKE
operators in PostgreSQL, especially those containing wildcard characters %
at both ends of the search string. This is particularly useful because, in the absence of pg_trgm
, such searches can be very slow on large datasets, as they require a sequential scan of the entire table.
Optimizing LIKE/ILIKE Queries
With pg_trgm
installed and properly configured, PostgreSQL can use GIN and GIST trigram-based indexes to accelerate LIKE
and ILIKE
queries. This method leverages the data’s trigram representation to drastically reduce the data set to be examined during the search.
For example, using the same songs
table mentioned earlier, if we want to find all titles containing the string “love”, a typical query might be:
SELECT title FROM songs WHERE title LIKE '%love%';
Without pg_trgm
, this query could be slow on a large dataset. However, with a GIN or GIST trigram index on the title
column, PostgreSQL can quickly narrow down the search to records that have a higher likelihood of matching the pattern, significantly reducing query execution time.
Creating an Index to Optimize LIKE/ILIKE
To take advantage of this optimization, we can create a GIN index on the column of interest as follows:
CREATE INDEX idx_title_like ON songs USING gin (title gin_trgm_ops);
With this index in place, LIKE
and ILIKE
queries including patterns with wildcards at both ends can be executed much faster, thanks to trigram-based candidate pre-selection.
Let’s do a real-world test
To do a real-world test, create an half-million records table using the following script.
drop table if exists random_people;
create table random_people as
with rnd_people as (
select distinct
('[0:13]={"Daniele","Debora","Mattia","Jake","Amy","Samuel","Jacopo","Martina","Sofia","Henry","Neil","Tim","John","George"}'::text[])
[floor(random()*14)] || ' ' || floor(random()*100000+1)::text || '°' first_name,
('[0:13]={"Bianchi","Lamborghini","Ferrari","De Tommaso","Rossi","Verdi","Gialli","Caponi","Gallini","Gatti","Ford","Daniel","Harrison","Macdonald"}'::text[])
[floor(random()*14)] || ' ' || floor(random()*100000+1)::text || '°' last_name
from
generate_series(1, 10 * 1000 * 50)
)
select
row_number() over() id, first_name, last_name
from
rnd_people;
select count(*) from random_people; -- 500,000 records
EXPLAIN ANALYZE
says the query takes c.a. 160ms to execute.
explain analyze select * from random_people where last_name ilike '%tom%'; -- c.a. 160 ms on my machine
Now, create a GIN
index on last_name
field using gin_trgm_ops
operator class.
CREATE EXTENSION if not exists pg_trgm;
drop index if exists index_last_name_on_random_people_trigram;
CREATE INDEX CONCURRENTLY index_last_name_on_random_people_trigram
ON random_people
USING gin (last_name gin_trgm_ops);
Now, the final test. Re-execute the previous query.
explain analyze select * from random_people where last_name ilike '%tom%'; -- c.a. 0.025 ms on my machine!
Great! From 160ms to 0.025ms (on my machine).
Final Thoughts
Utilizing pg_trgm
to optimize LIKE
and ILIKE
queries is a prime example of how PostgreSQL extensions can enhance application performance and flexibility. It’s important to remember that while trigram-based indexes can significantly speed up these searches, they also require additional space and can impact data insertion times. Therefore, as always in engineering, their implementation should be carefully considered based on the specific requirements of the application.
In conclusion, integrating pg_trgm
into your indexing and search strategies can profoundly transform the performance of textual queries. “Sapere aude” (Dare to know) - this extension invites you to explore and fully exploit PostgreSQL’s advanced capabilities for textual data management.
PostgreSQL Trainings
If you want to become proficient in PostgreSQL there are some very popular trainings provided by my company:
- PostgreSQL for Developers, English
- PostgreSQL per Sviluppatori, Italian
- PostgreSQL per Amministratori, Italian
If your company needs a custom training, let us know.
Trainings are available remotely and on-site.
Links
- Official
pg_trgm
documentation - All the functions provided by the
pg_trgm
extension
Comments
comments powered by Disqus