RUM: improved inverted index for full-text search based on GIN index
RUM is an extension which adds a RUM index to Postgresql.
RUM index is based on GIN that stores additional per-entry information in a posting tree. For example, positional information of lexemes or timestamps. In comparison to GIN it can use this information to make faster index-only scans for:
- Phrase search
- Text search with ranking by text distance operator
- Text
SELECT
s with ordering by some non-indexed additional column e.g. by timestamp.
RUM works best in scenarios when the possible keys are highly repeatable. I.e. all texts are composed of a limited amount of words, so per-lexeme indexing gives significant speed-up in searching texts containing word combinations or phrases.
Main operators for ordering are:
tsvector <=>
tsquery | float4 | Distance between tsvector and tsquery.
value <=>
value | float8 | Distance between two values.
Where value is timestamp, timestamptz, int2, int4, int8, float4, float8, money and oid
Usage#
Enable the extension#
You can get started with rum by enabling the extension in your Supabase dashboard.
- Go to the Database page in the Dashboard.
- Click on Extensions in the sidebar.
- Search for "rum" and enable the extension.
Syntax#
For type: tsvector#
To understand the following you may need first to see Official PostgreSQL documentation on text search
rum_tsvector_ops
And we can execute tsvector
selects with ordering by text distance operator:
rum_tsvector_addon_ops
Now we can execute the selects with ordering distance operator on attached column:
For type: anyarray#
rum_anyarray_ops
This operator class stores anyarray elements with length of the array. It supports operators &&
, @>
, <@
, =
, %
operators. It also supports ordering by <=>
operator.
Now we can execute the query using index scan:
rum_anyarray_addon_ops
The does the same with anyarray index as rum_tsvector_addon_ops
i.e. allows to order select results using distance
operator by attached column.
Limitations#
RUM
has slower build and insert times than GIN
due to:
- It is bigger due to the additional attributes stored in the index.
- It uses generic WAL records.