Using the array type for user and package queries

Currently, we use auxiliary tables to query for messages related to packages or users, in the standard RDBS fashion.

We came to some problems when trying to enforce foreign key constrains while using the timescaledb extension. We decided to try, if just using a column with array type with proper indes would have simmilar performace.

Array columns support indexing with Generalized Inverted Index, GIN, that allows for fast searches on membership and intersection. Because we mostly search for memebership, array column could be performant enough for our purposes.

Resources

Installing/enabling/activating

To have comparable results, we enabled timescaledb in same fashion as in our other experiment.

To add new column

alter table messages2 add column packages text[];

To populate it

update messages2 set packages=t_agg.p_agg from
  (select msg, array_agg(package) as p_agg from package_messages group by msg) as t_agg where messages.id=t_agg.msg;

We need to enable the btree_gin extension to be able to create index with array as well as timestamp

CREATE EXTENSION btree_gin;

To create the index

CREATE INDEX idx_msg_user on "messages2" USING GIN ("timestamp", "packages");

To help reuse our testing script, we setup postgrest locally

podman run --rm --net=host -p 3000:3000   -e PGRST_DB_URI=$DBURI -e PGRST_DB_ANON_ROLE="datagrepper" -e PGRST_MAX_ROWS=25   postgrest/postgrest:v7.0.

Because we focused only on package queries, as user colun couldn’t be populated due to constraints on size, we chose two as representative. There is implicit limit to return just 25 rows.

A simple membership:

/messages_ts?packages=ov.{{kernel}}

A simple membership ordered by time.

/messages_ts?order=timestamp.desc&packages=ov.{{kernel}}

Findings

Querying just the package membership

The queries were surprisingly fast, with maximum under 4 seconds and mean around half a second. This encouraged us to do further experiments.

Results

  test_filter_by_package
Requests: 300, pass: 300, fail: 0, exception: 0
For pass requests:
Request per Second - mean: 3.63
Time per Request   - mean: 0.522946, min: 0.000000, max: 3.907548

Querying just the package membership ordered by timestamp desc

Usually we want to see most recent messages. So we ammended the query, to include “order by timestamp desc”. The result was less encouraging, with longest succesful query taking more than 90 seconds and several timing out.

This seems to be the result of GIN index not supporting order in the index.

Results

  test_filter_by_package
Requests: 300, pass: 280, fail: 0, exception: 20
For pass requests:
Request per Second - mean: 0.53
Time per Request   - mean: 7.474040, min: 0.000000, max: 99.880939

Conclusion

While array support seems interesting, and for simple queries very fast, indexes that require ordering don’t seem to be supported. This makes strong case against using them.