Partitioning the database

In the database used by datanommer and datagrepper one table stands out from the other ones by its size, the messages table. This can be observed in Datanommer.

One possibility to speed things up in datagrepper is to partition that table into a set of smaller sized partitions.

Here are some resources regarding partitioning postgresql tables:

Table partitioning at postgresql’s documentation: https://www.postgresql.org/docs/13/ddl-partitioning.html
How to use table partitioning to scale PostgreSQL: https://www.enterprisedb.com/postgres-tutorials/how-use-table-partitioning-scale-postgresql
Definition of PostgreSQL Partition: https://www.educba.com/postgresql-partition/

Attempt #1

For our first attempt at partitioning the messages table, we thought we would partition it by year. Having a different partition for each year. We thus started by adding a year field to the table and fill it by extracting the year from the timestamp field of the table.

However, one thing to realize when using partitioned table is that each partition needs to be considered as an independant table. Meaning an unique constraint has to involve the field on which the table is partitioned. In other words, if you partition the table by a year field, that year field will need to be part of the primary key as well as any UNIQUE constraint on the table.

So to partition the messages table on year, we had to add the year field to the primary key. However, that broke the foreign key constraints on the user_messages and package_messages tables which rely on the id field to link the tables.

Attempt #2

Since partitioning on year did not work, we reconsidered and decided to partition on the id field instead using RANGE PARTITION.

We partitioned the messages table on the id field with partition of 10 million records each. This has the advantage of making each partition of similar sizes.

More resources

These are a few more resources we looked at and thought were worth bookmarking:

Automatic partitioning by day - PostgreSQL: https://stackoverflow.com/questions/55642326/
pg_partman, partition manager: https://github.com/pgpartman/pg_partman
How to scale PostgreSQL 10 using table inheritance and declarative partitioning: https://blog.timescale.com/blog/scaling-partitioning-data-postgresql-10-explained-cd48a712a9a1/