Using the timescaledb extension =============================== timescaledb (https://docs.timescale.com/latest/) is a postgresql extension for time-series database. Considering a lot of the actions done on datagrepper involve the timestamp field (for example: all the messages with that topic in this time range), we figured this extension is worth investigating. A bonus point being for this extension to already packaged and available in Fedora and EPEL. Resources --------- - Setting up/enabling timescaledb: https://severalnines.com/database-blog/how-enable-timescaledb-existing-postgresql-database - Migrating an existing database to timescaledb: https://docs.timescale.com/latest/getting-started/migrating-data#same-db Installing/enabling/activating ------------------------------ To install the plugin, simply run: .. code-block:: dnf install timescaledb The edit ``/var/lib/pgsql/data/postgresql.conf`` to tell postgresql to load it: .. code-block:: shared_preload_libraries = 'pg_stat_statements,timescaledb' timescaledb.max_background_workers=4 It will then need a restart of the entire database server: .. code-block:: systemctl restart postgresql You can then check if the extension loaded properly: .. code-block:: $ sudo -u postgres psql SELECT * FROM pg_available_extensions ORDER BY name; Then, you will need to activate it for your database: .. code-block:: $ sudo -u postgres psql CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE; Finally, you can check that the extension was activated for your database: .. code-block:: $ sudo -u postgres psql \dx .. _timescaledb_findings: Findings -------- Partitioned table ~~~~~~~~~~~~~~~~~ After converting the `messages` table to use timescaledb, we've realized that timescaledb uses table partitioning as well. This leads to the same issue with the foreign key constraints that we have seen in the plain partitioning approach we took. Foreign key considerations ~~~~~~~~~~~~~~~~~~~~~~~~~~ For a better understanding on the challenges we've encountered with foreign key constraints, here is a graphical representation of the datanommer database: .. image:: ../_static/datanommer_db.jpeg :target: ../_images/datanommer_db.jpeg So here are the issues we've faced: - To make the `messages` table a hypertable (ie: activate the timescaledb plugin on it), the tables need to be empty and the data imported in a second step. - Once the `messages` table is a hypertable, we cannot add foreign key constraints from the `user_messages` or `package_messages` tables to it. It is just not supported in timescaledb (cf https://docs.timescale.com/latest/using-timescaledb/schema-management#constraints ) - We tried creating the foreign key constraints before making the `messages` table a hypertable and then importing the data in (tweaking the primary key and foreign keys to include the timestamp, following https://stackoverflow.com/questions/64570143/ ) but that resulted in an error when importing the data. So we ended up with: Keep the same data structure but to not enforce the foreign key constaints on `user_messages` and `package_messages` to `messages`. As that database is mostly about inserts and has no updates or deletes, we don't foresee much problems with this. Duplicated messages ~~~~~~~~~~~~~~~~~~~ When testing datagrepper and datanommer in our test instance with the timescaledb plugin, we saw a number of duplicated messages showing up in the `/raw` endpoint. Checking if we could fix this server side, we found out that the previous database schema had an `UNIQUE` constraint on `msg_id` field. However, with the timescaledb plugin, that constraint is now on both `msg_id` and `timestamp` fields, meaning a message can be inserted twice in the database if there is a little delay between the two inserts. However, migrating datanommer from fedmsg to fedora-messaging should resolve that issue client side as rabbitmq will ensure there is only one consumer at a time handling a message. Open questions -------------- - How will upgrading the postgresql version with the timescaledb plugin look like? It looks like the timescaledb folks are involved enough in postgresql itself that we think things will work, but we have not had on-hands experience with it. Patch ----- Here is the patch that needs to be applied to ``datanommer/models/__init__.py`` to get it working with timescaledb's adjusted postgresql model. .. code-block:: diff --git a/datanommer.models/datanommer/models/__init__.py b/datanommer.models/datanommer/models/__init__.py index ada58fa..7780433 100644 --- a/datanommer.models/datanommer/models/__init__.py +++ b/datanommer.models/datanommer/models/__init__.py @@ -192,11 +192,11 @@ def add(envelope): # These two blocks would normally be a simple "obj.users.append(user)" kind # of statement, but here we drop down out of sqlalchemy's ORM and into the # sql abstraction in order to gain a little performance boost. - values = [{'username': username, 'msg': obj.id} for username in usernames] + values = [{'username': username, 'msg': obj.id, 'timestamp': timestamp} for username in usernames] if values: session.execute(user_assoc_table.insert(), values) - values = [{'package': package, 'msg': obj.id} for package in packages] + values = [{'package': package, 'msg': obj.id, 'timestamp': timestamp} for package in packages] if values: session.execute(pack_assoc_table.insert(), values) @@ -279,14 +279,17 @@ class BaseMessage(object): source_version=self.source_version, ) + user_assoc_table = Table('user_messages', DeclarativeBase.metadata, Column('username', UnicodeText, ForeignKey('user.name')), - Column('msg', Integer, ForeignKey('messages.id')) + Column('msg', Integer, ForeignKey('messages.id')), + Column('timestamp', Integer, ForeignKey('messages.timestamp')), ) pack_assoc_table = Table('package_messages', DeclarativeBase.metadata, Column('package', UnicodeText, ForeignKey('package.name')), Column('msg', Integer, ForeignKey('messages.id')) + Column('timestamp', Integer, ForeignKey('messages.timestamp')), )