The purpose of this work was about investigating the current solution and it’s bottlenecks to identify what needs to be done to solve the following problems:
Storage bottleneck when creating the intermediate database file
Operations efficiency for the infrastructure team
The short term goal is about enhancing operational gaps and possible technical bottlenecks in the current solution.
Improve intermediate db file to consume less disk space
Weekly data generation instead of daily
The long term goal aims to replace the current solution with an actual data driven application to provide end-user real time analytics (as close as possible) because of the following limitations:
Graphical reports are static images served through httpd
Manual intervention is needed to generate reports outside of the cron job schedule
Data is not real time
There is no way to connect third party apps suchs as Jupyter since there is no “data service”
The long term goal aims to create a data service and/or use an existing open source solution such as Prometheus/Kafka/etc. to serve that data from an API and a web app interface.
The API would be useful for other apps to pull and filter “real time” data instead of downloading a sqlite db file to then parse it to useful human friendly formats.
The investigation was about identifying possible bottenecks in the current solution, both technical and opertional.
The Current System
The current “system” is an ansible role which relies on mirrors-countme and other scripts to do its job, most of those scripts are being executed from a cron job which generates static images that are served through a web server.
Someone from the Fedora infrastructure team needs to run that paybook if there is a need to run any of those tools outside of the cron job schedule which is quite limiting.
The Intermediate Database
The current process is that the script generates an intermediate database file, usually referred to as “raw.db”, so another file is created from this one (“totals.db”) which is used by end users.
The problem is that “raw.db” data is appended for each httpd Apache log line which is turning into a storage problem due to the increasing growth of that file size.
One possible solution that is on the table is to purge old data from “raw.db” every time the end user database file gets updated - for example: keep data from the last 30 days and delete everything else.
Another option is to create weekly database files for the intermediate/”raw.db” database, using the full year and week number as the filename, for example: YYYY/01.db instead of appending everything to one “raw.db” file - that would allow us to archive those files individually if needed.
We concluded that we have work with the current solution as a short term goal but should keep track of a system refactoring as a long term goal.
The short term goal is about removing storage bottlenecks and enhacing its operational effciency.
The long term goal is about creating a data system that will replace the current solution entirely which may require another “arc initiative” as well.
The team should write an SOP for the Fedora Infrastructure team about how and where data is generated.
The SOP Document should also describe the required steps to generate “on demand” data based on user request.
The intermediate database file, also known as “raw.db”, is generated daily through a cron job.
The cron job should be run weekly instead, because httpd logs are not “real time” and the system can suffer from eventual data losses by doing it daily.
This can be done by updating cron job file definitions in Fedora’s ansible repository: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/web-data-analysis/files
The intermediate “raw.db” file aggregates all parsed data from HTTPd logs which is turning into a storage problem on our log servers.
There are two possible solutions for this problem: split database files based on “week of the year” or delete data from the intermediate database file that is older than 1 month.
Splitting Database Files
This scheme would create a file per “week of the year” instead of a single intermediate database file.
That would allow us to archive older files somewhere else while keeping the most recent ones in the server (the last 4 weeks for example).
This solution requires changes to how database files are written and the way we read those files to generate the final database file used by end users.
This approach would keep using a single “raw.db” database file but a new step would be added when adding data in the end user database file.
The team would need to implement a step that would remove old data from the intermediate database file once the final counter database file is updated.
For example: read “raw.db” -> update “counter.db” -> delete all data from “raw.db” that is older than one month.
This approach is a bit simpler since it just needs an extra step in the existing code instead of changing how “raw.db” files are stored and used.