Improving reliability of mirrors-countme scripts

Notes on curent deployment

For investigating and deployment, you need to be the member of sysadmin-analysis.

The repo that has the code is on https://pagure.io/mirrors-countme/

The deployment configuration is stored in ansible repo, run through playbook playbooks/groups/logserver.yml, mostly in role roles/web-data-analysis.

The scripts are running on log01.iad2.fedoraproject.org. If you are a member of sysadmin-analysis, you should be able to ssh, and have root there.

There are several cron jobs responsible for running the scripts:

syncHttpLogs in /etc/cron.daily/ rsync logs to /var/log/hosts/$HOST/$YEAR/$MONTH/$DAY/http
combineHttp - in /etc/cron.d/ every day at 6, runs /usr/local/bin/combineHttpLogs.sh
combines logs from /var/log/hosts to /mnt/fedora_stats/combined-http based on the project. We are using /usr/share/awstats/tools/logresolvemerge.pl and I am not sure we are using it correctly
condense-mirrorlogs - in /etc/cron.d/ every day at 6, does some sort of analysis, posibly one of the older scripts. It seems to attempt to sort the logs again.
countme-update - in /etc/cron.d/ every day at 9, runs two scripts,
countme-update-rawdb.sh that parses the logs and fills in the raw database and countme-update-totals.sh that uses the rawdb to calculate the statistics The results of countme-update-totals.sh are then copied to a web-folder to make it available at https://data-analysis.fedoraproject.org/csv-reports/countme/

Notes on avenues of improvement

We have several areas we need to improve:

downloading and syncing the logs, sometimes can fail or hang.
problems when combining them
instalation of the scripts, as there has been problem with updates, and currently we are doing just a pull of the git repo and running the pip install

Notes on replacing with off-the shelf solutions

As the raw data we are basing our staticis on are just the access-logs from our proxy-servers, we could be able to find an off-the shelf solution, that could replace our brittle scripts.

There are two solutions that psesent themselves, ELK stack and Loki and Promtail by Grafana.

We are already running ELK stack on our openshift, but our experience so far is that Elastic Search has even more brittle deployment.

We did some experiments with Loki. The technology seems promissing, as it is much more simple than ELK stack, with size looking comparable to the raw logs.

Moreover, promtail that does the parsing and uploading of logs has facilities to both add labels to loglies that will then be indexed and queriable in the database and collect statistics from the loglines directly that can be gathered by prometheus.

You can query the logs with language simmilar to GraphQL.

We are not going to use it because:

it doesn’t deal well with historical data, so any attempts at initial import of logsare pain.
using promtail enabled metrics wouldn’t help us with double-counting of people hitting different proxy servers
configuration is fiddly and tricky to test
changing batch-process to soft-realtime sounds like a headache