This is a long overdue post dating from 2017. But to make the troubleshooting interesting, I would like to do it in an informal Root Cause Analysis (RCA).
Disclaimer: To keep this post digestible for most of the audiences, I will ignore the following aspects of an RCA:
- It must thrive to have a detailed sequence of events, ideally with timestamps
- It should be written formally and neutrally
- It should include the business impact (time, users and losses)
Please bear that in mind if you would like to use this post as a guide for some reason.
Artifactory yum errors - Pseudo-RCA
- In 2017, we did a significant update on JFrog Artifactory on-premise services
- The new services passed all our validations, and the monitoring marked it as “ok”
- Compute and IOPS quotas are on their expected ranges
- The new Artifactory was promoted “as live”, no error was reported from the users immediately
- After a couple of hours, several yum-related errors were reported. Having the following error strings in common:
... error getting update info: Cannot retrieve repository metadata (repomd.xml) for repository: repo. Please verify its path and try again ... Error: requested datatype filelists not available
Starting from the Artifactory version we migrated on, the filelists.xml (yum file list) indexing is disabled by default since it is resource-intensive.
This breaking change was apparent until the yum cache expired on the machines that reported the problems with yum commands that rely on file metadata (makecache, search, install).
- Although expensive, continue using a blue/green strategy for core services, with a greater grace window (over 24hrs)
- Explore forcing yum metadata expiration on canary environments to test the new service
I hope you find this post helpful either on the JFrog pointers or the pseudo-RCA.
By the way, if you are in charge of a productive service that goes south, remember “keep calm and push forward”. We have been there and it happens #hugops.
Last modified on 2021-10-06