the only monitoring we have of GitLab CI right now is how many jobs are pending or running, it's useful but not enough.
i believe the runners themselves provide more information through a prometheus exporter. see how that works and try to tap into that, to answer questions like:
#41032 (comment 2872402): "In the meantime, how can i check the status of chi-node-14-verylarge? I have another job waiting for 20 hours on it right now."
I've checked out gitlab-ci's /metrics endpoint. No metrics really caught my attention as being really useful (joined a dump so people can checkout without setting up a runner).
Gitlab has an API endpoint that would help with checking what a runner is doing (and has done recently). Though it may be privileged for shared runners (needs Reporter+ for project runners)
"what is the average wait time on runners"
the best answer that can be obtained from Gitlab would only include already finished jobs, and not jobs in queue, as a job is assigned to a runner only when that runner says it has a slot for it.
in fact, i don't think there's anything worth scraping in there, unless we have very complex (e.g. docker machine) runners, which we don't have, so let's scrap this.
the reason label is a bit scary, but anyway, that looks promising. i'll work on throwing this in puppet and scraping this so we have better metrics of our runners. the metrics documentation is here:
origin/master 1da5b2422f4445eb9c8a4f01d9b322f6580f7923Author: Antoine Beaupré <anarcat@debian.org>Date: Thu Aug 17 15:25:08 2023 -0400monitor gitlab runner prometheus exporter (tpo/tpa/team#41042)This is mainly to figure out if failure rates are higher in the podmanrunner than the docker one, but could also be useful for other things.One question this doesn't answer is the queue wait time, but thatshould be available on the GitLab rails exporter itself, it's notvisible from the runners.2 files changed, 15 insertions(+)modules/profile/manifests/gitlab/runner.pp | 14 ++++++++++++++modules/profile/manifests/prometheus/server/internal.pp | 1 +modified modules/profile/manifests/gitlab/runner.pp@@ -192,6 +192,20 @@ class profile::gitlab::runner( Ferm::Rule::Simple <<| tag == 'gitlab-app-to-ci-runner-session-server' |>> }+ file_line { 'gitlab-runner metrics listen_address':+ path => '/etc/gitlab-runner/config.toml',+ after => '^shutdown_timeout',+ line => $facts['networking']['fqdn'],+ notify => Service[$gitlab_ci_runner::package_name],+ }+ # grant Prometheus access to the exporter+ Ferm::Rule <<| tag == 'profile::prometheus::server-gitlab-runner-exporter' |>>+ @@prometheus::scrape_job { "gitlab-runner_${facts['networking']['fqdn']}_9252":+ job_name => 'gitlab_runner',+ targets => [ "${facts['networking']['fqdn']}:9252" ],+ labels => { 'alias' => $facts['networking']['fqdn'] },+ }+ # prevent Puppet from ever restarting gitlab-runner Service <| tag == 'gitlab_ci_runner::service' |> { hasrestart => true,modified modules/profile/manifests/prometheus/server/internal.pp@@ -180,6 +180,7 @@ class profile::prometheus::server::internal ( 'nginx': port => 9113; 'gitlab': port => 9178; 'gitlab-redis': port => 9121;+ 'gitlab-runner': port => 9252; 'gitaly': port => 9236; 'gitlab-workhorse': port => 9229; 'process': port => 9256;
"what is the average wait time on runners"
that's not something the runner exporters can answer, but we should be able to answer questions like:
is the podman runner failure rate higher than the docker one? (#41295 (closed))
are we exceeding the concurrency limits on our runners? GitLab.com calls that runner capacity: "View the number of jobs being executed divided by the value of limit or concurrent."
a bunch of panels are not working, but i suspect that's because the runners are kind of idle and have been so since the exporter started. maybe once a bigger build runs we'll see better numbers, particularly on the "increase" or failed jobs dashboards.