donate-neo is having frequent high latency incidents
so we have various prometheus metrics and alerts now, and they are starting to work. one of them that we often trip is the latency alert:
here are the last 24h in #tor-alerts, for example:
19:11:30 -ALERTOR1:#tor-alerts- DjangoHighRequestLatency[donate_neo/warning] alert is firing, 1 alerts on donate-01.torproject.org
19:26:30 -ALERTOR1:#tor-alerts- DjangoHighRequestLatency[donate_neo/warning] alert is resolved, 1 alerts on -donate-01.torproject.org
20:32:30 -ALERTOR1:#tor-alerts- DjangoHighRequestLatency[donate_neo/warning] alert is firing, 1 alerts on donate-01.torproject.org
20:42:30 -ALERTOR1:#tor-alerts- DjangoHighRequestLatency[donate_neo/warning] alert is resolved, 1 alerts on -donate-01.torproject.org
Day changed to 21 Sep 2024
03:32:30 -ALERTOR1:#tor-alerts- DjangoHighRequestLatency[donate_neo/warning] alert is firing, 1 alerts on donate-01.torproject.org
03:37:30 -ALERTOR1:#tor-alerts- DjangoHighRequestLatency[donate_neo/warning] alert is resolved, 1 alerts on -donate-01.torproject.org
06:36:30 -ALERTOR1:#tor-alerts- DjangoHighRequestLatency[donate_neo/warning] alert is firing, 1 alerts on donate-01.torproject.org
06:41:30 -ALERTOR1:#tor-alerts- DjangoHighRequestLatency[donate_neo/warning] alert is resolved, 1 alerts on -donate-01.torproject.org
10:14:30 -ALERTOR1:#tor-alerts- DjangoHighRequestLatency[donate_neo/warning] alert is firing, 1 alerts on donate-01.torproject.org
10:19:30 -ALERTOR1:#tor-alerts- DjangoHighRequestLatency[donate_neo/warning] alert is resolved, 1 alerts on -donate-01.torproject.org
15:27:30 -ALERTOR1:#tor-alerts- DjangoHighRequestLatency[donate_neo/warning] alert is firing, 1 alerts on donate-01.torproject.org
i've redacted other traffic from the channel.
here's another view, from grafana:
https://grafana.torproject.org/d/adwbl8mxnaneoc/availability?var-alertstate=All
incidents are of course quite short, but this is irritating (and, i think, impacting users) that I think it's worth at least documenting this in an issue. what i will do next is likely raise the threshold for the alert (currently set to a IMHO generous 500ms) to something that won't trigger the alerts, but i think this should be investigate.
Note that the django dashboard also sees those high latency events:
it's all the blue spikes in here:
https://grafana.torproject.org/d/O6v4rMpizdb/django?orgId=1&refresh=auto&from=now-24h&to=now&viewPanel=4
if you zoom in the last hour, you can see a lot of stuff ends up in the 1-3s bucket, which is pretty bad!
this is not just "ooh this is 550ms", this is a full second loading a page. and that alert is not checking all pages, that's just loading donate.torproject.org.
the django dashboard, on the other hand, takes into account other views: you can see lower down, that almost all views with >1s load times are POST (/process and /subscribe). if you drill down in the per-view dashboard, you can confirm the latency of /process is 1.75s:
https://grafana.torproject.org/d/django-requests-by-view-jkwq/django-requests-by-view?orgId=1&var-datasource=default&var-namespace=prod&var-job=donate_neo&var-view=process&var-method=All&from=now-24h&to=now
on average. The 95th percentile is 2.42s which is pretty abysmal!
the donate form, however, loads pretty fast: 27/71ms for the same stats, so that's great:
https://grafana.torproject.org/d/django-requests-by-view-jkwq/django-requests-by-view?orgId=1&var-datasource=default&var-namespace=prod&var-job=donate_neo&var-view=donate_form&var-method=All&from=now-24h&to=now
but you can see it confirms what we see from the blackbox testing, and that it is sometimes loading really slow.
i suspect this might be because we're using a single thread for the app. maybe what's happening is the /process and /subscribe views are slow (which is, IMHO, a problem in itself, but not as bad as the frontpage being slow), and then that locks up the rest of the system...
if you squint a little, you can even see the latency spikes in the frontpage match the times where the /process view is active, so i think that's a very interesting theory.
@lavamind, i think it's time to look at multithreading that app! do you want a separate issue for that?
@stephen could you see if you can confirm latency on those views in your lab, and possibly find the cause?
no rush before monday, but i think this is one of those post-launch things that should be fixed before we wrap this up.
thanks both!