There is a race condition with the snowflake heap that has been causing the broker to crash several times a day. This race condition has existed in the broker for several years, but some recent updates as well as the host migration managed to shake it loose.
This race condition is causing the snowflake broker to crash repeatedly and often since the migration. We noticed because CollecTor stopped collecting metrics since the restart on 14 November 2019.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
Okay it turns out I was still ssh-ing into the old broker. I copied over the metrics log from the old broker so that CollecTor at least has some data from the last week, but note that this will be metrics from the old broker host so most proxies will have migrated to the new host.
Interestingly, it looks like there is still one proxy in Japan that's polling the old broker still.
I'm still not sure why the new broker wasn't saving metrics since migrating the other domains on November 14th. Looking at the log files, it looks like the broker might have crashed.
The weird thing is, these errors don't occur on the old broker host:
/var/log/snowflake-broker# grep "fatal error" ./*
/var/log/snowflake-broker# grep "panic" ./*
both yield no results.
The frequency with which the panics on the new host occur explain the lack of metrics, we only log metrics once the broker has been running for 24 hours straight and that hasn't happened since 14-11-2019.
The weird thing is that the #29207 (moved) update was deployed on the old broker at the same time but hasn't caused any problems. It's possible this was due to the DNS update to point to the new broker, but as seen from the metrics data, the old broker was still getting a lot of proxy polls for several days after.
Just updating the summary of this ticket to reflect the actual problem.
Trac: Summary: Snowflake broker not serving metrics correctly to Fix race condition in snowflake broker Description: Looks like the snowflake broker isn't serving metrics since 2019-11-14 (which happens to coincide with the host migration.
I've attached the result of the above fetch as well as the metrics file located on the snowflake broker host that the broker should be reading from.
to
There is a race condition with the snowflake heap that has been causing the broker to crash several times a day. This race condition has existed in the broker for several years, but some recent updates as well as the host migration managed to shake it loose.
This race condition is causing the snowflake broker to crash repeatedly and often since the migration. We noticed because CollecTor stopped collecting metrics since the restart on 14 November 2019.
The weird thing is that the #29207 (moved) update was deployed on the old broker at the same time but hasn't caused any problems. It's possible this was due to the DNS update to point to the new broker, but as seen from the metrics data, the old broker was still getting a lot of proxy polls for several days after.
Perhaps because the old broker had 1 CPU core and the new one has 2 CPU cores? I can imagine that might affect goroutine scheduling which might affect the visibility of race conditions.
You might try setting GOMAXPROCS=1 in /etc/service/snowflake-broker/run as a temporary workaround, if indeed it helps.
This race condition occurs because both the proxy process and the client process try to Pop/Remove the same snowflake from the heap.
When a proxy polls the broker, it starts a go routine that waits for an offer through the snowflake's offer channel, or waits for a timeout:
go func(request *ProxyPoll) { select { case offer := <-snowflake.offerChannel: request.offerChannel <- offer case <-time.After(time.Second * ProxyTimeout): // This snowflake is no longer available to serve clients. // TODO: Fix race using a delete channel heap.Remove(ctx.snowflakes, snowflake.index) delete(ctx.idToSnowflake, snowflake.id) request.offerChannel <- nil } }(request)
Snowflakes are removed from the heap when they time out or if they are popped off the heap by a client:
func clientOffers(ctx *BrokerContext, w http.ResponseWriter, r *http.Request) {... // Delete must be deferred in order to correctly process answer request later. snowflake := heap.Pop(ctx.snowflakes).(*Snowflake) defer delete(ctx.idToSnowflake, snowflake.id) snowflake.offerChannel <- offer
A race can always occur where the timeout happens after the Pop removes the snowflake but before the offer is sent through offerChannel. This can be fixed through the use of locks and a check to see if snowflake.index has been set to -1 (which happens after it has been popped off the heap). Here's a patch that adds synchronization to the broker to prevent simultaneous access to the heap as well as the idToSnowflake map: https://github.com/cohosh/snowflake/pull/14
I'd like to do some tests before merging this to make sure that the synchronization doesn't slow things down too much, but a code review would be good at this point.
Perhaps because the old broker had 1 CPU core and the new one has 2 CPU cores? I can imagine that might affect goroutine scheduling which might affect the visibility of race conditions.
Ah that would do it :) The reason we didn't notice then was because clients and proxies weren't around to trigger the race until we switched over the DNS records for the bamsoftware.com domains.
You might try setting GOMAXPROCS=1 in /etc/service/snowflake-broker/run as a temporary workaround, if indeed it helps.
Okay I did this and restarted the broker as of 2019/11/22 22:50:27 that should work until we get this patch reviewed. Thanks!
This looks good. I left a review on your GitHub pull request. I had a minor suggestion and a question regarding the scope of what needs to be protected by a lock.