"NOTICE: gabelmoo had 17 MiddleOnly flags in its vote but the consensus had 9" isn't noteworthy
I grabbed the consensus and gabelmoo's vote during the time period that we got the doctor warning, and sure enough:
$ grep MiddleOnly cached-consensus |grep ^s|wc -l
9
$ grep MiddleOnly gabelmoo-vote |grep ^s|wc -l
17
And in more detail,
$ grep MiddleOnly gabelmoo-vote |grep ^s
s BadExit MiddleOnly Running Stable Valid
s BadExit Fast MiddleOnly Stable Valid
s BadExit Fast MiddleOnly Running Stable Valid
s BadExit Fast MiddleOnly Running Stable Valid
s BadExit Fast MiddleOnly Stable Valid
s BadExit Fast MiddleOnly Running Valid
s BadExit Fast MiddleOnly Running Stable Valid
s BadExit Fast MiddleOnly Running Valid
s BadExit Fast MiddleOnly Stable Valid
s BadExit Fast MiddleOnly Stable Valid
s BadExit Fast MiddleOnly Stable Valid
s BadExit Fast MiddleOnly Stable Valid
s BadExit Fast MiddleOnly Stable Valid
s BadExit Fast MiddleOnly Running Stable Valid
s BadExit Fast MiddleOnly Running Stable Valid
s BadExit Fast MiddleOnly Stable Valid
s BadExit Fast MiddleOnly Running Stable Valid
So even gabelmoo only thought 9 of the 17 should be Running, so it's not surprising that only 9 of them made it into the consensus.
But even if gabelmoo had different opinions about which ones are Running, the fact that gabelmoo voted MiddleOnly about a relay which didn't make it into the consensus is not noteworthy. There are a variety of cases where it could happen during normal operation.
I think a more precise check would be: for each relay listed in the consensus as MiddleOnly, did gabelmoo list it as MiddleOnly too?
If that's too much coding, a simpler approximation (which avoids reporting the false positives but also omits some of the true positives) might be: don't log anything if the number in the vote is bigger than the number in the consensus.