Understand the "long tail" of unclassifiable network traffic

The obfs family of obfuscation protocols strives to "look like nothing" and falls into the long tail of network traffic that is meant to be unclassifiable. That is, if an ISP is monitoring its uplink, it shouldn't be able to figure out that one of its users is talking obfs4 to a Tor bridge. Instead, the obfs4 connection should show up as "unknown" in the log files.

We know next to nothing about this long tail that the obfs family hides in. What fraction of flows does it constitute? What fraction of bytes? What kind of protocols and implementations are difficult to classify? How does the long tail differ across uplinks?

Over at legacy/trac#30716 (moved) we're brainstorming features for obfs4's successor but before moving forward with obfs5, we should get a better understanding of this long tail because it allows us to make informed design decisions. Packet traces from the WIDE backbone is one of the data sets that may be helpful here.

Let's use this ticket to track progress and collect insights.