Implement counter mode faster on x86 and x86_64

Our current counter-mode implementation assumes that 32-bit math is fastest, that any endianness is possible, and that unaligned access is forbidden or slow.

This all slows it down. We can do much better than we do now on little-endian systems, and much better if we happen to me somewhere where unaligned integer access is just as fast as aligned integer access (this seems to be the case on x86 and x86_64.

The tricks are to change our current counter-increment logic to something like

   for (p=15;p>=0;p--) {
     if (counter_byte[p]++ == 0)
        break;
   }

and to change the main XOR loop logic to this on systems where unaligned access works and works quickly:


  {current loop until c==0}

  while (len>16) {
     XOR 16 bytes at once via 32-bit or 64-bit XORs.
     increment counter
     crypt another block
     len -= 16
  }

  {current loop until len == 0}

On my laptop with a relatively unoptimized openssl, preliminary testing suggests that this should shave over 5% off AES runtime for encrypting a bunch of 509-byte cell payloads. It should be even more impressive on aesni hardware.

[We can't just use OpenSSL's relatively-good-now counter mode, since I don't think it supports EVP.]