Wednesday, February 26, 2014

Saving bandwidth while chatting using UTF8 (Fixed)

UPDATE: Decode function fixed and tested (issue was a 4 was used instead of 1, silly me lol). Next up is maybe porting to C#,C++ and maybe Java, but its pretty simple I think people can do it themselves.

UPDATE: Well this is embarrassing, my new and faster decoding function is flawed lol, for now use the old decoding function for accuracy while I look into fixing my faster code.

UPDATE: New Encoding/Decoding Function and changes are below. From testing Encoding is over 5X faster than the previous code and Decoding is over 15X faster. Major change would be I implemented a lookup table so not so much string manipulation is required. Minor changes are I swapped and reordered some things around in favor of faster alternatives like for ex using index of instead of contains.

Test String (100,000 loop) = This is a test to try to make I can make my encoding and decoding for my scheme faster, it badly needs a speed boost lol!!!

Encoding: ~9,334 ms
Decoding: ~13,556 ms

Encoding: ~1,732 ms
Decoding: ~849 ms

    Previously I had a concept of a modified UTF8 encoding scheme with selected characters being optimized. I had code but the scheme was broken to some extent and code flawed. I left it alone for a couple months and then pbanj(from irc) posted a message in all hex lol. At first I didn't know what to do with it, so I told him "wanna decode that for me?" and he said why would he have to. At that moment it hit me, he was using an app I made awhile ago with the encoding scheme I made (still with bugs). I decoded the message and the message said "has this been updated" and that was when I got back to work on fixing this scheme. The concept was simple, most English messages are simply of Latin characters as in a-z and A-Z. I thought all that was needed to represent these were 6 bits. But with 6 bits I had extra room for more characters so I squeezed in the next possibly most used characters, digits 0-9. I then had room for 2 more spaces for characters, I used one to represent a space because come on... its used to separate words from one another. The last bit is reserved for something that isn't a character. It is used instead as an escape or wildcard. The last spot in the 6 bit encoding (111111) is used to represent a non optimized character (UTF8). Because most English messages will contain a ratio of greater Latin letters and numbers 0-9 plus some spaces compared to punctuation and other special characters this scheme will most likely save bandwidth. This scheme works because UTF8 has certain characteristics and rules that compliment the workings of this scheme. A character encoded in UTF8 cannot (in bits) start with 10XXXXXX. The first byte contains information on how many bytes ahead needs to be read for the current character, for example 111XXXXX has three 1 bits leading so the character has 3 bytes total to represent that character. So including itself it needs to also get the next 2 bytes to successfully decode the character.

    Now Lets see how we can optimize UTF8 a bit. Lets assume we want to encode only with 6 bits, those 6 bits representing the optimized characters mentioned above. And every time a special character needs to be encoded the bits (111111) is used. That Character is then encoded using UTF8 and its byte representation is put aside later for use (we will get back to it later). The encoding then continues onto the next character, again if the character is not an optimized character then the bits (111111) is added and its UTF8 bytes are added to the end of the UTF8 bytes from earlier.

    At the end of the encoding we put everything together. We should have 2 groups of data now, the bytes of the characters encoded in UTF8 and the bits for the characters encoded with 6 bits. We need to put them together but how will the decoder know where the UTF8 bytes end and where the 6 bit encoding start. Earlier we established that UTF8 cannot start with 10XXXXXX, well we will use this to signify the end of the UTF8 but instead of the whole byte we only use the first 2 bits (1 and 0) because there is no point in reading the whole byte if UTF8 wont be able to decode it anyways. So 1 and 0 are placed in between the bytes and the 6 bit encoding. The last thing to do to encoding the message is convert it to bytes. We do this because the final data needs to be in a byte array, we do this by simply adding 1s until the number of bits in the final data is a multiple of 8 (8 bits are in a byte). 1s are used so the decoder doesn't mistakenly decoded any other character if the padding was 6 bits long, for example if 0s were used for padding and the padding was 6 bits long the first optimized character would be decoded at the end when the original message didn't have it.

    To decode is fairly simple. We read the message byte per byte first for UTF8 decoding and only when a UTF8 starts with 10 does it stop, the UTF8 characters are put aside for later use. The bits per bit decoding now starts at the beginning of the byte which started with 10XXXXXX. First thing to do is because the 1 and 0 were just used as markers they are skipped or removed.The rest of the message is decoded 6 bits at a time until there are no more bits to read or there are less than 6 bits remaining. Each 6 bits corresponds with one of the optimized characters. Each time the 111111 comes up it grabs the corresponding index of the special characters. For example if this is the first occurance of 111111 then that character in the encoding is the first character in the special characters we decoded earlier.

Here is a demo of the encoding:
Optimized characters: "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 " no quotes.
Message = Hello my name is TizzyT.
Hex = 2E:A1:10:B2:CE:F8:C6:3E:34:03:04:F8:84:BE:B4:86:59:62:DF:FF
Octets =
00101110 10100001 00010000 10110010
11001110 11111000 11000110 00111110
00110100 00000011 00000100 11111000
10000100 10111110 10110100 10000110
01011001 01100010 11011111 11111111

First we look at the first byte (first octet): 00101110
We see that it starts with a 0 meaning that this character only needs the byte to decode the character. This character in turns decodes to be the period at the end of the original message. We add that to a list of special characters for later use.

We move onto the next byte (second octet): 10100001
It starts with a 1 and 0 so from this we know to stop decoding in UTF8.
Note if it doesn't start with 10  continue decoding in UTF8.

We skip (or remove) the 1 and 0 because they are markers and look at the next 6 bits: 100001
That is the number 33, so we know the character here is the 33rd character in  our optimize characters list (count started from 0) which is "H". And we repeat this throughout the rest of the message.
000100 = "e"
001011 = "l"
001011 = "l"
001110 = "o"
111110 = " "
001100 = "m"
011000 = "y"
111110 = " "
001101 = "n"
000000 = "a"
001100 = "m"
000100 = "e"
111110 = " "
001000 = "i"
010010 = "s"
111110 = " "
101101 = "T"
001000 = "i"
011001 = "z"
011001 = "z"
011000 = "y"
101101 = "T"
111111 = Special Character
111111 = Special Character

Here we come across 111111 which is to signify that the current character is a special character and it is the first occurrence so we look to the first decoded special character we have (which is the period).
Now we still have 6 more bits to read so we read those, if there were less then 6 bits left you stop because it is the end of the encoding and the remaining bits are there as padding. If the last six bits are 111111 the encoding ends because in such a case the bits are used as padding otherwise decode as usual.

Using this scheme we effectively saved 4 bytes with this message.
            UTF8 = 48:65:6C:6C:6F:20:6D:79:20:6E:61:6D:65:20:69:73:20:54:69:7A:7A:79:54:2E
This Scheme = 2E:A1:10:B2:CE:F8:C6:3E:34:03:04:F8:84:BE:B4:86:59:62:DF:FF

In the best case scenario (using only optimized characters and no padding) messages are 0.75 the size of  the same message encoded in standard UTF8. In the worst case this scheme can be any number of bytes larger, in which case the message should be sent using the standard UTF8 encoding. Because this scheme is based off UTF8 and the fact that this scheme decodes using UTF8 first there is no conflict with using this scheme to decode a standard UTF8 message.

Download Demo App:
Note: The Demo App doesn't fall back to standard UTF8 encoding to show worst case scenario of a given message.

After thoughts:
1) Why didn't I use compression instead of this encoding scheme?
    I have thought of that and did several test and concluded that while compression with large messages with hundreds or thousands of characters would work nicely, in many or most cases much nicer than this scheme. In an instant messaging environment people don't hold their messages until there is a huge bulk..I mean come on its called instant messaging for a reason, people send whats on their minds as soon as they finish typing it and for such short messages most if not all compression actually makes the message larger in size.

2) Why did you make this?
    I was first intending to make a new chat program just for the heck of it and while doing so I needed to decided what encoding scheme I would use to send text messages. I had 2 factors that I wanted to keep optimal. The first was to be able to use a large range of characters and the second being it had to use as little bandwidth as possible. The first obvious choice was to use ASCII but that later turned out to be very limited in what I wanted users to be able to type. The next was UTF8. While looking into how UTF8 worked and how it came to be I was amazed as to how flexible it was. It only uses the required amount of bytes needed to represent a given character and thus my knowledge on variable byte encoding grew and of course the scheme I came up with is essentially taking that concept a bit further. So back to the original question, this scheme satisfies both factors and saves enough bandwidth in my opinion for me to opt for its use, and in the worst case scenario in a completed application (if made to) the message would be the size of a standard UTF8 encoded message.

3) Have you tried to optimize it even more?
    I have thought about it but besides rearranging some declarations around etc I haven't really changed the fundamentals of my code. There are probably optimizations here and there but I haven't gotten into doing that, like maybe making use of bit shift operators or creating a look up table instead of constantly calculating things. Those things I will maybe get around to later for now I am satisfied with the basic concept and workings of the scheme and I'll leave the optimizing and/or porting for others who want to use it.

4) You say this saves bandwidth but how when some messages aren't smaller in size?
    When I say save bandwidth I am talking about traffic used within a given duration. For many people that duration is at the start of a billing cycle to the end of one. Here in America and probably most other countries a billing cycle is usually on a per month basis. So within a month this scheme would save you bandwidth. For a single message sure it might not be that much saved but over hundreds or even thousands of messages throughout the month you would have saved a noticeable amount in my opinion.

5) Why 6 bits? Why not lower?
    I will be honest I wanted to use 5 bits at first and being a complete scatter brain I completely forgot about capital letters so it took me a good half hour or so before my brain fart subsided and increased it by 1 bit. There are 26 letters in the Latin alphabet but there are also capitalization so already there are 52 characters that need to be represented and that exceeds the 32 possible with 5 bits. And I would argue that with 6 bits it works out nicely with the whole alphabet with capitalization, digits 0-9, a space and the required escape/wildcard.

Here is the updated code I have for this scheme which encodes 5X faster than the old one and decodes about 15X faster than the old one.
(100,000 loop stopwatch as benchmark with string of 123 characters):

While this scheme can save bandwidth and is perceived fast, it is much slower than standard UTF8 encoding and requires more steps. The encoding for practical uses though should be more than fast enough.

As always if anyone has any issues or improvements please comment.
Open to non commercial use.

No comments: