Monday, June 20, 2016

VBES - Saving even more bandwidth with UTF8

So a while ago I came up with an encoding scheme which utilized UTF8's variable byte concept to essentially save bandwidth while chatting in instant messaging scenarios. After that I had thought of improving upon that idea even further but never got around to actually coding it, I had the idea down but I didn't have a solid working scheme much less code. The other day though I thought I'd work on it and after planning it out and finishing the encoder and then the decoder the next day I had come up and made an encoding scheme which meets all the requirements of my previous encoding scheme and performed better (in saving bandwidth). This new scheme implements the idea that some letters are used more often then others and so if I allocate fewer bits to those more likely used letters the final message should be pretty small. I broke up the letters, a few punctuation, a space and an escape for UTF8 fallback into groups of 4bits, 6bits, 8bits, and 9bits. It is completely backwards compatible with UTF8 which is nice in scenarios where VBES would produce larger messages, one can just used the UTF8 version instead and the message would still be decoded. Its a simple concept and I will be opting to use this for my future messenger app.

How it works:
* 0000
  0001
e 0010
t 0011
a 0100
o 0101
i 0110
n 0111
s 1000
h 1001
r 1010
d 1011
l 1100
c 1101

u 111000
m 111001
w 111010
f 111011
g 111100
. 111101

y 11111000
p 11111001
b 11111010
v 11111011
k 11111100
j 11111101

, 111111100
x 111111101
q 111111110
z 111111111

The * signifies that there is an escape and a UTF8 encoded character should fill the space.

Demo Encoder/Decoder: HIDDEN FOR CHALLENGE

Q/A:
How much space does this save?
In the best case scenarios the size of the message is 50 percent of the UTF8 message plus 1 byte. So that is (X*.5)+1 bytes.

Where does this 50 percent plus 1 come from?
The half comes from the fact that the most optimized characters are 4 bits long which is half a byte which is half the size of the smallest possible letter in UTF8 which is a byte. The +1 comes from the fact that if a message consisted of all 4bit letters and fit perfectly into multiples of 8bits there would still require 2 more bits for the UTF8 marker. With those 2 included bits there now has to be padding and thus an entire byte has been added to the message.

How does this compare to the previous encoding scheme?
The previous encoding scheme worked on optimizing several characters and allocating 6 bits for each of those optimized characters. Because 6 is 75 percent of 8 and also had the 2 added bits for the UTF8 marker its best scenario efficiency would be (X*.75)+1.

How comes this scheme doesn't optimize capital letters?
By default this scheme will turn the entire input message to lowercase because only the lowercase letters are optimized and thus the message would save the most space this way but you can disable this behavior if you truly wish to send a message with capital letters. The capital letters each will be encoded using standard UTF8 and their placement will be signified with 0000 which is also used to signify the placement of other UTF8 encoded characters.

Source (VB.net):