How it works: * 0000 0001 e 0010 t 0011 a 0100 o 0101 i 0110 n 0111 s 1000 h 1001 r 1010 d 1011 l 1100 c 1101 u 111000 m 111001 w 111010 f 111011 g 111100 . 111101 y 11111000 p 11111001 b 11111010 v 11111011 k 11111100 j 11111101 , 111111100 x 111111101 q 111111110 z 111111111
The * signifies that there is an escape and a UTF8 encoded character should fill the space.
Demo Encoder/Decoder: HIDDEN FOR CHALLENGE
Q/A:
How much space does this save?
In the best case scenarios the size of the message is 50 percent of the UTF8 message plus 1 byte. So that is (X*.5)+1 bytes.
Where does this 50 percent plus 1 come from?
The half comes from the fact that the most optimized characters are 4 bits long which is half a byte which is half the size of the smallest possible letter in UTF8 which is a byte. The +1 comes from the fact that if a message consisted of all 4bit letters and fit perfectly into multiples of 8bits there would still require 2 more bits for the UTF8 marker. With those 2 included bits there now has to be padding and thus an entire byte has been added to the message.
How does this compare to the previous encoding scheme?
The previous encoding scheme worked on optimizing several characters and allocating 6 bits for each of those optimized characters. Because 6 is 75 percent of 8 and also had the 2 added bits for the UTF8 marker its best scenario efficiency would be (X*.75)+1.
How comes this scheme doesn't optimize capital letters?
By default this scheme will turn the entire input message to lowercase because only the lowercase letters are optimized and thus the message would save the most space this way but you can disable this behavior if you truly wish to send a message with capital letters. The capital letters each will be encoded using standard UTF8 and their placement will be signified with 0000 which is also used to signify the placement of other UTF8 encoded characters.
Source (VB.net):
No comments:
Post a Comment