Truncated binary encoding

Truncated binary encoding is an entropy encoding typically used for uniform probability distributions with a finite alphabet. It is parameterized by an alphabet with total size of number n. It is a slightly more general form of binary encoding when n is not a power of two.

If n is a power of two then the coded value for 0 ≤ x < n is the simple binary code for x of length log2(n). Otherwise let k = floor( log2(n) ) such that 2k < n < 2k+1 and let u = 2k+1 - n.

Truncated binary encoding assigns the first u symbols codewords of length k and then assigns the remaining n - u symbols the last n - u codewords of length k + 1. Because all the codewords of length k + 1 consist of an unassigned codeword of length k with a "0" or "1" appended, the resulting code is a prefix code.

Example with n = 5

For example, for the alphabet {0, 1, 2, 3, 4}, n = 5 and 22n < 23, hence k = 2 and u = 23 - 5 = 3. Truncated binary encoding assigns the first u symbols the codewords 00, 01, and 10, all of length 2, then assigns the last n - u symbols the codewords 110 and 111, the last two codewords of length 3.

For example, if n is 5, plain binary encoding and truncated binary encoding allocates the following codewords. Digits shown struck are not transmitted in truncated binary.

Truncated
binary
EncodingStandard
binary
00000
10011
20102
UNUSED0113
UNUSED1004
UNUSED1015/UNUSED
31106/UNUSED
41117/UNUSED

It takes 3 bits to encode n using straightforward binary encoding, hence 23 - n = 8 - 5 = 3 are unused.

In numerical terms, to send a value x where 0 ≤ x < n, and where there are 2kn < 2k+1 symbols, there are u = 2k + 1n unused entries when the alphabet size is rounded up to the nearest power of two. The process to encode the number x in truncated binary is: If x is less than u, encode it in k binary bits. If x is greater than or equal to u, encode the value x + u in k + 1 binary bits.

Example with n = 10

Another example, encoding an alphabet of size 10 (between 0 and 9) requires 4 bits, but there are 24 − 10 = 6 unused codes, so input values less than 6 have the first bit discarded, while input values greater than or equal to 6 are offset by 6 to the end of the binary space. (Unused patterns are not shown in this table.)

Input
value
OffsetOffset
value
Standard
Binary
Truncated
Binary
0000000000
1010001001
2020010010
3030011011
4040100100
5050101101
661201101100
761301111101
861410001110
961510011111

To decode, read the first k bits. If they encode a value less than u, decoding is complete. Otherwise, read an additional bit and subtract u from the result.

Example with n = 7

Here is a more extreme case: with n = 7 the next power of 2 is 8 so k = 2 and u = 23 - 7 = 1:

Input
value
OffsetOffset
value
Standard
Binary
Truncated
Binary
00000000
112001010
213010011
314011100
415100101
516101110
617110111

This last example demonstrates that a leading zero bit does not always indicate a short code; if u < 2k, some long codes will begin with a zero bit.

Simple algorithm

Generate the truncated binary encoding for a value x, 0 <= x < n, where n > 0 is the size of the alphabet containing x. n need not be a power of two.

string TruncatedBinary (int x, int n)
{
	// Set k = floor(log2(n)), i.e., k such that 2^k <= n < 2^(k+1).
	int k = 0, t = n;
	while (t > 1) { k++;  t >>= 1; }

	// Set u to the number of unused codewords = 2^(k+1) - n.
	int u = (1 << k+1) - n;

	if (x < u)  return Binary(x, k); 
        else  return Binary(x+u, k+1));
}

The routine Binary is expository; usually just the rightmost len bits of the variable x are desired. Here we simply output the binary code for x using len bits, padding with high-order 0's if necessary.

string Binary (int x, int len)
{
	string s = "";
	while (x != 0) {
		if (even(x))  s = '0' + s;
		else  s = '1' + s;
		x >>= 1;
	}
	while (s.Length < len)  s = '0' + s;
	return s;
}

On efficiency

If n is not a power of two, and k bit symbols are observed with probability p, k + 1 bit symbols are observed with probability 1 - p. We can calculate the expected number of bits per symbol as:

.

Raw encoding of the symbol is bits. Then relative space saving s (see Data_compression_ratio) of the encoding can be defined as

.

When simplified, this expression leads to:

.

This indicates that relative efficiency of truncated binary encoding increases as probability p of k bit symbols increase, and the raw encoding bits per symbol decreases.

See also

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.