M. Davis, 2004-05-08
The APON format provides a representation for numbers as a sequence of bytes. It provides for an an unlimited exponent size, arbitrary precision of mantissa, and importantly, preserves numeric order; that is, an unsigned byte-by-byte comparison will return the same order as the original numbers. One other important feature of this format is that it has parsable length. Thus a sequence of such numbers can be serialized and recovered without any other structure being required, such as the length of each separate number.
The format uses a series of exponent bytes, followed by a series of mantissa bytes. The first bit is the sign bit (s), where 1 = non-negative. First we discuss the format for positive numbers, then we'll show how negatives are represented. The next bit is the i bit, which is 1 for ∞ or NaN, otherwise zero. For clarity, in all the examples below the sign and i bits are boldface.
If the i bit is on, then the rest of the first byte is zero, and the subsequent mantissa indicates whether or not it is a NaN, and what kind of NaN it is. In that case, the first byte has the following form. (We'll get to the format of the mantissa later.)
|
If the number is neither ∞ nor NaN, change the number into exponent/mantissa form, whereby if the number is zero, then the exponent is zero and mantissa is zero; otherwise the exponent is such that the mantissa is greater than or equal to 0.5 and strictly less than 1. Thus 2.25 becomes 0.5625 × 22, and zero become 0.0 × 20.
The regular exponent consists of an exponent sign bit (z) followed by a set of exponent continuation bits (k) followed by exponent value bits (e). These are packed into a series of bytes, after the s and i bits.
| number | z | e |
|---|---|---|
| 15 | 1 | 1111 |
| 14 | 1 | 1110 |
| ... | ||
| 1 | 1 | 0001 |
| 0 | 1 | 0000 |
| -1 | 0 | 1111 |
| -2 | 0 | 1110 |
| ... | ||
| -15 | 0 | 0001 |
| -16 | 0 | 0000 |
The z bit is 1 for non-negative, 0 for negative. If the exponent value is negative, then it is subtracted from 1, then the bits are inverted (this is just the normal two's complement form). Thus for the single-byte case (A) below, the interpretation is as in Table 1.
The continuation bits are each associated with one byte of the exponent, with 1 indicating a non-final byte, and 0 for the final byte.
A. A single-byte exponent has the following form, with 1 sign bit, 1 i bit, 1 z bit, 1 continuation bit (0), and 4 bits of exponent. This is likely to cover most numbers in common usage, since 216 allows for numbers up to over 105. For clarity, in the examples below the continuation bit is underlined.
|
B. A double-byte exponent has the following form, with 1 s bit, 1 i bit, 1 z bit, 2 continuation bits, and 3 + 8 bits of exponent: numbers up to about 10616. It covers all numbers expressible with single- or double-precision IEEE numbers (float and double in C).
|
|
C. A 3-byte exponent has the following form, with 1 s bit, 1 z bit, 3 continuation bits (110), and 2 + 8 + 8 bits of effective exponent: numbers up to about 1078913. This covers all numbers expressible with the proposed IEEE binary128 (and many more).
|
|
|
D. The same process can be applied indefinitely. Each byte form of length N will have an s bit, an i bit, a z bit, plus N continuation bits, plus 7*N-3 bits of effective exponent. (While it is extremely unlikely that such long exponents would ever be used, the format does allow for them.) Note that the continuation bits may go into successive bytes. Thus the 7 byte exponent would have the following form:
|
|
... |
|
The exponent bits must use the shortest form that can contain them, and must be right-justified among all the exponent bits. Thus an exponent of 1 will be:
|
An exponent of -35 will be:
|
|
This format preserves ordering, since if two exponents have the same number of bytes, the ordering will be determined correctly from the exponent bits (since the length bits are identical). If two exponents have different numbers of bytes, then the larger number will be ordered first. For example, double-byte form in B will always be greater than the single-byte form in A.
The exponent bytes are then followed by 1 or more mantissa bytes (shown in italic). These represent a binary fraction greater than or equal to zero, and less than 1. In all cases other than zero, the first mantissa bit will be one.
|
... |
|
The low-order bit of each byte is a mantissa continuation bit. It is set to 1 if the byte is not the final byte, and zero otherwise. As opposed to the exponent, the mantissa bits are left-justified. That is, the mantissa 0.75, which is 0.11 in binary, would be expressed as the following (notice the zero continuation bit):
|
This produces the right ordering among the bytes. When you are comparing two numbers byte for byte:
Thus we get the following ordering:
[50] < [51][10] < [51][11]... < [51][20] < [60]
To encode negative numbers, every byte in the format is inverted. Thus the number 2.6875 turns into exponent = 2, mantissa = 0.671875, and results in the following.
|
|
The number -1.3515625 is just the above, with every bit inverted:
|
|
This will maintain the ordering, since all negative values are less than all positive values, and if a < b, then -a > -b.
One other feature of this format is that it has parsable length. Thus a sequence of such numbers can be serialized and recovered without any other structure being required. Such parsing needs to look at the sign bit, and invert bytes if it was on. The exponent length is computable from the exponent continuation bits; the end of the mantissa is the last byte that has a final zero bit (e.g. the last byte with an even value).
Note: there are various ways to pack the numbers more tightly, while preserving order and parsable length, but they do not appear to be worth the extra processing required.
With suitable modifications, the above method can be used to express decimal numbers (whereby decimal fractions can be precisely represented, as in BCD). The simplest method of doing that is to view the number as a base-100 value, and thus the exponent is the number of significant elements (in 100-element chunks) before the decimal point (negative for fractions). Thus 123.456 would have an exponent of 2, and a mantissa of <01><13><45><60>. Using 100 elements leaves room for the mantissa continuation bit: each byte will have 200 possible values. The exponent is coded as above. Thus
|
|
represents an exponent of 2, and a mantissa of <86>, thus 8600. For normal usages of numbers, it is extremely unlikely that more than 1 byte of exponent will be required, since that can express up to 10063, which is over a googol.
As above, DAPON numbers can be compared with one another, and the correct ordering will result. However, a DAPON number cannot be correctly compared against an APON number.
This format can also be adapted to situations where not all of the possible byte values are available for use, such as in restricted environments. The number of byte values can be used as a radix, much like for BCD. Thus supposed there were 125 possible byte values: 3, 5, ... 254. In that case half of the values would be used for the c bit, leaving 62 possible values for the mantissa. Then the mantissa is expressed as a number a * 62-1 + b * 62-2 ... The exponent is treated in a similar fashion. The inversion process would be not a bit-inversion, but a subtraction from the number of byte values (in this case, 125). Once the resulting sequence of numbers is determined, the the values are mapped to the possible byte values. Thus 0 => 3, 1 => 5, etc.
If the numbers are to maintain either binary (or decimal precision), then the radix must be chosen to be the largest power of 2 (or 10, respectively) that fits within the allowable values.
|
Parameter |
binary32 |
binary64 |
binary128 |
|
p digits |
24 |
53 |
113 |
| (pos) exponent bits | 7 | 10 | 16 |
|
emax |
+127 |
+1023 |
+16383 |
|
emin |
–126 |
–1022 |
–16382 |