Collins Software Adam modified UTF-8 format
 

A modified UTF-8 structure. I have revised my approach for character sets and have designed an alternate approach to the font / character software development problems. I have decided to not use UTF-8, rather use this format which has all numbers in a single format, not one for characters and another one for numbers. They are the same... A number in this format has no limitation, so if you want to represent PI to the trillionth decimal place, fine.

With a few minor additions to the UTF-8 specification, like mapping characters to fonts, and adding character properties, such as a sort sequence and local formatting styles, UTF-8 could then handle all character and font functionality needed for the next thousand years.

I have also included a compression system to utf-8 for Integers, Floating-Points and Date binary values that have no limitations. This system permits an infinite range to all numbering systems along with infinite Date / Time and Duration values. Number can have any number of significant digits, and date can now represent any level of precision (Pico seconds or smaller) for an infinite number of years. I will also introduce a new representation for +/- INFINITY.

The largest benefit is that all limitations are removed for numbers, floating points, dates, and character set size, all without affecting space or speed. I will be looking into hardware micro-code modifications to use this format instead of the existing BYTE oriented methodologies used in current computers for math and date calculations. I will not name this format... I will just simply call it a "number" or "Integer" or "Positive and Negative Natural Numbers". Characters are just numbers, numbers are just numbers, floating point are just 2 numbers, Dates are just a set of numbers, etc...

modified UTF-8 scheme

Bits Minimum Maximum Bytes Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 Byte 7
7 0 127 1 0xxxxxxx            
11 128 2,175 2 110xxxxx 10xxxxxx          
16 2,176 67,711 3 1110xxxx 10xxxxxx 10xxxxxx        
21 67,712 2,164,863 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx      
26 2,164,864 69,273,727 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx    
31 69,273,728 136,382,591 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx  
36 136,382,592 68,855,859,328 7 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  O+N+cmd   3-67 11111111 10zzzzzz 10nnnnnn 10cccccc 10cccccc ...  

Compression scheme for numbers, instead of 8 byte binary numbers
Bits Minimum Maximum Bytes Byte 1 Byte 2 Byte 3
7 0 127 1 0xxxxxxx    
6 128 172 1 11xxxxxxx    
12 172 4267 2 11xxxxxx 10xxxxxx  
18 4268 266,412 3 11xxxxxx 10xxxxxx 10xxxxxx

8 bit encoding
Bytes Bits Minimum Maximum size
1 7 0 127  
1 6 128 172  
2 12 172 4,268 4K
3 18 4,224 266,367 266K
4 24 266,368 17,043,583 17 MB
5 30 17,043,584 1,090,785,407 1 GB
6 36 1,090,785,408 69,810,262,143 69 GB
7 42 69,810,262,144 4,467,856,773,247 4 TB
8 48 4,467,856,773,248 285,942,833,483,903 285 TB
9 54 285,942,833,483,904 18,300,341,342,965,887 18 PB
10 60 18,300,341,342,965,888 1,171,221,845,949,812,863 1 EB
11 66 1,171,221,845,949,812,864 74,958,198,140,788,019,327 74 EB
12 72 74,958,198,140,788,019,328 4,797,324,681,010,433,233,023 4 ZB

 

8/16 bit encoding (leading 0 use 1 byte else 2) Adam preferred encoding
Synchronization is possible by looking at the previous bytes' high order bit.

back until high order bit = 0 (odd count = single byte, even count = second byte of word)
Might require adding a null character (0) to the character stream to help with synchronization, but only in very rare cases. Cg2 handles this already, so not really an issue.

1 byte = 0xxxxxx
2 byte = 11xx xxxx xxxx xxxx   followed by 0 or more 10xx xxxx xxxx xxxx

Bytes Bits Minimum Maximum chars size
1 7 0 127 3  
2 14 128 16,511 5 16K
4 28 16,512 268,451,967 9 268 MB
6 42 268,451,968 4,398,314,963,071 13 4 TB
8 56 4,398,314,963,072 72,061,992,352,891,007 17 72 PB
10 70 72,061,992,352,891,008 1,180,663,682,709,764,194,431 22 1 ZB
12 84 1,180,663,682,709,764,194,432 19,343,993,777,516,776,559,493,248 26 19 YB

Unicode assumes that one character-set can handle all languages and all special purpose symbols, plus somehow in the future it might, by some magic, even handle language and nationality differences.  It's an indexing scheme, nothing more, not needed at all if we change the basic premise of character and font relationships and extend the use of UTF-8 and UTF-16.

UTF-8 is a compression scheme for values from 0 of 68,855,859,328. For any single document this seems large enough. Most languages use 127 to 255 characters, Chinese has at the most 80,000 characters. I moved the use of special symbols to their own entry in the UTF-8 scheme, allowing for an infinite number of special symbols.

Not to put limits on the character set and to get even greater compression, character mapping was added. All characters can be remapped to alternate ranges and alternate fonts.

Special symbols are nice to have but are used in very few places. Time and space should not be wasted due to the handling of special symbols. 

The main advantage of this scheme is to fix all the current problems with handling characters and fonts.  These changes will be in Adam, they do not change the current Unicode or UTF-8 standards.
Command op-code Example
map 1 197:0
font 2 arial:0:255
sort order 3 97-122:65
thousand separator 4 .
decimal point 5 ,
digits 6 48:57

 

Date Compression
*define:DATE,Value:(Year+Day+Hour+Minute+Seconds+Millisecond+Nanosecond+Picosecond+Femtosecond+Attosecond+...);
Year Day Hour Minute Second Milliseconds
2015 366 24 60 60 1,000
1101.1111 1001.1111 1100.0101 1010.1110 0001.1000 0011.1100 0011.1100 1100.0011.1110.1000

  2 bytes for year
  4 bytes for year, day
  6 bytes for year, day, hour, minute
  8 bytes for year, day, hour, minute, second, millisecond
14 bytes for Pico seconds of precision,...


Infinite Date +/- Year (any size year)


Floating Point Compression
(any size +/- exponent, any size +/- mantissa)
*define:FLOAT,Value:(Exponent+Mantissa);

Exponent Mantissa
-34 13,457,231,234,567,778,000,122,210,110 (29 decimal places)
0110.0010 1100.0000 1000.0001 1000.1010 1010.1100 1011.1110 1000.0000 1011.0100  1000.0011 1010.1010 1011.1001 1011.1001  1000.1000
(4.4 bit groupings)

13 bytes compressed, 30 bytes as string, (80 bit binary number, 128 bits - 16 byte number)

This compression is greater for arrays was well as structures, since no character separators are needed between number values.

*define:Line,Points(*):(x+y);
line,0000.0000 0000.0000 0000.00001 0000.0000 0000.0001 0000.00001 0000.0000 0000.00001 0000.0000 0000.0000;
line,0:0,0:1,1:1,0:1,1:1,0:0;

A five point polygon 0,0 to 1,1 takes 10 bytes compressed, and 22 bytes as text; using a -0 "0100.000" as a line terminator;
The speed of conversion from compression is faster than from characters since no math operations are needed

1110.0001 + infinity
1111.0001 - infinity
1110.0010 + approaching zero from the right
1111.0010 - approaching zero from the left
 

Data Format: Common Ground 3000 (Cg3)

<cg3-record length>  
    <field length> ...*cg3 data...

<section length>
    <record length>
        <field length> ... *section data...

<records length>
      <record length>
         <field length> ...*field data...
              <sub-field length> ...*subfield data... 

<section length> ....

Commands:

1110.0000 cell terminator
1110.0001 row terminator
1110.0010 repeat this cell <count> 
1110.0011 repeat this row <count> 

1110.0100 no change cell value
1110.0101 no change row value

1110.0101 increment cell value
1111.0101 decrement cell value

1111.0110 transform cell <output id>
1111.0111 transform row <output id>

1110.1000 locale <id>
1110.1001 units <id>
1110.1010 format <id>
1110.1011 program <id>
1110.1100 ruler <...>

1110.1101 include <...>
1110.1111 picture <...>

    Bits structure (compressed)

10xx.xxxx 0xxx.xxxx 0xxx.xxxx ...      repeat 0 for 8 to 2gig bits
11xx.xxxx 0xxx.xxxx 0xxx.xxxx          repeat 1 for 8 to 2gig bits

                    0xxx.xxxx                                         actual bit settings (7 bits per byte)