Beautiful ASCII

This story starts with some specialized code, designed to very quickly set the case on a set of fixed length symbols:

uint32_t upper(uint32_t code) {
        return code & ~0x20202020;
}

uint32_t lower(uint32_t code) {
        return code | 0x20202020;
}

The code works by taking a small four letter ASCII string, encoded as an integer, and by adding a magic number it switches its case. To illustrate with an example, let’s pass YyZt to lower():

At a quick glance this might seem like some fancy bit-trickery designed to squeeze out a few extra cycles of performance, but it is in fact by design and carries with it a history dating back to the 18th century.

The telegraph standard

Designed by the French inventor Émile Baudot, the Baudot telegraph was an early multiplexing printing telegraph.

Baudot keyboard
Baudot mechanics

Each key on the five key keyboard represented one bit, in a 5-bit encoding. These 5-bit symbols, transmitted over the telegraph wire, would then be punched onto a five-hole punch tape. This encoding, initially called Baudot Code, became the first International Telegraph Alphabet or ITA-1:

┌────────┬────────┬──────┬──────┬──────┬──────┬──────┐
│ Letter │ Figure │   I  │  II  │  III │  IV  │   V  │
├────────┼────────┼──────┼──────┼──────┼──────┼──────┤
│        │        │      │      │      │      │      │
│   A    │   1    │   ●  │      │      │      │      │
│   B    │   8    │      │      │   ●  │   ●  │      │
│   C    │   9    │   ●  │      │   ●  │   ●  │      │
│   D    │   0    │   ●  │   ●  │   ●  │   ●  │      │
│   E    │   2    │      │   ●  │      │      │      │
│   É    │   &    │   ●  │   ●  │      │      │      │
│   F    │   ᶠ    │      │   ●  │   ●  │   ●  │      │
│   G    │   7    │      │   ●  │      │   ●  │      │
│   H    │   ʰ    │   ●  │   ●  │      │   ●  │      │
│   I    │   °    │      │   ●  │   ●  │      │      │
│   J    │   6    │   ●  │      │      │   ●  │      │
│   K    │   (    │   ●  │      │      │   ●  │   ●  │
│   L    │   =    │   ●  │   ●  │      │   ●  │   ●  │
│   M    │   )    │      │   ●  │      │   ●  │   ●  │
│   N    │   N°   │      │   ●  │   ●  │   ●  │   ●  │
│   O    │   5    │   ●  │   ●  │   ●  │      │      │
│   P    │   %    │   ●  │   ●  │   ●  │   ●  │   ●  │
│   Q    │   /    │   ●  │      │   ●  │   ●  │   ●  │
│   R    │   -    │      │      │   ●  │   ●  │   ●  │
│   S    │   ;    │      │      │   ●  │      │   ●  │
│   T    │   !    │   ●  │      │   ●  │      │   ●  │
│   U    │   4    │   ●  │      │   ●  │      │      │
│   V    │   '    │   ●  │   ●  │   ●  │      │   ●  │
│   W    │   ?    │      │   ●  │   ●  │      │   ●  │
│   X    │   ,    │      │   ●  │      │      │   ●  │
│   Y    │   3    │      │      │   ●  │      │      │
│   Z    │   :    │   ●  │   ●  │      │      │   ●  │
│   ᵗ    │   .    │   ●  │      │      │      │   ●  │
│  del   │        │      │      │      │   ●  │   ●  │
│ figure │ blank  │      │      │      │   ●  │      │
│ blank  │ letter │      │      │      │      │   ●  │
└────────┴────────┴──────┴──────┴──────┴──────┴──────┘

Baudot’s invention was later adapted for use with a typewriter by New Zealand’s Donald Murray, who modified code positions to reduce mechanical wear on common symbols. Murray’s code also introduced typewriter specific control codes, such as LF (line feed) and CR (carriage return). Murray’s invention became popular and his 5-bit Murray Code, with some adaptations, became the ITA-2:

┌──────┬─────┬─────────────────────┬───────────────────┬───┬───────────────┬──────┬───────┐
│ Ltrs │\n \r│ Q W E R T Y U I O P │ A S D F G H J K L │SPC│ Z X C V B N M │ →Fig │  DEL  │
├──────┼─────┼─────────────────────┼───────────────────┼───┼───────────────┼──────┼───────┤
│ Figs │\n \r│ 1 2 3 4 5 6 7 8 9 0 │ - ' ᵉ ˢ & \   ( ) │SPC│ + / : = ? , . │      │ →Ltrs │
├──────┼─────┼─────────────────────┼───────────────────┼───┼───────────────┼──────┼───────┤
│   1  │     │ ● ● ●     ● ●       │ ● ● ● ●     ● ●   │   │ ● ●     ●     │   ●  │   ●   │
│   2  │ ●   │ ● ●   ●     ● ●   ● │ ●       ●   ● ● ● │   │     ● ●       │   ●  │   ●   │
│   3  │     │ ●         ● ● ●   ● │   ●   ●   ●   ●   │ ● │   ● ● ●   ● ● │      │   ●   │
│   4  │   ● │       ●         ●   │     ● ● ●   ● ●   │   │   ● ● ● ● ● ● │   ●  │   ●   │
│   5  │     │ ● ●     ● ●     ● ● │         ● ●     ● │   │ ● ●   ● ●   ● │   ●  │   ●   │
└──────┴─────┴─────────────────────┴───────────────────┴───┴───────────────┴──────┴───────┘
ᵉ = escape ˢ = shift out

The ITA-2 standard wasn’t without its issues though. Given its size of only 5-bits the set of characters it could represent was limited, and so variations popped up to accommodate for different needs. The 5-bit code was also stateful, specific bit-patterns were reserved to switch the symbol set to either letters or figures. Codes with this trait are known as shifted codes, and while space efficient, a single flipped bit or lost message can render an entire message unintelligible.

Computation and collation

The arrival of computers brought a new set of requirements. New codes were developed that could both store data and perform computations efficiently. A few pioneering computer codes that heavily influenced ASCII were the 6/7-bit FIELDATA code and the 6-bit BCDIC. Both of these had their issues:

ASCII

By the late 1950s there was a push for a new global standard communication code. Under the supervision of various committees around the world, among them the American Standards Association’s X3.2 Committee, work on a new code commenced. This code would later be known as ASCII.

Design of ASCII

Many considerations from the computing and communication industry were reviewed in the design of the new code, as well as lessons learned from past codes, among them:

From these requirements the basic structure of the code was formed:

┌─────┬Column┬───0──┬───1──┬───2──┬───3──┬───4──┬───5──┬───6──┬───7──┐
│ Row │ Bits │  000 │  001 │  010 │  011 │  100 │  101 │  110 │  111 │
├─────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤
│  0  │ 0000 │ NULL ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  1  │ 0001 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  2  │ 0010 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  3  │ 0011 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  4  │ 0100 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  5  │ 0101 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  6  │ 0110 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  7  │ 0111 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  8  │ 1000 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  9  │ 1001 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  10 │ 1010 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  11 │ 1011 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  12 │ 1100 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  13 │ 1101 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  14 │ 1110 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  15 │ 1111 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  DEL │
└─────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
▓▓▓▓▓▓ Control  ▓▓▓▓▓▓ Graphics  ▓▓▓▓▓▓ Undefined

It was decided, after some debate, that space would be a graphics character and not a control character. And for computational/collation purposes it was decided that it should collate low to all other graphics characters. Ie, when sorting graphics characters the code values are compared, so space represented by b0100000 (0x20 or decimal 32) would have a lower value than any other graphics character. It was also determined alphabetics couldn’t start in column 2 (b010) either, as there were too many special characters that needed to collate lower than the alphabetics.

Having space as the first character of column 2 also ruled out column 2 for storing numerics. This was to ensure that a numeric’s lower four bits (row) mirrored the numeric value, i.e. the lower four bits of for example the numerical character '2' would have the value 2 (b0010). Numbers were therefore to be placed at the top of column 3 or 5, leaving space for alphabetcs.

┌─────┬Column┬───0──┬───1──┬───2──┬───3──┬───4──┬───5──┬───6──┬───7──┐
│ Row │ Bits │  000 │  001 │  010 │  011 │  100 │  101 │  110 │  111 │
├─────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤
│  0  │ 0000 │ NULL ▓▓▓▓▓▓  SPC    0  ▓▓▓▓▓▓   0  ▓▓▓▓▓▓▓▓▓▓▓▓│
│  1  │ 0001 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   1  ▓▓▓▓▓▓   1  ▓▓▓▓▓▓▓▓▓▓▓▓│
│  2  │ 0010 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   2  ▓▓▓▓▓▓   2  ▓▓▓▓▓▓▓▓▓▓▓▓│
│  3  │ 0011 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   3  ▓▓▓▓▓▓   3  ▓▓▓▓▓▓▓▓▓▓▓▓│
│  4  │ 0100 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   4  ▓▓▓▓▓▓   4  ▓▓▓▓▓▓▓▓▓▓▓▓│
│  5  │ 0101 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   5  ▓▓▓▓▓▓   5  ▓▓▓▓▓▓▓▓▓▓▓▓│
│  6  │ 0110 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   6  ▓▓▓▓▓▓   6  ▓▓▓▓▓▓▓▓▓▓▓▓│
│  7  │ 0111 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   7  ▓▓▓▓▓▓   7  ▓▓▓▓▓▓▓▓▓▓▓▓│
│  8  │ 1000 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   8  ▓▓▓▓▓▓   8  ▓▓▓▓▓▓▓▓▓▓▓▓│
│  9  │ 1001 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   9  ▓▓▓▓▓▓   9  ▓▓▓▓▓▓▓▓▓▓▓▓│
│  10 │ 1010 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  11 │ 1011 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  12 │ 1100 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  13 │ 1101 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  14 │ 1110 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│  15 │ 1111 │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  DEL │
└─────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
▓▓▓▓▓▓ Control  ▓▓▓▓▓▓ Graphics  ▓▓▓▓▓▓ Undefined

Whether or not lowercase and uppercase alphabetics should be interleaved, that is storing AaBbCc... in a column, was debated. The arguments for interleaving carried merit, should a name such as MacKenzie sort differently than Mackenzie with a lowercase K? Evidence, from things such as phone books, showed that there was no established consistency. One advantage of keeping them separate would be that the most commonly used symbols (uppercase, numerics and specials) could still fit into a single 6-bit (64 symbols) subset code easily extracted from the 7-bit code.

Although it was common for codes at the time to sort alphabetics lower than numerics, it was not yet decided if the code was to include the lowercase alphabet and the two undefined columns should be used for graphic codes instead of control codes. In the end, contrary to common practice, they concluded on putting numerics in column 2 (b010) as it left more choices open. To support their decision the committee argued that existing collating order could still be acheived:

If it is necessary to achieve the de facto collating sequence (specials, alphabetics, numerics), it may be achieved, during comparison operations, by inverting b7 if b6 = b5 = 1. That is, the three high-order bits of the column of numerics would then become 111, which would make them collate high to the alphabetics, with high-order bits of 100 and 101.

This decision also led to an additional criterion for ASCII:

There should be a single bit difference between capital and small letters.

The ASCII standard ASA X3. 4-1963 didn’t include the lowercase alphabet, but in October of 1963 it was decided that the lowercase alphabet should be included in columns 6 and 7, filling in the undefined region and enabling the single bit case switching.

┌─────┬Column┬───0──┬───1──┬───2──┬───3──┬───4──┬───5──┬───6──┬───7──┐
│ Row │ Bits │  000 │  001 │  010 │  011 │  100 │  101 │  110 │  111 │
├─────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤
│  0  │ 0000 │ NULL   DLE   SPC    0     @     P     `     p  │
│  1  │ 0001 │  SOH   DC1    !     1     A     Q     a     q  │
│  2  │ 0010 │  STX   DC2    "     2     B     R     b     r  │
│  3  │ 0011 │  ETX   DC3    #     3     C     S     c     s  │
│  4  │ 0100 │  EOT   DC4    $     4     D     T     d     t  │
│  5  │ 0101 │  ENQ   NAK    %     5     E     U     e     u  │
│  6  │ 0110 │  ACK   SYN    &     6     F     V     f     v  │
│  7  │ 0111 │  BEL   ETB    '     7     G     W     g     w  │
│  8  │ 1000 │  BS    CAN    (     8     H     X     h     x  │
│  9  │ 1001 │  HT    EN     )     9     I     Y     i     y  │
│  10 │ 1010 │  NL    SUB    *     :     J     Z     j     z  │
│  11 │ 1011 │  VT    ESC    +     ;     K     [     k     {  │
│  12 │ 1100 │  FF    FS     ,     <     L     \     l     |  │
│  13 │ 1101 │  CR    GS     -     =     M     ]     m     }  │
│  14 │ 1110 │  SO    RS     .     >     N     ^     n     ~  │
│  15 │ 1111 │  SI    US     /     ?     O     _     o    DEL │
└─────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
▓▓▓▓▓▓ Control  ▓▓▓▓▓▓ Graphics  

Sources