A Case of Beautiful ASCII

This story starts with some specialized code, designed to very quickly set the case on a set of fixed length symbols:

uint32_t upper(uint32_t code) {
        return code & ~0x20202020;
}

uint32_t lower(uint32_t code) {
        return code | 0x20202020;
}

The code works by taking a small four letter ASCII string, encoded as an integer, and by adding a magic number it switches its case. To illustrate with an example, let's pass YyZt to lower():

The ASCII sequence YyZt in byte form is 0x59 0x79 0x5a 0x74.
Encoded as a 32bit integer, on a big endian system, that's 0x59795a74.
0x59795a74 | 0x20202020 = 0x79797a74.
Decoding 0x79797a74 to bytes yields 0x79 0x79 0x7a 0x74 or yyzt in ASCII.

At a quick glance this might seem like some fancy bit-trickery designed to squeeze out a few extra cycles of performance, but it is in fact by design and carries with it a history dating back to the 18th century.

The telegraph standard

Each key on the five key keyboard represented one bit, in a 5-bit encoding. These 5-bit symbols, transmitted over the telegraph wire, would then be punched onto a five-hole punch tape. This encoding, initially called Baudot Code, became the first International Telegraph Alphabet or ITA-1:

┌────────┬────────┬──────┬──────┬──────┬──────┬──────┐
│ Letter │ Figure │   I  │  II  │  III │  IV  │   V  │
├────────┼────────┼──────┼──────┼──────┼──────┼──────┤
│        │        │      │      │      │      │      │
│   A    │   1    │   ●  │      │      │      │      │
│   B    │   8    │      │      │   ●  │   ●  │      │
│   C    │   9    │   ●  │      │   ●  │   ●  │      │
│   D    │   0    │   ●  │   ●  │   ●  │   ●  │      │
│   E    │   2    │      │   ●  │      │      │      │
│   É    │   &    │   ●  │   ●  │      │      │      │
│   F    │   ᶠ    │      │   ●  │   ●  │   ●  │      │
│   G    │   7    │      │   ●  │      │   ●  │      │
│   H    │   ʰ    │   ●  │   ●  │      │   ●  │      │
│   I    │   °    │      │   ●  │   ●  │      │      │
│   J    │   6    │   ●  │      │      │   ●  │      │
│   K    │   (    │   ●  │      │      │   ●  │   ●  │
│   L    │   =    │   ●  │   ●  │      │   ●  │   ●  │
│   M    │   )    │      │   ●  │      │   ●  │   ●  │
│   N    │   N°   │      │   ●  │   ●  │   ●  │   ●  │
│   O    │   5    │   ●  │   ●  │   ●  │      │      │
│   P    │   %    │   ●  │   ●  │   ●  │   ●  │   ●  │
│   Q    │   /    │   ●  │      │   ●  │   ●  │   ●  │
│   R    │   -    │      │      │   ●  │   ●  │   ●  │
│   S    │   ;    │      │      │   ●  │      │   ●  │
│   T    │   !    │   ●  │      │   ●  │      │   ●  │
│   U    │   4    │   ●  │      │   ●  │      │      │
│   V    │   '    │   ●  │   ●  │   ●  │      │   ●  │
│   W    │   ?    │      │   ●  │   ●  │      │   ●  │
│   X    │   ,    │      │   ●  │      │      │   ●  │
│   Y    │   3    │      │      │   ●  │      │      │
│   Z    │   :    │   ●  │   ●  │      │      │   ●  │
│   ᵗ    │   .    │   ●  │      │      │      │   ●  │
│  del   │        │      │      │      │   ●  │   ●  │
│ figure │ blank  │      │      │      │   ●  │      │
│ blank  │ letter │      │      │      │      │   ●  │
└────────┴────────┴──────┴──────┴──────┴──────┴──────┘

Baudot's invention was later adapted for use with a typewriter by New Zealand's Donald Murray, who modified code positions to reduce mechanical wear on common symbols. Murray's code also introduced typewriter specific control codes, such as LF (line feed) and CR (carriage return). Murray's invention became popular and his 5-bit Murray Code, with some adaptations, became the ITA-2:

┌──────┬─────┬─────────────────────┬───────────────────┬───┬───────────────┬──────┬───────┐
│ Ltrs │\n \r│ Q W E R T Y U I O P │ A S D F G H J K L │SPC│ Z X C V B N M │ →Fig │  DEL  │
├──────┼─────┼─────────────────────┼───────────────────┼───┼───────────────┼──────┼───────┤
│ Figs │\n \r│ 1 2 3 4 5 6 7 8 9 0 │ - ' ᵉ ˢ & \   ( ) │SPC│ + / : = ? , . │      │ →Ltrs │
├──────┼─────┼─────────────────────┼───────────────────┼───┼───────────────┼──────┼───────┤
│   1  │     │ ● ● ●     ● ●       │ ● ● ● ●     ● ●   │   │ ● ●     ●     │   ●  │   ●   │
│   2  │ ●   │ ● ●   ●     ● ●   ● │ ●       ●   ● ● ● │   │     ● ●       │   ●  │   ●   │
│   3  │     │ ●         ● ● ●   ● │   ●   ●   ●   ●   │ ● │   ● ● ●   ● ● │      │   ●   │
│   4  │   ● │       ●         ●   │     ● ● ●   ● ●   │   │   ● ● ● ● ● ● │   ●  │   ●   │
│   5  │     │ ● ●     ● ●     ● ● │         ● ●     ● │   │ ● ●   ● ●   ● │   ●  │   ●   │
└──────┴─────┴─────────────────────┴───────────────────┴───┴───────────────┴──────┴───────┘
ᵉ = escape ˢ = shift out

The ITA-2 standard wasn't without its issues though. Given its size of only 5-bits the set of characters it could represent was limited, and so variations popped up to accommodate for different needs. The 5-bit code was also stateful, specific bit-patterns were reserved to switch the symbol set to either letters or figures. Codes with this trait are known as shifted codes, and while space efficient, a single flipped bit or lost message can render an entire message unintelligible.

Computation and collation

The arrival of computers brought a new set of requirements. New codes were developed that could both store data and perform computations efficiently. A few pioneering computer codes that heavily influenced ASCII were the 6/7-bit FIELDATA code and the 6-bit BCDIC. Both of these had their issues:

FIELDATA, developed by the US Army, left a lot of the code unassigned, leaving it to implementers to assign them. This decision resulted in at least three different communications systems being incompatible with each other.
BCDIC was only 6-bits, which was insufficient to represent the necessary symbol space, and BCDIC's successor, the 8-bit extended BCDIC (EBCDIC) was deemed too large:
The ASCII committee determined 128 characters (7 bits) was sufficient to satisfy the majority of users.
8 bit registers were more expensive than 7bit registers, and 7bits was sufficient.
A lot of transmission equipment was unreliable and often used a parity bit, by using 7bits there was still room for a parity bit in a single frame on the standard one inch 8-hole punch tape.

ASCII

By the late 1950s there was a push for a new global standard communication code. Under the supervision of various committees around the world, among them the American Standards Association's X3.2 Committee, work on a new code commenced. This code would later be known as ASCII.

Design of ASCII

Many considerations from the computing and communication industry were reviewed in the design of the new code, as well as lessons learned from past codes, among them:

Keep the all zero NULL code and all one DEL code, as in ITA-2.
Surveys had shown a minimum requirement of 10 numerics, 26 alphabetics and 27 specials, so at minimum 63 graphic symbols.
There might be a need for lowercase and/or uppercase alphabetics.
Various format effectors (space, carriage return, tab, ...)
Special characters such as ESC, Shift-in, Shift-out to allow for future expansion of the code without increasing the size of the code.
Control characters and graphics characters should not be intermingled, and should be grouped continuously.

From these requirements the basic structure of the code was formed:

┌─────┬Column┬───0──┬───1──┬───2──┬───3──┬───4──┬───5──┬───6──┬───7──┐
│ Row │ Bits │  000 │  001 │  010 │  011 │  100 │  101 │  110 │  111 │
├─────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤
│  0  │ 0000 │ NULL │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  1  │ 0001 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  2  │ 0010 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  3  │ 0011 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  4  │ 0100 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  5  │ 0101 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  6  │ 0110 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  7  │ 0111 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  8  │ 1000 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  9  │ 1001 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  10 │ 1010 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  11 │ 1011 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  12 │ 1100 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  13 │ 1101 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  14 │ 1110 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  15 │ 1111 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│  DEL │
└─────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
▓▓▓▓▓▓ Control  ▓▓▓▓▓▓ Graphics  ▓▓▓▓▓▓ Undefined

It was decided, after some debate, that space would be a graphics character and not a control character. And for computational/collation purposes it was decided that it should collate low to all other graphics characters. Ie, when sorting graphics characters the code values are compared, so space represented by b0100000 (0x20 or decimal 32) would have a lower value than any other graphics character. It was also determined alphabetics couldn't start in column 2 (b010) either, as there were too many special characters that needed to collate lower than the alphabetics.

Having space as the first character of column 2 also ruled out column 2 for storing numerics. This was to ensure that a numeric's lower four bits (row) mirrored the numeric value, i.e. the lower four bits of for example the numerical character '2' would have the value 2 (b0010). Numbers were therefore to be placed at the top of column 3 or 5, leaving space for alphabetcs.

┌─────┬Column┬───0──┬───1──┬───2──┬───3──┬───4──┬───5──┬───6──┬───7──┐
│ Row │ Bits │  000 │  001 │  010 │  011 │  100 │  101 │  110 │  111 │
├─────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤
│  0  │ 0000 │ NULL │▓▓▓▓▓▓│  SPC │   0  │▓▓▓▓▓▓│   0  │▓▓▓▓▓▓│▓▓▓▓▓▓│
│  1  │ 0001 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│   1  │▓▓▓▓▓▓│   1  │▓▓▓▓▓▓│▓▓▓▓▓▓│
│  2  │ 0010 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│   2  │▓▓▓▓▓▓│   2  │▓▓▓▓▓▓│▓▓▓▓▓▓│
│  3  │ 0011 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│   3  │▓▓▓▓▓▓│   3  │▓▓▓▓▓▓│▓▓▓▓▓▓│
│  4  │ 0100 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│   4  │▓▓▓▓▓▓│   4  │▓▓▓▓▓▓│▓▓▓▓▓▓│
│  5  │ 0101 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│   5  │▓▓▓▓▓▓│   5  │▓▓▓▓▓▓│▓▓▓▓▓▓│
│  6  │ 0110 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│   6  │▓▓▓▓▓▓│   6  │▓▓▓▓▓▓│▓▓▓▓▓▓│
│  7  │ 0111 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│   7  │▓▓▓▓▓▓│   7  │▓▓▓▓▓▓│▓▓▓▓▓▓│
│  8  │ 1000 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│   8  │▓▓▓▓▓▓│   8  │▓▓▓▓▓▓│▓▓▓▓▓▓│
│  9  │ 1001 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│   9  │▓▓▓▓▓▓│   9  │▓▓▓▓▓▓│▓▓▓▓▓▓│
│  10 │ 1010 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  11 │ 1011 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  12 │ 1100 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  13 │ 1101 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  14 │ 1110 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│
│  15 │ 1111 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│  DEL │
└─────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
▓▓▓▓▓▓ Control  ▓▓▓▓▓▓ Graphics  ▓▓▓▓▓▓ Undefined

Whether or not lowercase and uppercase alphabetics should be interleaved, that is storing AaBbCc... in a column, was debated. The arguments for interleaving carried merit, should a name such as MacKenzie sort differently than Mackenzie with a lowercase K? Evidence, from things such as phone books, showed that there was no established consistency. One advantage of keeping them separate would be that the most commonly used symbols (uppercase, numerics and specials) could still fit into a single 6-bit (64 symbols) subset code easily extracted from the 7-bit code.

Although it was common for codes at the time to sort alphabetics lower than numerics, it was not yet decided if the code was to include the lowercase alphabet and the two undefined columns should be used for graphic codes instead of control codes. In the end, contrary to common practice, they concluded on putting numerics in column 2 (b010) as it left more choices open. To support their decision the committee argued that existing collating order could still be acheived:

If it is necessary to achieve the de facto collating sequence (specials, alphabetics, numerics), it may be achieved, during comparison operations, by inverting b7 if b6 = b5 = 1. That is, the three high-order bits of the column of numerics would then become 111, which would make them collate high to the alphabetics, with high-order bits of 100 and 101.

This decision also led to an additional criterion for ASCII:

There should be a single bit difference between capital and small letters.

The ASCII standard ASA X3. 4-1963 didn't include the lowercase alphabet, but in October of 1963 it was decided that the lowercase alphabet should be included in columns 6 and 7, filling in the undefined region and enabling the single bit case switching.

┌─────┬Column┬───0──┬───1──┬───2──┬───3──┬───4──┬───5──┬───6──┬───7──┐
│ Row │ Bits │  000 │  001 │  010 │  011 │  100 │  101 │  110 │  111 │
├─────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤
│  0  │ 0000 │ NULL │  DLE │  SPC │   0  │   @  │   P  │   `  │   p  │
│  1  │ 0001 │  SOH │  DC1 │   !  │   1  │   A  │   Q  │   a  │   q  │
│  2  │ 0010 │  STX │  DC2 │   "  │   2  │   B  │   R  │   b  │   r  │
│  3  │ 0011 │  ETX │  DC3 │   #  │   3  │   C  │   S  │   c  │   s  │
│  4  │ 0100 │  EOT │  DC4 │   $  │   4  │   D  │   T  │   d  │   t  │
│  5  │ 0101 │  ENQ │  NAK │   %  │   5  │   E  │   U  │   e  │   u  │
│  6  │ 0110 │  ACK │  SYN │   &  │   6  │   F  │   V  │   f  │   v  │
│  7  │ 0111 │  BEL │  ETB │   '  │   7  │   G  │   W  │   g  │   w  │
│  8  │ 1000 │  BS  │  CAN │   (  │   8  │   H  │   X  │   h  │   x  │
│  9  │ 1001 │  HT  │  EN  │   )  │   9  │   I  │   Y  │   i  │   y  │
│  10 │ 1010 │  NL  │  SUB │   *  │   :  │   J  │   Z  │   j  │   z  │
│  11 │ 1011 │  VT  │  ESC │   +  │   ;  │   K  │   [  │   k  │   {  │
│  12 │ 1100 │  FF  │  FS  │   ,  │   <  │   L  │   \  │   l  │   |  │
│  13 │ 1101 │  CR  │  GS  │   -  │   =  │   M  │   ]  │   m  │   }  │
│  14 │ 1110 │  SO  │  RS  │   .  │   >  │   N  │   ^  │   n  │   ~  │
│  15 │ 1111 │  SI  │  US  │   /  │   ?  │   O  │   _  │   o  │  DEL │
└─────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
▓▓▓▓▓▓ Control  ▓▓▓▓▓▓ Graphics

Sources

Charles E. Mackenzie. 1980. Coded-Character Sets: History and Development. Addison-Wesley Longman Publishing Co., Inc., USA.
1978 06 Interface Age (ark:/13960/t9964cv2c)