Beautiful ASCII
This story starts with some specialized code, designed to very quickly set the case on a set of fixed length symbols:
uint32_t upper(uint32_t code) {
return code & ~0x20202020;
}
uint32_t lower(uint32_t code) {
return code | 0x20202020;
}
The code works by taking a small four letter ASCII string, encoded as
an integer, and by adding a magic number it switches its case. To
illustrate with an example, let’s pass YyZt
to
lower()
:
- The ASCII sequence
YyZt
in byte form is0x59 0x79 0x5a 0x74
. - Encoded as a 32bit integer, on a big endian system, that’s
0x59795a74
. 0x59795a74 | 0x20202020 = 0x79797a74
.- Decoding
0x79797a74
to bytes yields0x79 0x79 0x7a 0x74
oryyzt
in ASCII.
At a quick glance this might seem like some fancy bit-trickery designed to squeeze out a few extra cycles of performance, but it is in fact by design and carries with it a history dating back to the 18th century.
The telegraph standard
Designed by the French inventor Émile Baudot, the Baudot telegraph was an early multiplexing printing telegraph.
Each key on the five key keyboard represented one bit, in a 5-bit encoding. These 5-bit symbols, transmitted over the telegraph wire, would then be punched onto a five-hole punch tape. This encoding, initially called Baudot Code, became the first International Telegraph Alphabet or ITA-1:
┌────────┬────────┬──────┬──────┬──────┬──────┬──────┐ │ Letter │ Figure │ I │ II │ III │ IV │ V │ ├────────┼────────┼──────┼──────┼──────┼──────┼──────┤ │ │ │ │ │ │ │ │ │ A │ 1 │ ● │ │ │ │ │ │ B │ 8 │ │ │ ● │ ● │ │ │ C │ 9 │ ● │ │ ● │ ● │ │ │ D │ 0 │ ● │ ● │ ● │ ● │ │ │ E │ 2 │ │ ● │ │ │ │ │ É │ & │ ● │ ● │ │ │ │ │ F │ ᶠ │ │ ● │ ● │ ● │ │ │ G │ 7 │ │ ● │ │ ● │ │ │ H │ ʰ │ ● │ ● │ │ ● │ │ │ I │ ° │ │ ● │ ● │ │ │ │ J │ 6 │ ● │ │ │ ● │ │ │ K │ ( │ ● │ │ │ ● │ ● │ │ L │ = │ ● │ ● │ │ ● │ ● │ │ M │ ) │ │ ● │ │ ● │ ● │ │ N │ N° │ │ ● │ ● │ ● │ ● │ │ O │ 5 │ ● │ ● │ ● │ │ │ │ P │ % │ ● │ ● │ ● │ ● │ ● │ │ Q │ / │ ● │ │ ● │ ● │ ● │ │ R │ - │ │ │ ● │ ● │ ● │ │ S │ ; │ │ │ ● │ │ ● │ │ T │ ! │ ● │ │ ● │ │ ● │ │ U │ 4 │ ● │ │ ● │ │ │ │ V │ ' │ ● │ ● │ ● │ │ ● │ │ W │ ? │ │ ● │ ● │ │ ● │ │ X │ , │ │ ● │ │ │ ● │ │ Y │ 3 │ │ │ ● │ │ │ │ Z │ : │ ● │ ● │ │ │ ● │ │ ᵗ │ . │ ● │ │ │ │ ● │ │ del │ │ │ │ │ ● │ ● │ │ figure │ blank │ │ │ │ ● │ │ │ blank │ letter │ │ │ │ │ ● │ └────────┴────────┴──────┴──────┴──────┴──────┴──────┘
Baudot’s invention was later adapted for use with a typewriter by New Zealand’s Donald Murray, who modified code positions to reduce mechanical wear on common symbols. Murray’s code also introduced typewriter specific control codes, such as LF (line feed) and CR (carriage return). Murray’s invention became popular and his 5-bit Murray Code, with some adaptations, became the ITA-2:
┌──────┬─────┬─────────────────────┬───────────────────┬───┬───────────────┬──────┬───────┐ │ Ltrs │\n \r│ Q W E R T Y U I O P │ A S D F G H J K L │SPC│ Z X C V B N M │ →Fig │ DEL │ ├──────┼─────┼─────────────────────┼───────────────────┼───┼───────────────┼──────┼───────┤ │ Figs │\n \r│ 1 2 3 4 5 6 7 8 9 0 │ - ' ᵉ ˢ & \ ( ) │SPC│ + / : = ? , . │ │ →Ltrs │ ├──────┼─────┼─────────────────────┼───────────────────┼───┼───────────────┼──────┼───────┤ │ 1 │ │ ● ● ● ● ● │ ● ● ● ● ● ● │ │ ● ● ● │ ● │ ● │ │ 2 │ ● │ ● ● ● ● ● ● │ ● ● ● ● ● │ │ ● ● │ ● │ ● │ │ 3 │ │ ● ● ● ● ● │ ● ● ● ● │ ● │ ● ● ● ● ● │ │ ● │ │ 4 │ ● │ ● ● │ ● ● ● ● ● │ │ ● ● ● ● ● ● │ ● │ ● │ │ 5 │ │ ● ● ● ● ● ● │ ● ● ● │ │ ● ● ● ● ● │ ● │ ● │ └──────┴─────┴─────────────────────┴───────────────────┴───┴───────────────┴──────┴───────┘ ᵉ = escape ˢ = shift out
The ITA-2 standard wasn’t without its issues though. Given its size of only 5-bits the set of characters it could represent was limited, and so variations popped up to accommodate for different needs. The 5-bit code was also stateful, specific bit-patterns were reserved to switch the symbol set to either letters or figures. Codes with this trait are known as shifted codes, and while space efficient, a single flipped bit or lost message can render an entire message unintelligible.
Computation and collation
The arrival of computers brought a new set of requirements. New codes were developed that could both store data and perform computations efficiently. A few pioneering computer codes that heavily influenced ASCII were the 6/7-bit FIELDATA code and the 6-bit BCDIC. Both of these had their issues:
FIELDATA, developed by the US Army, left a lot of the code unassigned, leaving it to implementers to assign them. This decision resulted in at least three different communications systems being incompatible with each other.
BCDIC was only 6-bits, which was insufficient to represent the necessary symbol space, and BCDIC’s successor, the 8-bit extended BCDIC (EBCDIC) was deemed too large:
The ASCII committee determined 128 characters (7 bits) was sufficient to satisfy the majority of users.
8 bit registers were more expensive than 7bit registers, and 7bits was sufficient.
A lot of transmission equipment was unreliable and often used a parity bit, by using 7bits there was still room for a parity bit in a single frame on the standard one inch 8-hole punch tape.
ASCII
By the late 1950s there was a push for a new global standard communication code. Under the supervision of various committees around the world, among them the American Standards Association’s X3.2 Committee, work on a new code commenced. This code would later be known as ASCII.
Design of ASCII
Many considerations from the computing and communication industry were reviewed in the design of the new code, as well as lessons learned from past codes, among them:
- Keep the all zero NULL code and all one DEL code, as in ITA-2.
- Surveys had shown a minimum requirement of 10 numerics, 26 alphabetics and 27 specials, so at minimum 63 graphic symbols.
- There might be a need for lowercase and/or uppercase alphabetics.
- Various format effectors (space, carriage return, tab, …)
- Special characters such as ESC, Shift-in, Shift-out to allow for future expansion of the code without increasing the size of the code.
- Control characters and graphics characters should not be intermingled, and should be grouped continuously.
From these requirements the basic structure of the code was formed:
┌─────┬Column┬───0──┬───1──┬───2──┬───3──┬───4──┬───5──┬───6──┬───7──┐ │ Row │ Bits │ 000 │ 001 │ 010 │ 011 │ 100 │ 101 │ 110 │ 111 │ ├─────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤ │ 0 │ 0000 │ NULL │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 1 │ 0001 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 2 │ 0010 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 3 │ 0011 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 4 │ 0100 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 5 │ 0101 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 6 │ 0110 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 7 │ 0111 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 8 │ 1000 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 9 │ 1001 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 10 │ 1010 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 11 │ 1011 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 12 │ 1100 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 13 │ 1101 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 14 │ 1110 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 15 │ 1111 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ DEL │ └─────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘ ▓▓▓▓▓▓ Control ▓▓▓▓▓▓ Graphics ▓▓▓▓▓▓ Undefined
It was decided, after some debate, that space would be a graphics
character and not a control character. And for computational/collation
purposes it was decided that it should collate low to all other graphics
characters. Ie, when sorting graphics characters the code values are
compared, so space represented by b0100000
(0x20
or decimal 32
) would have a lower value
than any other graphics character. It was also determined alphabetics
couldn’t start in column 2 (b010) either, as there were too many special
characters that needed to collate lower than the alphabetics.
Having space as the first character of column 2 also ruled out column
2 for storing numerics. This was to ensure that a numeric’s lower four
bits (row) mirrored the numeric value, i.e. the lower four bits of for
example the numerical character '2'
would have the value 2
(b0010
). Numbers were therefore to be placed at the top of
column 3 or 5, leaving space for alphabetcs.
┌─────┬Column┬───0──┬───1──┬───2──┬───3──┬───4──┬───5──┬───6──┬───7──┐ │ Row │ Bits │ 000 │ 001 │ 010 │ 011 │ 100 │ 101 │ 110 │ 111 │ ├─────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤ │ 0 │ 0000 │ NULL │▓▓▓▓▓▓│ SPC │ 0 │▓▓▓▓▓▓│ 0 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 1 │ 0001 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 1 │▓▓▓▓▓▓│ 1 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 2 │ 0010 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 2 │▓▓▓▓▓▓│ 2 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 3 │ 0011 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 3 │▓▓▓▓▓▓│ 3 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 4 │ 0100 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 4 │▓▓▓▓▓▓│ 4 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 5 │ 0101 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 5 │▓▓▓▓▓▓│ 5 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 6 │ 0110 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 6 │▓▓▓▓▓▓│ 6 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 7 │ 0111 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 7 │▓▓▓▓▓▓│ 7 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 8 │ 1000 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 8 │▓▓▓▓▓▓│ 8 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 9 │ 1001 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 9 │▓▓▓▓▓▓│ 9 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 10 │ 1010 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 11 │ 1011 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 12 │ 1100 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 13 │ 1101 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 14 │ 1110 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 15 │ 1111 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ DEL │ └─────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘ ▓▓▓▓▓▓ Control ▓▓▓▓▓▓ Graphics ▓▓▓▓▓▓ Undefined
Whether or not lowercase and uppercase alphabetics should be
interleaved, that is storing AaBbCc...
in a column, was
debated. The arguments for interleaving carried merit, should a name
such as MacKenzie sort differently than Mackenzie with a lowercase K?
Evidence, from things such as phone books, showed that there was no
established consistency. One advantage of keeping them separate would be
that the most commonly used symbols (uppercase, numerics and specials)
could still fit into a single 6-bit (64 symbols) subset code easily
extracted from the 7-bit code.
Although it was common for codes at the time to sort alphabetics
lower than numerics, it was not yet decided if the code was to include
the lowercase alphabet and the two undefined
columns should
be used for graphic codes instead of control codes. In the end, contrary
to common practice, they concluded on putting numerics in column 2
(b010
) as it left more choices open. To support their
decision the committee argued that existing collating order could still
be acheived:
If it is necessary to achieve the de facto collating sequence (specials, alphabetics, numerics), it may be achieved, during comparison operations, by inverting b7 if b6 = b5 = 1. That is, the three high-order bits of the column of numerics would then become 111, which would make them collate high to the alphabetics, with high-order bits of 100 and 101.
This decision also led to an additional criterion for ASCII:
There should be a single bit difference between capital and small letters.
The ASCII standard ASA X3. 4-1963 didn’t include the lowercase alphabet, but in October of 1963 it was decided that the lowercase alphabet should be included in columns 6 and 7, filling in the undefined region and enabling the single bit case switching.