This story starts with some specialized code, designed to very quickly set the case on a set of fixed length symbols:
uint32_t upper(uint32_t code) {
return code & ~0x20202020;
}
uint32_t lower(uint32_t code) {
return code | 0x20202020;
}
The code works by taking a small four letter ASCII string, encoded as
an integer, and by adding a magic number it switches its case. To
illustrate with an example, let's pass YyZt
to lower()
:
YyZt
in byte form is 0x59 0x79 0x5a 0x74
.0x59795a74
.0x59795a74 | 0x20202020 = 0x79797a74
.0x79797a74
to bytes yields 0x79 0x79 0x7a 0x74
or yyzt
in ASCII.At a quick glance this might seem like some fancy bit-trickery designed to squeeze out a few extra cycles of performance, but it is in fact by design and carries with it a history dating back to the 18th century.
Designed by the French inventor Émile Baudot, the Baudot telegraph was an early multiplexing printing telegraph.
Each key on the five key keyboard represented one bit, in a 5-bit encoding. These 5-bit symbols, transmitted over the telegraph wire, would then be punched onto a five-hole punch tape. This encoding, initially called Baudot Code, became the first International Telegraph Alphabet or ITA-1:
┌────────┬────────┬──────┬──────┬──────┬──────┬──────┐ │ Letter │ Figure │ I │ II │ III │ IV │ V │ ├────────┼────────┼──────┼──────┼──────┼──────┼──────┤ │ │ │ │ │ │ │ │ │ A │ 1 │ ● │ │ │ │ │ │ B │ 8 │ │ │ ● │ ● │ │ │ C │ 9 │ ● │ │ ● │ ● │ │ │ D │ 0 │ ● │ ● │ ● │ ● │ │ │ E │ 2 │ │ ● │ │ │ │ │ É │ & │ ● │ ● │ │ │ │ │ F │ ᶠ │ │ ● │ ● │ ● │ │ │ G │ 7 │ │ ● │ │ ● │ │ │ H │ ʰ │ ● │ ● │ │ ● │ │ │ I │ ° │ │ ● │ ● │ │ │ │ J │ 6 │ ● │ │ │ ● │ │ │ K │ ( │ ● │ │ │ ● │ ● │ │ L │ = │ ● │ ● │ │ ● │ ● │ │ M │ ) │ │ ● │ │ ● │ ● │ │ N │ N° │ │ ● │ ● │ ● │ ● │ │ O │ 5 │ ● │ ● │ ● │ │ │ │ P │ % │ ● │ ● │ ● │ ● │ ● │ │ Q │ / │ ● │ │ ● │ ● │ ● │ │ R │ - │ │ │ ● │ ● │ ● │ │ S │ ; │ │ │ ● │ │ ● │ │ T │ ! │ ● │ │ ● │ │ ● │ │ U │ 4 │ ● │ │ ● │ │ │ │ V │ ' │ ● │ ● │ ● │ │ ● │ │ W │ ? │ │ ● │ ● │ │ ● │ │ X │ , │ │ ● │ │ │ ● │ │ Y │ 3 │ │ │ ● │ │ │ │ Z │ : │ ● │ ● │ │ │ ● │ │ ᵗ │ . │ ● │ │ │ │ ● │ │ del │ │ │ │ │ ● │ ● │ │ figure │ blank │ │ │ │ ● │ │ │ blank │ letter │ │ │ │ │ ● │ └────────┴────────┴──────┴──────┴──────┴──────┴──────┘
Baudot's invention was later adapted for use with a typewriter by New Zealand's Donald Murray, who modified code positions to reduce mechanical wear on common symbols. Murray's code also introduced typewriter specific control codes, such as LF (line feed) and CR (carriage return). Murray's invention became popular and his 5-bit Murray Code, with some adaptations, became the ITA-2:
┌──────┬─────┬─────────────────────┬───────────────────┬───┬───────────────┬──────┬───────┐ │ Ltrs │\n \r│ Q W E R T Y U I O P │ A S D F G H J K L │SPC│ Z X C V B N M │ →Fig │ DEL │ ├──────┼─────┼─────────────────────┼───────────────────┼───┼───────────────┼──────┼───────┤ │ Figs │\n \r│ 1 2 3 4 5 6 7 8 9 0 │ - ' ᵉ ˢ & \ ( ) │SPC│ + / : = ? , . │ │ →Ltrs │ ├──────┼─────┼─────────────────────┼───────────────────┼───┼───────────────┼──────┼───────┤ │ 1 │ │ ● ● ● ● ● │ ● ● ● ● ● ● │ │ ● ● ● │ ● │ ● │ │ 2 │ ● │ ● ● ● ● ● ● │ ● ● ● ● ● │ │ ● ● │ ● │ ● │ │ 3 │ │ ● ● ● ● ● │ ● ● ● ● │ ● │ ● ● ● ● ● │ │ ● │ │ 4 │ ● │ ● ● │ ● ● ● ● ● │ │ ● ● ● ● ● ● │ ● │ ● │ │ 5 │ │ ● ● ● ● ● ● │ ● ● ● │ │ ● ● ● ● ● │ ● │ ● │ └──────┴─────┴─────────────────────┴───────────────────┴───┴───────────────┴──────┴───────┘ ᵉ = escape ˢ = shift out
The ITA-2 standard wasn't without its issues though. Given its size of only 5-bits the set of characters it could represent was limited, and so variations popped up to accommodate for different needs. The 5-bit code was also stateful, specific bit-patterns were reserved to switch the symbol set to either letters or figures. Codes with this trait are known as shifted codes, and while space efficient, a single flipped bit or lost message can render an entire message unintelligible.
The arrival of computers brought a new set of requirements. New codes were developed that could both store data and perform computations efficiently. A few pioneering computer codes that heavily influenced ASCII were the 6/7-bit FIELDATA code and the 6-bit BCDIC. Both of these had their issues:
FIELDATA, developed by the US Army, left a lot of the code unassigned, leaving it to implementers to assign them. This decision resulted in at least three different communications systems being incompatible with each other.
BCDIC was only 6-bits, which was insufficient to represent the necessary symbol space, and BCDIC's successor, the 8-bit extended BCDIC (EBCDIC) was deemed too large:
By the late 1950s there was a push for a new global standard communication code. Under the supervision of various committees around the world, among them the American Standards Association's X3.2 Committee, work on a new code commenced. This code would later be known as ASCII.
Many considerations from the computing and communication industry were reviewed in the design of the new code, as well as lessons learned from past codes, among them:
From these requirements the basic structure of the code was formed:
┌─────┬Column┬───0──┬───1──┬───2──┬───3──┬───4──┬───5──┬───6──┬───7──┐ │ Row │ Bits │ 000 │ 001 │ 010 │ 011 │ 100 │ 101 │ 110 │ 111 │ ├─────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤ │ 0 │ 0000 │ NULL │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 1 │ 0001 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 2 │ 0010 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 3 │ 0011 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 4 │ 0100 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 5 │ 0101 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 6 │ 0110 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 7 │ 0111 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 8 │ 1000 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 9 │ 1001 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 10 │ 1010 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 11 │ 1011 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 12 │ 1100 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 13 │ 1101 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 14 │ 1110 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 15 │ 1111 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ DEL │ └─────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘ ▓▓▓▓▓▓ Control ▓▓▓▓▓▓ Graphics ▓▓▓▓▓▓ Undefined
It was decided, after some debate, that space would be a graphics
character and not a control character. And for
computational/collation purposes it was decided that it should
collate low to all other graphics characters. Ie, when sorting
graphics characters the code values are compared, so space
represented by b0100000
(0x20
or decimal 32
) would have a
lower value than any other graphics character. It was also determined
alphabetics couldn't start in column 2 (b010) either, as there were too
many special characters that needed to collate lower than the
alphabetics.
Having space as the first character of column 2 also ruled out column 2 for
storing numerics. This was to ensure that a numeric's lower four
bits (row) mirrored the numeric value, i.e. the lower four bits of
for example the numerical character '2'
would have the value 2
(b0010
). Numbers were therefore to be placed at the top of column
3 or 5, leaving space for alphabetcs.
┌─────┬Column┬───0──┬───1──┬───2──┬───3──┬───4──┬───5──┬───6──┬───7──┐ │ Row │ Bits │ 000 │ 001 │ 010 │ 011 │ 100 │ 101 │ 110 │ 111 │ ├─────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┤ │ 0 │ 0000 │ NULL │▓▓▓▓▓▓│ SPC │ 0 │▓▓▓▓▓▓│ 0 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 1 │ 0001 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 1 │▓▓▓▓▓▓│ 1 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 2 │ 0010 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 2 │▓▓▓▓▓▓│ 2 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 3 │ 0011 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 3 │▓▓▓▓▓▓│ 3 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 4 │ 0100 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 4 │▓▓▓▓▓▓│ 4 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 5 │ 0101 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 5 │▓▓▓▓▓▓│ 5 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 6 │ 0110 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 6 │▓▓▓▓▓▓│ 6 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 7 │ 0111 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 7 │▓▓▓▓▓▓│ 7 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 8 │ 1000 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 8 │▓▓▓▓▓▓│ 8 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 9 │ 1001 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ 9 │▓▓▓▓▓▓│ 9 │▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 10 │ 1010 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 11 │ 1011 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 12 │ 1100 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 13 │ 1101 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 14 │ 1110 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ │ 15 │ 1111 │▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│▓▓▓▓▓▓│ DEL │ └─────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘ ▓▓▓▓▓▓ Control ▓▓▓▓▓▓ Graphics ▓▓▓▓▓▓ Undefined
Whether or not lowercase and uppercase alphabetics should be
interleaved, that is storing AaBbCc...
in a column, was debated. The
arguments for interleaving carried merit, should a name such as
MacKenzie sort differently than Mackenzie with a lowercase K?
Evidence, from things such as phone books, showed that there was no
established consistency. One advantage of keeping them separate would
be that the most commonly used symbols (uppercase, numerics and
specials) could still fit into a single 6-bit (64 symbols) subset
code easily extracted from the 7-bit code.
Although it was common for codes at the time to sort alphabetics
lower than numerics, it was not yet decided if the code was to include
the lowercase alphabet and the two undefined
columns should be
used for graphic codes instead of control codes. In the end, contrary
to common practice, they concluded on putting numerics in column 2
(b010
) as it left more choices open. To support their decision the
committee argued that existing collating order could still be acheived:
If it is necessary to achieve the de facto collating sequence (specials, alphabetics, numerics), it may be achieved, during comparison operations, by inverting b7 if b6 = b5 = 1. That is, the three high-order bits of the column of numerics would then become 111, which would make them collate high to the alphabetics, with high-order bits of 100 and 101.
This decision also led to an additional criterion for ASCII:
There should be a single bit difference between capital and small letters.
The ASCII standard ASA X3. 4-1963 didn't include the lowercase alphabet, but in October of 1963 it was decided that the lowercase alphabet should be included in columns 6 and 7, filling in the undefined region and enabling the single bit case switching.