Science Lessons Inspired by Neal Stephenson: Stealing and Decrypting The Arethusa Intercepts

The messages, which were encrypted under the Arethusa algorithm, interconnect more of the principal characters in Cryptonomicon, than almost any other factor in the story. The cards first appear (pages 813, 814) in a trunk that had belonged to Lawrence Waterhouse. His grandson Randy Waterhouse digs through it, "to reveal a stack of bricks, neatly wrapped in paper which has gone golden with age, each consisting of a short stack of ETC cards, and each labelled ARETHUSA INTERCEPTS with a date from 1944 or '45." Those cards are read out on one of Chester's antique ETC card readers, with the help of Chester's "card man", and transferred to a floppy disk, and ultimately to Randy's laptop computer.

While flying to Kinakuta a few days later (pages 876-882), Randy has a long telephone conversation with Enoch Root, who had been told by the "card man" about Randy's Arethusa cards. In the conversation, Enoch tells the history of the study of Arethusa at NSA, which had shown that the cards at NSA had been generated by a particular random-number generator, and were not message intercepts at all. After arrival at Kinakuta, Randy discovers that the Arethusa files on his hard drive are different from those at NSA (pages 906, 907).

While in prison in Manila, with Enoch Root in the next cell (pages 981 ff), Randy learns enough cryptography to break the Arethusa encryptions, without ever displaying the files on the screen of his computer. Along the way, he uses the suspected relation of Arethusa to the algorithm Azure, which used zeta functions in the generation of daily one-time pads (pages 1005, 1006). His success is described (page 1025) as: "He comes up ... with A(x) = K, such that for any given date x, he can figure out what K, the keystream for that day would be; ... it worked in each case." The most important message, by far (page 1028), includes: "THE PRIMARY IS CODE NAMED GOLGOTHA. COORDINATES OF THE MAIN DRIFT ARE AS FOLLOWS: LATITUDE NORTH (etc.)"

All of the other mentions of Arethusa occur later in Cryptonicon, but in flashbacks to World War II. On pages 1028-1034, Lawrence Waterhouse breaks the Azure/Pufferfish algorithm, which is a zeta function with the date as one of its input data. [I particularly admire Stephenson's choice of date (6 August 1945), for which the slaves had been evaluating the algorithm.] On pages 1094-1097, Lawrence and Rudy von Hacklheber discuss both Arethusa and Azure/Pufferfish, which differ mainly in that Arethusa does not use the date. On one occasion, for "the long message", it used "his military serial number" from "Shaftoe's headstone". Lawrence also had many probable words in the message, to use as a crib in selecting the proper zeta-function algorithm.

This relationship between Azure/Pufferfish and Arethusa requires that we go back to the previous mentions of Azure. It first appeared as the handwritten pages, which Lawrence found when he opened the safe from U-553 (pages 375-382). They are described as: "The pages are ruled with faint horizontal and vertical lines, dividing each one into a grid, and the grids are filled in with hand-printed letters in groups of five." Sometime later, Lawrence arrives at Bletchley Park, and discusses the pages with Alan Turing (pages 418-429). The important technical material (page 424) includes Lawrence's statements: "But this message was different--it used thirty-two characters--a power of two--meaning that each character had a unique binary representation, five binary digits long." "So I converted each letter into a number between one and thirty-two, using the Baudot code." "If the first letter is R, and its Baudot code is 01011, and the second letter is F, and its code is 10111, ..." At the bottom of the page is a footnote: "Baudot code is what teletypes use. Each of the 32 characters in the teletype alphabet has a unique number assigned to it. This number can be represented as a five-digit binary number, that is, five ones or zeroes, or (more useful) five holes, or absences of holes, across a strip of paper tape. ..." Alan recognizes that the original pages were written by Rudolph von Hacklheber. Eventually, Alan convinces Lawrence that the pages make up a one-time pad, and not a message.

There is no clear statement that Arethusa uses the same message structure as Azure. The fake Arethusa messages, as quoted by Enoch, started with five-letter groups, but he did not go on to say that other characters also appeared. However, because Azure is effectively a simplification of Arethusa, this seems to be a reasonable assumption.

On pages 1114-1119, Lawrence decides to steal the Arethusa intercepts, and sets about to do so. He justifies this by the danger that he and Goto Dengo would face, if they were ever decrypted in the future. He intends to conceal the theft by replacing the cards with fakes: "It'll look like any other encrypted message, ..." Using the "Project X" encrypted telephone circuit, Alan Turing gives Lawrence the parameters of a particularly effective zeta function, for generating pseudo-random numbers. Lawrence then prepares his digital computer to generate the replacement cards. "In order to make the computer execute Alan's random number function, he even has to design a new circuit board on the fly, and solder it together."

Before he can start the computer, the last two Arethusa intercept sheets arrive, and he punches their information onto cards. "Waterhouse goes to the oven and takes out a brick of hot, blank ETC cards. He has learned that he must keep the cards hot, or else they wiil soak up the tropical humidity and jam the machinery; ..." "He takes those cards out of the puncher's output tray and places them neatly in the box with the cards containing all of the previous Arethusa messages. He then takes the entire contents of this box--a brick of messages about a foot thick--and puts them in his attache case."

The production of the replacement cards is described as: "He dumps a foot-thick stack of hot blank cards into the input hopper of the card punch." "Then he starts the program he has written, the one that generates random numbers according to Turing's function. Lights flash, and the card reader whirrs, as the program is loaded into the computer's RAM. Then it pauses, waiting for input: ..." "Waterhouse thinks about it for a moment, and then types in COMSTOCK. The card punch rumbles into action. The stack of blanks begins to get shorter. Punched cards skitter into the output tray." "He puts this stack of freshly punched cards into the box labeled ARETHUSA INTERCEPTS, and puts it back in its place on the shelf." He then burns all of the original intercept sheets.

Unfortunately, this set of events constitutes perhaps the worst breakdown of continuity in these four novels. The description of Randy's decryption of the intercepts would apply to Azure, but not to Arethusa. In particular, the long message, with the location of Golgotha, was encrypted using Bobby Shaftoe's USMC service number, not a date shortly after his funeral. Are we to believe the 1 chance in 100,000,000 that they exactly matched? The description of the cards that went into Lawrence's attache case does not match the description of the cards that came out of his trunk fifty years later. How could Lawrence have reconstructed the dates of the messages, to be written on the wrappers? He had burned the intercept sheets, which undoubtedly did include the dates.

In the quotation: "It'll look like any other encrypted message," the pronoun "it" suffers from 'indefinite antecedent'. If "it" refers to a single card, then any one individual fake card may indeed look like rather like an individual real card, in that they have the same format. However, if "it" refers to one complete message, then the description of the production of the fakes is inadequate to guarantee the truth of the quotation.

A real message consisted of a particular number of alphabetical characters. It may or may not have always ended with a group of the characteristic size (i.e., 5), depending upon how Rudy constructed his algorithm. It very probably did not consist of exactly enough groups to fill an integer number of ETC cards. Depending upon the format used, one card might hold 12 to 16 five-character groups. Rudy's algorithm was so labor-intensive, that the users would have been reluctant even to encrypt meaningless 'filler characters' (e.g., ZZZ), to fill out the last group. They would certainly have balked at encrypting 'filler groups'. Thus, the 'average' real message, transcribed to cards, would end with a last card that was about half blank. The 'average' fake message must end the same way, and the set of all fake messages should have last cards with a reasonable distribution of numbers of characters or groups. For the best apparent equivalence, each fake message should have exactly the same length as the corresponding real message.

Depending upon programming style, each card in a real message should have included a sequence number, as well as the characters of the message itself. Again, depending upon programming style, there could be a 'start' card in front of the first card of the message, or an 'end' card after the last card, or both. Such a 'control' card could well include other information from the intercept sheet, such as date, time, origin, length of message, etc., plus 'identifiers' so that the program would not try to treat it as a character card. Each fake message must have the same structure as a real message, and whatever that takes should be mentioned.

If "it" in that quotation means the entire set of fake messages, then there is yet another possible difference from the set of real messages. The real messages were intercepted over a period of months, and presumably each was trancribed to cards immediately. Thus each message had been exposed to high humidity for a different length of time, after being heated in the oven, and could have turned a slightly different color. The fake messages were all produced at the same time, and would all have been the same color.

The overall control of Lawrence's primitive computer seemed to follow the prescription: "Leave it turned on until it runs out of cards." That may have been appropriate for the underlying fixed-program machinery, which was intended to do simple functions such as sorting, tabulating, etc. However, my exposure (in 1956) to a first-generation, main-frame, digital computer, convinced me that such a prescription is quite inappropriate for a stored-program computer. The first important thing I learned in the introductory course, was to make sure that the computer would stop in the right places, even before I learned how to make it do anything useful in between.

In that era, the programs had to be written in the 'native language' of the computer. In particular, every transfer of quantities between the arithmetic registers and the input, output, or RAM, had to be specified by the programmer. For that particular computer, the input and output were on five-row paper teletype tape. During the hours when students were scheduled to use the computer, there would typically be 3 to 20 people waiting in line with their tapes, all able to see the operator's station. If your program tape fell to the floor instead of stopping in the tape reader, it was rather likely that your program would not work. If your data tape also fell to the floor, it was almost certain that your program would not work. On the other hand, if your program was running, but got into an endless loop, you would exceed the time allowed for each student program. (That was only a few minutes, enough for a few thousand arithmetic operations.) The operator would have to press the button to stop the computer, thereby guaranteeing that your program failed. All of the people in line would be either smirking, snickering out loud, or standing with eyes closed, thinking, "Please, God, don't let that happen to me!"

This experience with teletype allowed me to see that Stephenson's descriptions on page 424 (quoted above) include a remarkable number of misstatements. The most fundamental is that five-bit numbers run from 00000 (zero) to 11111 (thirty-one), and not from one to thirty-two. Teletype can handle only 26 encoded letters, and the other six codes are necessarily non-printing. Punctuation characters and decimal digits are produced by mechanically shifting the teleprinter from LETTERS mode to FIGURES mode. [See Wikipedia for one standard assignment of numerical codes: International Telegraphy Alphabet Number 2 (ITA2).] That complete "teletype alphabet" has 50 printable characters, rather than 32, but they are all encoded with only 26 binary numbers. That Wikipedia article does not give the original Baudot coding, and anything else should be called only "Baudot-like", rather than "Baudot code".

Moreover, the encoding can be changed, in the tape perforators and the teleprinters, at the convenience of the user. I did not realize, until doing this analysis, that the encoding, used at that computer center in the 1950s and '60s, was very different from ITA2. After all the intervening years, I can remember only 19 character codes (0 - 9, +, -, F, J, K, L, N, S, and NUL), and not a single one of them matched ITA2. Stephenson actually hinted at this, when he suggested that F has code 10111. In ITA2 it is 01101, and as the sixth letter in alphabetic order it could be thought of as 00110. I remember it to be 01110. Thus, there is no such thing as a "unique binary representation", even for the alphabet. Six more characters would have to be added 'by hand', because they cannot be transmitted in a single usable teletype mode, along with the alphabet. Even after 6 particular characters have been chosen from the 14 available in ITA2 (3003 distinct combinations), they can be paired up with the six available codes in 720 different ways.

I am not particularly concerned that Lawrence P. Waterhouse is credited here with inventing the stored-program computer, which is usually ascribed to John von Neumann, after World War II. I will accept it as an earlier independent invention, which Neal Stephenson was astute enough to notice. I am more concerned that Lawrence tried to do too many different things, all at the same time. It was bad enough that he had to solder together a new circuit he had designed. It is highly unlikely that he had designed a printed-circuit board ("circuit board" for short). They did exist in 1945, typically for mass production of electronic systems that had to function under mechanically stressful conditions. However, I can't believe that a facility to make them was available in the war-time Philippines. Instead, Lawrence would have built his 'one-off' circuit in and on a sheet-metal chassis. That was the way that I built amateur-radio equipment a few years later.

Science Lessons Inspired by Neal Stephenson

Saturday, April 17, 2010

Stealing and Decrypting The Arethusa Intercepts

No comments:

Post a Comment

Followers

Blog Archive

About Me