Moving forward, let’s look at UTF-8. It had to encode those 100,000 characters which would need about 32 bits per character. UTF-8 starts by encoding English in the same way as ASCII (if it isn’t broken don’t fix it), meaning you would have had a lot of zero’s in every English character you type, averaging it at 24 bits of wasted space. Another problem was old systems that would see 8 0’s would see this as a NULL and stop reading. And finally as if it wasn’t bad enough the whole thing had to be backwards compatible. So UTF-8 starts by taking the original ASCII encoding for the first 7 bits, with an extra 0 on the end to make 8 bits. Now if you want something higher than this you add headers to the bytes. For two bytes, the first three characters would be 110. Two 1’s, two bytes. And the next byte would start with 10, a continuation flag. The rest could be filled in with the encoding for the character. This continues, with each byte adding a 1 to the first byte used, up to 111110 or 5 bytes of encoding. Not only did this fix every problem currently posed by different encodings, it would also allow for the entire code to be written on a mere napkin, which it was by Ken Thompson and Rob Pike in a 1992 in a New Jersey diner when they invented UTF-8.
However, thisasn’t Ken’s first bodge. In 1974 a friend of his, Lee E. McMahon, was attempting to analyze the text of old federalist papers. The ed editor Ken had developed for the Unix system could support regular expressions but on a smaller scale so was unable to help McMahon. Therefore, Ken decided to take the code from the ed editor and bodge it into its own standalone tool overnight, by Globally searching for Regular Expressions and Printing them out, also known as GREP, a now standard command in UNIX.