The making of BLUEPHRASE

Symbolic Notation for Writers

Designing a language for expressing an author's intentions

by Joe Honton Sep 16, 2019

This is the story of how I designed the core features of the BLUEPHRASE computer language for solving problems related to authoring, editing and publishing.

I have already written about the problems I faced as a writer, and the annoyances that led me to favor distraction-free authoring tools. To summarize, I reached a decision to create a writing system that worked for me, one that was compatible with HTML, but one that didn't have the awkward keystrokes and cluttered tags that made markup languages so unbearable.

I began the design under a project codenamed BLUEML. This was in homage to the convention followed by hot metal typesetters who used blue pencils when marking a manuscript with printer's notes. Blue was used so that there was no confusion over any remaining red penciled corrections made by the copy editor.

In that first year, as I was proving out the design, I referred to the project using the somewhat pompous neologism symbolic endophrasing. That was my way of describing a system where everything was turned inside out. Instead of placing headings and paragraphs within syntactic guard-rails the way markup languages do, I inverted the paradigm, embedding symbols and instructions within the composition itself. The ML part of the project codename was for markup language. I used this name for over a year before giving it any more thought. Eventually though I realised that what I had been designing and developing wasn't markup (and wasn't Markdown), it was more akin to the blue and red pencil marks of typesetters and copy-editors. I was creating a notation.

Guiding Principles

My work on this was guided by just a few key principles, which I've adhered to even as the implementation details of shorthand symbols and doppelmarks and pragmas emerged and evolved.

First on the list was that no special editing software should be required. That meant that even the simplist of notepad-style text editors should be able to create, read, and change a manuscript. And by extension, the files used for saving manuscripts should not be binary-encoded; that is, they should be plain text files. Also, since the 26 letters of the English alphabet encompass just a tiny fraction of the world's written scripts, the file's character set should be universal.

Second, the notation being developed should not prevent me from using the full power of HTML. So unlike Markdown and wikitext, which have a limited set of styling and structuring choices, there should be a way to use every possible HTML element without exception. This principle stemmed from my desire to be able to write documents and Web pages and ePubs that required no further post-processing or cleanup once I'd completed my writing.

Third, I should be able to write with just a keyboard, keeping my hands on home row. And of course, all shorthand marks should be readily typable on a standard keyboard. This principle was perhaps more personal than the first two, but as many professional software developers will attest, the constant fiddling with a mouse for styling and editing slows things down and breaks the creative flow, something I was trying hard to avoid.

Finally, to make this all feasible, I wanted to follow a usage pattern similar to what I was used to in my professional work: an iterative cycle of writing, proofreading, and correcting. With this in mind, I wanted a design where manuscripts could be compiled into documents for both proofreading and final publication. My goal was for a compilation step that was both instantaneous and invisible.

Three Essential Constructs

Adhering to my first principle meant that my notations to the compiler would be visible within the manuscript itself. So any instructions to control how the document is structured would be sprinkled throughout the composition. This seemd to run counter to the goal of being distraction-free. To alleviate this, I experimented with ways to shorten the syntax used by HTML tags. With some algorithmic sleight-of-hand, I was able to remove the pair of less-than/greater-than marks that normally envelope a tag, leaving only the HTML element name. I coined the term semantax to describe this naked tag — a play on the two words semantic and syntax. At the same time I was able to completely eliminate the closing tag. How this was possible I'll describe in a minute, but before going there, I want to pause and admire the simplicitly of what remained — just a short mnemonic semantax and the composition. No less-than, greater-than or solidus (slash), and no duplicate tagname. The result was easy to write and very readable.

From a programmer's perspective, was this easy to do? No. But from an author's perspective it kept my first principle intact while being as close to distraction-free as possible. The magic behind this was based on a well-known dichotomy that anyone familiar with HTML will recognize: inline elements versus block elements. Inline elements are used for inflection, voice, and meaning; elements such as <i> for italics, <q> for quotation, and <sup> for superscript are examples. Block elements are used for structure, lists, and flow; elements such as <section> for contextual groupings, <ol> for ordered lists, and <aside> for a separate thoughts, are examples.

One of the important differences in what I designed was that the Enter ⏎ key was restored to its original purpose: it signaled the end of a line. Markdown languages treated that character as whitespace, which in most cases meant that it collapsed into nothingness. In contrast, I chose to interpret it to mean "the end of an element", so a line that starts with semantax, then continues with the author's composition, and finally ends with an Enter ⏎, could be parsed into a distinct HTML element. I called this a basic phrase. Basic phrases became one of the three essential constructs that formed the core of the emerging language.

A different construct was needed in order to represent a collection of basic phrases. Such a construct would be needed, for example, when a list of items was being enumerated: the individual items would still be basic phrases, but some sort of wrapper would be needed to syntactically bind them together. Markup languages accomplished this with opening and closing tags, whose role was to demarcate the wrapper's starting and ending boundaries. Since I had already abandoned tags, I chose instead to use left { and right } curly braces to demarcate the boundaries. This is a common idiom used in many programming languages, and is familiar to a lot of people. Just as important though, is that curly braces are not often used as punctuation, so there was no difficulty for the compiler to recognize these marks for what they were. This new construct matched the block element role described above. I called this second essential construct a container phrase.

This still left the need for a way to represent inline elements. I chose to do this using doppelmarks, a term I applied to any sequence of two repeated characters. Eventually the language would define six types of doppelmarks, each fulfilling a different role, but for now I just needed one. Inline elements began with two less-than symbols << and ended with two greater-than symbols >>; in between were the semantax and the compositional text. I called this third essential construct a term phrase.

So, three phrase patterns: basic, container, and term.

These three formed the essential core of the new language, allowing any HTML element to be represented. In addition, they could be nested, allowing for the creation of containers within containers, or terms within terms.

Serendipitously, nesting provided a useful hierarchy that the compiler could make use of. This led to implied semantax. That idea was rooted in the observation that many of the most commonly used containing phrases would nearly always consist of the same type of basic phrase. For example, a <table> element usually had subordinate <tr> elements, and they in turn would normally have subordinate <td> elements. When patterns like this occurred often enough, I added instructions to the compiler to recognize the pattern and to predict the expected semantax. This meant that I could omit the semantax within the manuscript itself, providing less visual clutter and easier reading.

Notational Shorthand

At the same time that I was figuring out how to get rid of tags, I was also experimenting with notational shorthand. At a minimum, I needed a way to assign attributes to elements. In markup languages this is done by declaring and assigning attributes within the element's opening tag. But with opening tags reduced in my system to a simple mnemonic, a new approach was needed. I settled on the concept of shorthand to accomplish this. This hearkened back to work I had done in 1981 on a rudimentary system for making hotel reservations, where the hardware and its human interface were so limited that single character codes were employed as a command language. This 35-year old scheme was resurrected and became the inspiration for declaring and assigning directives, the jargon I used within my system to mimic the concept of attributes.

The syntax was simple: the keyboard's asterisk key * was designated as the shorthand symbol. This was followed immediately by a directive name, an equals sign, and the directive's value. When sandwiched between the bare semantax and the compositional text, it was interpreted by the compiler as an attribute, and was assigned to the phrase. Any number of directives could be applied to a phrase before the compositional text.

The decision to use an asterisk as the shorthand symbol wasn't my first choice; in early versions of the language I tried out other symbols (each of those alternatives being symbols that were located on the number keys at the top of the keyboard). The asterisk turned out to be the least intrusive visually, an important consideration for readability.

In my early experimental use of the language I soon realized that some HTML attributes appeared more frequently than others, and this made me want to create a larger set of notational shorthand that went beyond that single asterisk. Three of the most important ones related to CSS. These were the attributes style, class, and id.

The first of these is used to directly apply CSS properties to an element. The other two are used to indirectly apply CSS to an element based on selectors declared in external files. The CSS language itself specifies a selector syntax that uses a full-stop . for class and a hashtag # for id. I adopted this same symbology for my shorthand notation. I chose the caret ^ as the shorthand symbol for style.

Eventually I would define eight distinct shorthand symbols, but for now I had just four:

* an asterisk for generic attributes
^ a caret for styles
. a full-stop for classes
# a hashtag for identifiers

Lastly, one other very important set of attributes was still waiting to be addressed. These were the attributes used to specify external URLs, paths, and filenames. In HTML, there are several distinct keywords used for this.

A hyperlink specified in an <a> element uses the href attribute.
A reference to an external picture in an <img> element uses the src attribute.
A server endpoint for processing a <form> uses the action attribute.
An external citation for a <blockquote> element uses the cite attribute.
An embedded app specified in an <object> element uses the data attrbute.

I found this inconsistency to be a constant source of irritation because I couldn't keep them clear in my mind — they were just too arbitrary. My solution was to use one type of shorthand notation for all of them, and to let the compiler figure out which was which. The notation I used for these was a pair of grave-accent ` symbols delimiting the enclosed path. I referred to this as sourceref notation.

Unlike the other shorthand symbols, which consisted of a single character, sourcerefs used two characters. This was intentional. Paths and filenames sometimes contained spaces which made it impossible to distinguish between attribute value and compositional text. URLs had even more troublesome characters to deal with. Using the grave-accent as both the notational shorthand symbol and the delimiter made everything clear.

These simple ideas were all I needed to be productive: no tags obscuring the composition, naked semantax, three types of phrasing, and shorthand notation.

Over time the concept of semantax would expand to include user-defined semantax and custom vocabularies. The concept of shortand notation would lead to new symbols for accessibility and extended semantics, plus new capabilities to allow user-defined shorthand. And the language would be greatly extended with the introduction of new solutions to common writing problems such as manuscript organization, variable substitution, sequencing, interior landmarks, and interlinking.

But that's anothor story.