The making of BLUEPHRASE

Solutions to Common Writing Problems

Going beyond simple notational shorthand

by Joe Honton

This is the story of how I expanded BLUEPHRASE to become more than just a pretty HTML editor.

By the end of 2016 I had finished developing the first version of the BLUEPHRASE compiler. To recap: the language was capable of expressing the entire HTML language using semantax instead of tags, and shorthand notation for compactly assigning attributes.

In short, the system was usable for everyday writing tasks. And so I wrote. Notes. Blog posts. Slide presentations. System documentation. Chapters. Letters. It quickly became evident that the strength of the language was its versatility.

At the time, one of my pressing needs was documenting the very system that I was creating. There were intricacies that were not obvious, and I wanted to be sure to capture those important details before my working knowledge vanished. Writing that documentation, using the very tool I was writing about was indeed meta. It was also a good chance to reflect on what I had accomplished. But curiously the very act of writing that document revealed important opportunities for improvement to the language itself. Whenever I found myself documenting a limitation, I would turn the tables and look at it from the reader's perspective, often feeling that those limitations were unacceptable. It was like writing a manifesto for the future, but limiting its vision to the past. That was not good.

One of the important things I did in early 2017 was to develop a software module to read an HTML document, decompile it into memory structures, and write it out as a BLUE manuscript. This was a straightforward task requiring no hijinks: parsing HTML is easy because it's so regular.

Having both a compiler and a decompiler in my library meant that I could make lossless round-trips between BLUE manuscripts and HTML documents. The two were fully compatible. My early marketing efforts liked to point that out.

So what I saw as limitations to BLUEPHRASE were really limitations inherent in HTML. If I was going to make a great writing system, I had to be willing to let go of the concept of lossless round-trips. I had to allow myself to extend the language into fresh new territory.

The Summer of BLUE

The summer of 2017 was my time to think big. I call it my Summer of BLUE because it's when I extended the language to encompass new shorthand codes and new symbolic expressions. That season's work resulted in a cascade of new solutions. Some were mundane, such as escapes and Unicode characters. And some were copycat features, such as substitution variables and numeric sequencers. But others had a touch of brilliance: clones and junctors were new ideas that no one had yet heard of. Still others addressed emerging industry requirements, like accessibility and the Semantic Web. I also added support for the back-and-forth exchanges between the publishing house staff and the author. Finally, because of my experiences with server-side scripting languages, I designed support for xenomark syntax.

This covered a lot of professional realms all at once: web page developers, server-side programmers, publishing house staff, and of course the writer. Here are some of the thoughts and experiences that guided that summer's inspiration.

The mundane

Every computer language has a need for an escape sequence, which I realized was a glaring deficiency in my own work as soon as I began trying to document doppelmarks. An escape sequence merely tells the compiler that the next character should be treated as a regular character, not as a special symbol. Most computer languages use the reverse-solidus (backslash) \ as the escape character. This felt right to me so I implemented it in an afternoon, and never looked back.

Interestingly, HTML has three characters that need special attention. Less-than < and greater-than > characters that appear within the compositional text need to be encoded into html-entities so that they are not mistaken by the browser as tag delimiters. And ampersands & need escaping because html-entities themselves start with an ampersand, and are always interpreted by the compiler as the start of such a sequence. When I wrote the first version of the BLUEPHRASE compiler I took care of this problem for the author, automatically identifying these situations and properly encoding the characters within the compiled output. I enjoyed the freedom of never having to consciously type &lt; or &gt; or &amp;. It was a bit of a let-down to discover that I still needed to escape my own creation when talking about term-marks << >> , and the other doppelmarks that were to come. Of course it was really just injured pride, that's all.

Unicode sequences were also easy to implement. The standard keyboard has all the characters that I normally need, and when it doesn't I usually just resort to copy and paste. Since the BLUEPHRASE language natively uses the UTF-8 character encoding for manuscript files, the full set of Unicode characters were available for use. Still, I occasionally found the need for special punctuation marks that my keyboard didn't have, so a manual way of specifying them was necessary. Chief among these for me was the em-dash, which I like to use in my writing. I settled on using the percent-sign % as the compiler symbol to begin a Unicode sequence. When followed by two, four or six hexadecimal digits, the sequence was interpreted as a Unicode character. The compiler was smart about it though: if the percent sign was used in a phrase like "putting in 100%", then the percent character was not treated as a Unicode signal, and no special escaping was required.

Copycat features

The idea behind substitution variables was rooted in the template languages that I regularly used in my professional life. It's a common practice to create HTML templates that contain standard items used across multiple web pages. This creates a consistency that is desirable. The usefulness of this template technique is increased when it's created using placeholders for text which varies from page to page. I felt it was important to add support for substitution variables, to fill this placeholder role. I settled on a syntax for declaring variables that consisted of a dollar-sign $, the variable name, an equals sign =, and the text value. Standard stuff. The syntax for using a variable was simply to type the dollar-sign prefixed variable name anywhere in the manuscript where it was needed.

Numeric sequencers look and behave the same as substitution variables but have the added convenience of being incrementable. For example, if a sequencer currently has a value of N, you can change it to the value N+1 using increment syntax, which is simply two plus-signs ++ immediately after the variable name. This is such a common idiomatic expression that it needs no further explanation.

Clones

As I began using the language in technical documents, I soon found myself wrapping certain technical jargon in boilerplate styling. It was repetitive and boring work to type this out each time. Copy and paste was an alternative, but was clumsy and distracting. I wanted to keep my hands on the keyboard, and my thoughts on my writing, so I needed a different approach. The result that I came up with was clones.

Visually, a clone looks like a variable, that is, it consists of a dollar-sign followed by a sequence of characters. With substitution, the sequence of characters is the name of a variable declared and assigned somewhere in the manuscript. By contrast, with clones, the sequence of characters is an identifier assigned to a phrase. Using this scheme, any phrase that has an identifier (an id attribute), can become the source for a clone. When the clone syntax is applied anywhere else in the manuscript using that identifier, the source phrase and all its styling is cloned into the final document.

Through experimentation, I found cloning to be a powerful ally, one that allowed me to build and use my own vocabulary of terms and stock phrases, all consistently styled. The only refinement to the idea was the concept of inner clones and outer clones. What I originally designed became known as outer clones — where everything, including outer styling was duplicated. What I felt I need as well, was the ability to clone only the inner content and nested phrases of the source. These became known as inner clones.

Junctors

The other mechanism that I wished for was a way to create bidirectional links. I found that I was often wanting to give my readers a hyperlink to another place in the same document. But just as often, I wanted to provide the reader with the convenience of jumping back to the place in the composition they just left. Of course, browsers have a back button for this very purpose, but that works only for links going from A to B. What I wanted was to also create a link from B to A, allowing the user to approach the material from either direction. Technical reference documents would be enhanced with this capability.

To implement this, I designed junctor syntax. This neologism was derived from the two words junction and functor, where a functor is C++ jargon used to describe an object that acts as a method. I later discovered quite accidentally that the word junctor is an old term used in analog telephone exchanges for attaching incoming and outgoing lines. I felt that to be a nice coincidence.

The syntax I settled on for junctors was to use a tilde ~ followed immediately by an arbitrary identifier. During compilation, phrases having junctor syntax are matched up and any two with the same identifier have hyperlinks created from each to the other.

The compiler uses the text of the source junctor as the text of the hyperlink, and it's underlined following the usual convention for hyperlinks. But as I soon discovered, not every junctor source has text. Also, oftentimes the text of a junctor might be a whole sentence, which obviously wasn't a good choice for underlining. To solve these problems, I modified the design to allow for prefix junctors and postfix junctors. These were designed to instruct the compiler to generate the hyperlink just before or just after the target phrase. The syntax for these use a two-character sequence followed by the identifier: tilde less-than ~< for prefix junctors; tilde greater-than ~> for postfix junctors.

In order to keep my terminology precise I renamed the originally designed feature, the one that employed the tilde by itself, and began referring to it as an infix junctor.

Over time I discovered that junctor syntax provided less clutter and more convenience than traditional anchor based hyperlinks, and found myself wanting to be able to use it for more than its originally designed purpose. In particular, I was using a pattern where the target phrase already had an identifier (an id attribute), and I wanted to create a simple one-way link to it. In this scheme the back link from B to A was not wanted. To implement this idea I modified the compiler to look for and match up junctor identifiers and regular identifiers. There was no need to define any additional language syntax to make it work. I called this new feature half-duplex junctors.

As I was implementing the half-duplex algorithm it occurred to me that sometimes an author might want to use the convenience of half-duplex junctor syntax on multiple phrases. For example, when phrase A has a regular identifier which is used as the destination for hyperlinks from phrases B, C and D. Later on, when working on the index-builder, this situation did indeed occur. To implement this I modified the algorithm again, calling this new feature a multiplex junctor. Again, no additional user syntax was needed.

Emerging industry requirements

At this point, the language already had quite a few keyboard characters designated for special purpose uses, and I was reluctant to commit to any more. My goal had always been to make shorthand notation as terse as possible, and using single character symbols sat well with that goal. Unfortunately I could see that the number of available keys for that purpose was dwindling. I had already juggled symbols around to get to where I was, and any more would start to cramp things. I was thinking of several additional shorthand candidates that I wanted, but wasn't so sure of their general utility to the language's intended user base. Restraint was my mantra of the day. That was when I decided that it would be best to allow each user to define their own shorthand symbol assignments, without committing them to the general language, so I developed a way to extend the language using pragma definitions. I won't go into the story of pragmas here.

Nevertheless, despite the dwindling supply of symbol keys, I felt that it was the responsible thing to do to set aside two additional keys for accessibility and the Semantic Web.

The Web Accessibility Initiative has defined a set of keywords to be used by tools such as screen readers to assist people with visual impairments and other disabilities. Government bodies are beginning to mandate their usage on many websites, including both public and commercial sites. The shorthand symbol I designated for this was the plus-sign + which, when followed by a WAI-ARIA keyword, would be compiled into the final document as a role attribute.

The Semantic Web is a W3C (World Wide Web Consortium) initiative to build a web of data. Its goal is to extend the standard semantics of HTML elements to encompass domains that don't fit the article/book/publication paradigm that HTML was originally designed for. The initiative uses Resource Description Framework in Attributes (RDFa) properties to assign alternate meanings to an element's content. There is no master list of RDFa properties, rather, the initiative provides a way for domain experts to define their own ontologies. I won't go into the workings of that. As for shorthand, I chose the question-mark key ? as the notational symbol to be used for the Semantic Web, with the compiler generating property attributes from them.

Back and forth exchanges

When an author works with an editor on a draft manuscript, the process will likely include a series of communications. Some of these will result in immediate corrections, while others may require a back and forth exchange as the two work to understand the issue and how to best resolve it. There may be hundreds of these corrections and exchanges happening all at the same time. Each of them needs to be placed in context, as close as possible to the actual words being changed.

This communication process was the target of my design for graynotes. My thinking on this problem brought back memories of the terrible way we went about this in the publishing software I was helping to build during the years 2013 — 2016. (Recall that that project collapsed, eventually ending in complete failure, adding urgency to this project's success.) The big mistake we made was moving all those back and forth exchanges into an external database and using arbitrary data attributes as reference pointers back to the relevant place in the manuscript. Shying away from that fraughtful approach, I instead decided that all such communications should be done in-situ.

Mimicking the C and C++ comment syntax felt like the proper solution to this problem. I eventually ended up with four syntactic forms for what I began referring to as graynotes (because they weren't just comments).

The first form was exactly equivalent to C-style comments: everything between a matched pair of solidus-asterisk /* and asterisk-solidus */ was a remark. Authors, editors, reviewers, indexers, and anyone else working on the project could add remarks anywhere within the manuscript, and they could sign the remark with a full-stop . followed by their name or initials. The compiler would treat these remarks in one of two ways. During active review they would be converted into HTML comment syntax so that they were visible under-the-hood. During final document creation, they would be omitted completely, so that no one would know that they existed.

The second form behaved identically to the first, except that the delimiters were a matched pair of solidus-fullstop /. and fullstop-solidus ./. These were termed a reply. The intent was that two people communicating back and forth could use remarks for new topics and replies for follow up notes. A manuscript parser would be able to pull out these remarks and replies in order to build tools for publishers to track the progress of outstanding work.

The third form was for the special purpose of identifying places in the manuscript where external resources would be placed when they were eventually available. Since the art department usually lagged behind the author, the idea was for the author to use this special syntax as a placemark for the pending artwork. Having a dedicated syntax for this purpose would make is easier to track their progress and to keep the project on schedule. I chose solidus-questionmark /? and questionmark-solidus ?/ for this new feature and referred to these as placeholder graynotes. It wasn't until much later that I learned that the typesetter's bluemark for this was TK, an abbreviation of "to kum", meant to signal that the material was "to come". I stuck with my original nomenclature.

The final graynote form was the C++ style comment, which begins with a double solidus // and ends at the end of the line. This form was a favorite of mine because it was so quick to type. Also it is the only form that can be embedded within other graynotes, an important consideration when blocking out a long passage that's pending deletion. I referred to this form as a terminal comment, for obvious reasons.

Server-side scripting

Finally, as a software developer, I have a long history with server-side scripting languages. Early on I tried my hand at Microsoft's Active Server Pages, long before it matured into ASPX. It was less than satisfying. Later on I found PHP to be better suited to the problems I wanted to work on, and it became my bread and butter for more than a decade. I mention these because they influenced my decision to design xenomark syntax.

The utility of BLUEPHRASE was obvious, and I didn't want to prevent it from being used by programmers working with server-side languages or browser framework libraries. In the early days, those languages often used HTML template files containing embedded language scripts placed between signal marks. Many of the most popular Web development languages use this technique, each with their own signal mark syntax. I added support for seven of these:

  • <% ASP, Java Server Pages, Ruby %>
  • <? PHP ?>
  • {{ Angular, AppML, wikitext }}

Modern server-side design is moving away from this approach, and I expect this feature to become less important over time.


The Summer of BLUE was a rare confluence of creativity and productivity that I now recall as an inspirational rush. There were still more ideas churning, and it would take months to work through the backlog, but with these essential capabilities in place, I was writing more and enjoying it more than ever before.

0

Solutions to Common Writing Problems - Going beyond simple notational shorthand

🔗 🔎