The making of BLUEPHRASE

Dialecting the Language

Setting up the BLUE authoring environment

by Joe Honton Sep 21, 2019

This is the story behind the options that BLUEPHRASE offers for customizing the language to better fit different writing needs.

As I used BLUEPHRASE for more of my daily writing tasks, I discovered that some of its features were actually getting in the way. This was not good. I needed to simplify. Recall that simple distraction-free writing was the main impetus for creating the language in the first place.

To understand this, let me briefly explain how a BLUEPHRASE manuscript becomes a finished document. Loosely, what happens under the covers is that the compiler scans a manuscript one character at a time looking for semantax, notational shorthand, and doppelmarks. When they are encountered, the compiler suspends its normal text processing rules and begins processing characters, interpreting them as instructions. This all happens with what I had been calling smart-tech. But despite its name, this isn't magic: there are very precise rules for every situation and nothing is ever ambiguous or subjective.

So why was I feeling that features were getting in the way? It was mostly a matter of context. Remember that each of the three core phrase patterns (basic, containers, terms) are composed of semantax, attributes, and the author's words. Also remember that semantax can be omitted and implied by its surrounding context. So the BLUEPHRASE tokenizer works extra hard at the beginning of a phrase to determine if and what the semantax is, and if and what the optional attributes are. Once that is done, the tokenizer can merrily read and write the author's words, echoing them one for one. This proceeds until the end of the phrase or the start of a nested phrase is encountered. Well, almost. This read/write echoing is also interrupted by doppelmarks, xenomarks and graymarks. So sequences such as [[ ... ]] or <% ... %> or <* ... *> will be recognized and interpreted according to their own special rules.

Doppelmarks, xenomarks, graynotes

What I wanted, was a way to instruct the tokenizer to ignore certain shorthand notations and marks when I didn't need them. For example, if I'm using BLUEPHRASE to write a non-fictional book that has a backmatter index, I need [[ ... ]] to be honored, but otherwise I don't. If I'm writing a server-side script for Java Server Pages, I need <% ... %> to be honored, but otherwise I don't. And if I'm writing a JSON data structure, I don't want any of the shorthand marks * ` # . ^ + ? ~ to be honored.

I implemented all of these, and more, using command line options. When the BLUEPROCESSOR was invoked, it could be given commands to honor or ignore each doppelmark, xenomark, or shorthand notational symbol.

The doppelmark syntax that I targeted with these conditional processing rules were: listmarks, citemarks, glossmarks, notemarks, and indexmarks. Each could be turned on or off independently.

In a similar fashion, xenomark syntax could be turned off completely or set to use any of the three flavors of server-side scripting tags: JSP/ASP/Ruby, or PHP, or Angular/AppML/wikitext.

Graynote syntax was handled differently. I chose not to allow it to be fully disabled, but instead allowed remarks, replies and terminal comments to be kept or discarded from the final document that was produced. This would accommodate two phases of a manuscript's development: active editing where they would be written into the document using HTML comment syntax, and final production where they would be completely omitted from the document.

Placeholder syntax was special. Unlike the other graynotes, I added a special option to allow the text of the placeholder to be written to the output document if desired. It could be written as regular compositional text or as an HTML comment. This was to make it easier for proofreaders to know that something was pending and not simply missing.

Initially these options could only be specified on the command line, when the BLUEPROCESSOR was invoked. Later on I brought this same set of switches into the manuscript so that the choice to honor/ignore could be specified on a manuscript-by-manuscript basis. I did this using a new !option pragma, which used precisely the same syntax as its command-line equivalent: two dashes -- followed by a keyword. This had the additional advantage that the marks could be conditionally honored/ignored for smaller blocks of text, such as coding samples, by turning the option off and back on for limited stretches of the manuscript.

Alternate outputs

By default, BLUEPHRASE manuscripts are compiled into HTML5. This choice is ideal for web pages. But manuscripts that are being prepared for publishing as ePubs need a few special instructions added in order to comply with that specification. I added the --html-target option to accommodate this.

Another default option for BLUEPHRASE manuscripts is the automatic inclusion of required HTML tags. In light of my goal to keep boilerplate stuff to a minimum, I wanted to be able to create a valid "Hello World!" document by simply typing those words, and nothing else. Initially this produced a document that was not valid HTML: it was missing the <html>, <head>, <title> and <body> tags that all HTML documents require. I modified the compiler to automatically add these if they were missing. Later on I realized that there were valid cases where an HTML fragment was wanted, so I implemented the --nofragment option to skip any such checking.

One of the unresolved debates in the computer world is over the use of tabs and spaces. This seemingly innocuous choice can create a firestorm of vitriol when someone doesn't get their way. I added the --indent option to forestall any such abuse.

As everyone knows, HTML is very much like XML. I had this in mind from the very outset of the project. I wanted to be able to create arbitrary XML documents using BLUEPHRASE notation. Because the compiler parses the manuscript into memory structures, this was easily accomplished. I wrote an emitter for HTML and another one for XML. This worked beautifully.

Based on that success, I then proceeded to write other emitters allowing BLUEPHRASE to be used as the source for a wide variety of purposes. These included: wikitext, Markdown, Github-flavored Markdown, text, canonical BLUEPHRASE, JSON, YAML, HAML, TOML, PList, and INI files. These alternate outputs could be requested using the --emit option.

Custom vocabularies

Normally BLUEPHRASE expects a manuscript's semantax to be equivalent to HTML or MathML or SVG. The tags used in these three specifications are automatically recognized by the BLUEPROCESSOR.

When using BLUEPHRASE to create XML or JSON, those specifications are not relevant. To allow for this, I established a way to turn off semantax checking, and to instead interpret the first word of every phrase as a keyname. That keyname would then become the tagname (in XML) or property name (in JSON) in the final output. This made it possible to express arbitrary data structures without fuss. I added the --vocabulary=unchecked option to enable this type of interpretation.

XML has a formal way of creating a document type definition. I wanted to mimic this to allow BLUEPHRASE to be used with vocabularies different from the default three. I did this using the --semantax option. It allows anyone to create new semantax definitions that are recognized and honored by the interpreter, without resorting to the unchecked method. The syntax for the option allows for the definition of the semantax name, the vocabulary that it should be assigned to, its parents, whether or not it can be implied, whether or not it can be a container, and whether or not it is self-closing.

Custom shorthand

As I developed shorthand notation, I could see many other uses for it. But I was reluctant to keep chiseling away at the remaining unassigned punctuation marks and symbols that are found on a standard keyboard. Instead, I designed a way for users to establish their own shorthand codes. This was done with the --shorthand option, which simply requires the user to choose a symbol (a single character or punctuation mark) and the attribute or property name that it will be transformed into by the compiler.

Each of these options offers some degree of personalization to accommodate the type of writing the author is working on. About half of them are oriented towards turning standard behavior off. The other half are oriented towards creating new idiomatic expressions to suit the writer's needs. Together, they allow writers to establish a BLUE language dialect that is comfortable to use.