Delimiter-first code

Summary

I argue for wider usage of delimiter-first in the code

three friends [tic, tac, toe] becomes three friends ・tic ・tac ・toe.

A new top-level syntax for programming languages is proposed to show advantages of this method. New syntax is arguably as simple, but more consistent, better preserves visual structure and solves some issues in code formatting.

A well-known proposal is to write commas first in languages like javascript, JSON or SQL, which don’t have trailing commas (JS has these days, but not the other two):

    -- trailing commas              
    SELECT employee_name,
      company_name,
      salary,
      state_code,
      city
    FROM `employees`

    -- leading commas               
    SELECT employee_name
         , company_name
         , salary
         , state_code
         , city
    FROM `employees`

While it is not what I am discussing here, there is a large overlap. This style wasn’t widely adopted, and it is interesting why.

All criticism essentially comes down to: 1) tools can solve common issues solved by this notation 2) it is not natural / you don’t write text like this.

Argument 1) is irrelevant since tools can handle any notation, even completely non-readable for human. Argument 2) is weak, however similarity to known things drastically simplifies adoption.

Over time, however, code culture diverged in multiple ways from ‘usual writing’: we enumerate from zero, write identifiers with underscores, don’t follow usual rules for quotes, and indent code instead of writing in paragraphs. When some tools have shown that the alternative way works, further adoption happens more easily.

More importantly, argument 2) is really broken:

    ・this version          
    ・is far more 
    ・natural

    than this version・          
    with a delimiter・
    after

so when it came to enumerating in a visually distinctive way, ‘usual writing’ uses delimiter-first.

I want to point the source of this controversy with one more example:

You need eggs, cheese, bread.               # ok  
You need ,eggs ,cheese ,bread.              # sucks
You need a) eggs b) cheese c) bread.        # ok
You need 1. eggs 2. cheese 3. bread.        # ok
You need ・eggs ・cheese ・bread.            # ok

So complains are not because delimiter-first looks wrong - in fact, it is common. It is about commas being used as leading elements, not trailing - a lesson to remember.

Both argument 1) and 2) pinpoint reasons why things the way they are: habit and tools. But different code examples (SQL examples by Felipe Hopfa and JS examples by Isaac Z. Schlueter) show benefits of delimiter-first.

I expected to find in discussions some code examples where delimiter-last is better, but I didn’t.

Later addition: haskell community adopted leading commas in many projects, because trailing commas were not supported at first. Later haskell got support for trailing, but now majority votes for advantages of leading commas.

Is ‘delimiter’ a right word?

Delimiter (just as separator) separates items. Though there is no consensus about it.

E.g. in [ 1, 2, 3 ] we have a sequence of tokens:

start  item delimiter  item  delimiter   item   end
  [     1       ,        2        ,       3      ]

So what I’m arguing for is having a start-of-item token. Like this: ・1 ・2 ・3. Do we need to point an end of last token? As we’ll see next, that’s usually not the case.

We have a special word for end-of-item token: terminator, but no startinator or any similar word. I see some irony in this.
(update: find some interesting thoughts I received about this in the comments section)

Meanwhile, I keep using the word ‘delimiter’ (albeit it’s maybe incorrect)

Collections in HTML

Different markup languages give some food for thought, as they commonly deal with collections.

E.g. html allows using start-of-item (<li>) and skipping end-of-item (</li>)

<ul>
    <li> first item
    <li> second item
</ul>

Collections in YAML

Yaml, which focuses on a hierarchy of collections, also uses a delimiter-first approach.

- point 1
  - point 1.1
  - point 1.2
    - point 1.2.1
    - point 1.2.2
  - point 1.3
- point 2 

Let me reinterpret this example. This reinterpretation is important in further discussion.

There are 3 delimiters: \n-, \n__- and \n____- (underscore = whitespace). All three delimiters are distinct, and the whole structure now reads as

1point 1
2point 1.1
2point 1.2
3point 1.2.1
3point 1.2.2
2point 1.3
1point 2

No end token needed in yaml: the last item ends when a collection ends, i.e. at a delimiter of higher level. There is no need to know or parse anything about an internal structure between two tokens.

Correspondingly, the only expectation we have from contents enclosed between <lvl2> is that there are no tokens <lvl1> or <lvl2> and that’s it.

Intermediate conclusion: delimiter-first is very common, and in markup languages it is even standard (but not in programing languages!)

Line should start from `\n`, not end with it

This sounds mad (after many years of programming it just should), but see for yourself:

Let's assume I've had some very long text ending here.

Chapter 2.
Let's learn about belonging of indentation elements to logical elements.

Pay attention to the blank line between last line of previous chapter and a header of new line. Undoubtedly, blank line is a part of ‘Chapter 2’ logical element, because empty line focuses our attention on ‘Chapter 2’ label. It is not because we need to end the paragraph.

For the same reason, in html additional margins ‘belong’ to headers, not preceding elements.

Same for lines: we highlight a beginning of a new line, not an end of previous one. Ironically, that’s in the name: it is newline, not endline.

When we turn to code, the same thought is seen with this small snippet, where I compare normal print with a hypothetical print that outputs newline before the output:

print('step1. downloading', end='')
for chunk in download(...):
    print(end='.')
print() # to keep steps on separate lines

print('step2. processing', end='')
for chunk in process(...):
    print(end='.')
print() # to keep steps on separate lines

Code with \n auto-printed after the arguments

print('step1. downloading')              
for chunk in download(...):
    print(start='.')

print('step2. processing')
for chunk in process(...):
    print(start='.')
    
    

Code with \n auto-printed before the arguments

result:

step1. downloading.........
step2. processing.........

Version of code with leading \n is more straightforward.

If things were the opposite way:

.......step1. downloaded
.......step2. processed

then \n in the end would be more optimal, but this order is not natural. Normally we first describe the collection, then enumerate items, not vice versa.

Unix’s newline in the end of line

Unix does not use \n as a delimiter of lines. Instead, it is more of line-terminator, because file with text should end with \n. Not doing so would break simplicity of unix tools and simplicity of definitions, see this SO thread.

For layman, why newline is required in unix:

$ echo -n 'good file with newline in the end\n' && echo -n 'another good file with newline in the end\n'
good file with newline in the end
another good file with newline in the end

Missed newline in the first file:

$ echo -n 'bad file without newline in the end' && echo -n 'another good file with newline in the end\n'
bad file without  newline in the endanother good file with newline in the end

problem is in the first file, but it is the second one to get printed the wrong way. No such misattrbution issue with newline-first.

If it is ok to end each file with \n, then it is ok to start it with \n.

Having lines start with \n maintains the simplicity of unix utilities, but is a bit simpler to visualize in editor.

Imagine that in parallel universe text and binary files are different in the very first character. What a science finction we could live in!

Do I really want to change all files to newline-first? Of course not. But I have to point that if in the course of history files were newline-first from the start, that would be a better system.

I hypothesize, that newline-last comes from unix mainframes: when line in shell is entered, it can be passed to a mainframe for processing. I can’t confirm this, but it sounds plausible. If so, time has shown that to be a wrong choice: all the messengers these days make distinction between new line (enter) and sending messages (shift+enter). Jupyter knows that, IDEs know that, messengers know that. Terminals still don’t know that.

Using indentation to structure code

Code indentation is available in all major languages, but python (and scala 3, F#, nim, haskell, …) relies on indentation to define logical structure.

And that works very well. Let’s see how we can re-interpret the python code the way we did with yaml

class MyClass:
    def __init__(self):
        pass
    
    def some_method(self):
        pass

now we reinterpret the structure with <lvl1>=\n, <lvl2>=\n____, <lvl3>=\n________.

1class MyClass
2def __init__(self)
3pass
2
2def some_method(self):
3pass

so, we see very basic organization of code is available just by looking at sequence of start tokens (which simply mirrors indentation).

Some problems with multiline strings

There are places where python allows code to ‘escape’ indentation: continuation of previous line (explicit with \ or implicit with different brackets) and multiline strings.

Continuations are ‘solvable’ with code formatting tools, but not multiline literals:

if True:
    print("""
    This is python's
    multiline string
    """)

Output (###### just shows where the line ends):

######
    This is python's######
    multiline string######
    ######

To get proper output we need to break visual alignment:

if True:
    print("""This is python's
multiline string
""")
    # takes effort to realize that the same block of code continues here
    return False

There are problems with multiline: first line, last line and indentation. Multilines in javascript/go face all the same issues, so it is a generic problem.

I think there is a way to solve this issue too, and it will be discussed.

Delimiter-first pseudo-python

To better demostrate how all these ideas come together, I’ll imagine a new language (pseudo-python). To focus only on syntax changes, I’ll keep all other aspects of the language the same.

I will consider an artificially complicated example. It includes different arguments, list, empty list, string, multiline string, method chaining, multiline logical arithmetics, few or no arguments

Goal is to demonstrate that any wild mix is representable and does not produce mess.

prepare_message(
    title="Hey {}, ready for Christmas?".format(user_name),
    email=email,
    body=f"""Reminder: please clean your chimneys!

Oh, and prepare "Santa Landing Spot" on your roof

Thank you {user_name} for cooperation,\nSanta Corp.
""",
    additional_sections=[
        get_current_promotions(n_promotions=4),
        get_recent_news(),
    ],
    unsubscribe_link=generate_unsubscribe_link(
        email, 
        message=message,
        **unsubscribe_settings,
    ),
    attachments = [],
).schedule_for_submission(
    holidays_queue,
    important=user_is_santa |  user_is_deer \
     | user_previously_had_issues_with_christmas_delivery,
)

prepare_message(
    , title="Hey {}, ready for Christmas?".format(user_name)
    , email=email
    , body=f"""
        "Reminder: please clean your chimneys!              
        "                                                   
        "Oh, and prepare "Santa Landing Spot" on your roof  
        "                                                   
        "Thank you {user_name} for cooperation,\nSanta Corp.
    , additional_sections=[
        , get_current_promotions(n_promotions=4)
        , get_recent_news()
    ]
    , unsubscribe_link=generate_unsubscribe_link(
        , email
        , message=message
        , **unsubscribe_settings
    )
    , attachments = []
)
\.schedule_for_submission(
    , holidays_queue
    , important=user_is_santa | user_is_deer 
      \| user_previously_had_issues_with_christmas_delivery
)

I welcome you to study this example for a minute. Structure overall did not change much. Note differences in line breaks \ and multiline strings.

An important distinction: leading commas get the same role as hyphens in yaml: they define structure, their position is not arbitrary.

# normal python
# this is legal code            
print(
    1, 
        2,
)

# proposed
# this is incorrect code        
print(
    , 1
        , 2
)

In new code there is no need in closing brackets (see that yourself by staring at the code more!).
So let’s remove closing elements:

prepare_message(
    title="Hey {}, ready for Christmas?".format(user_name),
    email=email,
    body=f"""Reminder: please clean your chimneys!

Oh, and prepare "Santa Landing Spot" on your roof

Thank you {user_name} for cooperation,\nSanta Corp.
""",
    additional_sections=[
        get_current_promotions(n_promotions=4),
        get_recent_news(),
    ],
    unsubscribe_link=generate_unsubscribe_link(
        email, 
        message=message,
        **unsubscribe_settings,
    ),
    attachments = [],
).schedule_for_submission(
    holidays_queue,
    important=user_is_santa |  user_is_deer \
     | user_previously_had_issues_with_christmas_delivery,
)

prepare_message(
    , title="Hey {}, ready for Christmas?".format(user_name)
    , email=email
    , body=f"""
        "Reminder: please clean your chimneys!                
        "                                                     
        "Oh, and prepare "Santa Landing Spot" on your roof    
        "                                                     
        "Thank you {user_name} for cooperation,\nSanta Corp.  
    , additional_sections=[
        , get_current_promotions(n_promotions=4)
        , get_recent_news()
    , unsubscribe_link=generate_unsubscribe_link(
        , email
        , message=message
        , **unsubscribe_settings
    , attachments = []
\.schedule_for_submission(
    , holidays_queue
    , important=user_is_santa | user_is_deer 
      \| user_previously_had_issues_with_christmas_delivery

Don’t pay much attention to number of lines - denser code is a byproduct, not a goal.

Further I’ll discuss several advantages of this syntax.

New multiline strings

print(f"""
    "This is new
    "multiline string

output:

This is new
multiline string

Everything looks perfect, multiple issues are solved in one shot. But … with a minor catch: that’s how output looks like in raw form: \nThis is new\nmultiline string (i.e. it is newline-first). Technically, one can produce newline-last outputs, but that’s artificial. See the elegance of match between delimiter-first and newline-first approach: delimiter just gets replaced with newline. That’s an operation that one can visually imagine by shifting all lines to the left.

One more example:

print(f"""
    "you can place anything here: ' '' ''' " "" """ f""" etc etc.
    # and you can put comments in the middle of multiline
    "multiline string can't be broken or terminated by any sequence within a line

Now, python literals do not work like that.

'''
""" and ''' should be escaped (otherwise interpreted as literal terminator)
'''


'''''
'''  # this trick (available in markdown) does not work in python
'''''

New parsing

In contrast to normal python, line alone does not inform if the instruction is complete, or it should be continued on the next line. Parsing one more line is required to confirm that current code section is complete (only prefix of next line should be parsed, to be more precise).

In this approach top-level parsing is quite ignorant to language details, and it relies on the same visual cues as we humans do: parser does not need to analyze line in detail to figure out if the instruction continues or not.

Let me ‘parse’ this example:

Delimiter   Token class    Rest of line
            <lvl1-instr   >prepare_message(
    ,       <lvl2-item    >title="Hey {}, ready for Christmas?".format(user_name)
    ,       <lvl2-item    >email=email
    ,       <lvl2-item    >body= f"""
        "   <lvl3-literal >Reminder: please clean your chimneys!
        "   <lvl3-literal >
        "   <lvl3-literal >Oh, and prepare "Santa Landing Spot" on your roof
        "   <lvl3-literal >
        "   <lvl3-literal >Thank you {user_name} for cooperation,\nSanta Corp.
    ,       <lvl2-item    >additional_sections=[
        ,   <lvl3-item    >get_current_promotions(n_promotions=4)
        ,   <lvl3-item    >get_recent_news()
    ,       <lvl2-item    >unsubscribe_link=generate_unsubscribe_link(
        ,   <lvl3-item    >email
        ,   <lvl3-item    >message=message
        ,   <lvl3-item    >**unsubscribe_settings
    ,       <lvl2-item    >attachments = []
\           <lvl1-continue>.schedule_for_submission(
    ,       <lvl2-item    >holidays_queue
    ,       <lvl2-item    >important=user_is_santa | user_is_deer 
      \|    <lvl2-continue>| user_previously_had_issues_with_christmas_delivery

By looking only at the sequence of delimiters (there are several subtypes of them), one can deduct limits of every code block / call / literal, i.e. derive top-level structure of the program. Parser now deals with a simpler task of checking that elements fit this pre-defined structure, and can point places where ‘structure’ does not match ‘content’.

Good bye old times when one deleted bracket caused complete rebuild of AST and numerous errors.

New code suggestions

This paragraph was added later, to unwrap the point that was missed by many readers.

Parsing of correct code is not a problem since 1960s or so. Real challenge is on-the-fly parsing of partially incorrect and quickly-changing code in the process of editing.

Say I’m a complete novice and typed something wrong:

def myfunction(
    var1 = 'some default value',
    var2 = (1, (2, 3),
)
    var3 = "variable number 3"

    var4 = """
Simple unfinished multiline string
""" + \
var

    var5 = ())

what should be autosuggested? var1/2/3/4? or nothing? Which would be more helpful?

How to inform user which places should be fixed? VS Code blames bracket on first line saying it is not closed (while it is closed!) and last line for missing colon (no, I don’t want colon there). Pycharm’s diagnostic messages are slightly better, but it blames line with var3 (which is completely ok).

Now, in pseudo-python there is no way to ‘escape’ indentation and thus code analysis can rely on indentation. And it is immediately deducible that lines with var2 and var5 have problem, and indent of var3 is incorrect (since colon is missing on previous line).

Autosuggestion even in code with multiple unfinished places would be still useful (in similar scenario in pseudo-python it still can suggest var3/var4, and depending on tolerance additionally var1/var2). Currently tools don’t suggest anything.

As I mentioned, AST undergoes small changes during editing, thus providing highly effecient autosuggestion, code analysis, and highlighting for such language would be simpler, much simpler.

New editing

Normal python.
suppose you want to start a list of arguments

print()

after you hit enter in IDE:

print(
    
)

then you type argument and comma.
Ready to proceed

print(
    42,
    
)

Done? Arrow down + enter

print(
    42,
    43,
)

Forgot something?
Double arrow up,
move cursor to end of line,
enter

print(
    42,
    43,
    
)

Delimiter-first pseudo-python.
suppose you want to start a list of arguments

print()

after you hit enter in IDE comma is auto-added:

print(
    ,

you type only argument.
Ready to preceed

print(
    , 42
    ,

Done? Enter + shift-tab

print(
    , 42
    , 43

Forgot something? Tab

print(
    , 42
    , 43 
    ,

The process of editing such structures was polished with hierarchical lists in word and other text processors.

Below is an animated example from workflowy (taken from post by B. Brandall):

Even minimalist note-taking apps these days recognize the importance of hierarchical organization. Their interface focuses on effectively traversing and modifying this structure.

But with code - this extremely structured and standardized pieces of linked information - we continue the game of imitation: ‘hey, that’s just text files, you can use notepad here!’.

New versioning

Missing trailing commas make diffs a bit annoying because of including an additional line.

New syntax has this solved. In other aspects versioning should work the same.

New formatting

The goal of formatting is to produce a visual code structure that is easy to read, as if you already see all main components without reading anything.

New syntax enforces this, and leaves fewer degrees of freedom. Writing something non-readable would be challenging… I suppose.

Role of formatters thus would be minor, or they can be skipped.

Limitations

First, I did not try to solve following perceptual problems:

commas are leading, and I’ve mentioned that this was a problem for comma-first formatting
open brackets without a matching pair create visual discomfort. Also my eyes already trained to focus on closing brackets, but proper color scheme seems to solve this

This post is already long, and leaving things closer to python simplifies example. I think both points can be improved, and feel free to post your ideas on this.

Second, I intentionally focused only on improving multi-line constructs, but single-line collections were left untouched. That does not mean delimiter-first does not work there, but scale of necessary changes is just too high to justify gains. At least for now.

If you made it this far

Wow, thank you!

I hope an adventure was interesting and slightly mind blowing.

Don’t be too surprised if this proposal evokes “hey this looks wrong, just plain wrong” reaction.
After all, ideas we enjoy these days: enumeration from zero, using registers in names, structural programming, mandatory formatting, and even python’s approach to defining code blocks with indentation — every single one of them were met with a storm of criticism.

👋

Comments 💬

I received and collected a number of links for using delimiter-first in different contexts (lisp/scheme, formulas, translatable languages), will organize that material when I get time.
Isaac Z. Schlueter advised there is a term ‘initiator’, used in “… specification discussion threads, where it’s common to dig deep into the particulars of parsing semantics. Very much a ‘deep in the weeds’ kind of technical term.”

In the context of parsing I found the word ‘initiator’ in several papers, and only one mention on stackoverflow, so I’ll stick to using word ‘delimiter’.
Other options mentioned in discussions: introducer, starter
Peter Hilton noticed that “… startinators in prose usually called bullets. Some English-language style guides even treat the following punctuation as equivalent.

Brilliantly Wrong — Alex Rogozhnikov’s blog about math, machine learning, programming, physics and biology.*

Brilliantly Wrong — Alex Rogozhnikov’s blog about:
- math
- machine learning
- programming
- physics
- biology.
Note the bullet list’s trailing full stop (period). It’s still one punctuated sentence.”

Indeed, name ‘bullet’ sounds very appropriate when discussing code written in delimiter-first style. From parsing side, I don’t feel it’s a good partner to word ‘terminator’.
Thanks to Alexander Molchanov for proofreading, improving text, and leaving comments.
Question: “Who did you write this for?”

I believe that’s a better way to structure code (for readability, editing, and better language tools). Based on what I’ve learnt so far, I am sceptical about integration of additional syntax to existing languages: two notations side-by-side are worse for users than one. From the perspetive of language maintainers, all tooling would need to deal with two dialects, which is also a downgrade.

So main audience are authors of new programming languages. However, it is not only authors - to get adopted, any new feature should get at least minimal support from community. That’s where this page can help. So more generally, I target people interested in experimenting around new programming languages, and interested in challenging status-quo.
Question: “But how will you represent a couple of multiline lists next to each other?”

This case is handled normally:
```
  f([                
      a,
      b,
  ], [
      c,
      d,
  ])
```
```
  f([                
      , a
      , b
  \,[
      , c
      , d
```
For the record, I’d prefer to introduce variables in any case.
Question “Don’t you think that current tools have already solved the issues solved by delimiter-first?”

I developed a simple 4-line code with missed comma that is compeletely fine for flake8 and ruff. And black formatter considers it well-formatter. It took me less than a minute to develop this example, and if you start thinking, I’m sure you’ll find a handful of similar cases. Authors of one utitity that is supposed to mark these cases claim that ‘5% of 666 Python repos had comma typos (including Tensorflow, and PyTorch, Sentry, and V8)’.

We can continue patching problems with even more tools and more special cases, but I’d better have it solved by design. Core point is - delimiter-last is flawed. Main visual cues (indentation) is on the left, while there are still control sequences that can override indentation, and they are on the right. For this reason \ in the end of line is a bad choice.