Code as text is a problem

5 May, 2017 · Read in about 3 min · (543 Words)

I’m pretty frustrated right now. I’m doing some refactoring, and it is hard going. I’ve been at it for a couple of hours on a Friday, and I haven’t even started the actual feature I’m trying to introduce.

What I essentially want is to apply some simple transformations to the symbols constituting my program. If my code was a data structure, I could hack this out in 10-15 lines of code and be done with it. This code would both implement and describe the transformation I’m trying to accomplish.

Instead, I have to encode my refactoring as a complicated series of interactions with:

a graphical refactoring tool
sed, grep, contemplation of regexen
various diff visualization utilities

After every tiny change, I stage my changes and inspect them to ensure I haven’t introduced any untowards changes. I create a commit and describe in detail what I did, because the diff shows 5 billion changed lines of code (and I do mean lines, because most visual diff utilities can’t do --word-diff), and it will be impossible to untangle the changes later in code review. I run a watch on all files and pipe them through a reformatter. I rebase and squash things around to reduce the cognitive load of the diffs.

This is a giant pain in the ass. And it doesn’t really seem like there is any good reason for it to be this hard, aside from our rather arbitrary decision to represent code as streams of characters split across a bunch of files. In fact, we’ve invented all kinds of bullshit that is only necessary because code has no structured representation:

Super-advanced text editors
Linters, printers, configuration thereof, and SCM/CI integration for these
Style guides
Refactoring tools like Resharper
Code analysis (and code analysis optimization)
Advanced source control management systems
- Git, darcs, camp
- Tools to help us with complicated merges
Macros, headers, linkers, etc. etc.

Don’t get me wrong, I don’t have anything in particular against using a textual interface to modify code. In fact, there is a certain agility to manipulating code-as-text that just wouldn’t be there if we stored code in, say, databases.

But can’t we can have our cake and eat it too? If the primary representation of every programming language was data, editing it as text would simply be a question of serializing it out to some convenient syntax and file structure. When the programmer wants to modify code, render it into a textual representation. When they’re done, parse it back into nice, inspectable data.

Obviously this isn’t some super-new, crazy idea; homoiconic languages like Lisp(s) have been around for ages. I just wish it was across the board. I mean, at the very least everyone has to tokenize their language; how about starting there?

Representing code as data would have some totally rad consequences, like:

Sexy, simple-to-reverse patch representations in SCMs for transformations like renaming stuff, e.g. { alias: { old: "thissymbol", new: "thatsymbol" } }
Better mergeability
Increased signal to noise when the user is inspecting history
Linter rules for the semantics of your code (who keeps sneaking these static properties in?)
Searching and refactoring the codebase is easy, fast, fun

I know, “shut up and write/use a parser, you dummy”, but I like complaining about things.