Guidelines

When making decisions, e.g. about specific language features, a variety of factors can influence or guide the process. ChatGPT distinguishes these as follows:

Principles: Principles are truths or propositions that serve as the foundation for a system of belief or behavior or for a chain of reasoning. They are typically universal, timeless, general, and abstract. They are not specific to a particular situation but can be applied broadly to determine what is ethically right or wrong.
Paradigms: A paradigm is a distinct set of concepts or thought patterns. In the context of scientific theories and worldviews, a paradigm includes theories, models, research methods, postulates, a system of thinking, a framework for understanding the world, and standards for what constitutes legitimate contributions to a field. More broadly a paradigm is simply a prototypical example or pattern. Paradigms determine how we perceive, interpret, and engage with the world. They influence what is considered to be valid knowledge and guide how we approach problems and solutions in various domains.
Goals: A goal is a specific objective that Stroscot strives to achieve. They are usually SMART (specific, measurable, concrete, time-bound), with a defined endpoint or outcome and clear criteria for determining when they have been achieved. They can be short-term or long-term and provide direction and motivation. They help individuals and organizations focus their efforts on achieving specific outcomes and can be used to measure progress and success.
Standards: Standards are precise, measurable norms or requirements for quality and performance against which actual outcomes can be measured. Anyone can establish a standard but the better standards are generally written by industry-specific organizations to regulate products or services. In Stroscot these can most likely be formalized as benchmarks or test cases.
Best Practices: Best practices are techniques, methods, processes, or activities that have been generally accepted as superior to others because they produce results that are superior to those achieved by other means.Best practices are identified through experience and research and are often benchmarked or recognized as leading examples in their field. They evolve over time but serve as a guide or model for achieving excellence in a particular area.
Policies: Policies are formal statements or directives that define Stroscot’s general plan of action. They set out the intentions of the entity regarding broad issues, guiding decision-making and actions without detailing specific actions to be taken. Policies ensure consistency, fairness, and efficiency.

In theory, with a complete set of guidelines, a pull request or design choice can be declared “right” or “wrong” according to these principles, goals, etc. In practice I have tried applying the guidelines and they are often lacking. But nonetheless a list of guidelines can avoid some wasted work, and the guidelines can get better over time. Contributors can discuss changing the principles if a desired change is not compatible.

Per Dijkstra, each goal should be convincingly justified. A lack of justification may mean that the author is unconvinced of the worthiness of the goal or has been “talked into it”. Justifying goals clears up the author’s mind. It also allows ensuring that the goals are compatible with each other and well-understood; identifying conflicts early saves a lot of vain effort.

Principles

One of the surest of tests is the way in which a poet borrows. Immature poets imitate; mature poets steal; bad poets deface what they take, and good poets make it into something better, or at least something different. The good poet welds his theft into a whole of feeling which is unique, utterly different from that from which it was torn; the bad poet throws it into something which has no cohesion. A good poet will usually borrow from authors remote in time, or alien in language, or diverse in interest. (T. S. Eliot)
It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience. (Albert Einstein)
We fix things where they are broken, never tape over them. Never a product, but something that makes it easy to build products on. Never UI, but what you can build your UI on. Never finished, never complete, but tracking progress of technology. Never specific, always generic. Never the cathedral, just the building blocks to build it.(Lennart Poettering 1)
One thing we’ll grant you though, we sometimes can be smart-asses. We try to be prepared whenever we open our mouth, in order to be able to back-up with facts what we claim. That might make us appear as smart-asses. (Lennart Poettering 2)
An analogy from daily life is to compare the great pyramid of Giza, which is mostly solid bricks piled on top of each other with very little usable space inside, to a structure of similar size made from the same materials, but using the later invention of the arch. The result would be mostly usable space and requiring roughly 1/1000 the number of bricks. In other words, as size and complexity increase, architectural design dominates materials. (VPRI)
If it isn’t documented, it doesn’t exist. Not only does it have to be doc’d, but it has to be explained and taught and demonstrated. Do that, and people will be excited – not about your documentation, but about your product. (Mike Pope via Coding Horror). Corollary: There is no undefined behavior, only undocumented behavior.
Liking increases monotonically for both higher complexity and higher number of presentations. [MS17] (Originally this was “some complexity is desirable” from [Nor10] page 13, but then I looked for sources and there was a stronger conclusion. Fig. 2 seems to have a liking dip for the highest complexity. Other studies used unnatural stimuli or did not control for familiarity, resulting in averaging experienced and inexperienced participants. [GvanLier19] There is a difference between objective vs. subjective complexity measures; using Fisher information to convert from objective to subjective measures of information produces the typical inverted U-shape for objective measures. [GA22])
Better depends on your customer’s goodness metric. It is time for us to reject the simple-minded interpretation of the slogan “worse is better”, and start putting out software that really is better (on the dimension of goodness that our customers have, not necessarily our own). (Jim Waldo)
“Good Design Is Easier to Change Than Bad Design”. A thing is well designed if it adapts to the people who use it. For code, that means it must adapt by changing. So, a good design is Easier To Change (ETC). As far as we can tell, every design principle out there is a special case of ETC. Why is decoupling good? Because by isolating concerns we make each easier to change. Why is the single responsibility principle useful? Because a change in requirements is mirrored by a change in just one module. Why is naming important? Because good names make code easier to read, and you have to read it to change it. (Pragmatic Programmer 2019 edition, page 28)

Paradigms

“General purpose” programming languages are “general purpose” in that you can write any system. But in Java 1.5 you couldn’t do currying - no closures. You could do something with an interface and an anonymous class, maybe some weird decorator pattern to make it less verbose and a library for currying, but it’s just not the same as writing foo 1 and having it work. Paradigms strongly dictate how you structure your code. Semantics matters - it can change how we think about the problem that we’re trying to solve. For example, concurrency is easier if all your values are immutable. Also performance of paradigms is a consideration - generally you can express the same algorithm both ways and it will get compiled the same, but with for example immutability it may not necessarily compile to in-place update. But performance is not automatically faster or slower with a given paradigm - there are generally examples of both speedups and slowdowns - whereas if a paradigm is suited for a given task it’s really obvious. You can use inheritance to model any problem, but if it’s modeling the lambda calculus, then using a functional paradigm with ADTs and built-in lambdas in Haskell is going to be a lot easier and less code than Java.

The programming languages checklist has a few paradigms: functional, imperative, object-oriented, procedural, stack-based, “multi-paradigm”. In linguistics, a paradigm is “a set of linguistic items that form mutually exclusive choices in particular syntactic roles,” specifically “a table of all the inflected forms of a particular verb, noun, or adjective.” This seems to be a usable definition of a PL paradigm - you have all related versions of a semantic entity.

Unfortunately people seem to use paradigms as labels of entire languages, rather than to refer to individual syntactic features. Stroscot, like every other language, is “multi-paradigm” - even assembly is multi-paradigm since it is imperative (syscalls) and structured (conditional jump). So the adjectives “object-oriented”, “functional”, etc. are avoided outside of this page in favor of the specific semantic constructs, since “functional object-oriented language” sounds weird. Still, it’s good to have a map from paradigms to constructs, and to know which constructs embed into which other constructs. This list is based on Wikipedia’s list of paradigms:

Action: action descriptions are given by the state trajectory relation
Array-oriented functions are still functions
Automata-based:
- Nondeterministic automata are given by a transition relation.
- Deterministic automata are given by a transition relation that is a function.
concurrency - concurrent programs are given as imperative programs that use concurrent operations
- agents/actors/flow-based processes are threads with a main dispatch loop
data-driven programming is a main loop over condition-action pairs
declarative is a logical relation or a function
- functional - functions are total functional binary relations
  - lambas are anonymous functions
- logic - a logical relation is a set of tuples
  - boolean operations are logical constraints, i.e. relations over a certain domain
- constraint: constraints are 0-1 loss functions in an optimization problem
- dataflow is a block in single static assignment form
- a reactive or incremental program is a state value plus a state update function or command
- a query is a function that takes a database and produces a list of results
differentiable: the derivative is a function mapping a function $f$ to a linear operator $A$ such that $\lim _{\|h\|\to 0}{\frac {\|f(x+h)-f(x)-Ah\|}{\|h\|}}=0$.
dynamic: eval is a function from strings to values (and optionally with an environment)
event driven: an ED program is some event handler functions, data binding event handlers to events, and a main loop function (provided by a library) that repeatedly checks for events and calls the matching event handler
generic functions are just functions over a large domain
imperative programming:
- commands can be represented as a tag (payload) plus a callback function returning another command
- mutable variables are using read and modify functions on an implicitly passed/returned store.
- procedures are functions from arguments to commands
Metaprogramming:
- Attribute-oriented: attributes are a function from symbols to metadata
- Macros: macros are functions that take an AST and a lexical environment
Nondeterministic: a nondeterministic function is a relation
Parallel: a block in single static assignment form can be easily parallelized using a concurrent worker pool
Process-oriented programs can be represented using concurrent operations
probabilistic programs are functions from parameters to a log probability
Quantum:
- quantum logic gates are functions, in particular unitary operators on states of qubits
- a quantum program is a block, consisting of gate applications and discarding information (Qunity)
Set-theoretic: set membership is a boolean predicate function
Stack-based: a stack-oriented program is a function on stacks, a.k.a. lists
structured:
- loops are recursive functions
- conditionals are lazy functions
- Block-structured: block sequencing is Kleisli arrow composition, a function
- Object-oriented: objects are mutable variables containing records of mutable variables and functions
- Class-based: classes are types
- recursion is syntax for applying a fixpoint function
Symbolic: an AST is a value
Value-level: types are sets

In addition I’ve found some other paradigms too obscure for the WP list:

term rewriting systems are given by the rewriting relation
optimization problems are relations based on on objective functions
optimization solvers are functions from objective functions to a list of solutions
aspect-oriented: discussed on the “Aspects” page.

Some more paradigms that aren’t really paradigms at all, just libraries or syntactic tricks:

pattern matching: pattern matching, easily match and extract data from complex data structures.
functional reactive programming: build applications that respond to changes in data over time.
concurrent constraint programming: express and solve problems involving concurrent processes and constraints.
genetic programming: evolve solutions to problems using principles from genetics and natural selection.

Graph of paradigms

Graphviz has chosen “function” as the central paradigm. This agrees well with experience. Quoting Spivak, “the most important concept in all of mathematics is that of a function - in almost every branch of modern mathematics functions turn out to be the central objects of investigation.” Looking closer, function is part of an SCC function, relation, set, boolean, constraint, optimization. Although lambdas provide a natural way to express many functions, the mathematical notion of function is broader than just lambdas - some mathematically definable functions have no efficient/constructive algorithm and are instead specified as a logical relation or optimization predicate. So we need constraint logic programming as well to get the full notion of “function”. Hence the ultimate paradigm is functional logic programming. Thus, Stroscot is at its core designed to be a functional logic programming language, but with support for many other programming paradigms implemented via the embeddings described above.

Goals

The ultimate

Stroscot aims to be the ultimate programming language, rather than something just alright. The goal is to win the ultimate showdown of ultimate destiny w.r.t. programming languages. This has been called “silly” by Dennis Ritchie (author of C) and “the dream of immature programmers” by Bjarne Stroustrup (author of C++), [Sut00] but I think it can be made to work. A lot of language features have become standardized, which wasn’t the case in 2000, and for the other “unique” features there has been enough research to establish a clear hierarchy of power. To bring in an analogy with weapons, the question of which firearm is strongest is quite subjective and a matter of debate, among other reasons due to the capacity vs. weight tradeoff. But the Tsar Bomba is without question the strongest weapon in history, and makes such debates irrelevant - all you need is a single giant bomb, and making more of them would be a waste of resources. And when the standard interface for deploying such a weapon is pushing a button, the choice of what the button should look like is essentially a bikeshedding debate - it’s just a button and any choice of style and color will do (although of course red is traditional). In this analogy Stroscot would be an early nuke prototype - I’m not claiming it’s the biggest baddest language, but at least it will point the way towards designing such languages in the future.

Stroustrup claims there are “genuine design choices and tradeoffs” to consider, which I agree with up to a point. Many queries in a compiler are too expensive to compute exactly and the method used to approximate the answer can be refined or optimized. There are competing approaches to answering these questions and methods of combining solvers to obtain more precise answers. The time/precision tradeoff here is real. But these are implementation tradeoffs, and don’t affect the overall design of the language. While there may not be a best solver, there is a best set of syntax and features, at least until you get to details so minor that they are matters of personal taste.

Global maximum

Stroscot aims to be a global maximum of features and syntax. So take any set of optimization criteria and then Stroscot is the best.

World domination

Stroscot aims to replace all the programming languages in use today. Mainly this involves improving FFI support and interoperability with C and C++. In particular we need to be able to parse headers and use data from them with Stroscot. Since headers include code we need to be able to fully compile C/C++, so that Stroscot is the sole compiler and all of its global optimizations can be used (zig cc is an example of how this works). The linkage is asymmetric - you can export specific C-style constructs back to C, but C can’t use functions that depend on more advanced features.

Once the C/C++ implementation is stable enough for production use, focus will shift to developing automated conversion tools for other languages like Python and Java, so that the surface syntax can be changed to Stroscot’s. And yes, this is the E-E-E strategy, but Stroscot is open source so it’s all OK.

Standardization doesn’t seem necessary. A popular language builds its own standard. Python, the world’s most popular language as of July 2022, has never been formally standardized. But there needs to be an open-source cross-platform implementation, with a committee process for changes to build consensus and ensure stability. Another alternative is to freeze Stroscot after release and design a new language every 3 years, but that requires creating new names and websites so it’s easier to evolve gradually.

Functionality

Stroscot aims to be a wide-spectrum language. That is, for every way to do X, Stroscot should also allow doing X in that way. The logic behind this is simple: If Stroscot can’t do X, then people will choose to use another language that can do X. Practically, I have limited the domain of X to activities expressed in research publications and other programming languages, i.e., a systematic survey, so that the amount of functionality to consider is at least finite. I’ve mainly found novel ideas and techniques in obscure papers from decades ago, but there have also been a rare few published in the past few years. It is actually really hard to come up with better ideas than the old papers. And I’m not aware of any other programming languages that have tried to do a systematic search through the literature for features; academic languages are narrowly focused and practical languages do not innovate much. So Stroscot is at least somewhat innovative in its design by aiming for functionality in this way.

Motivation for this comes from [IHR+79] (edited):

We believe that the language designer should not forbid a facility. He should never take the attitude of the Newspeak [Or 50] designer:

“Don’t you see that the whole aim of Newspeak is to narrow the range of thought? In the end we shall make thought-crime impossible, because there will be no words in which to express it.”

Rather, he should always strive to expand the expressive power of the language.

Many languages suffer from “idea envy”, where they try to retrofit new ideas from other languages. For example C++ and Java have added lambdas relatively recently. When a programming language changes significantly in this way, it loses its identity - for example, Python 2 and Python 3 are effectively separate programming languages, as are Perl 5 and Raku (Perl 6). There are already projects that advertise themselves as “modern C++17” rather than simply “C++”. A new language needs new tools and new libraries; in this case, a split between non-updated and updated C++ tools. Minimizing the number of new languages / breaking language changes is best. The source of these changes is clear: ideas that were missed out on in the initial design. The lambda calculus dates to the 1930s, and anonymous functions were included in Lisp in 1958, long before C++ was designed in the 1980s. The retrofitting in C++ is due to a shallow intellectual base. By instead preferring coverage of all functionality from the start, we ensure a future-proof design. Even if new ideas emerge after the initial design, they are generally small tweaks on old ideas. With sufficient research these old ideas can be uncovered and incorporated, making it a minimal change to accommodate the new ideas.

You may point to INTERCAL’s COMEFROM as something best avoided, but it’s not hard to implement. The trickier parts are actually at the low level, interfacing memory management and calling conventions, and the value proposition there for a powerful interface should be clear. Providing a broad set of features will mean that the language is suitable for whatever project someone is thinking about. Another theory is that, even if Stroscot fails as a language, implementing lots of features will make people copy Stroscot’s list of features.

Against Stroscot’s goal of increasing functionality, there is a general sentiment in the industry that, as Jamie Willis put it, “if you increase power, you increase problems”. This suggests that it might be better to avoid powerful features. Willis clarifies that, by including more specialized and restricted abstractions, the language is easier to use. I guess I agree with this second statement, broadly; structured programming with loops is easier for beginners to use than the goto statement. But, I do not think that adding structured programming constructs makes goto unnecessary. Indeed, C still has goto, and Linux kernel programmers use it regularly. Java further specialized constructs with a “break label” construct that functions exactly as goto. Except, the “break label” can only jump to the top and bottom of loops. This regularly causes complaints in various circles, such as decompilers which need a quite complex algorithm to translate the JVM “goto” instructions back into loops and break statements. In fact this algorithm fails often and the disassembled code actually contains invalid goto statements. It is much better to simply include the goto statement, the loops, and the “break label” concept too.

More generally, I think including powerful constructs makes the language more expressive and more powerful. The programmer has less friction searching for the right construct, less difficulty expressing their intent, and less problems overall. For example, it’s hard to argue that SQL is too powerful - quite the opposite, most people criticize it for its lack of expressiveness and poor portability. The declarative aspect does introduce certain unique tasks, such as query optimization, but performance would be a problem regardless so this is not introducing a new problem. And in fact it is easier to optimize a query using the appropriate tools than it is to rewrite the corresponding imperative program.

Turtles all the way down

This is an Ecstasy principle. But it’s misleading - going infinitely downward would require infinite space. Actually it is a finite list plus a trick to make it infinite, namely that the objects at some point refer back to themselves. This pointing trick is the useful part, hence why Stroscot supports infinite structures. But this sort of “can you do this trick?” question is covered by the functionality goal.

Minimal core

Tinman I5 “The source language will contain a simple, clearly identifiable base, or kernel, which houses all the power of the language. To the extent possible, the base will be minimal with each feature providing a single unique capability not otherwise duplicated in the base. The choice of the base will not detract from the efficiency, safety, or understandability of the language.”

Minimalism is bad. If you build on an existing language but include no new features, then there’s no incentive to use your language. If your language only provides a minimal Turing-complete set of operations like Brainfuck, figuring out how to express programs in it will be difficult, and the resulting encoding most likely will be incomprehensible. Thus, minimalism must take second priority to functionality. But, given that we must provide all possible features, minimalism offers an approach to implementing them in a methodical, useful manner.

Certainly, there is the possibility of just implementing them all independently as some sort of hodgepodge, but I like GHC’s structure of having a smallish “core” language (System FC), and translating the rest of the language down to it. In fact there is not much to Haskell besides System FC; the language proper is quite small, and most of the idioms of Haskell are implemented in libraries. Similarly, for Stroscot I would like to define a “core” language that provides only the basic, necessary abstractions and tools for defining more abstractions, such as macros and syntactic extensions. Then the compiler only has to focus on handling these core constructs well, but the standard library can implement all the parts that users interact with. With suitable abstraction facilities, this approach doesn’t lose any expressiveness because we can still implement any language construct we can think of. We have not “surrender[ed] the adequate representation of a single datum of experience”, but merely reduced the reducible elements. We can satisfy Steelman 1E: “The [core] language should not contain unnecessary complexity. It should have a consistent semantic structure that minimizes the number of underlying concepts. It should be as small as possible consistent with the needs of the intended applications. It should have few special cases and should be composed from features that are individually simple in their semantics.”

The surface language is still complex, modern, and slick. Developers can focus on learning the core language’s general constructs, and then learn libraries by reading their source code, or they can follow more of a “learn by doing” approach where they learn the libraries they like from the documentation and examples, without understanding the implementation.

So what defines the “core” language? Well, per Einstein, each element should be basic, simple, and irreducible, and there should be as few elements as possible. More formally, we can consider the “core” as an orthonormal basis in an inner product space, with vectors as programming elements. Then our “core” must satisfy the following conditions:

spanning: every element can be written (macro-expressed) as some combination of the core elements
linear independence: this representation in terms of the core elements is unique (up to some notion of equivalence). In particular, no core element should be macro-expressible in terms of the other core elements.
orthogonality: The dot product of any two core elements should be 0. Said another way, for all scalars $r,s$ and core elements $x,y$, $\|r x\|\leq \|r x+sy\|$. In words, the combination of two core elements is at least as powerful/expressive as either element individually.
units: The norm of each core element should be 1. I interpret this as that each core element should be Turing-complete but not require an oracle, and correspond to one syntactic construct. In terms of macro expressibility, there shouldn’t be overly-specific elements or overly-general elements. Overly-specific elements cause clutter, while overly general elements are too hard to understand. Honestly this requirement is a ball of mud and just requiring an orthogonal basis or a basis at all seems sufficient.

For example, COMEFROM can be implemented with continuations and macros (c.f. this Python decorator). We can thus move COMEFROM to the standard library, and define a “core” subset of the language that contains only continuations and macros. By repeating this sort of exclusionary process, we can construct a minimal “basis” of core features, in the sense that none are redundant. Fewer concepts simplifies the whole language, and approximates Python’s goal of “There should be one– and preferably only one –obvious way to do it.”

Also, a core improves stability. Cameron has pointed out that the “core” is not set in stone and may need changes, particularly early in development. We may find out that an element is simply not needed at all, or is too complex in its current form and can be further simplified. We may find new elements that were missed in the initial design. For example, comparing GHC’s Core Expr datatype from 1998 to the present day, we find many changes: addition of a type field to cases, removal of constructor applications (in favor of an expanded Var type), addition of special-cased primitive literals, expansion of Note into Cast and Tick alternatives, removal of an “f” type parameter, addition of coercions. But in 20 years, 6 of the 8 constructors were essentially unchanged, and the remaining changes fall under the category of minor additions or “polishing”. For the most part, by virtue of its design constraints, the core is remarkably stable and can safely be used as an interface between the compiler and the rest of the language (the standard library).

Learnability

It’s often not that easy to learn a language. Google searches will often yield irrelevant results. Official documentation can be useful, but is often filled with terse wording, links to lengthy discussions containing irrelevant detail, and TODOs. The truth can be found in the compiler source code, but this often has one-letter variable names, very few comments, and an assumption that you know the coding style and design of the compiler.

Learnability means making things easier for generations of beginners by making the language “intuitive” so that language choices can be guessed rather than looked up. There is some amount of English discrimination involved, as the learnability studies’ “beginners” are limited to English speakers in Western colleges, but English is the most popular language, and there is the functionality to translate Stroscot to other languages.

Learnability does not necessarily mean making the language similar to existing languages. Such a language might be easier for experts to learn in the short run, but in the long run (assuming Stroscot is successful) there will be many more novices than experts that need to learn the language, so the novices should be prioritized.

Concision

If there is a verbose syntax and a terse syntax (as measured by characters or screen space usage), both equally learnable, then the terse syntax is better, because the program can be more cheaply printed out and literate documentation is mainly made up of the prose/code comments rather than code.

APL is sometimes criticized for being too concise, but the actual (learnability) issue with APL is that, like Chinese, it has a lot of symbols and hence novices and experts alike suffer from character amnesia. J uses ASCII symbols hence mitigates the issue and is praised for its terseness. But it still is difficult for novices to learn (basically you have to memorize this page) so an syntax based on English words may be better.

Simplicity

In his talk “Simple Made Easy”, Rich Hickey lists four words (etymologies from Wiktionary rather than him):

simple - literally “same fold”, consisting of a single part or aspect. An objective criterion about avoiding too many features, basically minimalism.
complex - braided together or weaved together. Hickey also uses “complect”, meaning to braid things together and make them more complex. Also an objective criterion, about avoiding feature overlap.
easy - literally “lying next to”, “bordering on”. A subjective criterion about a task being within the grasp of a particular person and toolset.
hard - literally “strong” or “powerful”. A subjective criterion about whether changing the software requires a lot of effort.

Hickey tries to say that simple is the opposite of complex and easy is the opposite of hard, but the etymologies and definitions don’t really agree. We must be careful about distingishing these words. Consider this $1 Split Woven Pouch Single String Sling. It’s simple, because it’s only one string. It’s complex, because he weaved the string with itself. It’s easy to make, because you just have to buy the string and follow the tutorial. It’s hard, because he made the knots really tight and the finished product is quite stiff. So these qualities are not mutually exclusive at all.

Similarly, Stroscot aims for all four of these:

simple - Stroscot does aim to be “simple”, in the etymological sense of “minimalism”. Stroscot concentrates the language into a “core”, a basis of features that can express all others and is as small as possible.
complex - The rest of the language, the standard libraries, user, libraries and user programs, braids the core features together and is “complex”. Hickey argues that a project should not be complex (complected), but he is using a different metric, of how interleaved the conceptual parts of the program are, rather than its interleaving of language features. There is some benefit to ensuring a tree-structured call graph in a program, but I don’t think this is a good argument to remove recursion.
easy - Stroscot aims for its standard library to make things “easy”, doable without much training and in few lines of code. There’s no downside, right?
hard - Stroscot also aims to have a “strong”, “powerful” standard library, that doesn’t change often, in other words a “hard” standard library.

Looking at this, despite my saying that Stroscot aims to be simple in the sense of minimality or mathematical elegance, it doesn’t seem that the language can be marketed as simple; there are just too many mixed messages. The fault does not lie with Stroscot, but rather the inadequacy of these words to express all aspects of an intricate design. As Edsger Dijkstra put it, “complexity sells better”. If you spend all this time hyping up a language, and then it turns out it’s so simple the design fits on a postcard, your audience will feel cheated and dismiss the result as trivial. As measured by [YaofeiChenDM+05], “simplicity” and “implementability” are both correlated with a lack of adoption as a developer’s primary language, while “extensibility” and “generality” are preferred. Fortunately though, this is all in the marketing. For example, people seem to say that Haskell is extremrly complex, but in the sense of Dijkstra, Haskell is “just syntax sugar” for System F, and has a simple theory. GHC Core is 9 constructors. It is “only” the libraries and syntax sugar that add in the complexity.

There is another quote that I think sheds some insight:

And now, having spoken of the men born of the pilot’s craft, I shall say something about the tool with which they work - the airplane. Have you looked at a modern airplane? Have you followed from year to year the evolution of its lines? Have you ever thought, not only about the airplane but about whatever man builds, that all of man’s industrial efforts, all his computations and calculations, all the nights spent over working draughts and blueprints, invariably culminate in the production of a thing whose sole and guiding principle is the ultimate principle of simplicity?

It is as if there were a natural law which ordained that to achieve this end, to refine the curve of a piece of furniture, or a ship’s keel, or the fuselage of an airplane, until gradually it partakes of the elementary purity of the curve of a human breast or shoulder, there must be the experimentation of several generations of craftsmen. In anything at all, perfection is finally attained not when there is no longer anything to add, but when there is no longer anything to take away, when a body has been stripped down to its nakedness.

…

There was a time when a flyer sat at the center of a complicated works. Flight set us factory problems. The indicators that oscillated on the instrument panel warned us of a thousand dangers. But in the machine of today we forget that motors are whirring: the motor, finally, has come to fulfill its function, which is to whirr as a heart beats - and we give no thought to the beating of our heart. Thus, precisely because it is perfect the machine dissembles its own existence instead of forcing itself upon our notice

—Antoine de Saint Exupéry, Terre des Hommes (1939), as translated into English as Wind, Sand and Stars by Lewis Galantière

In the sense of number of things, an airplane is not simple at all. A 747 has millions of parts, almost all different. Pretty much everything that can be added to the design has been added (besides leg room - they charge extra for that). It is not simple in the sense of being understandable or easy to remember. Like how it is impossible to draw a bike, it is pretty hard to draw the curve of an airplane wing from memory, to any degree of accuracy, or describe the design of a jet engine, or anything like that. What Antoine seems to be getting at is that there is a unity of purpose: an airplane is designed to move across the sky, and the natural form of the wings ended up being a “simple” curve. And a bicycle is designed for pedaling. All you have to do is get on, balance well, and pedal - how much simpler can it get? It is sort of the maxim of “form follows function” - when the function is to interact with humans, the controls will get super-simplified and easy to use, so that even a child could learn to use it. Like automatic cars, they have a go and a stop pedal just like a go-kart at an amusement part.

So from this Antoine quote, when people say they want a “simple” language, it seems what they really want a “usable” or a “learnable” language. They don’t care about how many millions of lines of code the compiler is. They care about how easy it is to download and try out and write their new pet project. Haskell falls down in this regard - despite numerous tutorials, monads are still hard to understand.

Simplicity of implementation

Now when some people talk about simplicity they really do mean the compiler. For example Carbon says “design features to be simple to implement.” (itself a C++ goal) Borretti has a post on this, where he says there are two language strategies: be simple and cheap (“small”), or become irreplaceable critical infrastructure of many large organizations (“big”). Notably in this post he categorizes C++ into the “big” category, so when Carbon and C++ are aspiring to be small and simple this is because in reality they are anything but.

Borretti of course likes the “small” approach but I don’t think it’s a guarantee of success. If a langauge is useless, it could still be simple and cheap but nonetheless get no serious “optimizing compiler” implementations. (INTERCAL nonwithstanding - there are no optimizing implementations of INTERCAL). A language has to have functionality to be used and functionality inevitably causes scope creep and implementation complexity. Lisp used to be advertised as a small, simple language but by Boretti’s own admission the Common Lisp spec is thousands of pages. Scheme is admittedly a smaller language but that is because there was a huge flamewar over R7RS-large (HN) and the “small language” advocates stayed while everyone else migrated to Racket.

Now it is true that toy implementations of almost any language can be written in an evening (c.f. the PL Zoo), but a serious optimizing compiler requires a level of skill to maintain that is associated with payment. Who will pay? Generally, companies using the language opt to pay for it. If a language does not even offer value sufficient for a part-time maintainer, then all I can say is that it is a failed language - there are many languages successful at that level, such as Crystal, Nim, Zig, and Squirrel.

Also, it is maintenance cost that matters, not cost of implementation or conceptual complexity - the issues are whether the compiler’s code is easy to port to new architectures, adapts well to replacement of algorithms, and accepts language evolutions. With the minimal core approach, most of the language is in its libraries and building a new compiler only has to deal with the core language. Now, it is true that the less technical debt there is, the less maintenance cost there is. But without implementing features there is also no language. As long as the implementation is maintainable, there is no issue with piling on features and implementation complexity beyond the realm of common sense.

Now as far as that the compiler should follow a specification, there I agree, but that is just because documentation is necessary. There are no restrictions on the length of the spec. As long each aspect of the specification is justified, it could be 10, 100, 1000, or even 10,000 pages without really changing anything. For example the C++ spec is 1815 pages. Now if you count, pages 479-1591 discuss the standard library, and 1663-1807 are indices and cross references, so really the spec is only 557 pages. But this is still a lot longer than the initial C++ Programming Language book published by Bjarne Stroustrup in 1986 (327 pages per Google Books), and the Austral spec (50 pages per Borretti’s counting).

Familiarity

Per Grace Hopper, “the most dangerous phrase [one] can say is ‘We’ve always done it that way’.” According to some guy <https://medium.com/geekculture/3-busted-myths-about-the-35-hour-week-that-you-should-present-to-your-boss-efa5403bb263> the golden rule at his university was that anyone who said that phrase was a lousy engineer. Hopper continues <https://books.google.com/books?id=3u9H-xL4sZAC&lpg=PA9&vq=%22most%20dangerous%22&pg=PA9#v=snippet&q=%22most%20dangerous%22&f=false>`__: “If we base our plans on the present, we fall behind and the cost of carrying out something may be more costly than not implementing it. But there is a line. If you step over it, you don’t get the budget. However, you must come as close to it as you can. And you must keep pushing the line out further. We must not only accept new concepts, we must manage their development and growth.”

Per Simon, C’s operator precedence, C++’s use of <> for generics, and C#’s design of properties are all examples of suboptimal, legacy decisions. They were designed based on limited information but in hindsight it has become clear that better choices exist. Nonetheless they continue to be adopted by new languages on the basis of “familiarity” - people are so used to the suboptimal behavior that they will complain if it changes.

For Stroscot, is it worth repeating these mistakes for the benefit of “familiarity”? Familiarity will not help beginners learn the language. Generally, we should understand why these choices were made, and consider if those reasons are still valid. For C’s operator precedence, there is essentially no basis - it is just historical baggage. But the operators themselves have some presence, so it is definitely worth including functions shift_right or whatever. With extensible syntax, the standard library can decide if these functions need infix syntax or not - it is not even a consideration in the compiler. It will have to be balanced on the basis of how often programmers use these vs. how disruptive it is to see an operator with no explanation. At the end of the day, these sorts of syntax decisions are minor annoyances, so don’t really impact the ability to accomplish things - all that matters is consistency and that the justification for the decision is clear.

What is the impact of a choice to deliberately be unfamiliar? Maybe experienced programmers will get so fed up that they will post “ragequit” posts to social media. But I think, so long as discussion can point to a solid basis for the changes, these will most likely serve to draw positive attention to the language. Anybody who uses the language for a while will get used to it. And actually the people who are willing to learn a new language are likely looking for something new and are willing to adapt, so they won’t ragequit. Succinct migration guides for users from various popular languages will get these users up to speed.

There is another sense of familiarity though in the sense of creating a “brand” for the language. Some languages take this in the sense of not allowing any room for major changes in the design once the language reaches a beta. Minor migrations would be possible, but for example switching from curried to uncurried functions would be forbidden because they would annoy too many people. This requires doing essentially all of the designing up-front. I’m kind of split on this. On the one hand, it is good to have a strong design. On the other hand, changes inevitably occur and it is better to plan to make unexpected changes. I think the goal to be “the ultimate” establishes the brand more - and requiring changes to be accompanied by evidence provides a good compromise between language identity and evolution.

Another important concept is being intuitive/memorable, as can be tested via cloze completion and “what does this piece of code do”. Ideally someone should be able to read the manual and write some throwaway Stroscot code, abandon Stroscot for 6 months, and then come back and correctly type out some new Stroscot code without having to look at the manual again. If Stroscot the language is a moving target this goal is difficult to accomplish. That being said though, like Poettering said nothing is ever finished and it is better to track the progress of technology.

Readability

Humans interact with code in a variety of ways: skimming, reading, writing, understanding, designing, discussing, reviewing, and refactoring code, as well as learning and teaching how to code. Focusing on readability means leaving out other activities. Ergonomics covers all of these activities and is perhaps too broad, being difficult to connect directly with development costs. Humans have limitations in the domains of perception, memory, reasoning, and decision-making and the language should take these HCI factors into account. The design should aim for productivity, ergonomics, and comfort, reducing errors and fatigue and making the language accessible.

Using the literal definition, “ease of understanding code”, readability is measured as the edit-test cycle time. Yue Yao says “The shorter the ‘edit-compile-run’ cycle, the happier the programmer.” Per here, the cycle time can be broken down into 70% Understanding Code, 25% Modifying Existing Code, 5% Writing New Code. In particular we estimate that there is 14x as much read time as write time. But this estimate is probably only appropriate for application code - the true average varies depending on scenario. Per APL, if a language is quick to program in, it may be faster to write small programs from scratch than to read and understand another person’s program. So the 70/25/5 may turn into something more like 50/20/30 in a scripting context, only a 1.6x read-write factor. On the other hand, common library functions may be read many times but only modified or added rarely, giving read/write factors of 100x, 1000x, or more.

Steelman 1C lists “clarity, understandability, and modifiability of programs” as the meaning of readability. This also accords with cycle time - clarity involves skimming and debugging, understandability involves reading, and modifiability involves writing. Notably it does not accord with the intuitive understanding of readability as reading - since when has modification been part of readability?

Cycle time has the benefit of being empirically measurable - just provide some code and an editing task, time it, and average across a pool of subjects. In contrast, readability per se is more subjective - the author of some code will most likely consider their code perfectly readable, particularly immediately after writing said code, even if an average programmer would not. Of course, in a week or a few years, depending on the author’s memory, any domain-specific knowledge will fade away and the author will struggle with their code just as much as any average programmer, but waiting ages just to convince someone of their code’s innate (un)readability is not feasible.

Most articles that discuss readability go on to describe “readable code”, defined by various properties:

Meaningful variable and function names (“self-commenting”)
Consistent identifier style, indentation, and spacing
Comments that explain the purpose of each function
Comments that explain non-obvious parts
Intermediate variables to avoid complex expressions
Intermediate functions to avoid deep nesting of control structures and ensure each function has a single purpose
Parentheses that make the order of operations clear

These definitions are somewhat subjective and unreliable. What makes a name meaningful? How deep and complex can an expression/function get before it needs to be broken up? Should the “consistent identifier style” be camel case or snake case? With a loose reading, most libraries and style guides qualify as readable, in that there is always somebody who will argue that the existing choice is the best. The cycle time principle provides a framework for evaluating these choices objectively, although it is still dependent on a subject pool and hence the scientific literature. In fact studies have validated many specific guidelines as empirically reducing time to understand, e.g in the underscores vs camel case debate finding a definitive benefit for underscores.

Cycle time also accounts for the aphorism “Perfect is the enemy of good”. One could spend hours optimizing for readability by fixing spelling mistakes and other nits and not get anything useful done. In the time it takes to write a long descriptive comment or poll coworkers for a meaningful variable name, one could have skipped writing comments, used 1-letter names, run and debugged the code, and moved on to a new task. Perfect readability is not the goal - the code just has to be understandable enough that any further readability improvements would take more cycle time than they will save in the future. And with hyperbolic discounting, reducing future maintenance effort is generally not as important as shipping working code now. This calculation does flip though when considering the programming language syntax and standard library, where small readability improvements can save time for millions of programmers (assuming the language becomes popular, so there is again a discounting factor).

Not included in cycle time (or readability) is the time to initially write a program. Maintainance cost is much more important in the long run than the initial investment for most programs. This is borne out when Steelman 1C lists readability under maintenance.

Terseness

APL is terse mainly due to its use of symbols, and [Hol78] mentions that some consider terseness an advantage. But is it really? An APL program may be short but if the APL program requires looking up symbols in a vocabulary while a normal word-based program is a little more verbose but self-contained, then the word-based program wins on cycle time.

Iverson argues the human mind has a limit on how many symbols it can manipulate simultaneously. A terser notation allows larger problems to be comprehended and worked with. But this ignores the role of chunking: a novice chess player works with symbols representing individual pieces, while an expert player works with symbols representing configurations of the entire board. Similarly, a novice programmer might have to look up individual functions, but a programming expert will work on the level of program patterns, for example CRUD or the design patterns of Java, and the amount of verbiage involved in writing such patterns is immaterial to mental manipulation but rather only becomes relevant in two places:

the time necessary to scan through unfamiliar codebases and comprehend their patterns. This can be reduced by making programming patterns easy to recognize (distinctive). APL’s overloading of monadic and dyadic function symbols seems to conflate distinct functions and go against this consideration.
the time needed to write out patterns when moving to implementation. Most programmers type at 30-50 wpm and use autocomplete, which means that even a long identifier requires at most 1-2 seconds. In contrast, for APL, symbols might found with the hunt and peck method, per Wikipedia 27 wpm / 135 cpm or 0.4 seconds per symbol. So APL is faster for raw input. But in practice, most of the time programming is spent thinking, and the time writing the program out is only a small fraction of coding. So what is important is how easy it is to remember the words/symbols and bring their representations to mind (the “memory palace” principle), for which APL’s symbols are at a disadvantage due to being pretty much arbitrary.

There is some advantage to terseness in that shorter code listings can be published more easily in books or blog posts, as inline snippets that do not detract from the flow of the text. Documentation works better when the commentary and the code are visible on the same medium. But readability of the code is more important - a barcode is terse too but provides no help without scanning it. Web UX design provides many techniques for creating navigable code listings, e.g. a 1000-line listings to be discussed in a short note with a hyperlink. Accordion folds can be used for 100-line listings, and 10-line listings can be in a two-column format or with a collapsed accordion fold. So this advantage of terseness seems minimal when considering that code is mostly published on the web these days.

Remember the Vasa

Bjarne Stroustrup seems fond of the phrase “Remember the Vasa” to warn against large last-minute changes. According to Wikipedia, the Vasa was a ship that sunk because the center of gravity was too high. Despite rumors that it was redesigned, there is no evidence that any alterations were performed during construction. It appears to have been built almost exactly as its designer Henrik Hybertsson envisioned it. And the design was obviously incorrect - a survey of shipwrights at the inquest after the sinking said the ship design “didn’t have enough belly”. So the only lesson I get is to learn from experienced designers to avoid making mistakes. But this is just T.S. Eliot’s principle to steal from great poets.

Standards

Adoption

How many users should Stroscot have? Well, as with SPJ’s motto of “avoid success at all costs”, there is such a thing as too popular. A widely-adopted language becomes ossified, as nobody want their code broken. This can be addressed by developing “language change/evolution management” tools, like automatic migration (as in early Go) and the compiler supporting multiple language versions/dialects at once. These should allow any sorts of changes to be made with minimal breakage to users, even if the language is significantly popular, while still adding minimal delay and overhead to language development. Explicitly, I do not want governance procedures/processes like PEP’s or the Rust council for new language features - never solve a problem through social means when there is a technical solution, the technical solution in this case being to add the new feature regardless (per the functionality goal) and put it behind a flag.

So with that out of the way, growth is really a social problem. Do I want to spend my days reading PR’s and writing comments, as Linus Torvalds does, or spend my days coding? Well, I am not really that great a coder. I type code slowly and over-design. Honestly it would be great to design by English. But it is not like everyone will drop what they are doing and be at my beck and call. It is an exchange of interests - Stroscot will have to provide some value to users, and they will have to judge that contributing to Stroscot’s vision is better than using other software. Still though, for individuals that do decide to contribute to Stroscot, I will not turn them away. I think once the technical tools for dealing with adoption are in place, SPJ’s motto is in fact wrong and success is necessary and desirable.

Then there is the question of whether to focus on adoption. I think this is like performance - it definitely matters, it definitely contributes to long-term language progress, and it directly affects financial success (in terms of donations / visibility). So it is worth tracking. But like performance, it is often premature to expend significant effort on adoption. Like buying ads for the language - probably a waste of money compared to improving error messages or some such. Focusing on the core goals of Stroscot like functionality, minimality, learnability, and concision will naturally lead to user adoption in the long term. With link aggregators and a decent website, it is possible to go from zero to 100,000 users in a week (c.f. hitting the front page). But it takes “the perfect storm” of user interests, informative website, and positive comments and votes. I think one simple mark of progress is that the project becomes interesting enough that someone else - unrelated to the project - submits the project to a link aggregator. That is probably the point at which it is worth devoting attention to adoption (as opposed to learnability). I suspect that most languages will need at least 5-10 years of development before reaching their first stable release, followed by another 5 years or so before it starts to take off. That’s all assuming you end up lucky enough for it to actually take off, as there are many languages that instead fade into obscurity. So a language most likely would need at least 10-15 years of development before charting on the TIOBE index or PyPL. Long-term, it is more important to avoid fading into obscurity than to shoot for #1.

Another problem, particularly for languages backed by industry, is that they get semi-popular very quickly, and then suddenly drop off the radar a few years later. This is due to being “all hype” and not really adding anything new. At least in the early days, there is some benefit to discouraging adoption, via burdensome installation requirements or frequent breaking changes. Although it slows adoption in the short term, such policies strengthen the community by forcing members to participate fully or not at all. Those who remain find that their welfare has been increased, because low-quality “what’s going on’” content is removed and feedback loops are shorter. The overall language design benefits as result, and can evolve much faster. (Compare: strict religions that prohibit alcohol and caffeine consumption and modern technology, a random guy pruning the 90% of members who have not posted a message in the past two weeks from his website)

But with this approach, one must be careful that the programming language still provides sufficient value to hold at least some amount of users - otherwise there is no feedback at all. The barriers to adoption must also be reasonable, and similarly barriers to prevent people from leaving are probably limited to implicit ones like language lock-in. It is not worth discouraging users too strongly, as these attempts can backfire with blog posts such as “my terrible experience trying to use Stroscot” or “my terrible experience trying to get rid of Stroscot”, destroying what little reputation the language may have built up. Although discouraging adoption may be the policy, each individual user’s interaction with the community should be engaging and issues that appear should actually be addressed.

There are not really any best practices to encourage adoption but [MR12] makes some observations.

Numerous people have made efforts to design programming languages, but almost all of these have failed miserably in terms of adoption. Success is the exception to the rule. Contrariwise, others observe that language usage follows a “fat tail” distribution, meaning that failure is not as bad an outcome as one might expect and even a “failed” language can have some popularity.
Successful languages generally have convoluted adoption paths, suggest that extrinsic factors are influential. (TODO: How influential? Top 10? Top 100?)
Language failures can generally be attributed to an incomplete understanding of users’ needs or goals.
Evolution or re-invention, by basing a design on existing experiences, increases understanding.
Surveying the literature is often advocated but rarely or never followed to a rigorous standard. The main sticking point is that it is difficult to evaluate language features accurately except by attempting to use them in a new language.
In the diffusion of innovation model, innovation is communicated over time through different channels along a network. Adoption is a 5-step linear process for each node:
1. Knowledge: an individual is made aware of the language. Knowledge is spread by impersonal mass communication: blog posts advertised with SEO, links to the homepage on link aggregators such as Reddit and HN, and shoutouts on social media such as Facebook and Twitter. Generally, this process is limited by the relative advantage of the language, the amount of improvement over previous languages. The relative advantage is described succinctly as the “killer app”, a story such as “we switched our <killer app> to Stroscot and sped things up by 300%” (note that this usage differs subtly from popular definitions of “killer app”).
2. Persuasion: an individual investigates and seeks information, evaluating pros and cons. An FAQ or comparison can provide initial evidence, but may be viewed as biased. Peer communication such as Discord is more effective because it is personalized. An individual may also evaluate reputation, so convincing influential, highly connected individuals and firms to promote the language can be effective. This process is limited by compatibility, how well an innovation integrates into an individual’s needs and beliefs. Consider [Cob06]’s simple model Change Function = f ( Perceived crisis / Total perceived pain of adoption ), where f is just the step function f x | x > 1 = DO_CHANGE; f _ = MAINTAIN_STATUS_QUO. In the terminology of programming languages, a language provides a certain value, but has a switching cost that dissuades adoption, such as the effort of learning the language, or expense of writing bindings to legacy code. The weighing factor for a language is then Benefit / Switching Cost. A firm will decide to adopt if the value of the new language minus the old exceeds the switching cost by a certain threshold. Otherwise, the firm maintains the status quo. A new language will have to provide significant value to be adopted, but an adopted language can survive simple by keeping up with its competitors and keeping the switching cost high. Even such a simple model can become complicated because the costs and benefits are subjective, and may not be correctly perceived.
3. Decision: an individual makes a decision to adopt. A short elevator pitch allows summarizing the pros and cons. The limiting factor here is simplicity, how easy the idea is to use and understand, as a complex decision may never be made.
4. Implementation: an individual tries out an innovation and analyzes its use. This is where the reference documentation gets a workout. The limiting factor here is trialability, how easy the language is to experiment with.
5. Confirmation: an individual finalizes the adoption decision, such as by fully deploying it and publicizing it. Encouraging such individuals to publish experience reports can start the adoption cycle over and cause the language to spread further. The limiting factor here is observability, the ability to get results.
Power - A language needs a unified design, and generally this means designating a single person to make the final decisions. Essentially, a single person weighing priorities based on their knowledge of the market and pain points is more effective than group voting. In a committee, nobody feels responsible for the final result, so each person does a shallow analysis of surface costs and benefits. In contrast, an individual feels empowered and really has the incentive to understand the issues deeply and design an effective solution. Absent mitigating factors such as a strong committee chair or shared vision, group design generally result in terrible “kitchen sink” languages. These languages have an incoherent design, with many features that sort of work, but no particular attraction to any actual users. “Kitchen sink” languages are generally short-lived due the the difficulty of implementing an endless stream of special-case features and maintaining the resulting complex, sprawling codebase. Of course, so long as the power structure is clear, delegation of roles and duties is quite reasonable, e.g. designating a person for data analysis.
Evidence - Everyone has opinions, but if there’s a disagreement, opinions don’t help much in making a decision. Although common, “The CEO said so” is not really a good reason to choose a particular design. I would rank evidence as follows:
- Mathematical theory and logic stand on their own, I guess I could verify it with Coq or something but generally a proof is a proof.
- Semi-automated analysis of source code repositories and developer communications, with manual inspection/validation of the results
- A survey of users who’ve actually used a language for a while.
- Experience reports from language designers are also high-quality evidence. There is some error in evolving and repurposing insights from one language to a new language.
- Anecdotal reports I would say are medium-quality, as the plural of anecdote is data (the “not” version appeared later). It requires filtering out the opinions - what we want are claims, supported or unsupported, rather than simply “I don’t like it”.
- Testing / prototyping can confirm hypotheses but may fail at identifying broad design considerations.
- Arguing via toy code examples seems pretty decent, although can suffer from “cherry-picking” meaning that the design may not work in practice for code dissimilar to the examples.
- Flix suggests evaluating features against a list of principles, but I tried it and generally the principles are too vague or unrelated to be useful. Also, the choice of principles is subject to bias. I would say the biggest goal for Stroscot is functionality, because figuring out how to include a feature means the feature must actually be seriously considered, whereas in other languages it is easy to fall into the trap of “don’t know, don’t care”.
Feedback - It is quite useful to get feedback from potential users and other, early and often. Feedback, unlike the designer, is not impacted by project history or the designer’s preconceptions. The Pontiac Aztek checked all the boxes regarding functionality, and had the highest customer satisfaction ratings for those who drove it, but every time the focus groups looked at it, they said “it’s the ugliest car in the world and we wouldn’t take it as a gift”. Per Bob Lutz, managers at GM ignored the focus groups, and the Aztek was a flop - barely anybody bought it, because it was indeed too ugly (although it did develop a cult following). However, just showing around a design and asking “what do you think?” has several problems. First, people’s opinions change as they are exposed more - maybe their gut reaction is that they hate it, but if they spend an hour trying it out, they’d love it. The solution is measure, measure, measure - for example, an initial poll and a poll after a tutorial. Another useful trick is limiting the stimulus to what is under study - if syntax is not relevant, don’t present any syntax, and then the discussion will naturally focus on semantics. If the “feel” of the language is being discussed, present a collage of concepts. Second, unstructured responses usually answer the wrong question - what matters is estimating how the design impacts certain business objectives and success criteria, but maybe the interviewee will spend half an hour discussing a tangent. This can be addressed by structuring and timeboxing the feedback with a rubric, and perhaps explaining some background with a video. Of course, qualitative feedback is most important, so the questions should still be open-ended. It is also best to speak to interviewees individually, rather than in a group, so that their opinions do not influence or dominate each other. Individual discussion is more likely to present a balanced opinion, whereas groups can pile on negative feedback. OTOH, a group does enunciate the overall consensus more clearly, and e.g. Submitting to HN is a convenient way of getting group but not individual feedback, unless a survey link or similar is included.
Testing - When qualitative considerations are silent, decisions must be made on quantitative grounds. The standard for websites is A/B testing: allocate some traffic to version A, and some to version B, and measure metrics such as average time to completion of task. A little more complex is a stochastic k-armed bandit test with Thompson sampling, which allows testing arbitrarily variants and also automatically reduces testing of poor-performing variants. We can do this for a language too, with some difficulty: get a random ID from the service, randomize choices, measure metrics in the compiler, report back, have a privacy policy and ensure GPDR complance, require the ID so as to generate customized documentation, and voila. Given that the audience is programmers it probably makes sense to allow overriding the arm selection.

Performance

Steelman 1D: “The language design should aid the production of efficient object programs.” Is this really a goal? How efficient do we need to be?

2-10x speedups

Performance plays a significant role in the bottom line of software companies. Let’s just look at the costs of a big software company (Google). The balance sheet lists cost of revenues, R&D, sales and marketing, general and administrative, property and equipment, and a bunch of financing considerations like loans, bonds, and stocks that don’t really matter for our purposes. Really, the only costs affected by a programming language are R&D and IT assets. Per 2016 10K 27,169 employees (37.7% of total) worked in R&D, for about $513,379 per person-year. Trying to update that, the 2022 10K lists 190,234 employees and $39.5 billion R&D, so estimate about 71,718 R&D employees and $550,766 per person-year. Regarding asset costs, the main figure is “other costs of revenue”, $48.955 billion, which contains data center operation and equipment depreciation.

Similarly, Meta’s numbers are $35.338 billion R&D, $25.249 billion cost of revenue. Total employees at the end of 2022 were 86,482. Their precise R&D employee count isn’t reported, but this HN post says about 42.6% “work in tech”, so we can estimate 36,899 R&D employees and a spend of $1,070,490 per person-year. Per levels.fyi, the median salary is $261k and for Facebook the median salary is $350k, 1/1.96 and 1/3 of person-year spend respectively. The spend is a bit high compared to the 1.2-1.4 rule of thumb for total employee cost. Probably the mean salary is higher than the median due to a small number of highly-paid employees, and the R&D figure includes significant costs besides employee salaries, maybe CI testing and software licenses. But it seems reasonable to assume that it scales by employee.

Given stories like Facebook rewriting Relay from JavaScript to Rust and making it 5x faster, or redesign their Hack JIT compiler for a 21% speedup (via Lemire / Casey), it seems at least theoretically possible that going all-in on a new language could make everything 2x faster and reduce hardware costs by half. For Google, the 2x speedup will reduce “other cost of revenue” by $24.77 billion per year. To break even, they would have to have spent less than that on the switchover, i.e. less than 48k man-years or about 67% of their 70k-person R&D department occupied for a year. For Facebook, the 2x speedup saves $12.62 billion per year and it would break even at 12k man-years or about 31% of their R&D department for a year. Although this is a large investment, acquiring WhatsApp was $19 billion, so it’s affordable assuming the speedup is guaranteed.

Performance not only affects costs, but also revenue. As an example of this, let’s look at Facebook’s LightSpeed project to rewrite Messenger - per the post, they got a 2x startup increase. Per stats from Google (2), that speedup probably was from 3s to 1.5s and decreased bounces by around 25%. Estimating the revenue from this is tricky but as a basic estimate Facebook’s IAP revenue from Messenger in 2022 was $2.91 million, iOS is about 48% of mobile traffic, so they should have gotten at least a $350k increase in revenue, about 1/3 of a man-year. That’s hard to jibe with Facebook’s statement that the rewrite took more than 100 engineers over 2-3 years, but FastCompany mentions that most of that involvement was just 40 different partner teams “buy[ing] in and updat[ing] their work”. If we assume that only the initial 3-4 engineers were really spending substantial time working on it, and not full-out but only 1/3 of the time, the speedup could pay for itself over 3 years or so. And there are indirect benefits of performance like happier users and improved reputation. Now, Facebook’s post also mentions that the codebase size decreased from 1.7 million LOC to 360k. This substantially reduces maintenance costs, to the tune of ~$2 million / year per this random cost per LoC figure. Facebook likely also went ahead with the rewrite because of the maintenance savings (the cultural motto of “Move fast and break things” has apparently evolved to “do complete rewrites pretty often while keeping all tests passing”), but here we’re focusing on performance so it’s reasonable to discount the maintenance benefits.

Now in practice, there are a variety of services. The desirable performance characteristics will vary - apps and websites will aim to reduce latency, backends will aim for efficient resource utilization, compilers will aim for faster code, and binary size is also a consideration in deployment. Rewriting existing C code probably won’t get much speedup, while JS probably will. There is a lot of uncertainty, and different companies will deal with this in different ways. For many companies, they are risk-averse and a 2x speedup is not large enough to take a risk; per Cliff they will need at least a 10x speedup before they start considering it seriously. For larger companies like Google or Facebook, they will consider even small speedups, but they will incrementally rewrite services one by one with a few developers, rather than going all-in.

So, yes, performance matters. If you can’t write fast code in the language, the language won’t be suitable for many purposes. And if another language is faster, some companies (like Facebook) have processes by which they can and will completely rewrite their implementation if there is a sufficient performance advantage (2x-10x). Maybe most less-agile or less tech-savvy organizations will not, but that’s their loss. Performance appears to be central to long-term business interests, and directly affects financial success.

Predicting performance

Predicting program performance (without running the code) is hard. For example, consider simple binary questions like whether a program is CPU-bound or I/O-bound, or which of two programs will execute faster. Certainly there are obvious cases, but when tested with a handpicked selection of tricky examples, even an experienced programmer’s answers will be more like random guesses than any sort of knowledge. When the task is more realistic, like a huge, complex, confusing program written in an unfamiliar language, the situation is worse. Per [Knu74], “the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail.” The programmer feels confident - “I think these will be the hotspots, so I’m going to design the architecture around them, pick out some algorithms with good time complexity, and write them in low-level assembly style.” But per C2, “this almost never works”. Per ChatGPT “the task is inherently uncertain”. So Steelman 1D “Constructs that have unexpectedly expensive implementations should be easily recognizable” is simply unimplementable.

I would say performance is difficult to predict for several reasons. First, hardware has simply become too complex for a programmer to reason about in their head - there is register renaming, multiple levels of caching, out-of-order execution, and instruction-level parallelism. Even the most accurate timing model at present, uiCA [AR21], still has errors of around 1% compared to real measurements. If we use less accurate models, like LLVM’s, then the errors are much higher, 10% or more. Certainly one can argue about what level of error is acceptable for what purpose but the fact remains that the errors are not 0% and the performance of even low-level assembly code is simply not predictable. Long gone are the days of Mel where programmers order their instructions by hand and take each cycle into account.

Another reason is that in the translation from source code to assembly, there is a lot of room for optimization to affect performance. For example, there is this post, C is not a low-level language. It argues that the C abstract machine does not map in any understandable way to modern hardware abstractions. Spectre and Meltdown mitigations impose significant performance penalties not visible in the source code. Simple translation does not provide fast code. Optimizers must fight the C memory model, sequential execution model, and layout guarantees, using millions of lines of code to avoid performing “obvious” optimizations that are actually unsound when closely inspecting the C semantics. So C’s performance is unpredictable and it does not map well to hardware. As C is probably the simplest language in common use today, Steelman 1D “Features should be chosen to have a simple and efficient implementation in many object machines” is simply impossible on account of the “simple” requirement. And as far as “efficient”, that is evaluated on a program-by-program basis. A feature is not intrinsically efficient or inefficient; only a use of a feature in a specific program can be evaluated as efficient or inefficient. And the cost of the use’s object code must be evaluated against how much it reduces the burden of maintenance of the code: how difficult the program would be to write without the feature, and how difficult it is to improve the performance of the feature by modifying either the program or the compiler.

On the positive side, compilers have gotten quite good at working optimization magic, and these optimizations can transform algorithms beyond recognition and even improve asymptotic complexity. For example there are some papers on how partial evaluation of naive string matching algorithms can lead to optimal matching algorithms such as Knuth-Morris-Pratt and Boyer-Moore. Such optimizations do not follow much pattern, other than that programming with mathematical functions appears more suited to such optimizations. So per Steelman 1D “Features should be chosen […] to maximize the number of safe optimizations available to translators”, we could also choose functional logic programming as the base paradigm based on performance considerations.

Optimization can cause amazing and unexpected speedups when an optimization works, and equally unexpected and disappointing performance when an optimization fails to work.

unused and constant portions of programs will not add to execution costs. Execution time support packages of the language shall not be included in object code unless they are called.

A third reason is that measuring programming language performance is subjective and often based more on marketing than any hard evidence. Rust has claimed to be “blazing fast” since 2014. But this claim is not backed up by an official benchmark suite or anything of the sort. In fact, in an explicit test of this claim on Google, C was faster. The programming language benchmarks game is often criticized because it compares implementation using SIMD to those without, but it too has often shown that C is faster. Even if the benchmark suite was bulletproof, there will be some who point to expensive but expressive features of the language and say that it is slow.

A fourth reason is that not all code in the language has to be fast all the time. As the numpy ecosystem has shown, for basic scripting tasks, the hard parts can be implemented in compiled library modules. Almost no special effort is needed to use these modules and get sufficient performance. Even though interpreted CPython is one of the slowest, least performant runtimes, the end result still performs acceptably because the hardware is very fast and the expensive operations are implemented in C. Now, there are drawbacks to this design - writing an interpreted, unvectorized loop is a performance no-no, for example. A real JIT compiler, like for example Julia’s use of LLVM, is more flexible and performant in that it can optimize such loops. Scripting and scientific computing are definitely niches in the industry where only a few “hot” regions of code need to be optimized, and the other “cold” regions can be ignored for performance purposes. More broadly, even in the most performance-sensitive apps, there are often cold paths that simply never happen often enough to affect performance.

Performance goals

Performance by itself not a SMART goal. Specifically evaluating the factors:

Specific – Performance is affected by many factors and it is not predictable which areas will need improvement. There is no clear division of responsibility for performance between the language and the programmer, with questions about what constitutes “idiomatic” code vs. “slow code that you shouldn’t have expected to perform well”.
Measurable – Performance is definitely measurable, although noise means that statistics are required to interpret the measurements. For example, it is possible to maintain a benchmark suite of Stroscot programs vs. similar C programs, and the Stroscot compiler can be benchmarked against itself to identify performance regressions.
Achievable – Being number 1 in performance is not necessarily possible; Stroscot has a small team and ideal performance requires putting in man-years of work into writing specialized optimizations. Perceptions of performance are more often due to external factors or marketing.
Relevant – Performance is extremely relevant to adoption. Better performance makes the language suitable for more use cases, which satisfies Stroscot’s overall goal of being the ultimate programming language.
Time – Spending time on optimization can initially give huge speedups for little effort, but eventually runs into diminishing returns. Achieving optimal results is possible for some cases, but most of the effort is spent on testing other possibilities to verify that the chosen possibility is indeed optimal.

When Carbon says they want performance, from what I can tell, they really mean providing the developer control over the assembly their program generates. This is more of a functionality feature, and is covered under the discussion of assembly. They mention other vague “tools” to use to address poor-performing programs, but all a programmer can really do to address performance is drop down to assembly, since every program must be eventually translated to assembly regardless. There is also the need for idiomatic programs to be fast; this is just making sure to implement a decent number of low-hanging optimizations. Again, really a functionality concern. Then Carbon says code should perform predictably, but as discussed, nobody can predict performance, so that’s just a pipe dream. I would rather have the brittle optimizations that occasionally deliver pleasant surprises. It is a bit dirty to put in a hack like “if you are compiling the SPEC2000 benchmark, then ignore the source code and just generate this binary”, but Intel did it and it’s not like there was a huge industry backlash that made them stop distributing their Intel C Compiler, and in the meantime they had the best benchmark scores. And maybe with just a little more effort it is possible to expand the functionality and make the optimizations less brittle so they work normally. You don’t know even necessarily know an optimization is brittle until you see real bugs like “I changed my program this way and there was a huge slowdown.”

There is also the performance of builds, and speeding up compilation. Google has done some work on build speeds; the main performance-focused feature there will be fine-grained incremental compilation to reduce compile times. This is planned for Stroscot as well.

Another performance-related thing I have seen is people trying to change the algorithmic complexity of a problem through design. For example, with package managers, with a minimal amount of functionality the problem becomes NP-complete. So some people have tried to restrict the functionality of their package manager to below a minimal level, so that they don’t need to solve an NP-complete problem. This results in some pretty bad software though, with brittle dependency solving and “dependency hell”. The newer package managers use a SAT solver or other algorithm for solving NP-complete problems. There is also the case of matrix multiplication, where the “optimal” complexity is O(n^2.62) or whatever but in practice the naive straightforward cubic algorithm is good enough for most purposes. I think it is better to develop software as an obvious, clear, correct solution plus a pile of hacks for performance, rather than to contort the presentation of the problem to have a fast solution.

So this has been a long discussion. What are the takeaways?

When starting out, it is better to have the mindset of “rapidly prototyping” - get the program worked out as quickly as possible, by writing clear, correct code with no attention to performance. This applies both to Stroscot’s implementation and the recommended methodology for writing programs in Stroscot. Don’t try to prematurely optimize. As such, the primary goal of the language should be to allow the programmer to express their intent as naturally as possible - i.e., to have the necessary functionality, such as powerful constructs and high-level declarative abstractions that allow “specifying what, not how”. This is captured in Stroscot’s principle “don’t surrender the adequate representation of a single datum of experience”.
Once there is an initial implementation, it may be profiled. Stroscot should profile itself and provide tools for profiling. If the performance is not acceptable, then the profile will show this and also point out the way forward. If there are a few clear bottlenecks, it is easy and you just have to fix them by rewriting to be more performant. This could take many forms. Stroscot should make it easy to use a different data structure, add a cache, reorder traversals, filter out irrelevant data, and rewrite a hot loop in assembly. Since Stroscot is the ultimate language there should never be a case where the optimization can’t be expressed naturally in Stroscot, again following the experience principle.
In the hard case, there may be a smear of hot code, or the bottlenecks are up against physical limits, and the program design and architecture will have to be rethought. Stroscot code should be modular so that as little as possible needs to be rewritten even with major design changes. This follows the principle of “Design to do the hard things every day” - restructuring a program due to performance concerns should be straightforward.
To encourage adoption, Stroscot should benchmark itself against C and against past versions so that it doesn’t regress. Optimization is an area of research and per the principle Stroscot should “track the progress of technology.” As long as Stroscot implements the best optimizations, it will naturally meet or beat the performance of C compilers on C-like programs, and similarly for other language styles.

Looking at these, as a design principle by itself, performance is simply not that relevant a consideration. For example Wadler states “interoperability and footprint are more important than performance.” [MR12] But this doesn’t mean that performance will be ignored. It is certainly worth fostering a “performance culture”, with performance tests, profiling tools, and so on, for both the compiler and libraries. In the near term, however, the project is still in its “rapid prototyping” phase and hence the compiler’s performance is not a consideration. The potential for performance is though, e.g. optimal register allocation is considered functionality rather than a “premature optimization” hence will be implemented unconditionally.

Cost

We sort of got into this discussing performance but one idea is to optimize for the costs of software development. Predicting such costs is an art in itself. Several models are in use.

The COCOMO II model [BH00] is perhaps the only detailed model with publicly available weights and formulas. It computes software development effort (in person-months, time-to-develop). It models the time as the product of various “cost drivers” or “effort multipliers”. The most significant factor is the program size. This measures new, modified, reused, and adapted lines of code (each weighted differently). It is estimated for new projects using unadjusted function point counting and SLOC/UFP language-specific conversion factors table (“backfiring”). The size is inflated by requirements for maintenance, evolution and volatility, and modified by an exponential (dis)economy of scale factor of familiarity/unprecedentedness, flexibility (rigorous/loose requirements), risk management (little/full), team cohesion (difficult interactions/seamless), and process maturity (low/high planning, documentation, and oversight). This scaled size is then further modified by (1) product attributes such as required software reliability extent (risk to human life), size of application database (ratio of bytes in database to SLOC), complexity of the product (natural language, numerical or stochastic analysis, distributed or real-time), developed for reusability (none / across multiple product lines), documentation relative to life-cycle (uncovered /excessive) (2) platform attributes, such as run-time performance constraints (percent of execution / storage used), volatility of the platform environment (major changes every 12 months / 2 weeks) (3) personnel attributes, such as analyst capability (ability percentile), programmer capability (ability percentile), personnel turnover (% per year), applications experience (months/years), platform experience (months/years), and language/tool experience (months/years) (4) project attributes, such as use of software tools (edit/debug vs. life-cycle tools intregrated with processes), multisite development (average of collocation international/one room and communication snail mail / interactive multimedia). Then a factor of required development schedule (acceptable schedule compression/stretch relative to nominal) is used to scale time to develop. The COCOMO model does not model requirements gathering or final acceptance.
The Putnam model (SLIM): the effort in person-years is proportional to the effective source lines of code divided by a productivity factor, cubed, divided by the schedule of the project to the 4th power. This inverse 4th power provides an alternative to modeling required development schedule as compression/relaxation.
There is an interesting note in [Cro05] page 8: “For software projects below about 1,000 function points in size (equivalent to 125,000 C statements), programming is the major cost driver, so estimating accuracy for coding is a key element. But for projects above 10,000 function points in size (equivalent to 1,250,000 C statements), both defect removal and production of paper documents are more expensive than the code itself.”
SEER-SEM is similar in broad structure to COCOMO. It has different exponents for project effort (man-years) and project duration (completion date). The backfiring method is more complex; it uses a language-dependent factor but also adjustments for phase at estimate, operating environment, application type, and application complexity. [Cro05]
Per [Cro05] page 29, lines of code is a bad metric, varying by factors of 2-8 and with 30-100% CV. Function points are more reliable but also have problems (difficult to define and use)
Use case points: this is similar to function points but on the level of UML rather than functions.
Weighted Micro Function Points provides an automated measurement of size via source code analysis. Unfortunately due to the complex nature of the measurements, it cannot easily be used to predict the cost of future projects except by analogy.
Wideband delphi is a process for coordinating experts to reach agreement on a cost estimate. Although common in industry, singular expert estimates using informal analogy are inconsistent and there is a tendency to underestimate, particularly to have drastic underestimates of difficult projects. Consistency and calibration are improved by including expert judgment as structured input to a model, rather than to take an unprocessed estimate. Conversely, an uncalibrated model may be worse than an expert’s unprocessed estimate. Expert judgment or calibration to previous projects is necessary as a model input due to heterogeneity among organizations and unique factors in projects. [Jrg04]
[LvonWurtembergHL12] finds that costs increase with commissioning body (5x), project assessed as high-risk (4x), project primary platform as legacy (3.5x), high project priority (3x), and more budget revisions (3x). Factors increasing productivity were project type “integration”, extremely low or high rating of project estimation efforts, targeting Windows, no testings conductor, and fewer budget revisions. Many other factors were not significant - interesting factors included length/cost of pre-study (dismissed as “quality” mattering more), cooperation among participants, assessed competence, and assessed code quality.

Now what we care about is the impact of programming language choice on project cost. In the COCOMO model, there is an obvious place for the programming language, the language-specific conversion factor or “expansion ratio” from UFP to SLOC. This ranges from 640 SLOC / UFP for machine code to 128 for C++ to 53 for Java to 13 for SQL to 6 SLOC for a spreadsheet. One could thus naively conclude that since the project requirements in terms of UFP count are pretty much fixed, implementing every project via spreadsheets is the cheapest option, reducing the cost of the project by 20 times compared to doing it in C++. This conclusion relies on a few assumptions; let’s examine them in more detail.

Lines of code

[Bro95] makes the observation that “productivity seems to be constant in terms of elementary statements.” Similarly [Boe81] says (page 477) “COCOMO uses Delivered Source Instructions as the basic size parameter”. DSI was correlated more closely with total effort than executable machine instructions. In COCOMO II, total effort in person-months is a power law of the project size in SLOC.

This is borne out by [CMetc17], an analysis of DoD projects. Per Figure 25, fitting a power law to the relationship between SLOC and actual hours explains 67% of the variance. Per Figure 73 the languages were primarily C, C++, C#, Ada, and Java; they did not conduct a by-language analysis. [Cla15] Figure 12 provides a more interesting graph which shows basically the same thing, although it is colorized by project type. The overlap in rates by project type is such that I would not consider it a significant factor.

[Pre00] argues that this relationship intuitively holds: one line of code corresponds to one unit of thinking in short term memory, and the time required to process a unit of thinking is constant and independent of the amount of conceptual complexity of the unit. I am not sure about this, but it does seem logical that reading and writing code in a linear fashion takes time proportional to the number of lines of code, and likely searching and browsing code takes time increasing in a power-law fashion with the amount of code.

[JK12] says not to put too much stock into the value of the scale parameter being less or greater than 1; depending on how the regression is done, it can find either a diseconomy or economy of scale. Existing data sets have noise, unmeasured variables, and are non-random, giving insufficient information to determine anything beyond that the relationship is approximately linear.

Generally this relationship is observed within-language. The more interesting question is whether it holds across languages; if I can write a program in language A in 1000 lines of code, but in B it is only 500 lines of code, can I conclude that I will write the program in B 2x faster? The answer seems to be yes, as modified by the power law: a 2x smaller program will be at least 2x as fast to write, and maybe more, e.g. 2.5x or 3x.

[Boe81] page 660 says “the amount of effort per source statement was highly independent of language level.” [WF77] and [Pal82] did not do a by-language analysis, but the man-month to SLOC relationship was strong. [Gra77] pg. 6-12 compared assembly to HOL (COBOL, JOVIAL, FORTRAN) and found a 1.8x factor in favor of using HOL, although the relationship was not strong. The effect was more pronounced in analyst and programmer time. [Mar79] suggests a savings of 20-75% of development time by using PEARL instead of assembler, but does not model this in terms of SLOC. The expansion ratio of object words to source lines is 5:1, which is probably the best guess for assembly LOC : PEARL LOC ratio; such a ratio would account for the differences in development time.

[Pre00] finds a 22 to 31 LOC/hour range of medians per language. There was a lot of noise though, the only significant difference was Java vs pooled scripting languages (21 vs 29, p=0.031). The difference is not conclusive as the scripting languages had differences in work time measurement.

[Jon07] suggests that language has an impact on lines/month, in favor of higher-level languages:

In Table 5.3, SLOC produced per month of coding is 2k LOC/mo for assembly vs 2.3k for Ada vs 2.5k for C++. The programmers write 4x less SLOC of C++ than assembler to implement the project, and the time spent coding and testing is reduced by 4.5-5.2x.
In Ch. 14, Jones mentions that prototypes should done in languages with low expansion ratios, because a savings in lines of code is a savings of effort.
In Table 17.10/11 (PBX projects), we can calculate lines of code per month of coding as 1250 for assembly and 1400 for Smalltalk.

[DKC07] measures SLOC per programmer per year by language for open-source projects and find a range of SLOC, from ~2050 for JavaScript to ~3200 for Pascal. The study did somewhat control for programmer experience using metrics of the author’s previous contributions, but it did not measure the programmer’s experience with the specific language or project type. In the COCOMO model, assuming extra high team cohesion as is often the case in successful open source projects, setting PREC, PMAT, APEX, PLEX, and LTEX to very low gives 2130 LOC/year, while setting them to extra/very high gives 7280 LOC/year. (6300 without adjusting PMAT). We see that the COCOMO model’s range nearly encompasses the study’s range. SourceForge does not allow filtering by date back to the 2000-2005 time period, but I went through the current projects and subjectively categorized them, then asked ChatGPT to rate 2005 programmers on the scale. I don’t claim these ratings are accurate, I just wanted a general sense of whether these factors could significantly affect productivity as measured in the study.

COCOMO II Model for 2005 programming languages
Language	Use case	Precedentedness	Process requirements	Applications experience	Platform experience	Language and tool experience	LOC/Person-Year	Study
Pascal	Programming Languages	GUIs	Games	Nominal	High	Low	Low	Nominal	3100
JavaScript	Web Libraries/Tools/Frameworks	Low	Nominal	Nominal	Low	Nominal	3280
Tcl	Simple GUIs/Editors	High	Low	Nominal	Low	Nominal	3300
PHP	CMS/Blogging	High	Low	High	Nominal	High	4375
Perl	Billing/Helpdesk Scripts	High	Nominal	High	Nominal	High	4475
Python	Plugins/Libraries	Nominal	Nominal	High	High	High	4800
C++	Games	Nominal	High	High	High	High	4900
Java	Scientific GUI Tools	High	High	High	High	High	5000

The first observation is that all of the COCOMO estimated LOC rates are significantly higher. Apparently an open-source programmer month is not as productive as a paid month. The significant pairwise changes are that Pascal moved from the top to the bottom, and PHP and Perl changed order. There were 3 other swaps, Perl-Tcl, PHP-Python, and Java-C++, but these were not significant in the original study. Overall the two methods agree on 17 out of 28 comparisons (60%) and 11 out of 16 significant comparisons (69%), 81% / 91% excluding Pascal. Not bad for an hour of work. Pascal can be explained as a data quality issue; looking at Table 4 it seems most of its comparisons were not significant, probably due to wide variance and lack of projects. Perl had a higher number of projects than PHP, but was otherwise more popular, so perhaps the issue is that authors published many small projects and the study did not combine projects by author. There is not a lot of detail in the study so it is hard to say. Overall, it does seem that the majority of variance in LOC/mo observed in the study can be explained by non-language factors such as the experience, familiarity, and motivation of the programmers that choose to use the language. I conclude that this study does not show evidence that LOC/mo varies by programming language, and shows that any variation that does occur is within a factor of 1.6x (the Pascal-Javascript ratio of 3200/2050).

[HJ94] is an interesting lesson in how not to do a LOC experiment: there is no control for programmer experience, several programmers wrote the same program in several languages, the task depended heavily on standard library support (with the Griffin “support” library being almost as large as the program), Rapide and Griffin were never executed, and Rapide didn’t even cover the required functionality. The LOC/hr ranged from 3 for Rapide to 8-20 for Haskell to 28-33 for Ada to 91 for Relational Lisp. I can draw no conclusions from this study.

[Pre07] measured a lot, like detailed activity patterns, questions, and check-in times, but most of it was uninformative. Check-ins per file were lower for Java. Perl had fewer manual lines of code, and more functionality per line of code, while Java had less functionality. Although the duration was controlled (hackathon), they didn’t do statistics to determine lines of code per function point, so no conclusions can be drawn.

Although SLOC was the main factor measured in these studies, the actual definition of SLOC has become more nuanced. For example [Boe81] page 479 says the anomalies of nonexecutable COBOL statements were rectified by weighting them by 1/3. In COCOMO II, “Table 64. COCOMO II SLOC Checklist” is quite detailed; it emphasizes that it attempts to count logical source lines, rather than physical. As such, it is most likely counterproductive to attempt to reduce SLOC by fitting more statements on one line, e.g. by lengthening the lines or shortening the lengths of identifiers.

I would say, given the relatively small variation in language lines of code per month (1-1.8x), and that the error in the formula is that higher-level languages finish faster than predicted by their lines of code, it does indeed seem that fewer lines of code reduces development costs. As such, Stroscot should be expressive and concise. Particularly, it is important to avoid verbose “boilerplate” declaration syntax. COCOMO II estimates that a 50% discount (25kSLOC) on a 50kSLOC project is a reduction of 5 person-months, from 20 to 15.

Function points

An alternative to SLOC is the use of function points for estimating projects in the early stages. Function points have over 40 different definitions, 6 competing ISO/IEC standards, and there isn’t a global consensus. Jones is partial to his own definition, IFPUG function points, but notes they are mainly used in the US. None of the definitions are free; you have to get the ISO standards and go through a certification process. There is a Simple Function Point method which is free though, and seems sufficient for most purposes.

In [AG83], the correlation between function points and SLOC was strong and linear, 0.854 or higher, as assumed in the COCOMO model. In [SKK+13], it was not; they found R of 0.15 (Haskell), 0.19 (C) and 0.31 (Isabelle/HOL). A CFP ranged from 250 to 20 lines of C, and Jones similarly mentions a 5-to-1 variation among individuals implementing the same specification in the same language in an IBM study. Part of this is a lack of training/standardization in function point counting, and part of this is that function points do not capture all aspects of implementation. In the Staples study, the correlation between C, Isabelle/HOL, and Haskell implementation sizes was much stronger than the correlation to function points.

This suggests that a better measurement for size might be “language-adjusted LOC”. For example, the number of words needed to specify the functionality in English may be a good measurement, if the English is restricted to the subset that a large-language model can easily translate to a sufficiently high-level programming language. In Ch. 6, Jones confirms this with the observation that the volume (page count) of paperwork generated during development of an application correlates fairly closely with size measured using either function points or source code. [LvonWurtembergHL12] notes that project size can also be estimated from cost, number of participants, number of consultants, and project duration, but of course in most models these are outputs rather than inputs. The most accurate method per [Jrg04] is deriving size by analogy, saying that the project is about as complex as a similar previous project, or somewhere between two projects.

Definitional issues aside, there are quite interesting tables of SLOC/FP conversion factors, QSM and SPR. Similarly [Jon17] Table 16 has work hours, KLOC, and FP by language. The use of function points provides what is probably the most language-independent method of comparing expressiveness, short of implementing the same project multiple time. It certainly seems worth investigating the high-productivity languages to see what features they offer. The actual SLOC/FP ratio will have to be determined experimentally.

Reuse

Another assumption in using SLOC is that all the code is new. In practice, reuse of code is a significant factor in the cost of development. COCOMO II in fact provides a specific model for it. The cost (in SLOC) is about 5% of the library’s size for adapting an unfamiliar component with no changes and goes past 100% of size at greater than 50% modification of the library. The cost can be broken down into reading the library’s documentation to identify if it is suitable (adding a cost of 0-8% of library size), modifying the library if necessary, and integrating it into the application. It helps if the library is high-quality with good (searchable/browseable) documentation, tests, and evaluations. Modifying and integrating the library is assisted by “software understanding”, having a clear, self-descriptive/well-commented, well-organized library structure with strong modularity and information-hiding. But the main factor of the cost of using a library is how familiar it is; if the programmer knows it and there is a clear match between the library and the project then the cost can indeed be zero.

So in practice, languages come with libraries, and in fact a language is chosen for its libraries so as to save significant effort. This is sort of the “culture” of a language; per Jones Table 6.6, Visual Basic apparently has 60% approximate/average reuse of code, and Java 50%. C++ is much lower at around 27.5% and macro assembly is at 15%. There are many methods to encourage this kind of reuse, basically a rehash of the COCOMO model:

Make it easy to structure code in a modular, organized fashion
Encourage self-documenting and well-commented code
Encourage tests and evaluations of libraries
Encourage “attention-conservation notices” that allow quickly determining if a library matches one’s need
Provide techniques to hide information that is irrelevant

Similarly we can look at C++’s multiple standard library forks for an example of what not to do. These forks exist for various reasons, such as exception handling/allocators (EASTL), backwards compatibility (Abseil), and missing functionality (Boost/folly), and people also still often handroll their own data structures.

Steelman 1C says “The language should encourage user documentation of programs.” I think the main way to do this is comments, specifically documentation comments that are processed by a documentation generator. Literate programming is interesting but in the case of TeX actually made it harder to use TeX as a library; it is only with great effort that LuaTeX has de-webbed TeX and turned it back into C. When you have interleaved comments and documentation it is really just a matter of style whether it is code embedded in text or text embedded in code, and the consensus across languages is that text embedded in code is easier to understand.

Defect rate

Jones says that the rate of defects per function point varies by programming language. In [Jon07] Table 17.8, we see that Smalltalk has the lowest number of errors per function point (0.14), most languages have under 2, C/C++ are pretty high, and assembly has the highest (5.01-9.28). Programming language syntax, notation, and tools can be designed to minimize the natural human tendency to make errors. Jones suggests regular syntax, minimal use of complex symbols, and garbage collection, among other facilities. Jones notes that individual factors cause 3x variations and so will matter more in practice except for particularly bug-prone languages. Nonetheless, it does make sense as in Steelman 1B to “design the language to avoid error-prone features”.

A 2014 study analyzed the rate of bugfix keywords in Github commits by language. They found an additional 100-150 expected bug-fixing commits per 700 commits (20%) on average when moving from Clojure to C++ (best to worst). This magnitude of effect held up after two re-analyses, [BHM+19] and [FTF21]. But there was a lot of noise, something like a 2x average variation between projects. The rankings of languages by bug rate were not consistent between the analysis and re-analyses. Generally, functional programming was less error-prone, Javascript was a little above the middle, and C/C++/Objective-C were at the top (in that order). Notably this disagrees with Jones’s table, where C is the worst and Objective-C has the lowest defect rate of the C family. The re-analyses mention that the regexes used to count bugfix commits were quite suspect; probably they also captured variations in commit message styles between languages such as NSError for Objective-C.

[Zei95] lists $10.52 per line of C code and $6.62 per line of Ada. It notes that proportionally more of the Ada code was “specification” or “header” lines, which are not redundant but are somewhat boilerplate in order to make entities visible. Perhaps as a result, Ada takes more lines to implement a feature than C; counting these specification lines as half lines would make the line counts and cost per line closer. These specification files also encourage more comments and whitespace (not counted in SLOC). The paper attributes the cheaper development cost per line of Ada to its specification, but looking at the defect rates of 0.676 (C) and 0.096 (Ada) per SLOC, it seems SLOC weighting is not sufficient to explain the discrepancy and it really is that C is simply more error-prone. Zeigler states they encountered many issues with C despite following as many development precautions as they could: “Customers were far more unhappy with features implemented in C than they were in Ada.”

Jones is not a fan of cost per defect. In [Jon17] Jones mentions time-motion studies and says that actual defect repair costs are flat and do not change very much, mode roughly 6 hours. The reason there are “urban legends” that post-release defects cost 100 times as much as early defects is that defects become scarcer while test cases and other fixed costs of defect finding remain. So for example, let’s say you’re in the initial stages of a project, writing unit tests. Finding a bug is random chance, fixing it costs $380, and fixed costs are $2000. Let’s say you find 50 bugs, then you spend 50*$380+$2000 and the cost per defect is $420. Then in the late stages of a project, finding a bug is still random chance, fixing it costs $380, and fixed costs are $2000, but you’ve found most of the bugs already, so you only find one bug so the cost per defect is $2380. If you didn’t discover any defects, then the cost per defect is infinite. The definition of defect varies and also the reporting period for defects after release, so this can really happen. Jones recommends accounting using cost of defects per function point instead, and for releases the proportion of bugs found before/after a release. All hours spent testing and debugging a function point can be totalled up; it is generally a significant fraction of the overall effort spent on that function point.

Maintenance

Typically a new language only has a compiler available. However, a more concise language and a full suite of programming tools for a language makes maintaining a language easier. Per Jones, with assembly, a maintainer can only maintain about 250 function points. With Java, a maintainer can keep up with updates for 750 function points. With best-in-class tools for automated complexity analysis, code restructuring, re-engineering, reverse-engineering, configuration control, and defect tracking, a maintainer may oversee 3500 function points without difficulty. Thus, when Steelman 1C says “the language should promote ease of program maintenance”, this can be interpreted as that the language should have a low burden of maintenance per function point, and the way to lower this burden is to provide more tooling and also to lower the lines of code per function point (increase functionality / expressiveness).

Fixed costs

Another assumption of the COCOMO model is that coding is the only thing that matters. In practice, there is also requirements, design, documentation, testing, and management. These costs are generally not affected by the programming language, unless for example you are using UML to design the project and you generate Java code from the UML diagram using a tool. But I would say that can be accommodated by redefining the phases; really, you are not working on the design, you are coding in UML+Java. Definitions aside, the impact of a programming language in a real project is not as drastic as the 10x increases that are suggested by the expansion ratios. In [Jon07] Table 17.3 Jones estimates the impact to be a 30% decrease to 25% increase, so maybe a multiplicative factor of 1.8 from best to worst cost.

Counterintuitively, because the language does not affect other necessary activities such as requirements, design, and documentation, these will take up more of the project time. So the lines of code per total project month will be lower for a high-productivity language. To really get good estimates of coding cost, we have to do time-use studies and similar to see what the breakdown of a developer’s time usage is.

Uncontrollable variables also impact studies. For example, Jones [Jon07] describes a study by DeMacro and Lister where the size and noise level of the programmer’s office space had a stronger influence than the programming language.

Best practices

Write a prototype implementation. Conduct an A* search through the possible solutions, stopping early if the potential solution is clearly worse than the prototype. Periodically take the best solution out of all discovered so far and implement it as the new prototype. (Branch and bound)
Take a list of items. Imagine a specific walk through a familiar place. List distinctive features of the route. Combine each feature with an item to form new outrageous/memorable images. (Memory palace)
Do all things without grumbling or complaining (Philippians 2:14)
Secure by default: The default level of access should have the least privilege and the most number of checks. (OpenBSD)
Organize functions by functionality into expressive components. (Uli Weltersbach)
When an interface has multiple possibilities, and other principles conflict or are ambiguous, the behavior should be that which will least surprise most new novice users. In particular the behavior is not necessarily the behavior that would be the most easily implemented. (POLA)

This principle varies from the original in specifically defining a target audience (new novice users). Targeting other audiences such as existing programmers would make the language highly dependent upon the whims of culture, and create a vicious circle of learning (“To understand recursion, you must first understand recursion”). For contrast, per Matsumoto’s interview, Ruby was designed for his least surprise. That means that, in order to feel comfortable with Ruby, one must learn all of Ruby, program a few large programs in Ruby, and then constantly re-read the Ruby manual to refresh what has been forgotten. And even then you are not Matsumoto so there may be something that surprises you. Focusing on novices means that all an expert has to do is “think like an idiot” and the solution will be at hand. The expectations of novices are essentially constant over time, because they depend on human psychology rather than experience. This principle is essentially the approach taken in designing the Quorum programming language. Every person goes through a short “what is this feature” phase, which novice-friendly syntax will make straightforward, followed by a much longer cycle of routine reading and writing for which the syntax makes no difference.
Design to do the hard things every day. Take all the common daily tasks considered to be painful and hard, and figure out a design that will allow each task to be accomplished efficiently in a few seconds of actual work. It is unacceptable to require detailed pre-planning or coordination for each task. The biggest overheads should be testing the result and writing documentation. (Linus on why Git works so well)
What we need, then, are more interfaces that are durable, non-leaky, and beautiful (or, at least, extremely workable). A durable interface lasts - [it] will run on many years’ worth of systems. A non-leaky interface reveals little or nothing about the interface’s implementation. Beautiful interfaces do not expose too much or too little functionality, nor are they more complex than they need to be. A lot of work and a lot of iteration is required to create a beautiful interface — so this conflicts somewhat with durability. The worthwhile research problem, then, is to create interfaces having these properties in order that the code living in between interface layers can be developed, understood, debugged, and maintained in a truly modular fashion. To some extent this is a pipe dream since the “non-leaky” requirement requires both correctness and performance opacity: both extremely hard problems. Another problem with this idea — from the research point of view — is that a grant proposal “to create durable, non-leaky, beautiful interfaces” is completely untenable, nor is it even clear that most of this kind of work belongs in academia. On the other hand, it seems clear that we don’t want to just admit defeat either. If we disregard the people who are purely chasing performance and those who are purely chasing correctness, a substantial sub-theme in computer systems research can be found where people are chasing beautiful interfaces. John Regehr
Most people assume that maintenance begins when an application is released, that maintenance means fixing bugs and enhancing features. We think these people are wrong. Programmers are constantly in maintenance mode. Our understanding changes day by day. New requirements arrive as we’re designing or coding. Perhaps the environment changes. Whatever the reason, maintenance is not a discrete activity, but a routine part of the entire development process

When we perform maintenance, we have to find and change the representation of things-those capsules of knowledge embedded in the application. The problem is that it’s easy to duplicate knowledge in the specifications, processes, and programs that we develop, and when we do so, we invite a maintenance nightmare - one that starts well before the application ships.

We feel that the only way to develop software reliably, and to make our developments easier to understand and maintain, is to follow what we call the DRY principle: “Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.”

Why do we call is DRY? It’s an acronym - Don’t Repeat Yourself.

The alternative is to have the same thing expressed in two or more places. If you change one, you have to remember to change the others, or, like the alien computers, your program will be brought to its knees by a contradiction. It isn’t a question of whether you’ll remember: it’s a question of when you’ll forget. (The Pragmatic Programmer 1st edition)

Now other people have added clarifications:
- There is a minimum size at which the language’s mechanisms for sharing operate, and below this level it is not appropriate to try to avoid repetition because such attempts will actually make the code longer or more confusing. For example, “In the heart of the forest, the forest seemed to whisper secrets only the forest knew.” - trying to shorten this sentence by using “it” instead of “the forest” will only make it more confusing as there is also the noun “heart” which “it” could refer to. Generally, saving a dozen lines of code is worthwhile, saving 1-2 is not. Sometimes, copy-and-pasting that 12-line snippet is exactly the way to proceed, because you will modify it beyond recognition. Refactoring a 12-line function into two 8-line functions or four 5-line functions is not an improvement; neither is factoring two 6-line functions into 3 5-line functions. (Tamir) There are clone detection tools which have logic and filters for which duplications are significant vs. not.
- Sometimes duplication is inherent. For example, a formula in the business logic may have repeated subformulas. Two test cases for the application domain may have identical validation logic. A database may not be in 3NF and have repeated columns for performance reasons. A package or library may be deliberately forked to iterate fast, adapt to new requirements, and avoid coupling or version constraints. Code may be written to follow the documentation and copy most of its structure and semantics. These duplications are “natural” and avoiding them is often unworkable or infeasible. Generally, it is best to document these duplications, and the history/reasoning behind them, so that the developer can get an idea of which version is authoritative for which purpose. Now, in practice, with distributed DVCS development, no version is completely authoritative. But, we can divide them into “active” forks, “temporary” forks, and “abandoned” forks. Generally there are only 1-2 active forks, and these are what matter. But this is an empirical observation rather than a best practice - it may happen that there really are 10 active forks or libraries that do substantially the same thing. (Nordmann) For code contracts, the versions can be maintained in sync by duplicating the contract assertions at both sides.