Frank DENIS random thoughts.

Where are the self-tuning systems?

Computer science has gone a long way. And machine learning has transformed things that looked like sci-fi a couple years ago into a reality.

However, there is a major thing that totally sucks in 2015, to the point of still being virtually nonexistent: self-tuning systems.

Every single system remains full of knobs that humans waste a tremendous amount of time tweaking. Breaking things by the way, which is usually solved by turning other knobs, which introduces side-effects that will require more knob turning. Rince and repeat.

Want to run a server? With a grain of salt for the brave people running unikernels, your libraries, drivers and kernel are utterly bloated, not specifically tuned for your use case and always suboptimal. Your web server could be way faster, more scalable and more reliable.

What filesystem should one use? With what set of mount options? After possibly adjusting a couple sysctl values, the maximum number of file descriptors, per-user limits and tunable values for each piece of hardware (especially network adapters), one might end up with a system that might be able to run the desired applications, under a normal workload, without having them break horribly just because these system parameters were initially too high or too low.

Then, each application comes with its own insanely long set of knobs. Why does one still have to adjust the maximum number of connections, timeouts, the size of various memory buffers, the number of threads and processes, and a bunch of other knobs just to run a very basic web server? Why does the JVM still need messing around GC settings to get acceptable performance/memory usage? Why does everything come with so many knobs, whose number keeps increasing version after version? Why can a service be unresponsive because knobs were not in the right position, even though the hardware would be totally able to keep up with the workload?

A simple old LAMP stack comes with so many knobs that no one possibly know what most of them actually do, and what the exact implications of turning two of these at the same time are. Eventually, people end up copying/pasting settings found on random web sites, add a few adjustments according to their intuition (usually on the basis that “bigger is better”), and keep moving random sliders if things go wrong until the whole construction appears to be stable enough.

Of course, some knobs can have been adjusted in a more scientific way, usually by running synthetic benchmarks. Unfortunately, what held true for these benchmarks are unlikely to hold true forever.

After reaching the holy grail, a stable (until unexpected traffic hits the server) configuration, the whole thing is very likely to remain suboptimal. The set of knob positions happens to work, partly by accident, but at any point of time, it remains far from being optimal. The service could always be way faster, accept more connections, or use way less memory.

In 2015, self-tuning systems mostly don’t exist. Every single piece of software still relies on magic numbers found empirically or pulled out of thin air, by developers or by users, possibly manually adjusted later in order to get closer to an acceptable security/reliability/performance balance.

Collecting system, application and network metrics is a long-solved problem. Accessing all the knobs in a unified way remains an unsolved, but engineering-only problem (that systemd is bound to tackle at some point).

Databases, network stacks, and virtual memory managers have been partly self-tuning for a long time, but only partly. Cluster resource managers/schedulers are pretty smart, but still rely too much on parameters whose value has to be chosen by humans.

Even academic research on self-tuning systems is scarce and old. Have we given up? Is it not worth the effort, because reasonable defaults ought to be enough for everyone? While a general solution to the global optimization problem is unrealistic, it’s remains perplexing that we have more and more “tuning guides” for every single piece of software, instead of converging towards knobless software.