A surprising? lesson in the speed of languages

Scenario: I had the need for a small tool that would parse logfiles of approximately 6 million lines. To each line apply two regexps to extract a few values, and using two separate dictionaries/hashmaps (choose your poison wrt terminology) calculate how many times each capture group int he regexp shows up.

All in all, this is not going to be a very performance critical run, as it will only parse about 4 times these 6 million lines per day, so even at worst case we're talking less than 100M regexp matches per day. Piece of cake for any language. And since it's a weekend and I don't have any time when I have to deliver this finished script, I set out to prove to myself that python is going to be fast enough for this.

My original plan was to try it in Python (because I can write that fastest) and Go (because I figured it'd be much faster on execution). After having completed those two in less than half an hour, I figured I'd give it a try in Rust as well before it was time for a coffee break.

In general, it was certainly the easiest and quickest in Python (34 lines of code properly formatted and with plenty of whitespace). Go was only marginally more writing time (and got me 49 lines of code with similar spacing), mostly around the (to me) annoying "there's always a check for nil"). And finally, Rust took significantly more time (though that part I blame mostly on me being less experienced in Rust than in Python or Go) to get working and ended up with 39 lines of code but some really ugly lines in it.

Those were all relatively unsurprising to me. The surprising part, however, was when it comes to runtimes (averaged over 3 runs, too lazy to do more detailed benchmark, but there was very little variation):

  • Python: ~25 seconds
  • Rust: ~35 seconds
  • Go: ~60 seconds

It's a known fact that the regexp engine in Go is pretty slow. There are other, faster, engines available for Go, but my test only involved the built-in one. So that it came out last is hardly surprising, but how much slower it was did surprise me. This test was also in the form of a single threaded, zero concurrency, processing of input to output, which is not the big strength of Go. It is still my go-to choice for highly concurrent or asynchronous workloads for example, where it can certainly beat out Python on performance any day.

How come Python is so fast, given it has a reputation for being pretty slow? Simple -- the vast majority of time is spent in the regexp engine, and the Python regexp engine is not written in Python. It's written in highly optimized C.

The big surprise for me in this lineup was Rust. I would've expected Rust to come out the fastest, and pay for it in development time. Of course, once I've done this, I google it a bit and turns out that the fact that the Rust regex engine is slow with capture groups is a known problem, and appears to have been known for several years (which is a long time in rust-land). I guess I should've done that before I started, but originally Rust wasn't supposed to be part of the test.

As a sidenote, the sizes of the files that I had to deploy to my testing instance were also "somewhat" different. Python was 1.2Kb (OK, that's cheating, since it also requires the python standard libraries to be installed on the machine -- but that's already installed in most cases), Go was 2.3MB (fully static binary with no dependencies) and Rust was a huge 8MB (dynamically linked but only needing the normal gcc libraries already installed on the platform).

Conclusion

The conclusion from this very trivial test is definitely not "Python is faster than Rust and Go". In fact, as a language it very definitely is not.

The conclusion is that for a lot of workloads, the speed of the language itself doesn't really matter, it's the speed of the underlying libraries that matter. And in this case, it can often help to have a language where the underlying library has been around for much longer and received a lot of optimization.

I see similar things in when I work with our customers who build database applications. The language chosen for the backend code rarely has any larger impact on the system, compared to the underlying database, database model and queries. Frameworks can have a bigger impact there, but mainly in those cases where they tie the hands of the developers and prevent them from writing efficient database queries and models. And this is clearly not the fault of the language itself.

And if the speed of the language itself doesn't matter, then the speed of development in the language obviously becomes a bigger factor. Or the cost of maintaining code in the language.

There are of course exceptions to this. Rust and Go will be much better choices if the code to be written is lower level and spending more time in "itself" rather than calling out to libraries. And as mentioned above, both Go and Rust have excellent properties for highly concurrent workloads. And many other cases as well.


Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

Past

Stockholm PUG Oct 2020
Oct 27, 2020
Stockholm/Online, Sweden
Percona Live
Oct 21, 2020
Online, Online
Warsaw User Group
Jun 29, 2020
Virtual, Virtual
Postgres Vision
Jun 23-24, 2020
Online, Virtual
PGCon 2020
May 26-29, 2020
Online, Virtual
More past conferences