Screenly - What we've learned from running Ubuntu Core in production for five years.

I recently traveled to lovely Prague for the resurrection of Ubuntu Summit. As it turns out, I attended the final Developer Summit in Copenhagen back in 2012, which was the last Developer Summit prior to this event.

The reason why I attended was to give a talk titled Five Years of Ubuntu Core, where we reflect on our journey from the very first release of Ubuntu Core (then called Snappy) to where we are today. If you would rather watch the recording, this can be found here.

Part of this talk was taken from the article Screenly 2 Player under the spotlight, where we outlined the selection criteria we had set out for the foundation of our big overhaul of Screenly’s player infrastructure.

It is also worth noting that the list below relates to Ubuntu Core as of 2022. We are not even going to pretend that things were smooth from the start. As with all bleeding edge technology, the road was very bumpy early on. As early adopters, we paid the price for this. Much has changed since, and the Ubuntu Core today is a much more polished product than it was five years ago. Ubuntu Core today is perhaps far from perfect, but it has come a long way.

Today, Screenly is the largest commercial deployment of Ubuntu Core in the world.

What do we really like?

OTA is very reliable

Over-the-Air (OTA) updates “just work”. We’ve had very few instances where this doesn’t work. The most common scenario where this happens is when customers build their own players and try to skimp on cost by selecting a non-industrial SD card.

Security is very solid

We’re a big fan of the security model in Ubuntu Core. Sure, as with anything security related, it does add friction. However, that is the price you pay for good security.

Some things we really like include the use of GPG for cryptographically signing (and validating) software, along with the sound and granular sandboxing available.

Lately, with our Screenly Player Max, we have also enabled Secure Boot, to ensure the operating system is cryptographically solid before booting.

Interfaces (slots/plugs) cover most use cases

In the early days, we had to do a lot of work with slots/plugs to interface with particular hardware on the system. Those days are now largely over, and the built-in slots/plugs are easy to work with. For instance, when we started using TPMs with our mTLS setup, this was readily available.

Snapcraft is easy to work with

We’re not going to lie, in the early days of Snapcraft was very rough. It frequently broke and was causing us a lot of pain. Those days are now largely over, and the tooling works well. Granted, there is a learning curve with the tooling, but it is a lot better these days.

Disk image builds work well

Again, in the early days, this was rough. These days it just works. We build all our disk images directly in our CI/CD pipeline. Whenever we do a manufacturing batch in the factory, we are able to use an up to date disk-image. This disk-image of course self-updates once the device is powered on, but the more recent the disk image, the faster the update will be.

Ability to mix and max Core versions as dependencies

This is perhaps something that one doesn’t appreciate until a few years into running Core. When we started using Core, the latest version available was Core 16. We’re now up to Core 22. Core 16 is still fully supported by Ubuntu, so it’s not a security issue. However, if we were limited to software from this release, we’d be very constrained in our software stack. Fortunately, this is not true. With Core, we are able to run Core 16 as the OS, but still use Core 22 as a building block for our software.

This is a powerful concept. This means that we can use Core 22 and Core 16 as the OS side-by-side, without having to maintain separate software stacks for them.

Debugging has gotten a lot better

Last but not least, debugging is a lot better these days. In the early days, we were limited in how we could debug software. This has gotten a lot better. We’re now able to use tools like GDB to debug our software stack directly in core. What’s perhaps even cooler is that we can also use eBPF tools, like Parca to do real-time profiling on devices to zoom in on bottlenecks directly in Core.

What still need work

Documentation is getting better but still not great

Documentation used to be one of the biggest issues with Core. It was a fast moving project and some argued that the best way to keep this up to date was to use the Snapcraft Forum as the source of truth for documentation. While we are active users of the Snapcraft Forum, we see this as a complement, rather than a substitute for official Core documentation.

Things have improved a fair bit, and the official documentation is a lot better than it used to. However, it still requires a fair bit of work.

Boot time is still very slow

This is perhaps our biggest issue with Ubuntu Core. Boot time is still painfully slow. In particular for the first boot. There are good reasons for this, but as an end-user, you really don’t care.

The good news is that this is a top-priority within Canonical. The Core engineers are hard at work trying to speed up this process, so we are hopeful that this will change soon.

No OTA between releases

A large part of our fleet is using Ubuntu Core 16. This release is still maintained until 2026, so our devices will still receive security updates from Ubuntu until then. However, it would be nice if we could update these devices remotely to a more recent version.

We get it, it’s a complicated thing to do, since it requires rewriting the whole partition system (since that has changed). However, if we put aside the technical complexity involved, this would be a nice feature.

Network stability has been an ongoing issue

We’ve been chasing this problem for a long time. There appears to be a small set of devices that periodically drop off the network. While we suspect that it is a network issue, we cannot be certain. All we know is that devices stop sending data to us and they also stop responding to ping on the local network.

To debug this, we’ve read more gigabytes of logs that we care to admit. We’ve even applied machine learning on said logs to see if we could find patterns. So far we’ve come up short. We will keep at this, but it is something that we didn’t see on Rasbian, but are seeing on Ubuntu Core. That said, it is an issue that is very hard to reproduce, which makes it very hard to debug.

Unable to (easily) set resource constraints per service

It’s very straightforward to apply resource limits to a snap. This is very handy if you have a single app in snap. In our case, we have a lot of services running within the same snap. This is where it becomes a bit tricky. We are able to set a ‘global’ resource limit, but there is no way to set resource limits per service.

We’ve worked around this using systemd’s resource control. It would have been much cleaner if we could have used the native Snapcraft features for this, but we’re at least able to work around this for now.

Hardware support is limited

One of the biggest limitations in Core is the supported hardware. Granted, you can run Core on more or less any x86 hardware, but Core may or may not include the required drivers (WiFi in particular springs to mind). This is where it gets tricky. You can’t just include new kernel drives to Core without having to jump through a lot of hoops and essentially build your own Kernel Snap (this is where the kernel lives). However, if you do this, you lose out on things like native Secure Boot.

There has been talk about improving this by being able to side-load kernel modules, but this isn’t available yet.

For non-x86 hardware it gets trickier. It took Canonical painfully long to enable support for the Raspberry Pi 4 (close to a year). As an end-user, this was very painful as we couldn’t support it on our end before Canonical had fully enabled it. This was part of the reason why we were so late in adding support for the Raspberry Pi 4 in Screenly.

Having spoken to a number of engineers and managers at Canonical at Ubuntu Summit, I’m hopeful that this is changing going forward. The Core team has been scaled up and some very talented engineers joined the Core team.

What was surprising

All TPMs are not created equal

When we introduced our first x86 player (Screenly Player Max), a hard requirement was to have a TPM on board. The reason was largely that we wanted to be able to use the TPM for things like mTLS, Secure Boot and Full Disk Encryption (FDE).

The first two of these use cases worked like a charm. We’ve used Secure Boot since the start and are migrating to use mTLS for our internal authentication shortly.

FDE has been a lot more challenging. In the process, we’ve learned a lot about the inner workings of TPMs. As it turns out, while TPM 2.0 is a standard, it is a standard that is poorly implemented in many TPMs. The particular TPM we use didn’t follow the standard for the particular operation that Canonical relied on for FDE in Ubuntu Core. We are still hopeful that this can be worked around, but this is really the reason why we haven’t rolled out full FDE in our players yet.

For those who want to nerd out over the details, you can read all about it in the bug report on Launchpad.

Splash screens is a hack at best

While giving my talk, I learned from Ogra that this now has been resolved. There is now an official way to display a splash-screen on Core as of a few weeks ago. Until then, adding a splash-screen was a very convoluted hack that required manually tweaking the bootloader.

What isn’t a solved problem however is using a splash-screen on x86 while using Secure Boot. There are good reasons for this. In short, you need to inject your custom splash-screen into where the kernel lives. By doing so, you of course alter the cryptographic signature, which in turns breaks Secure Boot.

One can of course work around this by doing Secure Boot with your own keys, but that has its own set of drawbacks

Summary

In all, it has been a bumpy ride. However, looking back at it, Core was still a good choice. Had we rolled our own solution with Yocto (which we explored), we would have run into another set of issues and we would be nowhere as secure as we are with Core.

What we've learned from running Ubuntu Core in production for five years.