slide

Sysdig monitoring via e bpf

Ned Bellavance
6 min read

Cover

A few weeks ago I attended a presentation by Sysdig as part of Cloud Field Day 5. Prior to attending CFD5 I did a little research about the company and their products and wrote up a quick post where I posed a few questions. I think some of those questions were answered by the presenters. The questions were:

  • Do you currently support or plan to support container deployments using AWS Fargate or Azure Container Instances?
  • Is Sysdig a marketplace item in AWS or Azure today to simplify deployment?
  • How are you handling balancing open-source and paid products? Are there plans to open-source the whole solution like Chef just did?
  • What are you doing with all the aggregated monitoring data you are getting from clients?
  • What are the major security concerns with your solution and how are you addressing them?

You can watch all three videos on YouTube. Below are a few thoughts and a bit of a deep dive on how Sysdig is capturing and storing their information.

Sysdig was founded by Loris Degioanni, who co-authored Wireshark. As someone who has used Wireshark on more than one occasion, I know the value that tool has to a systems administrator. There was a time when I needed to troubleshoot an Active Directory Cross Forest Trust authentication issue between two forests with Selective Authentication enabled. The only way I ever figured out the problem was by firing up Wireshark and looking at the packet captures coming out of the domain controllers on both sides and the client computer. By analyzing the authentication handshakes between multiple clients, we were able to track down a DNS issue, a FSMO role issue, and a poorly configured DC in a child domain. That’s just one instance where Wireshark helped me crack a particularly tricky situation. That’s the old world though, the legacy world of VMs and physical machines where the packet was king and packet capture put you in the center of everything. The reason Loris founded Sysdig, and the guiding principle behind their solution, is that packet capture on the wire is dead. It just doesn’t work for a cloud-native world with containers being spawned and thrown away constantly. The solution was to find a way to capture all traffic information from containers, look for anomalous behavior, and send that information up to an aggregation point for further analysis and alerting.

Sysdig is doing this through a light-weight container that lives on each host and has access to eBPF (extended Berkeley Packet Filter) running on the kernel of the host. If you aren’t familiar with eBPF, don’t worry! Neither was I. Here’s a really good article introducing the eBPF and what it can do. At a basic level, eBPF gets attached to a code path in the kernel and it allows verified programs to interact with particular interfaces through the bpf() function. If you want to filter, monitor, and classify network traffic in a performant way, then eBPF is your friend. The Sysdig container running on each host is basically sniffing the traffic of all other containers running on that hosts. I don’t know the exact details, but I assume when a new container is spun up, the Sysdig program is attached to its code path. Sysdig can see not only network traffic, but also IPC calls. The information is processed locally by the Sysdig agent, aggregated and sent up to the Sysdig analysis platform for additional processing. There is also a circular buffer of about 100MB that captures every packet and stores it to see if it is anomalous or interesting behavior. If it is, the full packet capture is uploaded so that you can do a playback of exactly what was happening at that time. Even if the container that caused it has ceased to exist, you can still replay the entire packet flow and find out what was going on. That is pretty freakin’ awesome for a developer trying to debug code or for a security person trying to figure out how a breach occurred.

There is still a question of security here, in fact I asked Loris that question directly! You are instilling a lot of trust in that Sysdig container. It gets to inspect all the traffic on your other containers, which makes it a goldmine of information for a potential attacker. In that regard, I would hope that Sysdig is constantly patching and improving their solution to keep their container images secure. Loris made the point that otherwise you would need to instrument every container individually or run a sidecar with every container. A vulnerability in that situation would be much harder to mitigate. If a vulnerability is discovered with Sysdig, you can stop the containers on all the hosts until a patch is available. In the meantime, the rest of your workloads continue to function uninterrupted. The aggregated data being uploaded to their service is being stored per client, but they did say that they anonymize the data and use it for model training. You can opt out of that, or you can choose to run the Sysdig software locally and not use their cloud service at all.

I did not get to ask them about Fargate or ACI. My guess is that neither is currently supported because of the way that Sysdig works. It’s using the eBPF to collect data, so it needs a container running on every host in the environment. If you’re running regular K8s or something like AKS, that’s fine. Each worker node is dedicated to you, and you can run that additional container on each host. Something like Fargate or ACI does not allocate a whole host to you, and there is no way that Microsoft or AWS is going to let you collect the network data on all containers running on each host you use in ACI or Fargate. Talk about a security nightmare! There are a few other ways to support it.

  • Embed the Sysdig agent in any container image you deploy to ACI or Fargate
  • Run a sidecar proxy with each container image you deploy to ACI or Fargate
  • Use native tooling in Azure or AWS to stream telemetry to Sysdig

I did ask about converting from Open Core to fully Open Source in the same way that Chef did. For the time being, Sysdig is going to continue down the path with Open Core. I did not get to ask them about publishing something to the Azure or AWS marketplace, but having actually run a demo myself I can tell you that it is very easy to deploy.

While I think that the death knell of packet capture is a big premature, it helped Sysdig tell a compelling story about the unique challenges that cloud-native applications pose for monitoring, performance, and security. Their tool approaches solving that problem in a unique and sophisticated way.