I’m thrilled to welcome our first guest, Ramesh Nampelly, Senior Director of Cloud Infrastructure and Platform Engineering at Palo Alto Networks. From SRE platforms to chaos engineering for service resiliency- Ramesh has worked on engineering effectiveness from many perspectives! In this interview, Ramesh gives me a unique view into how tech companies solve their DevOps and platform engineering challenges.

Mukta: Ramesh, welcome and thank you for joining us! It’s a pleasure to have this chat with you. As an enterprise cybersecurity platform, Palo Alto Networks has a huge number of products across many verticals of cyber security. Take us under the hood! How do your engineering teams support all these products?

Ramesh: Hi Mukta, great to be here. Well, to give you an idea of my org, I joined Cloud Delivered Security Services (CDSS) at Palo Alto Networks.

PAN has four groups, or Speedboats as we call them internally. These are Prisma Strata ( Netsec - Network Security) , Prisma Cloud ( Cloud security) , Cortex ( SIEM and XSIAM) and Unit42 ( Security consulting). The CDSS falls under NetSec group.

With many engineers, CDSS teams could be viewed as a bunch of internal startups. Every service team is responsible for end-to-end delivery and operations i.e. Dev/QA to staging to production! So, we had to figure out a way to bring these under a unified governance model while not impacting the current delivery cadence.

We have over 9 customer facing services each with their own CI/CD pipelines, infrastructure management and observability tools. So, we needed a two-layered cloud infrastructure and platform approach. In this approach, the common platforms, frameworks & tooling is owned by the central team (i.e. my team) and the service-specific implementation is owned by the concerned service team.  We also embraced inner sourcing ( an internal open source) model in which the core or central team owns a given platform service but the contribution can come from anywhere.

MuktaSo where did you start? Did you build an internal platform?

Ramesh: Yes. We researched and didn’t find a single solution that satisfied all our requirements. The closest option we found was the Spotify Backstage platform. My team started POC with an initial goal to provide a Service Catalog defining ownership i.e. which dev team owns which service.

The first use case we solved was discoverability ( explore and query). Previously teams used Google sheets, confluence, google docs etc. to find ownership. The Backstage Service Catalog served us well, it pulls in the meta-data around services and pulls in all the artifacts etc. into a simple UI. The service teams appreciated it when we gave them the demo.

Next we tackled efficiency through self-service automation: for example, if one of the teams solved a given problem or figured out a new tooling then how do we transfer that knowledge and learning to another team? We built Devclues (our internal developer platform based of backstage) which creates the required scaffolding in the form of templates. So for example, developers could bring up a k8s or a kafka cluster,  a react or a Go app, or secret manager (vault) integration with a few clicks using these templates.

We have also leveraged backstage plugins like cloud cost insights - since cloud costs tracking was important to us. We  extended the 'costs insights' plugin to provide granular level of insights of what’s contributing to the cost. For example, which SKU is contributing the most is it compute, cloud storage, logging or networking etc. We’ve made sure this plugin provides engineer level insights in addition to exec level view.

Also, we started extending the capabilities of the core platform with features that were important to us like integrating with OPA ( Open Policy Agent). Now developers can work with the guardrails and cost optimization in mind, instead of having different practices for each team. This still allows the flexibility that’s needed for developers but with best practices.

In addition to the internal developer platform, we are also working on an observability platform to help engineering teams to achieve better service reliability with optimized costs. Over the last year, we built an internal observability platform called “Garuda” using open source software like grafana, stackstorm and  vector.dev etc. This platform is going through an adoption phase as we speak. We have over 3 teams (or tenants) operating their services in production using this platform.

Mukta: How was this received? Did it meet the expectations of internal teams?

Ramesh: Let me be very honest here. For any internal platform, building is easy but adoption is a tough one unless you have buy-in from your customers.

At PAN, from the start of the journey, we took customer adoption very seriously and built platform features in close collaboration with service teams. In fact, we’ve co-developed initial capabilities with our customers (i.e. internal engineering teams) like how startups do with design partners.  

Some other measures we’ve taken to increase adoption are :

1. Collecting requirements and the feedback on existing features in the form of surveys

2. Send frequent updates through newsletters.

Mukta: Prior to Palo Alto Networks you worked at Cohesity. Tell us a bit about  your work there.

Ramesh: At Cohesity I was Head of Engineering Efficiency. The key challenge there was to improve the productivity of engineers. It was very clear that the leadership gave great importance to internal engineering services. In my first week, I was asked a critical question: how would you increase engineering productivity by 10x?

To answer that question I had to understand the bottlenecks in the system first. So I started learning the lay of the land by meeting people like senior leaders, key architects, and some developers in my first few weeks.

Two areas stood out from those conversations: engineering efficiency and better utilization of infrastructure to reduce expenses.

With regards to engineering efficiency, the main problem was the build time. After thorough investigation and discussions with key architects, we decided to migrate the build system over to Bazel, an open-source build system by Google. Bazel gives a robust remote cache mechanism which improved our capabilities and as part of migration we cleaned up implicit dependencies etc. Within 6 months we could migrate the majority c/c++ code to bazel. By the end, we’ve achieved 30- 50% optimization in build time.

Regarding infrastructure utilization, we  built a new system that dynamically provisions infrastructure in the datacenter based on the intent and continuously monitors the usage to scale up and down accordingly. The tool is leveraged in regression runs as well to utilize the available resources effectively.

Mukta: Engineering efficiency is one of the underlying principles that has driven the evolution of DevOps, SRE and platform engineering. What was your strategy and vision when you started?

Ramesh: The team’s vision statement at Cohesity was “Provide an awesome developer experience and frictionless engineering services”.

Our strategy was centered around the OKR model to create alignment and engagement around measurable goals. So, I came up with quarterly and yearly OKRs for engineering efficiency which are aligned with overall eng OKRs.

One practice that I’ve been following both at Cohesity and now at PAN is reviewing OKRs every month and adjusting execution accordingly. These OKRs are presented in “monthly all hands” so that the whole team is aware of the progress.

Mukta: Can you give us some examples of what kind of the Objectives you set and the Key results that followed at Cohesity?

Ramesh: Sure. We set objectives around three main aspects: a) Improving developer experience and productivity b) Shift left quality through increased automation c) maximizing infrastructure utilization.

So for the objective of  improving developer experience and productivity,  the Key Results were:  

  1. Reduce CR merge time by 50%.
  2. Provide merge failure feedback through automated triaging for all commits.
  3. Reduce MTTR for merge failures to x hrs.

Mukta:As you know, many CTOs and heads of Engineering of SaaS companies are solving this problem. However, the dilemma they face today is whether to invest in their own platform teams and internal frameworks or an external platform. Given your journey and experience, what advice would you give them? Build or buy?

Ramesh: Very good question and the same thing we’ve recently discussed at a Google cloud customer event.

My view is that it depends on the kind of company. For example, most tech companies might have been in the platform/devops/sre journey already by the time they are exposed to 3rd party tools. Secondly, these companies have unique platform requirements based on their product needs and heavily focus on building solutions specific to their needs. So, any 3rd party tool or platform may not fully satisfy their needs.  

Whereas when it comes to non-tech companies, it’d be little different as building software is not a primary for them. So they might prefer to Buy rather than Build.

Another important factor is migration i.e. moving to a managed/3rd party solution is easy if you are starting from scratch but it’d be tricky to migrate over if you already have an in-house solution.

So, I think it's about the integration complexity and potential costs associated with buying a solution. Basically I want to get the point across that if you go the buying route, you should consider buying a solution that covers all your needs and is continuously updated with new features/usecases. Rather than having to buy a single solution every time there is a problem and then integrate it into existing infrastructure.

Lastly, I’d like to differentiate platform engineering from typical SRE/DevOps is that you treat platform as a product in platform engineering as opposed to an isolated automation use case.

‍‍Mukta: Thank you Ramesh for taking the time to share your insights!