Observability by Design

Do you treat observability components like metrics, dashboards, alerts as "artifacts" or do you "configure" them today? Should you "ship" them or should you keep them out of your SDLC?

table of contents

SHARE

Observability actions to take at each step of your SDLC

As a software practice, observability is simply having the ability to determine the internal health of various components of your solution - just by examining the externally exposed states.

In the last few years, observability giants such as Newrelic, Datadog and evolving standards such as opentelemetry have led the way in capturing the exposed states in a solution, starting from the code traces to infrastructure metrics. All these tools provide ways to instrument (how to generate), a time series database (how to collect), and a visualization platform (how to observe).

In any organization, the Ops team implements a tool like this and subsequently, dev teams configure their dashboards, alerts and debugging mechanisms.

However, some fundamental issues remain. There are no guarantees on:

  1. Quality of Observability components -Are they well tested?
  2. Consistency across deployments- Are we missing data points or alerts in one of the deployments?
  3. Standardization across the organization- is everyone leveraging observability uniformly, or is it people-dependent?
  4. Discoverability of Observability components- Is there a central catalog that enables organization-wide discovery of observability components?

Signs that you have gaps in observability

From our conversations with partners and customers, we see that the above issues manifest in the following ways.

  1. The absence of a firing alert doesn't guarantee stability. Disparate understanding of "what to collect" and "what to observe" doesn't give the confidence that if there are no firing alerts, is it because everything is working as expected or they aren't simply observed?
  2. Are you chasing issues after they arise and hitting dead ends? Ideally, it would be best if you didn't have to learn about outages, bugs, or degradation in performance from the end-users. But let's say they happen! When conducting incident postmortems, if your RCAs hit dead-ends, it's probably due to missing common metrics and instrumentation.
  3. Are you repeating the same mistakes across teams and environments? Recurring incidents of similar types across teams and environments are a sign that you're not applying your learnings in a sustainable way. You end up configuring additional dashboards, but it doesn't ensure that it percolates in a way that the same and similar mistakes never occur.

Solution: Become observable "by Design "

We at Facets use the term "by design" to indicate that you can truly guarantee certain outcomes in the SDLC pipeline. E.g., if you have promoted a build that will appear in the production environment within a time frame or raise alerts if it doesn't.

In the context of observability by design, it means two things.

First, realizing and treating observability components as "artifacts"

An observability artifact is like a release artifact which can be deployed.

Then, tightly couple these observability artifacts with each phase of your pipeline so they are discovered as well as deployed.

Let's track what changes you need to make in your SDLC stages for observability by design:

  • SDLC Phase: Plan            
  • Observability Actions: Define SLOs/SLAs & BusinessMetrics

While planning features, we should define the relevant SLOs and SLAs.Product stakeholders should also define the Business Metrics that can track the new feature's adoption, usage, and performance. Technical Leads can then define the metrics that track the feature at a finer granularity, for instance, API or database level metrics.  

  • SDLC Phase: Develop              
  • Observability Actions: Configure Metric discovery & Define Metrics, Dashboards, and Alerts

The Open Metrics project introduces "metric discovery" where sources that produce metrics merely expose these in a standard format, and collectors "discover" them. Any packaging mechanism for an application should include the metadata on how the metrics for the application can be discovered. For e.g., if you package your application as helm charts, they must include a ServiceMonitor or support Prometheus scrape annotations for the end user to configure metric discovery.

Visualizations and alerts should be bundled with each component rather than configured for later. For example, you can ship Grafana dashboards as config-maps and alert definitions as PrometheusRules bundled in your helm chart. Popular tools like Newrelic have started addressing this need for dashboards and alerts as code.

  • SDLC Phase: Continuous Integration            
  • Observability Actions: Review & Refine Metrics,Dashboards, and Alerts

CI ensures code quality, and it should also guarantee observability. At this phase, teams ensure that the defined metrics, alerts, and dashboards adhere to benchmarks and standards. It is relatively straightforward to draw up standards for metrics from sources of the same kind. For example, one could enforce that every GRPC application must expose a pre-defined set of metrics.

  • SDLC Phase: Deploy                     
  • Observability Actions: Automatic Rollouts of Metrics,Dashboards, and Alerts to environments

Provisioning of all the metrics, dashboards, and alerts should be centralized and consistent across all environments. Any new feature development most probably requires a change to these defined metrics, so this process is, by definition, continuous.

  • SDLC Phase: Operate                    
  • Observability Actions: Analyze Metrics, Generate Feedback& Address Incidents

All the metrics, dashboards, and alerts that have been set up need to be monitored continuously. Monitoring is required to verify that all the benchmarks and thresholds we set are being met and, indeed, if we are capturing the correct information. Armed with this, stakeholders can feedback the learnings from incidents into the Planning and Development phases to achieve Continuous Feedback.

Conclusion

In summary, we must ensure that observability components are treated as artifacts. Once you devise a mechanism to ship these artifacts rather than configuring them, you will see a steady reduction in recurring mistakes. You can even go further and creatively enforce quality and governance programmatically around these artifacts in your delivery pipeline.

Contact us if you want to know more about how Facets can help introduce observability in your SDLC.

table of contents

SHARE

Join us for an insightful webinar on how Purplle leverages Platform Engineering to handle 10x complexity. Register Now

Capillary reduced ops tickets by 95%

“Our releases are fast. And with less developer time needed our teams can focus on building exciting features. We’ve saved countless hours and costs.”

Piyush K,
Chief Architect, Capillary Technologies,

Treebo reduced production issues by 70%

"With Facets, our staging environments look identical to production environments. So in case of production issues, we can be sure there are no infra drifts."

Kadam Jeet Jain,
Co-Founder & CTO, Treebo Hotels and Hotel Superhero.

GGX switched from AWS to GCP in 2 weeks

"Facets has radically changed our DevOps for the better. They did all the heavy lifting and saved us precious time and resources in our when we switched from AWS to GCP."

Kaustubh Bhoyar,
Head of engineering, GGX

Trusted by companies to run production at scale

Capillary | FacetsMPL | FacetsTreebo | FacetsPurplle | Facets

Request a Quote

Let us know if you have any additional queries, we'll get back to you soon.

Capillary reduced ops tickets by 95%

“Our releases are fast. And with less developer time needed our teams can focus on building exciting features. We’ve saved countless hours and costs.”

Piyush K,
Chief Architect, Capillary Technologies,

Treebo reduced production issues by 70%

"With Facets, our staging environments look identical to production environments. So in case of production issues, we can be sure there are no infra drifts."

Kadam Jeet Jain,
Co-Founder & CTO, Treebo Hotels and Hotel Superhero.

GGX switched from AWS to GCP in 2 weeks

"Facets has radically changed our DevOps for the better. They did all the heavy lifting and saved us precious time and resources in our when we switched from AWS to GCP."

Kaustubh Bhoyar,
Head of engineering, GGX

Trusted by companies to run production at scale

Capillary | FacetsMPL | FacetsTreebo | FacetsPurplle | Facets

Get in touch with us

Tell us your queries and we’ll get back to you

Prefer email? Reach out to us at info@facets.cloud