Stability & availability with full stack observability

overview

A large global fintech company, in collaboration with HCLTech, embarked on an observability journey aimed at enhancing the reliability, and performance of its core platforms. The program was designed in alignment with business value streams to help platforms achieve improved monitoring, error budgeting and Service Level Objectives (SLOs) by implementing observability tools and practices across the ecosystem leveraging existing tools such as Dynatrace, Splunk, Prometheus etc.

The Challenge

The client’s platforms faced several operational challenges, including limited insights into critical processes like batch transfers, high toil from manual monitoring, non-standardized processes to monitoring, and the absence of a centralized logging solution. Furthermore, SLOs were not defined, which hampered the ability to measure user experiences and service reliability across different platforms.

The Objective

The primary objective of the program was to accelerate platform stability and reliability for 80 platforms across Core Banking, Wealth Management, Capital Market platforms. This required the identification of customer journeys, documentation, and implementation of SLOs and Service Level Indicators (SLIs) across all critical services. In doing so, the program sought to reduce toil, improve platform observability and enhance user experience while aligning with contractual business SLAs while incorporating chaos engineering practices.

The Solution

HCLTech’s domain centric observability approach and implementation was divided into phases:

Business value stream identification: Identified, prioritized and implemented value streams and customer journeys based on business value delivered.
SLI/SLO documentation: A comprehensive assessment was conducted to document SLOs/SLIs for each platform. Error budgets were documented based on critical services and user journeys. Developed reusable templates and documents for efficient operationalization.
Continuous improvement: Monthly reviews of SLIs/SLOs, combined with error budget policies, helped prioritize improvements, reduce toil and enhance automation.

The Impact

The implementation of the program resulted in significant improvements:

Business agility: 2x improvement in number of changes delivered to production
Reduced operational toil by ~20%: Centralized logging strategies and observability tools provided actionable insights, reducing the manual effort required for monitoring.
Improved service reliability and reduced incidents by 10%: Platforms now have defined SLIs and SLOs, with alerting mechanisms for SLO violations, ensuring higher availability and better user experience.
Enhanced platform maturity and improved Mean Time To Recover (MTTR) by 30%: The platforms have moved from basic infrastructure-level monitoring to more sophisticated application-level observability, with user experience being measured through latency and error rates.
Reduced product feature release cycles from monthly to biweekly releases ensuring accelerated feature rollouts

Overall, the program has accelerated the maturity of the platforms, bringing them closer to achieving their reliability and observability goals.

Achieving stability and availability through full stack observability

overview

The Challenge

The Objective

The Solution

The Impact