Our client is an online social networking and virtual experiences company based in the United States. They partnered with HCLTech to provide smart hands and break/fix support for network and power distribution devices for all of their data centers and point of presence (POP) locations.
The Challenge
Complex infrastructure environment and limited support team availability
Our client faced several challenges in ensuring data center and lab infrastructure availability and reliability across US, EMEA and APAC. Their large, complex environment consists of more than 100,000 network and power distribution devices across 50+ data centers and multiple POP locations worldwide.
Data centers and POP locations generated a high-ticket volume of more than 15,000 per month, leading to a long MTTR for the data center smart hands and break/fix teams. The client also faced challenges because the smart hands team was available only on an ad hoc basis, which caused support issues when establishing new data halls.
The Objective
Global data center/lab and support coverage
The client sought a partner who could ensure reliability of devices in data center and POP locations by identifying best practices and improving operations, documentation and processes.
They also wanted the partner to ensure consistent smart hands availability across all locations for any new data hall turnups, as well as to optimize and standardize operational processes and documentation.
The Solution
Best practices for command center, support teams, automation and SOPs
The HCLTech team stepped in to establish a command center team that triages tickets and provides guidance so that the on-the-ground team can take corrective action. We also analyzed and suggested process automation for device backup/health checks, provisioning workflow, CMDB updates and draining/undraining devices/racks. These best practices accelerated the on-the-ground team’s performance and reduced mean time to repair (MTTR).
Our team continues to work closely with the client’s engineering team on any improvements needed on the scripts to perform the day-to-day deployment, upgrade, replace and fault identification/restoration activities. The team also analyzed noisy ticket volume and corrected ticket correlations.
HCLTech now manages tiger team support; the project manager plans and forecasts new data halls and appropriate smart hands team support requirements. Lastly, the smart hands team now does rigorous testing before deploying devices to reduce device failure rates.
The Impact
Reliable data center and lab infrastructure support
HCLTech’s implementation of best practices for infrastructure support resulted in several benefits:
- 20% reduction in MTTR
- 25% reduction in noisy ticket volume
- 15% reduction in device failure rates, which also significantly reduced RMAs
- Remote troubleshooting and automation reduced manual touches and increased productivity
- Standardized processes across all data centers and labs streamlined training and reduced errors