Optimizing data center and POP infrastructure reliability | HCLTech

Tech company reduces data center device failure rates by 15%, improving infrastructure reliability

HCLTech’s command center and smart hands best practices resolve issues for social networking and virtual experiences company in the US
3 min read
Share
3 min read
Share

Our client is an online social networking and virtual experiences company based in the United States. They partnered with HCLTech to provide smart hands and break/fix support for network and power distribution devices for all of their data centers and point of presence (POP) locations.

The Challenge

Complex infrastructure environment and limited support team availability

Our client faced several challenges in ensuring availability and reliability across US, EMEA and APAC. Their large, complex environment consists of more than 100,000 network and power distribution devices across 50+ data centers and multiple POP locations worldwide. 

Data centers and POP locations generated a high-ticket volume of more than 15,000 per month, leading to a long MTTR for the data center smart hands and break/fix teams. The client also faced challenges because the smart hands team was available only on an ad hoc basis, which caused support issues when establishing new data halls.

The challenges

The Objective

Global data center/lab and support coverage

The client sought a partner who could ensure reliability of devices in data center and POP locations by identifying best practices and improving operations, documentation and processes. 

They also wanted the partner to ensure consistent smart hands availability across all locations for any new data hall turnups, as well as to optimize and standardize operational processes and documentation.

Tech company reduces data center device failure rates by 15%, improving infrastructure reliability

The Solution

Best practices for command center, support teams, automation and SOPs

The HCLTech team stepped in to establish a command center team that triages tickets and provides guidance so that the on-the-ground team can take corrective action. We also analyzed and suggested process automation for device backup/health checks, provisioning workflow, CMDB updates and draining/undraining devices/racks. These best practices accelerated the on-the-ground team’s performance and reduced mean time to repair (MTTR).

Our team continues to work closely with the client’s engineering team on any improvements needed on the scripts to perform the day-to-day deployment, upgrade, replace and fault identification/restoration activities. The team also analyzed noisy ticket volume and corrected ticket correlations.

HCLTech now manages tiger team support; the project manager plans and forecasts new data halls and appropriate smart hands team support requirements. Lastly, the smart hands team now does rigorous testing before deploying devices to reduce device failure rates.

The solution

The Impact

Reliable data center and lab infrastructure support

HCLTech’s implementation of best practices for resulted in several benefits:

  • 20% reduction in MTTR
  • 25% reduction in noisy ticket volume
  • 15% reduction in device failure rates, which also significantly reduced RMAs
  • Remote troubleshooting and automation reduced manual touches and increased productivity
  • Standardized processes across all data centers and labs streamlined training and reduced errors