Finance data lake development

Introduction

Large enterprises acquiring companies often face challenges integrating financial systems. Our client, a Fortune 100 industrial conglomerate, has acquired dozens of businesses over the last several decades, each with its own ERP system and data warehouse solution, which created inefficiencies limiting performance and cost-effectiveness.

The Challenge

Interrelated technical, operational and security requirements

Every day, our client’s companies record tens of thousands of financial transactions. This poses a unique set of needs for a data lake:

Ingesting highly granular data — usually transaction-level — in near-real-time
Over 100 source systems belonging to more than thirty different types
Complying with Sarbanes Oxley (SOX) requirements
Zero tolerance for data inconsistencies
Managing voluminous and complex data: 200TB of enterprise data comprising more than 25,000 data domains (tables) of over five hundred distinct types
Providing simultaneous data consumption and continuous ingestion

The Objective

A highly performant single source of truth for financial reporting

To optimize the performance of financial reporting systems and reduce license, equipment and personnel costs, our client launched a project to consolidate the disparate data warehouses into a single data lake for financial data.

The Solution

A state-of-the-art, SOX-compliant data lake

The HCLTech team implemented a data lake architecture in which raw data would first be mirrored by ingestion into a Massively Parallel Processing relational database. Subsequently, data would be replicated to in-memory and Hadoop-based consumption layers for later use, e.g., aggregation and data science applications.

Integrated data from 100+ source systems — including 50+ Oracle and SAP ERPs — into a single finance data lake platform and implemented incremental data loads where possible.
Standardized the data model for common reporting and created maintainable and reusable ETL codes.
Enabled mirror integration with HVR and Talend.
Implemented ETL processing with metadata-driven generic ETL framework with PL/pgSQL-stored procedures. ETL jobs are orchestrated by Talend.
The solution supports fully metadata-driven ETL transformation, speeding up the development lifecycle.
Introduced robust audit framework to identify discrepancies and automatically reprocess records.
Implemented Greenplum to MemSQL incremental data replication with custom Queue Framework and Talend solution via S3.
Archived long-inactive ERPs via a proprietary archival framework capable of archiving a compressed version of the ERP from any source database into a cost-effective PostgreSQL database.

The Impact

Timely, accurate data for critical finance department functions with reduced operational costs

The solution enabled granular tracking of company/division-level platform and infrastructure costs for a wide range of business and technical benefits:

$30-$40M forecasted savings via retiring legacy data warehouses
Up to $15M savings per year for five years via ERP archiving
Six-figure USD savings in auditing costs via daily automatic data feeds to auditing firm and eliminating costs associated with on-site auditing
30% increase in transaction processing and a 50% increase in data and analytics productivity
80% reduction in manual effort needed for reconciling financial data in the ledger via new data feeds developed for the automatic reconciliation manager system
Run-time reduction from minutes to seconds and row-level security via MemSQL reporting database for the self-service consumption layer