Data marketplace architecture

Introduction

Our client, one of the world’s leading providers of investment, advisory and risk management solutions, headquartered in New York City, has 70 offices in 30 countries and clients in 100 countries. They were looking to improve the performance of their home-grown data marketplace platform, with special regard to data onboarding speeds, to ensure the satisfaction of both inhouse and external customers upon the product’s launch.

The Challenge

Building data pipelines in a highly volatile infrastructure

HCLTech team carried out an analysis of the data pipelines of the client’s product and highlighted the following challenges:

During development, the client was dissatisfied with the time it took to onboard new datasets and make them usable on the platform
The underlying issue resulted in month-long delays before a dataset could be shared on the data marketplace
The lack of a standardized and streamlined integration strategy to ingest data from third-party datasets and the technical limitations of the legacy Sybase database
In addition, the data stewards needed to be proficient in Perl, Java, Python and Informatica PowerCenter to be able to onboard new datasets, which resulted in further delays

The Objective

Improve customer satisfaction through better product performance

The client wanted to design and implement a solution to improve the platform’s performance, ensuring customer satisfaction upon the product’s launch.

To enable faster data throughput, the client needed to shift from batch-based processing to a stream-based architecture. This change, however, would result in a substantial increase in the volume of data the system would need to accommodate
The system would need to parse 5-10GB XML files into relational tables and quickly process 300-400,000 small files

The Solution

Design and deploy a metadata-driven ingestion system

Our team designed the data architecture and setup, managed the ingestion and transformation setup and created a data sharing strategy to improve the platform’s performance.

The team replaced the outdated Sybase database with a system relying primarily on Snowflake and dbt
The data was loaded to Snowflake from the object store to the stage 0 raw layer to maximize the performance and leverage the capabilities of dbt to transform the data and to perform quality check operations
To lower the technical proficiency requirements of data stewardship, the team implemented a metadata-driven auto ingestion platform. With this platform, data stewards needed only to fill out metadata templates for the data to get picked up automatically by the framework
This strategy and system eliminated the need to interact with a database, ETL or orchestration tool, as the new framework automatically generated everything based on the metadata
The team constructed a data lake that would accommodate structured, unstructured or semi-structured data. This enabled our team to rapidly and efficiently access all data assets regardless of the format
The team achieved this using a Python-based backend, where the Python code was written to parse configuration and metadata files to automatically generate pipelines with stages for extraction, loading, transformation and data quality checks

The Impact

Faster insights via reduced data load times

The client received a highly performant, robust and modular solution for onboarding datasets. They can now onboard new datasets in days, rather than months. The new architecture enabled easier data consumption for the client's customers and granted them access to an isolated replica using Snowflake Data Exchange. While the client's legacy architecture had an average latency of three days with larger datasets, the new solution can now load close to 4TB of semi-structured XML data in a couple of hours.