1. Introduction and Prerequisites: Need for data-hub or data-lake for MDM

This is the 1st blog as part of the series Full Stack: Remastering Master Data Management into graph like data. Hope you enjoy the series and find it useful !!

Introduction

The biggest challenges in managing customer master data come from a variety of disintegrated systems and sources, duplicated data and a lack of standards for data governance, hierarchy and maintenance at organizations. For larger companies, this may be a legacy from mergers and acquisitions; for others this discrepancy results from the lack of a data management vision whatsoever, where data repositories have grown into ‘swamps’ of stale facts that no one thinks they need.

A holistic master data management can address these challenges both reactively (fixing past data already in the systems) and proactively (preventing to create new records with incorrect, redundant, missing or duplicate details).

Need for data lake

Data lakes have become the cornerstone of many big data initiatives, just as they offer easier and more flexible options to scale when working with high volumes of data that’s being generated at a high velocity. They are typically used to store data that is generated from high-velocity, high-volume sources in a constant stream – such as IoT, product logs or web interactions – and when the organization needs a high-level of flexibility in terms of how the data will be used.

Before you start implementing a data lake, we must ask few questions:

What type of data source you are working with?
Do you know exactly what do you want to do with that data?
How complex is the process of data acquisition or ingestion?
What type of tools and frameworks exist in your organization or can be used?
What is your strategy for data management and governance?

Goals to create a data lake

So lets start with creating a data lake on our own. Firstly lets look at the goals that we want to achieve:

Choose a technical framework which can be easily integrated with multiple environments whether its in cloud or standalone OS.
The installer or the application should be of minimal footprint and highly fault-tolerant.
Must have multiple and modern integrator like REST API based or GraphQL based so that any kind of UI or server can consume that data.
Data must be hosted on highly performant and easy to query or fetch functionalities.
Flexible ETL processes.
Streamline ingestion and pipeline using message broker or Kafka.

Technology Prerequisites

This series will have multiple microservices implemented all of them written from the scratch using following technologies:

Quarkus - Yet another framework like Spring Boot with much minimal footprint and container ready. Quarkus provides a cohesive, fun to use, full-stack framework by leveraging a growing list of over fifty best-of-breed libraries. It tailors your application for GraalVM and HotSpot. Amazingly fast boot time, incredibly low RSS memory (not just heap size!) offering near instant scale up and high density memory utilization in container orchestration platforms like Kubernetes.
Spark Structured Streaming - Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive.
GRAND stack - GRANDstack is a combination of technologies that work together to enable developers to build data intensive full stack applications. The components of GRANDstack are:
- GraphQL - A new paradigm for building APIs, GraphQL is a way of describing data and enabling clients to query it.
- React - A JavaScript library for building component based reusable user interfaces.
- Apollo - A suite of tools that work together to create great GraphQL workflows.
- Neo4j - The native graph database that allows you to model, store, and query your data the same way you think about it: as a graph.
Docker
- Kafka
- MySQL
- Neo4j
Minikube - A tool that runs a single-node Kubernetes cluster in a virtual machine on your personal computer.

Source Code

You can find the source code in:

https://github.com/arpendu11/graph-based-data-lake

https://github.com/arpendu11/grand-data-lake-frontend

Cover Photo credit

Cover Photo by Simon Berger on Unsplash.