Traditional Data Warehouse Vs BDB Big Data Pipeline with Implementation Use Case

The entire Data Warehouse Architecture has been changed by the evolution of the digital footprints of organizations. BDB (Big Data BizViz) Platform claims to adopt this alteration seamlessly through various ways of utilizing Big Data, IoT, Cognitive BI, Machine Learning, and Artificial Intelligence.

Introduction

We intend to show how our BDB platform has evolved on data pipeline architectures based on distributed data stores, distributed computing, real-time or near real-time data processing (Streams), use of machine learning, and analytics to support the decision-making process in the current rapidly changing business environment.

Today, Big Data has grown really ‘Big’ to become one of the pillars of any business strategy. It is necessary for a highly competitive and regulated business environment that the decision-making process is based on data instead of intuition.

To make good decisions, it is necessary to process huge amounts of data efficiently (the less possible computing resources and a minimum processing latency), add new data sources (structured, semi-structured, and unstructured ones such as UI activities, logs, performance events, sensor data, emails, documents, social media, etc.) and support the decisions using machine learning algorithms as well as visualization techniques. The more an organization invests in its proprietary algorithms, the more chances are that it can sustain the storm of the digital era.

Let us see how we can transform our traditional data warehouse architecture into a contemporary one to solve the current challenges related to big data and high computing. Try to understand this situation through the example of an Enterprise customer that has 40 million active users and overall, 100 million user bases.

Traditional Data Warehouse Architecture

A traditional data warehouse is still good for businesses that are completely off the cloud but slowly they need to change their strategy to adopt the digital revolution.

Traditional data warehouse architecture comprises the following elements:

Transactional systems (OLTP)- Produce and record the transactional data/facts resulting from business operations.
A data warehouse (DWH)- Centralized data stores that integrate and consolidate the transactional data.
ETL processes- Batch processes that move the transactional data from OLTP towards DWH (scheduled).
Data Cleanup, Mata Data Management, Data Marts, and Cubes- Signify a derived and aggregated view of the DWH.
Analytical and Visualization Tools- Enable the visualization of data stored in the data marts and cubes for Canned, Ad-Hoc, and dashboards for different types of users. They do first-generation analytics using data mining techniques. The BDB Platform does this quite well and has done many successful deployments.

The following figure depicts how the traditional Data Warehouse Architecture works:

This architecture has some limitations in today’s modern Big Data Era as listed below:

Data sources are limited only to transactional systems (OLTP).
The major workload is based on ETL batch processing (jobs). It’s well-known that there is a loss of data in this step.
The integration and consolidation of data are very complex due to the rigid nature of ETL jobs.
The data schema is not very flexible to be extended for new analytics use cases. Maintenance is high with new fields coming into the business processes.
It’s very complex to integrate semi- and un-structure data sources. So, we lose very important information in the form of log files, sensor data, emails, and documents.
It doesn’t support naturally real-time and interactive analytics. It’s thought to be batch-oriented. With data streams, there is a need for Push Analytics.
The efforts put into scaling the solution are quite a lot, so all the businesses are not able to budget such projects properly.
It is mainly designed for an on-premise environment, so complex to extend and deploy in the cloud or hybrid-based environments.

Although all the above steps can be done efficiently with the BDB Platform there is a need to bring the data into a Pipeline based Architecture to give seamless analytics.

Modern Pipeline Architecture

The Modern Pipeline Architecture is an evolution from the previous one by integrating new data sources and using new computing paradigms as well as the integration of IoT Data, Artificial Intelligence, Machine Learning Algorithms, Deep Learning, and Cognitive Computing. We have a pipeline engine in this new approach with the following features:

Unified data architecture for all the data sources irrespective of the original structure. Integration with existing data marts and data warehouses helps to keep the original data as well as transformed data with the advent of Data Lakes.
Flexible data schemas are designed to be changed frequently. Use of NoSQL-based data stores.
Unified computing platform for processing any kind of workload (batch, interactive, real-time, machine learning, cognitive, etc.). Use of distributed platforms such as Hadoop, HDFS, Spark, Kafka, Flume, etc.
Deployable on hybrid- and cloud-based environments. Private Clouds for Enterprises are the key.
Horizontal scalability, so we can process unlimited data volume by just adding new nodes to the cluster (BDB platform gives horizontal and vertical scalability).

Although there is a huge number of technologies related to Big Data and analytics, we have shown how various Big Data Technologies can be used with the BDB Platform to provide a working modern Data Pipeline architecture (see figure 02). BDB Platform not only reduces the risk of experimenting with too many options, but also the architecture that has been working behind it.

Use Case 1. Data Warehouse Offloading

The Data warehouse should be in major companies for many years. With the exponential growth of data, the DWHs are reaching their capacity limits and batch windows are also increasing putting at risk the SLA. One approach is to migrate heavy ETL and calculation workloads into Hadoop to achieve faster processing time, lower costs per stored data and free DWH capacity to be used in other workloads.

Major Options:

One option is to load raw data from OLTP into Hadoop, then transform the data into the required models, and finally move the data into the DWH. We can also extend this scenario by integrating semi-structured, unstructured, and sensor-based data sources. In this sense, the Hadoop environment acts as an Enterprise Data Lake.
Another option is moving data from the DWH into Hadoop using Kafka (real Tim messaging services) to do pre-calculations, and then the result is stored in data marts to be visualized using traditional tools. Because the storage cost on Hadoop is much lower than on a DWH, we can save money and keep the data for a longer time. We can also extend this use case by taking advantage of analytical power and creating predictive models using Spark ML Lib or R language or cognitive computing using BDB Predictive Tool to support future decisions of the business.

Use Case 2. Near real-time Analytics on Data Warehouse

In this scenario, we have used Kafka as a streaming data source (front-end for real-time analytics) to store the incoming data in the form of events/messages. As a part of the ingestion process, the data is stored directly in the Hadoop file system, or some scalable, fault-tolerant, and distributed databases such as Cassandra (provided in BDB) or HBase or something else. Then the data gets computed and some data science models are created using the Data Science tools provided inside the BDB Platform. The result is stored in BDB’s Data Store (Elastic Search-based tool) for improving the searching capabilities of the platform, the Data Science models can be stored in the BDB Data Science Lab and the result of calculations can be stored in Cassandra. Data can be consumed by traditional tools as well as by the Web and mobile applications via API Rest.

Benefits of this Architectural Design

Add near real-time processing capabilities over batch-oriented processing
Lower the latency for getting actionable insights, impacting positively on the agility of the business in making decisions.
Lower the storage cost significantly compared to traditional data warehousing technologies because we can create commodities and a low-cost cluster of data and processing nodes on the premise and in the cloud.
This architecture is prepared to be migrated to the cloud and take advantage of the elasticity, so adding computing resources when the workload is increasing and relieving computing resources when they’re not needed.
Once this migrates to the cloud it can address the analytics needs of Hundreds and thousands of my customers with much lower effort. Horizontal scalability is extremely high.
Businesses can think of increasing a greater number of users by providing subscription base services.

BDB Data Pipeline Use Case for Education Industry

Big Data architectures are reliable, scalable, and completely automated for the Data Pipelines. Choosing an architecture and building the appropriate Big Data solution has never been as easy as it is with BDB.

Lambda Architecture

BDB follows the Lambda architecture here because it is fault-tolerant and linearly scalable. We choose this because of the below-given reasons:

Data is served by different applications to both the batch layer and the speed layer
The batch layer manages an append-only dataset and pre-computes the batch views
The batch views are indexed by the serving layer for low-latency, ad-hoc queries
The speed layer processes recent data and counterbalances the lower speed of the batch layer
Queries are answered by combining results from both batch views and real-time views

Let us assume you have different Source Systems, and you want to bring those systems' data to Data Lake through Streaming engines. Also, most of your systems are in the same domain but their data format is different.

In this case, you want to transform all those data to a unified format based on domain and store it in Data Lake. We can call it organized data storage in Data Lake.

Once data is there in Data Lake, you may want to do different computations on top of that data and store the result somewhere for analysis or analytical or reporting purposes. Our system provides a Computation framework to do this operation. This computation framework helps you to read data from Data Lake and perform necessary calculations (if needed). The result can be stored in the Analytics store or Report Store. From the analytics store, various tools/applications will be reading data to create reports/dashboards and generate data exports. API is provided to read data directly from the Data Lake (if required).

All the modules are API driven so that technology is hidden behind the API, and this helps us to design the system loosely coupled.

How data from the different Systems can be sent to this platform?

Data can be sent to this platform through the below-given options:

BDP (BizViz Big Data Platform) provides different buckets for accepting data from the customer. The customer can export the data as CSV/Excel and dump those data to our FTP/SFTP/S3 buckets provided for the same. These buckets are then monitored by our system to read the data and broadcast it to our streaming engine.
- In this case, the customer should follow certain naming conventions recommended by our system cookbook.
BDP provides API for accepting data to the platform. The customer can write data to our API that can be forwarded to the streaming engine by our API.
- In this case, the customer should follow certain rules recommended by our system cookbook.
BDP provides different Agents for fetching data directly from the customer database. If the customer allows, system agents can fetch the data from their database and broadcast it to our streaming engine.
- In this case, the customer should follow certain rules recommended by the BDP cookbook.

How the Unification, Data quality has been taken care of?

Data ingestion to the Data Lake should be there once customer data is available in the stream engine. In BDP we are calling it a data unification process.

Sometimes data comes with references and before moving this data to the Data Lake, we may need a lookup process to find out the rest of the part of data and polish it. Sometimes we may need to add some extra columns to the data as a part of data polishing. All these processes are happening at the unification engine.

Sometimes data comes with some lookup reference, which is not available in Data Lake. In this scenario, the system can’t push the data to the Data Lake and the system will keep this information in the raw area. There will be a scheduled process that will be working on top of these raw data and moving it to a refined area later.

Sometimes data came with missing information, which is very important for the Data Lake. This data also will keep in the raw area and will move to the refined area later once information is available.

All these processes are handled by unification and Data Quality Engines. BDP Data Quality Engine will generate an alert to the customer about raw data, which will help them to correct their data in their system.

How does Meta Data Management Layer work?

All the above processes need certain rules and instructions. BDP is a framework that works as per the input rules and gives instructions to the system.

E.g., the Attendance of Students is coming to the Stream engine which has the Student ID as a reference. In Data Lake, the Attendance object needs the Student Name and Student Gender as standard. MDS provides the facility to configure lookup to the Student object and fetch these two pieces of information. On another side, Attendance data came without the Student Id which is mandatory information for the Data Lake. We need a quality check on data to make sure Data Lake has polished data. MDS provides a facility to do this as a configuration.

These rules and instructions are called Metadata. BDP has an API-driven Metadata Service Layer (MDS) with the facility to accept the rules and instructions from the administrator. BDP framework can work with these rules and instructions.

Metadata information is a key for end-to-end data governance.

How is Data Lake Design and Implementation Planned?

Data Lake is a major part of our platform. BDP Data Lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the Data Lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.

BDP Data Lake is often associated with Hadoop-oriented object storage. An organization's data is first loaded into the Hadoop platform, and then BDP compute engine is applied to the data where it resides on Hadoop's cluster nodes of commodity computers.

How do ETL and Compute layers (Data Pipeline) Work?

Compute engine follows UBER architecture. We provide components like SCALA, JAVA, TALEND, SHELL, BDB ETL, etc. which help end users to come up with their on-computation logic and integrate with our system.

We use the SPARK ecosystem in BDP. SPARK provides distributed computing capability for BDP.

Compute Engine can read data from the Data Lake through API and it can perform all the required transformations and calculations which are necessary for you and allow you to store the result at any analytical or computing storage.

This framework follows pluggability, which allows 3rd party engines to plug into it and complete the computation. Further, it can also be used as ETL.

All the rules and instructions needed by Compute Engine are managed in the MDS layer.

Compute engine needs certain rules and instructions to do all these operations. All these rules and instructions are managed in the MDS layer. In short, BDP is a highly scalable and powerful system completely driven through MDS.

We have a robust MDS UI, which provides the facility to create Compute (Data Pipeline) workflows and schedule them on top of the Data Lake. All the MDS governance can be managed through the MDS UI.

Monitoring

BDB has very good ecosystem monitoring, error handling, and troubleshooting facility with the help of various frameworks like AMBARI.

Scalability, Benefits & ROI of the above Solution

BDB system works in one platform/integrated ecosystem on Cloud rather than working in SILOS like other vendors. There is no need for any other product with the BDB Platform, it comes with Big Data Pipeline, ETL, Query Services, Predictive Tool with Machine Learning, Survey Platform, Self Service BI, Advanced Visualization, and Dashboards.
Provided real-time attendance & assessment reporting to its entire 40 million active user bases – Teachers, Students, Administrators, etc.
Substantial Reduction in Cost of Operation
- Converted all Compliance Reporting into BizViz Architecture to reduce the Costs by ¼
- Saved 2 years of time and multi-million Dollars of effort to create an Analytics Platform.
Market Opportunity
- Instead of selling others' licensing packages, customers can sell their Subscription Services and Licenses – A market opportunity 10x of investment.
Additional Analytics Services revenue with BDB as the Implementation team.
- The solution can be extended to another 60 million passive Users, also Data Lake Services or Data as Services can be provided to other School Users. The entire solution can be extended with Big Data Analytics, Predictive, and Prescription Analytics like Student Dropout and Teacher Attritions by trapping critical Student and Teacher related parameters that are built into the system. This itself gives another 20x opportunity to the customer.

Mapping BDB into the BI & Analytics Ecosystem

The diagram given below is a self-expressive explanation of where BDB stands in the BI and Analytics Ecosystem. The other products have been shown to use them as a template to place BDB in the right perspective. Our Pre-Sales team which studied multiple BI products over a period has created the given diagram. The diagram depicts that BDB owns a single installation to provide for all elements of Big Data.

The following image displays our current set of thought processes and feature-based maturity vis-à-vis global names. We would like to repeat that we are competing against the best names in the industry and our closure after the POC round is 100%.

PreviousProcessing Modern Data Pipeline NextData Pipeline Handling Petabytes of Data

Last updated 2 years ago