1 of 100

Data Pipeline

Data pipelines are used to ingest and transfer data from different sources, transform unify and cleanse so that it’s suitable for analytics and business reporting.

What is a Data Pipeline?

“It is a collection of procedures that are carried either sequentially or even concurrently when transporting data from one or more sources to destination. Filtering, enriching, cleansing, aggregating, and even making inferences using AI/ML models may be part of these pipelines”.

Data pipelines are the backbone of the modern enterprise, Pipelines move, transform and store data so that enterprise can generate/take decision without delays. Some of these decisions are automated via AI/ML models in real-time.

Automate your entire data workflow

It can handle both Streaming and batch data seamlessly. The Data pipeline offers an extensive list of data processing components that help you automate the entire data workflow, Ingestion, transformations, and running AI/ML models.

Kickstart your Data Processing

In the Data Pipeline plugin, we treat data as events. Data Processing components can listen to events, as data hits those events, the process kick starts automatically. These processes then publish the output to another event. This allows data engineers to chain the process and build large data flows.

Secure Deployment as a Service

BDB Data Pipeline is available as a plugin to the BDB Platform. It can be deployed as a service in customers’ private accounts so that their data remains secure all the time.

Automatic Scaling based on the Data-load needs

There is in-build process scaler reads multiple process-metrics and automatically marks the scale-up or scale-down process. The BDB Pipelines consume data from your data source, transform it, and load it to your destination. You can send the processed data from your warehouse to any marketing, sales, or business application of your choice or vice versa. In-build process scaler reads multiple process-metrics and automatically marks the scale-up or scale-down process.

Readers: Your repository of data can be a reader for you. It could be a database, a file, or a SaaS application. Read Readers.
Connecting Components: The component that pulls or receives data from your source can be events/ connecting components for you. These Kafka-based messaging channels help to create a data flow. Read Connecting Components.
Writers: The databases or data warehouses to which the Pipelines load the data. Read Writers.
Transforms: The series of transformation components that help to cleanse, enrich, and prepare data for smooth analytics. Read Transformations.
Producers: Producers are the components that can be used to produce/generate streaming data to external sources. Read Producers.
Machine Learning: The Model Runner components allow the users to use the models created on the Python workspace of the Data Science Workbench or saved models from the Data Science Lab to be consumed in a pipeline. Read Machine Learning.
Consumers: These are the real-time / Streaming component that ingests data or monitor for changes in data objects from different sources to the pipeline. Read Consumers.
Alerts: These components facilitate user notification on various channels like Teams, Slack, and email based on their preferences. Notifications can be delivered for success, failure, or other relevant events, depending on the user's requirement. Read Alerts.
Scripting: The Scripting components allow users to write custom scripts and integrate them into the pipeline as needed. Read Scripting.
Scheduler: The Scheduler component enables users to schedule their pipeline at a specific time according to their requirements.

Adding Components to Canvas

Adding components to a pipeline workflow.

Check out the below given walk-through to add components to Pipeline Workflow editor canvas.

The Component Pallet is situated on the left side of the User Interface on the Pipeline Workflow. It has the System and Custom components tabs listing the various components.

The System components are displayed in the below given image:

Once the Pipeline gets saved in the pipeline list, the user can add components to the canvas. The user can drag the required components to the canvas and configure it to create a Pipeline workflow or Data flow.

Navigate to the existing data pipeline from the List Pipelines page.
Click the View icon for the pipeline.

The Pipeline Editor opens for the selected pipeline.
Drag and drop the new required components or make changes in the existing component’s meta information or change the component configuration (E.g., the DB Reader is dragged to the workspace in the below-given image).
Once dragged and dropped to the pipeline workspace, components can be directly connected to the nearest Kafka Event. To enable the auto-connect feature, the user needs to ensure that the Auto connect event on drag option is enabled, which is the default setting.
Please refer to the Event page for more details on creating Kafka Events and connecting them to components using various methods.

Click on the dragged component and configure the Basic Information tab, which opens by default.

Open the Meta Information tab, which is next to the Basic Information tab, and configure it.

Make sure to click the Save Component in Storage icon to update the component details and pipeline to reflect the recent changes in the pipeline. (The user can drag and configure other required components to create the Pipeline Workflow.
Click the Update Pipeline icon to save the changes.
A success message appears to assure that the pipeline has been successfully updated.
Click the Activate Pipeline icon to activate the pipeline (It appears only after the newly created pipeline gets successfully updated).
A dialog window opens to confirm the action of pipeline activation.
Click the YES option to activate the pipeline.
A success message appears confirming the activation of the pipeline.
Another success message appears to confirm that the pipeline has been updated.
The Status for the pipeline gets changed on the Pipeline List page.

Please Note:

Click the Delete icon from the Pipeline Editor page to delete the selected pipeline. The deleted Pipeline gets removed from the Pipeline list.
Refer to the Component Panel section to get detailed information on the each Pipeline Component.

Events [Kafka and Data Sync]

A Kafka event refers to a single piece of data or message that is exchanged between producers and consumers in a Kafka messaging system. Kafka events are also known as records or messages. They typically consist of a key, a value, and metadata such as the topic and partition. Producers publish events to Kafka topics, and consumers subscribe to these topics to consume events. Kafka events are often used for real-time data streaming, messaging, and event-driven architectures.

Creating a Kafka Event from the Event Panel

Check out the given illustration to understand how to create a Kafka Event.

Steps to create Kafka Event:

Navigate to the Pipeline Editor page.
Click on the Event Panel option in the pipeline toolbar and Click on Add New Event icon.
The New Event window opens.
Provide a name for the new Event.
Select Event Duration: Select an option from the below-given options.
- Short (4 hours)
- Medium (8 hours)
- Half Day (12 hours)
- Full Day (24 hours)
- Long (48 hours)
- Week (168 hours)

Please Note: The event data gets erased after 7 days if no duration option is selected from the available options. The Offsets expire as well.

No. of Partitions: Enter a value between 1 to 100. The default number of partitions is 3.
No. of Outputs: Define the number of outputs using this field.
Is Failover: Enable this option to create the event as the Failover Event. If a Failover Event is created, it must be mapped with a component to retrieve failed data from that component.
Click the "Add Event" option.
The Event will be created successfully, and the newly created Event is added to the Kafka Events tab in the Events panel.
Once the Kafka Event is created, the user can drag it to the pipeline workspace and connect it to any component.
On hovering over the event in the pipeline workspace, the user can see the following information for that event.
- Event Name
- Duration
- Number of Partitions
The user can edit the following information of the Kafka Event after dragging it to the pipeline workspace:
- Display Name
- No. of Outputs
- Is Failover

Mapping a Failover Event with a Component in a Workflow

A Failover Event is designed to capture data that a component in the pipeline fails to process. In cases where a connected event's data cannot be processed successfully by a component, the failed data is sent to the Failover Event.

Follow these steps to map a Failover Event with the component in the pipeline:

Create a Failover Event following the provided steps.
Drag the Failover Event to the pipeline workspace.
Navigate to the Basic Information tab of the desired component where the Failover Event should be mapped.
From the drop-down, select the Failover Event.
Save the component configuration.
The Failover Event is now successfully mapped. If the component encounters processing failures with data from its preceding event, the failed data will be directed to the Failover Event.
The Failover Event holds the following keys along with the failed data:
- Cause: Cause of failure.
- eventTime: Date and Time at which the data gets failed.
When hovering over the Failover Event, the associated component in the pipeline will be highlighted. Refer to the image below for visual reference.

Please see the below-given video on how to map a Failover Event.

Auto connecting Kafka/Data Sync Event to a component

This feature automatically connects the Kafka/Data Sync Event to the component when dragged from the events panel. In order to use this feature, users need to ensure that the Auto connect components on drag option is enabled from the Events panel.

Please see the below-given illustrations on auto connecting a Kafka and Data Sync Event to a pipeline component.

Creating and Connecting Kafka/Data Sync Event to a Component from the Pipeline Workspace

This feature allows user to directly connect a Kafka/Data Sync event to a component by right clicking on the component in the pipeline.

Follow these steps to directly connect a Kafka/Data Sync event to a component in the pipeline:

Right-click on the component in the pipeline.
Select the Add Kafka Event or Add Sync Event option.
The Create Kafka Event or Create Sync Event dialog box will open.
Enter the required details.
Click on the Add Kafka Event or Create Sync Event option.
The newly created Kafka/Data Sync Event will be directly connected to the component.

Please see the below-given illustrations on how to add a Kafka and Data Sync Event to a pipeline workflow.

Mapped Event

This feature enables users to map a Kafka Event with another Kafka event. In this scenario, the mapped event will have the same data as its source event.

Please see the below-given illustrations on how to create mapped Kafka Event.

Follow these steps to create a mapped event in the pipeline:

Choose the Kafka event from the pipeline as the source event for which mapped events need to be created.
Open the events panel from the pipeline toolbar and select the "Add New Event" option.
In the Create Kafka Event dialog box, enable the option at the top to enable mapping for this event.
Enter the Source Event Name in the Event name and click on the search icon to select the name of the source event from the suggestions.
The Event Duration and No. of Partitions will be automatically filled in the same as the source event. Users can modify the No. of Outputs between 1 to 10 for the mapped event.
Click on the Map Kafka Event option to create the Mapped Event.

Please Note: The data of the Mapped Event itself cannot be directly flushed. To clear the data of a Mapped Event, the user needs to flush the Source Event to which it is mapped.

Downloading Data from the Kafka Event

Please go through the below-given steps to check the meta information and download the data from the Kafka topic once the data has been sent to the Kafka event by the producer. The user can download the data from the Kafka event in CSV, Excel, and JSON format. The user will find the following tabs in the Kafka topic:

Basic Information

This tab displays information such as Display Name, Event Name, No. of Partitions, No. of Outputs, Event Duration, etc.

Meta information

Navigate to the meta information tab to view details such as the Number Of Records in the Kafka topic, Approximate Data Size, Number of Partitions, Approximate Partition Size (in MB), Start and End Offset of partition, and Total Records in each partition.

Pipeline Event Data Preview

Navigate to the Preview tab for a pipeline workflow using an Event component. Preferably select a pipeline workflow that has been activated to get data.
It will display the Data Preview.

Please Note:

Users can preview, download, and copy up to 100 data entries.
Click the Download icon to download the data in CSV, JSON, or Excel format.
Click on the Copy option to copy the data as a list of dictionaries.

Data Type icons are provided in each column header to indicate the data type. The icons help users quickly understand the contained data type in each column. The following icons are added:

Column Data Type

Data Type Icon

String

Number

Date

DateTime

Float

Boolean

Movable Column Separators

Users can drag and drop column separators to adjust column widths, allowing for a personalized view of the data.
This feature helps accommodate various data lengths and user preferences.

Table View

The event data is displayed in a structured table format.
The table supports sorting and filtering to enhance data usability. The previewed data can be filtered based on the Latest, Beginning, and Timestamp options.

The Timestamp filter option redirects the user to select a timestamp from the Time Range window. The user can either select a start and end date or choose from the available time ranges to apply and get a data preview.

Check out the illustration on sorting and filtering the Event Data Preview for a Pipeline workflow.

Preview Schema

This tab holds the Spark schema of the data. Users can download the Spark schema of the data by clicking on the download option.

Flushing Events

Kafka Events can be flushed to delete all present records. Flushing an Event retains the offsets of the topic by setting the start-offset value to the end-offset. Events can be flushed by using the Flush Event button beside the respective Event in the Event panel, and all Events can be flushed at once by using the Flush All button. This button is also present at the top of the Event panel.

Connecting an Event to a Component

Check out the given illustration on how to Auto connect a Kafka Event to a pipeline component.

Drag any component to the pipeline workspace.

Configure all the parameters/fields of the dragged component.
Drag the Event from the Events Panel. Once the Kafka event is dragged from the Events panel, it will automatically connect with the nearest component in the pipeline workspace. Please find the below given walk through for the reference.
Click the Update Pipeline icon to save the pipeline workflow.

Data Sync Event

Data Sync Event in the Data Pipeline module is used to write the required data directly to the any of the databases without using the Kafka Event and writer components in the pipeline. Please refer the below image for reference:

It can be seen in the above image that Data Event will directly write the data read from the MongoDB reader component to the table of the configured Database in the Data Sync without using a Kafka Event in-between.

Benefits of using the Data Sync Event

It doesn't need Kafka event to read the data. It can be connected with any component to read the data and it writes it to the tables of respective databases.
Pipeline complexity is reduced because Kafka event and writer is not needed to use in the pipeline.
Since, writers are not used, the resource consumption are low.
Once Data sync are configured, multiple Data Sync events can be created for the same configuration and the data can be written to multiple tables.

Creating a Data Sync Event

Pre-requisite: Before Creating the Data Sync Event, the user has to configure the Data Sync section under the Settings page.

DB Sync Event enables direct write to the DB that helps in reducing the usage of additional compute resources like Writers in the Pipeline Workflow.

Please Note: The supported drivers for the Data Sync component are as listed below:

ClickHouse
MongoDB
MSSQL
MySQL
Oracle
PostgreSQL
Snowflake
Redshift

Check out the given video on how to create a Data Sync component and connect it with a Pipeline component.

Navigate to the Pipeline Editor page.
Click the Toggle Event Panelicon from the header.
The Events Panel appears, and the Toggle Event Panel icon gets changed as, suggesting that the event panel is displayed.
Click on the DB Sync tab.
Click on the Add New Data Sync (+) icon from the Toggle Event Panel.

The Create Data Sync window opens.
- Provide a display name for the new Data Sync.
- Select the Driver
- Put a checkmark in the Is Failover option to create a failover Data Sync. In this case, it is not enabled.
Click the Save option.

Please Note:

Only the configured drivers from the Settings page get listed under the Create Data Sync wizard.
The Data Sync component gets created as Failover Data Sync, if the Is Failover option is enabled while creating a Data Sync.

Drag and drop Data Sync Event to the workflow editor.
Click on the dragged Data Sync component.

The Basic Information tab appears with the following fields:
- Display Name: Display name of the Data Sync
- Event Name: Event name of the Data Sync
- Table name: Specify table name.
- Driver: This field will be pre-selected.
- Save Mode: Select save mode from the drop-down: Append or Upsert.
- Composite Key: This field is optional. This field will only appear when upsert is selected as the Save Mode.
- Click on the Save Data Sync icon to save the Data Sync information.
Once the Data Sync Event is dragged from the Events panel, it will automatically connect with the nearest component in the pipeline workflow.
Update and activate the pipeline.
Open the Logs tab to view whether the data gets written to a specified table.

Please Note:

In the Save mode, there are two available options.
- Append
- Upsert: One extra field will be displayed for upsert save mode i.e. Composite Key.
When the SQL component is set to Aggregate Query mode and connected to Data Sync, the data resulting from the query will not be written to the Data Sync Event. Please refer to the following image for a visual representation of the flow and avoid using such scenario.

Creating a Failover Data Sync

Click the Event Panelicon from the Pipeline Editor page.
The Events Panel appears on the right side of the Pipeline Workflow Editor page.
Click on the Data Sync tab.
Click on the Add New Data Sync (+) icon from the Toggle Event Panel.
The Create Data Sync dialog box opens.
Provide the Display Name and
Enable the Is Failover option.
Click the Save option.
The failover Data Sync will be created successfully.

Mapping Failover Data Sync to a Component

Drag & drop the failover data sync to the pipeline workflow editor.
Click on the failover Data Sync and fill in the following field:
- Table name: Enter the table name where the failed data has to be written.
- Primary key: Enter the column name to be made as the primary key in the table. This field is optional.
- Save mode: Select a save mode from the given choices. Select either Append or Upsert.
Failover Data Sync
The failover Data Sync Event gets configured, and now the user can map it with any component in the pipeline workflow.
The image displays that the failover Data Sync Event is mapped with the SQL Component.
- If the component fails, it will write all the data to the given table of the configured DB in the Failover Data Sync.

Please Note: There is no information available on UI that the failover Data Sync Event has been configured while hovering on the failover Data Sync Event.

Memory and CPU Allocations

This section focuses on the Configuration tab provided for any Pipeline component.

For each component that gets deployed, we have an option to configure the resources i.e., Memory and CPU.

We have two deployment types:

Docker
Spark

Docker

Go through the given illustration to understand how to configure a component using the Docker deployment type.

After we save the component and pipeline, the component gets saved with the default configuration of the pipeline i.e., Low, Medium, and High. After we save the pipeline, we can see the configuration tab in the component. There are multiple things.

For the Docker components, we have the Request and Limit configurations.
We can see the CPU and Memory options to be configured.

CPU: This is the CPU configuration where we can specify the number of cores that we need to assign to the component.

Please Note: 1000 means 1 core in the configuration of docker components. When we put 100 that means 0.1 core has been assigned to the component.

Memory: This option is to specify how much memory you want to dedicate to that specific component.

Please Note: 1024 means 1GB in the configuration of the docker components.

Instances: The number of instances is used for parallel processing. If we give N no. of instances those many pods will get deployed.

Spark

Go through the below given walk-through to understand the steps to configure a component with Spark configuration type.

The Spark Components configuration is slightly different from the Docker components. When the spark components are deployed, there are two pods that come up:

Driver
Executor

Provide the Driver and executor configurations separately.

Instances: The number of instances is used for parallel processing. If we give N no. of instances in executors configuration those many executors pods will get deployed.

Please Note: Till the current release, the minimum requirement to deploy a driver is 0.1 Cores and 1 core for the executor. It can change with the upcoming versions of Spark.

Creating a New Job

This section provides detailed information on the Jobs to make your data process faster.

Jobs are used for ingesting and transferring data from separate sources. The user can transform, unify, and cleanse data to make it suitable for analytics and business reporting without using the Kafka topic which makes the entire flow much faster.

Check out the given demonstration to understand how to create and activate a job.

Navigate to the Data Pipeline homepage.
Click on the Create icon.
Navigate to the Create Pipeline or Job interface.

The New Job dialog box appears redirecting the user to create a new Job.
Enter a name for the new Job.
Describe the Job(Optional).
Job Baseinfo: In this field, there are three options:
Trigger By: There are 2 options for triggering a job on success or failure of a job:
- Success Job: On successful execution of the selected job the current job will be triggered.
- Failure Job: On failure of the selected job the current job will be triggered.
Is Scheduled?
- A job can be scheduled for a particular timestamp. Every time at the same timestamp the job will be triggered.
- Job must be scheduled according to UTC.
Concurrency Policy: Concurrency policy schedulers are responsible for managing the execution and scheduling of concurrent tasks or threads in a system. They determine how resources are allocated and utilized among the competing tasks. Different scheduling policies exist to control the order, priority, and allocation of resources for concurrent tasks.

Please Note:

Concurrency Policy will appear only when "Is Scheduled" is enabled.
If the job is scheduled, then the user has to activate it for the first time. Afterward, the job will automatically be activated each day at the scheduled time.

There are 3 Concurrency Policy available:
1. Allow: If a job is scheduled for a specific time and the first process is not completed before the next scheduled time, the next task will run in parallel with the previous tasks.
2. Forbid: If a job is scheduled for a specific time and the first process is not completed before the next scheduled time, the next task will wait until all the previous tasks are completed.
3. Replace: If a job is scheduled for a specific time and the first process is not completed before the next scheduled time, the previous task will be terminated and the new task will start processing.
Spark Configuration
- Select a resource allocation option using the radio button. The given choices are:
  - Low
  - Medium
  - High
- This feature is used to deploy the Job with high, medium, or low-end configurations according to the velocity and volume of data that the Job must handle.
- Also, provide the resources to Driver and Executer according to the requirement.

Alert: There are 2 options for sending an alert:
- Success: On successful execution of the configured job, the alert will be sent to selected channel.
- Failure: On failure of the configured job, the alert will be sent to selected channel.
Please go through the given link to configure the Alerts in Job:
Click the Save option to create the job.
A success message appears to confirm the creation of a new job.
The Job Editor page opens for the newly created job.

Please Note:

The Trigger by feature will not work if the selected Trigger by job is running in the Development mode. Trigger by feature will only work when the selected Trigger by Job is activated.
By clicking the Save option, the user gets redirected to the job workflow editor.

Job Editor Page

The Job Editor Page provides the user with all the necessary options and components to add a task and eventually create a Job workflow.

Adding a Task to the Job Workflow

Once the Job gets saved in the Job list, the user can add a Task to the canvas. The user can drag the required tasks to the canvas and configure it to create a Job workflow or dataflow.

The Job Editor appears displaying the Task Pallet containing various components mentioned as Tasks.

Task Pallet

The Task Pallet is situated on the left side of the User Interface. It has a Task tab listing the various tasks.

The tasks are displayed in the below-given image:

Task Components

The components used under the Task Pallet are broadly classified into:

Readers: Your repository of data can be a reader for you. It could be a database, a file, or a SaaS application.
Writers: The databases or data warehouses to which the data is loaded by the Jobs.
Transforms: The series of transformation tasks that help to cleanse, enrich, and prepare data for smooth analytics.

Please Note: Refer to the Task Components page for the more details.

Searching a Task

There is a Search Task space provided on the Task Panel.

The user can search for a task by typing in the given space, the related Tasks will appear as suggestions.

Select a suggestion for search and the Task Panel will display the customized view based on the searched task.

Steps to Create a Job Workflow

Navigate to the Job List page.
It will list all the jobs and display the Job type if it is Spark Job or PySpark Job in the type column.
Select a Job from the displayed list.
Click the View icon for the Job.
List Jobs Page

Please Note: Generally, the user can perform this step-in continuation to the Job creation, but if the user has come out of the Job Editor the above steps can help to access it.

The Job Editor opens for the selected Job.
Drag and drop the new required task, make changes in the existing task’s meta information, or change the task configuration as the requirement. (E.g., the DB Reader is dragged to the workspace in the below-given image):

Click on the dragged task icon.
The task-specific fields open asking the meta-information about the dragged component.

Open the Meta Information tab and configure the required information for the dragged component.
Click the given icon to validate the connection.
Click the Save Task in Storage icon.

A notification message appears.
Click the Activate Job icon to activate the job(It appears only after the newly created job gets successfully updated).
A dialog window opens to confirm the action of job activation.
Click the YES option to activate the job.
A success message appears confirming the activation of the job.
Once the job is activated, the user can see their job details while running the job by clicking on the View icon; the edit option for the job will be replaced by the View icon when the job is activated.

Please Note:

Jobs can be run in the Development mode as well. The user can preview only 10 records in the preview tab of the task if the job is running in the Development mode and if any writer task is used in the job then it will write only 10 records in the table of the given database.
If the job is not running in Development mode, there will be no data in the preview tab of tasks.

The Status for the Job gets changed on the job List page when they are running in the Development mode or it is activated.

Please Note: Click the Delete icon from the Job Editor page to delete the selected job. The deleted job gets removed from the Job list.

Job Task Preview

Users can get a sample of the task data under the Preview Data tab provided for the tasks in the Job Workflows.

Navigate to the Job Editor page for a selected job.
Open a task from where you want to preview records.
Click the Preview Data tab to view the content.
Data Preiew

Please Note:

Users can preview, download, and copy up to 10 data entries.
Click the Download icon to download the data in CSV, JSON, or Excel format.
Click on the Copy option to copy the data as a list of dictionaries.

Movable Column Separators

Users can drag and drop column separators to adjust columns' width, allowing personalized data view.
This feature helps accommodate various data lengths and user preferences.

Table View

The event data is displayed in a structured table format.
The table supports sorting and filtering to enhance data usability. The previewed data can be filtered based on the Latest, Beginning, and Timestamp options.

The Timestamp filter option redirects the user to select a timestamp from the Time Range window. The user can select a start and end date or choose from the available time ranges to apply and get a data preview.

Toggle Log Panel

The Toggle Log Panel displays the Logs and Advanced Logs tabs for the Job Workflows.

Navigate to the Job Editor page.
Click the Toggle Log Panel icon on the header.

A panel Toggles displaying the collective logs of the job under the Logs tab.

Select the Job Status tab to display the pod status of the complete Job.

Please Note: If any orphan task is used in the Job Editor Workspace that is not in use, it will cause failure for the entire Job. So, avoid using any orphan task in the Job. Please see the image below, for reference. In the image below the highlighted DB writer task is an orphan task and if the Job is activated, then this Job will fail because the orphan DB writer task is not getting any input. Please avoid the use of an orphan task inside the Job Editor workspace.

Version Update

Navigate to the Job Editor page.
The Job Version update button will display a red dot indicating that new updates are available for the selected job.
The Confirm dialog box appears.
Click the YES option.

A notification message appears stating that the version is upgraded.
After the job is upgraded, the Upgrade Job Version button gets disabled. It will display that the job is up to date and no updates are available.

Icons on the Header panel of the Job Editor Page

Icon

Name

Action

Job Version details

Displays the latest versions for the Jobs upgrade.

Displays Jobs logs and Job Status tab under Log panel.

Redirects to the Job Monitoring page

Development Mode

Runs the job in development mode.

Activate Job

Activates the current Job.

Update Job

Updates the current Job.

Edit Job

To edit the job name/ configurations.

Delete Job

Deletes the current Job.

Push Job

Push the selected job to GIT.

Redirects to the List Job page.

Redirects to the Settings page.

Opens the Job in Full screen

Format the Job tasks in arranged manner.

Zoom in the Job workspace.

Zoom out the Job workspace.

DB Reader

This task is used to read the data from the following databases: MYSQL, MSSQL, Oracle, ClickHouse, Snowflake, PostgreSQL, Redshift.

Configuring the Meta Information tab fields

Drag the DB reader task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.

Host IP Address: Enter the Host IP Address for the selected driver.
Port: Enter the port for the given IP Address.
Database name: Enter the Database name.
Table name: Provide a single or multiple table names. If multiple table name has be given, then enter the table names separated by comma(,).
User name: Enter the user name for the provided database.
Password: Enter the password for the provided database.
Driver: Select the driver from the drop down. There are 7 drivers supported here: MYSQL, MSSQL, Oracle, ClickHouse, Snowflake, PostgreSQL, Redshift.
Fetch Size: Provide the maximum number of records to be processed in one execution cycle.
Create Partition: This is used for performance enhancement. It's going to create the sequence of indexing. Once this option is selected, the operation will not execute on server.
Partition By: This option will appear once create partition option is enabled. There are two options under it:
- Auto Increment: The number of partitions will be incremented automatically.
- Index: The number of partitions will be incremented based on the specified Partition column.
Query: Enter the spark SQL query in this field for the given table or table(s). Please refer the below image for making query on multiple tables.

Please Note:

The ClickHouse driver in the Spark components will use HTTP Port and not the TCP port.
In the case of data from multiple tables (join queries), one can write the join query directly without specifying multiple tables, as only one among table and query fields is required.
Please click the Save Task In Storage icon to save the configuration for the dragged reader task.

Azure Blob Reader

This task is used to read data from Azure blob container.

Configuring the Meta Information tab fields

Drag the Azure Blob reader task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.

Read using: There are three(3) options available under this tab:

Read using Shared Access Signature

Provide the following details:

Shared Access Signature: This is a URI that grants restricted access rights to Azure Storage resources.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the file is located and which has to be read.
File type: There are four(5) types of file extensions are available under it:
- CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
- JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type. Check-in the Multiline option if there is any multiline string in the file.
- PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
- AVRO: This File Type provides two drop-down menus.
  - Compression: Select an option out of the Deflate and Snappy options.
  - Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
- XML: Select this option to read XML file. If this option is selected, the following fields will get displayed:
  - Infer schema: Enable this option to get true schema of the column.
  - Path: Provide the path of the file.
  - Root Tag: Provide the root tag from the XML files.
  - Row Tags: Provide the row tags from the XML files.
  - Join Row Tags: Enable this option to join multiple row tags.
Path: This option will appear once the file type is selected. Enter the path where the selected file type is located.
Read Directory: Check in this box to read the specified directory.
Query: Provide Spark SQL query in this field.

Read using Secret Key Option

Provide the following details:

Account Key: Enter the Azure account key. In Azure, an account key is a security credential that is used to authenticate access to storage resources, such as blobs, files, queues, or tables, in an Azure storage account.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the blob is located. A container is a logical unit of storage in Azure Blob Storage that can hold blobs. It is similar to a directory or folder in a file system, and it can be used to organize and manage blobs.
File type: There are four(5) types of file extensions are available under it:
- CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
- JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type. Check-in the Multiline option if there is any multiline string in the file.
- PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
- AVRO: This File Type provides two drop-down menus.
  - Compression: Select an option out of the Deflate and Snappy options.
  - Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
- XML: Select this option to read XML file. If this option is selected, the following fields will get displayed:
  - Infer schema: Enable this option to get true schema of the column.
  - Path: Provide the path of the file.
  - Root Tag: Provide the root tag from the XML files.
  - Row Tags: Provide the row tags from the XML files.
  - Join Row Tags: Enable this option to join multiple row tags.
Path: This option will appear once the file type is selected. Enter the path where the selected file type is located.
Read Directory: Check in this box to read the specified directory.
Query: Provide Spark SQL query in this field.

Read using Principal Secret

Provide the following details:

Client ID: Provide Azure Client ID. The client ID is the unique Application (client) ID assigned to your app by Azure AD when the app was registered.
Tenant ID: Provide the Azure Tenant ID. Tenant ID (also known as Directory ID) is a unique identifier that is assigned to an Azure AD tenant, which represents an organization or a developer account. It is used to identify the organization or developer account that the application is associated with.
Client Secret: Enter the Azure Client Secret. Client Secret (also known as Application Secret or App Secret) is a secure password or key that is used to authenticate an application to Azure AD.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the blob is located. A container is a logical unit of storage in Azure Blob Storage that can hold blobs. It is similar to a directory or folder in a file system, and it can be used to organize and manage blobs.
Query: Provide Spark SQL query in this field.
File type: There are four(5) types of file extensions are available under it:
- CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
- JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type. Check-in the Multiline option if there is any multiline string in the file.
- PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
- AVRO: This File Type provides two drop-down menus.
  - Compression: Select an option out of the Deflate and Snappy options.
  - Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
- XML: Select this option to read XML file. If this option is selected, the following fields will get displayed:
  - Infer schema: Enable this option to get true schema of the column.
  - Path: Provide the path of the file.
  - Root Tag: Provide the root tag from the XML files.
  - Row Tags: Provide the row tags from the XML files.
  - Join Row Tags: Enable this option to join multiple row tags.

Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged reader task.

S3 Writer

This task is used to write the data in Amazon S3 bucket.

Configuring the Meta Information tab fields

Drag the S3 writer task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.

Bucket Name (*): Enter S3 Bucket name.
Region (*): Provide S3 region.
Access Key (*): Access key shared by AWS to login
Secret Key (*): Secret key shared by AWS to login
Table (*): Mention the Table or object name which is to be read
File Type (*): Select a file type from the drop-down menu (CSV, JSON, PARQUET, AVRO are the supported file types).
Save Mode: Select the Save mode from the drop down.
- Append
Schema File Name: Upload spark schema file in JSON format.

Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged writer task.

Python Job

Write Python scripts and run them flawlessly in the Jobs.

This feature allows users to write their own Python script and run their script in the Jobs section of Data Pipeline module.

Before creating the Python Job, the user has to create a project in the Data Science Lab module under Python Environment. Please refer the below image for reference:

After creating the Data Science project, the users need to activate it and create a Notebook where they can write their own Python script. Once the script is written, the user must save it and export it to be able to use it in the Python Jobs.

Creating a Python Job

Navigate to the Data Pipeline module homepage.
Open the pipeline homepage and click on the Create option.

The new panel opens from right hand side. Click on Create button in Job option.
- Enter a name for the new Job.
- Describe the Job (Optional).
- Job Baseinfo: Select Python Job from the drop-down.
- Trigger By: There are 2 options for triggering a job on success or failure of a job:
  - Success Job: On successful execution of the selected job the current job will be triggered.
  - Failure Job: On failure of the selected job the current job will be triggered.
- Is Scheduled?
  - A job can be scheduled for a particular timestamp. Every time at the same timestamp the job will be triggered.
  - Job must be scheduled according to UTC.
- On demand: Check the "On demand" option to create an on-demand job. For more information on Python Job (On demand), check here.
- Docker Configuration: Select a resource allocation option using the radio button. The given choices are:
  - Low
  - Medium
  - High
- Provide the resources required to run the python Job in the limit and Request section.
  - Limit: Enter max CPU and Memory required for the Python Job.
  - Request: Enter the CPU and Memory required for the job at the start.
  - Instances: Enter the number of instances for the Python Job.
- Alert: Please refer to the Job Alerts page to configure alerts in job.
- Click the Save option to save the Python Job.

The Python Job gets saved, and it will redirect the users to the Job Editor workspace.

Check out the below given demonstration configure a Python Job.

Configuring the Meta Information of Python Job

Once the Python Job is created, follow the below given steps to configure the Meta Information tab of the Python Job.

Project Name: Select the same Project using the drop-down menu where the Notebook has been created.
Script Name: This field will list the exported Notebook names which are exported from the Data Science Lab module to Data Pipeline.
External Library: If any external libraries are used in the script the user can mention it here. The user can mention multiple libraries by giving comma (,) in between the names.
Start Function: Here, all the function names used in the script will be listed. Select the start function name to execute the python script.
Script: The Exported script appears under this space.
Input Data: If any parameter has been given in the function, then the name of the parameter is provided as Key, and value of the parameters has to be provided as value in this field.

Job Alerts

The Alert feature in the job allows users to send an alert message to the specified channel (Teams or Slack) in the event of either the success or failure of the configured job. Users can also choose both success and failure options to send an alert for the configured job.

Configuring Job Alert when Teams is selected as channel

Webhook URL: Provide the Webhook URL of the selected channel group where the Alert message needs to be sent.
Type: Message Card. (This field will be Pre-filled)
Theme Color: Enter the Hexadecimal color code for ribbon color in the selected channel. Please refer the image given at the bottom of this page for the reference.
Sections: In this tab, the following fields are there:
- Activity Title: This is the title of the alert which has to be to sent on the Teams channel. Enter the Activity Title as per the requirement.
- Activity Subtitle: Enter the Activity Subtitle. Please refer the image given at the bottom of this page for the reference.
- Text: Enter the text message which should be sent along with Alert.

Configuring Job Alert when Slack is selected as channel

Webhook URL: Provide the Webhook URL of the selected channel group where the Alert message needs to be sent.
Attachments: In this tab, the following fields are there:
- Title: This is the title of the alert which has to be to sent on the selected channel. Enter the Activity Title as per the requirement.
- Color: Enter the Hexadecimal color code for ribbon color in the Slack channel. Please refer the image given at the bottom of this page for the reference.
- Text: Enter the text message which should be sent along with Alert.
- Footer: The "Footer" typically refers to additional information or content appended at the end of a message in a Slack channel. This can include details like a signature, contact information, or any other supplementary information that you want to include with your message. Footers are often used to provide context or additional context to the message content.
- Footer Icon: In Slack, the footer icon refers to an icon or image that is displayed at the bottom of a message or attachment. The footer icon can be a company logo, an application icon, or any other image that represents the entity responsible for the message. Enter image URL as the value of Footer icon.
  - Follow these steps to set the Footer icon in Slack:
    Go to the desired image that has to be used as the footer icon.
    Right-click on the image.
    Select the 'Copy image address' to get the image URL.
    Now, the obtained image URL can be used as the value for the footer icon in Slack.
Sample image URL for Footer icon:

Sample Hexadecimal Color code which can be used in Job Alert.

Overview

This page provides an overview and summary of the pipeline module, including details such as running status, types, number of pipelines and jobs, and resources used.

Follow these steps to get the overview page in the Data Pipeline module:

Navigate to the Data Pipeline module and click on the Overview option on the Pipeline homepage.
The user will get redirected to the Pipeline and Job Overview page.
On this page, the user will find two tabs:
- Jobs
- Pipeline

Scheduler

Check-out the given walk-through on how the Scheduler option works in the Data Pipeline.

The Scheduler List page opens displaying all the registered schedulers in a pipeline.
It displays all the previous and next execution of the scheduler.

On the Scheduler List page, users will find the following details:

Scheduler Name: The name of the scheduler component as given in the pipeline.
Scheduler Time: The time set in the scheduler component.
Next Run Time: The next scheduled run time of the scheduler.
Pipeline Name: The name of the pipeline where the scheduler is configured and used. Clicking on this option will directly redirect the user to the selected pipeline.

Data Channel & Cluster Events

The Data Channel & Cluster Events page presents a comprehensive list of all Broker Info, Consumer Info, Topic Info, Kafka Version, and all the events used in the pipeline. It allows users to flush/delete the events.

Go through the below-given demonstration for the Data Chanel & Cluster Event page.

Navigate to the Pipeline Homepage.
Click the Data Channel & Cluster Events icon.

The Data Channel & Cluster Events page opens.
The list opens displaying the Data Channel & Cluster Events information.

Data Channel

The Data Channel includes the following information:

Broker Info: It will list all Kafka brokers and display the number of partitions used for each broker.

Consumer Info: It will display the number of active and rebalancing consumers.

Topic Info: It will display the number of topics.

Version information: It will display the Kafka version.

Clustered Events

The Clustered Events page includes the following information:

Pipeline & Topics

On this page, all the pipelines will be listed along with the following details:

Pipeline Name: The name of the pipeline.
Number of Events: The number of Kafka events created in the selected pipeline.
Status: The running status of the pipeline, indicated by Green if active and Red otherwise.
Expand for Events: Click here to expand the selected row for a particular pipeline. This will list all Kafka events associated with the chosen pipeline along with the following information:
- Name: Display the name of the Kafka event in the pipeline.
- Event Name: Name of the Kafka event.
- Partitions: Number of partitions in the Kafka event.

The user gets two options to apply to the listed Kafka events for the pipeline:

Flush All: This will flush all topic data in the selected pipeline.
Delete All: This will delete all topics in the selected pipeline.

Once the user clicks on the Event Name after expanding the row for the selected pipeline, the following information will be displayed on the new page for the selected Kafka Event:

Overview

This tab contains the following information for the selected Kafka Event:

Partitions: The number of partitions in the Kafka Event.
Replication Factor: Displays the replication factor of the Kafka topic. This refers to the number of copies of a Kafka topic's data maintained across different broker nodes within the Kafka cluster, ensuring high availability and fault tolerance data.
Sync Replicas: Displays the number of in-sync replicas of the Kafka topic. In-sync replicas (ISRs) are a subset of replicas fully synchronized with the leader replica for a partition. These replicas have the latest data as the leader and are capable of taking over as the leader if the current leader fails.
Segment Size: This shows the segment size of the Kafka topic. A segment is a smaller chunk of a partition log file. Segment size refers to the size of these log segments that Kafka uses to manage and store data within a partition. Kafka topics are divided into partitions, and each partition is further divided into segments.
Messages Count: Displays the number of messages in the Kafka topic.
Retention Period: Displays the retention period of the Kafka topic in hours. The retention period of a Kafka topic determines how long Kafka retains the messages in a topic before deleting them.

Additionally, this tab lists all the partition details along with their start and end offset, the number of messages in each partition, the number of replicas for each partition, and the size of the messages held by each partition.

Messages

This tab contains the following information for the selected Kafka Event:

Offset: Shows the offset number of the partition. An offset is a unique identifier assigned to each message within a partition.
Partitions: Displays the partition number where the offset belongs.
Time: It mentions the date and time of the message when it was stored at the offset.
Preview: This option helps the user to view and copy the message stored at the selected offset.

Consumers

This tab shows details of consumers connected to Kafka Topic.

Trash

This page describes how to delete a pipeline/Job or restore a deleted pipeline/Job using the Trash option.

Check out the given illustration on how to use the Trash option provided on the Pipeline homepage left menu panel.

Navigate to the Pipeline Homepage to access the left side menu panel.
Click the Trash icon.
The Trash List page opens listing all the deleted pipelines/jobs by a user from a user specific account.
The user gets two options to be applied on the pipelines/jobs:
- Delete
- Restore

Please Note: Based on the selected option, the related action will be taken on the concerned pipeline/job.

Permanently Deleting a Pipeline/Job

Navigate to the Trash page.
Select a Pipeline.
Click the Delete icon for the selected Pipeline.

The Delete Pipeline/Job dialog box opens.
Click the Yes option.

The Delete Pipeline Confirmation dialog box opens.
Click the Delete Anyway option.

A notification message appears and the Pipeline/Job Permanently gets deleted for the user.

Please Note: The Trash page at present displays only those pipelines and jobs which are deleted by the logged in user from the Pipeline/Job Editor page by a user.

Restoring a Pipeline/Job

Navigate to the Trash page.
Select a Pipeline.
Click the Restore icon for the selected Pipeline/Job.

The Recover Pipeline/Job dialog box opens.
Click the Yes option.

A notification message appears that the Pipeline/Job has been restored.

The Pipeline gets recovered and lists in the Pipeline/Job list page.

Pipeline Workflow Editor

The Pipeline Workflow Editor contains Toolbar, Component Panel, and the Right-side Panel together with the Design Canvas for the user to create a pipeline flow.

The Pipeline Workflow consists of three main elements:

Pipeline Toolbar
Component Panel
Right-side Panel

Please Note: Please find the basic Workflow to ingest data using the Pipeline given below:

The above-given workflow shows the basic workflow to ingest data into a database using the Data Pipeline. It can be seen in the above workflow that data is read from a source using a reader component (DB Reader) and is then written to a destination location using a writer component (DB Writer).

In the successive slides, the user can find the detailed working of pipeline workflow design and the several pipeline components.

Push & Pull Pipeline

The Version Control feature has been provided for the user to maintain a version of the pipeline while the same pipeline undergoes further development and different enhancements.

The Push & Pull Pipeline from GIT feature are present on the List Pipeline and Pipeline Editor pages.

Pushing a Pipeline into VCS

Navigate to the Pipeline Editor page for a Pipeline.
Click the Push & Pipeline icon for the selected data pipeline.

The Push/Pull dialog box appears.

Provide a Commit Message (required) for the data pipeline version.
Select a Push Type out of the below-given choices to push the pipeline:
1. 1.Version Control: For versioning of the pipeline in the same environment.
2. 2.GIT Export (Migration): This is for pipeline migration. The pushed pipeline can be migrated to the destination environment from the migration window in Admin Module.
Click the Save option.

A notification message appears to confirm the completion of the action.

Please Note:

The pipeline pushed to the VCS using the Version Control option, can be pulled directly from the Pull Pipeline from GIT icon.
The user also gets an option to Push the pipeline to GIT. This action will be considered as Pipeline Migration.

Pulling a Pipeline

This feature is for pulling the previously moved versions of a pipeline that are committed by the user. This can help a user significantly to recover the lost pipelines or avoid unwanted modifications made to the pipeline.

Navigate to the Pipeline Editor page.
Select a data pipeline from the displayed list.
Click the Push & Pipeline icon for the selected data pipeline.
Select Pull From VCS option.
The Push/Pull dialog box appears.
Select the data pipeline version by marking the given checkbox.
Click the Save option.

A confirmation message appears to assure the users that the concerned pipeline workflow has been imported.

Another confirmation message appears to assure the user that the concerned pipeline workflow has been pulled.

Please Note:

The pipeline that you pull will be changed to the selected version. Please make sure to manage the versions of the pipeline properly.
Refer Migrating Pipeline described as a part of the GIT Migration (under the Administration section) on how to pull an exported/migrated Pipeline version from the GIT.

Log Panel

The Toggle Log Panel displays the Logs and Component Status tabs for the Pipeline/Job Workflows.

Navigate to the Pipeline Editor page.
Make sure the Pipeline is in the active state (Activate the Pipeline).
Click the Log Panel icon on the Pipeline.

A Log panel toggles displaying the collective component logs of the pipeline/Job under the Logs tab.

Component Status Tab

Select the Component Status tab from the Log panel to display the status of the component containers. By selecting the Open All option, it will list all the components.

Job Status Tab

Select the Job Status tab from the Log panel to display the status of the pod of the selected Job.

Kill Orphan Processes

This feature provides the capability to kill all Orphan Pods associated with any component in the pipeline/Jobs if they persist after deactivation. Orphan Processes are the processes that remain active in the backend even after deactivating the pipeline.

The user can access the Kill Orphan Processes option under the Component Status tab of the Log Panel.

Update Component Version

This feature allows users to update the older version of the component with the latest version in the pipeline.

Follow the below given steps to update the component version in the pipeline:

Navigate to the pipeline toolbar panel.
Click on the Update Component Version option.

After clicking on the option, all components with older versions in the pipeline will be updated to the latest version. A success message will appear stating, "Components Version Updated Successfully".
A message will appear stating, "All components are up to date, no updates available" if all components already have the latest version.

Pipeline Monitoring

This Page explains How we can monitor the Pipelines.

The user can monitor a pipeline together with all the components associated with the same by using the Pipeline Monitoring icon. The user gets information about Pipeline components, Status, Types, Last Activated (Date and Time), Last Deactivated (Date and Time), Total Allocated and Consumed CPU%, Total allocated and consumed memory, Number of Records, and Component logs all displayed on the same page.

Go through the below-given video to get a basic idea on the pipeline monitoring functionality.

Navigate to the Pipeline List page.
Click the Monitor icon.

Navigate to the Pipeline Workflow Editor page.
Click the Pipeline Monitoring icon on the Header panel.

The Pipeline Monitoring page opens displaying the details of the selected pipeline.
The Pipeline Monitoring page displays the following information for the selected Job:
- Pipeline: Name of the pipeline.
- Status: Running status of the Job. 'True' indicates the Job is active, while 'False' indicates inactivity.
- Last Activated: Date and time when the job was last activated.
- Last Deactivated: Date and time when the pipeline was last deactivated.
- Total Allocated CPU: Total allocated CPU in cores.
- Total Allocated Memory: Total allocated memory in MB.
- Total Consumed CPU: Total consumed CPU by the pipeline in cores.
- Total Consumed Memory: Total consumed memory by the pipeline in MB.
- Component Name: Name of the component which is used in pipeline.
- Running: The running status of the component, displayed as 'UP' if the component is running, otherwise 'OFF'.
- Type: Invocation type of component. It may be either Real Time or Batch.
- Instances: Number of instances used in the component.
- Last Processed Size: Size of the batch (in MB) that was last processed.
- Last Processed Count: Number of Processed records in last batch.
- Total Number of Records: Total number of records processed by the component.
- Last Processed Time: Last processed time of the instance.
- Host Name: Name of the instance of the selected component.
- Min CPU Usage: Minimum CPU usage in cores by the instance.
- Max CPU Usage: Maximum CPU usage in cores by the instance.
- Min Memory Usage: Minimum memory usage in MB by the instance.
- Max Memory Usage: Maximum memory usage in MB by the instance.
- CPU Utilization: Total CPU utilization in cores by the instance.
- Memory Utilization: Total memory utilization in MB by the instance.
There will be three tabs in the monitoring page.
- Monitor: In this tab, it will display information such as the resources allocated, minimum/maximum resource consumption, instances provisioned, the number of records processed by each component, and their running status.
- Data Metrics: Data Metrics will show the number of consumed records, processed records, failed records, and the corresponding failed percentage over a selected time window
- System Logs: In this tab, the user can see the pod logs of every component in the pipeline.

Once the user clicks on any instance, the page will expand to show the graphical representation of CPU usage, Memory usage and Records Per Process Execution over the given interval of time. For reference, please see the images given below:

Please Note: The Records Per Process Execution metric showcases the number of records processed from the previous Kafka Event. If the component is not linked to the Kafka Event, the displayed value will be 0.

Monitor

The Monitor tab opens by default on the monitoring page.

If there are multiple instances for a single component, click on the drop-down icon.
Details for each instance will be displayed.

Monitoring page for Docker component in Real-Time

Monitoring page for Docker component in Batch:

Monitoring page for Spark Component:

Monitoring page for Spark Component - Driver:

Monitoring page for Spark Component- Executer:

If memory allocated to the component is less than required, then it will be displayed in red color.

Data Metrics

Open the Data Metrics tab from the pipeline monitoring page.
It shows the Produced, Consumed, and Failed records for each component in the pipeline over the given interval of time range in the form of bar charts where each bar contains the data of the given time interval (by default 30 mins).
Color terminology on the data metrics page:
- Blue: Indicates the number of records successfully produced to the out event by the component.
- Green: Indicates the number of records consumed from the previous connected Kafka event by the component.
- Red: Indicates the number of records failed by the component while processing it.
Once hovering over a specific bar in the bar chart on the data metrics page, the following information will be displayed:
- Start: Window start time.
- End: Window end time.
- Processed: Number of processed records.
- Produced: Number of records generated after processing.
- Consumed: Number of records consumed from the previous event.
- Failed: Number of records that failed during processing.
- Failed Percentage: Percentage of failed records, calculated as the ratio between Failed and Processed data.

The user can see the data metrics for all the components by enabling the Show all components on the Data Metrics page. Please refer the below given image for the reference.

Filter: The user can apply their custom filter using the Filter tab on the Data Metrics page.

Time Range & Interval:
- Enter the Start date & End date for filtering the data.
- Custom interval: Enter the time in minutes. Each bar in the bar chart on the data metrics page will contain data for this custom interval.
- Click on Apply Time Range to apply the filter.
- The user can also filter the data from the last 5 minutes to the last 2 days directly from the filter tab.

Clear: It will clear all the monitoring and data metrics logs for all the components in the pipeline.

Please Note: The Clear option does not display on the monitoring page if the pipeline is active.

Users can visualize the loaded data in the form of charts by clicking on the green color icon.

Please go through the given walk-through for the reference.

Once the user clicks on the green color icon, the following page will be opened:

Produced v/s Consumed

This chart will display the number of records produced to the out event compared to the number of records taken from the previous event over the given time window.

Min v/s Max v/s Avg Elapsed Time

This chart displays the minimum, maximum, and average time taken (in milliseconds) to process a record over the given time window.

Min Elapsed Time: The minimum time taken (in milliseconds) to process a record and send it to the out event.
Max Elapsed Time: The maximum time taken (in milliseconds) to process a record and send it to the out event.
Average Elapsed Time: The average time taken (in milliseconds) to process a record and send it to the out event.

Failed Records

This chart will display the number of failed records during processing over the given time window for the selected component.

Consumed v/s Failed Records

This chart will display the ratio of records failed during processing by the component to the total number of records consumed by the component over the given time window.

The user can also analyze the failure for the selected component from the Data Metrics page by clicking on the Analyze Failure option. Please see the below given image for reference.

Clicking on the Analyze Failure option will redirect the user to the page.

System Logs

In this tab, the user can see the pod logs of every component in the pipeline. The user can access this tab from the monitoring page.

The user can find the following options on the System Logs tab:

Selected Pod: The user can select the Pod for which they want to see the logs.
Date Filter: The user can apply the date filter to see the logs accordingly.
Refresh Logs: The user can refresh the logs for the selected pod.
Download Logs: The user can download the logs for the selected pod.

Please Note: The System Logs on the monitoring page will be displayed only when the pipeline is active.

Connection Validation

The Connection Validation option helps the users to validate the connection details of the db/cloud storages.

This option is available for all the components to validate the connection before deploying the components to avoid connection-related errors. This will also work with environment variables.

Check out a sample illustration of connection validation.

Readers

Readers are a group of components that can read data from different DB and cloud storages in both invocation types i.e., Real-Time and Batch.

HDFS Reader

HDFS stands for Hadoop Distributed File System. It is a distributed file system designed to store and manage large data sets in a reliable, fault-tolerant, and scalable way. HDFS is a core component of the Apache Hadoop ecosystem and is used by many big data applications.

This component reads the file located in HDFS(Hadoop Distributed File System).

All component configurations are classified broadly into 3 section

Basic Information
Meta Information
Resource Configuration

Configuring the Meta Information tab of the HDFS Reader

Host IP Address: Enter the host IP address for HDFS.
Port: Enter the Port.
Zone: Enter the Zone for HDFS. Zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read.
File Type: Select the File Type from the drop down. The supported file types are:
- CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
- JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type. Check-in the Multiline option if there is any multiline string in the file.
- PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
- AVRO: This File Type provides two drop-down menus.
  - Compression: Select an option out of the Deflate and Snappy options.
  - Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
- XML: Select this option to read XML file. If this option is selected, the following fields will get displayed:
  - Infer schema: Enable this option to get true schema of the column.
  - Path: Provide the path of the file.
  - Root Tag: Provide the root tag from the XML files.
  - Row Tags: Provide the row tags from the XML files.
  - Join Row Tags: Enable this option to join multiple row tags.
- ORC: Select this option to read ORC file. If this option is selected, the following fields will get displayed:
  - Push Down: In ORC (Optimized Row Columnar) file format, "push down" typically refers to the ability to push down predicate filters to the storage layer for processing. There will be two options in it:
    True: When push down is set to True, it indicates that predicate filters can be pushed down to the ORC storage layer for filtering rows at the storage level. This can improve query performance by reducing the amount of data that needs to be read into memory for processing.
    False: When push down is set to False, predicate filters are not pushed down to the ORC storage layer. Instead, filtering is performed after the data has been read into memory by the processing engine. This may result in more data being read and potentially slower query performance compared to when push down is enabled.
- Path: Provide the path of the file.
- Partition Columns: Provide a unique Key column name to partition data in Spark.