Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
It is the default landing page while selecting the Data Pipeline module. The Homepage of the Data Pipeline is a page that opens by default when you select the Data Pipeline module from the Apps menu.
Navigate to the Platform homepage.
Click on the Apps menu to open it.
Click on the Data Pipeline module.
You get redirected to the Homepage of the Data Pipeline module.
The Homepage of the Data Pipeline module offers a left-side menu panel with various options to begin with the Data Pipeline module.
Various options provided on the left-side menu panel given on the Data Pipeline Homepage are as described below:
Home
Redirects the Pipeline Homepage.
Displays options to create a new Pipeline or Job.
Redirects to the page listing all the accessible Jobs by the user.
Redirect to the page listing all the accessible Pipelines by the user.
Redirects to the Scheduler List page.
Redirects to the page containing the information on Data Channel & Cluster Events.
Redirects to the Trash page.
Redirects to the Settings page.
The Recently Visited section provides users with instant access to their most recent pipelines or jobs, displaying the last ten interactions. By eliminating the need to search through extensive menus or folders, this feature significantly improves workflow efficiency. Users can quickly resume work where they left off, ensuring a more seamless and intuitive experience on the pipeline homepage. It’s especially valuable for users managing multiple pipelines, as it reduces repetitive tasks, shortens navigation time, and increases overall productivity.
Navigate to the Data Pipeline Homepage.
Locate the Recently Visited section on the Homepage.
You may review the list of recent items to find the one you wish to access.
Select a Pipeline or Job from the list.
Click on the View icon of the desired pipeline or job.
You will be redirected to the Workflow Editor page of the selected Pipeline or Job.
An event-driven architecture uses events to trigger and communicate between decoupled services and is common in modern applications built with microservices.
The connecting components help to assemble various pipeline components and create a Pipeline Workflow. Just click and drag the component you want to use into the editor canvas. Connect the component output to a Kafka Event.
Once a Pipeline is created the User Interface of the Data Pipeline provides a canvas for the user to build the data flow (Pipeline Workflow).The Pipeline assembling process can be divided into two parts as mentioned below:
Adding Components to the Canvas
Adding Connecting Components (Events) to create the Data flow/ Pipeline workflow
Each components inside a pipeline are fully decoupled. Each component acts as a producer and consumer of data. The design is based on event-driven process orchestration. For passing the output of one component to another component we need an Intermediatory event. An event-driven architecture contains three items:
Event Producer [Components]
Event Stream [Event (Kafka topic/ DB Sync)
Event Consumer [Components]
​
This section provides the steps involved in creating a new Pipeline flow.
Check out the below-given illustration on how to create a new pipeline.
Navigate to the Create Pipeline or Job interface.
Click the Create option provided for Pipeline.
The New Pipeline window opens asking for the basic information.
Enter a name for the new Pipeline.
Describe the Pipeline (Optional).
Select a resource allocation option using the radio button- the given choices are:
Low
Medium
High
Please Note: This feature is used to deploy the pipeline with high, medium, or low-end configurations according to the velocity and volume of data that the pipeline must handle. All the components saved in the pipeline are then allocated resources based on the selected Resource Allocation option depending on the component type (Spark and Docker).
Click the Save option to create the pipeline. By clicking the Save option, the user gets redirected to the pipeline workflow editor.​
A success message appears to confirm the creation of a new pipeline.
The Pipeline Editor page opens for the newly created pipeline.
Resource allocation can be changed anytime by clicking on top left edit icon near pipeline name.​
Create a new Pipeline or Job based on the selected create option.
The Pipeline Homepage offers the Create option to initiate with Pipeline or Job creation based on the need of user.
Navigate to the Pipeline List page.
Click the Create icon.
The Create Pipeline or Job user interface opens prompting to create a new Pipeline and Job.
Please Note: Based on your selection, the next page opens. Refer the Create a New Pipeline and Create a Job sections to get the details.
Adding components to a pipeline workflow.
Check out the below given walk-through to add components to Pipeline Workflow editor canvas.
The Component Pallet is situated on the left side of the User Interface on the Pipeline Workflow. It has the System and Custom components tabs listing the various components.
The System components are displayed in the below given image:
Once the Pipeline gets saved in the pipeline list, the user can add components to the canvas. The user can drag the required components to the canvas and configure it to create a Pipeline workflow or Data flow.
Navigate to the existing data pipeline from the List Pipelines page.
Click the View icon for the pipeline.
The Pipeline Editor opens for the selected pipeline.
Drag and drop the new required components or make changes in the existing component’s meta information or change the component configuration (E.g., the DB Reader is dragged to the workspace in the below-given image).
Once dragged and dropped to the pipeline workspace, components can be directly connected to the nearest Kafka Event. To enable the auto-connect feature, the user needs to ensure that the Auto connect event on drag option is enabled, which is the default setting.
Click on the dragged component and configure the Basic Information tab, which opens by default.
Open the Meta Information tab, which is next to the Basic Information tab, and configure it.
Make sure to click the Save Component in Storage icon to update the component details and pipeline to reflect the recent changes in the pipeline. (The user can drag and configure other required components to create the Pipeline Workflow.
Click the Update Pipeline icon to save the changes.
A success message appears to assure that the pipeline has been successfully updated.
Click the Activate Pipeline icon to activate the pipeline (It appears only after the newly created pipeline gets successfully updated).
A dialog window opens to confirm the action of pipeline activation.
Click the YES option to activate the pipeline.
A success message appears confirming the activation of the pipeline.
Another success message appears to confirm that the pipeline has been updated.
The Status for the pipeline gets changed on the Pipeline List page.
Please Note:
Please refer to the page for more details on creating Kafka Events and connecting them to components using various methods.
Click the icon from the Pipeline Editor page to delete the selected pipeline. The deleted Pipeline gets removed from the Pipeline list.
Refer to the section to get detailed information on the each Pipeline Component.
All the available Reader Task components are included in this section.
Readers are a group of tasks that can read data from different DB and cloud storages. In Jobs, all the tasks run in real-time.
There are eight(8) Readers tasks in Jobs. All the readers tasks contains the following tabs:
Meta Information: Configure the meta information same as doing in pipeline components.
Preview Data: Only ten(10) random data can be previewed in this tab only when the task is running in Development mode.
Preview schema: Spark schema of the reading data will be shown in this tab.
Logs: Logs of the tasks will display here.
This section provides details about the various categories of the task components which can be used in the Spark Job.
There are three categories of task components available:
All the available Writer Task components for a Job are explained in this section.
Writers are a group of components that can write data to different DB and cloud storages.
There are Eight(8) Writers tasks in Jobs. All the Writers tasks is having the following tabs:
Meta Information: Configure the meta information same as doing in pipeline components.
Preview Data: Only ten random data can be previewed in this tab only when the task is running in Development mode.
Preview schema: Spark schema of the data will be shown in this tab.
Logs: Logs of the tasks will display here.
HDFS stands for Hadoop Distributed File System. It is a distributed file system designed to store and manage large data sets in a reliable, fault-tolerant, and scalable way. HDFS is a core component of the Apache Hadoop ecosystem and is used by many big data applications.
This task reads the file located in HDFS (Hadoop Distributed File System).
Drag the HDFS reader task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Host IP Address: Enter the host IP address for HDFS.
Port: Enter the Port.
Zone: Enter the Zone for HDFS. Zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read.
File Type: Select the File Type from the drop down. The supported file types are:
CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type. Check-in the Multiline option if there is any multiline string in the file.
PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
AVRO: This File Type provides two drop-down menus.
Compression: Select an option out of the Deflate and Snappy options.
Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
XML: Select this option to read XML file. If this option is selected, the following fields will get displayed:
Infer schema: Enable this option to get true schema of the column.
Path: Provide the path of the file.
Root Tag: Provide the root tag from the XML files.
Row Tags: Provide the row tags from the XML files.
Join Row Tags: Enable this option to join multiple row tags.
Path: Provide the path of the file.
Partition Columns: Provide a unique Key column name to partition data in Spark.
Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged reader task.
This task is used to read data from MongoDB collection.
Drag the MongoDB reader task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Connection Type: Select the connection type from the drop-down:
Standard
SRV
Connection String
Port (*): Provide the Port number (It appears only with the Standard connection type).
Host IP Address (*): The IP address of the host.
Username (*): Provide a username.
Password (*): Provide a valid password to access the MongoDB.
Database Name (*): Provide the name of the database where you wish to write data.
Additional Parameters: Provide details of the additional parameters.
Cluster Shared: Enable this option to horizontally partition data across multiple servers.
Schema File Name: Upload Spark Schema file in JSON format.
Query: Please provide Mongo Aggregation query in this field.
Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged reader task.
This task is used to read the data from the following databases: MYSQL, MSSQL, Oracle, ClickHouse, Snowflake, PostgreSQL, Redshift.
Drag the DB reader task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Host IP Address: Enter the Host IP Address for the selected driver.
Port: Enter the port for the given IP Address.
Database name: Enter the Database name.
Table name: Provide a single or multiple table names. If multiple table name has be given, then enter the table names separated by comma(,).
User name: Enter the user name for the provided database.
Password: Enter the password for the provided database.
Driver: Select the driver from the drop down. There are 7 drivers supported here: MYSQL, MSSQL, Oracle, ClickHouse, Snowflake, PostgreSQL, Redshift.
Fetch Size: Provide the maximum number of records to be processed in one execution cycle.
Create Partition: This is used for performance enhancement. It's going to create the sequence of indexing. Once this option is selected, the operation will not execute on server.
Partition By: This option will appear once create partition option is enabled. There are two options under it:
Auto Increment: The number of partitions will be incremented automatically.
Index: The number of partitions will be incremented based on the specified Partition column.
Query: Enter the spark SQL query in this field for the given table or table(s). Please refer the below image for making query on multiple tables.
Please Note:
The ClickHouse driver in the Spark components will use HTTP Port and not the TCP port.
In the case of data from multiple tables (join queries), one can write the join query directly without specifying multiple tables, as only one among table and query fields is required.
Please click the Save Task In Storage icon to save the configuration for the dragged reader task.
HDFS stands for Hadoop Distributed File System. It is a distributed file system designed to store and manage large data sets in a reliable, fault-tolerant, and scalable way. HDFS is a core component of the Apache Hadoop ecosystem and is used by many big data applications.
This task writes the data in HDFS(Hadoop Distributed File System).
Drag the HDFS writer task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Host IP Address: Enter the host IP address for HDFS.
Port: Enter the Port.
Table: Enter the table name where the data has to be written.
Zone: Enter the Zone for HDFS in which the data has to be written. Zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read.
File Format: Select the file format in which the data has to be written:
CSV
JSON
PARQUET
AVRO
Save Mode: Select the save mode.
Schema file name: Upload spark schema file in JSON format.
Partition Columns: Provide a unique Key column name to partition data in Spark.
Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged writer task.
This section provides detailed information on the Jobs to make your data process faster.
Jobs are used for ingesting and transferring data from separate sources. The user can transform, unify, and cleanse data to make it suitable for analytics and business reporting without using the Kafka topic which makes the entire flow much faster.
Check out the given demonstration to understand how to create and activate a job.
Navigate to the Data Pipeline homepage.
Click on the Create icon.
Navigate to the Create Pipeline or Job interface.
The New Job dialog box appears redirecting the user to create a new Job.
Enter a name for the new Job.
Describe the Job(Optional).
Job Baseinfo: In this field, there are three options:
Trigger By: There are 2 options for triggering a job on success or failure of a job:
Success Job: On successful execution of the selected job the current job will be triggered.
Failure Job: On failure of the selected job the current job will be triggered.
Is Scheduled?
A job can be scheduled for a particular timestamp. Every time at the same timestamp the job will be triggered.
Job must be scheduled according to UTC.
Concurrency Policy: Concurrency policy schedulers are responsible for managing the execution and scheduling of concurrent tasks or threads in a system. They determine how resources are allocated and utilized among the competing tasks. Different scheduling policies exist to control the order, priority, and allocation of resources for concurrent tasks.
Please Note:
Concurrency Policy will appear only when "Is Scheduled" is enabled.
If the job is scheduled, then the user has to activate it for the first time. Afterward, the job will automatically be activated each day at the scheduled time.
There are 3 Concurrency Policy available:
Allow: If a job is scheduled for a specific time and the first process is not completed before the next scheduled time, the next task will run in parallel with the previous tasks.
Forbid: If a job is scheduled for a specific time and the first process is not completed before the next scheduled time, the next task will wait until all the previous tasks are completed.
Replace: If a job is scheduled for a specific time and the first process is not completed before the next scheduled time, the previous task will be terminated and the new task will start processing.
Spark Configuration
Select a resource allocation option using the radio button. The given choices are:
Low
Medium
High
This feature is used to deploy the Job with high, medium, or low-end configurations according to the velocity and volume of data that the Job must handle.
Also, provide the resources to Driver and Executer according to the requirement.
Alert: There are 2 options for sending an alert:
Success: On successful execution of the configured job, the alert will be sent to selected channel.
Failure: On failure of the configured job, the alert will be sent to selected channel.
Please go through the given link to configure the Alerts in Job: Job Alert
Click the Save option to create the job.
A success message appears to confirm the creation of a new job.
The Job Editor page opens for the newly created job.
Please Note:
The Trigger by feature will not work if the selected Trigger by job is running in the Development mode. Trigger by feature will only work when the selected Trigger by Job is activated.
By clicking the Save option, the user gets redirected to the job workflow editor.
A Kafka event refers to a single piece of data or message that is exchanged between producers and consumers in a Kafka messaging system. Kafka events are also known as records or messages. They typically consist of a key, a value, and metadata such as the topic and partition. Producers publish events to Kafka topics, and consumers subscribe to these topics to consume events. Kafka events are often used for real-time data streaming, messaging, and event-driven architectures.
Check out the given illustration to understand how to create a Kafka Event.
Navigate to the Pipeline Editor page.
Click on the Event Panel option in the pipeline toolbar and Click on Add New Event icon.
The New Event window opens.
Provide a name for the new Event.
Select Event Duration: Select an option from the below-given options.
Short (4 hours)
Medium (8 hours)
Half Day (12 hours)
Full Day (24 hours)
Long (48 hours)
Week (168 hours)
​Please Note: The event data gets erased after 7 days if no duration option is selected from the available options. The Offsets expire as well.
No. of Partitions: Enter a value between 1 to 100. The default number of partitions is 3.
No. of Outputs: Define the number of outputs using this field.
Is Failover: Enable this option to create the event as the Failover Event. If a Failover Event is created, it must be mapped with a component to retrieve failed data from that component.
Click the "Add Event" option.
The Event will be created successfully, and the newly created Event is added to the Kafka Events tab in the Events panel.
Once the Kafka Event is created, the user can drag it to the pipeline workspace and connect it to any component.
On hovering over the event in the pipeline workspace, the user can see the following information for that event.
Event Name
Duration
Number of Partitions
The user can edit the following information of the Kafka Event after dragging it to the pipeline workspace:
Display Name
No. of Outputs
Is Failover
A Failover Event is designed to capture data that a component in the pipeline fails to process. In cases where a connected event's data cannot be processed successfully by a component, the failed data is sent to the Failover Event.
Follow these steps to map a Failover Event with the component in the pipeline:
Create a Failover Event following the provided steps.
Drag the Failover Event to the pipeline workspace.
Navigate to the Basic Information tab of the desired component where the Failover Event should be mapped.
From the drop-down, select the Failover Event.
Save the component configuration.
The Failover Event is now successfully mapped. If the component encounters processing failures with data from its preceding event, the failed data will be directed to the Failover Event.
The Failover Event holds the following keys along with the failed data:
Cause: Cause of failure.
eventTime: Date and Time at which the data gets failed.
When hovering over the Failover Event, the associated component in the pipeline will be highlighted. Refer to the image below for visual reference.
Please see the below-given video on how to map a Failover Event.
This feature automatically connects the Kafka/Data Sync Event to the component when dragged from the events panel. In order to use this feature, users need to ensure that the Auto connect components on drag option is enabled from the Events panel.
Please see the below-given illustrations on auto connecting a Kafka and Data Sync Event to a pipeline component.
This feature allows user to directly connect a Kafka/Data Sync event to a component by right clicking on the component in the pipeline.
Follow these steps to directly connect a Kafka/Data Sync event to a component in the pipeline:
Right-click on the component in the pipeline.
Select the Add Kafka Event or Add Sync Event option.
The Create Kafka Event or Create Sync Event dialog box will open.
Enter the required details.
Click on the Add Kafka Event or Create Sync Event option.
The newly created Kafka/Data Sync Event will be directly connected to the component.
Please see the below-given illustrations on how to add a Kafka and Data Sync Event to a pipeline workflow.
This feature enables users to map a Kafka Event with another Kafka event. In this scenario, the mapped event will have the same data as its source event.
Please see the below-given illustrations on how to create mapped Kafka Event.
Follow these steps to create a mapped event in the pipeline:
Choose the Kafka event from the pipeline as the source event for which mapped events need to be created.
Open the events panel from the pipeline toolbar and select the "Add New Event" option.
In the Create Kafka Event dialog box, enable the option at the top to enable mapping for this event.
Enter the Source Event Name in the Event name and click on the search icon to select the name of the source event from the suggestions.
The Event Duration and No. of Partitions will be automatically filled in the same as the source event. Users can modify the No. of Outputs between 1 to 10 for the mapped event.
Click on the Map Kafka Event option to create the Mapped Event.
Please Note: The data of the Mapped Event itself cannot be directly flushed. To clear the data of a Mapped Event, the user needs to flush the Source Event to which it is mapped.
Please go through the below-given steps to check the meta information and download the data from the Kafka topic once the data has been sent to the Kafka event by the producer. The user can download the data from the Kafka event in CSV, Excel, and JSON format. The user will find the following tabs in the Kafka topic:
This tab displays information such as Display Name, Event Name, No. of Partitions, No. of Outputs, Event Duration, etc.
Navigate to the meta information tab to view details such as the Number Of Records in the Kafka topic, Approximate Data Size, Number of Partitions, Approximate Partition Size (in MB), Start and End Offset of partition, and Total Records in each partition.
Navigate to the Preview tab for a pipeline workflow using an Event component. Preferably select a pipeline workflow that has been activated to get data.
It will display the Data Preview.
Please Note:
Users can preview, download, and copy up to 100 data entries.
Click the Download icon to download the data in CSV, JSON, or Excel format.
Click on the Copy option to copy the data as a list of dictionaries.
Data Type icons are provided in each column header to indicate the data type. The icons help users quickly understand the contained data type in each column. The following icons are added:
String
Number
Date
DateTime
Float
Boolean
Users can drag and drop column separators to adjust column widths, allowing for a personalized view of the data.
This feature helps accommodate various data lengths and user preferences.
The event data is displayed in a structured table format.
The table supports sorting and filtering to enhance data usability. The previewed data can be filtered based on the Latest, Beginning, and Timestamp options.
The Timestamp filter option redirects the user to select a timestamp from the Time Range window. The user can either select a start and end date or choose from the available time ranges to apply and get a data preview.
Check out the illustration on sorting and filtering the Event Data Preview for a Pipeline workflow.
This tab holds the Spark schema of the data. Users can download the Spark schema of the data by clicking on the download option.
Kafka Events can be flushed to delete all present records. Flushing an Event retains the offsets of the topic by setting the start-offset value to the end-offset. Events can be flushed by using the Flush Event button beside the respective Event in the Event panel, and all Events can be flushed at once by using the Flush All button. This button is also present at the top of the Event panel.
Check out the given illustration on how to Auto connect a Kafka Event to a pipeline component.
Drag any component to the pipeline workspace.
Configure all the parameters/fields of the dragged component.
Drag the Event from the Events Panel. Once the Kafka event is dragged from the Events panel, it will automatically connect with the nearest component in the pipeline workspace. Please find the below given walk through for the reference.
Click the Update Pipeline icon to save the pipeline workflow.
Data Sync Event in the Data Pipeline module is used to write the required data directly to the any of the databases without using the Kafka Event and writer components in the pipeline. Please refer the below image for reference:
It can be seen in the above image that Data Event will directly write the data read from the MongoDB reader component to the table of the configured Database in the Data Sync without using a Kafka Event in-between.
It doesn't need Kafka event to read the data. It can be connected with any component to read the data and it writes it to the tables of respective databases.
Pipeline complexity is reduced because Kafka event and writer is not needed to use in the pipeline.
Since, writers are not used, the resource consumption are low.
Once Data sync are configured, multiple Data Sync events can be created for the same configuration and the data can be written to multiple tables.
Pre-requisite: Before Creating the Data Sync Event, the user has to configure the Data Sync section under the Settings page.
DB Sync Event enables direct write to the DB that helps in reducing the usage of additional compute resources like Writers in the Pipeline Workflow.
Please Note: The supported drivers for the Data Sync component are as listed below:
ClickHouse
MongoDB
MSSQL
MySQL
Oracle
PostgreSQL
Snowflake
Redshift
Check out the given video on how to create a Data Sync component and connect it with a Pipeline component.
Navigate to the Pipeline Editor page.
Click on the DB Sync tab.
Click on the Add New Data Sync (+) icon from the Toggle Event Panel.
The Create Data Sync window opens.
Provide a display name for the new Data Sync.
Select the Driver
Put a checkmark in the Is Failover option to create a failover Data Sync. In this case, it is not enabled.
Click the Save option.
Please Note:
Only the configured drivers from the Settings page get listed under the Create Data Sync wizard.
The Data Sync component gets created as Failover Data Sync, if the Is Failover option is enabled while creating a Data Sync.
Drag and drop Data Sync Event to the workflow editor.
Click on the dragged Data Sync component.
The Basic Information tab appears with the following fields:
Display Name: Display name of the Data Sync
Event Name: Event name of the Data Sync
Table name: Specify table name.
Driver: This field will be pre-selected.
Save Mode: Select save mode from the drop-down: Append or Upsert.
Composite Key: This field is optional. This field will only appear when upsert is selected as the Save Mode.
Click on the Save Data Sync icon to save the Data Sync information.
Once the Data Sync Event is dragged from the Events panel, it will automatically connect with the nearest component in the pipeline workflow.
Update and activate the pipeline.
Open the Logs tab to view whether the data gets written to a specified table.​​
Please Note:
In the Save mode, there are two available options.
Append
Upsert: One extra field will be displayed for upsert save mode i.e. Composite Key.
When the SQL component is set to Aggregate Query mode and connected to Data Sync, the data resulting from the query will not be written to the Data Sync Event. Please refer to the following image for a visual representation of the flow and avoid using such scenario.
The Events Panel appears on the right side of the Pipeline Workflow Editor page.
Click on the Data Sync tab.
Click on the Add New Data Sync (+) icon from the Toggle Event Panel.
The Create Data Sync dialog box opens.
Provide the Display Name and
Enable the Is Failover option.
Click the Save option.
The failover Data Sync will be created successfully.
Drag & drop the failover data sync to the pipeline workflow editor.
Click on the failover Data Sync and fill in the following field:
Table name: Enter the table name where the failed data has to be written.
Primary key: Enter the column name to be made as the primary key in the table. This field is optional.
Save mode: Select a save mode from the given choices. Select either Append or Upsert.
The failover Data Sync Event gets configured, and now the user can map it with any component in the pipeline workflow.
The image displays that the failover Data Sync Event is mapped with the SQL Component.
If the component fails, it will write all the data to the given table of the configured DB in the Failover Data Sync.
Please Note: There is no information available on UI that the failover Data Sync Event has been configured while hovering on the failover Data Sync Event.
This section focuses on the Configuration tab provided for any Pipeline component.
For each component that gets deployed, we have an option to configure the resources i.e., Memory and CPU.
We have two deployment types:
Docker
Spark
Go through the given illustration to understand how to configure a component using the Docker deployment type.
After we save the component and pipeline, the component gets saved with the default configuration of the pipeline i.e., Low, Medium, and High. After we save the pipeline, we can see the configuration tab in the component. There are multiple things.
For the Docker components, we have the Request and Limit configurations.
We can see the CPU and Memory options to be configured.
CPU: This is the CPU configuration where we can specify the number of cores that we need to assign to the component.
Please Note: 1000 means 1 core in the configuration of docker components. When we put 100 that means 0.1 core has been assigned to the component.
Memory: This option is to specify how much memory you want to dedicate to that specific component.
Please Note: 1024 means 1GB in the configuration of the docker components.
Instances: The number of instances is used for parallel processing. If we give N no. of instances those many pods will get deployed.
Go through the below given walk-through to understand the steps to configure a component with Spark configuration type.
The Spark Components configuration is slightly different from the Docker components. When the spark components are deployed, there are two pods that come up:
Driver
Executor
Provide the Driver and executor configurations separately.
Instances: The number of instances is used for parallel processing. If we give N no. of instances in executors configuration those many executors pods will get deployed.
Please Note: Till the current release, the minimum requirement to deploy a driver is 0.1 Cores and 1 core for the executor. It can change with the upcoming versions of Spark.
​
The Job Editor Page provides the user with all the necessary options and components to add a task and eventually create a Job workflow.
Once the Job gets saved in the Job list, the user can add a Task to the canvas. The user can drag the required tasks to the canvas and configure it to create a Job workflow or dataflow.
The Job Editor appears displaying the Task Pallet containing various components mentioned as Tasks.
Navigate to the Job List page.
It will list all the jobs and display the Job type if it is Spark Job or PySpark Job in the type column.
Select a Job from the displayed list.
Click the View icon for the Job.
Please Note: Generally, the user can perform this step-in continuation to the Job creation, but if the user has come out of the Job Editor the above steps can help to access it.
The Job Editor opens for the selected Job.
Drag and drop the new required task, make changes in the existing task’s meta information, or change the task configuration as the requirement. (E.g., the DB Reader is dragged to the workspace in the below-given image):
Click on the dragged task icon.
The task-specific fields open asking the meta-information about the dragged component.
Open the Meta Information tab and configure the required information for the dragged component.
Click the given icon to validate the connection.
Click the Save Task in Storage icon.
A notification message appears.
A dialog window opens to confirm the action of job activation.
Click the YES option to activate the job.
A success message appears confirming the activation of the job.
Once the job is activated, the user can see their job details while running the job by clicking on the View icon; the edit option for the job will be replaced by the View icon when the job is activated.
Please Note:
If the job is not running in Development mode, there will be no data in the preview tab of tasks.
The Status for the Job gets changed on the job List page when they are running in the Development mode or it is activated.
Users can get a sample of the task data under the Preview Data tab provided for the tasks in the Job Workflows.
Navigate to the Job Editor page for a selected job.
Open a task from where you want to preview records.
Click the Preview Data tab to view the content.
Please Note:
Users can preview, download, and copy up to 10 data entries.
Click the Download icon to download the data in CSV, JSON, or Excel format.
Click on the Copy option to copy the data as a list of dictionaries.
Users can drag and drop column separators to adjust columns' width, allowing personalized data view.
This feature helps accommodate various data lengths and user preferences.
The event data is displayed in a structured table format.
The table supports sorting and filtering to enhance data usability. The previewed data can be filtered based on the Latest, Beginning, and Timestamp options.
The Timestamp filter option redirects the user to select a timestamp from the Time Range window. The user can select a start and end date or choose from the available time ranges to apply and get a data preview.
The Toggle Log Panel displays the Logs and Advanced Logs tabs for the Job Workflows.
Navigate to the Job Editor page.
Click the Toggle Log Panel icon on the header.
A panel Toggles displaying the collective logs of the job under the Logs tab.
Select the Job Status tab to display the pod status of the complete Job.
Please Note: If any orphan task is used in the Job Editor Workspace that is not in use, it will cause failure for the entire Job. So, avoid using any orphan task in the Job. Please see the image below, for reference. In the image below the highlighted DB writer task is an orphan task and if the Job is activated, then this Job will fail because the orphan DB writer task is not getting any input. Please avoid the use of an orphan task inside the Job Editor workspace.
Navigate to the Job Editor page.
The Job Version update button will display a red dot indicating that new updates are available for the selected job.
The Confirm dialog box appears.
Click the YES option.
After the job is upgraded, the Upgrade Job Version button gets disabled. It will display that the job is up to date and no updates are available.
Job Version details
Displays the latest versions for the Jobs upgrade.
Displays Jobs logs and Job Status tab under Log panel.
Redirects to the Job Monitoring page
Development Mode
Runs the job in development mode.
Activate Job
Activates the current Job.
Update Job
Updates the current Job.
Edit Job
To edit the job name/ configurations.
Delete Job
Deletes the current Job.
Push Job
Push the selected job to GIT.
Redirects to the List Job page.
Redirects to the Settings page.
Opens the Job in Full screen
Format the Job tasks in arranged manner.
Zoom in the Job workspace.
Zoom out the Job workspace.
This task is used to read data from Azure blob container.
Drag the Azure Blob reader task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Read using: There are three(3) options available under this tab:
Provide the following details:
Shared Access Signature: This is a URI that grants restricted access rights to Azure Storage resources.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the file is located and which has to be read.
File type: There are four(5) types of file extensions are available under it:
CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type. Check-in the Multiline option if there is any multiline string in the file.
PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
AVRO: This File Type provides two drop-down menus.
Compression: Select an option out of the Deflate and Snappy options.
Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
XML: Select this option to read XML file. If this option is selected, the following fields will get displayed:
Infer schema: Enable this option to get true schema of the column.
Path: Provide the path of the file.
Root Tag: Provide the root tag from the XML files.
Row Tags: Provide the row tags from the XML files.
Join Row Tags: Enable this option to join multiple row tags.
Path: This option will appear once the file type is selected. Enter the path where the selected file type is located.
Read Directory: Check in this box to read the specified directory.
Query: Provide Spark SQL query in this field.
Provide the following details:
Account Key: Enter the Azure account key. In Azure, an account key is a security credential that is used to authenticate access to storage resources, such as blobs, files, queues, or tables, in an Azure storage account.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the blob is located. A container is a logical unit of storage in Azure Blob Storage that can hold blobs. It is similar to a directory or folder in a file system, and it can be used to organize and manage blobs.
File type: There are four(5) types of file extensions are available under it:
CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type. Check-in the Multiline option if there is any multiline string in the file.
PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
AVRO: This File Type provides two drop-down menus.
Compression: Select an option out of the Deflate and Snappy options.
Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
XML: Select this option to read XML file. If this option is selected, the following fields will get displayed:
Infer schema: Enable this option to get true schema of the column.
Path: Provide the path of the file.
Root Tag: Provide the root tag from the XML files.
Row Tags: Provide the row tags from the XML files.
Join Row Tags: Enable this option to join multiple row tags.
Path: This option will appear once the file type is selected. Enter the path where the selected file type is located.
Read Directory: Check in this box to read the specified directory.
Query: Provide Spark SQL query in this field.
Provide the following details:
Client ID: Provide Azure Client ID. The client ID is the unique Application (client) ID assigned to your app by Azure AD when the app was registered.
Tenant ID: Provide the Azure Tenant ID. Tenant ID (also known as Directory ID) is a unique identifier that is assigned to an Azure AD tenant, which represents an organization or a developer account. It is used to identify the organization or developer account that the application is associated with.
Client Secret: Enter the Azure Client Secret. Client Secret (also known as Application Secret or App Secret) is a secure password or key that is used to authenticate an application to Azure AD.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the blob is located. A container is a logical unit of storage in Azure Blob Storage that can hold blobs. It is similar to a directory or folder in a file system, and it can be used to organize and manage blobs.
Query: Provide Spark SQL query in this field.
File type: There are four(5) types of file extensions are available under it:
CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type. Check-in the Multiline option if there is any multiline string in the file.
PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
AVRO: This File Type provides two drop-down menus.
Compression: Select an option out of the Deflate and Snappy options.
Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
XML: Select this option to read XML file. If this option is selected, the following fields will get displayed:
Infer schema: Enable this option to get true schema of the column.
Path: Provide the path of the file.
Root Tag: Provide the root tag from the XML files.
Row Tags: Provide the row tags from the XML files.
Join Row Tags: Enable this option to join multiple row tags.
Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged reader task.
Elasticsearch is an open-source search and analytics engine built on top of the Apache Lucene library. It is designed to help users store, search, and analyze large volumes of data in real-time. Elasticsearch is a distributed, scalable system that can be used to index and search structured, semi-structured, and unstructured data.
This task is used to read the data located in Elastic Search engine.
Drag the ES reader task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Host IP Address: Enter the host IP Address for Elastic Search.
Port: Enter the port to connect with Elastic Search.
Index ID: Enter the Index ID to read a document in elastic search. In Elasticsearch, an index is a collection of documents that share similar characteristics, and each document within an index has a unique identifier known as the index ID. The index ID is a unique string that is automatically generated by Elasticsearch and is used to identify and retrieve a specific document from the index.
Resource Type: Provide the resource type. In Elasticsearch, a resource type is a way to group related documents together within an index. Resource types are defined at the time of index creation, and they provide a way to logically separate different types of documents that may be stored within the same index.
Is Date Rich True: Enable this option if any fields in the reading file contain date or time information. The "date rich" feature in Elasticsearch allows for advanced querying and filtering of documents based on date or time ranges, as well as date arithmetic operations.
Username: Enter the username for elastic search.
Password: Enter the password for elastic search.
Query: Provide a spark SQL query.
Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged reader task.
This task reads the file from Amazon S3 bucket.
Please follow the below mentioned steps to configure meta information of S3 reader task:
Drag the S3 reader task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Bucket Name (*): Enter S3 bucket name.
Region (*): Provide the S3 region.
Access Key (*): Access key shared by AWS to login..
Secret Key (*): Secret key shared by AWS to login
Table (*): Mention the Table or object name which is to be read
File Type (*): Select a file type from the drop-down menu (CSV, JSON, PARQUET, AVRO, XML are the supported file types)
Limit: Set a limit for the number of records.
Query: Insert an SQL query (it supports query containing a join statement as well)​.
Access Key (*): Access key shared by AWS to login
Secret Key (*): Secret key shared by AWS to login
Table (*): Mention the Table or object name which has to be read
File Type (*): Select a file type from the drop-down menu (CSV, JSON, PARQUET, AVRO, XML are the supported file types)
Limit: Set limit for the number of records
Query: Insert an SQL query (it supports query containing a join statement as well)
Provide a unique Key column name to partition data in Spark.
Please Note:
Please click the Save Task In Storage icon to save the configuration for the dragged reader task.
Once file type is selected the multiple fields will appear. Follow the below steps for the selected different file types.
CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type. Check-in the Multiline option if there is any multiline string in the file.
PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
AVRO: This File Type provides two drop-down menus.
Compression: Select an option out of the Deflate and Snappy options.
Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
XML: Select this option to read XML file. If this option is selected, the following fields will get displayed:
Infer schema: Enable this option to get true schema of the column.
Path: Provide the path of the file.
Root Tag: Provide the root tag from the XML files.
Row Tags: Provide the row tags from the XML files.
Join Row Tags: Enable this option to join multiple row tags.
Amazon Athena is an interactive query service that easily analyzes data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.
Athena Query Executer task enables users to read data directly from the external table created in AWS Athena.
Please Note: Please go through the below given demonstration to configure Athena Query Executer in Jobs.
Region: Enter the region name where the bucket is located.
Access Key: Enter the AWS Access Key of the account that must be used.
Secret Key: Enter the AWS Secret Key of the account that must be used.
Table Name: Enter the name of the external table created in Athena.
Database Name: Name of the database in Athena in which the table has been created.
Limit: Enter the number of records to be read from the table.
Data Source: Enter the Data Source name configured in Athena. Data Source in Athena refers to your data's location, typically an S3 bucket.
Workgroup: Enter the Workgroup name configured in Athena. The Workgroup in Athena is a resource type to separate query execution and query history between Users, Teams, or Applications running under the same AWS account.
Query location: Enter the path where the results of the queries done in the Athena query editor are saved in the CSV format. Users can find this path under the Settings tab in the Athena query editor as Query Result Location.
Query: Enter the Spark SQL query.
Sample Spark SQL query that can be used in Athena Reader:
This task can read the data from the Network pool of Sandbox.
Drag the Sandbox reader task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Storage Type: This field is pre-defined.
Sandbox File: Select the file name from the drop-down.
File Type: Select the file type from the drop down.
There are four(5) types of file extensions are available under it:
CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type. Check-in the Multiline option if there is any multiline string in the file.
PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
AVRO: This File Type provides two drop-down menus.
Compression: Select an option out of the Deflate and Snappy options.
Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
XML: Select this option to read XML file. If this option is selected, the following fields will get displayed:
Query: Provide Spark SQL query in this field.
Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged reader task.
This task is used to write the data in Amazon S3 bucket.
Drag the S3 writer task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Bucket Name (*): Enter S3 Bucket name.
Region (*): Provide S3 region.
Access Key (*): Access key shared by AWS to login
Secret Key (*): Secret key shared by AWS to login
Table (*): Mention the Table or object name which is to be read
File Type (*): Select a file type from the drop-down menu (CSV, JSON, PARQUET, AVRO are the supported file types).
Save Mode: Select the Save mode from the drop down.
Append
Schema File Name: Upload spark schema file in JSON format.
Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged writer task.
This task is used to write data in the following databases: MYSQL, MSSQL, Oracle, ClickHouse, Snowflake, PostgreSQL, Redshift.
Drag the DB writer task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Host IP Address: Enter the Host IP Address for the selected driver.
Port: Enter the port for the given IP Address.
Database name: Enter the Database name.
Table name: Provide a single or multiple table names. If multiple table name has be given, then enter the table names separated by comma(,).
User name: Enter the user name for the provided database.
Password: Enter the password for the provided database.
Driver: Select the driver from the drop down. There are 6 drivers supported here: MYSQL, MSSQL, Oracle, ClickHouse, Snowflake, PostgreSQL, Redshift.
Schema File Name: Upload spark schema file in JSON format.
Save Mode: Select the Save mode from the drop down.
Append
Overwrite
Query: Write the create table(DDL) query.
Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged writer task.
Elasticsearch is an open-source search and analytics engine built on top of the Apache Lucene library. It is designed to help users store, search, and analyze large volumes of data in real-time. Elasticsearch is a distributed, scalable system that can be used to index and search structured, semi-structured, and unstructured data.
This task is used to write the data in Elastic Search engine.
Drag the ES writer task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Host IP Address: Enter the host IP Address for Elastic Search.
Port: Enter the port to connect with Elastic Search.
Index ID: Enter the Index ID to read a document in elastic search. In Elasticsearch, an index is a collection of documents that share similar characteristics, and each document within an index has a unique identifier known as the index ID. The index ID is a unique string that is automatically generated by Elasticsearch and is used to identify and retrieve a specific document from the index.
Mapping ID: Provide the Mapping ID. In Elasticsearch, a mapping ID is a unique identifier for a mapping definition that defines the schema of the documents in an index. It is used to differentiate between different types of data within an index and to control how Elasticsearch indexes and searches data.
Resource Type: Provide the resource type. In Elasticsearch, a resource type is a way to group related documents together within an index. Resource types are defined at the time of index creation, and they provide a way to logically separate different types of documents that may be stored within the same index.
Username: Enter the username for elastic search.
Password: Enter the password for elastic search.
Schema File Name: Upload spark schema file in JSON format.
Save Mode: Select the Save mode from the drop down.
Append
Selected columns: The user can select the specific column, provide some alias name and select the desired data type of that column.
Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged writer task.
Azure is a cloud computing platform and service. It provides a range of cloud services, including infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS) offerings, as well as tools for building, deploying, and managing applications in the cloud.
Azure Writer task is used to write the data in the Azure Blob Container.
Drag the Azure writer task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Write using: There are three(3) options available under this tab:
Shared Access Signature:
Secret Key
Principal Secret
Provide the following details:
Shared Access Signature: This is a URI that grants restricted access rights to Azure Storage resources.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the blob is located. A container is a logical unit of storage in Azure Blob Storage that can hold blobs. It is similar to a directory or folder in a file system, and it can be used to organize and manage blobs.
Blob Name: Enter the Blob name. A blob is a type of object storage that is used to store unstructured data, such as text or binary data, like images or videos.
File Format: There are four(4) types of file extensions are available under it, select the file format in which the data has to be written:
CSV
JSON
PARQUET
AVRO
Save Mode: Select the Save mode from the drop down.
Append
Overwrite
Schema File Name: Upload spark schema file in JSON format.
Account Key: Enter the azure account key. In Azure, an account key is a security credential that is used to authenticate access to storage resources, such as blobs, files, queues, or tables, in an Azure storage account.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the blob is located. A container is a logical unit of storage in Azure Blob Storage that can hold blobs. It is similar to a directory or folder in a file system, and it can be used to organize and manage blobs.
Blob Name: Enter the Blob name. A blob is a type of object storage that is used to store unstructured data, such as text or binary data, like images or videos.
File type: There are four(4) types of file extensions are available under it:
CSV
JSON
PARQUET
AVRO
Schema File Name: Upload spark schema file in JSON format.
Save Mode: Select the Save mode from the drop down.
Append
Overwrite
Provide the following details:
Client ID: Provide Azure Client ID. The client ID is the unique Application (client) ID assigned to your app by Azure AD when the app was registered.
Tenant ID: Provide the Azure Tenant ID. Tenant ID (also known as Directory ID) is a unique identifier that is assigned to an Azure AD tenant, which represents an organization or a developer account. It is used to identify the organization or developer account that the application is associated with.
Client Secret: Enter the Azure Client Secret. Client Secret (also known as Application Secret or App Secret) is a secure password or key that is used to authenticate an application to Azure AD.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the blob is located. A container is a logical unit of storage in Azure Blob Storage that can hold blobs. It is similar to a directory or folder in a file system, and it can be used to organize and manage blobs.
Blob Name: Enter the Blob name. A blob is a type of object storage that is used to store unstructured data, such as text or binary data, like images or videos.
File type: There are four(4) types of file extensions are available under it:
CSV
JSON
PARQUET
AVRO
Save Mode: Select the Save mode from the drop down.
Append
Overwrite
Schema File Name: Upload spark schema file in JSON format.
Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged writer task.
Click the Toggle Event Panelicon from the header.
The Events Panel appears, and the Toggle Event Panel icon gets changed as, suggesting that the event panel is displayed.
Click the Event Panelicon from the Pipeline Editor page.
Click the Activate Job icon to activate the job(It appears only after the newly created job gets successfully updated).
Jobs can be run in the Development mode as well. The user can preview only 10 records in the preview tab of the task if the job is running in the Development mode and if any writer task is used in the job then it will write only 10 records in the table of the given database.
Please Note: Click the Delete icon from the Job Editor page to delete the selected job. The deleted job gets removed from the Job list.
A notification message appears stating that the version is upgraded.
This task writes the data to MongoDB collection.
Drag the MongoDB writer task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Connection Type: Select the connection type from the drop-down:
Standard
SRV
Connection String
Port (*): Provide the Port number (It appears only with the Standard connection type).
Host IP Address (*): The IP address of the host.
Username (*): Provide a username.
Password (*): Provide a valid password to access the MongoDB.
Database Name (*): Provide the name of the database where you wish to write data.
Additional Parameters: Provide details of the additional parameters.
Schema File Name: Upload Spark Schema file in JSON format.
Save Mode: Select the Save mode from the drop down.
Append: This operation adds the data to the collection.
Ignore: "Ignore" is an operation that skips the insertion of a record if a duplicate record already exists in the database. This means that the new record will not be added, and the database will remain unchanged. "Ignore" is useful when you want to prevent duplicate entries in a database.
Upsert: It is a combination of "update" and "insert". It is an operation that updates a record if it already exists in the database or inserts a new record if it does not exist. This means that "upsert" updates an existing record with new data or creates a new record if the record does not exist in the database.
Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged writer task.
This task writes data to network pool of Sandbox.
Drag the Sandbox writer task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Storage Type: This field is pre-defined.
Sandbox File: Enter the file name.
File Type: Select the file type in which the data has to be written. There are 4 files types supported here:
CSV
JSON
Save Mode: Select the Save mode from the drop down.
Append
Overwrite
Schema File Name: Upload spark schema file in JSON format.
Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged writer task.
In Apache Kafka, a "producer" is a client application or program that is responsible for publishing (or writing) messages to a Kafka topic.
A Kafka producer sends messages to Kafka brokers, which are then distributed to the appropriate consumers based on the topic, partitioning, and other configurable parameters.
Drag the Kafka Producer task to the Workspace and click on it to open the related configuration tabs for the same. The Meta Information tab opens by default.
Topic Name: Specify topic name where user want to produce data.
Security Type: Select the security type from drop down:
Plain Text
SSL
Is External: User can produce the data to external Kafka topic by enabling 'Is External' option. ‘Bootstrap Server’ and ‘Config’ fields will display after enable 'Is External' option.
Bootstrap Server: Enter external bootstrap details.
Config: Enter configuration details.
Host Aliases: In Apache Kafka, a host alias (also known as a hostname alias) is an alternative name that can be used to refer to a Kafka broker in a cluster. Host aliases are useful when you need to refer to a broker using a name other than its actual hostname.
IP: Enter the IP.
Host Names: Enter the host names.
Please Note: Please click the Save Task In Storage icon to save the configuration for the dragged writer task.
This page aims to explain the various transformation options provided on the Jobs Editor page.
The following transformations are provided under the Transformations section.
Alter Columns
Select Columns
Date Formatter
Query
Filter
Formula
Join
Aggregation
Sort Task
The Alter Columns command is used to change the data type of a column in a table.
Add the Column name from the Alter Columns tab where the datatype needs to be changed.
Name (*): Column name.
Alias Name(*) : New column name.
Column Type(*): Specify datatype from dropdown.
Add New Column: Multiple columns can be added for desired modification in the datatype.
Helps to select particular columns from a table definition.
Name (*): Column name.
Alias Name(*) : New column name.
Column Type(*) : Specify datatype from dropdown.
Add New Column: Multiple columns can be added for the desired result
It helps in converting Date and Datetime columns to a desired format.
Name (*): Column name.
Input Format(*) : The function has total 61 different formats.
Output Format(*): The format in which the output will be given.
Output Column Name(*): Name of the output column.
Add New Row: To insert a new row.
The Query transformation allows you to write SQL (DML) queries such as Select queries and data view queries.
Query(*): Provide a valid query to transform data.
Table Name(*): Provide the table name.
Schema File Name: Upload Spark Schema file in JSON format.
Choose File: Upload a file from the system.
Please Note: Alter query will not work.
The Filter columns allow the user to filter table data based on different defined conditions.
Field Name(*): Provide field name.
Condition (*): 8 condition operations are available within this function.
Logical Condition(*)(AND/OR):
Add New Column: Adds a new Column.
It gives computation results based on the selected formula type.
Field Name(*): Provide field name.
Formula Type(*): Select a Formula Type from the drop-down option.
Math (22 Math Operations)
String (16 String Operations )
Bitwise (3 Bitwise Operations)
Output Field Name(*): Provide the output field name.
Add New Column: Adds a new column.
It joins 2 tables based on the specified column conditions.
Join Type (*): Provides drop-down menu to choose a Join type.
The supported Join Types are:
Inner
Outer
Full
Full outer
Left outer
Left
Right outer
Right
Left semi
Left anti
Left Column(*): Conditional column from the left table.
Right Column(*) : Conditional column from right table.
Add New Column: Adds a new column.
An aggregate task performs a calculation on a set of values and returns a single value by using the Group By column.
Group By Columns(*): Provide a name for the group by column.
Field Name(*): Provide the field name.
Operation (*): 30 operations are available within this function.
Alias(*): Provide an alias name.
Add New Column: Adds a new column.
This transformation sorts all the data from a table based on the selected column and order.
Sort Key(*): Provide a sort key.
Order(*): Select an option out of Ascending or Descending.
Add New Column: Adds a new column.
Write PySpark scripts and run them flawlessly in the Jobs.
This feature allows users to write their own PySpark script and run their script in the Jobs section of Data Pipeline module.
Before creating the PySpark Job, the user has to create a project in the Data Science Lab module under PySpark Environment. Please refer the below image for reference:
Please go through the below given demonstration to create and configure a PySpark Job.
Open the pipeline homepage and click on the Create option.
The new panel opens from right hand side. Click on Create button in Job option.
Provide the following information:
Enter name: Enter the name for the job.
Job Description: Add description of the Job (It is an optional field).
Job Baseinfo: Select PySpark Job option from the drop in Job Base Information.
Trigger By: The PySpark Job can be triggered by another Job or PySpark Job. The PySpark Job can be triggered in two scenarios from another jobs:
On Success: Select a job from drop-down. Once the selected job is run successfully, it will trigger the PySpark Job.
On Failure: Select a job from drop-down. Once the selected job gets failed, it will trigger the PySpark Job.
Is Schedule: Put a checkmark in the given box to schedule the new Job.
Spark config: Select resource for the job.
Click on Save option to save the Job.
The PySpark Job gets saved and it will redirect the user to the Job workspace.
Once the PySpark Job is created, follow the below given steps to configure the Meta Information tab of the PySpark Job.
Project Name: Select the same Project using the drop-down menu where the concerned Notebook has been created.
Script Name: This field will list the exported Notebook names which are exported from the Data Science Lab module to Data Pipeline.
Please Note: The script written in DS Lab module should be inside a function. Refer the Export to Pipeline page for more details on how to export a PySpark script to the Data Pipeline module.
External Library: If any external libraries are used in the script the user can mention it here. The user can mention multiple libraries by giving comma(,) in between the names.
Start Function: Here, all the function names used in the script will be listed. Select the start function name to execute the python script.
Script: The Exported script appears under this space.
Input Data: If any parameter has been given in the function, then the name of the parameter is provided as Key and value of the parameters has to be provided as value in this field.
Please note: We are currently supporting following JDBC connector:
MySQL
MSSQL
Oracle
MongoDB
PostgreSQL
ClickHouse
Write Python scripts and run them flawlessly in the Jobs.
This feature allows users to write their own Python script and run their script in the Jobs section of Data Pipeline module.
Before creating the Python Job, the user has to create a project in the Data Science Lab module under Python Environment. Please refer the below image for reference:
After creating the Data Science project, the users need to activate it and create a Notebook where they can write their own Python script. Once the script is written, the user must save it and export it to be able to use it in the Python Jobs.
Navigate to the Data Pipeline module homepage.
Open the pipeline homepage and click on the Create option.
The new panel opens from right hand side. Click on Create button in Job option.
Enter a name for the new Job.
Describe the Job (Optional).
Job Baseinfo: Select Python Job from the drop-down.
Trigger By: There are 2 options for triggering a job on success or failure of a job:
Success Job: On successful execution of the selected job the current job will be triggered.
Failure Job: On failure of the selected job the current job will be triggered.
Is Scheduled?
A job can be scheduled for a particular timestamp. Every time at the same timestamp the job will be triggered.
Job must be scheduled according to UTC.
On demand: Check the "On demand" option to create an on-demand job. For more information on Python Job (On demand), check here.
Docker Configuration: Select a resource allocation option using the radio button. The given choices are:
Low
Medium
High
Provide the resources required to run the python Job in the limit and Request section.
Limit: Enter max CPU and Memory required for the Python Job.
Request: Enter the CPU and Memory required for the job at the start.
Instances: Enter the number of instances for the Python Job.
Alert: Please refer to the Job Alerts page to configure alerts in job.
Click the Save option to save the Python Job.
The Python Job gets saved, and it will redirect the users to the Job Editor workspace.
Check out the below given demonstration configure a Python Job.
Once the Python Job is created, follow the below given steps to configure the Meta Information tab of the Python Job.
Project Name: Select the same Project using the drop-down menu where the Notebook has been created.
Script Name: This field will list the exported Notebook names which are exported from the Data Science Lab module to Data Pipeline.
External Library: If any external libraries are used in the script the user can mention it here. The user can mention multiple libraries by giving comma (,) in between the names.
Start Function: Here, all the function names used in the script will be listed. Select the start function name to execute the python script.
Script: The Exported script appears under this space.
Input Data: If any parameter has been given in the function, then the name of the parameter is provided as Key, and value of the parameters has to be provided as value in this field.
The Alert feature in the job allows users to send an alert message to the specified channel (Teams or Slack) in the event of either the success or failure of the configured job. Users can also choose both success and failure options to send an alert for the configured job.
Webhook URL: Provide the Webhook URL of the selected channel group where the Alert message needs to be sent.
Type: Message Card. (This field will be Pre-filled)
Theme Color: Enter the Hexadecimal color code for ribbon color in the selected channel. Please refer the image given at the bottom of this page for the reference.
Sections: In this tab, the following fields are there:
Activity Title: This is the title of the alert which has to be to sent on the Teams channel. Enter the Activity Title as per the requirement.
Activity Subtitle: Enter the Activity Subtitle. Please refer the image given at the bottom of this page for the reference.
Text: Enter the text message which should be sent along with Alert.
Webhook URL: Provide the Webhook URL of the selected channel group where the Alert message needs to be sent.
Attachments: In this tab, the following fields are there:
Title: This is the title of the alert which has to be to sent on the selected channel. Enter the Activity Title as per the requirement.
Color: Enter the Hexadecimal color code for ribbon color in the Slack channel. Please refer the image given at the bottom of this page for the reference.
Text: Enter the text message which should be sent along with Alert.
Footer: The "Footer" typically refers to additional information or content appended at the end of a message in a Slack channel. This can include details like a signature, contact information, or any other supplementary information that you want to include with your message. Footers are often used to provide context or additional context to the message content.
Footer Icon: In Slack, the footer icon refers to an icon or image that is displayed at the bottom of a message or attachment. The footer icon can be a company logo, an application icon, or any other image that represents the entity responsible for the message. Enter image URL as the value of Footer icon.
Follow these steps to set the Footer icon in Slack:
Go to the desired image that has to be used as the footer icon.
Right-click on the image.
Select the 'Copy image address' to get the image URL.
Now, the obtained image URL can be used as the value for the footer icon in Slack.
Sample image URL for Footer icon:
Sample Hexadecimal Color code which can be used in Job Alert.
The Script Executor job is designed to execute code snippets or scripts written in various programming languages such as Go, Julia, and Python. This job allows users to fetch code from their Git repository and execute it seamlessly.
Navigate to the Data Pipeline module homepage.
Open the pipeline homepage and click on the Create option.
The new panel opens from right hand side. Click on Create button in Job option.
Enter a name for the new Job.
Describe the Job (Optional).
Job Baseinfo: Select Script Executer from the drop-down.
Trigger By: There are 2 options for triggering a job on success or failure of a job:
Success Job: On successful execution of the selected job the current job will be triggered.
Failure Job: On failure of the selected job the current job will be triggered.
Is Scheduled?
A job can be scheduled for a particular timestamp. Every time at the same timestamp the job will be triggered.
Job must be scheduled according to UTC.
Docker Configuration: Select a resource allocation option using the radio button. The given choices are:
Low
Medium
High
Provide the resources required to run the python Job in the limit and Request section.
Limit: Enter max CPU and Memory required for the Python Job.
Request: Enter the CPU and Memory required for the job at the start.
Instances: Enter the number of instances for the Python Job.
Click the Save option to save the Python Job.
The Script Executer Job gets saved, and it will redirect the users to the Job Editor workspace.
Please go through the demonstration given-below to configure the Script Executor.
Git Config: Select an option from the drop-down.
Personal: If this option is selected, provide the following information:
Git URL: Enter the Git URL.
User Name: Enter the GIT username.
Token: Enter the Access token or API token for authentication and authorization when accessing the Git repository, commonly used for secure automated processes like code fetching and execution.
Branch: Specify the Git branch for code fetching.
Script Type: Select the script's language for execution from the drop down:
GO
Julia
Python
Start Script: Enter the script name (with extension) which has to be executed.
For example, if Python is selected as the Script type, then the script name will be in the following format: script_name.py.
If Golang is selected as the Script type, then the script name will be in the following format: script_name.go.
Start Function: Specify the initial function or method within the Start Script for execution, particularly relevant for languages like Python with reusable functions.
Repository: Provide the Git repository name.
Input Arguments: Allows users to provide input parameters or arguments needed for code execution, such as dynamic values, configuration settings, or user inputs affecting script behavior.
Admin: If this option is selected, then Git URL, User Name, Token & Branch fields have to be configured in the platform in order to use Script Executor.
In this option, the user has to provide the following fields:
Script Type
Start Script
Start Function
Repository
Input Arguments
Please Note: Follow the below given steps to configure GitLab/GitHub credentials in the Admin Settings in the platform:
Navigate to Admin >> Configurations >> Version Control.
From the first drop-down menu, select the Version.
Choose 'DsLabs' as the module from the drop-down.
Select either GitHub or GitLab based on the requirement for Git type.
Enter the host for the selected Git type.
Provide the token key associated with the Git account.
Select a Git project.
Choose the branch where the files are located.
After providing all the details correctly, click on 'Test,' and if the authentication is successful, an appropriate message will appear. Subsequently, click on the 'Save' option.
To complete the configuration, navigate to My Account >> Configuration. Enter the Git Token and Git Username, then save the changes.
The on-demand python job functionality allows you to initiate a python job based on a payload using an API call at a desirable time.
Before creating the Python (On demand) Job, the user has to create a project in the Data Science Lab module under Python Environment. Please refer the below image for reference:
Please follow the below given steps to configure Python Job(On-demand):
Navigate to the Data Pipeline homepage.
Click on the Create Job icon.
The New Job dialog box appears redirecting the user to create a new Job.
Enter a name for the new Job.
Describe the Job(Optional).
Job Baseinfo: In this field, there are three options:
Spark Job
PySpark Job
Python Job
Select the Python Job option from Job Baseinfo.
Check-in On demand option as shown in the below image.
Docker Configuration: Select a resource allocation option using the radio button. The given choices are:
Low
Medium
High
Provide the resources required to run the python Job in the limit and Request section.
Limit: Enter max CPU and Memory required for the Python Job.
Request: Enter the CPU and Memory required for the job at the start.
Instances: Enter the number of instances for the Python Job.
The payload field will appear once the "On Demand" option is checked. Enter the payload in the form of a JSON array containing JSON objects.
Trigger By: There are 2 options for triggering a job on success or failure of a job:
Success Job: On successful execution of the selected job the current job will be triggered.
Failure Job: On failure of the selected job the current job will be triggered.
Click the Save option to create the job.
A success message appears to confirm the creation of a new job.
The Job Editor page opens for the newly created job.
Based on the given number of instances, topics will be created, and the payload will be distributed across these topics. The logic for distributing the payload among the topics is as follows:
The number of data on each topic can be calculated as the ceiling value of the ratio between the payload size and the number of instances.
For example:
Payload Size: 10
Number of Topics=Number of Instances: 3
The number of data on each topic is calculated as Payload Size divided by Number of Topics:
In this case, each topic will hold the following number of data:
Topic 1: 4
Topic 2: 4
Topic 3: 2 (As there are only 2 records left)
In the On-Demand Job system, the naming convention for topics is based on the Job_Id
, followed by an underscore (_) symbol, and successive numbers starting from 0. The numbers in the topic names will start from 0 and go up to n-1
, where n
is the number of instances. For clarity, consider the following example:
Job_ID: job_13464363406493
Number of instances: 3
In this scenario, three topics will be created, and their names will follow the pattern:
Please Note:
When writing a script in DsLab Notebook for an On-Demand Job, the first argument of the function in the script is expected to represent the payload when running the Python On-Demand Job. Please refer to the provided sample code.
job_payload
is the payload provided when creating the job or sent from an API call or ingested from the Job trigger component from the pipeline.
Once the Python (On demand) Job is created, follow the below given steps to configure the Meta Information tab of the Python Job:
Project Name: Select the same Project using the drop-down menu where the Notebook has been created.
Script Name: This field will list the exported Notebook names which are exported from the Data Science Lab module to Data Pipeline.
External Library: If any external libraries are used in the script the user can mention it here. The user can mention multiple libraries by giving comma (,) in between the names.
Start Function: Here, all the function names used in the script will be listed. Select the start function name to execute the python script.
Script: The Exported script appears under this space.
Input Data: If any parameter has been given in the function, then the name of the parameter is provided as Key, and value of the parameters has to be provided as value in this field.
Python (On demand) Job can be activated in the following ways:
Activating from UI
Activating from Job trigger component
For activating Python (On demand) job from UI, it is mandatory for the user to enter the payload in the Payload section. Payload has to be given in the form of a JSON Array containing JSON objects as shown in the below given image.
Once the user configure the Python (On demand) job, the job can be activated using the activate icon on the job editor page.
Please go through the below given walk-through which will provide a step-by-step guide to facilitate the activation of the Python (On demand) Job as per user preferences.
Sample payload for Python Job (On demand):
The Python (On demand) job can be activated using the Job Trigger component in the pipeline. To configure this, the user has to set up their Python (On demand) job in the meta-information component within the pipeline. The in-event data of the Job Trigger component will then be utilized as a payload in the Python (On demand) Job.
Please go through the below given walk-through which will provide a step-by-step guide to facilitate the activation of the Python (On demand) Job through Job trigger component.
Please follow the below given steps to configure Job trigger component to activate the Python (On demand) job:
Create a pipeline that generates meaningful data to be sent to the out event, which will serve as the payload for the Python (On demand) job.
Connect the Job Trigger component to the event that holds the data to be used as payload in the Python (On demand) job.
Open the meta-information of the Job Trigger component and select the job from the drop-down menu that needs to be activated by the Job Trigger component.
The data from the previously connected event will be passed as JSON objects within a JSON Array and used as the payload for the Python (On demand) job. Please refer to the image below for reference:
In the provided image, the event contains a column named "output" with four different values: "jobs/ApportionedIdentifiers.csv", "jobs/accounnts.csv", "jobs/gluue.csv", and "jobs/census_2011.csv". The payload will be passed to the Python (On demand) job in the JSON format given below.
The Register Feature enables users to directly create jobs from DsLab notebook. Using this feature, users can create the following types of jobs:
Follow the below-given steps to create a Python Job using the register feature:
Create a project in the DSLab module under the Python environment. Refer to the below given image for guidance.
Once the project is created, create a Notebook and write the scripts. It supports scripts running through multiple code cells.
After writing the script, select the Register option from the Notebook options. Refer to the following image for reference.
The Register as Job panel will open on the right side of the window with two options:
Register as New: This will create a new Job.
Re-Register: If the Job is already created and the user wants to make changes to the scripts, it will update the existing Job with the latest changes.
Please go through the walkthrough below to register as a new Job.
After clicking on the Register option and selecting the Register as New option, the user needs to choose the desired notebook cell that should be executed with the Job. The user can select multiple cells from the notebook according to requirement and then click on the Next option.
Now, the user needs to validate the script of the selected cell by clicking on the Validate option. Additionally, the user can select the External Libraries used in the project by clicking on the External Libraries option next to the Validate option. The Next option won't be enabled until the user validates the script. After validating the script, click on the Next option.
Enter scheduler name: Name of the job. (It will create a scheduled job if the user is creating a Python Job).
Scheduler description: Description of the job.
Start function: Select the start function from the validated script to start the job from the given start function.
Job BaseInfo: Python (This field will be pre-selected if the DSLab project is created under the Python environment.)
Docker config: Select the desired configuration for the job and provide the resources (CPU & Memory) for executing the job.
Provide the resources required to run the Python Job in the Limit and Request section.
Limit: Enter the max CPU and Memory required for the Python Job.
Request: Enter the CPU and Memory required for the job at the start.
Instances: Enter the number of instances for the Python Job.
On demand: Check this option if a Python Job (On demand) needs to be created. In this scenario, the Job will not be scheduled.
Click on the Save option to create a job.
This feature allows the user to update an existing job with the latest changes. Select the Re-register option if the Job is already created and the user wants to make changes to the scripts.
After selecting the Re-Register option, the system will display all previous versions of registered Jobs from the chosen notebook.
The user must choose the Job to be Re-Registered with the latest changes and proceed by clicking Next.
Validate the script of the selected cell by clicking on the Validate option. Additionally, the user can choose External Libraries by clicking on the option next to Validate. The Next option remains disabled until the script is validated.
After script validation, proceed by clicking Next.
Now, the user needs to provide all the required information for Re-registering the job, following the same steps outlined in the Register as New feature.
Finally, click on the Save option to complete the Re-Registration of the job.
Please Note: The user can Register as New or Re-Register a PySpark Job in the same way as Python Job as mentioned above. The only difference is that the user needs to create the project under PySpark environment in the DSLab module.
This feature enable users to upload their files (in .py format) to the Utility section of a DS Lab project using this feature. Subsequently, it will allow users to import these files as modules in a DsLab notebook, enabling the use of the uploaded utility functions in scripts.
Prerequisite: To use this feature, the user needs to create a project under Ds Lab module.
Check out the below-given video on how to use utility scripts in the DS Lab module.
Activate and open the project under DS Lab module.
Navigate to the Utility tab in the project.
Click on Add Scripts options.
After that, the user will find two options:
Import Utility: It enables users to upload their own file (.py format) where the script has been written.
Utility Name: Enter the utility name.
Utility Description: Enter the description for the utility.
Utility Script: The users can upload their files in the .py format.
Click the Save option after entering all the required information on the Add utility script page, and the uploaded file will list under the Utility tab of the selected project.
Once the file is uploaded to the Utility tab, the user can edit the contents of the file by clicking on the Edit icon corresponding to the file name.
After making changes, the users can validate their script by clicking on the Validate icon on the top right.
Click the Update option to save the latest changes to the utility file.
If the user wants to delete a particular Utility, they can do so by clicking on the Delete icon.
The user can import the uploaded utility file as a module in the DS Lab notebook and use it accordingly.
In the above given image, it can be seen that "employee.py" file is uploaded in the utility. Now, this file is going to be imported in the DS Lab notebook and will further used in the script.
Use the below-given sample code for the reference and explore the Utility file related features yourself inside the DS Lab notebook.
In the above written sample script, the utility file(employee.py) has been imported (import employee) and it has been used in the script for further processing.
After completing this process, the user can export this script to the Data Pipeline module and register it as a job according to their specific requirements.
Please Note:
Refer to the below-given links to get instructions on exporting to the pipeline and registering it as a job:
To apply the changes made in the utility file and get the latest results, the user must restart the notebook's kernel after any modifications to the utility file.
This feature enables users to export their written scripts from DS Lab notebook in order to use them in Pipeline and Jobs
Prerequisite: The user needs to create a project under the DS Lab module before using this feature.
Follow the below-given steps to export a script from the DS Lab notebook:
Activate and open a project under the DS Lab module.
The user can create or open the previously created notebook where the scripts have been written.
Go to the Notebook options and click the Export option.
The Export to Pipeline/Git panel opens from the right side of the window; select Export to Pipeline option.
Select the scripts using the check boxes to export them.
Click the Next option.
Click the Validate icon to validate the selected scripts.
The user can also export External Libraries used in the project along with scripts.
The Libraries panel opens. Select the necessary libraries to be exported along with the scripts.
A notification message appears to ensure the user for the validation of the script.
After that, click on the Export to Pipeline option, and the selected scripts will be exported along with the chosen external libraries.
This page shows a summary of Pipelines in a graphical manner.
Please go through the below given demonstration for the Pipeline Overview page.
This page contains the following information of the Pipelines in the graphical format:
This section displays the total number of pipelines created along with their count and percentage in a tile manner for the following different running statuses:
Running: Displays the number and total percentage of running pipelines.
Success: Displays the number and total percentage of successfully executed pipelines.
Ignored: Displays the number and total percentage of pipelines where the failure of any component has been ignored.
Failed: Displays the number and total percentage of pipelines where failure has occurred.
Once the user clicks on any of the status, it will list down all the pipelines related to that particular category along with the following option:
View: Redirects the user to the selected pipeline workspace.
It displays the total number and percentage of pipeline created for the following resource type in the graphical format:
Low
Medium
High
Once the user clicks on any of the resource type, it will list down all the pipelines related to that particular resource type along with the following option:
View: Redirects the user to the selected job workspace.
Git Sync feature allows users to import files directly from their Git repository into the DS Lab project to be used in the subsequent processes within pipelines and jobs. For using this feature, the user needs to configure their repository in their DS Lab project.
Prerequisites:
To configure GitLab/GitHub credentials, follow these steps in the Admin Settings:
Navigate to Admin >> Configurations >> Version Control.
From the first drop-down menu, select the Version.
Choose 'DsLabs' as the module from the drop-down.
Select either GitHub or GitLab based on the requirement for Git type.
Enter the host for the selected Git type.
Provide the token key associated with the Git account.
Select a Git project.
Choose the branch where the files are located.
After providing all the details correctly, click on 'Test,' and if the authentication is successful, an appropriate message will appear. Subsequently, click on the 'Save' option.
To complete the configuration, navigate to My Account >> Configuration. Enter the Git Token and Git Username, then save the changes.
Please follow the below-given steps to configure the Git Sync in the DS Lab project:
Navigate to the DS Lab module.
Click the Create Project to initiate a new project.
Enter all the required fields for the project.
Select the Git Repository and Git Branch.
Enable the option Sync git repo at project creation to gain access of all the files in the selected repository.
Click the Save option to create the project.
After creating the project, expand the Repo option in the Notebook tab to view all files present in the repository.
The Git Console option, accessible by clicking at the bottom of the page, allows the user to run Git commands directly. This feature enables the user to execute any Git commands as per their specific requirements.
After completing all the process, the users can export their scripts to the Data Pipeline module and register it as a job according to their specific requirements.
For instructions on exporting script to the pipeline and registering it as a job, please refer to the link provided below:
This page provides an overview and summary of the pipeline module, including details such as running status, types, number of pipelines and jobs, and resources used.
Please go through the below given demonstration for the Job Overview page.
This tab will open by default once the user navigates to the Pipeline and Job Overview page. It contains the following information of the Jobs in graphical format:
Job Status:
This section displays the total number of Jobs created along with their count and percentage in a tile manner the following different running status :
Running: This section displays the number and total percentage of running jobs.
Success: This section displays the number and total percentage of successfully ran jobs.
Interrupted: This section displays the number and total percentage of interrupted jobs.
Failed: This section displays the number and total percentage of failed jobs.
Once the user clicks on any of the status, it will list down all the jobs related to that particular category along with the following option:
View: Redirects the user to the selected job workspace.
Job Type:
It displays the total number and percentage of jobs created for the following categories in the graphical format:
Once the user clicks on any of the job type, it will list down all the jobs related to that particular job type along with the following option:
View: Redirects the user to the selected job workspace.
All the saved Jobs by a logged-in user get listed by using this option.
and The List Jobs option opens the available Jobs List for the logged-in user. All the saved Jobs by a user are listed on this page. By clicking on the Job name the Details tab on the right side of the page gets displayed with the basic details of the selected job.
Navigate to the Data Pipeline Homepage.
Click on the List Jobs option.
The List Jobs page opens displaying the created jobs.
Select a Job from the displayed list, and click on it.
This will open a panel containing three tabs:
The Job Details panel displays key metadata: the creator, last updater, last activation timestamp, and last deactivation timestamp. This information helps users track changes and manage their workflows effectively.
Tasks: Indicates the number of tasks used in the job.
Created: Indicates the user's name who created the job (with the date and time stamp).
Last Activated: Indicates the user's name who last activated the job (with the date and time stamp).
Last Deactivated: Indicates the user's name who last deactivated the job (with the date and time stamp).
Cron Expression: A string representing a schedule specifying when the job should run.
Trigger Interval: The interval at which the job is triggered (e.g., every 5 minutes).
Next Trigger: Date and time of the next scheduled trigger for the job.
Description: Description of the job provided by the user.
Total Allocated Min CPU: Total Minimum allocated CPU (in Cores)
Total Allocated Min Memory: Total Minimum allocated Memory (MB)
Total Allocated Max CPU: Total Maximum allocated CPU (in Cores)
Total Allocated Max Memory: Total Maximum allocated Memory (MB)
Total Allocated CPU: Total allocated CPU cores.
Provides relevant information about the selected job's past runs, including success or failure status.
A Clear option is available to clear the job history.
Click on the Clear button from the History tab.
It will clear all the job run history and logs from the History tab and Sandbox location.
A Refresh icon is provided for refreshing the displayed job history.
In the List Jobs page, the user can view and download the pod logs for all instances by clicking on the View System Logs option in the History tab.
Once the user clicks on the View System Logs option, a drawer panel will open from the right side of the window. The user can select the instance for which the System logs have to be downloaded from the Select Hostname drop-down option.
Clear: It will clear all the job run history from the History tab.
Navigate to the History section for a Job.
Click the Clear option.
A confirmation dialog box appears.
Click the Yes option to apply the clear history.
The job history gets deleted.
You may open the History tab for the same job. It will be deleted.
The Pin and Unpin feature allows users to prioritize and easily access specific jobs within a list. This functionality is beneficial for managing tasks, projects, or workflows efficiently. This feature is available on each job card on the List Jobs page.
Navigate to the List Jobs page.
Select the jobs that you wish to pin.
Click the Pin icon for the selected Job.
The job gets pinned to the list for easy access and appears at the top of the job list if it is the first pinned job.
You can use the pin icon to pin multiple jobs on the list.
Click the Unpin icon for a pinned job.
The selected job gets unpinned from the list.
A Job gets some actions to be applied under the Actions section on the List Jobs page.
The user can search for a specific Job by using the Search Bar on the Job List.
By typing a name in the Search Bar all the existing jobs containing that word will be listed. E.g., By typing san all the existing Jobs with the word san in it get listed as displayed in the following image:
The Job List has been provided with a header bar containing the categories of the available jobs. The user can also customize the Job List by choosing available filter options in the header.
The Job List gets modified based on the selected category option from the header.
The Recent Runs status indication option provides an at-a-glance view of the five most recent job executions, allowing users to track performance in real-time. Get status display for each listed job—whether successful, failed, or in progress—is displayed, enabling users to assess system health quickly. By highlighting any issues immediately, this feature allows for proactive troubleshooting and faster response times, helping ensure seamless workflows and minimizing downtime.
The Recent Run categorizes the recently run Jobs into the following statuses:
Succeeded: The job was completed successfully without any errors.
Interrupted: The job was stopped before completion, either by a user action or an external factor.
Failed: The job encountered an error that prevented it from completing successfully.
Running: The job is currently in progress.
The Job Run statuses are displayed with different color codes.
Succeeded: Green
Failed: Red
Interrupted: Yellow
Running: Blue
No Run: Grey
Navigate to the List Jobs page.
The Recent Runs section will be visible for all the listed jobs indicating the statuses of the five recent most runs.
You may hover on a status icon to get more details under the tooltip.
Users can get a tooltip with additional details on the Job run status by hovering over the status icon in the Recent Run section.
Status: The status of the job run (Succeeded, Interrupted, Failed, Running).
Started At: The time when the job run was started.
Stopped At: This indicates when the job was stopped.
Completed At: The time when the job run was completed.
Please Note:
The user can open the Job Editor for the selected Job from the list by clicking the View icon.
The user can view and download logs only for successfully run or failed jobs. Logs for the interrupted jobs cannot be viewed or downloaded.
Alert: Please refer to the page to configure alerts in job.
URL for Github:
URL for Gitlab:
Payload: This option will only appear if the On demand option is checked. Enter the payload in the form of a JSON Array containing JSON objects. For more details about the Python Job (On demand), refer to this link:
Concurrency Policy: Select the desired concurrency policy. For more details about the Concurrency Policy, check this link:
Alert: This feature in the Job allows the users to send an alert message to the specified channel (Teams or Slack) in the event of either the success or failure of the configured Job. Users can also choose both success and failure options to send an alert for the configured Job. Check the following link to configure the Alert:
Pull from Git: It enables users to pull their scripts directly from their Git repository. In order to use this feature, the user needs to configure their Git branch while creating the project in DS Lab. Detailed information on configuring Git in a DS Lab project can be found here:
Click the External Libraries icon.
The user can use the exported scripts in the Pipeline, Jobs (, ).
Please Note: Files with the .ipynb extension can be exported to the pipelines and jobs for the further user, while those with the .py extension can only be utilized as the utility files. Additional information on utility files can be found here:
Monitor: Redirects the user to the for the selected pipeline.
Monitor: Redirects the user to the for the selected job.
Please Note: Files with the .ipynb extension can be exported for use in pipelines and jobs, while those with the .py extension can only be utilized as utility files. Additional information on utility files can be found on the page.
Monitor: Redirects the user to the for the selected job.
Monitor: Redirects the user to the for the selected job.
Updated: Indicates the user's name who updated the job (with the date and time stamp).
Total Allocated Memory: Total allocated memory in megabytes (MB).
Please Note: The Actions section also includes the , and options for a job to perform the said actions.
Push Job
Enables to Push a Job to the VCS.
View
Share
Allows the user to share the selected job with other user(s) or usergroup(s).
Job Monitoring
Edit
This enables the user to edit any information about the job. This option will be disabled when the job is active.
Delete
Check-out the given walk-through on how the Scheduler option works in the Data Pipeline.
The Scheduler List page opens displaying all the registered schedulers in a pipeline.
It displays all the previous and next execution of the scheduler.
On the Scheduler List page, users will find the following details:
Scheduler Name: The name of the scheduler component as given in the pipeline.
Scheduler Time: The time set in the scheduler component.
Next Run Time: The next scheduled run time of the scheduler.
Pipeline Name: The name of the pipeline where the scheduler is configured and used. Clicking on this option will directly redirect the user to the selected pipeline.
This page describes how to delete a pipeline/Job or restore a deleted pipeline/Job using the Trash option.
Check out the given illustration on how to use the Trash option provided on the Pipeline homepage left menu panel.
Navigate to the Pipeline Homepage to access the left side menu panel.
Click the Trash icon.
The Trash List page opens listing all the deleted pipelines/jobs by a user from a user specific account.
The user gets two options to be applied on the pipelines/jobs:
Delete
Restore
Please Note: Based on the selected option, the related action will be taken on the concerned pipeline/job.
Navigate to the Trash page.
Select a Pipeline.
Click the Delete icon for the selected Pipeline.
The Delete Pipeline/Job dialog box opens.
Click the Yes option.
The Delete Pipeline Confirmation dialog box opens.
Click the Delete Anyway option.
A notification message appears and the Pipeline/Job Permanently gets deleted for the user.
Please Note: The Trash page at present displays only those pipelines and jobs which are deleted by the logged in user from the Pipeline/Job Editor page by a user.
Navigate to the Trash page.
Select a Pipeline.
Click the Restore icon for the selected Pipeline/Job.
The Recover Pipeline/Job dialog box opens.
Click the Yes option.
A notification message appears that the Pipeline/Job has been restored.
The Pipeline gets recovered and lists in the Pipeline/Job list page.
The Data Channel & Cluster Events page presents a comprehensive list of all Broker Info, Consumer Info, Topic Info, Kafka Version, and all the events used in the pipeline. It allows users to flush/delete the events.
Go through the below-given demonstration for the Data Chanel & Cluster Event page.
Navigate to the Pipeline Homepage.
Click the Data Channel & Cluster Events icon.
The Data Channel & Cluster Events page opens.
The list opens displaying the Data Channel & Cluster Events information.
The Data Channel includes the following information:
Broker Info: It will list all Kafka brokers and display the number of partitions used for each broker.
Consumer Info: It will display the number of active and rebalancing consumers.
Topic Info: It will display the number of topics.
Version information: It will display the Kafka version.
The Clustered Events page includes the following information:
On this page, all the pipelines will be listed along with the following details:
Pipeline Name: The name of the pipeline.
Number of Events: The number of Kafka events created in the selected pipeline.
Status: The running status of the pipeline, indicated by Green if active and Red otherwise.
Expand for Events: Click here to expand the selected row for a particular pipeline. This will list all Kafka events associated with the chosen pipeline along with the following information:
Name: Display the name of the Kafka event in the pipeline.
Event Name: Name of the Kafka event.
Partitions: Number of partitions in the Kafka event.
The user gets two options to apply to the listed Kafka events for the pipeline:
Flush All: This will flush all topic data in the selected pipeline.
Delete All: This will delete all topics in the selected pipeline.
Once the user clicks on the Event Name after expanding the row for the selected pipeline, the following information will be displayed on the new page for the selected Kafka Event:
This tab contains the following information for the selected Kafka Event:
Partitions: The number of partitions in the Kafka Event.
Replication Factor: Displays the replication factor of the Kafka topic. This refers to the number of copies of a Kafka topic's data maintained across different broker nodes within the Kafka cluster, ensuring high availability and fault tolerance data.
Sync Replicas: Displays the number of in-sync replicas of the Kafka topic. In-sync replicas (ISRs) are a subset of replicas fully synchronized with the leader replica for a partition. These replicas have the latest data as the leader and are capable of taking over as the leader if the current leader fails.
Segment Size: This shows the segment size of the Kafka topic. A segment is a smaller chunk of a partition log file. Segment size refers to the size of these log segments that Kafka uses to manage and store data within a partition. Kafka topics are divided into partitions, and each partition is further divided into segments.
Messages Count: Displays the number of messages in the Kafka topic.
Retention Period: Displays the retention period of the Kafka topic in hours. The retention period of a Kafka topic determines how long Kafka retains the messages in a topic before deleting them.
Additionally, this tab lists all the partition details along with their start and end offset, the number of messages in each partition, the number of replicas for each partition, and the size of the messages held by each partition.
This tab contains the following information for the selected Kafka Event:
Offset: Shows the offset number of the partition. An offset is a unique identifier assigned to each message within a partition.
Partitions: Displays the partition number where the offset belongs.
Time: It mentions the date and time of the message when it was stored at the offset.
Preview: This option helps the user to view and copy the message stored at the selected offset.
This tab shows details of consumers connected to Kafka Topic.
The List Pipelines option opens the available Pipeline List for the logged-in user. All the saved pipelines by a user are listed on this page. Please refer to the below image for reference:
Please Note:
If the logged-in user has Admin access, the user can see all the pipelines created by all users.
The non-admin users get to access the list of pipelines created by them or shared with them by the other users.
The Pin and Unpin feature allows users to prioritize and easily access specific pipelines within a list. This functionality is beneficial for managing tasks, projects, or workflows efficiently. This feature is available on each pipeline card in the list view.
Navigate to the List Pipelines page.
Select a Pipeline from the list that you wish to pin.
Click the Pin icon provided for the selected pipeline.
The Pipeline gets pinned to the list and it will be moved to the top of the list if it is the first pinned pipeline.
The user can pin multiple pipelines to the list.
Click the Unpin icon for a pinned pipeline.
The pipeline will be unpinned from the list.
Navigate to the Pipeline List page.
Select a data pipeline from the displayed list.
Click the Push & Pull Pipeline icon for the selected data pipeline.
Please Note: The Push & Pull Pipeline features are available on the List Pipeline and the Pipeline Editor pages.
The Push/ Pull drawer appears. It displays the name of the selected Pipeline. E.g., In the following image the Push/ Pull Restaurant Analysis WK5 heading, the Restaurant Analysis WK5 is the name of the selected Pipeline.
Provide a Commit Message (required) for the data pipeline version.
Select a Push Type out of the below-given choices to push the pipeline:
Version Control: For versioning of the pipeline in the same environment. In this case, the selected Push Type is Version Control.
GIT Export (Migration): This is for pipeline migration. The pushed pipeline can be migrated to the destination environment from the migration window in the Admin Module.
Click the Save option.​
Based on the selected Push Type the pipeline gets moved to Git or VCS, and a notification message appears to confirm the completion of the action.
Check out the illustrations below on the Version Control and Pipeline Migration functionalities.
Version Control:
Pipeline Migration:
Please Note:
The pipeline pushed to the VCS using the Version Control option, can be pulled directly from the Pull Pipeline option.
The Pull feature helps users pull the previously moved versions of a pipeline from the VCS. Thus, it can help the users significantly to recover the lost pipelines or avoid unwanted modifications made to the pipeline.
Check out the walk-through on how to pull a pipeline version from the VCS.
Navigate to the Pipeline List page.
Select a data pipeline from the displayed list.
Click the Pull from GIT icon for the selected data pipeline.
The Push/ Pull drawer opens with the selected Pipeline name.
Select the data pipeline version by marking the given checkbox.
Click the Save option.
A notification appears to inform that the selected pipeline workflow is imported.
Please Note: The pipeline you pull will be changed to the selected version. Please make sure to manage the versions of the pipeline properly.
Clicking on the View icon will direct the user to the pipeline workflow editor page.
Navigate to the Pipeline List page.
Select a Pipeline from the list.
Click the View icon.
The Pipeline Editor page opens for that pipeline.
Please Note: The user can open the Pipeline Editor for the selected pipeline from the list by clicking the View icon or the Pipeline Workflow Editor icon on the Pipeline List page.
The user can search for a specific pipeline by using the Search bar on the Pipeline List. By typing a common name all the existing pipelines having that word will be listed. E.g., By typing the letters 'resta' all the existing pipelines with those letters get listed.
Click on the pipeline to view the information of the pipeline on the list pipeline page. Once clicked on the pipeline name, a menu will open on the right side of the screen showing all the details of the selected pipeline.
Please look at the demonstration to understand how the Pipeline details are displayed on the List Pipelines page.
By clicking on a pipeline name, the following details of that pipeline will appear below in two tabs:
The Pipeline Details panel displays key metadata: the creator, last updater, last activation timestamp, and last deactivation timestamp. This information helps users track changes and manage their workflows effectively.
Components: Number of components used in the pipeline.
Created: Indicates the Pipeline owner's name (with Date and Time stamp).
Updated: Name of the person who has updated the pipeline (with Date and Time stamp).
Last Activated: Displays the person's name who last activated the pipeline (with Date and Time stamp).
Last Deactivated: Displays the person's name who last deactivated the pipeline (with Date and Time stamp).
Description: This field will show the description of the pipeline if given by the pipeline owner.
Total Component Config
Total Allocated Max CPU: The maximum allocated CPU in cores.
Total Allocated Min CPU: The minimum allocated CPU in cores.
Total Allocated Max Memory: The maximum allocated memory in MB.
This section displays the component that has been failed or ignored in the selected pipeline workflow.
Users can click the Failed Components & Ignored Components options to display the failed or ignored component names. The information icon provided next to the names of the Failed or Ignored components will redirect the user to get more information about those components.
Navigate to the Failed and Ignored Components section for a pipeline.
Click the Ignore Failure Components icon provided for Failed Components.
The Confirm dialog box opens.
Use the Comment space to comment on the Failed component.
Click the Yes option to ignore the Failed Components.
Please Note: The Failed and Ignored Components tab will be empty if the pipeline flow is successful.
Check out the illustration on the Pipeline Component Configuration page.
Navigate to the Pipeline List page.
Select a Pipeline from the list.
Click the Pipeline Component Configuration icon.
The configuration page will list all the components used in the selected pipeline. The Basic Info tab opens by default.
Click the Configuration tab to access the pipeline components' configuration details. The user may modify the details and update it using the Save option.
Please Note:
The user can edit the basic or configuration details for all the pipeline components.
The Pipeline Component Configuration page also displays icons to access Pipeline Testing, Failure Analysis, Open Pipeline (Editor Workspace), and List Pipeline.
Users can share a pipeline with one or multiple users and user Groups using the Share Pipeline option.
Check out the following walk-through on how to share a pipeline with user/ user group and exclude the user from a shared pipeline.
Click the List Pipelines icon to open the Pipeline List.
Select a Pipeline from the pipeline list.
Click on the Share Pipeline icon.
The Share Pipeline window opens.
Select an option from the given choices: User (the default tab) and User Group or Exclude User (the Exclude User option can be chosen if the pipeline is already shared with a user/user group and you wish to exclude them from the privilege).
Select a user or user(s) from the displayed list of users (In case of the User Group(s) tab, it displays the names of the User Groups.
Click the arrow to move the selected User(s)/ User Group(s).
The selected user(s)/user group(s) get moved to the box given on the right (In case of the Exclude User option the selected user/ the user moved to the right-side box looses the access of the shared pipeline).
Click the Save option.​
Privilege for the selected pipeline gets updated and the same gets communicated through a message.
By completing the steps mentioned above, the concerned pipeline gets successfully shared with the selected user/user group or the selected users can also be excluded from their privileges for the concerned pipeline.
Please Note:
An admin user can View, Edit/Modify, Share, and Delete a shared pipeline.
A non-admin user can View and Edit/ Modify a shared Pipeline, but it can't share it further or delete it.
By clicking the Pipeline Monitor icon, the user will be redirected to the Pipeline Monitoring page.
Navigate to the List Pipelines page.
Select a Pipeline from the list.
Click the Pipeline Monitor icon.
The user gets redirected to the Pipeline Monitoring page of the selected Pipeline. The Monitor tab opens by default.
Click the expand icon to display the monitor page in detail.
The user may click on the Data Metrics and System Logs tabs to get more details on the pipeline.
Failure analysis is a central failure mechanism. Here, the user can identify the failure reason. Failure of any pipeline stored at a particular location (collection). From there you can query your failed data in the Failure Analysis UI. It displays the failed records, cause, event time, and pipeline ID.
Check out the illustration to access the Failure Analysis for a selected Pipeline.
Navigate to the Pipeline List page.
Select a Pipeline with Failed status from the list.
Click on the Failure Analysis icon.
The user gets redirected to the Failure Analysis page.
The user can further get the failure analysis of the specific component that failed in the pipeline workflow.
Use the ellipsis icon to get redirected to the Monitoring and Data Metrics options for a component.
Navigate to the List Pipelines page.
Select a Pipeline from the list.
Click the Delete Pipeline icon for the pipeline.
The Confirm dialog box appears to confirm the action.
Click the YES option to confirm the deletion.
The selected Pipeline gets removed from the list.
The user can activate/deactivate the pipeline by clicking on the Activate/ Deactivate icon.
Click the List Pipelines icon to open the Pipeline list.
Select a Pipeline from the pipeline list.
Click on the Activate Pipeline icon.
The Confirmation dialog box opens to get the confirmation for the action.
Click the YES option.
A confirmation message appears.
The pipeline gets activated. The Activate Pipeline icon turns into Deactivate Pipeline.
The Status option diaplays UP and turns green in color.
Click the Deactivate Pipeline icon to deactivate the pipeline.
The Confirmation dialog box opens to get the confirmation for the action.
Click the YES option.
A confirmation message appears.
The pipeline gets deactivated. The Deactivate Actions icon turns Activate.
The same gets communicated through the color of the Status icon which displays OFF and turns Gray.
This page is for the Advanced configuration of the Data Pipeline. The following list of the configurations get displayed on the Settings Page:
Logger
Default Configuration
System Component Status
Data Sync
Job BaseInfo
List Components
This gives the info and config details of the Kafka topic that is set for logging.
This page enables users to view or set the default resource configuration in Low, Medium, and High resource allocation types for pipelines and jobs.
To access the Default Configuration, click on the settings option on the pipeline homepage and select it from the settings page.
There will be two tabs on this page:
Pipeline: The Pipeline tab opens by default.
This tab shows the default configuration set for Spark and Docker components in different resource allocation types (Low, Medium, High) and in different invocation types such as Real-time and Batch.
Users can change this default configuration as per their requirement, and the same default resource configuration will be applied to newly added components when used in the pipeline.
Job: Navigate to this tab to view or set the default configuration for Jobs.
This tab shows the default configuration set for Spark and Docker Jobs in different resource allocation types (Low, Medium, High).
Users can change this default configuration as per their requirement, and the same default resource configuration will be applied to newly created jobs.
The System Component Status page under the Settings option gives us monitoring capability on the health of the System Components.
System logs are essential for monitoring and troubleshooting systems pods. They provide a record of system events, errors, and user activities. Here’s an overview of key concepts and types of system logs, along with best practices for managing them.
Navigate to the Logger Details page by clicking the Settings icon from the left menu bar.
Click on the System Component Status Tab. The user will navigate to the System Pod Details page.
Click on any one System Pod. The System Logs drawer will open.
System logs are files that record events and messages from the operating system, applications, and other components.
Search Bar: Users can search the relevant log entries in the search bar.
Time Range Filter: This filter allows users to specify a time frame to view system pod logs. This feature is essential for isolating issues, monitoring performance, or auditing activities over specific periods.
Filter Options: Users get two ways to specify the time range.
Predefined Ranges: Select from the displayed options that include commonly used options like last minutes, last hour, or last days, etc. E.g., the Last 24 hours is selected as a Time Range in the following image.
Custom Range: Users can specify a start and end date/time to filter logs using the Calendar.
Select a time range.
It will be displayed as the selected time range.
Click the Apply Time Range option.
The logs within the selected time range will be displayed. Logs may be displayed in a list format, showing timestamps, log levels, and relevant messages.
Log Formats: Logs can be in plain text format.
Log Levels: Logs are often categorized by severity (e.g., DEBUG, INFO, WARN, ERROR, CRITICAL).
Please Note:
Download Logs: You can download the logs files of the system logs using the Download Logs icon.
Refresh: You can refresh the system log list using the Refresh option.
Data Sync in Settings is given to globally configure the Data Sync feature. This way the user can enable a DB connection and use the same connection in the pipeline workflow without using any extra resources.
Please Note: The supported drivers are:
MongoDB
Postgres
MySQL
MSSQL
Oracle
ClickHouse
Snowflake
Redshift
Go to the Settings page.
Click the Data Sync option.
The Data Sync page opens.
Click the Plus icon.
The Create Data Sync Connection dialog box appears.
Specify all the required connection details.
Click the Save option.
Please Note:
Please use TCP Port If you are using ClickHouse as a Driver in Data Sync.
A success notification message appears on the top stating that Data Sync Setting Creating Successfully.
Enable the action button to activate the Data Sync connection.
Go to the Settings page.
Click the Data Sync option.
The DB Sync Settings page opens.
Click the Edit Setting icon.
The Edit Settings dialog box opens.
Edit the connection details if required.
Click the Update option.
Go to the Settings page.
Click the Data Sync option.
The Data Sync Settings page opens.
Click on the Disconnect button to disconnect.
The Disconnect Setting confirmation dialog box appears.
Click the DISCONNECT option.
A success notification message appears on the top when DB Sync gets disconnected.
All the Pipelines containing the DB Sync Event component get listed under the Pipeline using the DB Sync section.
The user can see the following information:
Name: Name of the pipeline where Data Sync is used.
Running status: This indicates if the pipeline is active or deactivated.
Actions: The user can view the pipeline where Data Sync is used.
The Pipeline module supports the following types of jobs:
These jobs are configured in the Job BaseInfo page under the Settings menu.
Please Note: The Job BaseInfo has been created from the admin side, and the user is not supposed to create it in the settings menu.
To create a new Job BaseInfo, click on the Create Job BaseInfo option, as shown in the image below:
Once clicked, it will redirect to the page for creating a new Job BaseInfo and the user will be asked to fill in details in the following Tabs:
Basic Information Tab:
Name: Provide a Name for the Job BaseInfo.
Deployment type: Select the Deployment type from the drop-down.
Image Name: Enter the Image name for the Job BaseInfo.
Version: Specify the version.
isExposed: This option will be automatically filled as True once the deployment type is selected
Job type: Select the job type from the drop-down.
Ports Tab:
Port Name: Enter the Port name.
Port Number: Enter the Port number.
Delete: The user can delete the port details by clicking on this option.
Add Port: The user can add the Port by clicking on this option.
Spark component information:
Main class: Enter the Main class for creating the Job BaseInfo.
Main application file: Enter the main application file.
Runtime Environment Type: Select from the drop-down. There are three options available: Scala, Python, and R.
Now, click on the Save option to create the Job BaseInfo.
Once the Job BaseInfo is created, the user can redirect to the List Job BaseInfo page by clicking on the List Job BaseInfo icon as shown in the below image:
List Job BaseInfo Page:
This Page will show the list of all created Job BaseInfo as shown in the below image:
The Name column represents the Name of the Job BaseInfo.
The Status column will display the status of the Job BaseInfo.
The Version column displays the version of the Job BaseInfo.
All the components being used in the pipeline are listed on this page.
Please Note: Please go through the below-given walk-through for the list components page.
Check out the steps in the below-shared walk-through to configure the custom component.
Please Note: You should have your docker image created and stored in the docker repository where all other pipeline components are present. Little Help from DevOps will be required to push those images.
Redirects the user to the page.
Redirects user to the page.
Allows the user to delete the job. The deleted job will be moved to .
The user also gets an option to Push the pipeline to GIT. This action will be considered as Pipeline Migration. The user needs to follow the steps given in the Admin module of the Platform to pull a version that has been pushed to GIT.
Total Allocated Min Memory: The minimum allocated memory in MB.
Please refer to this link for more details on the .
Please Note: Check out the page for more details.
Please Note: Refer to the for more details on this.
Please Note: The configured Driver from the Settings page provided for Data Sync can be accessed inside a while using it in a Pipeline.
The user can see the details of Job Baseinfo by clicking on the View icon. Once clicked on the View icon, the user can see the details of Job BaseInfo provided while creating that Job.
There is a provision to create a component. At the top right corner, we have a Create Component icon. Clicking on this takes you to a different page where you can configure your component.