1 of 84

Components

Adding Components to Workflow

Components are broadly classified into:

Readers
Writers
Transforms
Consumers
Producers
Scripting
ML
Scheduler

Component Panel

Expand component group and select
Search using the search bar in the component panel
Drag and Drop the components to the workflow editor.

Any Pipeline System component can be easily dragged to the workflow it includes the following steps to use a System component in the Pipeline Workflow editor:

Drag and Drop the components to the workflow editor.
Search using the search bar in the component panel
Expand component group and select

Check out the illustration on how to use a system pipeline component.

Component Architecture

Component Base Configuration

This page pays attention to describe the Basic Info tab provided for the pipeline components. This tab has to be configured for all the components.

Invocation Type

The Invocation-Type config decides the type of deployment of the component. There are following two types of invocations:

Real-Time
Batch

Real-Time

When the Component has the real-time invocation, the component never goes down when the pipeline is active. This is for situations where you want to keep the component ready all the time to consume data.

Intelligent Scaling

When "Realtime" is selected as the invocation type, we have an additional option to scale up the component called "Intelligent Scaling."

Please Note: The First Component of the pipeline must be in real-time invocation.

Batch

When the component has the batch invocation type then the component needs a trigger to initiate the process from the previous event. Once the process of the component is finished and there are no new events to process the component goes down.

These are really helpful in Batch or scheduled operations where the data is not streaming or real-time.

Please Note: When the users select the Batch invocation type, they get an additional option of the Grace Period (in sec). This grace period is the time that the component will take to go down gracefully. The default value for Grace Period is 60 seconds and it can be configured by the user.

Batch Size

The pipeline components process the data in micro-batches. This batch size is given to define the maximum number of records that you want to process in a single cycle of operation; This is really helpful if you want to control the number of records being processed by the component if the unit record size is huge. You can configure it in the base config of the components.

The below given illustration displays how to update the Batch Size configuration.

Failover Events

We can create a failover event and map it in the component base configuration, so that if the component fails it audits all the failure messages with the data (if available) and timestamp of the error.

Go through the illustration given below to understand the Failover Events.

Resource Configuration

There is a resource configuration tab while configuring the components.

The Data Pipeline contains an option to configure the resources i.e., Memory & CPU for each component that gets deployed.

There are two types of components-based deployment types:

Docker
Spark

Docker

After we save the component and pipeline, The component gets saved with the default configuration of the pipeline i.e. Low, Medium, and High. After the users save the pipeline, we can see the configuration tab in the component. There are multiple things:

There are Request and Limit configurations needed for the Docker components.

The users can see the CPU and Memory options to be configured.

CPU: This is the CPU config where we can specify the number of cores that we need to assign to the component.

Please Note: 1000 means 1 core in the configuration of docker components.

When we put 100 that means 0.1 core has been assigned to the component.

Memory: This option is to specify how much memory you want to dedicate to that specific component.

Please Note: 1024 means 1GB in the configuration of the docker components.

Instances: The number of instances is used for parallel processing. If the users. give N no of instances those many pods will be deployed.

Spark

Spark Component has the option to give the partition factor in the Basic Information tab. This is critical for parallel spark jobs.

Please follow the given example to achieve it:

E.g., If the users need to run 10 parallel spark processes to write the data where the number of inputs Kafka topic partition is 5 then, they will have to set the partition count to 2[i.e., 5*2=10 jobs]. Also, to make it work the number of cores * number of instances should be equal to 10.2 cores * 5

instances =10 jobs.

The configuration of the Spark Components is slightly different from the Docker components. When the spark components are deployed, there are two pods that come up:

Driver
Executor

Provide the Driver and Executor configurations separately.

Instances: The number of instances used for parallel processing. If we give N as the number of instances in the Executor configuration N executor pods will get deployed.

Please Note: Till the current release, the minimum requirement to deploy a driver is 0.1 Cores and 1 core for the executor. It can change with the upcoming versions of Spark.

Intelligent Scaling

A way to scale up the processing speed of components.

A feature to scale your component to the max number of instances to reduce the data processing lag automatically. This feature detects the need to scale up the components in case of higher data traffic.

Please Note: This feature is available both in Spark & Docker components. This feature will only work with the real-time as invocation type.

All components have option of Intelligent scaling which is ability of the system to dynamically adjust the scale or capacity of the reader component based on the current demand and available resources. It involves automatically optimizing the resources allocated to the component to ensure efficient and effective processing of tasks.

Please Note:

If you have selected intelligent scaling option make sure to give enough resources to component so that it can auto-scale based on the load.
Components will scale up if there is a lag of more than 60% and if the lag goes less than 10% component pods will automatically scale down. This lag percentage is configurable.

Connection Validation

The Connection Validation option helps the users to validate the connection details of the db/cloud storages.

This option is available for all the components to validate the connection before deploying the components to avoid connection-related errors. This will also work with environment variables.

Check out a sample illustration of connection validation.

Readers

Readers are a group of components that can read data from different DB and cloud storages in both invocation types i.e., Real-Time and Batch.

GCS Reader

GCS Reader component is typically designed to read data from Google Cloud Storage (GCS), a cloud-based object storage service provided by Google Cloud Platform. A GCS Reader can be a part of an application or system that needs to access data stored in GCS buckets. It allows you to retrieve, read, and process data from GCS, making it accessible for various use cases, such as data analysis, data processing, backups, and more.

GCS Reader pulls data from the GCS Monitor, so the first step is to implement GCS Monitor.

Note: The users can refer to the GCS Monitor section of this document for the details.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration

GCS Reader with Docker Deployment

Navigate to the Pipeline Workflow Editor page for an existing pipeline workflow with GCS Monitor and Event component.
Open the Reader section of the Component Pallet.
Drag the GCS Reader to the Workflow Editor.

Click on the dragged GCS Reader component to get the component properties tabs below.

Basic Information

It is the default tab to open for the component while configuring it.

Invocation Type: Select an invocation mode from the ‘Real-Time’ or ‘Batch’ using the drop-down menu.
Deployment Type: It displays the deployment type for the reader component. This field comes pre-selected.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Failover Event: Select a failover Event from the drop-down menu.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (the minimum limit for this field is 10).

Steps to Configure the Meta Information of GCS Reader (with Docker Deployment Type)

Bucket Name: Enter the Bucket name for GCS Reader. A bucket is a top-level container for storing objects in GCS.
Directory Path: Enter the path where the file is located, which needs to be read.
File Name: Enter the file name.

PySpark GCS Reader

Navigate to the Pipeline Workflow Editor page for an existing pipeline workflow with the PySpark GCS Reader and Event component.

You may create a new pipeline with the mentioned components.
Open the Reader section of the Component Pallet.
Drag the PySpark GCS Reader to the Workflow Editor.

Click the dragged GCS Reader component to get the component properties tabs below.

Basic Information

Invocation Type: Select an invocation mode from the ‘Real-Time’ or ‘Batch’ using the drop-down menu.
Deployment Type: It displays the deployment type for the reader component. This field comes pre-selected.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Failover Event: Select a failover Event from the drop-down menu.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (the minimum limit for this field is 10).

Steps to configure the Meta Information of GCS Reader (with Spark Deployment Type)

Secret File (*): Upload the JSON from the Google Cloud Storage.
Bucket Name (*): Enter the Bucket name for GCS Reader. A bucket is a top-level container for storing objects in GCS.
Path: Enter the path where the file is located, which needs to be read.
Read Directory: Disable reading single files from the directory.
Limit: Set a limit for the number of records to be read.
File-Type: Select the File-Type from the drop-down.
- File Type (*): Supported file formats are:
  - CSV: The Header, Multilibe, and Infer Schema fields will be displayed with CSV as the selected File Type. Enable the Header option to get the Header of the reading file and enable the Infer Schema option to get the true schema of the column in the CSV file. Check the Multiline option if there is any Multiline string in the file.
  - JSON: The Multiline and Charset fields are displayed with JSON as the selected File Type. Check in the Multiline option if there is any Multiline string in the file.
  - PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
  - AVRO: This File Type provides two drop-down menus.
    Compression: Select an option out of the Deflate and Snappy options.
    Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
  - XML: Select this option to read the XML file. If this option is selected, the following fields will be displayed:
    Infer schema: Enable this option to get the true schema of the column.
    Path: Provide the path of the file.
    Root Tag: Provide the root tag from the XML files.
    Row Tags: Provide the row tags from the XML files.
    Join Row Tags: Enable this option to join multiple row tags.
Query: Enter the Spark SQL query.

Select the desired columns using the Download Data and Upload File options.

Upload File: The user can upload the existing system files (CSV, JSON) using the ‘Upload File’ icon (file size must be less than 2 MB).
Download Data (Schema): Users can download the schema structure in JSON format using the ‘Download Data’ icon.

The user can also use the Column Filter section to select columns.

Saving the Component Configuration

Click the Save Component in Storage icon after doing all the configurations to save the reader component.

A notification message appears to inform about the component configuration success.

S3 Reader

S3 Reader component typically authenticate with S3 using AWS credentials, such as an access key ID and secret access key, to gain access to the S3 bucket and its contents. S3 Reader is designed to read and access data stored in an S3 bucket in AWS.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Check out the below-given demonstration to configure the S3 component and use it in a pipeline workflow.

Navigate to the Data Pipeline Editor.
Expand the Reader section provided under the Component Pallet.
Drag and drop the S3 Reader component to the Workflow Editor.
Click on the dragged S3 Reader to get the component properties tabs.

Basic Information

It is the default tab to open for the component while configuring it.

Invocation Type: Select an invocation mode out of ‘Real-Time’ or ‘Batch’ using the drop-down menu.
Deployment Type: It displays the deployment type for the reader component. This field comes pre-selected.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Failover Event: Select a failover Event from the drop-down menu.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (Min limit for this field is 10).

Meta Information

Open the ‘Meta Information’ tab and fill in all the connection-specific details for the S3 Reader.·

Bucket Name (*): Enter AWS S3 Bucket Name.
Zone (*): Enter S3 Zone. (For eg: us-west-2)
Access Key (*): Provide Access Key ID shared by AWS.
Secret Key (*): Provide Secret Access Key shared by AWS.
Table (*): Mention the Table or file name from S3 location which is to be read.
File Type (*): Select a file type from the drop-down menu (CSV, JSON, PARQUET, AVRO, XML and ORC are the supported file types)
Limit: Set a limit for the number of records to be read.
Query: Enter a Spark SQL query. Take inputDf as table name.

Access Key (*): Provide Access Key ID shared by AWS.
Secret Key (*): Provide Secret Access Key shared by AWS.
Table (*): Mention the Table or file name from S3 location which is to be read.
File Type (*): Select a file type from the drop-down menu (CSV, JSON, PARQUET, AVRO are the supported file types).
Limit: Set a limit for the number of records to be read.
Query: Enter a Spark SQL query. Take inputDf as table name.

Sample Spark SQL query for S3 Reader:

select * from inputDf where Gender = 'Male';

Selected Columns

There is also a section for the selected columns in the Meta Information tab if the user can select some specific columns from the table to read data instead of selecting a complete table so this can be achieved by using the ‘Selected Columns’ section. Select the columns which you want to read and if you want to change the name of the column, then put that name in the alias name section otherwise keep the alias name the same as of column name and then select a Column Type from the drop-down menu.

Use ‘Download Data’ and ‘Upload File’ options to select the desired columns.

Upload File: The user can upload the existing system files (CSV, JSON) using the ‘Upload File’ icon (file size must be less than 2 MB).
Download Data (Schema): Users can download the schema structure in JSON format by using the ‘Download Data’ icon.

Partition Columns

Provide a unique Key column name on which the partition has been done and has to be read.

Saving the Component Configuration

Click the Save Component in Storage icon after doing all the configurations to save the reader component.
A notification message appears to inform about the component configuration success.

Please Note:

(*) the symbol indicates that the field is mandatory.
Either table or query must be specified for the data readers except for SFTP Reader.
Selected Columns- There should not be a data type mismatch in the Column Type for all the Reader components.
The Meta Information fields may vary based on the selected File Type.
- All the possibilities are mentioned below:
  - CSV: ‘Header’ and ‘Infer Schema’ fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
  - JSON: ‘Multiline’ and ‘Charset’ fields get displayed with JSON as the selected File Type. Check-in the Multiline option if there is any multiline string in the file.
  - PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
  - AVRO: This File Type provides two drop-down menus.
    Compression: Select an option out of the ‘Deflate’ and ‘Snappy’ options.
    Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
  - XML: Select this option to read XML file. If this option is selected, the following fields will get displayed:
    Infer schema: Enable this option to get true schema of the column.
    Path: Provide the path of the file.
    Root Tag: Provide the root tag from the XML files.
    Row Tags: Provide the row tags from the XML files.
    Join Row Tags: Enable this option to join multiple row tags.
  - ORC: Select this option to read ORC file. If this option is selected, the following fields will get displayed:
    Push Down: In ORC (Optimized Row Columnar) file format, "push down" typically refers to the ability to push down predicate filters to the storage layer for processing. There will be two options in it:
    True: When push down is set to True, it indicates that predicate filters can be pushed down to the ORC storage layer for filtering rows at the storage level. This can improve query performance by reducing the amount of data that needs to be read into memory for processing.
    False: When push down is set to False, predicate filters are not pushed down to the ORC storage layer. Instead, filtering is performed after the data has been read into memory by the processing engine. This may result in more data being read and potentially slower query performance compared to when push down is enabled.

HDFS Reader

HDFS stands for Hadoop Distributed File System. It is a distributed file system designed to store and manage large data sets in a reliable, fault-tolerant, and scalable way. HDFS is a core component of the Apache Hadoop ecosystem and is used by many big data applications.

This component reads the file located in HDFS(Hadoop Distributed File System).

All component configurations are classified broadly into 3 section

Basic Information
Meta Information
Resource Configuration

Configuring the Meta Information tab of the HDFS Reader

Host IP Address: Enter the host IP address for HDFS.
Port: Enter the Port.
Zone: Enter the Zone for HDFS. Zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read.
File Type: Select the File Type from the drop down. The supported file types are:
- CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
- JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type. Check-in the Multiline option if there is any multiline string in the file.
- PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
- AVRO: This File Type provides two drop-down menus.
  - Compression: Select an option out of the Deflate and Snappy options.
  - Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
- XML: Select this option to read XML file. If this option is selected, the following fields will get displayed:
  - Infer schema: Enable this option to get true schema of the column.
  - Path: Provide the path of the file.
  - Root Tag: Provide the root tag from the XML files.
  - Row Tags: Provide the row tags from the XML files.
  - Join Row Tags: Enable this option to join multiple row tags.
- ORC: Select this option to read ORC file. If this option is selected, the following fields will get displayed:
  - Push Down: In ORC (Optimized Row Columnar) file format, "push down" typically refers to the ability to push down predicate filters to the storage layer for processing. There will be two options in it:
    True: When push down is set to True, it indicates that predicate filters can be pushed down to the ORC storage layer for filtering rows at the storage level. This can improve query performance by reducing the amount of data that needs to be read into memory for processing.
    False: When push down is set to False, predicate filters are not pushed down to the ORC storage layer. Instead, filtering is performed after the data has been read into memory by the processing engine. This may result in more data being read and potentially slower query performance compared to when push down is enabled.
- Path: Provide the path of the file.
- Partition Columns: Provide a unique Key column name to partition data in Spark.

DB Reader

The DB reader is a spark-based reader which gives you capability to read data from multiple database sources. All the database sources are listed below:

Supported Drivers

MySQL
Oracle
PostgreSQL
MS-SQL
ClickHouse
Snowflake
Redshift

All component configurations are classified broadly into the following sections:

Meta Information

Please follow the steps given in the demonstration to configure the DB Reader component.

Please Note:

The ClickHouse driver in the Spark components will use HTTP Port and not the TCP port.
In the case of data from multiple tables (join queries), one can write the join query directly without specifying multiple tables, as only one among table and query fields is required.

Table name: Provide a single or multiple table names. If multiple table name has be given, then enter the table names separated by comma(,).
Fetch Size: Provide the maximum number of records to be processed in one execution cycle.
Create Partition: This is used for performance enhancement. It's going to create the sequence of indexing. Once this option is selected, the operation will not execute on server.
Partition By: This option will appear once create partition option is enabled. There are two options under it:
- Auto Increment: The number of partitions will be incremented automatically.
- Index: The number of partitions will be incremented based on the specified Partition column.
Query: Enter the spark SQL query in this field for the given table or table(s). It supports query containing a join statement as well. Please refer the below image for making query on multiple tables.
Enable SSL: Check this box to enable SSL for this components. Enable SSL feature in DB reader component will appear only for two(2) drivers: PostgreSQL and ClickHouse.
Certificate Folder: This option will appear when the Enable SSL field is checked-in. The user has to select the certificate folder from drop down which contains the files which has been uploaded to the admin settings. Please refer the below given images for the reference.

Sample Spark SQL query for DB Reader:

select * from employee where Gender = 'Male';

//In the Spark SQL query above, the selected database contains a table named 'employee'. The query retrieves all records from the 'employee' table where the gender is specified as 'Male'.

Please Note: To use DB reader component with SSL, the user needs to upload the following files on the certificate upload page:

Certificate file (.pem format)]
Key file (.key format)

ES Reader

An Elasticsearch reader component is designed to read and access data stored in an Elasticsearch index. Elasticsearch readers typically authenticate with Elasticsearch using username and password credentials, which grant access to the Elasticsearch cluster and its indexes.

Elastic Search Reader

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Please follow the given demonstration to configure the component.

Configuring the Meta Information Tab

Host IP Address: Enter the host IP Address for Elastic Search.
Port: Enter the port to connect with Elastic Search.
Index ID: Enter the Index ID to read a document in Elasticsearch. In Elasticsearch, an index is a collection of documents that share similar characteristics, and each document within an index has a unique identifier known as the index ID. The index ID is a unique string that is automatically generated by Elasticsearch and is used to identify and retrieve a specific document from the index.
Resource Type: Provide the resource type. In Elasticsearch, a resource type is a way to group related documents together within an index. Resource types are defined at the time of index creation, and they provide a way to logically separate different types of documents that may be stored within the same index.
Is Date Rich True: Enable this option if any fields in the reading file contain date or time information. The date rich feature in Elasticsearch allows for advanced querying and filtering of documents based on date or time ranges, as well as date arithmetic operations.
Username: Enter the username for elastic search.
Password: Enter the password for elastic search.
Query: Provide a spark SQL query.

SFTP Stream Reader

SFTP stream reader is designed to read and access data from an SFTP server. SFTP stream readers typically authenticate with the SFTP server using username and password or SSH key-based authentication.

SFTP Stream Reader

All component configurations are classified broadly into the following sections:

Meta Information

Please follow the demonstration to configure the component and its Meta Information.

Steps to configure the meta information of SFTP Stream Reader Component

Host: Enter the host.
Username: Enter username for SFTP stream reader.
Port: Provide the Port number.
Add File Name: Enable this option to get the file name along with the data.
Authentication: Select an authentication option using the drop-down list.
- Password: Provide a password to authenticate the SFTP Stream reader component
- PEM/PPK File: Choose a file to authenticate the SFTP Stream reader component. The user needs to upload a file if this authentication option has been selected.
Reader Path: Enter the path from where the file has to be read.
Channel: Select a channel option from the drop-down menu (the supported channel is SFTP).
File type: Select the file type from the drop-down:
- CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file. Schema: If CSV is selected as file type, then paste spark schema of CSV file in this field.
- JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type. Check-in the Multiline option if there is any multiline string in the file.
- XML: Select this option to read XML file. If this option is selected, the following fields will get displayed:
  - Infer schema: Enable this option to get true schema of the column.
  - Path: Provide the path of the file.
  - Root Tag: Provide the root tag from the XML files.
  - Row Tags: Provide the row tags from the XML files.
  - Join Row Tags: Enable this option to join multiple row tags.
File Metadata Topic: Enter Kafka Event Name where the reading file metadata has to be sent.
Column filter: Select the columns which you want to read and if you want to change the name of the column, then put that name in the alias name section otherwise keep the alias name the same as of column name and then select a Column Type from the drop-down menu.
Use Download Data and Upload File options to select the desired columns.
- Upload File: The user can upload the existing system files (CSV, JSON) using the Upload File icon.
- Download Data (Schema): Users can download the schema structure in JSON format by using the Download Data icon.

SFTP Reader

SFTP Reader is designed to read and access files stored on an SFTP server. SFTP readers typically authenticate with the SFTP server using a username and password or SSH key pair, which grants access to the files stored on the server

All component configurations are classified broadly into the following sections:

Meta Information

Please follow the demonstration to configure the SFTP Reader and its meta information.

Please go through the below given steps to configure SFTP Reader component:

Host: Enter the host.
Username: Enter username for SFTP reader.
Port: Provide the Port number.
Dynamic Header: It can automatically detect the header row in a file and adjust the column names and number of columns as necessary.
Authentication: Select an authentication option using the drop-down list.
- Password: Provide a password to authenticate the SFTP component
- PEM/PPK File: Choose a file to authenticate the SFTP component. The user must upload a file if this authentication option is selected.
Reader Path: Enter the path from where the file has to be read.
Channel: Select a channel option from the drop-down menu (the supported channel is SFTP).
Column filter: Select the columns that you want to read and if you change the name of the column, then put that name in the alias name section otherwise keep the alias name the same as the column name and then select a Column Type from the drop-down menu.
Use the Download Data and Upload File options to select the desired columns.
- Upload File: The user can upload the existing system files (CSV, JSON) using the Upload File icon.
- Download Data (Schema): Users can download the schema structure in JSON format by using the Download Data icon.

Mongo DB Reader

Mongo DB reader component contains both the deployment-types: Spark & Docker

MongoDB Reader Lite (PyMongo Reader)

A MongoDB reader component is designed to read and access data stored in a MongoDB database. Mongo readers typically authenticate with MongoDB using a username and password or other authentication mechanisms supported by MongoDB.

This page covers the configuration steps for the Mongo DB Reader.All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

MongoDB Reader reads data from the specified database’s Collection. It also has an option to filter the data using Mongo Query Language(MQL), Which will run the MQL directly on the MongoDB Server, and push the data to the out event.

Check out the below-given walk through about the MongoDB Reader Lite.

Accessing the Component Properties

Drag & drop the Mongo Reader component to the Workflow Editor.
Click on the dragged reader component.
The component properties tabs open below.

Basic Information

It is the default tab to open for the Mongodb Reader Lite while configuring the component.

Select an Invocation Type from the drop-down menu to confirm the running mode of the reader component. Select Real-Time or Batch from the drop-down menu.
Deployment Type: It displays the deployment type for the component (This field comes pre-selected).
Container Image Version: It displays the image version for the docker container (This field comes pre-selected).
Failover Event: Select a failover Event from the drop-down menu.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (Min limit for this field is 10).

Please Note: The Grace Period Field appears when the Batch is selected as the Invocation Type option in the Basic Information tab. You can now give a grace period for components to go down gracefully after that time by configuring this field.

Grace Period Field

The Grace Period field appears when the Batch option is selected as the Invocation Type. You can now give a grace period for components to go down gracefully after that time by configuring the given field in the Basic Information of the concerned component.

Meta Information

Open the Meta Information tab and fill in all the connection-specific details of MongoDB Reader Lite. The Meta Information tab opens with the below given fields:

Please Note: The Meta Information fields may vary based on the selected Connection Type option.

Configuring the over all Meta Information tab fields

Please Note: The fields marked as (*) are mandatory fields.

Connection Type: Select either of the options out of Standard, SRV, and Connection String as connection types.
Port number (*): Provide the Port number (It appears only with the Standard connection type).
Host IP Address (*): The IP address of the host.
Username (*): Provide a username.
Password (*): Provide a valid password to access the MongoDB.
Database Name (*): Provide the name of the database where you wish to write data.
Collection Name (*): Provide the name of the collection.
Fetch size: Specifies the number of documents to return in each batch of the response from the MongoDB collection. For ex: If 1000 is given in the fetch size field. Then it will read the 1000 data at one execution and it will process it further.
Additional Parameters: Provide details of the additional parameters.
Enable SSL: Check this box to enable SSL for this components. MongoDB connection credentials will be different if this option is enabled.
The user needs to upload the following files on the certificate upload page:
- Certificate file (.pem format)]
- Key file (.key format)
Certificate Folder: This option will appear when the Enable SSL field is checked-in. The user has to select the certificate folder from drop down which contains the files which has been uploaded to the admin settings for connecting MongoDB with SSL. Please refer the below given images for the reference.
Connection String (*): Provide a connection string (It appears only with the Connection String connection type).
Query: Provide a relevant query service. We can write the Mongo queries in the following manner:

bi_testing.d1.find({ "$or": [ { "AmountSpent": "10255" } , { "Age": "Old" } ] })// Some codedb.collection_name.aggregate([{'$match': {'Goal1Adjective': 'High'}}])

Variations in the Meta Information tab fields

Meta Information Tab with Standard as the Connection Type

Meta Information Tab with SRV as the Connection Type

Meta Information Tab with Connection String as the Connection Type

Meta Information Tab with enabled the "Enable SSL" field:

Saving the Reader Component

After configuring the required configuration fields, click the Save Component in Storage icon provided in the reader configuration panel to save the component.

A confirmation message appears to notify the component properties are saved.

Updating and Activating Pipeline

Click on the Update Pipeline icon to update the pipeline.

A confirmation message appears to inform the user.

Click on the Activate Pipeline icon.

The Confirm dialog box appears to ask the user permission.
Click the YES option.

A confirmation message appears to inform that the pipeline has been activated.

Accessing the Log Panel

Click on the Toggle Log Panel icon.
The Log Panel opens displaying the Logs and Advance Logs tabs.

Please Note:

The Pod logs for the components appear in the Advanced Logs tab.
The overall component logs will be displayed in the Logs tab.

Tabs for a Configured Component

A configured component will display some more tabs such as the Configuration, Logs, and Pod Logs tabs (as displayed below for the Mongodb Reader Lite component).

Configuration Tab

Logs Tab

Pod Logs Tab

Summary Tab

This tab will show all the description about the component.

MongoDB Reader

This page covers configuration details for the MongoDB Reader component.

A MongoDB reader is designed to read and access data stored in a MongoDB database. Mongo readers typically authenticate with MongoDB using a username and password or other authentication mechanisms supported by MongoDB.

All component configurations are classified broadly into the following sections:

Meta Information

Please follow the demonstration to configure the component.

MongoDB Reader reads the data from the specified collection of Mongo Database. It has an option to filter data using spark SQL query.

Drag & Drop the MongoDB Reader on the Workflow Editor.
Click on the dragged reader component to open the component properties tabs below.

Basic Information

It is the default tab to open for the MongoDB reader while configuring the component.

Select an Invocation type from the drop-down menu to confirm the running mode of the reader component. Select the Real-Time or Batch option from the drop-down menu.
Deployment Type: It displays the deployment type for the component. This field comes pre-selected.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Failover Event: Select a failover Event from the drop-down menu.
Batch Size: Provide the maximum number of records to be processed in one execution cycle.

Configuring the Meta Information tab of the MongoDB Reader

Please Note: The fields marked as (*) are mandatory fields.

Connection Type: Select the connection type from the drop-down:
- Standard
- SRV
- Connection String
Host IP Address (*): Hadoop IP address of the host.
Port(*): Port number (It appears only with the Standard Connection Type).
Username(*): Provide username.
Password(*): Provide a valid password to access the MongoDB.
Database Name(*): Provide the name of the database from where you wish to read data.
Collection Name(*): Provide the name of the collection.
Partition Column: specify a unique column name, whose value is a number .
Query: Enter a Spark SQL query. Take the mongo collection_name as the table name in Spark SQL query.
Limit: Set a limit for the number of records to be read from MongoDB collection.
Schema File Name: Upload Spark Schema file in JSON format.
Cluster Sharded: Enable this option if data has to be read from sharded clustered database. A sharded cluster in MongoDB is a distributed database architecture that allows for horizontal scaling and partitioning of data across multiple nodes or servers. The data is partitioned into smaller chunks, called shards, and distributed across multiple servers.
Additional Parameters: Provide the additional parameters to connect with MongoDB. This field is optional.
Enable SSL: Check this box to enable SSL for this components. MongoDB connection credentials will be different if this option is enabled.
Certificate Folder: This option will appear when the Enable SSL field is checked-in. The user has to select the certificate folder from drop down which contains the files which has been uploaded to the admin settings for connecting MongoDB with SSL. Please refer the below given images for the reference.

Sample Spark SQL query for MongoDB Reader:

SELECT department, AVG(salary) AS average_salary FROM employee_table 
GROUP BY department;

// In the above Spark SQL query, employee_table is the name of collection which is filled in the Collection Name field of Meta Information tab.

Please Note: The Meta Information fields vary based on the selected Connection Type option.

The following images display the various possibilities of the Meta Information for the MongoDB Reader:

i. Meta Information Tab with Standard as Connection Type.

ii. Meta Information Tab with SRV as Connection Type.

iii. Meta Information Tab with Connection String as Connection Type.

Column Filter: The users can select some specific columns from the table to read data instead of selecting a complete table; this can be achieved via the Column Filter section. Select the columns which you want to read and if you want to change the name of the column, then put that name in the alias name section otherwise keep the alias name the same as of column name and then select a Column Type from the drop-down menu.

Use the Download Data and Upload File options to select the desired columns.

1. Upload File: The user can upload the existing system files (CSV, JSON) using the Upload File icon (file size must be less than 2 MB).

2. Download Data (Schema): Users can download the schema structure in JSON format by using the Download Data icon.

After doing all the configurations click the Save Component in Storage icon provided in the reader configuration panel to save the component.
A notification message appears to inform about the component configuration success.

Azure Blob Reader

Azure Blob Reader is designed to read and access data stored in Azure Blob Storage. Azure Blob Readers typically authenticate with Azure Blob Storage using Azure Active Directory credentials or other authentication mechanisms supported by Azure. This is a spark based component.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Please Note: Please go through the below given demonstration to configure Azure Blob Reader in the pipeline.

Please Note: Before starting to use the Azure Reader component, please follow the steps below to obtain the Azure credentials from the Azure Portal:

Accessing Azure Blob Storage: Shared Access Signature (SAS), Secret Key, and Principal Secret

This document outlines three methods for accessing Azure Blob Storage: Shared Access Signatures (SAS), Secret Keys, and Principal Secrets.

Understanding Security Levels:

Shared Access Signature (SAS): This is the recommended approach due to its temporary nature and fine-grained control over access permissions. SAS tokens can be revoked, limiting potential damage if compromised.
Secret Key: Secret keys grant full control over your storage account. Use them with caution and only for programmatic access. Consider storing them securely in Azure Key Vault and avoid hardcoding them in scripts.
Principal Secret: This applies to Azure Active Directory (Azure AD) application access. Similar to secret keys, use them cautiously and store them securely (e.g., Azure Key Vault).

1. Shared Access Signature (SAS):

Benefits:

Secure: Temporary and revocable, minimizing risks.
Granular Control: Define specific permissions (read, write, list, etc.) for each SAS token.

Steps to Generate an SAS Token:

Navigate to Azure Portal: Open the Azure portal (https://azure.microsoft.com/en-us/get-started/azure-portal) and log in with your credentials.
Access Blob Storage Account: Locate "Storage accounts" in the left menu and select your storage account.
Configure SAS Settings: Find and click on "Shared access signature" in the settings. Define the permissions, expiry date, and other parameters for your needs.
Generate SAS Token: Click on "Generate SAS and connection string" to create the SAS token.
Copy and Use SAS Token: Copy the generated SAS token. Use this token to securely access your Blob Storage resources in your code.

2. Secret Key:

Use with Caution:

High-Risk: Grants full control over your storage account.
Secure Storage: Store them securely in Azure Key Vault, never hardcode them in scripts.

Steps to Obtain Secret Key:

Navigate to Azure Portal: Open the Azure portal and log in.
Access Blob Storage Account: Locate and select your storage account.
View Secret Keys: Click on "Access keys" to view your storage account keys. Do not store these directly in code. Consider Azure Key Vault for secure storage.

3. Principal Secret (Azure AD Application):

Use for Application Access:

Grants access to your storage account through an Azure AD application.
Secure Storage: Store them securely in Azure Key Vault, never hardcode them in scripts.

Steps to Obtain Principal Secret:

Navigate to Azure AD Portal: Open the Azure AD portal (https://azure.microsoft.com/en-us/get-started/azure-portal) and log in with your credentials.
Access App Registrations: Locate "App registrations" in the left menu.
Select Your Application: Find and click on the application for which you want to obtain the principal secret.
Access Certificates & Secrets: Inside your application, go to "Certificates & secrets" in the settings menu.
Generate New Client Secret (Principal Secret):
- Under "Client secrets," click on "New client secret."
- Enter a description, select the expiry duration, and click "Add" to generate the new client secret.
- Copy the generated client secret immediately as it will be hidden afterward.

Configuring the Meta Information of Azure Blob Reader

Read Using: There are three authentication methods available to connect with Azure in the Azure Blob Reader Component:

Shared Access Signature
Secret Key
Principal Secret

Read using Shared Access Signature

Provide the following details:

Shared Access Signature: This is a URI that grants restricted access rights to Azure Storage resources.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the file is located and which has to be read.
File Type: There are five (5) types of file extensions available:
- CSV
- JSON
- PARQUET
- AVRO
- XML
Read Directory: This field will be checked by default. If this option is enabled, the component will read data from all the blobs present in the container.
Blob Name: This field will display only if the Read Directory field is disabled. Enter the specific name of the blob whose data has to be read.
Limit: Enter a number to limit the number of records that has to be read by the component.
Column Filter: Enter the column names here. Only the specified columns will be fetched from Azure Blob. In this field, the user needs to fill in the following information:
- Source Field: Enter the name of the column from the blob. The user can add multiple columns by clicking on the "Add New Column" option.
- Destination Field: Enter the alias name for the source field.
- Column Type: Enter the data type of the column.
- Upload: This option allows the user to upload a data file in CSV, JSON, or EXCEL format. The column names will be automatically fetched from the uploaded data file and filled out in the Name, Alias, and Column Type fields.
- Download Data: This option will download the data filled in the Column Filter field in JSON format.
- Delete Data: This option will clear all the information filled in the Column Filter field.
Query: Enter a Spark SQL query in this field. Use inputDf as the table name.

//Sample Spark SQL Query:

Select * from inputDf where department='sales';

Read Using Secret Key Option

Provide the following details:

Account Key: Used to authorize access to data in your storage account via Shared Key authorization.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the file is located and which has to be read.
File Type: There are five (5) types of file extensions available:
- CSV
- JSON
- PARQUET
- AVRO
- XML
Read Directory: This field will be checked by default. If this option is enabled, the component will read data from all the blobs present in the container.
Blob Name: This field will display only if the Read Directory field is disabled. Enter the specific name of the blob whose data has to be read.
Limit: Enter a number to limit the number of records that has to be read by the component.
Column Filter: Enter the column names here. Only the specified columns will be fetched from Azure Blob. In this field, the user needs to fill in the following information:
- Source Field: Enter the name of the column from the blob. The user can add multiple columns by clicking on the "Add New Column" option.
- Destination Field: Enter the alias name for the source field.
- Column Type: Enter the data type of the column.
- Upload: This option allows the user to upload a data file in CSV, JSON, or EXCEL format. The column names will be automatically fetched from the uploaded data file and filled out in the Name, Alias, and Column Type fields.
- Download Data: This option will download the data filled in the Column Filter field in JSON format.
- Delete Data: This option will clear all the information filled in the Column Filter field.
Query: Enter a Spark SQL query in this field. Use inputDf as the table name.

Read Using Principal Secret

Provide the following details:

Client ID: The unique Application (client) ID assigned to your app by Azure AD when the app was registered.
Tenant ID: A globally unique identifier (GUID) that is different from your organization name or domain.
Client Secret: The password of the service principal.
Account Name: Provide the Azure account name.
File Type: There are five (5) types of file extensions available:
- CSV
- JSON
- PARQUET
- AVRO
- XML
Read Directory: This field will be checked by default. If this option is enabled, the component will read data from all the blobs present in the container.
Blob Name: This field will display only if the Read Directory field is disabled. Enter the specific name of the blob whose data has to be read.
Limit: Enter a number to limit the number of records that has to be read by the component.
Column Filter: Enter the column names here. Only the specified columns will be fetched from Azure Blob. In this field, the user needs to fill in the following information:
- Source Field: Enter the name of the column from the blob. The user can add multiple columns by clicking on the "Add New Column" option.
- Destination Field: Enter the alias name for the source field.
- Column Type: Enter the data type of the column.
- Upload: This option allows the user to upload a data file in CSV, JSON, or EXCEL format. The column names will be automatically fetched from the uploaded data file and filled out in the Name, Alias, and Column Type fields.
- Download Data: This option will download the data filled in the Column Filter field in JSON format.
- Delete Data: This option will clear all the information filled in the Column Filter field.
Query: Enter a Spark SQL query in this field. Use inputDf as the table name.

Note: The following fields will be displayed after selecting the following file types:

CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type.
- Multiline: This option handles JSON files that contain records spanning multiple lines. Enabling this ensures the JSON parser reads multiline records correctly.
- Charset: Specify the character set used in the JSON file. This defines the character encoding of the JSON file, such as UTF-8 or ISO-8859-1, ensuring correct interpretation of the file content.
PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
AVRO: This File Type provides two drop-down menus.
- Compression: Select an option out of the Deflate and Snappy options.
  - Deflate: A compression algorithm that balances between compression speed and compression ratio, often resulting in smaller file sizes.
  - Snappy: This compression type is select by default. A fast compression and decompression algorithm developed by Google, optimized for speed rather than maximum compression ratio.
- Compression Level: This field appears if Deflate compression is selected. It provides a drop-down menu with levels ranging from 0 to 9, indicating the compression intensity.

Azure Metadata Reader

Azure Metadata Reader is designed to read and access metadata associated with Azure resources. Azure Metadata Readers typically authenticate with Azure using Azure Active Directory credentials or other authentication mechanisms supported by Azure.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Please Note: Please go through the below given demonstration to configure Azure Metadata Reader in the pipeline.

Please Note: Before starting to use the Azure Reader component, please follow the steps below to obtain the Azure credentials from the Azure Portal:

Accessing Azure Blob Storage: Shared Access Signature (SAS), Secret Key, and Principal Secret

This document outlines three methods for accessing Azure Blob Storage: Shared Access Signatures (SAS), Secret Keys, and Principal Secrets.

Understanding Security Levels:

Shared Access Signature (SAS): This is the recommended approach due to its temporary nature and fine-grained control over access permissions. SAS tokens can be revoked, limiting potential damage if compromised.
Secret Key: Secret keys grant full control over your storage account. Use them with caution and only for programmatic access. Consider storing them securely in Azure Key Vault and avoid hardcoding them in scripts.
Principal Secret: This applies to Azure Active Directory (Azure AD) application access. Similar to secret keys, use them cautiously and store them securely (e.g., Azure Key Vault).

1. Shared Access Signature (SAS):

Benefits:

Secure: Temporary and revocable, minimizing risks.
Granular Control: Define specific permissions (read, write, list, etc.) for each SAS token.

Steps to Generate an SAS Token:

Navigate to Azure Portal: Open the Azure portal (https://azure.microsoft.com/en-us/get-started/azure-portal) and log in with your credentials.
Access Blob Storage Account: Locate "Storage accounts" in the left menu and select your storage account.
Configure SAS Settings: Find and click on "Shared access signature" in the settings. Define the permissions, expiry date, and other parameters for your needs.
Generate SAS Token: Click on "Generate SAS and connection string" to create the SAS token.
Copy and Use SAS Token: Copy the generated SAS token. Use this token to securely access your Blob Storage resources in your code.

2. Secret Key:

Use with Caution:

High-Risk: Grants full control over your storage account.
Secure Storage: Store them securely in Azure Key Vault, never hardcode them in scripts.

Steps to Obtain Secret Key:

Navigate to Azure Portal: Open the Azure portal and log in.
Access Blob Storage Account: Locate and select your storage account.
View Secret Keys: Click on "Access keys" to view your storage account keys. Do not store these directly in code. Consider Azure Key Vault for secure storage.

3. Principal Secret (Azure AD Application):

Use for Application Access:

Grants access to your storage account through an Azure AD application.
Secure Storage: Store them securely in Azure Key Vault, never hardcode them in scripts.

Steps to Obtain Principal Secret:

Navigate to Azure AD Portal: Open the Azure AD portal (https://azure.microsoft.com/en-us/get-started/azure-portal) and log in with your credentials.
Access App Registrations: Locate "App registrations" in the left menu.
Select Your Application: Find and click on the application for which you want to obtain the principal secret.
Access Certificates & Secrets: Inside your application, go to "Certificates & secrets" in the settings menu.
Generate New Client Secret (Principal Secret):
- Under "Client secrets," click on "New client secret."
- Enter a description, select the expiry duration, and click "Add" to generate the new client secret.
- Copy the generated client secret immediately as it will be hidden afterward.

Configuring the Meta Information of Azure Blob Reader

Read Using: There are three authentication methods available to connect with Azure in the Azure Blob Reader Component:

Shared Access Signature
Secret Key
Principal Secret

Read using Shared Access Signature

Provide the following details:

Shared Access Signature: This is a URI that grants restricted access rights to Azure Storage resources.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the file is located and which has to be read.
Path type: There are options available under it:
- Null: If Null is selected as the Path Type, the component will read the metadata of all the blobs from the given container. The user does not need to fill the Blob Name field in this option.
- Directory Path: Enter the directory path to read the metadata of files located in the specified directory. For example: employee/joining_year=2010/department=BI/designation=Analyst/.
- Blob Name: Specify the blob name to read the metadata from that particular blob.

Read using Secret Key

Provide the following details:

Account Key: It is be used to authorize access to data in your storage account via Shared Key authorization.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the file is located and which has to be read.
Path type: There are options available under it:
- Null: If Null is selected as the Path Type, the component will read the metadata of all the blobs from the given container. The user does not need to fill the Blob Name field in this option.
- Directory Path: Enter the directory to read the metadata of files located in the specified directory. For example: employee/joining_year=2010/department=BI/designation=Analyst/.
- Blob Name: Specify the blob name to read the metadata from that particular blob.

Read using Principal Secret

Provide the following details:

Client ID: The client ID is the unique Application (client) ID assigned to your app by Azure AD when the app was registered.
Tenant ID: It is a globally unique identifier (GUID) that is different than your organization name or domain.
Client Secret: The client secret is the password of the service principal.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the file is located and which has to be read.
Path type: There are options available under it:
- Null: If Null is selected as the Path Type, the component will read the metadata of all the blobs from the given container. The user does not need to fill the Blob Name field in this option.
- Directory Path: Enter the directory to read the metadata of files located in the specified directory. For example: employee/joining_year=2010/department=BI/designation=Analyst/.
- Blob Name: Specify the blob name to read the metadata from that particular blob.

Output Metadata

Once the component runs successfully, it will send the following metadata to the output event:

Container: Name of the container where the blob is present.
Blob: Name of the blob present in the specified path.
blobLastModifiedDateAndTime: Date and time when the blob was last modified.
blobLength: Size of the blob.

ClickHouse Reader (Docker)

ClickHouse reader is designed to read and access data stored in a ClickHouse database. ClickHouse readers typically authenticate with ClickHouse using a username and password or other authentication mechanisms supported by ClickHouse.

Along with the Spark Driver in RDBMS reader we have Docker Reader that supports TCP port All component configurations are classified broadly into the following sections:

All component configurations are classified broadly into the following sections:

Meta Information

Check out the given illustration to understand the configuration steps for the ClickHouse Reader component.

Configuring the Meta Information Tab fields

Host IP Address: Enter the Host IP Address.
Port: Enter the port for the given IP Address.
User name: Enter the user name for the provided database.
Password: Enter the password for the provided database.
Database name: Enter the Database name.
Table name: Provide a single or multiple table names. If multiple table name has be given, then enter the table names separated by comma(,). Settings: Option that allows you to customize various configuration settings for a specific query.
Enable SSL: Enabling SSL with ClickHouse Reader involves configuring the reader to use the Secure Sockets Layer (SSL) protocol for secure communication between the reader and the ClickHouse server.
Query: Write SQL query to filter out desired data from ClickHouse Database.

Please Note:

The Meta Information tab has got an SSL field for the ClickHouse Reader component. The user needs to configure the SSL.
ClickHouse reader docker component supports only TCP port.

Sandbox Reader

A Sandbox reader is used to read and access data within a configured sandbox environment.

All component configurations are classified broadly into the following sections:

Meta Information

Before using the Sandbox Reader component for reading a file, the user needs to upload a file in Data Sandbox under the Data Center module.

Please go through the given walk-through for uploading the file in the Data Sandbox under the Data Center module.

Check out the given video on how to configure a Sandbox Reader component.

Steps to configure Sandbox Reader

Navigate to the Data Pipeline Editor.
Expand the Readers section provided under the Component Pallet.
Drag and drop the Sandbox Reader component to the Workflow Editor.
Click on the dragged Sandbox Reader to get the component properties tabs.

Basic Information

It is the default tab to open for the component while configuring it.

Invocation Type: Select an invocation mode from the Real-Time or Batch options by using the drop-down menu.
Deployment Type: It displays the deployment type for the reader component. This field comes pre-selected.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Failover Event: Select a failover Event from the drop-down menu.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (Minimum limit for this field is 10).

Configuring meta information of Sandbox Reader component:

Storage Type: The user will find two options here:
- Network: This option will be selected by default. In this option, the following fields will be displayed:
  - File Type: Select the type of the file to be read. Supported file types include CSV, JSON, PARQUET, AVRO, XML, and ORC.
  - Schema: Enter the Spark schema of the file in JSON format.
  - Sandbox Folder Path: Enter the Sandbox folder name where the data is stored in part files.
  - Limit: Enter the number of records to be read.
- Platform: In this option, the following fields will be displayed:
  - File Type: Select the type of the file to be read. The supported file types are CSV, JSON, PARQUET, AVRO, XML, and ORC.
  - Sandbox Name: This field will display once the user selects the file type. It will show all the Sandbox names for the selected file type, and the user has to select the Sandbox name from the drop-down.
  - Sandbox File: This field displays the name of the sandbox file to be read. It will automatically fill when the user selects the sandbox name.
  - Limit: Enter the number of records to be read.
Query: Enter a spark SQL query. Take inputDf as a table name.
Column Filter: There is also a section for the selected columns in the Meta Information tab if the user can select some specific columns from the table to read data instead of selecting a complete table so this can be achieved by using the Column Filter section. Select the columns that you want to read and if you want to change the name of the column, then put that name in the alias name section otherwise keep the alias name the same as of column name and then select a Column Type from the drop-down menu.
Use the Download Data and Upload File options to select the desired columns.
- Upload File: The user can upload the existing system files (CSV, JSON) using the Upload File icon (file size must be less than 2 MB).
- Download Data (Schema): Users can download the schema structure in JSON format by using the Download Data icon.

Use the Download Data and Upload File options to select the desired columns.

Partition Columns: To read a specific partition, enter the name of the partitioned column.

Sample Query for Sandbox Reader:

select team as Team, Avg(monthly_salary) as Average_Salary,
Avg(experience) as Average_Experience from inputDf group by team;

Please Note:

(*) the symbol indicates that the field is mandatory.
Either table or query must be specified for the data readers except for SFTP Reader.
Column Filter- There should not be a data type mismatch in the Column Type for all the Reader components.

Fields in the Meta Information tab may vary based on the selected File Type. All the possibilities are mentioned below:

CSV: The following fields will display when CSV is selected as File Type:
- Header: Enable the Header option to retrieve the header of the reading file.
- Infer Schema: Enable the Infer Schema option to obtain the true schema of the columns in the CSV file.
- Multiline: Enable the Multiline option to read multiline strings in the data.
- Schema: This field will be visible only when the Header option is enabled. Enter the Spark schema in JSON format in the schema field to filter out the bad records. To filter the bad records, the user needs to map the failover Kafka event in the Failover Event field in the Basic Information tab.

JSON: The Multiline and Charset fields are displayed with JSON as the selected File Type. Check in the Multiline option to see, if there is any multiline string in the file.
PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
AVRO: This File Type provides two drop-down menus.
- Compression: Select an option out of the Deflate and Snappy options.
- Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.
XML: Select this option to read the XML file. If this option is selected, the following fields will be displayed:
- Infer schema: Enable this option to get the true schema of the column.
- Path: Provide the path of the file.
- Root Tag: Provide the root tag from the XML files.
- Row Tags: Provide the row tags from the XML files.
- Join Row Tags: Enable this option to join multiple row tags.
ORC: Select this option to read the ORC file. If this option is selected, the following fields will be displayed:
- Push Down: In ORC (Optimized Row Columnar) file format, "push down" typically refers to the ability to push down predicate filters to the storage layer for processing. There will be two options in it:
  - True: When push-down is set to True, it indicates predicate filters can be pushed down to the ORC storage layer for filtering rows at the storage level. This can improve query performance by reducing the amount of data that needs to be read into memory for processing.
  - False: When push down is set to False, predicate filters are not pushed down to the ORC storage layer. Instead, filtering is performed after the data has been read into memory by the processing engine. This may result in more data being read and potentially slower query performance compared to when push-down is enabled.

Azure Blob Reader (Docker)

Azure Blob Reader is designed to read and access data stored in Azure Blob Storage. Azure Blob Readers typically authenticate with Azure Blob Storage using Azure Active Directory credentials or other authentication mechanisms supported by Azure.

This is a docker based component.

All component configurations are classified broadly into the following sections:

Meta Information

Please Note: Please go through the below given demonstration to configure Azure Blob Reader in the pipeline.

Please Note: Before starting to use the Azure Reader component, please follow the steps below to obtain the Azure credentials from the Azure Portal:

Accessing Azure Blob Storage: Shared Access Signature (SAS), Secret Key, and Principal Secret

This document outlines three methods for accessing Azure Blob Storage: Shared Access Signatures (SAS), Secret Keys, and Principal Secrets.

Understanding Security Levels:

Shared Access Signature (SAS): This is the recommended approach due to its temporary nature and fine-grained control over access permissions. SAS tokens can be revoked, limiting potential damage if compromised.
Secret Key: Secret keys grant full control over your storage account. Use them with caution and only for programmatic access. Consider storing them securely in Azure Key Vault and avoid hardcoding them in scripts.
Principal Secret: This applies to Azure Active Directory (Azure AD) application access. Similar to secret keys, use them cautiously and store them securely (e.g., Azure Key Vault).

1. Shared Access Signature (SAS):

Benefits:

Secure: Temporary and revocable, minimizing risks.
Granular Control: Define specific permissions (read, write, list, etc.) for each SAS token.

Steps to Generate an SAS Token:

Access Blob Storage Account: Locate "Storage accounts" in the left menu and select your storage account.
Configure SAS Settings: Find and click on "Shared access signature" in the settings. Define the permissions, expiry date, and other parameters for your needs.
Generate SAS Token: Click on "Generate SAS and connection string" to create the SAS token.
Copy and Use SAS Token: Copy the generated SAS token. Use this token to securely access your Blob Storage resources in your code.

2. Secret Key:

Use with Caution:

High-Risk: Grants full control over your storage account.
Secure Storage: Store them securely in Azure Key Vault, never hardcode them in scripts.

Steps to Obtain Secret Key:

Navigate to Azure Portal: Open the Azure portal and log in.
Access Blob Storage Account: Locate and select your storage account.
View Secret Keys: Click on "Access keys" to view your storage account keys. Do not store these directly in code. Consider Azure Key Vault for secure storage.

3. Principal Secret (Azure AD Application):

Use for Application Access:

Grants access to your storage account through an Azure AD application.
Secure Storage: Store them securely in Azure Key Vault, never hardcode them in scripts.

Steps to Obtain Principal Secret:

Access App Registrations: Locate "App registrations" in the left menu.
Select Your Application: Find and click on the application for which you want to obtain the principal secret.
Access Certificates & Secrets: Inside your application, go to "Certificates & secrets" in the settings menu.
Generate New Client Secret (Principal Secret):
- Under "Client secrets," click on "New client secret."
- Enter a description, select the expiry duration, and click "Add" to generate the new client secret.
- Copy the generated client secret immediately as it will be hidden afterward.

Configuring the Meta Information of Azure Blob Reader

Read Using: There are three authentication methods available to connect with Azure in the Azure Blob Reader Component:

Shared Access Signature
Secret Key
Principal Secret

Read using Shared Access Signature

Provide the following details:

Shared Access Signature: This is a URI that grants restricted access rights to Azure Storage resources.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the file is located and which has to be read.
File Type: There are five (5) types of file extensions available:
- CSV
- JSON
- PARQUET
- AVRO
- XML
Read Directory: This field will be checked by default. If this option is enabled, the component will read data from all the blobs present in the container.
Blob Name: This field will display only if the Read Directory field is disabled. Enter the specific name of the blob whose data has to be read.
Column Filter: Enter the column names here. Only the specified columns will be fetched from Azure Blob. In this field, the user needs to fill in the following information:
- Source Field: Enter the name of the column from the blob. The user can add multiple columns by clicking on the "Add New Column" option.
- Destination Field: Enter the alias name for the source field.
- Column Type: Enter the data type of the column.
- Upload: This option allows the user to upload a data file in CSV, JSON, or EXCEL format. The column names will be automatically fetched from the uploaded data file and filled out in the Name, Alias, and Column Type fields.
- Download Data: This option will download the data filled in the Column Filter field in JSON format.
- Delete Data: This option will clear all the information filled in the Column Filter field.

Read Using Secret Key Option

Provide the following details:

Account Key: Used to authorize access to data in your storage account via Shared Key authorization.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the file is located and which has to be read.
File Type: There are five (5) types of file extensions available:
- CSV
- JSON
- PARQUET
- AVRO
Read Directory: This field will be checked by default. If this option is enabled, the component will read data from all the blobs present in the container.
Blob Name: This field will display only if the Read Directory field is disabled. Enter the specific name of the blob whose data has to be read.
Column Filter: Enter the column names here. Only the specified columns will be fetched from Azure Blob. In this field, the user needs to fill in the following information:
- Source Field: Enter the name of the column from the blob. The user can add multiple columns by clicking on the "Add New Column" option.
- Destination Field: Enter the alias name for the source field.
- Column Type: Enter the data type of the column.
- Upload: This option allows the user to upload a data file in CSV, JSON, or EXCEL format. The column names will be automatically fetched from the uploaded data file and filled out in the Name, Alias, and Column Type fields.
- Download Data: This option will download the data filled in the Column Filter field in JSON format.
- Delete Data: This option will clear all the information filled in the Column Filter field.

Read Using Principal Secret

Provide the following details:

Client ID: The unique Application (client) ID assigned to your app by Azure AD when the app was registered.
Tenant ID: A globally unique identifier (GUID) that is different from your organization name or domain.
Client Secret: The password of the service principal.
Account Name: Provide the Azure account name.
File Type: There are five (5) types of file extensions available:
- CSV
- JSON
- PARQUET
- AVRO
Read Directory: This field will be checked by default. If this option is enabled, the component will read data from all the blobs present in the container.
Blob Name: This field will display only if the Read Directory field is disabled. Enter the specific name of the blob whose data has to be read.
Column Filter: Enter the column names here. Only the specified columns will be fetched from Azure Blob. In this field, the user needs to fill in the following information:
- Source Field: Enter the name of the column from the blob. The user can add multiple columns by clicking on the "Add New Column" option.
- Destination Field: Enter the alias name for the source field.
- Column Type: Enter the data type of the column.
- Upload: This option allows the user to upload a data file in CSV, JSON, or EXCEL format. The column names will be automatically fetched from the uploaded data file and filled out in the Name, Alias, and Column Type fields.
- Download Data: This option will download the data filled in the Column Filter field in JSON format.
- Delete Data: This option will clear all the information filled in the Column Filter field.

Note: The following fields will be displayed after selecting the following file types:

CSV: The Header and Infer Schema fields get displayed with CSV as the selected File Type. Enable Header option to get the Header of the reading file and enable Infer Schema option to get true schema of the column in the CSV file.
JSON: The Multiline and Charset fields get displayed with JSON as the selected File Type.
- Multiline: This option handles JSON files that contain records spanning multiple lines. Enabling this ensures the JSON parser reads multiline records correctly.
- Charset: Specify the character set used in the JSON file. This defines the character encoding of the JSON file, such as UTF-8 or ISO-8859-1, ensuring correct interpretation of the file content.
PARQUET: No extra field gets displayed with PARQUET as the selected File Type.
AVRO: This File Type provides two drop-down menus.
- Compression: Select an option out of the Deflate and Snappy options.
  - Deflate: A compression algorithm that balances between compression speed and compression ratio, often resulting in smaller file sizes.
  - Snappy: This compression type is select by default. A fast compression and decompression algorithm developed by Google, optimized for speed rather than maximum compression ratio.
- Compression Level: This field appears if Deflate compression is selected. It provides a drop-down menu with levels ranging from 0 to 9, indicating the compression intensity.

Athena Query Executer

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.

Athena Query Executer component enables users to read data directly from the external table created in AWS Athena.

Please Note: Please go through the below given demonstration to configure Athena Query component in the pipeline.

Configuring meta information of Athena Query Executer component:

Region: Enter the region name where the bucket is located.
Access Key: Enter the AWS Access Key of the AWS account which has to be used.
Secret Key: Enter the AWS Secret Key of the AWS account which has to be used.
Table Name: Enter the name of the external table created in Athena.
Database Name: Name of the database in Athena in which the table has been created.
Limit: Enter the number of records to be read from the table.
Data Source: Enter the Data Source name configured in Athena. Data Source in Athena refers to the location where your data resides, typically an S3 bucket.
Workgroup: Enter the Workgroup name configured in Athena. The Workgroup in Athena is a resource type used to separate query execution and query history between Users, Teams, or Applications running under the same AWS account.
Query location: Enter the path where the results of queries done in the Athena query editor are saved in CSV format. You can find this path under the "Settings" tab in the Athena query editor in the AWS console, labeled as "Query Result Location".
Query: Enter the Spark SQL query.

Sample Spark SQL query that can be used in Athena Query Executer:

// Selecting data from employee table:
SELECT * FROM employee LIMIT 10;



// insert data into table_2 by selecting data from table_1 in Athena database:
INSERT INTO table_2
SELECT * FROM table_1;



//Creating a table in Athena database by selecting data from another table in a different database:
CREATE TABLE database_2.table_2
WITH (
    format = 'PARQUET',   --file format(Parquet, AVRO, CSV, JSON etc..)
    external_location = 's3_path' -- Where data will be stored
) AS
SELECT *
FROM database_1.table_1;



//Using CTE queries to get the results:
WITH age_data AS (
    SELECT department,
           CAST(ROUND(AVG(age), 2) AS INT) AS avg_age,
           MIN(age) AS min_age,
           MAX(age) AS max_age
    FROM pipeline.employee_avro1
    GROUP BY department
    ORDER BY department
),
salary_data AS (
    SELECT e.department,
           CAST(ROUND(AVG(e.salary), 2) AS INT) AS average_salary,
           a.avg_age
    FROM pipeline.employee_avro1 e
    JOIN age_data a ON e.department = a.department
    GROUP BY e.department, a.avg_age
),
result_data AS (
    SELECT s.department,
           s.average_salary,
           s.avg_age,
           COUNT(emp_id) AS Total_employees
    FROM pipeline.employee_avro1 e
    JOIN salary_data s ON e.department = s.department
    GROUP BY s.department, s.average_salary, s.avg_age
)
SELECT *
FROM result_data;

Big Query Reader

The Big Query Reader Component is designed for efficient data access and retrieval from Google Big Query, a robust data warehousing solution on Google Cloud. It enables applications to execute complex SQL queries and process large datasets seamlessly. This component simplifies data retrieval and processing, making it ideal for data analysis, reporting, and ETL workflows.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information

Steps to Configure the Big Query Reader Component

Navigate to the Data Pipeline Editor.
Expand the Reader section provided under the Component Pallet.
Drag and drop the Big Query Reader component to the Workflow Editor.
Click on the dragged Big Query Reader to get the component properties tabs.

Basic Information

It is the default tab to open for the component while configuring it.

Invocation Type: Select an invocation mode from the ‘Real-Time’ or ‘Batch’ using the drop-down menu.
Deployment Type: It displays the deployment type for the reader component. This field comes pre-selected.
Batch Size (min 1): Provide the maximum number of records to be processed in one execution cycle (Min limit for this field is 1).
Failover Event: Select a failover Event from the drop-down menu.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.

Meta Information

Open the Meta Information tab and fill in all the connection-specific details for the Big Query Reader.
Read using: The 'Service Account' option is available under this field, so select it.

Follow these steps to download a JSON from Big Query:

Open the Big Query Console.
Click the API & Services inside the Navigation pane.
Click on Credentials.
Create Credentials (Service Account).
After creating the credentials click on the Key.
Go to Keys in the header, click the Create New option, and download JSON.

Dataset Id: Mention the Dataset ID from Big Query which is to be read.
Table Id: Mention the Table ID from Big Query which is to be read.
Location (*): Mention the location according to your Project.
Limit: Set a limit for the number of records to be read.
Query: Enter an SQL Query.

Sample Spark SQL query for Big Query Reader:

Select * from project_id.dataset_id.table_id limit 10

Saving the Component Configuration

A notification message appears to inform about the component configuration success.

Writers

Data writers specifically focus on the final stage of the pipeline, where the processed or transformed data is written to the target destination. This section explains all the supported Data Writers.

S3 Writer

An S3 Writer is designed to write data to an S3 bucket in AWS. S3 Writer typically authenticate with S3 using AWS credentials, such as an access key ID and secret access key, to gain access to the S3 bucket and its contents.

All component configurations are classified broadly into the following sections:

Meta Information

Check out the steps given in the demonstration to configure the S3 Writer component.

Configuring the Meta Information tab of the S3 Writer

Bucket Name: Enter the S3 Bucket name.
Access Key: Access key shared by AWS to login.
Secret Key: Secret key shared by AWS to login.
Table: Mention the Table or object name where the data has to be written in the S3 location.
Region: Provide the S3 region where the Bucket is created.
File Type: Select a file type from the drop-down menu (CSV, JSON, PARQUET, AVRO, ORC are the supported file types).
Save Mode: Select the save mode from the drop-down menu:
- Append: It will append the data in the blob.
- Overwrite: It will overwrite the data in the blob.
Schema File Name: Upload a Spark schema file of the data which has to be written in JSON format.
Column Filter: Enter the column names here. Only the specified columns will be fetched from the data from the previous connected event to the S3 Writer. In this field, the user needs to fill in the following information:
- Name: Enter the name of the column which has to be written from the previous event. The user can add multiple columns by clicking on the "Add New Column" option.
- Alias: Enter the alias name for the selected column name.
- Column Type: Enter the data type of the column.
Upload: This option allows the user to upload a data file in CSV, JSON, or EXCEL format. The column names will be automatically fetched from the uploaded data file and filled out in the Name, Alias, and Column Type fields.
Download Data: This option will download the data filled in the Column Filter field in JSON format.
Delete Data: This option will clear all the information filled in the Column Filter field.
Partition Columns: This feature enables users to partition the data when writing to Azure Blob. Users can specify multiple columns for partitioning by clicking the "Add Column Name" option. For example, If data is partitioned by a date column, a separate folder will be created for each unique date in an Amazon S3 bucket. The data storage might look like this:

    s3://my-bucket/my_table/partition_column=2023-01-01/
    s3://my-bucket/my_table/partition_column=2023-01-02/
    s3://my-bucket/my_table/partition_column=2023-01-03/

DB Writer

The DB writer is a spark-based writer component which gives you capability to write data to multiple database sources.

All component configurations are classified broadly into the following sections:

Meta Information

Please check out the given demonstration to configure the component.

Drivers Available

Supported Drivers

MySQL
Oracle
PostgreSQL
MS-SQL
ClickHouse
Snowflake
Redshift

Please Note:

The ClickHouse driver in the Spark components will use the HTTP Port and not the TCP port.
It is always recommended to create the table before activating the pipeline to avoid errors as RDBMS has a strict schema and can result in errors.
When using the Redshift driver with a Boolean datatype in JDBC, the table is not created unless you pass the create table query. Alternatively, you can use a column filter to convert a Boolean value to a String for the desired operation.

Configuring the Meta Information of Azure Writer

Database name: Enter the Database name.
Table name: Provide a table name where the data has to be written.
Enable SSL: Check this box to enable SSL for this components. Enable SSL feature in DB reader component will appear only for three(3) drivers: MongoDB, PostgreSQL and ClickHouse.
Certificate Folder: This option will appear when the Enable SSL field is checked-in. The user has to select the certificate folder from drop down which contains the files which has been uploaded to the admin settings. Please refer the below given images for the reference.
Schema File Name: Upload a Spark schema file of the data which has to be written in JSON format.
Save Mode: Select the save mode from the drop-down menu:
- Append: It will append the data in the table.
- Overwrite: It will overwrite the data in the table.
- Upsert: This operation allows the users to insert a new record or update existing data into a table. For configuring this, we need to provide the Composite Key.
  - Sort Column: This field will appear only when Upsert is selected as Save mode. If there are multiple records with the same composite key but different values in the batch, the system identifies the record with the latest value based on the Sort column. The Sort column defines the ordering of records, and the record with the highest value in the sort column is considered the latest.
- Column Filter: Enter the column names here. Only the specified columns will be fetched from the data from the previous connected event to the DB Writer. In this field, the user needs to fill in the following information:
  - Name: Enter the name of the column which has to be written from the previous event. The user can add multiple columns by clicking on the "Add New Column" option.
  - Alias: Enter the alias name for the selected column name.
  - Column Type: Enter the data type of the column.
  - Upload: This option allows the user to upload a data file in CSV, JSON, or EXCEL format. The column names will be automatically fetched from the uploaded data file and filled out in the Name, Alias, and Column Type fields.
  - Download Data: This option will download the data filled in the Column Filter field in JSON format.
  - Delete Data: This option will clear all the information filled in the Column Filter field.
- Query: In this field, we can write a DDL for creating the table in database where the in-event data has to be written. For example, please refer the below image:

Please Note:

In DB Writer component, the Save Mode for ClickHouse driver is as follows:
- Append: It will create a table in ClickHouse database with a table engine StripeLog.
- Upsert: It will create a table in ClickHouse database with a table engine ReplacingMergeTree.
If the user is using Append as the Save mode in ClickHouse Writer (Docker component) and Data Sync (ClickHouse driver), it will create a table in the ClickHouse database with a table engine Memory.
Currently, the Sort column field is only available for the following drivers in the DB Writer: MSSQL, PostgreSQL, Oracle, Snowflake, and ClickHouse.

HDFS Writer

This component writes the data in HDFS(Hadoop Distributed File System).

All component configurations are classified broadly into 3 sections

Follow the given steps in the demonstration to configure the HDFS Writer component.

Configuring the Meta Information tab of the HDFS Writer

Host IP Address: Enter the host IP address for HDFS.
Port: Enter the Port.
Table: Enter the table name where the data has to be written.
Zone: Enter the Zone for HDFS in which the data has to be written. Zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read.
File Format: Select a file format in which the data has to be written.
- CSV
- JSON
- PARQUET
- AVRO
- ORC
Save Mode: Select a Save Mode.
Schema file name: Upload Spark schema file in JSON format.
Partition Columns: Provide a unique Key column name to partition data in Spark.

ES Writer

An Elasticsearch writer component is designed to write the data stored in an Elasticsearch index. Elasticsearch writers typically authenticate with Elasticsearch using username and password credentials, which grant access to the Elasticsearch cluster and its indexes.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Follow the given steps in the demonstration to configure the ES Writer component.

Please follow the below mentioned steps to configure the meta information of ES writer:

Host IP Address: Enter the host IP Address for Elastic Search.
Port: Enter the port to connect with Elastic Search.
Index ID: Enter the Index ID to read a document in elastic search. In Elasticsearch, an index is a collection of documents that share similar characteristics, and each document within an index has a unique identifier known as the index ID. The index ID is a unique string that is automatically generated by Elasticsearch and is used to identify and retrieve a specific document from the index.
Mapping ID: Provide the Mapping ID. In Elasticsearch, a mapping ID is a unique identifier for a mapping definition that defines the schema of the documents in an index. It is used to differentiate between different types of data within an index and to control how Elasticsearch indexes and searches data.
Resource Type: Provide the resource type. In Elasticsearch, a resource type is a way to group related documents together within an index. Resource types are defined at the time of index creation, and they provide a way to logically separate different types of documents that may be stored within the same index.
Username: Enter the username for elastic search.
Password: Enter the password for elastic search.
Schema File Name: Upload spark schema file in JSON format.
Save Mode: Select the Save mode from the drop down.
- Append
Selected columns: The user can select the specific column, provide some alias name and select the desired data type of that column.

Video Writer

The video writer component is designed to write .mp4 format video to a SFTP location by combining the frames that can be consumed using the video consumer component.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration

Please follow the given demonstration to configure the Video Writer component.

Please Note:

The Pipeline testing suite and Data Metrices options in Monitoring pipeline page are not available for this component.
The video Writer component supports only .mp4 file format. Its writes video frame by frame to SFTP.

Drag & drop the Video Stream Consumer component to the Workflow Editor.

Click the dragged Video Stream Consumer component to open the component properties tabs.

Basic Information

It is the default tab to open for the component.

Invocation Type: Select an Invocation type from the drop-down menu to confirm the running mode of the reader component. Select ‘Real-Time’ or ‘Batch’ from the drop-down menu.
Deployment Type: It displays the deployment type for the component. This field comes pre-selected.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (Min limit for this field is 10).
Failover Event: Select a failover Event from the drop-down menu.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Description: Description of the component. It is optional.

Please Note: If the selected Invocation Type option is Batch, then, Grace Period (in a sec)* field appears to provide the grace period for the component to go down gracefully after that time.

Selecting Real-time as the Invocation Type option will display the Intelligent Scaling option.

Meta Information Tab

Select the Meta Information tab and provide the mandatory fields to configure the dragged Video Stream Consumer component.

Host IP Address (*)- Provide IP or URL
- The input in Host IP Address in the Meta Information tab changes based on the selection of the Channel. There are two options available:
  - Live: This allows writing the data to the desired location when live data is coming continuously.
  - Media File: It will read only stored video file and writes them to desired SFTP location.
Username (*)- Provide username
Port (*)- Provide the Port number
Authentication- Select any one authentication option out of Password or PEM PPK File
Stream(*)- The supported streaming methods are Live and Media files.
Partition Time(*)- It defines the length of video the component will consume at once in seconds. This field will appear only if the LIVE option is selected in Stream field.
Writer Path (*)- Provide the desired path in SFTP location where the video has to be written.
File Name(*)- Give any filename with a format mp4(sample_filename.mp4).
Frame Rate – Provide the rate of frames to be consumed.

Please Note: The fields for the Meta Information tab change based on the selection of the Authentication option.

When the authentication option is Password

While using the authentication option as a Password it adds a password field in the Meta information.

When the authentication option is PEM/PPK file

While choosing the PEM/PPK File authentication option, the user needs to select a file using the Choose File option.

Saving the Component

Click the Save Component in Storage icon for the Video Writer component.

The message appears to notify that the component properties are saved.

The Video Writer component gets configured to pass the data in the Pipeline Workflow.

Azure Writer

Azure Writer component is designed to write or store data in Microsoft Azure's storage services, such as Azure Blob Storage. Azure Writers typically authenticate with Azure using Azure Active Directory credentials or other authentication mechanisms supported by Azure.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Please go through the demonstration to configure Azure Writer in the pipeline.

Please Note: Before starting to use the Azure Reader component, please follow the steps below to obtain the Azure credentials from the Azure Portal:

Accessing Azure Blob Storage: Shared Access Signature (SAS), Secret Key, and Principal Secret

This document outlines three methods for accessing Azure Blob Storage: Shared Access Signatures (SAS), Secret Keys, and Principal Secrets.

Understanding Security Levels:

Shared Access Signature (SAS): This approach is recommended due to its temporary nature and fine-grained control over access permissions. SAS tokens can be revoked, limiting potential damage if compromised.
Secret Key: Secret keys grant full control over your storage account. Use them with caution and only for programmatic access. Consider storing them securely in Azure Key Vault and avoid hardcoding them in scripts.
Principal Secret: This applies to Azure Active Directory (Azure AD) application access. Similar to secret keys, use them cautiously and store them securely (e.g., Azure Key Vault).

1. Shared Access Signature (SAS):

Benefits:

Secure: Temporary and revocable, minimizing risks.
Granular Control: Define specific permissions (read, write, list, etc.) for each SAS token.

Steps to Generate an SAS Token:

Navigate to Azure Portal: Open the Azure portal (https://azure.microsoft.com/en-us/get-started/azure-portal) and log in with your credentials.
Access Blob Storage Account: Locate Storage accounts from the left menu and select your storage account.
Configure SAS Settings: Find and click on "Shared access signature" in the settings. Define the permissions, expiry date, and other parameters for your needs.
Generate SAS Token: Click on "Generate SAS and connection string" to create the SAS token.
Copy and Use SAS Token: Copy the generated SAS token. Use this token to access your Blob Storage resources in your code securely.

2. Secret Key:

Use with Caution:

High-Risk: Grants full control over your storage account.
Secure Storage: Store them securely in Azure Key Vault, never hardcode them in scripts.

Steps to Obtain Secret Key:

Navigate to Azure Portal: Open the Azure portal and log in.
Access Blob Storage Account: Locate and select your storage account.
View Secret Keys: Click the Access keys to view your storage account keys. Do not store these directly in code. Consider Azure Key Vault for secure storage.

3. Principal Secret (Azure AD Application):

Use for Application Access:

Grants access to your storage account through an Azure AD application.
Secure Storage: Store them securely in Azure Key Vault, never hardcode them in scripts.

Steps to Obtain Principal Secret:

Navigate to Azure AD Portal: Open the Azure AD portal (https://azure.microsoft.com/en-us/get-started/azure-portal) and log in with your credentials.
Access App Registrations: Locate "App registrations" in the left menu.
Select Your Application: Find and click on the application you want to obtain the principal secret.
Access Certificates & Secrets: Go to Certificates & secrets in the Settings menu inside your application.
Generate New Client Secret (Principal Secret):
- Click on the New client secret option under the Client secrets section.
- Enter a description, select the expiry duration, and click the Add option to generate the new client secret.
- Copy the generated client secret immediately as it will be hidden afterward.

Configuring the Meta Information of Azure Writer

Write Using: There are three authentication methods available to connect with Azure in the Azure Writer Component:

Shared Access Signature
Secret Key
Principal Secret

Write Using Shared Access Signature

Provide the following details:

Shared Access Signature: This is a URI that grants restricted access rights to Azure Storage resources.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the blob is located. A container is a logical unit of storage in Azure Blob Storage that can hold blobs. It is similar to a directory or folder in a file system, and it can be used to organize and manage blobs.
Blob Name: Enter the blob name. A blob is a type of object storage used to store unstructured data, such as text or binary data, like images or videos.
File Format: Four (4) types of file types are available. Select the file format in which the data has to be written:
- CSV
- JSON
- PARQUET
- AVRO
Save Mode: Select the save mode from the drop-down menu:
- Append: It will append the data in the blob.
- Overwrite: It will overwrite the data in the blob.
Schema File Name: Upload a Spark schema file of the data that has to be written in JSON format.
Column Filter: Enter the column names here. Only the specified columns will be fetched from the data from the previous connected event to the Azure Writer. In this field, the user needs to fill in the following information:
- Name: Enter the name of the column that has to be written from the previous event. The user can add multiple columns by clicking on the "Add New Column" option.
- Alias: Enter the alias name for the selected column name.
- Column Type: Enter the data type of the column.
- Upload: This option allows the user to upload a data file in CSV, JSON, or EXCEL format. The column names will be automatically fetched from the uploaded data file and filled out in the Name, Alias, and Column Type fields.
- Download Data: This option will download the data filled in the Column Filter field in the JSON format.
- Delete Data: This option will clear all the information filled in the Column Filter field.
Partition Column: This feature enables users to partition the data when writing to Azure Blob. Users can specify multiple columns for partitioning by clicking the "Add Column Name" option.

Write Using Secret Key Option:

Provide the following details:

Account Key: Enter the Azure account key. In Azure, an account key is a security credential that is used to authenticate access to storage resources, such as blobs, files, queues, or tables, in an Azure storage account.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the blob is located. A container is a logical unit of storage in Azure Blob Storage that can hold blobs. It is similar to a directory or folder in a file system, and it can be used to organize and manage blobs.
Blob Name: Enter the Blob name. A blob is a type of object storage that is used to store unstructured data, such as text or binary data, like images or videos.
File Format: There are four (4) types of file extensions available:
- CSV
- JSON
- PARQUET
- AVRO
Save Mode: Select the save mode from the drop-down menu:
- Append: It will append the data in the blob.
- Overwrite: It will overwrite the data in the blob.
Schema File Name: Upload a Spark schema file of the data which has to be written in JSON format.
Column Filter: Enter the column names here. Only the specified columns will be fetched from the data from the previous connected event to the Azure Writer. In this field, the user needs to fill in the following information:
- Name: Enter the name of the column which has to be written from the previous event. The user can add multiple columns by clicking on the "Add New Column" option.
- Alias: Enter the alias name for the selected column name. The column name given here will be written in the container.
- Column Type: Enter the data type of the column.
- Upload: This option allows users to upload a data file in CSV, JSON, or EXCEL format. The column names will be automatically fetched from the uploaded data file and filled out in the Name, Alias, and Column Type fields.
- Download Data: This option will download the data filled in the Column Filter field in the JSON format.
- Delete Data: This option will clear all the information filled in the Column Filter field.
Partition Column: This feature enables users to partition the data when writing to Azure Blob. Users can specify multiple columns for partitioning by clicking the "Add Column Name" option.

Write using Principal Secret:

Provide the following details:

Client ID: Provide Azure Client ID. The client ID is the unique Application (client) ID assigned to your app by Azure AD when the app was registered.
Tenant ID: Provide the Azure Tenant ID. Tenant ID (also known as Directory ID) is a unique identifier that is assigned to an Azure AD tenant and represents an organization or a developer account. It is used to identify the organization or developer account that the application is associated with.
Client Secret: Enter the Azure Client Secret. Client Secret (also known as Application Secret or App Secret) is a secure password or key that is used to authenticate an application to Azure AD.
Account Name: Provide the Azure account name.
Container: Provide the container name from where the blob is located. A container is a logical unit of storage in Azure Blob Storage that can hold blobs. It is similar to a directory or folder in a file system, and it can be used to organize and manage blobs.
Blob Name: Enter the Blob name. A blob is a type of object storage that is used to store unstructured data, such as text or binary data, like images or videos.
File Format: There are four (4) types of file extensions available under it:
- CSV
- JSON
- PARQUET
- AVRO
Save Mode: Select the save mode from the drop-down menu:
- Append: It will append the data in the blob.
- Overwrite: It will overwrite the data in the blob.
Schema File Name: Upload a Spark schema file of the data which has to be written in JSON format.
Column Filter: Enter the column names here. Only the specified columns will be fetched from the data from the previous connected event to the Azure Writer. In this field, the user needs to fill in the following information:
- Name: Enter the column name that must be written from the previous event. The user can add multiple columns by clicking the Add New Column option.
- Alias: Enter the alias name for the selected column name. The column name given here will be written in the container.
- Column Type: Enter the data type of the column.
- Upload: This option allows the user to upload a data file in CSV, JSON, or EXCEL format. The column names will be automatically fetched from the uploaded data file and filled out in the Name, Alias, and Column Type fields.
- Download Data: This option will download the data filled in the Column Filter field in JSON format.
- Delete Data: This option will clear all the information filled in the Column Filter field.
Partition Column: This feature enables users to partition the data when writing to Azure Blob. Users can specify multiple columns for partitioning by clicking the "Add Column Name" option.

ClickHouse Writer (Docker)

Along with the Spark Driver in RDBMS Writer we have Docker writer that supports TCP port.

ClickHouse writer component is designed to write or store data in a ClickHouse database. ClickHouse writers typically authenticate with ClickHouse using a username and password or other authentication mechanisms supported by ClickHouse.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Please go through the given walk-through to understand the configuration steps for the ClickHouse Writer pipeline component.

Configuring the Meta information tab of ClickHouse writer

Host IP Address: Enter the Host IP Address.
Port: Enter the port for the given IP Address.
User name: Enter the user name for the provided database.
Password: Enter the password for the provided database.
Database name: Enter the Database name.
Table name: Provide a single or multiple table names. If multiple table name has be given, then enter the table names separated by comma(,). Settings: Option that allows you to customize various configuration settings for a specific query.
Enable SSL: Enabling SSL with ClickHouse writer involves configuring the writer to use the Secure Sockets Layer (SSL) protocol for secure communication between the writer and the ClickHouse server.
Save Mode: Select the Save mode from the drop down.
Column Filter: There is also a section for the selected columns in the Meta Information tab if the user can select some specific columns from the table to read data instead of selecting a complete table so this can be achieved by using the Column Filter section. Select the columns which you want to read and if you want to change the name of the column, then put that name in the alias name section otherwise keep the alias name the same as of column name and then select a Column Type from the drop-down menu.
- Use Download Data and Upload File options to select the desired columns.
  - Upload File: The user can upload the existing system files (CSV, JSON) using the Upload File icon (file size must be less than 2 MB).
  - Download Data: Users can download the schema structure in JSON format by using the Download Data icon.

Please Note:

ClickHouse Writer component supports only TCP port.
If the user is using Append as the Save mode in ClickHouse Writer (Docker component) and Data Sync (ClickHouse driver), it will create a table in the ClickHouse database with a table engine Memory.

Sandbox Writer

A Sandbox writer is used to write the data within a configured sandbox environment.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Check out the given Walk-through on the Sandbox Writer component.

Configuring the Meta Information Tab

Please follow the below mentioned steps to configure the Meta Information Tab of Sandbox Writer:

Storage Type: The user will find two options here:
- Network: This option will be selected by default. In this mode, a folder corresponding to the Sandbox file name provided by the user will be created at the Sandbox location. Data will be written into part files within this folder, with each part file containing data based on the specified batch size.
- Platform: If the user selects the "Platform" option, a single file containing the entire dataset will be created at the Sandbox location, using the Sandbox file name provided by the user.
Sandbox File: Enter the file name.
File Type: Select the file type in which the data has to be written. There are 4 files types supported here:
- CSV
- JSON
- Text
- ORC
Save Mode: Select the save mode from the drop-down menu:
- Append: It will append the data in the blob.
- Overwrite: It will overwrite the data in the blob.
Schema File Name: Upload a Spark schema file of the data which has to be written in JSON format.
Column Filter: Enter the column names here. Only the specified columns will be fetched from the data from the previous connected event to the Sandbox Writer. In this field, the user needs to fill in the following information:
- Name: Enter the name of the column which has to be written from the previous event. The user can add multiple columns by clicking on the "Add New Column" option.
- Alias: Enter the alias name for the selected column name. The column name given here will be written in the Sandbox file.
- Column Type: Enter the data type of the column.
- Upload: This option allows the user to upload a data file in CSV, JSON, or EXCEL format. The column names will be automatically fetched from the uploaded data file and filled out in the Name, Alias, and Column Type fields.
- Download Data: This option will download the data filled in the Column Filter field in JSON format.
- Delete Data: This option will clear all the information filled in the Column Filter field.

MongoDB Writers

We have given two different writers for writing data to MongoDB. The available deployment types for the same are: Spark and Docker.

MongoDB Writer

The MongoDB writer component is designed to write the data in the MongoDB collection.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Follow the given steps in the demonstration to configure the Mongo (Spark) Writer component.

Please Note: In the Connection Type field, you can choose one of the three options: SRV, Standard, and Connection String.

Configuring the Meta Information tab of the MongoDB Writer

Please Note: The fields marked as (*) are mandatory fields.

Connection Type: Select the connection type from the drop-down:
- Standard
- SRV
- Connection String
Host IP Address (*): Hadoop IP address of the host.
Port(*): Port number (It appears only with the Standard Connection Type).
Username(*): Provide username.
Password(*): Provide a valid password to access the MongoDB.
Database Name(*): Provide the name of the database from where you wish to read data.
Collection Name(*): Provide the name of the collection.
Schema File Name: Upload Spark Schema file in JSON format.
Additional Parameters: Provide the additional parameters to connect with MongoDB. This field is optional.
Enable SSL: Check this box to enable SSL for this components. MongoDB connection credentials will be different if this option is enabled.
Certificate Folder: This option will appear when the Enable SSL field is checked-in. The user has to select the certificate folder from drop down which contains the files which has been uploaded to the admin settings for connecting MongoDB with SSL. Please refer the below given images for the reference.
Save Mode: Select the Save mode from the drop down.
- Append: This operation adds the data to the collection.
- Ignore: "Ignore" is an operation that skips the insertion of a record if a duplicate record already exists in the database. This means that the new record will not be added, and the database will remain unchanged. "Ignore" is useful when you want to prevent duplicate entries in a database.
- Upsert: It is a combination of update and insert options. It is an operation that updates a record if it already exists in the database or inserts a new record if it does not exist. This means that "upsert" updates an existing record with new data or creates a new record if the record does not exist in the database.

MongoDB Writer Lite (PyMongo Writer)

The PyMongo writer component is designed to write the data in the Mongo collection. It is a docker based component.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Please follow the demonstration to configure the component.

Configuration Steps for PyMongo Writer (Mongodb Writer Lite):

The PyMongo Writer writes the data to the Mongo Database.

Drag & drop the PyMongo Writer component to the Pipeline Workflow Editor.

Click the dragged PyMongo Writer component to open the component properties tabs below.

Basic Information Tab

It is the default tab to open for the PyMongo Writer while configuring the component.

Select an Invocation type from the drop-down menu to confirm the running mode of the reader component. Select ‘Real-Time’ or ‘Batch’ from the drop-down menu.
Deployment Type: It displays the deployment type for the component. This field comes preselected.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Failover Event: Select a failover Event from the drop-down menu.
Batch Size: Provide the maximum number of records to be processed in one execution cycle.

Meta Information Tab

Open the Meta Information tab and configure all the connection-specific details for the PyMongo Writer.

Connection Type: Select either of the options out of ‘Standard’, ‘SRV’, and ‘Connection String’ connection types.
Port number(*): Provide the Port number (It appears only with the ‘Standard’ connection type).
Host IP Address(*): IP address of the host.
Username(*): Provide username.
Password(*): Provide a valid password to access the MongoDB.
Database Name(*): Provide the name of the database where you wish to write data.
Collection Name (*): Provide the name of the collection.
Save Mode: Select an option from the drop-down menu (the supported options are Upsert and Append).
Enable SSL: Check-in this box to enable SSL feature for PyMongo writer.

Please Note: Credentials will be different if this option is enabled.

Composite Keys (*): This field appears only when the selected save mode is ‘Upsert’. The user can enter multiple composite keys separated by commas on which the 'Upsert' operation has to be done.
Additional Parameters: Provide details of the additional parameters.
Connection String (*): Provide a connection string.

The Meta Information fields vary based on the selected Connection Type option.

Meta Information Tab with Standard as the Connection Type.

Meta Information Tab with SRV as the Connection Type.

Meta Information Tab with Connection String as the Connection Type.

Selected Columns

The users can select some specific columns to change the column name or data type while writing it to the collection. Users have to type the name of the column in the name field that has to be modified. If you went to change the name of the column, then put the name of your choice in the alias name section otherwise keep it the same as of column name. Then select the Column Type from the drop-down menu into which you want to change the datatype of that particular column. Once this is done, while writing the selected column data type and column name will be converted to your given choice.

Use the Download Data and Upload File options to select the desired columns.

Upload File: The user can upload the existing system files (CSV, JSON) using the Upload File icon (file size must be less than 2 MB).
Download Data (Schema): Users can download the schema structure in JSON format by using the Download Data icon.

Saving the Component Configuration

Click the Save Component in Storage icon for the PyMongo Writer component.
A message appears to notify the successful update of the component.

Click on the Activate Pipeline icon.
The pipeline will be activated and the PyMongo writer component will write the in-event data to the given MongoDB collection.

Machine Learning

These components utilize machine learning algorithms and techniques to analyze and model the data.

DSLab Runner

The DSL (Data Science Lab) Runner is utilized to manage and execute data science experiments that have been created within the DS Lab module and imported into the pipeline.

All component configurations are classified broadly into 3 section:

Meta Information

Using a DS Lab Runner Component in the Pipeline Workflow

Drag the DS Lab Runner component to the Pipeline Workflow canvas.

The DS Lab Model runner requires input data from an Event and sends the processed data to another Event. So, create two events and drag them onto the Workspace.
Connect the input and output events with the DS Lab Runner component as displayed below.

The data in the input event can come from any Ingestion, readers, any script from DS Lab Module or shared events.
Click the DS Lab Model runner component to get the component properties tabs below.

Basic Information

It is the default tab to open for the component.
Select an Invocation type from the drop-down menu to confirm the running mode of the reader component. Select the Real-Time or Batch option from the drop-down menu.

Please Note: If the selected Invocation Type option is Batch, then Grace Period (in sec)* field appears to provide the grace period for component to go down gracefully after that time.

Deployment Type: It displays the deployment type for the component. This field comes pre-selected.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (Min limit for this field is 10).
Failover Event: Select a failover Event from the drop-down menu.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Description: Description of the component. It is optional.

Please Note: The DS Lab Runner contains two execution types in its Meta Information tab.

DS Lab Runner as Model Runner

It allows to run a model that you have created on DS Lab Module by simply registering the model to use it in pipeline.

DS Lab Runner as Script Runner

It allows to run a script that you have exported from DS Lab Module to pipeline.

Please follow the demonstration to use the DS Lab Runner as a Model Runner.

Meta Information Tab (as Model Runner)

Model Runner as Execution Type

Please follow the below steps to configure the Meta Information when the Model Runner is selected as execution type:

Project Name: Name of the project where you have created your model in DS Lab Module.
Model Name: Name of the saved model in Project under the DS Lab module.

Please follow the demonstration to configure the component for Execution Type as Script Runner as Execution Type.

Script Runner as Execution Type

Please follow the below-given steps to configure the Meta Information when the Script Runner is selected as Execution Type:

Function Input Type: Select the input type from the drop-down. There are two options in this field:
1. DataFrame
2. List of dictionary
Project Name: Provide the name of the Project that contains a model in the DS Lab Module.
Script Name: Select the script that has been exported from the notebook in the DS Lab module. The script written in the DS Lab module should be inside a function.
External Library: If any external libraries are used in the script we can mention them here. We can mention multiple libraries by giving a comma (,) in between the names.
Start Function: Select the function name in which the script has been written.
Input Data: If any parameter has been given in the function, then the parameter name is provided as Key, and the value of the parameters has to be provided as value in this field.

Saving the DS Lab Runner Component

A success notification message appears when the component gets saved.
The DS Lab Runner component reads the data coming to the input event, runs the model, and gives the output data with predicted columns to the output event.

AutoML Runner

The AutoML Runner is designed to automate the entire workflow of creating, training, and deploying machine learning models. It seamlessly integrates with the DS Lab module and allows for the importation of models into the pipeline.

All component configurations are classified broadly into 3 sections.

Meta Information

Steps to configure Auto ML Runner

Drag and drop the Auto ML Runner component to the Workflow Editor.

The Auto ML Runner requires input data from an Event and sends the processed data to another Event. Create two Events and drag them on the Workspace.
Connect the input and output events with the Auto ML Runner component as displayed below.

The data in the input event can come from any Ingestion, Readers, Any script from DS Lab Module or shared events.
Click the Auto ML Runner component to get the component properties tabs below.

Basic Information

It is the default tab to open for the component.

Select an Invocation type from the drop-down menu to confirm the running mode of the reader component. Select either Real-Time or Batch option from the drop-down menu.

Please Note: If the selected Invocation Type option is Batch, then Grace Period (in sec)* field appears to provide the grace period for component to go down gracefully after that time.

Deployment Type: It displays the deployment type for the component. This field comes pre-selected.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Failover Event: Select a failover Event from the drop-down menu.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (Min limit for this field is 10).

Meta Information

Project Name: Name of the project where you have created your model in DS Lab Module.
Model Name: Name of the saved model in project in the DS Lab module.

Saving the Auto ML Runner Component

A success notification message appears when the component gets saved.
The Auto ML Runner component reads the data coming to the input event, runs the model and gives the output data with predicted columns to the output event.

Consumers

These are the real-time / Streaming component that ingests data or monitor for change in data objects from different sources to the pipeline.

GCS Monitor

The GCS Monitor continuously monitors a specific folder. When a new file is detected in the monitored folder, the GCS Monitor reads the file's name and triggers an event. Subsequently, the GCS Monitor copies the detected file to a designated location as defined and then removes it from the monitored folder. This process is repeated for each file that is found.

Basic Information Tab

It is the default tab to open for the component while configuring it.

Invocation Type: Select an invocation mode as Real-Time.
Deployment Type: It displays the deployment type for the reader component. This field comes pre-selected.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Failover Event: Select a failover Event from the drop-down menu.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (Min limit for this field is 10).

Steps to configure the meta information of GCS Monitor

Bucket Name: Enter the source bucket name.
Directory Path: Fill in the monitor folder path using a forward-slash (/). For example, "monitor/".
Copy Directory Path: Specify the copy folder name where you want to copy the uploaded file. For example, "monitor_copy/".
Choose File: Upload a Service Account Key(s) file.
File Name: After the Service Account Key file is uploaded, the file name is auto-generated based on the uploaded file.
Copy Bucket Name: Fill in the destination bucket name where you need to copy the files.

Sqoop Executer

Sqoop Executer is a tool designed to efficiently transfer data between Hadoop (Hive/HDFS) and structured data stores such as relational databases (e.g., MySQL, Oracle, SQL Server).

All component configurations are classified broadly into the following sections:

Meta Information

Basic Information Tab

It is the default tab to open for the component while configuring it.

Invocation Type: Select an invocation mode out of ‘Real-Time’ or ‘Batch’ using the drop-down menu.
Deployment Type: It displays the deployment type for the reader component. This field comes pre-selected.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Failover Event: Select a failover Event from the drop-down menu.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (Min limit for this field is 10).

Steps to configure the meta information of Sqoop Executer

Username: Enter the username for connecting to a relational database.
Host: Provide a host or IP address of the machine where your relational database server is running.
Port: Provide a Port number (the default number for these fields is 22).
Authentication: Select an authentication type from the drop-down:
- Password: Enter the password.
- PEM/PPK File: choose a file and provide the file name if the user selects this authentication option.
Command: Enter the relevant Sqoop command. In Apache Sqoop, a command is a specific action or operation that you perform using the Sqoop tool. Sqoop provides a set of commands to facilitate the transfer of data between Hadoop (or more generally, a Hadoop ecosystem component) and a relational database. These commands are used in Sqoop command-line operations to interact with databases, import data, export data, and perform various data transfer tasks.

Some of the common Sqoop commands include:

Import command: This command is used to import data from a relational database into Hadoop. You can specify source and target tables, database connection details, and various import options.

sqoop import --connect jdbc:mysql://hostname/database_name --username your_username --password your_password --table your_table --target-dir /user/hadoop/sqoop_data

Export Command: This command is used to export data from Hadoop to a relational database. You can specify source and target tables, database connection details, and export options.

sqoop export --connect jdbc:mysql://hostname/database_name --username your_username --password your_password --table your_table --export-dir /user/hadoop/sqoop_data

Eval Command: This command allows you to evaluate SQL queries and expressions without importing or exporting data. It's useful for testing SQL queries before running import/export commands.

sqoop eval --connect jdbc:mysql://hostname/database_name --username your_username --password your_password --query "SELECT * FROM your_table"

List Databases Command: This command lists the available databases on the source database server.

sqoop list-databases --connect jdbc:mysql://hostname --username your_username --password your_password

OPC UA

OPC UA (OPC Unified Architecture) is a communication protocol and standard used for collecting and transmitting data from industrial devices and systems to a data processing or analytics platform. OPC UA is commonly employed in data pipelines for handling data from industrial and manufacturing environments, making it an integral part of industrial data pipelines.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration

Basic Information Tab

It is the default tab to open for the component while configuring it.

Invocation Type: Select an invocation mode out of ‘Real-Time’ or ‘Batch’ using the drop-down menu.
Deployment Type: It displays the deployment type for the reader component. This field comes pre-selected.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Failover Event: Select a failover Event from the drop-down menu.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (Min limit for this field is 10).

Steps to configure the meta information of OPC UA

URL: Provide URL link. In OPC UA (OPC Unified Architecture), a URL (Uniform Resource Locator) is used to specify the address or location of an OPC UA server or endpoint. URLs in OPC UA are typically used to establish connections to servers and access the services provided by those servers.
Message Security Mode: Select a message security mode from the drop-down menu (The supported options are ‘Sign’ and ‘SignAndEncrypt’).
Security Policy: Select a policy using the drop-down menu. Three types of security policies are supported:
- Basic256: Basic256 is a security profile that provides encryption and signature capabilities for OPC UA communication. It uses a 256-bit encryption key. All messages exchanged between clients and servers are encrypted using a 256-bit encryption key, providing data confidentiality. Messages are digitally signed to ensure data integrity and authenticity. Signature algorithms ensure that the message has not been tampered with during transmission. Basic256 uses symmetric encryption, meaning both parties share the same secret key for encryption and decryption.
- Basic256Sha256: Basic256Sha256 is an enhanced security profile that builds upon the features of Basic256. It offers stronger security by using SHA-256 cryptographic algorithms for key generation and message digests.
- Basic128Rsa15: Basic128Rsa15 is a security profile that uses 128-bit encryption and RSA-15 key exchange. It is considered less secure compared to Basic256 and Basic256Sha256. Basic128Rsa15 uses 128-bit encryption for data confidentiality. It relies on the RSA-15 key exchange mechanism, which is considered less secure than newer RSA and elliptic curve methods.
Certificate File Name: This name gets reflected based on the Choose File option provided for the Certificate file.
Choose File: Browse a certificate file by using this option.
PEM File Name: This name gets reflected based on the Choose File option provided for the PEM file.
Choose File: Browse a PEM file by using this option.
Source Node: Enter the source node. The "Source Node" refers to the entity or component within the OPC UA server that is the source or originator of an event or notification. It represents the object or node that generates an event when a specific condition or state change occurs.
Event Node: Enter the event node. The "Event Node" refers to the specific node in the OPC UA AddressSpace that represents an event or notification that can be subscribed to by OPC UA clients. It is a node that defines the structure and properties of the event, including the event's name, severity, and other attributes.

SFTP Monitor

SFTP (Secure File Transfer Protocol) Monitor is used to monitor and manage files transfer over SFTP servers. It is designed to keep track of file transfers over SFTP.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Please follow the below demonstration to configure SFTP Monitor component.

Steps to configure the SFTP Monitor Component

Drag and Drop the SFTP Monitor consumer component which is inside the consumer section of the system component part to the Workflow Editor.
Click the dragged ingestion component to get the component properties tabs.

Basic Information Tab

The Basic Information Tab is the default tab for the component.

Select the Invocation Type (at present only the Real-Time option is provided).
Deployment Type: It comes preselected based on the component.
Container Image Version: It comes preselected based on the component.
Failover Event: Select a failover event from the drop-down menu.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (Min limit for this field is 10).

Meta Information Tab

Configure the Meta Information tab for the dragged SFTP Monitor component.

Host: Broker IP or URL
Username: If authentication is required then give username
Port: Provide the Port number
Authentication: Select an authentication option using the drop-down list.
- Password: Provide a password to authenticate SFTP Monitor.
- PEM/PPK File: Choose a file to authenticate the SFTP Monitor component. The user needs to upload a file if this authentication option has been selected.
Directory Path: Fill the monitor folder path using forward-slash (/). E.g., /home/monitor
Copy Directory Path: Fill in the copy folder name where you want to copy the uploaded file. E.g., /home/monitor_copy

Please Note: Don't use a nested directory structure in the directory path and copy directory path. Else the component won't behave in an expected manner.

Don't use the dirpath and copy-path as follows: dirpath: home/monitor/datacopy-dir:home/monitor/data/copy_data

Channel: Select a channel option from the drop-down menu (the supported channel is SFTP).

Meta Information Fields when the Authentication option is Password

Meta Information Fields when the Authentication option is PEM/PPK

Saving the Component Configuration

Click the Save Component in Storage icon to save configured details of the SFTP Monitor component.

A notification message appears to confirm the same.

Please Note:

a. The SFTP Monitor component monitors the file coming to the monitored path and copies the file in the Copy Path location for SFTP Reader to read.

b. The SFTP Monitor component requires an Event to send output.

c. The SFTP Monitor send the file name to the out event along with File size, last modified time and ingestion time (Refer the below image).

d. Only one SFTP monitor will read and move the file if multiple monitors are set up to monitor the same file path at the same time.

MQTT Consumer

MQTT(Message Queuing Telemetry Transport) is a lightweight, publish-subscribe, machine to machine network protocol for message queue/message queuing service. It is designed for connections with remote locations that have devices with resource constraints or limited network bandwidth, such as in the Internet of Things.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Follow the given demonstration to configure the MQTT component.

Configuring the Meta Information tab of MQTT Consumer:

Host IP Address: Provide the IP Address of MQTT broker.
Username: Enter the username.
Port: Enter the port for the given IP address.
Authenticator: There are 2 options in it, select any one to authenticate.
- Password: Enter the password to authenticate.
- PEM/PPK File: Upload the PEM/PPK File to authenticate.
Quality of service(QoS): Enter the values either 0, 1 or 2. The Quality of Service (QoS) level is an agreement between the sender of a message and the receiver of a message that defines the guarantee of delivery for a specific message. There are 3 QoS levels in MQTT:
1. At most once (0): The minimal QoS level is zero. This service level guarantees a best-effort delivery. There is no guarantee of delivery. The recipient does not acknowledge receipt of the message and the message is not stored and re-transmitted by the sender.
2. At least once (1): QoS level 1 guarantees that a message is delivered at least one time to the receiver. The sender stores the message until it gets a PUBACK packet from the receiver that acknowledges receipt of the message.
3. Exactly once (2): QoS 2 is the highest level of service in MQTT. This level guarantees that each message is received only once by the intended recipients. QoS 2 is the safest and slowest quality of service level. The guarantee is provided by at least two request/response flows (a four-part handshake) between the sender and the receiver. The sender and receiver use the packet identifier of the original PUBLISH message to coordinate delivery of the message.
MQTT topic: Enter the name of the MQTT topic from where the messages have been published and to which the messages have to be consumed.

Please Note: Kindly perform the following tasks to run a Pipeline workflow with the MQTT consumer component:

After configuring the component click the Save Component in Storage option for the component.
Update the Pipeline workflow and activate the pipeline to see the MQTT consumer working in a Pipeline Workflow. The user can get details through the Logs panel when the Pipeline workflow starts loading data.

Video Stream Consumer

Video stream consumer is designed to consume .mp4 video from realtime source or stored video in some SFTP location in form of frames.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration

Please follow the given demonstration to configure the Video Stream Consumer component.

Please Note:

Video Stream component supports only .mp4 file format. It reads/consumes video frame by frame.
The Testing Pipeline functionality and Data Metrices option (from Monitoring Pipeline functionality) are not available for this component.

Drag & drop the Video Stream Consumer component to the Workflow Editor.

Click the dragged Video Stream Consumer component to open the component properties tabs.

Basic Information

It is the default tab to open for the component.

Invocation Type: Select an Invocation type from the drop-down menu to confirm the running mode of the reader component. Select ‘Real-Time’ or ‘Batch’ from the drop-down menu.
Deployment Type: It displays the deployment type for the component. This field comes pre-selected.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (Min limit for this field is 10).
Failover Event: Select a failover Event from the drop-down menu.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Description: Description of the component (It is optional).

Please Note: If the selected Invocation Type option is Batch, then, Grace Period (in sec)* field appears to provide the grace period for component to go down gracefully after that time.

Selecting Real-time as the Invocation Type option will display the Intelligent Scaling option.

Meta Information Tab

Select the Meta Information tab and provide the mandatory fields to configure the dragged Video Stream Consumer component.

Host IP Address (*)- Provide IP or URL
- The input in the Host IP Address field in the Meta Information tab changes based on the selection of Channel. There are two options available:
  - SFTP: It allows us to consume stored videos from SFTP location. Provide SFTP connection details.
  - URL: It allows us to consume live data from different sources such as cameras. We can provide the connection details for live video coming.

Username (*)- Provide username
Port (*)- Provide Port number
Authentication- Select any one authentication option out of Password or PEM PPK File
Reader Path (*)- Provide reader path
Channel (*)- The supported channels are SFTP and URL
Resolution (*)- Select an option defining the video resolution out of the given options.
Frame Rate – Provide rate of frames to be consumed.

Please Note: The fields for the Meta Information tab change based on the selection of the Authentication option.

When the authentication option is Password

While using authentication option as Password it adds a password column in the Meta information.

When the authentication option is PEM/PPK file

While choosing the PEM/PPK File authentication option, the user needs to select a file using the Choose File option.

Please Note: SFTP uses IP in the Host IP Address and URL one uses URL in Host IP Address.

Saving the Component

Click the Save Component in Storage icon for the Video Stream Consumer component.
A message appears to notify that the component properties are saved.

The Video Stream Consumer component gets configured to pass the data in the Pipeline Workflow.

Please Note: The Video Stream Consumer supports only the Video URL.

Eventhub Subscriber

EventHub subscriber typically consumes event data from an EventHub by creating an event processor client that reads the event data from the EventHub.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration

Follow the provided demonstration to configure the Eventhub Subscriber component.

Configuring meta information of EventHub Subscriber

There are two read using methods:

Connection String
Principal Secret

Read using connection string

Connection String: It is a string of parameters that are used to establish a connection to an Azure EventHub
Consumer Group: It is a logical grouping of event consumers (subscribers) that read and process events from the same partition of an event hub.
EventHub Name: It refers to the specific Event Hub within the Event Hubs namespace to which data is being sent or received.
Checkpoint Location: It is a location in the event stream that represents the last event that has been successfully processed by the subscriber.
Enqueued time: It indicates the time when the event was added to the partition, which is typically the time when the event occurred or was generated.
Subscriber namespace: It is a logical entity that is used to group related subscribers and manage access control to EventHubs within the namespace.

Read using Principal Secret

Client ID: The ID of the Azure AD application that has been registered in the Azure portal and that will be used to authenticate the subscriber. This can be found in the Azure portal under the "App registrations" section.
Tenant ID: The ID of the Azure AD tenant that contains the Azure AD application and service principal that will be used to authenticate the subscriber.
Client secret: The secret value that is associated with the Azure AD application and that will be used to authenticate the subscriber.
Consumer group: It is a logical grouping of event consumers (subscribers) that read and process events from the same partition of an event hub.
EventHub Name: It refers to the specific Event Hub within the Event Hubs namespace to which data is being sent or received.
Checkpoint Location: It is a location in the event stream that represents the last event that has been successfully processed by the subscriber.
Enqueued time: It indicates the time when the event was added to the partition, which is typically the time when the event occurred or was generated.
Subscriber namespace: It is a logical entity that is used to group related subscribers and manage access control to EventHubs within the namespace.

Twitter Scrapper

This component is used to fetch the tweets of any hashtag from Twitter.

All component configurations are classified broadly into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Follow the demonstration to configure the Twitter Scrapper component.

Configuring the meta information tab for Twitter Scrapper:

Consumer API Key: Provide the Consumer API Key for the Twitter Scrapper.
Consumer API Secret Key: This Key acts as password for this component.
Filter text: Enter the hashtag from where the Tweets are to be fetched.
Twitter Data Type: This field contains two options:
- History: It will fetch all the past Tweets.
- Real-time: It will fetch the real-time Tweets.

Mongo ChangeStream

Mongo ChangeStream allow applications to access real-time data changes without the complexity. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them. Because change streams use the aggregation framework, applications can also filter for specific changes.

All component configurations are classified broadly into 3 section:

Meta Information

Follow the given walk-through to configure the Mongo Change Stream component.

Configuring the Meta Information for the Mongo ChangeStream component

Connection type: Select the connection type from the drop-down menu and provide the required credentials.
Database name: Enter the database name.
Collection name: Enter the collection name from the given database.
Operation type: Select the operation type from the drop-down menu. There are four types of operations supported here: Insert, Update, Delete, and Replace.
Enable SSL: Check this box to enable SSL for this components. Credentials will be different for this.

Saving the Component

Activate the pipeline, make any operation from the above given operation types in the Mongo collection.
Whatever operation has been done in the Mongo collection, the Mongo ChangeStream component will fetch that change and send it to the next Event in the Pipeline Workflow.
Enable SSL: Check this box to enable SSL for this components. MongoDB connection credentials will be different if this option is enabled.
Certificate Folder: This option will appear when the Enable SSL field is checked-in. The user has to select the certificate folder from drop down which contains the files which has been uploaded to the admin settings. Please refer the below given images for the reference.

Rabbit MQ Consumer

RabbitMQ is an open-source message-broker software that enables communication between different applications or services. It implements the Advanced Message Queuing Protocol (AMQP) which is a standard protocol for messaging middleware. RabbitMQ is designed to handle large volumes of message traffic and to support multiple messaging patterns such as point-to-point, publish/subscribe, and request/reply. In a RabbitMQ system, messages are produced by a sender application and sent to a message queue. Consumers subscribe to the queue to receive messages and process them accordingly. RabbitMQ provides reliable message delivery, scalability, and fault tolerance through features such as message acknowledgement, durable queues, and clustering.

A RabbitMQ consumer is a client application or process that subscribes to a queue and receives messages in a push mode, using RabbitMQ client libraries and various subscription options.

All component configurations are classified broadly into the following sections:

Meta Information

Follow the steps given in the demonstration to configure the Rabbit MQ Consumer component.

Configuring the Meta Information tab of RabbitMQ Consumer

Host: Enter the host for RabbitMQ.
Port: Enter the port.
Username: Enter the username for RabbitMQ.
Password: Enter the password to authenticate with RabbitMQ consumer.
Queue: Provide queue for RabbitMQ consumer. A queue is a buffer that holds messages produced by publishers until they are consumed by subscribers. Queues are the basic building blocks of a RabbitMQ messaging system and are used to store messages that are waiting to be processed.

AWS SNS Monitor

In AWS, SNS (Simple Notification Service) is a fully managed messaging service that enables you to send notifications and messages to distributed systems and components. SNS Monitor is a feature or functionality related to SNS that allows users to monitor the activity, health, and performance of their SNS topics and messages. It provides metrics and insights into the delivery status, throughput, success rates, and other relevant information about the messages sent through SNS topics.

All component configurations are classified broadly into 3 section:

Meta Information

Access Key: Enter the AWS access key.
Secret Key: Enter the AWS secret key.
Region: Select the region of the SNS topic.
SQS URL: Enter the SQS URL obtained after creating an SQS queue, which will fetch the notification and send it to the out event if there is any modification in the S3 bucket.

Please Note:

Follow the below-given steps to set up monitoring for an S3 bucket using AWS SNS monitor:
- Create an SNS topic in your AWS account.
- Create an SQS queue that will subscribe to the SNS topic you created earlier.
- After setting up the SQS queue, obtain the SQS URL associated with it.
- With the SNS topic and SQS queue configured, you need to create an event notification for the S3 bucket that needs to be monitored.
- This event notification will be configured to send notifications to the specified SNS topic.
- Whenever there is a modification in the S3 bucket, the SNS topic will trigger notifications, which will be fetched by the SQS queue using its URL.
- Finally, these notifications will be sent to the out Event, allowing you to monitor activity within the S3 bucket effectively.
Please go through the below given steps to create an SNS topic, SQS queue and Event Notification.

Sign in to the AWS console.
Navigate to the "Services" option or use the search option at the top and select "Simple Notification Service (SNS)".
Once redirected to the SNS page, go to the "Topics" option and click on "Create topic".

Enter a name and display name for the topic, and optionally, provide a description.
Click on "Create topic" to create the SNS topic.
Once the topic is created, go to the "Subscriptions" tab and click on "Create subscription".
Choose "Amazon SQS" as the protocol.
Select the desired SQS queue from the drop-down list or create a new queue if needed.
Click on "Create subscription" to link the SQS queue to the SNS topic.
After successfully creating the subscription, the SQS URL will be displayed. This URL can be used to receive messages from the SNS topic.

Steps to Create an SQS Queue

Sign in to the AWS console.
Navigate to the "Services" option or use the search option at the top and select "Simple Queue Service (SQS)".
Once redirected to the SQS page, where the list of available queues will be displayed, go to the "Topics" option and click on "Create queue".
Enter the queue name and configure any required settings such as message retention period, visibility timeout, etc.
Click on the "Create queue" button to create the queue.
After successfully creating the queue, select the queue from the list.
In the queue details page, navigate to the "Queue Actions" dropdown menu and select "Subscribe to SNS topic".
Choose the SNS topic to which you want to subscribe from the dropdown menu.
Configure any required parameters such as filter policies and delivery retry policies.
Click on the "Subscribe" button to create the subscription.
Once the subscription is created successfully, the SQS URL will be displayed in the subscription details.
The Obtained URL can be used in the SQS URL field in the AWS SNS monitor component.

Once the SNS and SQS are configured, the user has to create an event notification in a bucket to monitor the activity of the S3 bucket using the AWS SNS monitor component in the pipeline. This involves setting up event notifications within an S3 bucket to trigger notifications whenever certain events occur, such as object creation, deletion, or modification. By configuring these event notifications, users can ensure that relevant events in the S3 bucket are captured and forwarded to the specified SNS topic for further processing or monitoring. This integration allows for seamless monitoring of S3 bucket activities using the AWS SNS monitor component within the pipeline.

Steps to create an Event Notification in S3 Bucket

Sign in to the AWS Management Console.
Navigate to the "Services" option and select "S3" from the list of available services.
Once redirected to the S3 dashboard, locate and click on the desired bucket for which you want to create an event notification.
In the bucket properties, navigate to the "Properties" tab.
Scroll down to the "Events" section and click on "Create event notification".
Provide a name for the event notification configuration.
Choose the events that the user wants to trigger notifications for, such as "All object create events", "All object delete events", or specific events based on prefixes or suffixes.
Specify the destination for the event notification. Select "Amazon SNS" as the destination type.
Choose the SNS topic to which the user wants to publish the event notifications.
Optionally, configure any additional settings such as filters based on object key prefixes or suffixes.
Review the configuration and click "Save" or "Create" to create the event notification.
Once saved, the event notification will be configured for the selected S3 bucket, and notifications will be sent to the specified SNS topic whenever the configured events occur within the bucket. Subsequently, these notifications will be fetched by the SQS URL subscribed to that SNS topic.

Go through the below given demonstration to create SNS topic, SQS ques and Event notification in AWS.

Please Note: The user should ensure that the AWS Bucket, SNS topic, and SQS topic are in the same region to create an event notification.

Kafka Consumer

The Kafka Consumer component consumes the data from the given Kafka topic. It can consume the data from the same environment and external environment with CSV, JSON, XML, and Avro formats. This comes under the Consumer component group.

All component configurations are classified broadly into the following sections:

Meta Information

Check out the steps provided in the demonstration to configure the Kafka Consumer component.

Please Note: It currently supports SSL and Plaintext as Security types.

This Component can read the data from external Brokers as well with SSL as the security type and host Aliases:

Steps to Configure

Click on the dragged Kafka Consumer component to get the component properties tabs.
Configure the Basic Information tab.
Select an Invocation type from the drop-down menu to confirm the running mode of the component. Select the Real-Time option from the drop-down menu.
Deployment Type: It displays the deployment type for the component. This field comes pre-selected.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Failover Event: Select a failover Event from the drop-down menu.
Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (Min limit for this field is 10
Enable Auto-Scaling: Component pod scale up automatically based on a given max instance, if component lag is more than 60%.

Configuring meta information of Kafka Consumer:

Topic Name: Specify the topic name that the user wants to consume data from Kafka.
Start From: The user will find four options here. Please refer at the bottom of the page for a detailed explanation along with an example.
- Processed:
  - It Represents the offset that has been successfully processed by the consumer.
  - This is the offset of the last record that has been successfully read and processed by the consumer.
  - By selecting this option, the consumer initiates data consumption from the point where it previously successfully processed, ensuring continuity in the consumption process.
- Beginning:
  - It Indicates the earliest available offset in a Kafka topic.
  - When a consumer starts reading from the beginning, it means it will read from the first offset available in the topic, effectively reading all messages from the start.
- Latest:
  - It represents the offset at the end of the topic, indicating the latest available message.
  - When a consumer starts reading from the latest offset, it means it will only read new messages that are produced after the consumer starts.
- Timestamp:
  - It refers to the timestamp associated with a message. Consumers can seek to a specific timestamp to read messages that were produced up to that timestamp.
  - To utilize this option, users are required to specify both the Start Time and End Time, indicating the range for which they intend to consume data. This allows consumers to retrieve messages within the defined time range for processing.
Is External: The user can consume external topic data from the external bootstrap server by enabling the Is External option. The Bootstrap Server and Config fields will display after enabling the Is External option.
Bootstrap Server: Enter external bootstrap details.
Config: Enter configuration details of external details.
Input Record Type: It contains the following input record types:
CSV: The user can consume CSV data using this option. The Headers and Separator fields will display if the user selects choose CSV input record type.
- Header: In this field, the user can enter column names of CSV data that consume from the Kafka topic.
- Separator: In this field, the user can enter separators like comma (,) that are used in the CSV data.
JSON: The user can consume JSON data using this option.
XML: The user can consume parquet data using this option.
AVRO: The user can consume Avro data using this option.
Security Type: It contains the following security types:
- Plain Text: Choose the Plain Text option if there environment without SSL.
- Host Aliases: This option contains the following fields:
- IP: Provide the IP address.
- Host Names: Provide the Host Names.
SSL: Choose the SSL option if there environment with SSL. It will display the following fields:
- Trust Store Location: Provide the trust store path.
- Trust Store Password: Provide the trust store password.
- Key Store Location: Provide the key store path.
- Key Store Password: Provide the key store password.
- SSL Key Password: Provide the SSL key password.
- Host Aliases: This option contains the following fields:
- IP: Provide the IP.
- Host Names: Provide the host names.

Please Note: The Host Aliases can be used with the SSL and Plain text Security types.

// Assume the following offsets and timestamps for messages in the topic:

Offset | Timestamp
-------|-------------------
0      | 2024-02-27 10:00 AM
1      | 2024-02-27 11:30 AM
2      | 2024-02-27 01:00 PM
3      | 2024-02-27 02:30 PM

Processed:
- If a consumer has successfully processed up to offset 2, it means it has processed all messages up to and including the one at offset 2 (timestamp 2024-02-27 01:00 PM). Now, the consumer will resume processing from offset 3 onwards.
Beginning:
- If a consumer starts reading from the beginning, it will read messages starting from offset 0. It will process messages with timestamps from 2024-02-27 10:00 AM onward.
Latest:
- If a consumer starts reading from the latest offset, it will only read new messages produced after the consumer starts. Let's say the consumer starts at timestamp 2024-02-27 02:00 PM; it will read only the message at offset 3.
Timestamp:
- If a consumer seeks to a specific timestamp, for example, 2024-02-27 11:45 AM, it will read messages with offsets 2 and 3, effectively including the messages with timestamps 2024-02-27 01:00 PM and 2024-02-27 02:30 PM, while excluding the messages with timestamps 2024-02-27 10:00 AM and 2024-02-27 11:30 AM.

API Ingestion and Webhook Listener

API ingestion and Webhooks are two methods used to receive data from a third-party service or system.

All component configurations are classified broadly into 3 section

Basic Information
Meta Information
Resource Configuration

Follow the steps given in the demonstration to configure the API Ingestion component.

Configure the Meta Information tab of API Ingestion Component

Ingestion type: Select API ingestion as ingestion type from drop down. (API Ingestion or Webhook)
Ingestion Id: It will be predefined in the component.
Ingestion Secret: It will be predefined in the component.
Once the pipeline gets saved, the Component Instance Id URL gets generated in the meta information tab of the component as shown in the above image.
Connect a out event with the component and activate the pipeline.
Open the Postman tool / other tools where you want to configure the API/webhook endpoint.
Create a new request and select the POST as request method from drop down and provide the generated Component Instance Id URL in the URL section of the Postman tool.
Navigate to the Headers section in the Postman tool and provide the Ingestion Id (key : ingestionId)and Ingestion Secret (key : ingestionSecret) which is pre-defined in the Meta Information of the API Ingestion component.
Navigate to the Body section in the Postman and select raw tab and select the JSON option from the drop-down as the data type.
Now, enter the JSON data in the space provided and click on send button.
The API Ingestion component will process the JSON data entered in the Postman tool and it will send the JSON data to the out event.

Please refer the below-given image to configure the Postman tool for API Ingestion component:

Producers

The role of data producers is to ensure a continuous flow of data into the pipeline, providing the necessary raw material for subsequent processing and analysis.

WebSocket Producer

A WebSocket producer component is a software component that is used to send data over a WebSocket connection.

All component configurations are classified broadly into the following sections:

Meta Information

Follow the steps given in the demonstration to configure the WebSocket component.

This component can be used to produce data to the internal WebSocket to consume live data. The WebSocket Producer helps the user to get the message received by the Kafka topic.

Steps to configure the component:

Drag & Drop the WebSocket Producer component on the Workflow Editor.
The producer component requires an input event (to get the data) and produces the data to the WebSocket location based on guid, ingestion Id, and ingestion Secret.
Create an event and drag them to the Workspace.
Connect the input event (The data in the input event can come from any Reader, Consumer, or Shared event).

Click on the dragged WebSocket Producer component to open the component properties tabs below.
Basic Information: It is the default tab to open for the WebSocket Producer while configuring the component.
- Invocation Type: Select an Invocation type from the drop-down menu to confirm the running mode of the WebSocket Producer component. The supported invocation type is ‘Real-Time’.
- Deployment Type: It displays the deployment type for the component. This field comes pre-selected.
- Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
- Failover Event: Select a failover event from the drop-down menu.
- Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (Min limit for this field is 10).
- Intelligent Scaling: Component pods scale up automatically based on the given max instance if the component lag is more than 60% and the pod goes down if the component lag is less than 10%.
Open the Meta Information tab and configure the required fields:
- GUID: It will be displayed after saving the component and updating the pipeline.
- Ingestion Id: It will auto-generate with a new component.
- Ingestion Secret: It will be auto-generated with a new component and regenerate after clicking on the Refresh Ingestion icon.
Click the Save Component in Storage icon provided in the WebSocket Producer configuration panel to save the component.

A message appears to notify the successful update of the component.

Click on the Update Pipeline icon to update the pipeline.

Eventhub Publisher

The EventHub Publisher leverages the scalability and throughput capabilities of Event Hubs to ensure efficient and reliable event transmission.

All component configurations are classified broadly into 3 section

Meta Information

Please follow the steps given in the walk-through to configure the Eventhub Publisher component.

Configuring the Meta Information tab of Eventhub Publisher

There are two read using methods:

Connection String
Principal Secret

Read using Connection String

Connection String: It is a string of parameters that are used to establish a connection to an Azure EventHub.
Consumer Group: It is a logical grouping of event consumers (subscribers) that read and process events from the same partition of an event hub.
EventHub Name: It refers to the specific Event Hub within the Event Hubs namespace to which data is being sent or received.
Checkpoint Location: It is a location in the event stream that represents the last event that has been successfully processed by the subscriber.
Enqueued time: It indicates the time when the event was added to the partition, which is typically the time when the event occurred or was generated.
Publisher namespace: It is a logical entity that is used to group related publishers and manage access control to EventHubs within the namespace.

Read using Principal Secret

Client ID: The ID of the Azure AD application that has been registered in the Azure portal and that will be used to authenticate the publisher. This can be found in the Azure portal under the "App registrations" section.
Tenant ID: The ID of the Azure AD tenant that contains the Azure AD application and service principal that will be used to authenticate the publisher.
Client secret: The secret value that is associated with the Azure AD application and that will be used to authenticate the publisher.
Consumer group: It is a logical grouping of event producer(publisher) that read and process events from the same partition of an event hub.
EventHub Name: It refers to the specific Event Hub within the Event Hubs namespace to which data is being sent or received.
Checkpoint Location: It is a location in the event stream that represents the last event that has been successfully processed by the publisher.
Enqueued time: It indicates the time when the event was added to the partition, which is typically the time when the event occurred or was generated.
Publisher namespace: It is a logical entity that is used to group related publishers and manage access control to EventHubs within the namespace.