Data Pipeline
  • Data Pipeline
    • About Data Pipeline
    • Design Philosophy
    • Low Code Visual Authoring
    • Real-time and Batch Orchestration
    • Event based Process Orchestration
    • ML and Data Ops
    • Distributed Compute
    • Fault Tolerant and Auto-recovery
    • Extensibility via Custom Scripting
  • Getting Started
    • Homepage
      • List Pipelines
      • Create
        • Creating a New Pipeline
          • Adding Components to Canvas
          • Connecting Components
            • Events [Kafka and Data Sync]
          • Memory and CPU Allocations
        • Creating a New Job
          • Job Editor Page
          • Task Components
            • Readers
              • HDFS Reader
              • MongoDB Reader
              • DB Reader
              • S3 Reader
              • Azure Blob Reader
              • ES Reader
              • Sandbox Reader
            • Writers
              • HDFS Writer
              • Azure Writer
              • DB Writer
              • ES Writer
              • S3 Writer
              • Sandbox Writer
              • Mongodb Writer
              • Kafka Producer
            • Transformations
          • PySpark Job
          • Python Job
      • List Jobs
      • List Components
      • Delete Orphan Pods
      • Scheduler
      • Data Channel
      • Cluster Event
      • Trash
      • Settings
    • Pipeline Workflow Editor
      • Pipeline Toolbar
        • Pipeline Overview
        • Pipeline Testing
        • Search Component in Pipelines
        • Push Pipeline (to VCS/GIT)
        • Pull Pipeline
        • Full Screen
        • Log Panel
        • Event Panel
        • Activate/Deactivate Pipeline
        • Update Pipeline
        • Failure Analysis
        • Pipeline Monitoring
        • Delete Pipeline
        • Pipeline Component Configuration
        • Pipeline Failure Alert History
      • Component Panel
      • Right-side Panel
    • Testing Suite
    • Activating Pipeline
    • Monitoring Pipeline
    • Job Monitoring
  • Components
    • Adding Components to Workflow
    • Component Architecture
    • Component Base Configuration
    • Resource Configuration
    • Intelligent Scaling
    • Connection Validation
    • Readers
      • S3 Reader
      • HDFS Reader
      • DB Reader
      • ES Reader
      • SFTP Stream Reader
      • SFTP Reader
      • Mongo DB Reader
        • MongoDB Reader Lite (PyMongo Reader)
        • MongoDB Reader
      • Azure Blob Reader
      • Azure Metadata Reader
      • ClickHouse Reader (Docker)
      • Sandbox Reader
      • Azure Blob Reader
    • Writers
      • S3 Writer
      • DB Writer
      • HDFS Writer
      • ES Writer
      • Video Writer
      • Azure Writer
      • ClickHouse Writer (Docker)
      • Sandbox Writer
      • MongoDB Writers
        • MongoDB Writer
        • MongoDB Writer Lite (PyMongo Writer)
    • Machine Learning
      • DSLab Runner
      • AutoML Runner
    • Consumers
      • SFTP Monitor
      • MQTT Consumer
      • Video Stream Consumer
      • Eventhub Subscriber
      • Twitter Scrapper
      • Mongo ChangeStream
      • Rabbit MQ Consumer
      • AWS SNS Monitor
      • Kafka Consumer
      • API Ingestion and Webhook Listener
    • Producers
      • WebSocket Producer
      • Eventhub Publisher
      • EventGrid Producer
      • RabbitMQ Producer
      • Kafka Producer
      • Synthetic Data Generator
    • Transformations
      • SQL Component
      • Dateprep Script Runner
      • File Splitter
      • Rule Splitter
      • Stored Producer Runner
      • Flatten JSON
      • Email Component
      • Pandas Query Component
      • Enrichment Component
      • Mongo Aggregation
      • Data Loss Protection
      • Data Preparation (Docker)
      • Rest Api Component
      • Schema Validator
    • Scripting
      • Script Runner
      • Python Script
        • Keeping Different Versions of the Python Script in VCS
    • Scheduler
  • Custom Components
  • Advance Configuration & Monitoring
    • Configuration
      • Default Component Configuration
      • Logger
    • Data Channel
    • Cluster Events
    • System Component Status
  • Version Control
  • Use Cases
Powered by GitBook
On this page
  • Drag and Drop the Component
  • Basic Information Tab
  • Meta Information Tab
  • Saving the Component Configuration
  • Sample Schema File
  • Types and their properties

Was this helpful?

  1. Components
  2. Producers

Synthetic Data Generator

PreviousKafka ProducerNextTransformations

Last updated 1 year ago

Was this helpful?

The Synthetic Data Generator component is designed to generate the desired data by using the Draft07 schema of the data that needs to be generated.

The user can upload the data in CSV or XLSX format and it will generate the draft07 schema for the same data.

Check out steps to create and use the Synthetic Data Generator component in a Pipeline workflow.

Drag and Drop the Component

  • Drag and drop the Synthetic Data Generator Component to the Workflow Editor.

  • Click on the dragged Synthetic Data Generator component to get the component properties tabs.

Basic Information Tab

Configure the Basic Information tab.

  • Select an Invocation type from the drop-down menu to confirm the running mode of the component. Select the Real-Time option from the drop-down menu.

  • Deployment Type: It displays the deployment type for the component. This field comes pre-selected.

  • Container Image Version: It displays the image version for the docker container. This field comes pre-selected.

  • Failover Event: Select a failover Event from the drop-down menu .

  • Batch Size (min 10): Provide maximum number of records to be processed in one execution cycle (Min limit for this field is 10.

Meta Information Tab

Configure the following information:

  • Iteration: Number of iterations for producing the data.

  • Delay (sec): Delay between each iteration in seconds.

  • Batch Size: Number of data to be produced in each iteration.

  • Upload Sample File: Upload the file containing data. CSV and XLSX file formats are supported. Once the file is uploaded, the draft07 schema for the uploaded file will be generated in the Schema tab. The supported files are CSV, Excel, and JSON formats.

  • Schema: Draft07 schema will display under this tab in the editable format.

  • Upload Schema: The user can directly upload the draft07 schema in JSON format from here. Also, the user can directly paste the draft07 schema in the schema tab.

Saving the Component Configuration

  • After doing all the configurations click the Save Component in Storage icon provided in the configuration panel to save the component.

  • A notification message appears to inform about the component configuration saved.

Please Note: Total number of generated data= Number of iterations * batch size

Sample Schema File

Please find a Sample Schema file given below for the users to explore the component.

{
  "$schema": "schema",
  "type": "object",
  "properties": {
    "number1": {
      "type": "number"
    },
    "number2": {
      "type": "number"
    },
    "number3": {
      "type": "number"
    },
    "Company": {
      "type": "string",
      "enum": ["NIKO RESOURCES LIMITED", "TCS", "Accenture", "ICICI Bank", "Cognizant", "HDFC Bank", "Infosys"]
    },
    "Lead Origin": {
      "type": "string",
      "enum": ["Campaign", "Walk-in", "Social Media", "Existing Account"]
    },
    "Lead Stage": {
      "type": "string",
      "enum": ["Contact", "Lead", "Prospect", "Opportunity"]
    },
    "Lead Score": {
      "type": "number",
      "minimum": 0,
      "maximum": 10
    },
    "Order Value": {
      "type": "number",
      "minimum": 0,
      "maximum": 10000000
    },
    "Average Time Per Visit": {
      "type": "number",
      "minimum": 1,
      "unique": true
    },
    "Last Activity Date": {
      "type": "string",
      "format": "date",
      "minimum": "2020-01-01",
      "maximum": "2023-01-01"
    },
    "Created On": {
      "type": "string",
      "format": "date",
      "minimum": "2020-01-01",
      "maximum": "2023-01-01"
    },
    "Modified On": {
      "type": "string",
      "format": "date",
      "minimum": "2020-01-01",
      "maximum": "2023-01-01"
    },
    "Lead Conversion Date": {
      "type": "string",
      "format": "date",
      "start": "2020-01-01",
      "interval": 365,
      "occurrence": 2
    },
    "Mobile Number": {
      "type": "string",
      "pattern": "^\\+?\\d{1,3}[-.\\s]?\\(?(\\d{1,3})\\)?[-.\\s]?\\d{1,4}[-.\\s]?\\d{1,4}$"
    },
    "Source Medium": {
      "type": "string",
      "enum": ["Website", "Direct Calls", "Referral"]
    },
    "Source Campaign": {
      "type": "string",
      "enum": ["Campaign A", "Campaign B", "Campaign C"]
    },
    "Email": {
      "type": "string",
      "format": "email"
    },
    "Last Activity": {
      "type": "string",
      "enum": ["Page Visited on Website", "Email Opened", "Unreachable", "Converted to Lead"]
    },
    "State": {
      "type": "string",
      "format": "state"
    },
    "Country": {
      "type": "string",
      "format": "country"
    },
    "Names": {
      "type": "string",
      "format": "name"
    },
    "Address": {
      "type": "string",
      "format": "address"
    },
    "Datetime_value": {
      "type": "string",
      "format": "Current_datetime"
    },
    "Specialization": {
      "type": "string"
    }
  },
  "required": [
    "number1",
    "number3",
    "Company",
    "Lead Origin",
    "Mobile Number",
    "Source Medium",
    "Source Campaign",
    "Email",
    "Lead Stage",
    "Lead Score",
    "Order Value",
    "Average Time Per Visit",
    "Last Activity",
    "Last Activity Date",
    "Created On",
    "Modified On",
    "Lead Conversion Date",
    "State",
    "Country",
    "Names",
    "Address",
    "Datetime_value",
    "Specialization"
  ],
  "if": {
    "properties": {
      "number1": {
        "type": "number"
      },
      "number2": {
        "type": "number"
      }
    }
  },
  "then": {
    "properties": {
      "number1": {
        "maximum": {
          "$data": "number2"
        }
      },
      "number3": {
        "calculation": {
          "$eval": "data.number1 + data.number2 * 2"
        }
      }
    }
  }
}

Please Note: Weights can be given in order to handle the bias across the data generated:

The addition on weights should be exactly 1

"age": { "type": "string", "enum": ["Young", "Middle","Old"], "weights":[0.6,0.2,0.2]}

Types and their properties

Type: "string"

Properties:

  • maxLength: Maximum length of the string.

  • minLength: Minimum length of the string.

  • enum: A list of values that the number can take.

  • weights: Weights for each value in the enum list.

  • format: Available formats include 'date', 'date-time', 'name', 'country', 'state', 'email', 'uri', and 'address'.

For 'date' and 'date-time' formats, the following properties can be set:

  • minimum: Minimum date or date-time value.

  • maximum: Maximum date or date-time value.

  • interval: For 'date' format, the interval is the number of days. For 'date-time' format, the interval is the time difference in seconds.

  • occurrence: Indicates how many times a date/date-time needs to repeat in the data. It should only be employed with the 'interval' and 'start' keyword.

A new format has been introduced for the string type: 'current_datetime'. This format generates records with the current date-time.

Type: "number"

Properties:

  • minimum: The minimum value for the number.

  • maximum: The maximum value for the number.

  • exclusiveMinimum: Indicates whether the minimum value is exclusive.

  • exclusiveMaximum: Indicates whether the maximum value is exclusive.

  • unique: Determines if the field should generate unique values (True/False).

  • start: Associated with unique values, this property determines the starting point for unique values.

  • enum: A list of values that the number can take.

  • weights: Weights for each value in the Enum list.

Type: "float"

Properties:

  • minimum: The minimum float value.

  • maximum: The maximum float value.

Note: Draft-07 schemas allow for the use of if-then-else conditions on fields, enabling complex validations and logical checks. Additionally, mathematical computations can be performed by specifying conditions within the schema.

Example: Here number3 value will be calculated based on "$eval": "data.number1 + data.number2 * 2" condition.

"if": {
    "properties": {
      "number1": {
        "type": "number"
      },
      "number2": {
        "type": "number"
      }
    }
  },
  "then": {
    "properties": {
      "number1": {
        "maximum": {
          "$data": "number2"
        }
      },
      "number3": {
        "calculation": {
          "$eval": "data.number1 + data.number2 * 2"
        }
      }
    }
  }

Synthetic Data Generator
Meta Information for Schema Data Generator