Synthetic Data Generator

The Synthetic Data Generator component creates artificial datasets based on a Draft-07 JSON schema. It enables users to generate realistic test data for pipelines without relying on sensitive or production datasets.

Key features:

Upload CSV/XLSX/JSON sample files to automatically generate a Draft-07 schema.
Directly edit or upload Draft-07 schema JSON.
Configure iterations, delays, and batch sizes for continuous data generation.
Support for advanced schema rules: if-then-else conditions, weights, and mathematical calculations.

Configuration Sections

All configurations are classified into:

Basic Information
Meta Information
Resource Configuration

Basic Information Tab

The Basic Information tab defines execution parameters.

Field

Description

Required

Invocation Type

Select execution mode: Real-Time.

Yes

Deployment Type

Displays the deployment type (pre-selected).

Yes

Container Image Version

Displays the Docker image version (pre-selected).

Yes

Failover Event

Select a failover event.

Optional

Batch Size

Maximum number of records per cycle (minimum: 10).

Yes

Meta Information Tab

The Meta Information tab defines schema and data generation parameters.

Field

Description

Required

Iteration

Number of iterations for producing synthetic data.

Yes

Delay (sec)

Delay between iterations in seconds.

Yes

Batch Size

Number of records generated per iteration.

Yes

Upload Sample File

Upload CSV/XLSX/JSON file to auto-generate a Draft-07 schema.

Optional

Schema

Displays the Draft-07 schema. Can be edited directly.

Yes

Upload Schema

Upload Draft-07 schema in JSON format.

Optional

Draft-07 Schema Capabilities

Supported Data Types

String
- Properties: maxLength, minLength, enum, weights, format
- Formats: date, date-time, name, country, state, email, uri, address, current_datetime
Number
- Properties: minimum, maximum, exclusiveMinimum, exclusiveMaximum, unique, start, enum, weights
Float
- Properties: minimum, maximum

Conditional Rules (if-then-else)

Draft-07 schemas allow applying logical conditions for validation and generation.

Example – ensuring end_date ≥ start_date:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "task_start_date": { "type": "string", "format": "date" },
    "task_end_date": { "type": "string", "format": "date" }
  },
  "if": {
    "properties": {
      "task_end_date": { "type": "string", "format": "date" },
      "task_start_date": { "type": "string", "format": "date" }
    }
  },
  "then": {
    "properties": {
      "task_end_date": {
        "format": "date",
        "minimum": { "$data": "task_start_date" }
      }
    }
  }
}

Weighted Values

Weights bias generated values across enumerations.

"age": {
  "type": "string",
  "enum": ["Young", "Middle", "Old"],
  "weights": [0.6, 0.2, 0.2]
}

Computed Fields

You can define derived values with calculation rules.

"number3": {
  "calculation": {
    "$eval": "data.number1 + data.number2 * 2"
  }
}

Saving the Component Configuration

Configure Basic Information and Meta Information.
Click Save Component (Storage icon).
A confirmation message appears after saving.
Activate the pipeline to begin generating synthetic data.

Example Workflow

Upload a sample CSV file containing customer records.
The system generates a Draft-07 schema automatically.
Configure:
- Iteration = 10
- Delay = 5 seconds
- Batch Size = 100
Save the component and activate the pipeline.
The component continuously generates synthetic customer data batches and feeds them downstream.

PreviousKafka Producer NextTransformations