Synthetic Data Generator

The Synthetic Data Generator component is designed to generate the desired data by using the Draft07 schema of the data that needs to be generated.

The user can upload the data in CSV or XLSX format and it will generate the draft07 schema for the same data.

Check out steps to create and use the Synthetic Data Generator component in a Pipeline workflow.

Drag and Drop the Component

Drag and drop the Synthetic Data Generator Component to the Workflow Editor.

Click on the dragged Synthetic Data Generator component to get the component properties tabs.

Basic Information Tab

Configure the Basic Information tab.

Select an Invocation type from the drop-down menu to confirm the running mode of the component. Select the Real-Time option from the drop-down menu.
Deployment Type: It displays the deployment type for the component. This field comes pre-selected.
Container Image Version: It displays the image version for the docker container. This field comes pre-selected.
Failover Event: Select a failover Event from the drop-down menu .
Batch Size (min 10): Provide maximum number of records to be processed in one execution cycle (Min limit for this field is 10.

Meta Information Tab

Configure the following information:

Iteration: Number of iterations for producing the data.
Delay (sec): Delay between each iteration in seconds.
Batch Size: Number of data to be produced in each iteration.
Upload Sample File: Upload the file containing data. CSV and XLSX file formats are supported. Once the file is uploaded, the draft07 schema for the uploaded file will be generated in the Schema tab. The supported files are CSV, Excel, and JSON formats.
Schema: Draft07 schema will display under this tab in the editable format.
Upload Schema: The user can directly upload the draft07 schema in JSON format from here. Also, the user can directly paste the draft07 schema in the schema tab.

Saving the Component Configuration

After doing all the configurations click the Save Component in Storage icon provided in the configuration panel to save the component.

A notification message appears to inform about the component configuration saved.

Please Note: Total number of generated data= Number of iterations * batch size

Sample Schema File

Please find a Sample Schema file given below for the users to explore the component.

    "Company": {
      "type": "string",
      "enum": ["NIKO RESOURCES LIMITED", "TCS","Accenture","ICICI Bank","Cognizant","HDFC Bank","Infosys"]
    },
    "Lead Origin": {
      "type": "string",
      "enum": ["Campaign", "Walk-in", "Social Media","Existing Account"]
    },
    "Mobile Number": {
      "type": "string",
      "pattern": "^\\+?\\d{1,3}[-.\\s]?\\(?(\\d{1,3})\\)?[-.\\s]?\\d{1,4}[-.\\s]?\\d{1,4}$"
    },
    "Lead Source": {
      "type": "string",
      "enum": ["Source A", "Source B", "Source C"]
    },
    "Source Medium": {
      "type": "string",
      "enum": ["Website", "Direct Calls", "Referal"]
    },
    "Source Campaign": {
      "type": "string",
      "enum": ["Campaign A", "Campaign B", "Campaign C"]
    },
    "Do Not Email": {
      "type": "string",
      "enum": ["Yes","No"]
    },
    "Do Not Call": {
      "type": "string",
      "enum": ["Yes","No"]
    },
    "Lead Stage": {
      "type": "string",
      "enum": ["Contact","Lead","Prospect","Opportunity"]
    },
    "Lead Score": {
      "type": "number",
      "minimum" : 0,
      "maximum" : 10
    },
    "Order Value": {
    "type": "number",
    "minimum" : 0,
    "maximum" : 10000000
    },
    "Engagement Score": {
    "type": "number",
    "minimum" : 0,
    "maximum" : 100
    },
    "TotalVisits": {
    "type": "number",
    "minimum" : 0,
    "maximum" : 10
    },
    "Average Time Per Visit": {
    "type": "number",
    "minimum" : 1,
    "maximum" : 50
    },
    "Last Activity": {
      "type": "string",
      "enum": ["Page Visited on Website","Email Opened","Unreachable","Converted to Lead"]      
    },
    "Last Activity Date": {
      "type": "string",
      "format" : "date",
      "minimum" : "2020-01-01",
      "maximum" : "2023-01-01"
    },
    "Created On": {
    "type": "string",
    "format" : "date",
    "minimum" : "2020-01-01",
    "maximum" : "2023-01-01"
    },
    "Modified On": {
    "type": "string",
    "format" : "date",
    "minimum" : "2020-01-01",
    "maximum" : "2023-01-01"
    },
    "Lead Conversion Date": {
    "type": "string",
    "format" : "date",
    "minimum" : "2020-01-01",
    "maximum" : "2023-01-01"
    },
    "State": {
      "type": "string",
      "enum": ["State A", "State B", "State C"]
    },
    "Country": {
      "type": "string",
      "enum": ["Country A", "Country B", "Country C"]
    },
    "Specialization": {
      "type": "string"
    }
  },
  "required": [
    "Company",
    "Lead Origin",
    "Mobile Number",
    "Lead Source",
    "Source Medium",
    "Source Campaign",
    "Do Not Email",
    "Do Not Call",
    "Lead Stage",
    "Lead Score",
    "Order Value",
    "Engagement Score",
    "TotalVisits",
    "Average Time Per Visit",
    "Last Activity",
    "Last Activity Date",
    "Created On",
    "Modified On",
    "Lead Conversion Date",
    "State",
    "Country",
    "Specialization"
  ]
}

Weights can be given in order to handle the bias across the data generated:

The addition on weights should be exactly 1

"age": { "type": "string", "enum": ["Young", "Middle","Old"], "weights":[0.6,0.2,0.2]}

PreviousKafka Producer NextTransformations

Last updated 2 years ago