# Synthetic Data Generator

The Synthetic Data Generator component is designed to generate the desired data by using the Draft07 schema of the data that needs to be generated.

The user can upload the data in CSV or XLSX format and it will generate the draft07 schema for the same data.

{% hint style="success" %}
*Check out steps to create and use the Synthetic Data Generator component in a Pipeline workflow.*
{% endhint %}

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fc39ZvXI46qjXzpN3rYAg%2Fuploads%2F0CzbdVhwiQRhO4gj5Tdl%2FSDG.mp4?alt=media&token=3e3b0de6-d9fe-4c4d-a667-08499d9bd288>" %}
***Synthetic Data Generator***
{% endembed %}

## Drag and Drop the Component <a href="#drag-and-drop-the-component" id="drag-and-drop-the-component"></a>

* Drag and drop the Synthetic Data Generator Component to the Workflow Editor.

<figure><img src="/files/qv2qixM9JwLmcJPCersE" alt=""><figcaption></figcaption></figure>

* Click on the dragged Synthetic Data Generator component to get the component properties tabs.

## Basic Information Tab <a href="#basic-information-tab" id="basic-information-tab"></a>

Configure the **Basic Information** tab.

* Select an Invocation type from the drop-down menu to confirm the running mode of the component. Select the **Real-Time** option from the drop-down menu.
* **Deployment Type**: It displays the deployment type for the component. This field comes pre-selected.
* **Container Image Version**: It displays the image version for the docker container. This field comes pre-selected.
* **Failover Event**: Select a failover Event from the drop-down menu .
* **Batch Size (min 10)**: Provide maximum number of records to be processed in one execution cycle (Min limit for this field is 10.

<figure><img src="/files/BKuGqxU2LSVFuDo1yt7f" alt=""><figcaption></figcaption></figure>

## Meta Information Tab

Configure the following information:

* **Iteration:** Number of iterations for producing the data.
* **Delay (sec):** Delay between each iteration in seconds.
* **Batch Size:** Number of data to be produced in each iteration.
* **Upload Sample File:** Upload the file containing data. CSV and XLSX file formats are supported. Once the file is uploaded, the draft07 schema for the uploaded file will be generated in the Schema tab. The supported files are CSV, Excel, and JSON formats.
* **Schema:** Draft07 schema will display under this tab in the editable format.
* **Upload Schema:** The user can directly upload the draft07 schema in JSON format from here. Also, the user can directly paste the draft07 schema in the schema tab.&#x20;

<figure><img src="/files/cRQfrJ6WKZXhFnXF8QvK" alt=""><figcaption><p><em><strong>Meta Information for Schema Data Generator</strong></em></p></figcaption></figure>

## Saving the Component Configuration <a href="#saving-the-component-configuration" id="saving-the-component-configuration"></a>

* After doing all the configurations click the ***Save Component in Storage*** icon provided in the configuration panel to save the component.

<figure><img src="/files/ULjpx5jPku2img9kcT61" alt=""><figcaption></figcaption></figure>

* A notification message appears to inform about the component configuration saved.

<figure><img src="/files/2K9gkIBD01Y0fYBVQfM0" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
*<mark style="color:green;">Please Note</mark>: **Total number of generated data**= **Number of iterations \* batch size***
{% endhint %}

### Sample Schema File

Please find a Sample Schema file given below for the users to explore the component.&#x20;

```json
{
  "$schema": "schema",
  "type": "object",
  "properties": {
    "number1": {
      "type": "number"
    },
    "number2": {
      "type": "number"
    },
    "number3": {
      "type": "number"
    },
    "Company": {
      "type": "string",
      "enum": ["NIKO RESOURCES LIMITED", "TCS", "Accenture", "ICICI Bank", "Cognizant", "HDFC Bank", "Infosys"]
    },
    "Lead Origin": {
      "type": "string",
      "enum": ["Campaign", "Walk-in", "Social Media", "Existing Account"]
    },
    "Lead Stage": {
      "type": "string",
      "enum": ["Contact", "Lead", "Prospect", "Opportunity"]
    },
    "Lead Score": {
      "type": "number",
      "minimum": 0,
      "maximum": 10
    },
    "Order Value": {
      "type": "number",
      "minimum": 0,
      "maximum": 10000000
    },
    "Average Time Per Visit": {
      "type": "number",
      "minimum": 1,
      "unique": true
    },
    "Last Activity Date": {
      "type": "string",
      "format": "date",
      "minimum": "2020-01-01",
      "maximum": "2023-01-01"
    },
    "Created On": {
      "type": "string",
      "format": "date",
      "minimum": "2020-01-01",
      "maximum": "2023-01-01"
    },
    "Modified On": {
      "type": "string",
      "format": "date",
      "minimum": "2020-01-01",
      "maximum": "2023-01-01"
    },
    "Lead Conversion Date": {
      "type": "string",
      "format": "date",
      "start": "2020-01-01",
      "interval": 365,
      "occurrence": 2
    },
    "Mobile Number": {
      "type": "string",
      "pattern": "^\\+?\\d{1,3}[-.\\s]?\\(?(\\d{1,3})\\)?[-.\\s]?\\d{1,4}[-.\\s]?\\d{1,4}$"
    },
    "Source Medium": {
      "type": "string",
      "enum": ["Website", "Direct Calls", "Referral"]
    },
    "Source Campaign": {
      "type": "string",
      "enum": ["Campaign A", "Campaign B", "Campaign C"]
    },
    "Email": {
      "type": "string",
      "format": "email"
    },
    "Last Activity": {
      "type": "string",
      "enum": ["Page Visited on Website", "Email Opened", "Unreachable", "Converted to Lead"]
    },
    "State": {
      "type": "string",
      "format": "state"
    },
    "Country": {
      "type": "string",
      "format": "country"
    },
    "Names": {
      "type": "string",
      "format": "name"
    },
    "Address": {
      "type": "string",
      "format": "address"
    },
    "Datetime_value": {
      "type": "string",
      "format": "Current_datetime"
    },
    "Specialization": {
      "type": "string"
    }
  },
  "required": [
    "number1",
    "number3",
    "Company",
    "Lead Origin",
    "Mobile Number",
    "Source Medium",
    "Source Campaign",
    "Email",
    "Lead Stage",
    "Lead Score",
    "Order Value",
    "Average Time Per Visit",
    "Last Activity",
    "Last Activity Date",
    "Created On",
    "Modified On",
    "Lead Conversion Date",
    "State",
    "Country",
    "Names",
    "Address",
    "Datetime_value",
    "Specialization"
  ],
  "if": {
    "properties": {
      "number1": {
        "type": "number"
      },
      "number2": {
        "type": "number"
      }
    }
  },
  "then": {
    "properties": {
      "number1": {
        "maximum": {
          "$data": "number2"
        }
      },
      "number3": {
        "calculation": {
          "$eval": "data.number1 + data.number2 * 2"
        }
      }
    }
  }
}

```

{% hint style="info" %}
*<mark style="color:green;">Please Note:</mark> Weights can be given in order to handle the bias across the data generated:*

***The addition on weights should be exactly 1***

"age": { "type": "string", "enum": \["Young", "Middle","Old"], "weights":\[0.6,0.2,0.2]}
{% endhint %}

## Types and their properties

**Type: "string"**

**Properties:**

* **`maxLength`:** Maximum length of the string.
* **`minLength`:** Minimum length of the string.
* **`enum`:** A list of values that the number can take.
* **`weights`:** Weights for each value in the enum list.
* **`format`:** Available formats include 'date', 'date-time', 'name', 'country', 'state', 'email', 'uri', and 'address'.

**For 'date' and 'date-time' formats, the following properties can be set:**

* **`minimum`:** Minimum date or date-time value.
* **`maximum`:** Maximum date or date-time value.
* **`interval`:** For 'date' format, the interval is the number of days. For 'date-time' format, the interval is the time difference in seconds.
* **`occurrence`:** Indicates how many times a date/date-time needs to repeat in the data. It should only be employed with the 'interval' and 'start' keyword.

A new format has been introduced for the string type: **'current\_datetime'**. This format generates records with the current date-time.

**Type: "number"**

**Properties:**

* **`minimum`:** The minimum value for the number.
* **`maximum`:** The maximum value for the number.
* **`exclusiveMinimum`:** Indicates whether the minimum value is exclusive.
* **`exclusiveMaximum`:** Indicates whether the maximum value is exclusive.
* **`unique`:** Determines if the field should generate unique values (True/False).
* **`start`:** Associated with unique values, this property determines the starting point for unique values.
* **`enum`:** A list of values that the number can take.
* **`weights`:** Weights for each value in the Enum list.

**Type: "float"**

**Properties:**

* **`minimum`:** The minimum float value.
* **`maximum`:** The maximum float value.

{% hint style="info" %}
*<mark style="color:green;">Please Note:</mark> Draft-07 schemas allow for the use of **if-then-else** conditions on fields, enabling complex validations and logical checks. Additionally, mathematical computations can be performed by specifying conditions within the schema.*
{% endhint %}

**Sample Draft-07 schema with if-then-else condition**

```json
schema = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "properties": {
        "country": {
            "type": "string",
            "enum": ["USA", "Canada", "UK", "Australia"]
        },
        "currency": {
            "type": "string"
        },
        "population": {
            "type": "integer"
        },
        "start_date": {
            "type": "string",
            "format": "date-time"
        },
        "end_date": {
            "type": "string",
            "format": "date-time"
        }
    },
    "if": {
        "properties": {
            "country": {
                "const": "USA"
            }
        }
    },
    "then": {
        "properties": {
            "currency": {
                "const": "USD"
            },
            "population": {
                "minimum": 33000000
            },
            "start_date": {
                "format": "date-time",
                "pattern": "^2023-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}$"
            },
            "end_date": {
                "format": "date-time",
                "pattern": "^2023-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}$",
                "minimum": { "$data": "start_date" }
            }
        }
    },
    "else": {
        "if": {
            "properties": {
                "country": {
                    "const": "UK"
                }
            }
        },
        "then": {
            "properties": {
                "currency": {
                    "const": "GBP"
                },
                "population": {
                    "minimum": 65000000
                }
            }
        },
        "else": {
            "if": {
                "properties": {
                    "country": {
                        "enum": ["Canada", "Australia"]
                    }
                }
            },
            "then": {
                "properties": {
                    "currency": {
                        "const": "CAD-AUD"
                    }
                }
            },
            "else": {
                "properties": {
                    "currency": {
                        "const": "Unknown"
                    }
                }
            }
        }
    }
}
```

**Example:** Here number3 value will be calculated based on \
\&#xNAN;**`"$eval": "data.number1 + data.number2 * 2"`** condition.

```json
"if": {
    "properties": {
      "number1": {
        "type": "number"
      },
      "number2": {
        "type": "number"
      }
    }
  },
  "then": {
    "properties": {
      "number1": {
        "maximum": {
          "$data": "number2"
        }
      },
      "number3": {
        "calculation": {
          "$eval": "data.number1 + data.number2 * 2"
        }
      }
    }
  }
```

{% hint style="info" %}
*<mark style="color:green;">Please Note</mark> **:** Conditional statement can also be applied on **date** and **datetime** columns using **if-then-else**. Please go through the below given schema for reference.*
{% endhint %}

```json
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "task_end_date": {
      "type": "string",
      "format": "date"
    },
    "task_start_date": {
      "type": "string",
      "format": "date"
    }
  },
  "if": {
    "properties": {
      "task_end_date": { "type": "string", "format": "date" },
      "task_start_date": { "type": "string", "format": "date" }
    },
    "then": {
      "properties": {
        "task_end_date": { "format": "date", "minimum": { "$data": "task_start_date" } }
      }
    }
  }
}
```

This above given JSON schema defines an object with two properties: "**task\_end\_date**" and "**task\_start\_date**", both of which are expected to be strings in date format. The schema includes a conditional validation rule using the "**if-then**" structure. If both "**task\_end\_date**" and "**task\_start\_date**" properties are present and in date format, then an additional constraint is applied. Specifically, the "**task\_end\_date**" must have a minimum value that is greater than or equal to the value of "**task\_start\_date**." This schema is useful for ensuring that task end dates are always set to a date that is on or after the task's start date when working with JSON data.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.bdb.ai/data-pipeline-3/components/producers/synthetic-data-generator.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
