Data Preparation Landing Page

The user can access the Data Grid view of the selected dataset by clicking on the Data Preparation icon. The displayed data in the grid changes based on the number of transforms performed on it.

The Data Grid in the BDB Data Preparation displays the data. The data displayed in the grid is a sample from the actual data set or complete data based on the data volume.

Data Grid Header

The grid has a header that displays the column name, and column type from the selected dataset. Within this header, a column chart visually represents the data of the selected column.

Show Chart

The column chart can be shown from the Header using the Show Chart option provided on the top right side of the page. The Column chart remains hidden by default in the Data Grid.

Navigate to the Data Preparation workspace.
Click the Show Chart icon.

The column chart for each column will be displayed in the header.

Each Column Header contains a Drop-down icon. By clicking the Drop-down icon, a Context menu is displayed with some options to be applied to that column.

The following options are displayed while clicking on the Context Menu icon:

Rename column
Hide Column
Delete Column
Delete All Others
Duplicate Columns
Get Character Length
Change to String (appears only for Integer)

It also presents the data type of the column. It is analyzed based on the max match to any data type in the first 10K records. Consider that out of the 10000 rows sample, there are 9000 integers and 1000 string values, the selected data type is Integer. The 1000-string rows get detected as invalid rows.

The column header in the Data Grid displays the following information based on the column types:

Columns with Integer values- The Min and Max values
Columns with String values- Total unique count or no. of categories
Columns with Date values- Range of dates including the min-max date

Please Note: Repetitive Column Names Handling under the Data Preparation module is based on the file types (as explained below).

When opening an Excel file with repetitive names of the columns in the Data Preparation framework, the column names will be mentioned with _0, _1 suffix by default.
- For Example, if multiple columns with the name ID are present in an Excel file, the Data Preparation will read these columns as ID_0, ID_1, ID_2, and so on.
A CSV file handles such scenarios of repetitive column names by displaying no suffix for the first column and then progressively inserting .1, .2, and .3 suffixes for all the repetitive columns.
- For Example, if multiple columns with the name ID are present in a CSV file, the Data Preparation will read these columns as ID, ID.1, ID.2, and so on.

Settings

The Settings icon on the top of the Data Preparation workspace contains two actions to be applied to the data set rows.

Skip Rows
Total Rows

Skip Rows

The Skip Rows functionality will help the user to skip the selected records from the specified index. The user can limit the Data Preparation up to some no. of rows by using the Skip Rows option. The skipped rows will be excluded from the original dataset while applying the Data Preparation transformations.

Navigate to the Data Preparation Workspace.
Click the Settings icon.

The Skip & Total Rows drawer opens.
Provide the number till where you want to skip the data.
Click the Apply option.

The dataset grid will exhibit the dataset with the specified rows skipped.
- The Sample Size reflects the change after applying the Skip Row.
- The Info section also displays changed Count and Valid data.

Please Note:

The default value for Skip Rows functionality is 0.
After saving the Data Preparation with Skipped rows while re-opening the same Data Preparation the Skip rule gets removed.
The Skip Rows functionality does not support the Data Preparations created based on the Data Sets.
In cases where the dataset is extensive, containing more than 1000 rows, it's recommended to use skip row and total row functionalities together for better performance or efficiency.

Total Rows

By default, the grid always displays the first 1K rows of the dataset. The user can use the N Rows option to change or modify the limit of the Data set displayed in the Data Grid. The N Rows option is provided on the top of the Data Grid view, the user can change the view to 2K, 3K, 4K up to 5K using this option.

Navigate to the Data Preparation Workspace.
Click the Settings icon.

The Skip & Total Rows drawer opens with the Total Rows' default value.
Click the drop-down icon provided for Total Rows.

Select a number from the drop-down menu to display the data.

Click the Apply option.

The Sample Size number of rows will reflect the selected total number of rows.
The Count and Valid no. of data mentioned under the Info option will also display the changed total data.
Each page displays 200 records by default in the grid display. The pages will be added to the Grid display to accommodate the remaining rows based on the selected no. of the Total row.

Please Note: The Total Rows functionality is not supported for a Data Preparation created based on a Data Set.

Pagination

Pagination is implemented in the grid display of data. The tool displays 200 records on each page by default, by changing the Total Row count the no. of pages displayed for the data grid will change.

Data Types

The Data Grid header displays Data Types. Some of the supported Data Types are given below:

Integer
Double
String
Date
Timestamp
Long
Email
Boolean
Gender
URL

Key Metrics

Key Metrics are displayed at the bottom of the data grid to provide valuable insights into the dataset. These metrics offer essential contextual information, enabling users to make informed decisions, perform data profiling, and gain a deeper understanding of the dataset that is being prepared. This includes:

Column Count: The total number of columns in the dataset to quickly assess the complexity and scope of the data.
Row Count: The total number of rows in the dataset, providing an overview of the dataset's size and volume.
Data Type Count: The number of distinct data types present in the dataset, enabling users to understand the variety and diversity of data formats and structures.

Please Note:

The Data Preparation workspace supports more than the listed Data Types.
The user can edit the name for the Data Preparation using the Title bar.

Data Quality Bar

A Data Quality Bar appears in the header of the data grid. The Data Quality is indicated through color-coding by clicking on a particular column.

The Data Quality Bar displays three types of data using 3 different colors.

Dark Blue-Valid Data
Orange-Invalid Data
Light Blue- Blank Data

Please Note: These color-coded bars appear by clicking on a particular column.

Show/Hide Columns

This option allows the user to instantly hide or show the rows based on their need to derive meaningful insights from the displayed data.

Check out the illustration on how to use the Show/Hide option.

Navigate to the Grid view of any selected Data Preparation.
Click the Show/Hide Columns option at the top of the displayed grid view of the data.

The Show/Hide Columns drawer appears displaying the available columns from the selected Data Preparation.

Select the columns using the given checkboxes provided for those columns.
The selected columns will disappear from the Data Grid display.

Un-check the checkboxes for the same column(s).
The column(s) start reflecting in the Grid view.

Please Note: The Hide Columns option can be accessed from the menu icon provided for each column in the Data Grid display of the dataset.

Auto Prep

Auto Prep is an automated process to streamline and clean datasets by applying various data-cleaning techniques and transformations. These techniques may include handling missing values, standardizing data formats, removing duplicates, and performing other preprocessing tasks to ensure data is consistent, accurate, and ready for analysis.

Implementing Auto Prep can save time and effort compared to manual data cleaning, especially for large datasets with numerous variables. By automating the cleaning process, data scientists and analysts can focus more on analyzing insights and making informed decisions rather than spending excessive time on data preparation. All the applied steps are neatly mentioned under the Steps tab.

Please Note: Auto Prep will quickly be applied all over the dataset to clean the complete dataset.

A set of transforms is included in this process. The explanation of each Auto Prep transform is given below:

Cast to Types: It will check out each column of the dataset and if any mismatch between data & datatype is found, then it will try to convert it into the same or remove the particular data from that cell for the given column.
1. A string column contains a,b,c,d,3,r,f data. After applying the Cast to Types transform, the digit 3 will be considered a string datatype.
2. In another scenario, a column contains 1,2,3,4,abc,5,6,7 after performing this transformation the text 'abc' will be removed as it can’t be converted to an integer.
Remove Special Character from Metadata: This transform removes special characters & spaces from the metadata (column headers) and presents them with proper naming conventions which will be useful in the other scenarios.
Fill Empty Cells with Text: This transform provides or fills the empty cells of each column based on the data type. The transformation works in the following manner for the different data types:
1. The empty cells in a String column will get filled with NA.
2. The empty cells in a numerical column will be filled with 0.
3. The empty cells in a date column will be filled with the text "NaT".
Remove Special Character: This transform removes special characters such as @, #, %, _, dot, etc. if present in the given dataset.
Remove Accents: This transformation technique removes accents from the words and makes them proper.
Delete Rows with Empty Cell: The Delete Rows with Empty Cell deletes the rows with empty values from each cell (This is optional to use by the user as it will cause certain data loss from the complete data).
Delete Rows with Invalid Cell: This is to delete the rows with invalid values in each cell (This is optional to use by the user as it will cause certain data loss from the complete data).

Please Note:

The most important/ significant transform as a part of Auto Prep is Remove Special Character from Metadata which will be useful in the columns present in the dataset with no proper naming convention.
While applying Auto Prep other than the set of transforms provided under the Auto Prep, the trailing leading whitespaces from the dataset are removed if present.

Check out the given video on the Auto Prep functionality.

Steps to use the Auto Prep feature

Navigate to a dataset displayed in the Data Preparation framework.
Click the Auto Prep option from the top right menu panel.

The Transformations List window opens with the list of the suggested Data Preparations.

The user can modify the suggested list using checkmarks in the given checkboxes.
Click the Proceed option after selecting all the required data preparation options from the list.

Open the Steps tab.
All the selected data preparations are applied to the dataset. The Auto Prep entry gets registered as AUTO DATAPREP under the Steps tab.

The applied Transforms are listed below by clicking on the AUTO DATAPREP step from the Steps tab.

Please Note:

The saved Data Preparation using Auto Prep also appears under the Preparation List while opening the data sandbox file from the Data Sandbox List page.
The Remove Special Characters from Metadata will be disabled by default while using Auto Prep. This transform will be enabled only when the headers of the selected dataset contain special characters.
Once the Auto Prep feature has been applied to a Dataset it will be disabled for the dataset. The Auto Prep feature can be enabled in such scenarios only after deleting the applied AUTO DATAPREP step from the Steps tab.

Filter

The Filter feature allows the user to customize the display by selecting a specific column/ row or choosing a data type from the listed options.

Check out the illustration to understand the Filter functionality for Columns and Rows.

Navigate to the landing page of the selected Data Preparation.
Click the Filter icon provided on the top right side of the screen.

The Filter drawer window appears displaying the default view for the Filter functionality.

Filtering the Data Display

The user can filter the data display based on the following aspects:

Data Types: Select the data types from the available list based on which you wish to filter the data.

Column: Provide the name of a specific column to filter the view by that column. For example, the given image displays data filtered by the columns that contain the Sour letter in the titles (There is only one column with these letters in the given dataset).

Row: Provide the value of a specific row to filter data by that value. For example, the following image filters the data by the rows that contain the Indeed value.

Please Note:

The Filter dialog box will display all the applicable data types to the available categories of columns from the selected Data Preparation.
The Filter dialog window displays the data type options selected by default while opening it for the first time. The user can edit the choices after opening it.
Keep the data type option checked that can display multiple columns in the filtered view while applying the Column or Row filtering option.

Saving a Data Preparation

The user will get a Save option on the Data Preparation workspace screen to save the concerned Data Preparation.

Navigate to the Data Preparation workspace.
Provide a name for the Data Preparation or modify the existing name if needed. Or Apply a transform to the data.
Click the SAVE option.

A notification message appears to ensure the user.

The Data Preparation gets saved on the Data Preparation List page.

Please Note:

The Save option gets enabled only after one transform or Auto Prep is applied to the dataset.
If the user fails to save a Data Preparation after applying some transforms, and closes the Data Preparation workspace. It will still be saved with an auto-generated name under the Data Preparation List.

PreviousData Preparation Workspace NextProfile Tab

Last updated 9 months ago