Resource Provisioning

This section provides an overview of the Configuration tab, which facilitates the customization and parameterization of each Pipeline component.

Resource Configuration per Component

Each component in a pipeline can be configured with resource allocation settings to ensure optimal performance based on workload requirements. Once a pipeline and its associated components are saved, each component inherits the default pipeline configuration settings—Low, Medium, and High.

After the pipeline is saved, the Configuration tab becomes available within the component interface, enabling fine-grained tuning of compute and memory resources.

Two primary deployment types are supported: Docker and Spark, each offering distinct configuration parameters.

Deployment Types and Resource Configuration

Docker Deployment

When deploying a component using Docker, resource configuration can be specified under the Configuration tab.

CPU (Cores)
- Defines the number of CPU cores allocated to the container.
- Note: In Docker configuration, 1000 = 1 core. For example, 100 = 0.1 core.
Memory (RAM)
- Specifies the memory allocated to the container.
- Note: In Docker configuration, 1024 = 1 GB.
Instances
- Defines the number of container instances deployed for parallel processing.
- If N instances are specified, N pods are deployed.
Requests and Limits
- Docker components support configuration of both resource requests (minimum guaranteed resources) and limits (maximum allowed resources), ensuring predictable performance and avoiding resource contention.

Spark Deployment

When deploying a component using Apache Spark, resource allocation applies at both the driver and executor levels.

Driver Configuration
- Driver CPU and Memory: Specifies the cores and memory allocated to the Spark driver, which manages the Spark context and coordinates job execution.
Executor Configuration
- Executor CPU and Memory: Specifies the cores and memory allocated to Spark executors, which execute tasks in parallel.
- Instances: Defines the number of executors. If N executors are configured, N executor pods are deployed.
Minimum Requirements
- As of the current release, the minimum driver requirement is 0.1 cores.
- The minimum executor requirement is 1 core.
- These values may change in upcoming Spark versions.

Spark resource configuration enables fine-grained tuning for distributed workloads, maximizing scalability and efficiency.

Comparison: Docker vs. Spark Resource Configuration

Aspect

Docker Deployment

Spark Deployment

Resource Scope

Configured at the container level for each component.

Configured at the driver and executor levels for distributed workloads.

CPU Allocation

- Defined in cores. - 1000 = 1 core, 100 = 0.1 core.

- Driver CPU: Allocated to Spark driver for coordination. - Executor CPU: Allocated per executor.

Memory Allocation

- Defined in MB. - 1024 = 1 GB.

- Driver Memory: For Spark driver context. - Executor Memory: For each executor task.

Instances

- Defines number of pods deployed for parallelism. - N instances = N pods.

- Defines number of executor pods. - N executors = N pods.

Minimum Requirements

- CPU: 0.1 core. - Memory: No strict limit, defined by user.

- Driver: Minimum 0.1 cores. - Executor: Minimum 1 core.

Requests & Limits

- Supports both resource requests (min guaranteed) and limits (max allowed).

- Resource requests defined via driver/executor configs for fine-grained cluster control.

Parallel Processing

Achieved via multiple container instances (pods).

Achieved via executors, each handling tasks in parallel.

Optimization

- Best for lightweight workloads and containerized tasks.

- Best for distributed data processing, ML trainin

Best Practice Recommendations

When to Use Docker Deployment

Lightweight Workloads – Small-scale ETL, API services, and micro-tasks.
Containerized Microservices – Stateless workloads that do not require distributed execution.
Rapid Iteration & Testing – Prototyping and pipeline validation.
Resource-Constrained Tasks – Predictable workloads with minimal CPU/memory.

When to Use Spark Deployment

Distributed Data Processing – Large-scale ETL, aggregation, and transformation.
Machine Learning & AI – Training and inference of models requiring parallel computing.
Streaming Analytics – High-volume streaming and near real-time pipelines.
Scalable Workloads – Jobs requiring elastic scaling across multiple executors.

General Best Practices

Right-Size Resources – Start small, monitor performance, and scale incrementally.
Use Node Pools Strategically – Assign Spark jobs to high-performance/GPU-enabled pools, Docker tasks to cost-optimized pools.
Enable Intelligent Scaling – Only if maximum instances > minimum instances.
Adopt GitOps with FluxCD – Version-control all resource configurations to ensure auditability and consistency.

By applying these configurations and best practices, the BDB Platform ensures flexibility, cost efficiency, and scalability across diverse data and AI workloads.

Optimization Configuration Fields

Node Pool Selection

The Node Pool option provides control over the execution environment of individual components by specifying the compute node group where the component runs.

Key Benefits

Workload Isolation – Assign components to node pools based on criticality, performance, or security policies (e.g., secure node pools for sensitive workloads).
Optimized Resource Allocation – Place compute-intensive tasks (e.g., ML training) on GPU-enabled pools, and lightweight tasks on cost-efficient pools.
Cost Management – Assign non-critical workloads to low-cost node pools (e.g., spot instances) and reserve premium resources for high-priority jobs.
Performance Optimization – Minimize latency by routing heavy compute workloads to high-throughput node pools.
Environment-Specific Execution – Leverage custom node pools with preinstalled libraries and dependencies for specific workloads, reducing setup overhead.

Intelligent Scaling

Intelligent Scaling dynamically adjusts resource allocation based on workload demands and system performance metrics.

Enable Intelligent Scaling to optimize utilization and execution efficiency.

Recommendation: Enable this option only if maximum instances > minimum instances, ensuring elasticity while avoiding under-provisioning.

The BDB Platform provides a robust framework for managing system resources with its comprehensive configuration options. This framework ensures flexible, scalable, and cost-effective resource allocation, effectively supporting diverse and demanding analytics and machine learning workloads.

PreviousData Sync Events NextPipeline Toolbar