Data Pipeline
Data pipelines are used to ingest and transfer data from disparate sources, transform unify and cleanse so that it’s suitable for analytics and business reporting.
What is a Data Pipeline?
“It is a collection of procedures that are carried either sequentially or even concurrently when transporting data from one or more sources to destination. Filtering, enriching, cleansing, aggregating, and even making inferences using AI/ML models may be part of these pipelines”
Data pipelines are the backbone of the modern enterprise, Pipelines move, transform and store data so that enterprise can generate/take decision without delays. Some of these decisions are automated via AI/ML models in real-time.
Readers: Your repository of data can be a reader for you. It could be a database, a file, or a SaaS application. Read Readers.
Connecting Components: The component that pulls or receives data from your source can be events/ connecting components for you. These Kafka-based messaging channels help to create a data flow. Read Connecting Components.
Writers: The databases or data warehouses to which the data is loaded by the Pipelines. Read Writers.
Transforms: The series of transformation components that help to cleanse, enrich, and prepare data for smooth analytics. Read Transformations.
Ingestion: Ingestion components allow the users to ingest data in the pipeline from outside the pipeline. The user needs to do Data Profiling to figure out what data you want to extract using various Ingestion APIs based on their structure and how well it fits a business purpose. Read Ingestion.
ML: The Model Runner components allow the users to use the models created on R, Python workspace of the Data Science Workbench or saved models from the Data Science Lab to be consumed in a pipeline. Read AI/ML.
Consumers: These are the real-time / Streaming component that ingests data or monitor for change in data objects from different sources to the pipeline. Read Consumers.
Last updated