A data pipeline is a set of processes to move data from one or more sources to a destination where it can be stored, processed, and analyzed.
Enterprises as well as tech companies run their businesses using data processing pipelines.
A few examples –
- E-commerce companies collects, validates, and updates order data for inventory and shipping through data pipelines.
- Sensor Data from IoT sensors goes through data pipeline for monitoring and analysis.
- Real-time financial transactions travel through data pipeline for record, verification, and reporting for compliance.
- Log Files are transmitted through pipelines to collect, parse, and visualize for troubleshooting and performance monitoring.
- User behavior/click-stream data pass through data pipeline to generate personalized product or content recommendations.
In this episode, let’s have a quick walk-through of the steps of evolution of a data pipeline.
Data pipelines in real world can be complex but the fundamental considerations behind their evolution would follow some of these steps.
Let’s dive in…
Step 1: Start – Move data from Source A to Point B
Typically we start where.
We get data from a source; the source could be a structured data store of an application, structured data in a data warehouse, flat file in a network drive, data on a No-SQL database, or inside the segment of a mpp data segment (e.g.- Hadoop).
We process the data in one processor (X). The processing could be in multiple steps but happen within the processor component.
Once processed, we send data to to the destination (B). Like the source (A), the destination (B) could also be a structured database, a flat file or a mpp data segment.
|
Step 2: Enters Complexity – Add more stages
Over a period of time, the processor becomes complex – it starts including data pre-processing logic, validation rules, actual processing algorithms and post-processing steps.
A logical next step will be to make the processor component modular by breaking it into a cohesive set of pre-processor, processor, post-processor components. There could be more than 3 steps but there will be at least 3 processor components.
The processor stages can write date to database, or to flat files from where the next processor stage would pull the data.
The Data pipeline can be visualized like the following diagram (please note that the diagram is simplified to show three processors, with pre- and post- processing components, there can be more number of processors in the data pipeline).
|
Step 3: Need Flexibility – Let’s Orchestrate them!
Next Logical step will be to make the processing flexible.
What this means is that we should be able to configure the processor components to follow conditional logic. If it needs to go through fraud processing logic, let it pass through processor step Y; if not, it should be skip processor component Y and should go to processor component Z.
As the data processing pipeline is reused in a number of scenarios,
To implement this, we should configure the meta logic of processing outside the processor and orchestrate the processing through the processors by an external component. The external component manages multi-step processing like a long-running process.
|
The Workflow management orchestrates the multi-step processing, the processing rules are administered in the workflow management component. (The direction of arrows shown above are illustrative).
Step 4: Fast – Let’s have On-demand processing
If the processing is modular, reusable and flexible, the consumer of the data processing pipeline will demand faster processing. It is at this stage that we would like to move into event driven processing pipeline.
Event driven processing components communicate by sending and receiving events, most often through a messaging middleware. This allows components to operate asynchronously, meaning they can process events independently without waiting for immediate responses.
This asynchronous nature is well-suited for real-time scenarios where data processing delays can lead to bottlenecks. The data processors generate events on completion and other data processors are notified by the completion events.
Event driven processing introduces choreography among the data processing components which is in start contrast to the orchestration enabled by the workflow management component. There is a choice of using either or both. In case of most of the complex data pipelines, a fine balance is maintained between orchestration and choreography where processor to processor communication is managed through events and the state of the data pipeline is managed by the process map (in workflow management solution). The processor notifies the completion state to the workflow management component and workflow management sends the configuration parameters to the next processor component.
|
Step 5: High Performance – Let’s scale
Once all above are implemented, data pipeline will work like a well oiled machine except one problem.
The problem is associated with the memory & processing resources of the individual processors. Even though the processors are separated (making them modular – as in Step 2 above), the speed of data processing by pipeline is as fast as (or as slow as) the slowest processor in the pipeline.
To overcome this problem of scalability, the processors are scaled by running multiple instances of the processors so that they can consume data packets as soon as those packets are available (without waiting time for data packets to be picked up by the processors). You can containerize the processors to reduce the processing overhead of individual instances, vertically scale the runtime and run multiple instances as separate processes – to increase the scalability.
Typically, large scale data pipelines have performance profiles that determine the SLA of data processing. Number of processor instances is the lever you can use to improve the SLA.
|
Remember, the number of instances of each of the processors will not be the same. the number of processor instances are calculated in such a way that the data pipeline has a steady flow of data packets. The upper and lower limits of the number of instances no of the processors are determined in such a way that the processors are neither overheated, nor starving for data packets.
Conclusion
Efficient, well-designed data pipeline lead to informed faster strategic choices and quicker actions.
It can improve customer experience and can provide opportunities to personalize services based on analyzed data.
It can also provide patterns to mitigate risks and ensure compliance with regulations.
For modern enterprises, application of data pipelines are endless and their impacts are very significant.
Even though data pipelines diff in design, but the considerations of complexity, flexibility, real-time/on-demand processing and higher scalability – are timeless principles applicable to all data pipelines.
Above steps gives a relevant perspective to look at the non-functional aspects of data pipeline pragmatically.
Hope, this is useful.
That’s all for this week.
Till next week…