In today’s episode, I’m going to share how to integrate innovative Analytics Solution patterns even for a traditional large Banking enterprise.
You might be asking yourself, “What’s the big deal? It’s just building a new data platform for an enterprise”, but that’s not entirely true.
Building a green-field modern data platform from scratch is not difficult but creating one for an enterprise is difficult because the enterprise has a legacy and there are many guardians of the legacy protecting their turf fiercely.
For these traditional guardians in a traditional enterprise – innovative patterns mean disruptive big bang work costing millions and jobs.
But that’s not true.
We can bring in innovation through improvement by making existing solution more modular.
We can innovate by optimizing components. We can scale gradually. We can reduce cycle time.
And so on.
Here’s how to adopt 4 innovative patterns to enhance, optimize the data platform of an enterprise and to make it more efficient.
This is from a data platform implementation for a Banking customer that we have done recently. This was an enhancement to the existing solution platform built few years back.
Collaborative data exploration
Traditional Data platform doesn’t have a data exploration mechanism. But we wanted to build one for the customer.
Notebooks are web-based, interactive data exploration tools. We have used it for the customer for ease of use and distributing the machine learning work to data analysts and data scientists in a controlled (secured) environment.
After months of discussion, customer was sold on the aspiration of using notebook as a white boarding tool for data analysts in a way similar to the manner the designers use white boarding tools for customer journey mapping. It would help them do rapid exploration and prototyping, and to get a sense of what type of ML models would work.
We explicitly didn’t want to take the notebooks any further than data exploration; when ideas are crystallized, we moved them out of notebooks into the codebase so that they can be deployed in production for consumption.
We have used Google Colab notebook environment provided by Google. It supports markdown for formatted text, images, and equations in Jupyter notebooks, but Colab is is Jupyter notebook on steroid because has in-built access to GPUs and TPUs for accelerated computation.
Some cool features that we have used:
- Integration with Google drive, Google cloud storage (GCS) for storage and sharing of notebooks.
- Integration with ML libraries and frameworks like TensorFlow and PyTorch.
- Ability of real-time collaboration with others for simultaneous editing and commenting.
You can access Google Colab here: https://colab.research.google.com/.
Modular Dataset creation jobs
If you have worked on building ML pipeline for production system, you would know that we want to make the steps in pipeline as modular as possible. In this project we had created two clean separate steps in the ML pipeline to have two separate jobs-
- one job for creation of dataset
- another job for using the dataset in training
This approach enables the Team to experiment for using the same dataset for training with different models (similar to the 12-factor principle – build once, run anywhere).
We have orchestrated several BigQuery SQL queries to create the data we needed to train model(s) – we have utilized BigQuery’s powerful data manipulation functions for data transformations and cleansing.
As you would imagine, we leveraged BigQuery’s scalable infrastructure to process large datasets efficiently.
More we have studied the features of Google BigQuery, we were amazed to find a number of platforms services available for use –
- BigQuery’s APIs and client libraries to orchestrate data preparation programmatically.
- BigQuery’s native scheduling capabilities to schedule data preparation jobs
- Cloud Composer or Cloud Dataflow to automate data preparation workflows
- BigQuery’s native integration with other Google Cloud services (e.g. – GCS)
- BigQuery’s monitoring and logging features to monitor job progress and performance
After processing of data on BigQuery, dataset snapshots were saved on GCS.
Model training jobs
As we had created two separate jobs in the ML pipeline, the datasets saved in above step, would be used by the Model training job.
It begins by downloading a snapshot of data and submitting the job to Google AI platform.
All jobs are run as custom containers on Google’s AI platform. This will give Team the flexibility to pick what type of instance they want to run training jobs on – Python code in container with python runtime, Spark MLib jobs or using GPUs for deep learning jobs
We have used Makefile commands to create and submit jobs to the AI platform.
- Build code and dependencies as a Python package or container image
- Package the data and code into a Cloud Storage bucket
- Submit the job to AI Platform using the gcloud command-line
- Monitor the logs & performance through AI Platform’s web console
- Utilize AI Platform’s automatic scaling to adjust resources based on workload.
The final steps of a model training job is to save the trained model by uploading it to a model registry.
This is a single place where all machine learning models are uploaded and this will enable the Bank to retrieve the models for inference across different value streams.
Consumption of Models
This is one of the most important part of a Analytics solution, yet often overlooked by solution Team.
Unless the analytics output is integrated with the business journeys (of customer or back-end processes) – data monetization remains a potential outcome. It is very interesting how it was thought about in this program.
We had some models that we wanted to run on a schedule, for example every day or every week for recurring update of customer preferences. When the predictions are generated, the prediction scores are to be inserted back into BigQuery.
But the question remains, how to trigger a specific action on a customer’s account. For example, how can we send a notification to customer expediting next repayment if the predicted score is below good customer threshold.
This is where some innovative thinking led to publishing prediction results as a series of events onto the Messaging middleware (Google PubSub) so that these events can be picked up by consumer applications to trigger actions (notification to customer or alert to operator dashboard).
TL;DR
We have adopted 4 innovative patterns to create a modern data platform that delivers business outcome to a Banking enterprise-
- Collaborative data exploration – for fast scanning of data collaboratively and agree upon the algorithms and models to be deployed in production.
- Modular Dataset creation jobs – Separating out dataset creation jobs from model training helped in scaling data creation and applying multiple models for the same dataset independently – resulting into faster insights, insights at scale, and variety of insights.
- Model training jobs – modular training job deployed on cloud AI platform simplified the ML Ops and creating a wide array of models inventory for the Bank.
- Consumption of Models – Traditional way to operationalize ML insights is to expose them as service. But another way to do it on demand and at much granular level is to generate analytical insights as business events – customer facing services and back-end processing workload consume these events as required.
Modular components, optimization of processes, adoption of new technology – can help in incrementing Solutions.
Conclusion
This is a classic case of adoption of new patterns for an enterprise.
Even though the final outcomes of the solution is to be seen in coming months, quarters – but it gives a blueprint of adopting new patterns for data platform for an enterprise. Without disrupting the business and tech operations.
Similar rationale can follow for patterns in other solutions – Integration patterns, API design patterns, deployment patterns, observability patterns.
Well, that’s all for today – 4 innovative patterns in a data platform for a large enterprise.
Till next week!