Canonical Data Platform: that was 2021

It’s that time of the year again: many folks are panic buying cans of windscreen de-icer spray and thermal underwear, bringing pine trees into the front room and preparing to enjoy an extended break with the family. So we thought to ourselves, what better time than now to take a look back at the year gone by on the Canonical Data Platform?

Data Lab – Charmed Kubeflow warms up

In March, Aymen examined running an AI data lab on-premise, and surveyed the benefits of this approach as well as the tools and drivers to do so. We learned about cost reduction, easy compliance with data governance norms, and the platform tooling needed to build a cloud-agnostic, on-premise ready financial market forecasting platform for data-driven simulations and active trading.

In April, Rui took a look at how to set up a production-ready AI model scoring server infrastructure with Kubeflow’s KFServing engine. We learned how to set up auto-scaling rules for AI and ML scoring servers running Tensorflow, PyTorch, XGBoost, ScikitLearn, and ONNX workloads, and went a little deeper on configuring canary (a.k.a. blue/green) rollouts of new models, data preprocessing pipelines with Kubeflow, and took a little peek at model explainability too.

Also in April, Maciej took us through the process of deploying and configuring NVIDIA RAPIDS and NGC Containers on Ubuntu, as well as the design decisions behind his recommended setup.

In May, Rui went further into the world of model servers with an in-depth guide to ML model serving. We learned about different approaches to model serving, like embedding the AI/ML model in the app, exposing the AI/ML model as an API, or packaging the model as a library. We also learned about some of the complexities of managing and automating AI/ML at scale.

Also in May, I (Rob) took a deeper look at the differences, similarities and overlap between data lab, data hub and data lake, as well as attempted to answer the question of what is a data lab and how to architect one in depth. I examined the vocabulary and concepts of the data lab, the benefits of setting up a data lab, blueprints and example technologies for a data lab using open source solution building blocks, and strategies for accelerating data lab initiatives as a part of business transformation programmes.

In July, Aymen had a go at building Kubeflow pipelines using Jupyter notebooks and KALE, and we learned how to drive our pipelines using notebook annotations and python and automatically run them at scale on Kubernetes from this YouTube how-to guide. All in a day’s work!

Data on Ubuntu – getting hot!

In May, I examined the advantages of dealing with a mountain of data by sharding PostgreSQL versus setting up a data lake in Let’s play: sharded big data PostgreSQL. We learned about the advantages of using GPUs to accelerate query times for data warehousing at scale; system consolidation versus Hadoop; and concurrent query benefits too. We also briefly examined sharding PostgreSQL deployments using the native fdw_postgresql remote server feature.

In August, we had a lot of fun with a four-part blog series that got super hands-on with running Apache Spark on MicroK8s and Ubuntu Core, in the cloud, using all the things including LXD clustering, Jupyter notebooks and nested virtualization on Google Cloud Platform.

And after taking time to wish Linux a very happy 30th birthday, we took a good look at how DataOps teams can benefit from adopting the model-driven operations paradigm. We learned about how teams often get lost in the plumbing and often spend more time fixing broken things than focusing on productive work; and that by modelling, eventually automating their tasks, they can shave off more time to get in the flow state.

Also in August, I had a look at Cloud PaaS through the lens of open source, the shared responsibility model, some of the challenges and disadvantages PAAS presents versus open source software, and I expressed my vision of how PAAS as open source software can offer the best of both worlds.

In September, we published 7 approaches to accelerating Apache Kafka on K8s, a white paper that takes an in depth look at your options and the tradeoffs for building an ua high-volume, low-latency Kafka cluster.

In October, Ubuntu 21.10 arrived, complete with an Apache Cassandra snap – making Cassandra cluster deployment easier than ever. With 21.10, we announced the official ECR and Dockerhub Ubuntu LTS Docker container image collection, including images for PostgreSQL, MySQL, Redis, Memcached and Cassandra!

Also in October, I took another look at pet servers, and questioned whether developing a sophisticated infrastructure as code deployment automation solution is always the right approach to take for stateful systems.

Microsoft SQL Server is as Microsoft SQL Server does

In November, we announced the SQL Server on Ubuntu Pro for Azure solution, co-engineered by Canonical and Microsoft and published by Microsoft on the Azure portal. I delivered a review of the venerable history of SQL Server in SQL Server on Ubuntu Pro: bringing it all back home and we lifted up the bonnet and took a good look at the mechanics inside – SQL Server high availability clustering – in the webinar, Introducing SQL Server on Ubuntu Pro for Azure Part 1.

In December, we explored the value that the SQL Server on Ubuntu Pro for Azure solution brings to enterprise IT departments in the whitepaper Evaluating Microsoft SQL Server Options for Azure, and examined pricing models, support options, SLAs, licensing and compliance, patch management, hardening, accreditation and TCO for a range of different SQL Server on Azure options.

Data-driven – into 2022 and beyond

2022 will be a big year for the Canonical Data Platform, and we’re super excited about all the great things we’re going to share with you next year.

Until then, stay warm, grab a blanket and your laptop, and enjoy the quiet times and the space it gives you to hone your data and AI skills!