[Kubernetes Data Platform][Part 1]: Introduction
Why This Series?
As a Data Platform Engineer based in Vietnam, I have over five years of experience deploying Data Platforms using a variety of technologies such as the Hadoop Ecosystem, Cloud environments (Azure, AWS), and on-premises Kubernetes clusters.
Following the discontinuation of Hortonworks Data Platform (HDP), the development of cutting-edge Data Platforms presented new challenges. I recognized Kubernetes as a promising technology with vast potential for data-driven applications. Driven by this belief, I have invested considerable effort into researching and successfully implementing a modern data platform on Kubernetes.
I’ve observed a scarcity of comprehensive documentation, tutorials, and hands-on training materials for this specific area. To address this gap, I’ve created this series to showcase the deployment of a production-ready Data Platform (Data Lakehouse) on Kubernetes.
This series serves a twofold purpose: to reinforce my own knowledge and to share insights with those interested in Data Platform deployment or the Data Platform Engineer role.
Data Lakehouse Architecture
A data platform is essential for organizations handling vast amounts of data. It provides a centralized infrastructure for data management, processing, and analysis, empowering businesses to extract valuable insights and make informed decisions.
A well-structured data platform, following a Data Lakehouse architecture, typically comprises three primary components:
1. Main Components
These are the essential elements that form the foundation of a data platform. They include:
- Storage: A system for storing and managing large volumes of data. Examples include HDFS, MinIO, and more.
- Metadata Store: A repository for storing and managing metadata about the data, such as its location, format, and structure. Examples include Hive metastore, Rest Catalog, Nessie and more.
- Table Format: The format in which data is stored on the storage system. Examples include Iceberg, Hudi, Delta, Paimon, Parquet, ORC, and Avro.
- Data Processing Engine: A tool for processing and transforming data. Examples include Apache Spark, Trino/Presto, Apache Flink, and Apache Beam.
- Online Data Warehouse: A database that serves as an interface for applications to access and utilize data from a data platform. Examples include Apache Doris, Starrocks, and more.
- Data Ingestion: The process of collecting data from various sources and loading it into the data platform. Examples include Apache Kafka, Airbyte, Apache Seatunnel, Debezium, Flink CDC, Python.
- Data Orchestration: A system for managing and scheduling data pipelines, ensuring that data is processed and transformed in the correct order. Examples include Apache Airflow, Prefect, Dagster, Mage-ai and Kestra.
- CI/CD (Optional): Continuous Integration/Continuous Delivery system for Data platform. Examples include Jenkins, ArgoCD and more
2. Real-time Components
This component deals with processing and analyzing data streams in real time. It enables organizations to gain immediate insights from data as it is generated. Examples include:
- Kafka: A distributed streaming platform for handling real-time data feeds.
- Flink: A stream processing framework for building real-time applications.
- Spark Streaming: A module of Apache Spark for real-time data processing.
- RisingWave: Postgres-compatible streaming database engineered to provide the simplest and most cost-efficient approach for processing, analyzing, and managing real-time event streaming data.
3. Applications
Data platforms serve as the foundation for various data-driven applications that utilize the processed and analyzed data. These applications can be categorized into:
- Data API: APIs that provide programmatic access to data for consumption by other applications or services. Examples include Spring Boot, and FastAPI.
- Dashboard and BI: Tools for creating interactive dashboards and visualizations to explore and analyze data. Examples include Power BI, Tableau, and Superset.
- Ad-hoc Analysis: Tools for enabling users to perform exploratory data analysis without the need for pre-defined queries. Examples include Jupyter Notebook and Google Data Studio.
- Machine Learning: Frameworks and tools for building and deploying machine learning models using the data stored in the data platform. Examples include TensorFlow, PyTorch, and scikit-learn.
- Data Applications: Web applications (Example Streamlit) that directly utilize the data from the data platform to provide specific functionalities or services. Examples include recommendation engines, fraud detection systems, and customer segmentation tools.
What Next?
In the next chapter of this series, we’ll dive into the hands-on deployment of these key components:
Highly Available (HA) Kubernetes Cluster: We’ll explore setting up HA clusters using tools like kubeadm, k3s, k0s, and rke2.
Data Stack Deployment:
- Main Components: We’ll walk you through implementing essential data stack components like Minio, Hive Metastore, Spark, Trino, Airflow, and more on the Kubernetes platform.
- Real-time Components: Deploy Kafka, Flink, and RisingWave to enable real-time data processing capabilities within the Kubernetes environment.
- Application Components: We’ll cover integrating Data APIs (Spring Boot, FastAPI), BI tools (Superset), Adhoc analysis tools (JupyterHub), Web applications (Streamlit), and Machine Learning (ML) tools (Mlflow, Kubeflow, etc.) into your Kubernetes environment.
This series will provide a step-by-step guide, making it accessible for everyone. However, a basic understanding of Kubernetes resources and data stacks will be helpful in solidifying your knowledge.
Lab Environment Setup:
I recommend using Linux distributions like Ubuntu or Arch for this series, as I’ve had success with them. I haven’t tested Windows yet. Get ready to set up your lab environment on your personal laptop with the following configuration:
Pre-requisites:
To ensure a smooth lab experience, please install the following tools on your machine:
- docker: https://www.docker.com/
- docker-compose: https://github.com/docker/compose
- sysbox runtime: https://github.com/nestybox/sysbox
- kind: https://kind.sigs.k8s.io/
- kubectl: https://kubernetes.io/docs/reference/kubectl/
- VirtualBox: https://www.virtualbox.org/wiki/Downloads
- Vagrant: https://developer.hashicorp.com/vagrant
- Multipass: https://multipass.run/docs/tutorial
- Helm: https://helm.sh/