[Kubernetes Data Platform][Part 1]: Introduction

Published in

Towards Dev

4 min readJul 30, 2024

Why This Series?

As a Data Platform Engineer based in Vietnam, I have over five years of experience deploying Data Platforms using a variety of technologies such as the Hadoop Ecosystem, Cloud environments (Azure, AWS), and on-premises Kubernetes clusters.

Following the discontinuation of Hortonworks Data Platform (HDP), the development of cutting-edge Data Platforms presented new challenges. I recognized Kubernetes as a promising technology with vast potential for data-driven applications. Driven by this belief, I have invested considerable effort into researching and successfully implementing a modern data platform on Kubernetes.

I’ve observed a scarcity of comprehensive documentation, tutorials, and hands-on training materials for this specific area. To address this gap, I’ve created this series to showcase the deployment of a production-ready Data Platform (Data Lakehouse) on Kubernetes.

This series serves a twofold purpose: to reinforce my own knowledge and to share insights with those interested in Data Platform deployment or the Data Platform Engineer role.

Data Lakehouse Architecture

A data platform is essential for organizations handling vast amounts of data. It provides a centralized infrastructure for data management, processing, and analysis, empowering businesses to extract valuable insights and make informed decisions.

A well-structured data platform, following a Data Lakehouse architecture, typically comprises three primary components:

1. Main Components

These are the essential elements that form the foundation of a data platform. They include:

Storage: A system for storing and managing large volumes of data. Examples include HDFS, MinIO, and more.
Metadata Store: A repository for storing and managing metadata about the data, such as its location, format, and structure. Examples include Hive metastore, Rest Catalog, Nessie and more.
Table Format: The format in which data is stored on the storage system. Examples include Iceberg, Hudi, Delta, Paimon, Parquet, ORC, and Avro.
Data Processing Engine: A tool for processing and transforming data. Examples include Apache Spark, Trino/Presto, Apache Flink, and Apache Beam.
Online Data Warehouse: A database that serves as an interface for applications to access and utilize data from a data platform. Examples include Apache Doris, Starrocks, and more.
Data Ingestion: The process of collecting data from various sources and loading it into the data platform. Examples include Apache Kafka, Airbyte, Apache Seatunnel, Debezium, Flink CDC, Python.
Data Orchestration: A system for managing and scheduling data pipelines, ensuring that data is processed and transformed in the correct order. Examples include Apache Airflow, Prefect, Dagster, Mage-ai and Kestra.
CI/CD (Optional): Continuous Integration/Continuous Delivery system for Data platform. Examples include Jenkins, ArgoCD and more

2. Real-time Components

This component deals with processing and analyzing data streams in real time. It enables organizations to gain immediate insights from data as it is generated. Examples include:

Kafka: A distributed streaming platform for handling real-time data feeds.
Flink: A stream processing framework for building real-time applications.
Spark Streaming: A module of Apache Spark for real-time data processing.
RisingWave: Postgres-compatible streaming database engineered to provide the simplest and most cost-efficient approach for processing, analyzing, and managing real-time event streaming data.

3. Applications

Data platforms serve as the foundation for various data-driven applications that utilize the processed and analyzed data. These applications can be categorized into:

Data API: APIs that provide programmatic access to data for consumption by other applications or services. Examples include Spring Boot, and FastAPI.
Dashboard and BI: Tools for creating interactive dashboards and visualizations to explore and analyze data. Examples include Power BI, Tableau, and Superset.
Ad-hoc Analysis: Tools for enabling users to perform exploratory data analysis without the need for pre-defined queries. Examples include Jupyter Notebook and Google Data Studio.
Machine Learning: Frameworks and tools for building and deploying machine learning models using the data stored in the data platform. Examples include TensorFlow, PyTorch, and scikit-learn.
Data Applications: Web applications (Example Streamlit) that directly utilize the data from the data platform to provide specific functionalities or services. Examples include recommendation engines, fraud detection systems, and customer segmentation tools.

What Next?

In the next chapter of this series, we’ll dive into the hands-on deployment of these key components:

Highly Available (HA) Kubernetes Cluster: We’ll explore setting up HA clusters using tools like kubeadm, k3s, k0s, and rke2.

Data Stack Deployment:

Main Components: We’ll walk you through implementing essential data stack components like Minio, Hive Metastore, Spark, Trino, Airflow, and more on the Kubernetes platform.
Real-time Components: Deploy Kafka, Flink, and RisingWave to enable real-time data processing capabilities within the Kubernetes environment.
Application Components: We’ll cover integrating Data APIs (Spring Boot, FastAPI), BI tools (Superset), Adhoc analysis tools (JupyterHub), Web applications (Streamlit), and Machine Learning (ML) tools (Mlflow, Kubeflow, etc.) into your Kubernetes environment.

This series will provide a step-by-step guide, making it accessible for everyone. However, a basic understanding of Kubernetes resources and data stacks will be helpful in solidifying your knowledge.

Lab Environment Setup:

I recommend using Linux distributions like Ubuntu or Arch for this series, as I’ve had success with them. I haven’t tested Windows yet. Get ready to set up your lab environment on your personal laptop with the following configuration:

Pre-requisites:

To ensure a smooth lab experience, please install the following tools on your machine:

docker: https://www.docker.com/
docker-compose: https://github.com/docker/compose
sysbox runtime: https://github.com/nestybox/sysbox
kind: https://kind.sigs.k8s.io/
kubectl: https://kubernetes.io/docs/reference/kubectl/
VirtualBox: https://www.virtualbox.org/wiki/Downloads
Vagrant: https://developer.hashicorp.com/vagrant
Multipass: https://multipass.run/docs/tutorial
Helm: https://helm.sh/

[Kubernetes Data Platform][Part 1]: Introduction

Why This Series?

Data Lakehouse Architecture

1. Main Components

2. Real-time Components

3. Applications

What Next?

Lab Environment Setup:

Written by Viet_1846

More from Viet_1846 and Towards Dev

[Kubernetes Data Platform][Part 3][Main Components]: Install Distributed MinIO Cluster

FastAPI: From App.py to a Modular Architecture

When building a backend using FastAPI, it’s common to start with a single app.py file. While this approach works for small projects, it…

Gradio — Streamlit — NiceGUI: Python UI Framework Part 2

“I cannot wait to do UI” said no infrastructure engineer ever! Yet when you are a start-up of two middle-aged and slightly disillusioned…

[Kubernetes Data Platform][Part 4][Main Components]: Install Hive Metastore, Trino on Kubernetes

As we know, a Data Platform consists of four main components (Review Part 1):

Recommended from Medium

Build an interface to your data platform

Modern data platforms are complex. If you look at reference architectures, like the one from A16Z below, it contains 30+ boxes. Each box…

Snowflake vs. Databricks 2024 (actually useful)

Snowflake vs. Databricks is something we’ve all heard before, so why not take a different approach

Lists

Business

Natural Language Processing

Understanding the Polaris Iceberg Catalog and Its Architecture

NOTE: I am working on a hands-on tutorial for Polaris, so please watch for the Dremio Blog in the coming days. Also, check out many other…

CAP theorem — What Every Data Engineer Should Know

A Data Engineer’s Guide to Balancing Consistency, Availability, and Partition Tolerance

requirements.txt Is Obsolete

Managing Python project dependencies and metadata with Poetry

Data :Lakehouse Architecture: Overview, Tools and Cost Management

Lakehouse Architecture combines the reliability and performance of data warehouses with the scalability and cost-effectiveness of data…