Last chance! 7 days left!

Click for a free read!

[Kubernetes Data Platform][Part 7][Main Components]: Data warehouse with Apache Doris

Viet_1846

6 min read4 days ago

We’ve nearly completed the core components of a data platform (Data Lakehouse), using MinIO, Hive Metastore, and Iceberg. The remaining challenge is enabling other applications to integrate and utilize data from this lakehouse. While Trino and Spark might come to mind, they aren’t ideal due to their unfamiliarity to most developers, potential integration issues with existing BI tools, and security concerns.

As a rising star in the real-time data warehouse domain, Apache Doris offers several compelling features. Here’s how it addresses our data platform’s needs:

  • Seamless Integration: Provides a MySQL interface and ANSI SQL compatibility, ensuring easy integration with BI tools. Offers an open data API for external compute engines like Spark, Flink, and ML/AI, and integrates smoothly with backend frameworks (Spring Boot, FastAPI, etc.).
  • Federated Querying: Enables querying data lakes (Hive, Iceberg, Hudi) and databases (MySQL, PostgreSQL) directly, eliminating the need to load data into data marts. This is similar to Trino’s functionality.
  • Robust Security: Implements Role-Based Access Control (RBAC) for granular permission management.
  • Real-time Capabilities: Supports both push-based and pull-based data ingestion, real-time upserts, appends, and pre-aggregation.
  • High Performance: Optimized for high-concurrency and high-throughput queries with columnar storage, MPP architecture, cost-based query optimizer, and vectorized execution engine.
  • Data Flexibility: Handles semi-structured data (arrays, maps, JSON) and offers automatic data type inference for JSON. Includes features for text search.
  • Scalability: Provides a distributed design for linear scalability, workload isolation, tiered storage, and supports both shared-nothing clusters and separation of storage and compute.

DEPLOYMENT STEPS

  1. Initialize a Kubernetes cluster with Kind.
  2. Install Nginx Ingress Controller, MinIO, Hive Metastore, Trino, Apache Airflow on Kubernetes
  3. Trigger DAG: dbt_jaffle-shop-classic_example
  4. Install Apache Doris
  5. Config and Query Iceberg catalog
  6. Destroy the Kind cluster

HANDS-ON STEP

Reference Repository: https://github.com/viethqb/data-platform-notes/tree/main/doris/kubernetes

1. Initialize a Kubernetes cluster with Kind.

> cd  ~/Documents 
> git clone https://github.com/viethqb/data-platform-notes.git
> cd data-platform-notes/doris/kubernetes
> kind create cluster --name dev --config deployment/kind/kind-config.yaml

2. Install Nginx Ingress Controller, MinIO, Hive Metastore, Trino on Kubernetes

Install Nginx Ingress Controller

> helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
> helm repo update
> helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx --set controller.hostNetwork=true,controller.service.type="",controller.kind=DaemonSet --namespace ingress-nginx --version 4.10.1 --create-namespace --debug
> kubectl -n ingress-nginx get po -owide

Install MinIO

> helm repo add bitnami https://charts.bitnami.com/bitnami
> helm repo update
> helm upgrade --install minio -n minio -f deployment/minio/minio-values.yaml bitnami/minio --create-namespace --debug --version 14.6.0
> kubectl -n minio get po
> kubectl get no -owide
# NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
# dev-control-plane Ready control-plane 3m53s v1.30.0 172.18.0.2 <none> Debian GNU/Linux 12 (bookworm) 6.9.9-arch1-1 containerd://1.7.15
# dev-worker Ready <none> 3m11s v1.30.0 172.18.0.5 <none> Debian GNU/Linux 12 (bookworm) 6.9.9-arch1-1 containerd://1.7.15
# dev-worker2 Ready <none> 3m12s v1.30.0 172.18.0.3 <none> Debian GNU/Linux 12 (bookworm) 6.9.9-arch1-1 containerd://1.7.15
# dev-worker3 Ready <none> 3m11s v1.30.0 172.18.0.4 <none> Debian GNU/Linux 12 (bookworm) 6.9.9-arch1-1 containerd://1.7.15
# Add the following lines to the end of the /etc/hosts
172.18.0.4 minio.lakehouse.local airflow.lakehouse.local

Install Hive Metastore

# hive-metastore-postgresql
> helm repo add bitnami https://charts.bitnami.com/bitnami
> helm repo update
> helm upgrade --install metastore-db -n metastore -f deployment/hive/hive-metastore-postgres-values.yaml bitnami/postgresql --create-namespace --debug --version 15.4.2

# Hive metastore
# docker pull rtdl/hive-metastore:3.1.2
# kind load docker-image rtdl/hive-metastore:3.1.2 --name dev
> helm upgrade --install hive-metastore -n metastore -f deployment/hive/hive-metastore-values.yaml ../../charts/hive-metastore --create-namespace --debug

Install Trino

> helm repo add trino https://trinodb.github.io/charts
> helm upgrade --install trino -n trino -f deployment/trino/trino-values.yaml trino/trino --create-namespace --debug --version 0.21.0
> kubectl -n trino get po

Install Apache Airflow

> helm repo add airflow https://airflow.apache.org/
> helm repo update
> helm upgrade --install airflow airflow/airflow -f deployment/airflow/airflow-values.yaml --namespace airflow --create-namespace --debug --version 1.13.1 --timeout 600s
> kubectl -n airflow get po

Access at http://airflow.lakehouse.local/connection/list/ ⇒ user: admin & password: admin

Config S3 Connection and Kubernetes Connection in Airflow UI

Access Airflow Connection at http://airflow.lakehouse.local/connection/list/ ⇒ add new record

Connection Id: s3_default
Connection Type: Amazon Web Services
AWS Access Key ID: admin
AWS Secret Access Key: password
Extra: {"endpoint_url": "<http://minio.minio.svc.cluster.local:9000>"}
Connection Id: kubernetes_default
Connection Type: Kubernetes Cluster Connection
In cluster configuration: yes
Disable SSL: yes

3. Trigger DAG: dbt_jaffle-shop-classic_example

Create jaffle_shop schema:

> kubectl -n trino exec -it deployments/trino-coordinator trino
trino> CREATE SCHEMA lakehouse.jaffle_shop WITH (location = 's3a://lakehouse/jaffle_shop.db/');

Access at http://airflow.lakehouse.local/dags/dbt_jaffle-shop-classic_example/grid ⇒ Trigger

Check tables from jaffle_shop schema:

> kubectl -n trino exec -it deployments/trino-coordinator trino
trino> show tables from lakehouse.jaffle_shop;

4. Install Apache Doris

Install doris-operator ⇒ Custom resource: dorisclusters.doris.selectdb.com

helm repo add doris https://charts.selectdb.com/
helm repo update
helm upgrade --install operator doris/doris-operator --namespace doris --create-namespace --debug --version 1.6.0

doriscluster-sample-storageclass.yaml

# This yaml describe using `storageclass` to provide persistentVolume for fe and be.
# This yaml use default `storageclass` on kubernetes. when use specific storageclass please update then field of storageClassName.
apiVersion: doris.selectdb.com/v1
kind: DorisCluster
metadata:
labels:
app.kubernetes.io/name: doriscluster
app.kubernetes.io/instance: doriscluster-sample-storageclass
app.kubernetes.io/part-of: doris-operator
name: doriscluster-sample-storageclass1
namespace: doris
spec:
adminUser:
name: root
password: "12345678"
feSpec:
replicas: 3
image: selectdb/doris.fe-ubuntu:2.1.2
service:
type: NodePort
servicePorts:
- nodePort: 31001
targetPort: 8030
- nodePort: 31002
targetPort: 8040
- nodePort: 31003
targetPort: 9030
# limits:
# cpu: 2
# memory: 4Gi
# requests:
# cpu: 2
# memory: 4Gi
persistentVolumes:
- mountPath: /opt/apache-doris/fe/doris-meta
name: fetest
persistentVolumeClaimSpec:
# when use specific storageclass, the storageClassName should reConfig, example as annotation.
#storageClassName: openebs-jiva-csi-default
accessModes:
- ReadWriteOnce
resources:
# notice: if the storage size less 5G, fe will not start normal.
requests:
storage: 10Gi
- mountPath: /opt/apache-doris/fe/log
name: felog
persistentVolumeClaimSpec:
# when use specific storageclass, the storageClassName should reConfig, example as annotation.
#storageClassName: openebs-jiva-csi-default
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
beSpec:
replicas: 3
image: selectdb/doris.be-ubuntu:2.1.2
# limits:
# cpu: 2
# memory: 4Gi
# requests:
# cpu: 2
# memory: 4Gi
persistentVolumes:
- mountPath: /opt/apache-doris/be/storage
name: betest
persistentVolumeClaimSpec:
# when use specific storageclass, the storageClassName should reConfig, example as annotation.
#storageClassName: openebs-jiva-csi-default
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
- mountPath: /opt/apache-doris/be/log
name: belog
persistentVolumeClaimSpec:
# when use specific storageclass, the storageClassName should reConfig, example as annotation.
#storageClassName: openebs-jiva-csi-default
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi

Create Doris cluster using Custom resource: dorisclusters

kubectl apply -f deployment/doris/doriscluster-sample-storageclass.yaml
kubectl -n doris get po -owide
kubectl -n doris get dorisclusters.doris.selectdb.com

Change Doris password:

> kubectl run mysql-client --image=mysql:5.7 -it --rm --restart=Never --namespace=doris -- /bin/bash
mysql -uroot -P9030 -hdoriscluster-sample-storageclass1-fe-service
mysql> SET PASSWORD FOR 'root' = PASSWORD('12345678');
mysql> SET PASSWORD FOR 'admin' = PASSWORD('12345678');

5. Config and Query Iceberg catalog

> kubectl -n doris delete po mysql-client
> kubectl -n metastore get svc
> kubectl -n minio get svc
> kubectl run mysql-client --image=mysql:5.7 -it --rm --restart=Never --namespace=doris -- /bin/bash
bash-4.2# mysql -uroot -P9030 -hdoriscluster-sample-storageclass1-fe-service -p
# Using dns not work => using ip
mysql> CREATE CATALOG iceberg PROPERTIES (
"type"="iceberg",
"iceberg.catalog.type"="hms",
"hive.metastore.uris" = "thrift://10.96.86.185:9083",
"warehouse" = "s3://lakehouse",
"s3.access_key" = "admin",
"s3.secret_key" = "password",
"s3.endpoint" = "http://10.96.130.201:9000",
"s3.region" = "us-east-1"
);

Query data from Iceberg on Doris:

mysql> show databases from iceberg;
mysql> show tables from iceberg.jaffle_shop;
mysql> select * from iceberg.jaffle_shop.customers limit 5;

6. Destroy the Kind cluster

> kind delete cluster --name dev