Last chance! 7 days left!

Click for a free read!

[Kubernetes Data Platform][Part 8][Main Components]: Data warehouse with StarRocks

Viet_1846

10 min read3 days ago

StarRocks is a high-performance, massively parallel processing (MPP) SQL data warehouse optimized for big data analytics. Built on a significantly enhanced version of Apache Doris, it delivers exceptional speed, reliability, and ease of use.

StarRocks Architecture

StarRocks supports two main architectures: Shared-nothing and Shared-data. Each architecture has its own characteristics and is suitable for different use cases.

Shared-data Clusters

StarRocks Shared-data Architecture

Architecture: In this model, data is stored on a shared storage system (such as S3, Google Cloud Storage, Azure Blob Storage, MinIO), while compute nodes are only responsible for processing queries and computations.

Advantages:

  • Lower cost by leveraging the advantages of cheap object storage.
  • Independent scalability between compute and storage.
  • Better data recovery capabilities.

Disadvantages:

  • Performance may be lower than shared-nothing in some cases.
  • Dependent on the performance of the storage system.

Shared-nothing Clusters

StarRocks Shared-nothing Architecture

Architecture: In this model, each node in the cluster is independent and does not share any resources with other nodes. Each node has its own memory, CPU, and hard drive, and is responsible for storing, computing, and managing its own data.

Advantages:

  • Flexible scalability: Nodes can be easily added or removed to meet changing needs.
  • High fault tolerance: A failure on one node does not affect other nodes.
  • Good performance for heavy OLAP workloads.

Disadvantages:

  • Higher cost due to the requirement for more hardware resources.
  • More complex management.

DEPLOYMENT STEPS

  1. Initialize a Kubernetes cluster with Kind.
  2. Install Nginx Ingress Controller, MinIO, Hive Metastore, Trino, Apache Airflow on Kubernetes
  3. Trigger DAG: dbt_jaffle-shop-classic_example
  4. Install StarRocks: I will deploy using the Shared-data cluster architecture.
  5. Config and Query Iceberg catalog
  6. Destroy the Kind cluster

HANDS-ON STEP

Reference Repository: https://github.com/viethqb/data-platform-notes/tree/main/starrocks

1. Initialize a Kubernetes cluster with Kind.

> cd  ~/Documents 
> git clone https://github.com/viethqb/data-platform-notes.git
> cd data-platform-notes/doris
> kind create cluster --name dev --config deployment/kind/kind-config.yaml

2. Install Nginx Ingress Controller, MinIO, Hive Metastore, Trino on Kubernetes

Install Nginx Ingress Controller

> helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
> helm repo update
> helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx --set controller.hostNetwork=true,controller.service.type="",controller.kind=DaemonSet --namespace ingress-nginx --version 4.10.1 --create-namespace --debug
> kubectl -n ingress-nginx get po -owide

Install MinIO

> helm repo add bitnami https://charts.bitnami.com/bitnami
> helm repo update
> helm upgrade --install minio -n minio -f deployment/minio/minio-values.yaml bitnami/minio --create-namespace --debug --version 14.6.0
> kubectl -n minio get po
> kubectl get no -owide
# NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
# dev-control-plane Ready control-plane 3m53s v1.30.0 172.18.0.2 <none> Debian GNU/Linux 12 (bookworm) 6.9.9-arch1-1 containerd://1.7.15
# dev-worker Ready <none> 3m11s v1.30.0 172.18.0.5 <none> Debian GNU/Linux 12 (bookworm) 6.9.9-arch1-1 containerd://1.7.15
# dev-worker2 Ready <none> 3m12s v1.30.0 172.18.0.3 <none> Debian GNU/Linux 12 (bookworm) 6.9.9-arch1-1 containerd://1.7.15
# dev-worker3 Ready <none> 3m11s v1.30.0 172.18.0.4 <none> Debian GNU/Linux 12 (bookworm) 6.9.9-arch1-1 containerd://1.7.15
# Add the following lines to the end of the /etc/hosts
172.18.0.4 minio.lakehouse.local airflow.lakehouse.local

Install Hive Metastore

# hive-metastore-postgresql
> helm repo add bitnami https://charts.bitnami.com/bitnami
> helm repo update
> helm upgrade --install metastore-db -n metastore -f deployment/hive/hive-metastore-postgres-values.yaml bitnami/postgresql --create-namespace --debug --version 15.4.2
# Hive metastore
# docker pull rtdl/hive-metastore:3.1.2
# kind load docker-image rtdl/hive-metastore:3.1.2 --name dev
> helm upgrade --install hive-metastore -n metastore -f deployment/hive/hive-metastore-values.yaml ../../charts/hive-metastore --create-namespace --debug

Install Trino

> helm repo add trino https://trinodb.github.io/charts
> helm upgrade --install trino -n trino -f deployment/trino/trino-values.yaml trino/trino --create-namespace --debug --version 0.21.0
> kubectl -n trino get po

Install Apache Airflow

> helm repo add airflow https://airflow.apache.org/
> helm repo update
> helm upgrade --install airflow airflow/airflow -f deployment/airflow/airflow-values.yaml --namespace airflow --create-namespace --debug --version 1.13.1 --timeout 600s
> kubectl -n airflow get po

Access at http://airflow.lakehouse.local/connection/list/ ⇒ user: admin & password: admin

Config S3 Connection and Kubernetes Connection in Airflow UI

Access Airflow Connection at http://airflow.lakehouse.local/connection/list/ ⇒ add new record

Connection Id: s3_default
Connection Type: Amazon Web Services
AWS Access Key ID: admin
AWS Secret Access Key: password
Extra: {"endpoint_url": "<http://minio.minio.svc.cluster.local:9000>"}
Connection Id: kubernetes_default
Connection Type: Kubernetes Cluster Connection
In cluster configuration: yes
Disable SSL: yes

3. Trigger DAG: dbt_jaffle-shop-classic_example

Create jaffle_shop schema:

> kubectl -n trino exec -it deployments/trino-coordinator trino
trino> CREATE SCHEMA lakehouse.jaffle_shop WITH (location = 's3a://lakehouse/jaffle_shop.db/');

Access at http://airflow.lakehouse.local/dags/dbt_jaffle-shop-classic_example/grid ⇒ Trigger

Check tables from jaffle_shop schema:

> kubectl -n trino exec -it deployments/trino-coordinator trino
trino> show tables from lakehouse.jaffle_shop;

4. Install Install StarRocks

Docs: https://github.com/StarRocks/starrocks-kubernetes-operator/tree/main/doc

Install starrocks-kubernetes-operator ⇒ Custom resource: starrocksclusters.starrocks.com

kubectl apply -f https://raw.githubusercontent.com/StarRocks/starrocks-kubernetes-operator/main/deploy/starrocks.com_starrocksclusters.yaml
kubectl apply -f https://raw.githubusercontent.com/StarRocks/starrocks-kubernetes-operator/main/deploy/operator.yaml

Create Starrocks cluster with root password is empty

shared_data_mode.yaml

apiVersion: starrocks.com/v1
kind: StarRocksCluster
metadata:
name: starrockscluster
namespace: starrocks
spec:
starRocksFeSpec:
image: starrocks/fe-ubuntu:3.3-latest
replicas: 3
# limits:
# cpu: 2
# memory: 4Gi
# requests:
# cpu: 2
# memory: 4Gi
# feEnvVars:
# - name: "MYSQL_PWD"
# valueFrom:
# secretKeyRef:
# name: rootcredential
# key: password
service:
type: NodePort # export fe service
ports:
- name: query # fill the name from the fe service ports
nodePort: 32755
port: 9030
containerPort: 9030
storageVolumes:
- name: fe-storage-meta
# storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
storageSize: 10Gi # the size of storage volume for metadata
mountPath: /opt/starrocks/fe/meta # the path of metadata
- name: fe-storage-log
# storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
storageSize: 1Gi # the size of storage volume for log
mountPath: /opt/starrocks/fe/log # the path of log
configMapInfo:
configMapName: starrockscluster-fe-cm
resolveKey: fe.conf
starRocksCnSpec:
image: starrocks/cn-ubuntu:3.3-latest
replicas: 1
# limits:
# cpu: 2
# memory: 4Gi
# requests:
# cpu: 2
# memory: 4Gi
# cnEnvVars:
# - name: "MYSQL_PWD"
# valueFrom:
# secretKeyRef:
# name: rootcredential
# key: password
configMapInfo:
configMapName: starrockscluster-cn-cm
resolveKey: cn.conf
storageVolumes:
- name: cn-storage-data
# storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
storageSize: 10Gi # the size of storage volume for data
mountPath: /opt/starrocks/cn/storage # the path of data
- name: cn-storage-log
# storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
storageSize: 1Gi # the size of storage volume for log
mountPath: /opt/starrocks/cn/log # the path of log
starRocksFeProxySpec:
replicas: 1
# limits:
# cpu: 1
# memory: 2Gi
# requests:
# cpu: 1
# memory: 2Gi
service:
type: NodePort # export fe proxy service
ports:
- name: http-port # fill the name from the fe proxy service ports
containerPort: 8080
nodePort: 30180 # The range of valid ports is 30000-32767
port: 8080

resolver: "kube-dns.kube-system.svc.cluster.local" # this is the default dns server.

---
# fe config
apiVersion: v1
kind: ConfigMap
metadata:
name: starrockscluster-fe-cm
namespace: starrocks
labels:
cluster: starrockscluster
data:
fe.conf: |
LOG_DIR = ${STARROCKS_HOME}/log
DATE = "$(date +%Y%m%d-%H%M%S)"
JAVA_OPTS="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:${LOG_DIR}/fe.gc.log.$DATE"
JAVA_OPTS_FOR_JDK_9="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xlog:gc*:${LOG_DIR}/fe.gc.log.$DATE:time"
JAVA_OPTS_FOR_JDK_11="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:+UseG1GC -Xlog:gc*:${LOG_DIR}/fe.gc.log.$DATE:time"
http_port = 8030
rpc_port = 9020
query_port = 9030
edit_log_port = 9010
mysql_service_nio_enabled = true
sys_log_level = INFO
run_mode = shared_data
cloud_native_meta_port = 6090
# 是否允许 StarRocks 使用 FE 配置文件中指定的存储相关属性创建默认存储卷
enable_load_volume_from_conf = true
aws_s3_path = starrocks
aws_s3_region = us-east-1
aws_s3_endpoint = http://minio.minio.svc.cluster.local:9000
aws_s3_access_key = admin
aws_s3_secret_key = password

---
# cn config
apiVersion: v1
kind: ConfigMap
metadata:
name: starrockscluster-cn-cm
namespace: starrocks
labels:
cluster: starrockscluster
data:
cn.conf: |
sys_log_level = INFO
# ports for admin, web, heartbeat service
thrift_port = 9060
webserver_port = 8040
heartbeat_service_port = 9050
brpc_port = 8060

Create Starrocks cluster with root password is empty

> kubectl create secret generic rootcredential --from-literal=password=mysql_password
> kubectl apply -f deployment/starrocks/shared_data_mode.yaml
> kubectl -n starrocks get po -owide
> kubectl -n starrocks get starrocksclusters.starrocks.com

Change Starrocks password:

# Connect to nodeport
> mysql --host=172.18.0.4 --port=32755 --user=root
MySQL [(none)]> SET PASSWORD FOR 'root' = PASSWORD('mysql_password');
MySQL [(none)]> quit

shared_data_mode_with_root_password.yaml

apiVersion: starrocks.com/v1
kind: StarRocksCluster
metadata:
name: starrockscluster
namespace: starrocks
spec:
starRocksFeSpec:
image: starrocks/fe-ubuntu:3.3-latest
replicas: 3
# limits:
# cpu: 2
# memory: 4Gi
# requests:
# cpu: 2
# memory: 4Gi
feEnvVars:
- name: "MYSQL_PWD"
valueFrom:
secretKeyRef:
name: rootcredential
key: password
service:
type: NodePort # export fe service
ports:
- name: query # fill the name from the fe service ports
nodePort: 32755
port: 9030
containerPort: 9030
storageVolumes:
- name: fe-storage-meta
# storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
storageSize: 10Gi # the size of storage volume for metadata
mountPath: /opt/starrocks/fe/meta # the path of metadata
- name: fe-storage-log
# storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
storageSize: 1Gi # the size of storage volume for log
mountPath: /opt/starrocks/fe/log # the path of log
configMapInfo:
configMapName: starrockscluster-fe-cm
resolveKey: fe.conf
starRocksCnSpec:
image: starrocks/cn-ubuntu:3.3-latest
replicas: 1
# limits:
# cpu: 2
# memory: 4Gi
# requests:
# cpu: 2
# memory: 4Gi
cnEnvVars:
- name: "MYSQL_PWD"
valueFrom:
secretKeyRef:
name: rootcredential
key: password
configMapInfo:
configMapName: starrockscluster-cn-cm
resolveKey: cn.conf
storageVolumes:
- name: cn-storage-data
# storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
storageSize: 10Gi # the size of storage volume for data
mountPath: /opt/starrocks/cn/storage # the path of data
- name: cn-storage-log
# storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
storageSize: 1Gi # the size of storage volume for log
mountPath: /opt/starrocks/cn/log # the path of log
starRocksFeProxySpec:
replicas: 1
# limits:
# cpu: 1
# memory: 2Gi
# requests:
# cpu: 1
# memory: 2Gi
service:
type: NodePort # export fe proxy service
ports:
- name: http-port # fill the name from the fe proxy service ports
containerPort: 8080
nodePort: 30180 # The range of valid ports is 30000-32767
port: 8080

resolver: "kube-dns.kube-system.svc.cluster.local" # this is the default dns server.

---
# fe config
apiVersion: v1
kind: ConfigMap
metadata:
name: starrockscluster-fe-cm
namespace: starrocks
labels:
cluster: starrockscluster
data:
fe.conf: |
LOG_DIR = ${STARROCKS_HOME}/log
DATE = "$(date +%Y%m%d-%H%M%S)"
JAVA_OPTS="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:${LOG_DIR}/fe.gc.log.$DATE"
JAVA_OPTS_FOR_JDK_9="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xlog:gc*:${LOG_DIR}/fe.gc.log.$DATE:time"
JAVA_OPTS_FOR_JDK_11="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:+UseG1GC -Xlog:gc*:${LOG_DIR}/fe.gc.log.$DATE:time"
http_port = 8030
rpc_port = 9020
query_port = 9030
edit_log_port = 9010
mysql_service_nio_enabled = true
sys_log_level = INFO
run_mode = shared_data
cloud_native_meta_port = 6090
# 是否允许 StarRocks 使用 FE 配置文件中指定的存储相关属性创建默认存储卷
enable_load_volume_from_conf = true
aws_s3_path = starrocks
aws_s3_region = us-east-1
aws_s3_endpoint = http://minio.minio.svc.cluster.local:9000
aws_s3_access_key = admin
aws_s3_secret_key = password

---
# cn config
apiVersion: v1
kind: ConfigMap
metadata:
name: starrockscluster-cn-cm
namespace: starrocks
labels:
cluster: starrockscluster
data:
cn.conf: |
sys_log_level = INFO
# ports for admin, web, heartbeat service
thrift_port = 9060
webserver_port = 8040
heartbeat_service_port = 9050
brpc_port = 8060

Update Starrocks cluster with root password

> kubectl apply -f deployment/starrocks/shared_data_mode_with_root_password.yaml

Test StarRocks shared_data_mode

# Connect to nodeport
> mysql --host=172.18.0.4 --port=32755 --user=root --password --ssl=FALSE
MySQL [(none)]> create database test;
MySQL [(none)]> create table test.tbl1(id int, name varchar(10));
MySQL [(none)]> select * from test.tbl1;

Data is saved in minio:

5. Config and Query Iceberg catalog

mysql --host=172.18.0.4 --port=32755 --user=root --password --ssl=FALSE
MySQL [(none)]> CREATE EXTERNAL CATALOG lakehouse PROPERTIES (
"type" = "iceberg",
"iceberg.catalog.type" = "hive",
"hive.metastore.uris" = "thrift://hive-metastore.metastore.svc.cluster.local:9083",
"aws.s3.enable_ssl" = "false",
"aws.s3.enable_path_style_access" = "true",
"aws.s3.endpoint" = "http://minio.minio.svc.cluster.local:9000",
"aws.s3.access_key" = "admin",
"aws.s3.secret_key" = "password"
);

MySQL [(none)]> show databases from lakehouse;
MySQL [(none)]> show tables from lakehouse.jaffle_shop;
MySQL [(none)]> select * from lakehouse.jaffle_shop.customers limit 10;

Query data from Iceberg on Doris:

mysql> show databases from iceberg;
mysql> show tables from iceberg.jaffle_shop;
mysql> select * from iceberg.jaffle_shop.customers limit 5;

6. Destroy the Kind cluster

> kind delete cluster --name dev

Conclusion

This is the final installment in our series on the Core Components for a Data Platform (Data Lakehouse). In this part, we learned how to deploy:

  • Object Storage with MinIO
  • Metadata Store with Hive Metastore
  • Table Format with Apache Iceberg
  • Processing with Trino, Spark Operator, and Spark Connect Server
  • Airflow to manage ETL workflows using Trino + dbt, Spark Operation, and Spark Connect Server.
  • Apache Doris and StarRocks as Interfaces for integrating with applications that want data on the Lakehouse.

As you may have noticed, this series has primarily focused on the practical aspects of deploying these technologies in a production environment, rather than exploring the underlying reasons, technical details, or optimal use cases. To gain a deeper understanding of these technologies, you’ll need to consult other resources.

In reality, you may not need to use all of the technologies mentioned above, or you may need to substitute some of them depending on the specific requirements of your project and the resources you have available.

In the next part, we will deploy some services that support Realtime processing such as Kafka and Risingwave.