[Kubernetes Data Platform][Part 8][Main Components]: Data warehouse with StarRocks

10 min read3 days ago

StarRocks is a high-performance, massively parallel processing (MPP) SQL data warehouse optimized for big data analytics. Built on a significantly enhanced version of Apache Doris, it delivers exceptional speed, reliability, and ease of use.

StarRocks Architecture

StarRocks supports two main architectures: Shared-nothing and Shared-data. Each architecture has its own characteristics and is suitable for different use cases.

Shared-data Clusters

Architecture: In this model, data is stored on a shared storage system (such as S3, Google Cloud Storage, Azure Blob Storage, MinIO), while compute nodes are only responsible for processing queries and computations.

Advantages:

Lower cost by leveraging the advantages of cheap object storage.
Independent scalability between compute and storage.
Better data recovery capabilities.

Disadvantages:

Performance may be lower than shared-nothing in some cases.
Dependent on the performance of the storage system.

Shared-nothing Clusters

Architecture: In this model, each node in the cluster is independent and does not share any resources with other nodes. Each node has its own memory, CPU, and hard drive, and is responsible for storing, computing, and managing its own data.

Advantages:

Flexible scalability: Nodes can be easily added or removed to meet changing needs.
High fault tolerance: A failure on one node does not affect other nodes.
Good performance for heavy OLAP workloads.

Disadvantages:

Higher cost due to the requirement for more hardware resources.
More complex management.

DEPLOYMENT STEPS

Initialize a Kubernetes cluster with Kind.
Install Nginx Ingress Controller, MinIO, Hive Metastore, Trino, Apache Airflow on Kubernetes
Trigger DAG: dbt_jaffle-shop-classic_example
Install StarRocks: I will deploy using the Shared-data cluster architecture.
Config and Query Iceberg catalog
Destroy the Kind cluster

HANDS-ON STEP

Reference Repository: https://github.com/viethqb/data-platform-notes/tree/main/starrocks

1. Initialize a Kubernetes cluster with Kind.

> cd  ~/Documents 
> git clone https://github.com/viethqb/data-platform-notes.git
> cd data-platform-notes/doris
> kind create cluster --name dev --config deployment/kind/kind-config.yaml

2. Install Nginx Ingress Controller, MinIO, Hive Metastore, Trino on Kubernetes

Install Nginx Ingress Controller

> helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
> helm repo update
> helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx --set controller.hostNetwork=true,controller.service.type="",controller.kind=DaemonSet --namespace ingress-nginx --version 4.10.1 --create-namespace --debug
> kubectl -n ingress-nginx get po -owide

Install MinIO

> helm repo add bitnami https://charts.bitnami.com/bitnami
> helm repo update
> helm upgrade --install minio -n minio -f deployment/minio/minio-values.yaml bitnami/minio --create-namespace --debug --version 14.6.0
> kubectl -n minio get po
> kubectl get no -owide
# NAME                STATUS   ROLES           AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION   CONTAINER-RUNTIME
# dev-control-plane   Ready    control-plane   3m53s   v1.30.0   172.18.0.2    <none>        Debian GNU/Linux 12 (bookworm)   6.9.9-arch1-1    containerd://1.7.15
# dev-worker          Ready    <none>          3m11s   v1.30.0   172.18.0.5    <none>        Debian GNU/Linux 12 (bookworm)   6.9.9-arch1-1    containerd://1.7.15
# dev-worker2         Ready    <none>          3m12s   v1.30.0   172.18.0.3    <none>        Debian GNU/Linux 12 (bookworm)   6.9.9-arch1-1    containerd://1.7.15
# dev-worker3         Ready    <none>          3m11s   v1.30.0   172.18.0.4    <none>        Debian GNU/Linux 12 (bookworm)   6.9.9-arch1-1    containerd://1.7.15

# Add the following lines to the end of the /etc/hosts
172.18.0.4  minio.lakehouse.local  airflow.lakehouse.local

Install Hive Metastore

# hive-metastore-postgresql
> helm repo add bitnami https://charts.bitnami.com/bitnami
> helm repo update
> helm upgrade --install metastore-db -n metastore -f deployment/hive/hive-metastore-postgres-values.yaml bitnami/postgresql --create-namespace --debug --version 15.4.2

# Hive metastore
# docker pull rtdl/hive-metastore:3.1.2
# kind load docker-image rtdl/hive-metastore:3.1.2 --name dev
> helm upgrade --install hive-metastore -n metastore -f deployment/hive/hive-metastore-values.yaml ../../charts/hive-metastore --create-namespace --debug

Install Trino

> helm repo add trino https://trinodb.github.io/charts
> helm upgrade --install trino -n trino -f deployment/trino/trino-values.yaml trino/trino --create-namespace --debug --version 0.21.0
> kubectl -n trino get po

Install Apache Airflow

> helm repo add airflow https://airflow.apache.org/
> helm repo update
> helm upgrade --install airflow airflow/airflow -f deployment/airflow/airflow-values.yaml --namespace airflow --create-namespace --debug --version 1.13.1 --timeout 600s
> kubectl -n airflow get po

Access at http://airflow.lakehouse.local/connection/list/ ⇒ user: admin & password: admin

Config S3 Connection and Kubernetes Connection in Airflow UI

Access Airflow Connection at http://airflow.lakehouse.local/connection/list/ ⇒ add new record

Connection Id: s3_default
Connection Type: Amazon Web Services
AWS Access Key ID: admin 
AWS Secret Access Key: password
Extra: {"endpoint_url": "<http://minio.minio.svc.cluster.local:9000>"}

Connection Id: kubernetes_default
Connection Type: Kubernetes Cluster Connection
In cluster configuration: yes
Disable SSL: yes

3. Trigger DAG: dbt_jaffle-shop-classic_example

Create jaffle_shop schema:

> kubectl -n trino exec -it deployments/trino-coordinator trino
trino> CREATE SCHEMA lakehouse.jaffle_shop WITH (location = 's3a://lakehouse/jaffle_shop.db/');

Access at http://airflow.lakehouse.local/dags/dbt_jaffle-shop-classic_example/grid ⇒ Trigger

Check tables from jaffle_shop schema:

> kubectl -n trino exec -it deployments/trino-coordinator trino
trino> show tables from lakehouse.jaffle_shop;

4. Install Install StarRocks

Docs: https://github.com/StarRocks/starrocks-kubernetes-operator/tree/main/doc

Install starrocks-kubernetes-operator ⇒ Custom resource: starrocksclusters.starrocks.com

kubectl apply -f https://raw.githubusercontent.com/StarRocks/starrocks-kubernetes-operator/main/deploy/starrocks.com_starrocksclusters.yaml
kubectl apply -f https://raw.githubusercontent.com/StarRocks/starrocks-kubernetes-operator/main/deploy/operator.yaml

Create Starrocks cluster with root password is empty

shared_data_mode.yaml

apiVersion: starrocks.com/v1
kind: StarRocksCluster
metadata:
  name: starrockscluster
  namespace: starrocks
spec:
  starRocksFeSpec:
    image: starrocks/fe-ubuntu:3.3-latest
    replicas: 3
    # limits:
    #   cpu: 2
    #   memory: 4Gi
    # requests:
    #   cpu: 2
    #   memory: 4Gi
    # feEnvVars:
    #   - name: "MYSQL_PWD"
    #     valueFrom:
    #       secretKeyRef:
    #         name: rootcredential
    #         key: password
    service:
      type: NodePort # export fe service
      ports:
        - name: query # fill the name from the fe service ports
          nodePort: 32755
          port: 9030
          containerPort: 9030
    storageVolumes:
      - name: fe-storage-meta
        # storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
        storageSize: 10Gi # the size of storage volume for metadata
        mountPath: /opt/starrocks/fe/meta # the path of metadata
      - name: fe-storage-log
        # storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
        storageSize: 1Gi # the size of storage volume for log
        mountPath: /opt/starrocks/fe/log # the path of log
    configMapInfo:
      configMapName: starrockscluster-fe-cm
      resolveKey: fe.conf
  starRocksCnSpec:
    image: starrocks/cn-ubuntu:3.3-latest
    replicas: 1
    # limits:
    #   cpu: 2
    #   memory: 4Gi
    # requests:
    #   cpu: 2
    #   memory: 4Gi
    # cnEnvVars:
    #   - name: "MYSQL_PWD"
    #     valueFrom:
    #       secretKeyRef:
    #         name: rootcredential
    #         key: password
    configMapInfo:
      configMapName: starrockscluster-cn-cm
      resolveKey: cn.conf
    storageVolumes:
      - name: cn-storage-data
        # storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
        storageSize: 10Gi # the size of storage volume for data
        mountPath: /opt/starrocks/cn/storage # the path of data
      - name: cn-storage-log
        # storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
        storageSize: 1Gi # the size of storage volume for log
        mountPath: /opt/starrocks/cn/log # the path of log
  starRocksFeProxySpec:
    replicas: 1
    # limits:
    #   cpu: 1
    #   memory: 2Gi
    # requests:
    #   cpu: 1
    #   memory: 2Gi
    service:
      type: NodePort # export fe proxy service
      ports:
        - name: http-port # fill the name from the fe proxy service ports
          containerPort: 8080
          nodePort: 30180 # The range of valid ports is 30000-32767
          port: 8080

    resolver: "kube-dns.kube-system.svc.cluster.local" # this is the default dns server.

---
# fe config
apiVersion: v1
kind: ConfigMap
metadata:
  name: starrockscluster-fe-cm
  namespace: starrocks
  labels:
    cluster: starrockscluster
data:
  fe.conf: |
    LOG_DIR = ${STARROCKS_HOME}/log
    DATE = "$(date +%Y%m%d-%H%M%S)"
    JAVA_OPTS="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:${LOG_DIR}/fe.gc.log.$DATE"
    JAVA_OPTS_FOR_JDK_9="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xlog:gc*:${LOG_DIR}/fe.gc.log.$DATE:time"
    JAVA_OPTS_FOR_JDK_11="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:+UseG1GC -Xlog:gc*:${LOG_DIR}/fe.gc.log.$DATE:time"
    http_port = 8030
    rpc_port = 9020
    query_port = 9030
    edit_log_port = 9010
    mysql_service_nio_enabled = true
    sys_log_level = INFO
    run_mode = shared_data
    cloud_native_meta_port = 6090
    # 是否允许 StarRocks 使用 FE 配置文件中指定的存储相关属性创建默认存储卷
    enable_load_volume_from_conf = true
    aws_s3_path = starrocks
    aws_s3_region = us-east-1
    aws_s3_endpoint = http://minio.minio.svc.cluster.local:9000
    aws_s3_access_key = admin
    aws_s3_secret_key = password

---
# cn config
apiVersion: v1
kind: ConfigMap
metadata:
  name: starrockscluster-cn-cm
  namespace: starrocks
  labels:
    cluster: starrockscluster
data:
  cn.conf: |
    sys_log_level = INFO
    # ports for admin, web, heartbeat service
    thrift_port = 9060
    webserver_port = 8040
    heartbeat_service_port = 9050
    brpc_port = 8060

Create Starrocks cluster with root password is empty

> kubectl create secret generic rootcredential --from-literal=password=mysql_password
> kubectl apply -f deployment/starrocks/shared_data_mode.yaml
> kubectl -n starrocks get po -owide
> kubectl -n starrocks get starrocksclusters.starrocks.com

Change Starrocks password:

# Connect to nodeport
> mysql --host=172.18.0.4 --port=32755 --user=root
MySQL [(none)]> SET PASSWORD FOR 'root' = PASSWORD('mysql_password');
MySQL [(none)]> quit

shared_data_mode_with_root_password.yaml

apiVersion: starrocks.com/v1
kind: StarRocksCluster
metadata:
  name: starrockscluster
  namespace: starrocks
spec:
  starRocksFeSpec:
    image: starrocks/fe-ubuntu:3.3-latest
    replicas: 3
    # limits:
    #   cpu: 2
    #   memory: 4Gi
    # requests:
    #   cpu: 2
    #   memory: 4Gi
    feEnvVars:
      - name: "MYSQL_PWD"
        valueFrom:
          secretKeyRef:
            name: rootcredential
            key: password
    service:
      type: NodePort # export fe service
      ports:
        - name: query # fill the name from the fe service ports
          nodePort: 32755
          port: 9030
          containerPort: 9030
    storageVolumes:
      - name: fe-storage-meta
        # storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
        storageSize: 10Gi # the size of storage volume for metadata
        mountPath: /opt/starrocks/fe/meta # the path of metadata
      - name: fe-storage-log
        # storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
        storageSize: 1Gi # the size of storage volume for log
        mountPath: /opt/starrocks/fe/log # the path of log
    configMapInfo:
      configMapName: starrockscluster-fe-cm
      resolveKey: fe.conf
  starRocksCnSpec:
    image: starrocks/cn-ubuntu:3.3-latest
    replicas: 1
    # limits:
    #   cpu: 2
    #   memory: 4Gi
    # requests:
    #   cpu: 2
    #   memory: 4Gi
    cnEnvVars:
      - name: "MYSQL_PWD"
        valueFrom:
          secretKeyRef:
            name: rootcredential
            key: password
    configMapInfo:
      configMapName: starrockscluster-cn-cm
      resolveKey: cn.conf
    storageVolumes:
      - name: cn-storage-data
        # storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
        storageSize: 10Gi # the size of storage volume for data
        mountPath: /opt/starrocks/cn/storage # the path of data
      - name: cn-storage-log
        # storageClassName: "nfs-client" # you can remove this line if you want to use the default storage class
        storageSize: 1Gi # the size of storage volume for log
        mountPath: /opt/starrocks/cn/log # the path of log
  starRocksFeProxySpec:
    replicas: 1
    # limits:
    #   cpu: 1
    #   memory: 2Gi
    # requests:
    #   cpu: 1
    #   memory: 2Gi
    service:
      type: NodePort # export fe proxy service
      ports:
        - name: http-port # fill the name from the fe proxy service ports
          containerPort: 8080
          nodePort: 30180 # The range of valid ports is 30000-32767
          port: 8080

    resolver: "kube-dns.kube-system.svc.cluster.local" # this is the default dns server.

---
# fe config
apiVersion: v1
kind: ConfigMap
metadata:
  name: starrockscluster-fe-cm
  namespace: starrocks
  labels:
    cluster: starrockscluster
data:
  fe.conf: |
    LOG_DIR = ${STARROCKS_HOME}/log
    DATE = "$(date +%Y%m%d-%H%M%S)"
    JAVA_OPTS="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:${LOG_DIR}/fe.gc.log.$DATE"
    JAVA_OPTS_FOR_JDK_9="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xlog:gc*:${LOG_DIR}/fe.gc.log.$DATE:time"
    JAVA_OPTS_FOR_JDK_11="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:+UseG1GC -Xlog:gc*:${LOG_DIR}/fe.gc.log.$DATE:time"
    http_port = 8030
    rpc_port = 9020
    query_port = 9030
    edit_log_port = 9010
    mysql_service_nio_enabled = true
    sys_log_level = INFO
    run_mode = shared_data
    cloud_native_meta_port = 6090
    # 是否允许 StarRocks 使用 FE 配置文件中指定的存储相关属性创建默认存储卷
    enable_load_volume_from_conf = true
    aws_s3_path = starrocks
    aws_s3_region = us-east-1
    aws_s3_endpoint = http://minio.minio.svc.cluster.local:9000
    aws_s3_access_key = admin
    aws_s3_secret_key = password

---
# cn config
apiVersion: v1
kind: ConfigMap
metadata:
  name: starrockscluster-cn-cm
  namespace: starrocks
  labels:
    cluster: starrockscluster
data:
  cn.conf: |
    sys_log_level = INFO
    # ports for admin, web, heartbeat service
    thrift_port = 9060
    webserver_port = 8040
    heartbeat_service_port = 9050
    brpc_port = 8060

Update Starrocks cluster with root password

> kubectl apply -f deployment/starrocks/shared_data_mode_with_root_password.yaml

Test StarRocks shared_data_mode

# Connect to nodeport
> mysql --host=172.18.0.4 --port=32755 --user=root --password --ssl=FALSE
MySQL [(none)]> create database test;
MySQL [(none)]> create table test.tbl1(id int, name varchar(10));
MySQL [(none)]> select * from test.tbl1;

Data is saved in minio:

5. Config and Query Iceberg catalog

mysql --host=172.18.0.4 --port=32755 --user=root --password --ssl=FALSE
MySQL [(none)]> CREATE EXTERNAL CATALOG lakehouse PROPERTIES (
  "type" = "iceberg", 
  "iceberg.catalog.type" = "hive", 
  "hive.metastore.uris" = "thrift://hive-metastore.metastore.svc.cluster.local:9083", 
  "aws.s3.enable_ssl" = "false", 
  "aws.s3.enable_path_style_access" = "true", 
  "aws.s3.endpoint" = "http://minio.minio.svc.cluster.local:9000", 
  "aws.s3.access_key" = "admin", 
  "aws.s3.secret_key" = "password"
);

MySQL [(none)]> show databases from lakehouse;
MySQL [(none)]> show tables from lakehouse.jaffle_shop;
MySQL [(none)]> select * from lakehouse.jaffle_shop.customers limit 10;

Query data from Iceberg on Doris:

mysql> show databases from iceberg;
mysql> show tables from iceberg.jaffle_shop;
mysql> select * from iceberg.jaffle_shop.customers limit 5;

6. Destroy the Kind cluster

> kind delete cluster --name dev

Conclusion

This is the final installment in our series on the Core Components for a Data Platform (Data Lakehouse). In this part, we learned how to deploy:

Object Storage with MinIO
Metadata Store with Hive Metastore
Table Format with Apache Iceberg
Processing with Trino, Spark Operator, and Spark Connect Server
Airflow to manage ETL workflows using Trino + dbt, Spark Operation, and Spark Connect Server.
Apache Doris and StarRocks as Interfaces for integrating with applications that want data on the Lakehouse.

As you may have noticed, this series has primarily focused on the practical aspects of deploying these technologies in a production environment, rather than exploring the underlying reasons, technical details, or optimal use cases. To gain a deeper understanding of these technologies, you’ll need to consult other resources.

In reality, you may not need to use all of the technologies mentioned above, or you may need to substitute some of them depending on the specific requirements of your project and the resources you have available.

In the next part, we will deploy some services that support Realtime processing such as Kafka and Risingwave.

[Kubernetes Data Platform][Part 8][Main Components]: Data warehouse with StarRocks

StarRocks Architecture

Shared-data Clusters

Shared-nothing Clusters

DEPLOYMENT STEPS

HANDS-ON STEP

1. Initialize a Kubernetes cluster with Kind.

2. Install Nginx Ingress Controller, MinIO, Hive Metastore, Trino on Kubernetes

3. Trigger DAG: dbt_jaffle-shop-classic_example

4. Install Install StarRocks

5. Config and Query Iceberg catalog

6. Destroy the Kind cluster

Conclusion

Written by Viet_1846

More from Viet_1846

[Kubernetes Data Platform][Part 9][Real-time Components]: Apache Kafka, Kafka connect and Debezium

These are well-known technologies, so I’ll provide a brief overview:

[Kubernetes Data Platform][Part 2.4]: Creating highly available Kubernetes cluster with k3s

[Kubernetes Data Platform][Part 3][Main Components]: Install Distributed MinIO Cluster

[Kubernetes Data Platform][Part 2.1]: Creating highly available Kubernetes cluster with kubeadm

Recommended from Medium

Understanding the Polaris Iceberg Catalog and Its Architecture

NOTE: I am working on a hands-on tutorial for Polaris, so please watch for the Dremio Blog in the coming days. Also, check out many other…

Snowflake vs. Databricks 2024 (actually useful)

Snowflake vs. Databricks is something we’ve all heard before, so why not take a different approach

Lists

Staff Picks

Natural Language Processing

Build an interface to your data platform

Modern data platforms are complex. If you look at reference architectures, like the one from A16Z below, it contains 30+ boxes. Each box…

Data Warehouse, Redefined

Rethinking data warehousing: Why redefinition is necessary even beyond Modern Data Warehouse (MDW) and Lakehouse Models

Data :Lakehouse Architecture: Overview, Tools and Cost Management

Lakehouse Architecture combines the reliability and performance of data warehouses with the scalability and cost-effectiveness of data…

Data Vault on Snowflake and Apache Iceberg

A good data vault model is a data model that reflects the business architecture by accurately representing business objects as defined by…