[Kubernetes Data Platform][Part 4][Main Components]: Install Hive Metastore, Trino on Kubernetes

8 min read6 days ago

As we know, a Data Platform consists of four main components (Review Part 1):

Storage: We will use MinIO. In Part 3, we learned how to deploy a multi-node, multi-disk MinIO cluster. In this article, we will deploy MinIO on Kubernetes to save time.
Metastore: We will use Hive Metastore. In this article, we will also install it to run on Kubernetes.
Table Format: We will choose Apache Iceberg.
Processing Engine: We will choose Trino.

In this article, we will deploy the data stack on a Kubernetes cluster created using Kind (Kubernetes in Docker).

Note: We are using Kind for the dev environment. For production, we should choose one of the methods presented in Part 2.1 to Part 2.4.

New Terms:

Hive Metastore: This is a familiar tool if you have ever worked with the Hadoop Ecosystem. The Hive Metastore (HMS) is a central repository of metadata for Hive tables and partitions in a relational database, and provides clients (including Hive, Impala, and Spark) access to this information using the metastore service API. It has become a building block for data lakes that utilize the diverse world of open-source software, such as Apache Spark and Presto. In fact, a whole ecosystem of tools, open-source and otherwise, are built around the Hive Metastore, some of which this diagram illustrates.

Apache Iceberg: Iceberg is a high-performance format for large analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive, and Impala to safely work with the same tables at the same time.

Trino: Trino (formerly known as Presto) is an open-source distributed SQL query engine designed to query large datasets distributed across one or more heterogeneous data sources.

Key features of Trino include:

Standard SQL support: Trino adheres to the ANSI SQL standard, allowing you to use familiar SQL queries to query your data.
Fast query performance: Trino utilizes a distributed architecture and advanced query processing techniques to deliver fast query performance, even for large datasets.
High scalability: Trino can be scaled horizontally to handle any amount of data.
Support for multiple data sources: Trino can query data from a variety of sources, including data warehouses, data lakes, relational databases, cloud storage systems, and more.
Ease of use: Trino is easy to install, configure, and use.

DEPLOYMENT STEPS

1. Initialize a Kubernetes cluster with Kind.

2. Install Nginx Ingress Controller.

3. Install MinIO

4. Install data sources:

Install MySQL on Kubernetes.
Install PostgreSQL on Kubernetes.
Download Parquet file: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2009-01.parquet ⇒ Upload the Parquet file using the Minio Console.

5. Install Hive Metastore:

Install PostgreSQL.
Install Hive Metastore.

6. Install Trino and test some features:

Read a Parquet file on MinIO.
Read a table on MySQL and PostgreSQL.
Join 2 tables in 2 databases
Write to MinIO in Iceberg format.
Trino web UI

7. Destroy the Kind cluster.

HANDS-ON STEP

Reference Repository: https://github.com/viethqb/data-platform-notes/tree/main/trino

1. Initialize a Kubernetes cluster with Kind

> cd  ~/Documents 
> git clone https://github.com/viethqb/data-platform-notes.git
> cd data-platform-notes/trino
> kind create cluster --name dev --config deployment/kind/kind-config.yaml
> kubectl get no -owide

2. Install Nginx Ingress Controller

# Install ingress nginx
root@master-1:~# helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
root@master-1:~# helm repo update
root@master-1:~# helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx --set controller.hostNetwork=true,controller.service.type="",controller.kind=DaemonSet --namespace ingress-nginx --version 4.10.1 --create-namespace --debug
root@master-1:~# kubectl -n ingress-nginx get all
root@master-1:~# kubectl get IngressClass

3. Install MinIO on Kubernetes

> helm repo add bitnami https://charts.bitnami.com/bitnami
> helm repo update
> helm upgrade --install minio -n minio -f deployment/minio/minio-values.yaml bitnami/minio --create-namespace --debug --version 14.6.0
> kubectl -n minio get po -owide
> kubectl -n minio get ing
> kubectl get no -owide
> sudo vim /etc/hosts
# Add the following lines to the end of the /etc/hosts
# 172.18.0.5  minio.lakehouse.local

Access Minio at http://minio.lakehouse.local/ in your web browser with user: admin & password: password

4. Install data sources

Install MySQL on Kubernetes

> kubectl apply -f deployment/mysql/namespace.yaml
> kubectl apply -f deployment/mysql/deployment.yaml
> kubectl apply -f deployment/mysql/serrvice.yaml
# user: root / password: debezium

Check mysql data

#check mysql data
> kubectl -n data-source exec -it deployments/mysql bash
bash-4.4# mysql -uroot -P3306 -hmysql.data-source -p
mysql> show databases;
mysql> show tables in inventory;
mysql> select * from inventory.customers;

Install PostgreSQL on Kubernetes

> kubectl apply -f deployment/postgres/postgres.yaml
# user/password: data_engineer/password
# psql -h postgres.data-source -p 5432 -U data_engineer -d data_engineer

> kubectl apply -f deployment/postgres/postgresql-client.yml
> kubectl -n data-source exec -it postgresql-client sh
~ $ psql -h postgres.data-source -p 5432 -U data_engineer -d data_engineer
data_engineer=# \dt inventory.*
data_engineer=# select * from inventory.customers;

Upload the Parquet file using the Minio Console

Access Minio at http://minio.lakehouse.local/browser/lakehouse in your web browser => create new path: raw/yellow_tripdata/y=2009/m=01 => Upload file: yellow_tripdata_2009–01.parquet

5. Install Hive Metastore on Kubernetes

Install hive-metastore-postgresql

> helm repo add bitnami https://charts.bitnami.com/bitnami
> helm repo update
> helm upgrade --install metastore-db -n metastore -f deployment/hive/hive-metastore-postgres-values.yaml bitnami/postgresql --create-namespace --debug --version 15.4.2
> kubectl -n metastore get all

Install Hive metastore

# docker pull rtdl/hive-metastore:3.1.2
# kind load docker-image rtdl/hive-metastore:3.1.2 --name dev
> helm upgrade --install hive-metastore -n metastore -f deployment/hive/hive-metastore-values.yaml ../charts/hive-metastore --create-namespace --debug
> kubectl -n metastore get all

6. Install Trino on Kubernetes

trino-values.yaml

additionalCatalogs:
  lakehouse: |
    connector.name=iceberg
    hive.metastore.uri=thrift://hive-metastore.metastore.svc.cluster.local:9083
    hive.s3.path-style-access=true
    hive.s3.endpoint=http://minio.minio.svc.cluster.local:9000
    hive.s3.aws-access-key=admin
    hive.s3.aws-secret-key=password
    iceberg.register-table-procedure.enabled=true
    iceberg.unique-table-location=true
    iceberg.hive-catalog-name=minio
  minio: |
    connector.name=hive
    hive.metastore.uri=thrift://hive-metastore.metastore.svc.cluster.local:9083
    hive.s3.path-style-access=true
    hive.s3.endpoint=http://minio.minio.svc.cluster.local:9000
    hive.s3.aws-access-key=admin
    hive.s3.aws-secret-key=password
    hive.metastore.thrift.client.connect-timeout=3m
    hive.metastore.thrift.client.read-timeout=3m
    hive.s3.connect-timeout=3m
    hive.s3.socket-timeout=3m
    hive.non-managed-table-writes-enabled=true
    hive.storage-format=ORC
    hive.partition-projection-enabled=true
  postgresql: |
    connector.name=postgresql
    connection-url=jdbc:postgresql://postgres.data-source.svc.cluster.local:5432/data_engineer
    connection-user=data_engineer
    connection-password=password
  mysql: |
    connector.name=mysql
    connection-url=jdbc:mysql://mysql.data-source.svc.cluster.local:3306?zeroDateTimeBehavior=convertToNull
    connection-user=root
    connection-password=debezium
    case-insensitive-name-matching=true

Install Trino on Kubernetes

Trino has various configurations. To view all the configurations of the Helm chart, use the command: helm show values trino/trino > trino-values.yaml
Then, modify the trino-values.yaml file to suit your specific needs.

> helm repo add trino https://trinodb.github.io/charts
> helm repo update
> helm upgrade --install trino -n trino -f deployment/trino/trino-values.yaml trino/trino --create-namespace --debug --version 0.21.0

Read a Parquet file on MinIO

Create table using Hive connector:

> kubectl -n trino exec -it deployments/trino-coordinator trino
trino> CREATE SCHEMA lakehouse.raw WITH (location = 's3a://lakehouse/raw/');
trino> CREATE TABLE IF NOT EXISTS minio.raw.yellow_tripdata (
 vendor_name varchar,
 Trip_Pickup_DateTime varchar,
 Trip_Dropoff_DateTime varchar,
 Passenger_Count bigint,
 Trip_Distance double,
 Start_Lon double,
 Start_Lat double,
 Rate_Code double,
 store_and_forward double,
 End_Lon double,
 End_Lat double,
 Payment_Type Varchar,
 Fare_Amt double,
 surcharge double,
 mta_tax double,
 Tip_Amt double,
 Tolls_Amt double,
 Total_Amt double,
 y varchar,
 m varchar
)
WITH
(
    format = 'PARQUET',
    external_location = 's3a://lakehouse/raw/yellow_tripdata',
    partitioned_by = ARRAY[ 'y', 'm' ]
);

trino> CALL minio.system.sync_partition_metadata('raw', 'yellow_tripdata', 'FULL');

Read Hive table

trino> select * from minio.raw.yellow_tripdata limit 10;

Read a table on MySQL

Show infomations mysql catalogs

trino> show catalogs;
trino> show schemas from mysql;
trino> show tables from mysql.inventory;

Read data from mysql catalog

trino> select * from mysql.inventory.customers;

Read a table on PostgreSQL

trino> show schemas from postgresql;
trino> show tables from postgresql.inventory;
trino> select * from postgresql.inventory.orders;

Join 2 tables in 2 databases

trino> select t1.*, t2.name, t2.description, t2.weight from postgresql.inventory.orders t1
       left join mysql.inventory.products t2 on t1.product_id = t2.id;

Write to MinIO in Iceberg format

Postgres + Mysql to Iceberg

trino> create table lakehouse.raw.join_table as
       select t1.*, t2.name, t2.description, t2.weight from postgresql.inventory.orders t1
       left join mysql.inventory.products t2 on t1.product_id = t2.id;
       
trino> select * from lakehouse.raw.join_table;

join_table on MinIO

trino> create table lakehouse.raw.join_table as
 select t1.*, t2.name, t2.description, t2.weight from postgresql.inventory.orders t1
 left join mysql.inventory.products t2 on t1.product_id = t2.id;
 
trino> select * from lakehouse.raw.join_table;

Parquet to Iceberg

Create Iceberg table

trino> CREATE TABLE IF NOT EXISTS lakehouse.raw.yellow_tripdata_trino_iceberg (
        vendor_name varchar,
        Trip_Pickup_DateTime varchar,
        Trip_Dropoff_DateTime varchar,
        Passenger_Count bigint,
        Trip_Distance double,
        Start_Lon double,
        Start_Lat double,
        Rate_Code double,
        store_and_forward double,
        End_Lon double,
        End_Lat double,
        Payment_Type Varchar,
        Fare_Amt double,
        surcharge double,
        mta_tax double,
        Tip_Amt double,
        Tolls_Amt double,
        Total_Amt double,
        y varchar,
        m varchar
       )
       WITH
       (
           format = 'PARQUET',
           format_version = 2,
           partitioning = ARRAY[ 'y', 'm' ]
       );

Insert data to Iceberg table

trino> insert into lakehouse.raw.yellow_tripdata_trino_iceberg
       select * from lakehouse.raw.yellow_tripdata;

Trino Web UI

Trino provides a web-based user interface (UI) for monitoring a Trino cluster and managing queries. The Web UI is accessible on the coordinator via HTTP or HTTPS, using the corresponding port number specified in the coordinator Config properties. It can be configured with Web UI properties.

> kubectl -n trino port-forward services/trino 9000:8080

7. Destroy the Kind cluster.

kind delete cluster --name dev

Conclusion

In this article, we have successfully deployed Trino and integrated it with Hive Metastore, MinIO, MySQL, and PostgreSQL.

We have demonstrated Trino’s capabilities by:

Reading a Parquet file from MinIO
Reading a table from PostgreSQL
Joining two tables across different databases: one in MySQL, the other in PostgreSQL
Writing data to MinIO in Iceberg format

I believe you now have a good understanding of how Trino works.

When implementing Trino in a production environment, you may need to adjust configurations to optimize performance based on your specific workload and available resources.

My latest data platform also leverages Trino as the core processing engine. Combining Trino with dbt and Airflow creates a powerful ETL solution. I will delve deeper into this architecture in upcoming articles.

[Kubernetes Data Platform][Part 4][Main Components]: Install Hive Metastore, Trino on Kubernetes

New Terms:

DEPLOYMENT STEPS

HANDS-ON STEP

1. Initialize a Kubernetes cluster with Kind

2. Install Nginx Ingress Controller

3. Install MinIO on Kubernetes

4. Install data sources

5. Install Hive Metastore on Kubernetes

6. Install Trino on Kubernetes

7. Destroy the Kind cluster.

Conclusion

Written by Viet_1846

More from Viet_1846

[Kubernetes Data Platform][Part 7][Main Components]: Data warehouse with Apache Doris

[Kubernetes Data Platform][Part 6][Main Components]: Data Orchestrate with Apache Airflow

[Kubernetes Data Platform][Part 8][Main Components]: Data warehouse with StarRocks

[Kubernetes Data Platform][Part 9][Real-time Components]: Apache Kafka, Kafka connect and Debezium

These are well-known technologies, so I’ll provide a brief overview:

Recommended from Medium

Understanding the Future of Apache Iceberg Catalogs

Apache Iceberg is revolutionizing the data industry as an open-source table format that allows data lake storage layers to function as…

How to use NFS in Kubernetes Cluster: Step By Step

The primary nature of the containers makes them immutable, they do not persist any information on the disks, and when the container is shut…

Lists

Staff Picks

Stories to Help You Level-Up at Work

Self-Improvement 101

Productivity 101

Build an interface to your data platform

Modern data platforms are complex. If you look at reference architectures, like the one from A16Z below, it contains 30+ boxes. Each box…

requirements.txt Is Obsolete

Managing Python project dependencies and metadata with Poetry

Iceberg + Dbt + Trino + Hive : modern, open-source data stack

To provide a deeper understanding of how the modern, open-source data stack consisting of Iceberg, dbt, Trino, and Hive operates within a…

Snowflake vs. Databricks 2024 (actually useful)

Snowflake vs. Databricks is something we’ve all heard before, so why not take a different approach