Last chance! 7 days left!

Click for a free read!

[Kubernetes Data Platform][Part 4][Main Components]: Install Hive Metastore, Trino on Kubernetes

Viet_1846

8 min read6 days ago

As we know, a Data Platform consists of four main components (Review Part 1):

  • Storage: We will use MinIO. In Part 3, we learned how to deploy a multi-node, multi-disk MinIO cluster. In this article, we will deploy MinIO on Kubernetes to save time.
  • Metastore: We will use Hive Metastore. In this article, we will also install it to run on Kubernetes.
  • Table Format: We will choose Apache Iceberg.
  • Processing Engine: We will choose Trino.

In this article, we will deploy the data stack on a Kubernetes cluster created using Kind (Kubernetes in Docker).

Note: We are using Kind for the dev environment. For production, we should choose one of the methods presented in Part 2.1 to Part 2.4.

New Terms:

Hive Metastore: This is a familiar tool if you have ever worked with the Hadoop Ecosystem. The Hive Metastore (HMS) is a central repository of metadata for Hive tables and partitions in a relational database, and provides clients (including Hive, Impala, and Spark) access to this information using the metastore service API. It has become a building block for data lakes that utilize the diverse world of open-source software, such as Apache Spark and Presto. In fact, a whole ecosystem of tools, open-source and otherwise, are built around the Hive Metastore, some of which this diagram illustrates.

Apache Iceberg: Iceberg is a high-performance format for large analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive, and Impala to safely work with the same tables at the same time.

Trino: Trino (formerly known as Presto) is an open-source distributed SQL query engine designed to query large datasets distributed across one or more heterogeneous data sources.

Key features of Trino include:

  • Standard SQL support: Trino adheres to the ANSI SQL standard, allowing you to use familiar SQL queries to query your data.
  • Fast query performance: Trino utilizes a distributed architecture and advanced query processing techniques to deliver fast query performance, even for large datasets.
  • High scalability: Trino can be scaled horizontally to handle any amount of data.
  • Support for multiple data sources: Trino can query data from a variety of sources, including data warehouses, data lakes, relational databases, cloud storage systems, and more.
  • Ease of use: Trino is easy to install, configure, and use.

DEPLOYMENT STEPS

1. Initialize a Kubernetes cluster with Kind.

2. Install Nginx Ingress Controller.

3. Install MinIO

4. Install data sources:

5. Install Hive Metastore:

  • Install PostgreSQL.
  • Install Hive Metastore.

6. Install Trino and test some features:

  • Read a Parquet file on MinIO.
  • Read a table on MySQL and PostgreSQL.
  • Join 2 tables in 2 databases
  • Write to MinIO in Iceberg format.
  • Trino web UI

7. Destroy the Kind cluster.

HANDS-ON STEP

Reference Repository: https://github.com/viethqb/data-platform-notes/tree/main/trino

1. Initialize a Kubernetes cluster with Kind

> cd  ~/Documents 
> git clone https://github.com/viethqb/data-platform-notes.git
> cd data-platform-notes/trino
> kind create cluster --name dev --config deployment/kind/kind-config.yaml
> kubectl get no -owide
Kind Kubernetes Cluster

2. Install Nginx Ingress Controller

# Install ingress nginx
root@master-1:~# helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
root@master-1:~# helm repo update
root@master-1:~# helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx --set controller.hostNetwork=true,controller.service.type="",controller.kind=DaemonSet --namespace ingress-nginx --version 4.10.1 --create-namespace --debug
root@master-1:~# kubectl -n ingress-nginx get all
root@master-1:~# kubectl get IngressClass

3. Install MinIO on Kubernetes

> helm repo add bitnami https://charts.bitnami.com/bitnami
> helm repo update
> helm upgrade --install minio -n minio -f deployment/minio/minio-values.yaml bitnami/minio --create-namespace --debug --version 14.6.0
> kubectl -n minio get po -owide
> kubectl -n minio get ing
> kubectl get no -owide
> sudo vim /etc/hosts
# Add the following lines to the end of the /etc/hosts
# 172.18.0.5 minio.lakehouse.local

Access Minio at http://minio.lakehouse.local/ in your web browser with user: admin & password: password

4. Install data sources

Install MySQL on Kubernetes

> kubectl apply -f deployment/mysql/namespace.yaml
> kubectl apply -f deployment/mysql/deployment.yaml
> kubectl apply -f deployment/mysql/serrvice.yaml
# user: root / password: debezium

Check mysql data

#check mysql data
> kubectl -n data-source exec -it deployments/mysql bash
bash-4.4# mysql -uroot -P3306 -hmysql.data-source -p
mysql> show databases;
mysql> show tables in inventory;
mysql> select * from inventory.customers;

Install PostgreSQL on Kubernetes

> kubectl apply -f deployment/postgres/postgres.yaml
# user/password: data_engineer/password
# psql -h postgres.data-source -p 5432 -U data_engineer -d data_engineer

> kubectl apply -f deployment/postgres/postgresql-client.yml
> kubectl -n data-source exec -it postgresql-client sh
~ $ psql -h postgres.data-source -p 5432 -U data_engineer -d data_engineer
data_engineer=# \dt inventory.*
data_engineer=# select * from inventory.customers;

Upload the Parquet file using the Minio Console

Access Minio at http://minio.lakehouse.local/browser/lakehouse in your web browser => create new path: raw/yellow_tripdata/y=2009/m=01 => Upload file: yellow_tripdata_2009–01.parquet

5. Install Hive Metastore on Kubernetes

Install hive-metastore-postgresql

> helm repo add bitnami https://charts.bitnami.com/bitnami
> helm repo update
> helm upgrade --install metastore-db -n metastore -f deployment/hive/hive-metastore-postgres-values.yaml bitnami/postgresql --create-namespace --debug --version 15.4.2
> kubectl -n metastore get all

Install Hive metastore

# docker pull rtdl/hive-metastore:3.1.2
# kind load docker-image rtdl/hive-metastore:3.1.2 --name dev
> helm upgrade --install hive-metastore -n metastore -f deployment/hive/hive-metastore-values.yaml ../charts/hive-metastore --create-namespace --debug
> kubectl -n metastore get all

6. Install Trino on Kubernetes

trino-values.yaml

additionalCatalogs:
lakehouse: |
connector.name=iceberg
hive.metastore.uri=thrift://hive-metastore.metastore.svc.cluster.local:9083
hive.s3.path-style-access=true
hive.s3.endpoint=http://minio.minio.svc.cluster.local:9000
hive.s3.aws-access-key=admin
hive.s3.aws-secret-key=password
iceberg.register-table-procedure.enabled=true
iceberg.unique-table-location=true
iceberg.hive-catalog-name=minio
minio: |
connector.name=hive
hive.metastore.uri=thrift://hive-metastore.metastore.svc.cluster.local:9083
hive.s3.path-style-access=true
hive.s3.endpoint=http://minio.minio.svc.cluster.local:9000
hive.s3.aws-access-key=admin
hive.s3.aws-secret-key=password
hive.metastore.thrift.client.connect-timeout=3m
hive.metastore.thrift.client.read-timeout=3m
hive.s3.connect-timeout=3m
hive.s3.socket-timeout=3m
hive.non-managed-table-writes-enabled=true
hive.storage-format=ORC
hive.partition-projection-enabled=true
postgresql: |
connector.name=postgresql
connection-url=jdbc:postgresql://postgres.data-source.svc.cluster.local:5432/data_engineer
connection-user=data_engineer
connection-password=password
mysql: |
connector.name=mysql
connection-url=jdbc:mysql://mysql.data-source.svc.cluster.local:3306?zeroDateTimeBehavior=convertToNull
connection-user=root
connection-password=debezium
case-insensitive-name-matching=true

Install Trino on Kubernetes

Trino has various configurations. To view all the configurations of the Helm chart, use the command: helm show values trino/trino > trino-values.yaml

Then, modify the trino-values.yaml file to suit your specific needs.

> helm repo add trino https://trinodb.github.io/charts
> helm repo update
> helm upgrade --install trino -n trino -f deployment/trino/trino-values.yaml trino/trino --create-namespace --debug --version 0.21.0

Read a Parquet file on MinIO

Create table using Hive connector:

> kubectl -n trino exec -it deployments/trino-coordinator trino
trino> CREATE SCHEMA lakehouse.raw WITH (location = 's3a://lakehouse/raw/');
trino> CREATE TABLE IF NOT EXISTS minio.raw.yellow_tripdata (
vendor_name varchar,
Trip_Pickup_DateTime varchar,
Trip_Dropoff_DateTime varchar,
Passenger_Count bigint,
Trip_Distance double,
Start_Lon double,
Start_Lat double,
Rate_Code double,
store_and_forward double,
End_Lon double,
End_Lat double,
Payment_Type Varchar,
Fare_Amt double,
surcharge double,
mta_tax double,
Tip_Amt double,
Tolls_Amt double,
Total_Amt double,
y varchar,
m varchar
)
WITH
(
format = 'PARQUET',
external_location = 's3a://lakehouse/raw/yellow_tripdata',
partitioned_by = ARRAY[ 'y', 'm' ]
);

trino> CALL minio.system.sync_partition_metadata('raw', 'yellow_tripdata', 'FULL');

Read Hive table

trino> select * from minio.raw.yellow_tripdata limit 10;

Read a table on MySQL

Show infomations mysql catalogs

trino> show catalogs;
trino> show schemas from mysql;
trino> show tables from mysql.inventory;

Read data from mysql catalog

trino> select * from mysql.inventory.customers;

Read a table on PostgreSQL

trino> show schemas from postgresql;
trino> show tables from postgresql.inventory;
trino> select * from postgresql.inventory.orders;

Join 2 tables in 2 databases

trino> select t1.*, t2.name, t2.description, t2.weight from postgresql.inventory.orders t1
left join mysql.inventory.products t2 on t1.product_id = t2.id;

Write to MinIO in Iceberg format

Postgres + Mysql to Iceberg

trino> create table lakehouse.raw.join_table as
select t1.*, t2.name, t2.description, t2.weight from postgresql.inventory.orders t1
left join mysql.inventory.products t2 on t1.product_id = t2.id;

trino> select * from lakehouse.raw.join_table;

join_table on MinIO

trino> create table lakehouse.raw.join_table as
select t1.*, t2.name, t2.description, t2.weight from postgresql.inventory.orders t1
left join mysql.inventory.products t2 on t1.product_id = t2.id;

trino> select * from lakehouse.raw.join_table;

Parquet to Iceberg

Create Iceberg table

trino> CREATE TABLE IF NOT EXISTS lakehouse.raw.yellow_tripdata_trino_iceberg (
vendor_name varchar,
Trip_Pickup_DateTime varchar,
Trip_Dropoff_DateTime varchar,
Passenger_Count bigint,
Trip_Distance double,
Start_Lon double,
Start_Lat double,
Rate_Code double,
store_and_forward double,
End_Lon double,
End_Lat double,
Payment_Type Varchar,
Fare_Amt double,
surcharge double,
mta_tax double,
Tip_Amt double,
Tolls_Amt double,
Total_Amt double,
y varchar,
m varchar
)
WITH
(
format = 'PARQUET',
format_version = 2,
partitioning = ARRAY[ 'y', 'm' ]
);

Insert data to Iceberg table

trino> insert into lakehouse.raw.yellow_tripdata_trino_iceberg
select * from lakehouse.raw.yellow_tripdata;

Trino Web UI

Trino provides a web-based user interface (UI) for monitoring a Trino cluster and managing queries. The Web UI is accessible on the coordinator via HTTP or HTTPS, using the corresponding port number specified in the coordinator Config properties. It can be configured with Web UI properties.

> kubectl -n trino port-forward services/trino 9000:8080

7. Destroy the Kind cluster.

kind delete cluster --name dev 

Conclusion

In this article, we have successfully deployed Trino and integrated it with Hive Metastore, MinIO, MySQL, and PostgreSQL.

We have demonstrated Trino’s capabilities by:

  • Reading a Parquet file from MinIO
  • Reading a table from PostgreSQL
  • Joining two tables across different databases: one in MySQL, the other in PostgreSQL
  • Writing data to MinIO in Iceberg format

I believe you now have a good understanding of how Trino works.

When implementing Trino in a production environment, you may need to adjust configurations to optimize performance based on your specific workload and available resources.

My latest data platform also leverages Trino as the core processing engine. Combining Trino with dbt and Airflow creates a powerful ETL solution. I will delve deeper into this architecture in upcoming articles.