Deploy to your own cloud? #
Having ParkerDB on your own cloud is a great way to have full control over your data and the infrastructure.
The obvious benefits are:
- Isolated from other users to ensure the best performance and security.
- No data privacy concern. No data or credentials are sent to us.
- Increase performance by having the data closer to your applications.
- Full control over the network topology and server hardware.
- No cloud provider lock-in. You can move the data to other cloud providers.
How to deploy on your own cloud? #
You can deploy ParkerDB on your own cloud with the following steps:
- Start a few ParkerDB instances on your cloud.
- The total number of instances depends on the data size and query rate.
- Configure your load balancer to randomly distribute the queries to the ParkerDB instances.
Step-by-step guide #
0. Prerequisites #
- Email “support at parkerdb dot com” to get a 30-day temporary license file, providing these information:
- Your company name
- Contact email
- Prepare the AWS credentials with S3 read access to the data files.
- The credentials are only used to list and read the data files.
- The credentials are never sent to us.
- A few servers with docker to run the ParkerDB instances.
1. Prepare a configuration file #
Create a config.yaml
file with the following content:
cluster_name: YOUR_CLUSTER_NAME
# Update this version if this configuration is updated
version: 1
tables:
- name: Table1
name_in_url: t1 # used in the query url
primary_key: the_primary_key_column_name
hive_directory_pattern: s3://your-bucket/path/to/table1/{date}/{hour}/
For Hive directory layout, the date partition or hour partition should be in the format of YYYY-MM-DD
and HH
.
Here is the common placeholder syntax:
Placeholder | Description |
---|---|
date | YYYY-MM-DD |
hour | HH |
year | YYYY |
month | MM |
day | DD |
ds | YYYYMMDD |
TODO: Add support for Iceberg, Delta Lake, and Hudi.
2. Start ParkerDB instances #
Start a few ParkerDB instances with the following command:
export HTTP_PORT=38250
export GRPC_PORT=37275
docker run \
-e AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY \
-e AWS_DEFAULT_REGION=us-west-2 \
-e AWS_ENDPOINT_URL=object_storage_host:object_storage_port \
-e PARKER_GRPC_PORT=$GRPC_PORT \
-e PARKER_HTTP_PORT=$HTTP_PORT \
-e PARKER_HOST=$(hostname -i) \
-v /path/to/config.yaml:/etc/parkerdb/config.yaml \
-v /path/to/license.json:/etc/parkerdb/license.json \
-p ${HTTP_PORT}:8250 \
-p ${GRPC_PORT}:7275 \
parkerdb/parker
Note: for better security, you can put the AWS credentials in a file and use –env-file option to pass the credentials.
3. (Optional) Configure your load balancer #
Here are the health check endpoints:
- The http health check endpoint is
http://host:38250/healthz
. - The grpc health check endpoint is
host:37275
.
Configure your load balancer to randomly distribute the queries to the ParkerDB instances.
For example, you can use the following Nginx configuration file to start a load balancer with Nginx:
http {
upstream grpc_backend {
server parker-instance1:37275;
server parker-instance2:37275;
server parker-instance3:37275;
}
server {
listen 7275 http2;
# gRPC service requests
location / {
grpc_pass grpc://grpc_backend;
}
}
}