Deploy to your own cloud? #

Having ParkerDB on your own cloud is a great way to have full control over your data and the infrastructure.

The obvious benefits are:

Isolated from other users to ensure the best performance and security.
No data privacy concern. No data or credentials are sent to us.
Increase performance by having the data closer to your applications.
Full control over the network topology and server hardware.
No cloud provider lock-in. You can move the data to other cloud providers.

How to deploy on your own cloud? #

You can deploy ParkerDB on your own cloud with the following steps:

Start a few ParkerDB instances on your cloud.
- The total number of instances depends on the data size and query rate.
Configure your load balancer to randomly distribute the queries to the ParkerDB instances.

Step-by-step guide #

0. Prerequisites #

Email “support at parkerdb dot com” to get a 30-day temporary license file, providing these information:
- Your company name
- Contact email
Prepare the AWS credentials with S3 read access to the data files.
- The credentials are only used to list and read the data files.
- The credentials are never sent to us.
A few servers with docker to run the ParkerDB instances.

1. Prepare a configuration file #

Create a config.yaml file with the following content:

cluster_name: YOUR_CLUSTER_NAME
# Update this version if this configuration is updated
version: 1
tables:
  - name: Table1
    name_in_url: t1                                  # used in the query url
    primary_key: the_primary_key_column_name
    hive_directory_pattern: s3://your-bucket/path/to/table1/{date}/{hour}/

For Hive directory layout, the date partition or hour partition should be in the format of YYYY-MM-DD and HH. Here is the common placeholder syntax:

Placeholder	Description
date	`YYYY-MM-DD`
hour	`HH`
year	`YYYY`
month	`MM`
day	`DD`
ds	`YYYYMMDD`

TODO: Add support for Iceberg, Delta Lake, and Hudi.

2. Start ParkerDB instances #

Start a few ParkerDB instances with the following command:

export HTTP_PORT=38250
export GRPC_PORT=37275
docker run \
  -e AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_ID \
  -e AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY \
  -e AWS_DEFAULT_REGION=us-west-2 \
  -e AWS_ENDPOINT_URL=object_storage_host:object_storage_port \
  -e PARKER_GRPC_PORT=$GRPC_PORT \
  -e PARKER_HTTP_PORT=$HTTP_PORT \
  -e PARKER_HOST=$(hostname -i) \
  -v /path/to/config.yaml:/etc/parkerdb/config.yaml \
  -v /path/to/license.json:/etc/parkerdb/license.json \
  -p ${HTTP_PORT}:8250 \
  -p ${GRPC_PORT}:7275 \
  parkerdb/parker

Note: for better security, you can put the AWS credentials in a file and use –env-file option to pass the credentials.

3. (Optional) Configure your load balancer #

Here are the health check endpoints:

The http health check endpoint is http://host:38250/healthz.
The grpc health check endpoint is host:37275.

Configure your load balancer to randomly distribute the queries to the ParkerDB instances.

For example, you can use the following Nginx configuration file to start a load balancer with Nginx:

http {
    upstream grpc_backend {
        server parker-instance1:37275;
        server parker-instance2:37275;
        server parker-instance3:37275;
    }

    server {
        listen 7275 http2;
        # gRPC service requests
        location / {
            grpc_pass grpc://grpc_backend;
        }

    }
}