Bring Your Own Cloud

Deploy to your own cloud? #

Having ParkerDB on your own cloud is a great way to have full control over your data and the infrastructure.

The obvious benefits are:

  • Isolated from other users to ensure the best performance and security.
  • No data privacy concern. No data or credentials are sent to us.
  • Increase performance by having the data closer to your applications.
  • Full control over the network topology and server hardware.
  • No cloud provider lock-in. You can move the data to other cloud providers.

How to deploy on your own cloud? #

You can deploy ParkerDB on your own cloud with the following steps:

  • Start a few ParkerDB instances on your cloud.
    • The total number of instances depends on the data size and query rate.
  • Configure your load balancer to randomly distribute the queries to the ParkerDB instances.

Step-by-step guide #

0. Prerequisites #

  • Email “support at parkerdb dot com” to get a 30-day temporary license file, providing these information:
    • Your company name
    • Contact email
  • Prepare the AWS credentials with S3 read access to the data files.
    • The credentials are only used to list and read the data files.
    • The credentials are never sent to us.
  • A few servers with docker to run the ParkerDB instances.

1. Prepare a configuration file #

Create a config.yaml file with the following content:

cluster_name: YOUR_CLUSTER_NAME
# Update this version if this configuration is updated
version: 1
tables:
  - name: Table1
    name_in_url: t1                                  # used in the query url
    primary_key: the_primary_key_column_name
    hive_directory_pattern: s3://your-bucket/path/to/table1/{date}/{hour}/

For Hive directory layout, the date partition or hour partition should be in the format of YYYY-MM-DD and HH. Here is the common placeholder syntax:

Placeholder Description
date YYYY-MM-DD
hour HH
year YYYY
month MM
day DD
ds YYYYMMDD

TODO: Add support for Iceberg, Delta Lake, and Hudi.

2. Start ParkerDB instances #

Start a few ParkerDB instances with the following command:

export HTTP_PORT=38250
export GRPC_PORT=37275
docker run \
  -e AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_ID \
  -e AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY \
  -e AWS_DEFAULT_REGION=us-west-2 \
  -e AWS_ENDPOINT_URL=object_storage_host:object_storage_port \
  -e PARKER_GRPC_PORT=$GRPC_PORT \
  -e PARKER_HTTP_PORT=$HTTP_PORT \
  -e PARKER_HOST=$(hostname -i) \
  -v /path/to/config.yaml:/etc/parkerdb/config.yaml \
  -v /path/to/license.json:/etc/parkerdb/license.json \
  -p ${HTTP_PORT}:8250 \
  -p ${GRPC_PORT}:7275 \
  parkerdb/parker

Note: for better security, you can put the AWS credentials in a file and use –env-file option to pass the credentials.

3. (Optional) Configure your load balancer #

Here are the health check endpoints:

  • The http health check endpoint is http://host:38250/healthz.
  • The grpc health check endpoint is host:37275.

Configure your load balancer to randomly distribute the queries to the ParkerDB instances.

For example, you can use the following Nginx configuration file to start a load balancer with Nginx:

http {
    upstream grpc_backend {
        server parker-instance1:37275;
        server parker-instance2:37275;
        server parker-instance3:37275;
    }

    server {
        listen 7275 http2;
        # gRPC service requests
        location / {
            grpc_pass grpc://grpc_backend;
        }

    }
}