ParkerDB

ParkerDB #

Looking to deliver processed data to your customers with ultra-low latency after crunching it in your data warehouse?

Traditional reverse ETL workflows can be complex and resource-intensive, often involving multiple steps:

  1. Read data from the data warehouse using a Spark job.
  2. Send the data to a Kafka topic to manage update throttling.
  3. Consume data from Kafka and store it in a database or Cassandra cluster.
  4. Access the data via a caching layer to reduce latency. Each of these steps adds complexity, potential errors, and significant overhead.

ParkerDB simplifies this process by providing fast and scalable point lookup for big data tables, without the need for an extensive ETL pipeline.

Key Benefits #

  • Ultra low latency and high concurrency.
    • A single moderate server can handle 20,000 queries per second with P99 latency under 1ms, all without a cache layer.
  • Fast and efficient data publishing.
    • Simply download your data warehouse tables in Parquet format for seamless integration.
  • Horizontal scalability.
    • Scale effortlessly by adding more servers to accommodate growing query volumes and larger tables.

Why ParkerDB is so fast? #

Suppose you have sorted your warehouse table by the primary key and saved it in the parquet format, ParkerDB will build indexes in memory and serve the queries with O(1) disk access.

Try it out #

Prepare the parquet files #

Suppose your table has a primary key column id, for example:

// Load the data into a DataFrame (replace with your source)
val df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("/path/to/your/input.csv")

// Sort the DataFrame by 'id'
val sortedDf = df.orderBy("id")

// Write the sorted DataFrame to Parquet format
sortedDf.write.mode("overwrite").parquet("/path/to/output_directory")

Run the ParkerDB preview #

You can mount the output directory and open the port 8250.


docker run \
    -v /path/to/output_directory:/data \  # Mount the local directory with parquet files to the container
    -p 8250:8250 \                        # Map the container's port 8250 to the host's port 8250
    parkerdb/parkerdb-preview \           # Use the ParkerDB preview image
    parker --index-column=id

Query the data #

In a new terminal, query the data with the following curl command:

curl "http://localhost:8250/query?id=1"

Benchmark #

To benchmark the ParkerDB, you can use this tool parkbench. It can send a large number of queries to the ParkerDB server and measure the latency and throughput. The query ids are read from a CSV file and randomized. So no cache effect is involved.

Install #

Download the latest release from the releases page and decompress it.

Prepare the CSV file #

The input CSV can be downloaded directly:

$ curl -o ids.csv http://localhost:8250/export_ids

Run #

$ parkbench -httpAddress localhost:8250 -csv ids.csv -idColumn id -concurrency 20

How ParkerDB can help you? #

The above is just the preview version of ParkerDB. It is limited to one server and one table, on static data.

For the production version, we can help to achieve the following:

  • Data Partitioning And Replication: If the data size is big, we can help you to partition the data and distribute the data to multiple servers.
  • Multiple Tables: Refresh each table data at its schedule.
  • Ultra Low Latency: Bring your own cloud (BYOC) deployment to reduce the network latency as much as possible.
  • Horizontal Scalability: Scale the query service horizontally with multiple servers.
  • Data Security: All your data files and credentials stay on your own servers.
  • Auto Update: Automated data publishing and query service management.

Contact us #

Email: support at parkerdb.com