Load Data at GiB/s

Paint Slowly Dripping

A sad fact — I have spent a significant amount of my life waiting for data, either mine or someone elses, to be loaded into a database. And it’s often not much fun, because of a baseline anxiety around whether it will finish on time, or even whether it will finish at all. But it doesn’t have to be that way — with the right setup, even enormous data loads can be fast — dare I say thrilling.

In this article I’ll demonstrate how to load more than one terabyte of data into SingleStore’s managed service in under 20 minutes. Since it’s after work, I’ll put my customer hat on and use an ordinary trial account with a setup fully available for anyone to follow along. Sign-up today!

⚠️💲⚠️ Warning: If you run this from within the same cloud provider as the SingleStore Cluster, you might be charged a cross-AZ data transfer fee. For 1 TiB, this fee could exceed $10 (it did for me!). To avoid the fee, and actually achieve better performance, take a look at the S3 Trampoline technique in my next article. Another way to avoid the fee is to run the client outside of AWS on a client with a fast unmetered connection.

Why Is Loading Data Quickly So Hard?

The overall transfer progresses at the speed of the slowest component involved. I’ve seen slow data loads due to:

Source
1. Slow disk
2. Insufficient CPU
3. Source software unable to saturate available resources
Network
1. Insufficient throughput
2. High latency
3. Lossy
Target
1. Slow disk
2. Insufficient CPU
3. Target software unable to saturate available resources
Protocol
1. Too much overhead
2. Too sensitive to latency

Any one of these problems can reduce the overall load throughput to a low level. All components must be of proportionate size and work together to achieve peak performance.

Push or Pull?

SingleStore supports two main ways to load data:

Pushing the data, through SQL write statements.
Pulling the data, through SingleStore Pipelines

When pushing, a SQL client connects to an aggregator node in a SingleStore cluster. It issues a write command in SQL which is then forwarded from the aggregator to the appropriate leaf nodes.

When pulling, the SingleStore leaf nodes initiate the connection to some data source, such as S3 or a Kafka topic. The leaf nodes download the data directly.

Generally, using Pipelines is the best way to get data into SingleStore. However, when your data is not accessible from the SingleStore cluster, or when it’s in an unusual format, or dynamically-generated, you’ll need a push-based approach.

In this article I’ll be using dynamically-generated data from a test program, so it can’t be loaded through Pipelines. However just because it’s so cool, I’ve included an example of loading data from S3 using Pipelines at the end of the article.

The Setup

The actual size of data depends on its representation. To make sure the “at least one terabyte” goal of the article is completely and inarguably met, I will transfer at least 2^31 rows (~2 billion), each containing (among other fields) 512 bytes of incompressible random data. That means the actual data transferred and stored can be no less than one terabyte.

In practice the data transferred will be greater, as I will add an additional documentID field, there is protocol overhead, etc. Let’s say the actual amount transferred will definitely be less than 2 TiB. To load 2 TiB in under 20 minutes, the data rate will need to be around 13.7 Gbps, or ~1.7 GiB/s. This is the base rate I’ll aim for with every link in the chain.

Source

My home internet is not anywhere close to 13.7 Gbps, so the source machine will be in AWS. AWS has some nice network-optimized instances, and for this test I’ll use a c5n.9xlarge. I was pleased to discover this instance type with its crazy network bandwidth:

Stat	Value
Model	c5n.9xlarge
vCPU	36
RAM	96 GiB
Network	50 Gbps

With an efficient program, I should be able to generate and push 13.7 Gbps from this machine.

Network

The network path itself will all be within the AWS Oregon data center and so will be low-latency with more than enough bandwidth.

Target

As a target I’ll use SingleStore’s S-16 instance size. This has 128 vCPUs, roughly proportionate 4x my source system, and 1 TiB of memory, large enough to almost hold the entire data set in RAM. I’ll trust that SingleStore’s managed service network has enough bandwidth to handle the load.

This instance size seems expensive when my customer hat is on, but since the load will complete in under 20 minutes, I can suspend it when done after only a few credits from my free trial have been consumed.

Protocol

The most efficient way to push bulk data into SingleStore is with a mechanism called LOAD DATA LOCAL INFILE. This protocol is highly optimized in SingleStore and is also more efficient over-the-network than individual insert statements. SingleStore supports a number of formats for LOAD DATA LOCAL INFILE. The CSV format in particular is easy to write, efficient, and flexible.

You can generally make LOAD DATA LOCAL INFILE push data from STDIN or a file, but it’s also possible with many clients to push data from an in-proc source, which is what I’ll do in this case.

Dry Run - Load 610 MiB into SingleStore Free Tier in 10 seconds.

The ‘megapush’ program I wrote for this test generates random data and pushes it into a SingleStore database using LOAD DATA LOCAL INFILE.

This for example will load about 610 MiB of random payload, across 10 million rows, into a table called “pushed”:

./main \
  8 \
  myendpoint.aws-virginia-6.svc.singlestore.com \
  3333 \
  jason-tmp \
  SrO1P3kCjumi1ZtRTlzT2cCqrsvnOVFY \
  jtdb \
  pushed \
  10000000 \
  64

Connected to SingleStore 8.7.3
Creating target table
Generating test data locally (filling buffers)
32/32
Start time:  2024-07-09 16:49:56
 610 /  610 MiB 4m20s
End time:  2024-07-09 16:54:15
Elapsed time:  4m19.579773848s
done

Uhg — four minutes is not ten seconds. The slow link in this case is my home XFinity. Here’s the same program run on a c5n.9xlarge in the same data center as the SingleStore Free Tier:

sudo yum install -y git
sudo yum install -y golang
git clone https://github.com/jasonthorsness/megapush.git
cd megapush
go build main.go
(same command as above)

Connected to SingleStore 8.7.3
Creating target table
Generating test data locally (filling buffers)
32/32
Start time:  2024-07-09 23:54:36
 610 /  610 MiB 10s
End time:  2024-07-09 23:54:46
Elapsed time:  9.95351574s
done

That’s more like it! Less than 10 seconds. Now let’s try with some real power.

And Now For The Real Deal

I provisioned a fresh c5a.9xlarge and S-16 in AWS Oregon, and ran the test.

./main \
  64 \
  svc-be85035f-64c7-4ab9-b44b-f19867e070dc-dml.aws-oregon-3.svc.singlestore.com \
  3306 \
  admin \
  aFtD0TcrolJUkmFQj3IUacjEURqm2K3J \
  test \
  pushed \
  2147483648 \
  512

The 2 billion rows / 1 TiB of test data loaded in just 18 minutes 21 seconds at around ~1 GiB/s.

Connected to SingleStore 8.7.4
Creating target table
Generating test data locally (filling buffers)
256/256
Start time:  2024-07-09 23:04:23
1024 / 1024 GiB
End time:  2024-07-09 23:22:44
Elapsed time:  18m21.450507735s
done

Overall this is a “long coffee break” sort of wait, rather than the “hope for the best and come back in a few days” waits that haunt my past. Hopefully this setup provides a useful reference for others looking to load data quickly into SingleStore.

Appendix: SingleStore Pipelines Example

I mentioned earlier I would include an example of pulling data using Pipelines. Here is the SQL needed to read bulk stock data from the S3 compatible API at polygon.io. Execute this and your data will be available almost immediately:

CREATE TABLE stocks_min(
  localTS AS CONVERT_TZ(
    FROM_UNIXTIME(window_start / 1000000000),'UTC','America/New_York')
    PERSISTED DATETIME(6) NOT NULL,
  localDate AS localTS PERSISTED DATE NOT NULL,
  ticker LONGTEXT NOT NULL,
  volume BIGINT NOT NULL,
  open DOUBLE NOT NULL,
  close DOUBLE NOT NULL,
  high DOUBLE NOT NULL,
  low DOUBLE NOT NULL,
  window_start BIGINT NOT NULL,
  transactions BIGINT NOT NULL,
  INDEX (ticker),
  SORT KEY (localDate, ticker, localTS),
  SHARD KEY(ticker));

CREATE PIPELINE stocks_min_pipeline_2024 AS
LOAD DATA S3 's3://flatfiles/us_stocks_sip/minute_aggs_v1/2024/*/*.csv.gz'
CONFIG '{"region":"us-east-1", "endpoint_url": "https://files.polygon.io"}'
CREDENTIALS '{"aws_access_key_id": "ACCESS_KEY_ID",
               "aws_secret_access_key": "SECRET_ACCESS_KEY"}'
INTO TABLE stocks_min
FIELDS TERMINATED BY ',' IGNORE 1 LINES;
START PIPELINE stocks_min_pipeline_2024;

If your data is already in a supported format and location, pulling it with SingleStore Pipelines is the way to go.