4 min read

Benchmark: the fastest way to import data into Postgresql

Postgresql

There is many way to load data into Postgresql, I will benchmark for each of them but only covered how load data into “remote” Postgresql server to create a near real situation of “Production” environment. So I’m not testing direct import data using file inside Postgresql server.

For each server (elt python) and postgresql are in separated server with same specs and connected via private server connection in same server region.

Dedicated CPU servers spec:

  • CPU: 2 vCore (AMD EPYC-Milan Processor, 2,3 GHz)
  • Memory: 8 GB
  • Harddisk: SSD ( I/O Speed(average) : 1228.8 MB/s)

ELT:

  • Python: 3.13.5
  • Packages: psycopg(3.2.10)

Postgresql:

  • Docker: Postgresql 18.0 (trixie)

For testing dataset I will be using public data from https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales, from 1k, 10k, 100k and 5m rows of data.

Standard Insert Statement

Our first test is using a standard insert statement. With this, we will also test how fast PostgreSQL is when used as a daily OLTP operations. I'm also included "pipeline" feature from psycopg 3 as "batch commit", just to reduce number of data commit event in PostgreSQL.

async with conn.pipeline():
    sql = "INSERT INTO sales VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    await cur.execute(sql, row)

*in second

1st 2nd 3th 4th 5th avg min max
1k row 0.14 0.14 0.16 0.14 0.14 0.14 0.14 0.16
10k row 0.94 0.92 1.11 0.92 0.93 0.96 0.92 1.11
100k row 9.03 9.03 9.50 9.08 8.86 9.10 8.86 9.50
1m row 92.36 89.95 92.89 87.49 90.38 90.61 87.49 92.89
5m row 453.05 449.69 455.90 441.97 444.86 449.09 441.97 455.90

Insert using Executemany

With executemany method we can insert a bulk of data with single insert statement. This also called as bulk insert, as in source code I'm creating a a chunck of bulk data every 1k rows.

sql = "INSERT INTO sales VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
await cur.executemany(sql, row)

*in second

1st 2nd 3th 4th 5th avg min max better than insert
1k rows 0.18 0.12 0.12 0.18 0.15 0.15 0.12 0.18 0.94x
10k row 0.92 0.75 0.76 0.83 0.77 0.81 0.75 0.92 1.20x
100k rows 7.00 7.11 7.35 7.27 7.13 7.17 7.00 7.35 1.27x
1m rows 70.37 71.18 72.03 71.52 70.80 71.18 70.37 72.03 1.27x
5m rows 361.02 357.25 356.23 359.93 355.79 358.05 355.79 361.02 1.25x

Insert using Copy

This method usually used in ELT batch processing that use PostgreSQL feature to fast load data from string or bytes object. The fastest way is using "binary" data type, but this feature have a disadvantage. The data types and length you provide must be exacly same as in destination table schema, or it will failed while inserting to PostgreSQL. And for decimal value will lose it precision (ex: source value is 1.00(decimal), but in binary object is 1 (integer), this also will be failed when insert the data. So I suggest using "text" csv when load data using sql "copy" script, and combine using "double quote" when store string data.

Batch Copy
*in second

1st 2nd 3th 4th 5th avg min max better than insert better than insertmany
1k rows 0.04 0.04 0.03 0.03 0.04 0.04 0.03 0.04 4.03x 4.30x
10k row 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 6.90x 5.76x
100k rows 1.17 1.16 1.15 1.14 1.16 1.16 1.14 1.17 7.87x 6.21x
1m rows 11.44 11.07 11.64 11.41 11.08 11.33 11.07 11.64 8.00x 6.28x
5m rows 57.93 57.34 56.16 58.44 56.51 57.28 56.16 58.44 7.84x 6.25x

Copy whole data
*in second

1st 2nd 3th 4th 5th avg min max better than insert better than insertmany
1k rows 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 5.27x 4.93x
10k row 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 12.00x 4.36x
100k rows 0.23 0.24 0.28 0.24 0.31 0.26 0.23 0.31 27.41x 34.79x
1m rows 1.82 1.84 1.80 1.79 1.83 1.81 1.79 1.84 39.24x 49.95x
5m rows 9.17 9.28 9.33 9.11 8.72 9.12 8.72 9.33 39.25x 49.23x

Verdict

For each data load operation into PostgreSQL that I have tested, it has own benefit and purpose. Start from “insert statement”, it still have good performance for small load data (ex: < 1k rows). But when you do ELT from multiple Prod table, it would be slower when load data and will use more PostgreSQL resources.

Then if your data more than 1k rows or your ELT code seem run slower than expected. I will encourage you to used “executemany” to make it more fast than standard insert statement.

The last is using sql "copy", it more faster than executemany even with same number of rows. And when using AWS Redshift you doesn't have to read local csv file like in regular Postgresql, just give AWS S3 path file within sql copy then Redshift will read that file directly from S3 then load data into the table.

Another key point when loading big data is that you must create a set of chunked batch data, so the DB resource is optimized for writing and committing data into PostgreSQL. Set your ELT to transfer maximum 10k rows of data to make you source database and datawarehouse not to heavy when running multiple ELT jobs table.

The source code for this test in my github.
Ty