Everyone knows that if you’ve got big data, you need Apache Hadoop, right? It’s an affordable, horizontally scalable, clustered data processing platform ideal for data warehousing use cases. And it knocks the socks off classic relational database management systems like PostgreSQL that can barely keep up when playing with a terabyte of data, let alone a petabyte. Right? Well, maybe. Let’s look at PostgreSQL again and see what it can do.
How about a nice GPU?
Graphics Processing Units (GPUs) are the cornerstone of the Artificial Intelligence (AI) revolution. GPUs excel at parallel arithmetic operations, and deliver a whole lot of compute power at relatively low cost, making them ideal for compute (as opposed to I/O) intensive operations like training artificial neural networks. They’re also pretty good for compute-intensive data processing workloads like data warehousing and analytics. When combined with vectorized operations and direct I/O libraries for NVMe SSDs, the sheer core density of a modern server grade GPU can basically deliver as much horsepower in a single 4U server chassis as a large cluster of regular Hadoop servers.
What does it mean? Well, PostgreSQL is highly extensible by design, and one such extension to core PostgreSQL is HeteroDB PG-Strom. PG-Strom is an open source extension which enables PostgreSQL to take advantage of Nvidia GPUs to accelerate complex, large-scale analytical queries. HeteroDB reckons PG-Strom can enable a single, scaled-up PostgreSQL database server with multiple GPUs installed to handle up to around 100TB of data with a query throughput of around 40GB/s.
We’re not going to play the benchmarks game, but performance claims like these are pretty impressive compared with Hadoop, which would typically need around 28 servers banded together in a cluster to handle a similar workload, based on a typical setup of 20TB of storage media per host and 3x HDFS replication plus overhead. 28 decent sized pieces of iron (typically, modern Hadoop expects 128-256GB of RAM per host, and typically 16-32 CPU cores per system) are likely to come with a total cost of ownership somewhat higher than a single scaled up box with a couple of Nvidia Tesla GPUs – even when rented from your favourite compute cloud versus purchased and run on-premise.
Two can play that game
I hear what you’re saying: “I’ve got a bit north of a petabyte of data, not a bit south of 100 terabytes, and one PostgreSQL server just isn’t going to cut it for me, no matter how impressive the stats. And besides, the box might crash, and then what? My entire reporting and analytics infrastructure would be down.” Don’t worry. PostgreSQL has you covered. PostgreSQL version 11 and later have fairly impressive native support for sharding through the partitioning and foreign data wrappers features. Basically, you declare a table with partitioning based on the partition strategy of your choice, and then delegate the table’s partitions to remote tables at other PostgreSQL servers via the postgres_fdw foreign data wrappers feature. For partitioning, you can choose between RANGE, LIST or HASH partitioning strategies.
One benefit of this approach is that the client doesn’t need to know anything about the location of the data, they only need to query the table at the head node, which then runs subqueries across the remote partitions and gathers the results, using predicate pushdowns and other nice features to run the queries as efficiently as possible. This is also true of INSERT and UPDATE queries, cross table joins, correlated subqueries, etc. Another benefit is that all queries, including INSERT and UPDATE queries, are fully ACID (atomic, consistent, isolated and durable) transactional, even distributed queries spanning multiple sharded database servers. This delivers further benefits of data immediacy that are difficult to achieve with solutions like Apache Hive.
“Yes yes”, I hear you decry, “but incoming write operations would be hitting that head node too heavily, and analytical read queries from the same node – that’s just not feasible for my needs”. Well – PostgreSQL has some very nice replication capabilities, too. With PostgreSQL streaming replication, you can set up one or more replication receiver servers. The replication receiver servers can be, optionally, synchronous – in other words, transactions do not commit until they are confirmed by the participating receiver servers. With a combination of synchronous and asynchronous replication, perhaps including cascading relays, you should be able to deploy a configuration that fits your needs, where the write operations go to the head node, but read queries run at replica instances.
What’s more, in a high concurrency setup, a well tuned PostgreSQL setup can typically outperform data lake engines like Apache Hive and SparkSQL, which can start to fall apart when more than a dozen concurrent queries are running.
Too Complicated By Half?
“Sounds complex to set up and maintain” you say. Well, there are quite some parameters to configure, and also some significant considerations for high availability. Compared to the literally thousands of configuration and tuning parameters you need to know in order to set up and maintain a stable and performant Hadoop cluster though, a sharded PostgreSQL setup is objectively less complex.
“Ok, but – Hadoop has blob store connectivity, meaning I can store and query my data on a low cost storage system like Amazon S3, Azure Blob Store, or Google Cloud Storage.” Well, here you’ve got me. If you need a data warehousing solution that runs at the lowest possible storage cost, then you might indeed be better off with Apache Hive, Presto / Trino, or something similar. Furthermore, if you don’t need your data system to be always online, then an ephemeral, cloud-based Apache Hadoop or Spark environment backed by your cloud service provider’s blob storage platform might also make sound business sense for you. OXO!
However, if excellent performance and high availability, coupled with flexible scalability, all at a balanced price point is your goal, then a well architected, sharded big data PostgreSQL cluster, running HeteroDB’s PG-Strom GPU extension might well present a good solution for you.