Postgres In Production with Patroni & TimescaleDB— Our Experiences

Nilesh Jayanandana
6 min readSep 4, 2019

By Nilesh Jayanandana and Rovin Shanila

Postgres is an excellent database solution and recently, Platformer Team had the opportunity to implement one for production with one of our larger clients. We thought that we would write down our experiences of the stack and the decision making process that went with it, so it might help someone out there to make better decisions looking at this use-case.

Setting up the Cluster

Here are the requirements

  • Database should be Highly Available.
  • Database should automatically fail over to a replica in case of a failure.
  • Database grows at about 50–80GB per month and contains time series data.
  • Initial Disk space would be 2 TB and will be expanded in a couple of years.
  • Database audit logs should be enabled.

Let’s take these one by one.

Postgres Version

Unless there’s a specific requirement for the postgres version from the stakeholders of your cluster, always go for the latest one. We used Postgres 11.

High Availability and Automatic Failover

By doing a bit of research, we found out that Stolon and Patroni are the best tools out there for High Availability and Automatic Failover. Both of these tools are excellent and have their own pros and cons such as Stolon uses its own proxy and Patroni can be used with HAProxy or NGINX. We opted for Patroni over Stolon due its simplicity and ease of use.

The reason to use a tool like Patroni or Stolon is the ability to switchover automatically from a master to a replica in case of a server issue.

Patroni is a open source wrapper written in Python that handle tasks like replication, backups and restorations.

The above image explains the architecture we implemented. We used HAProxy as our on-prem load balancer and configured it to listen to ports 5000 and 5001 for write users and read users respectively.

Patroni exposes a REST API that we can get the dynamic configurations. Using that we have used the /master endpoint only to listen to the master node and /replica endpoint to replicas.

If anyone interested in learning about HAProxy, please read the article here on setting up and getting started with it. Here is the specific listen blocks we used on HAProxy configuration.

listen postgres_write
bind *:5000
option httpchk OPTIONS /master
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server postgresql1 <ip-address-1>:5432 maxconn 100 check port 8008
server postgresql2 <ip-address-2>:5432 maxconn 100 check port 8008
server postgresql3 <ip-address-3>:5432 maxconn 100 check port 8008
listen postgres_read
bind *:5001
option httpchk OPTIONS /replica
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server postgresql1 <ip-address-1>:5432 maxconn 100 check port 8008
server postgresql2 <ip-address-2>:5432 maxconn 100 check port 8008
server postgresql3 <ip-address-3>:5432 maxconn 100 check port 8008

Apart from HAProxy, we would need a DCS (Distributed Consensus Store) like etcd, Consul or Zookeeper in order to reliably run a leader election and plan automatic failover.

Time Series Data and Table Partitioning

As with Postgres, there is no native way to configure automatic table partitioning as the table grows. It should be done manually. There were a few plugins out there which would automate this for you like pg_partman, but we ran into a few issues with it and it did not deliver the performance we needed.

For IoT time series data, there is an excellent plugin called TimescaleDB. You can read all about it here. TimescaleDB solved all the issues we had with automatic table partitioning, consistent performance throughout the inserts as the tables got bigger etc. However, we noticed that there was a significant slowness when it comes to updates with tables configured with TimescaleDB. This is mentioned in the docs and since IoT time series data are usually not updated heavily, this is okay.

Performance Tuning

TimescaleDB has an excellent tool called timescale tune, which allowed us to tune Postgres for best performance with the infrastructure specs we had. Furthermore, PGTune here too helped us get some idea about how our pg_conf file should look like.

Tablespaces and Flexible Disk Sizes

We decided to mount multiple disks into the system and use tablespaces in Postgres to write data to those disks. Tablespaces allow us to write table/database data to a specific location we prefer in the linux system. However, purely attaching disks and mounting them to the system and assigning tablespaces were not going to cut it as those will be physical disks. Instead, we decided to create Logical Volume Mounts, LVM for short in linux where we would pool all the physical disks we have together, create a logical volume out of those, partition them and break them into any number of logical volume mounts, which can be mounted in the linux subsystem. This allowed us to have flexible disk mounts, where the sizing could be changed as necessary. You can read more about LVMs here.

Monitoring and Logging

Prometheus was our main monitoring and alerting tool. We installed node-exporter on all the VMs and configured prometheus to scrape metrics on a timely basis. Furthermore, we used postgres-exporter to extract postgres level metrics. Also we used the extension pg_stat_statements to measure and monitor query performance of our databases. We enabled logs, looking at this excellent article by tableplus guys here.

Backup and Restore

We are doing daily VM Backups, and also weekly pg_basebackup to recover from major failures. We advice you to chaos engineer your cluster before going into production and break up some stuff and try to restore from your backups just to make sure they are working. Barman is a great tool which you can use for DR purposes.

Troubleshooting and Issues

WAL fills up Disk

One of the most common issues that we encountered was WAL filling up the disks. It doesn’t matter if you have set a max wal size in your configuration, if your slaves go down, WAL would pile up till they come back up. So we recommend you to have some disk space where your pg_wal is stored because if it takes up your entire disk, the postgres process will go down. We recommend as a safety protocol, have a couple of dummy files in the disk that has the pg_wal folder. If it fills up, you can periodically delete those dummy files. The following code would create a 10GB dummy file.

sudo dd if=/dev/zero  of=/root/ONLY_DELETE_THIS_DUMMY_FILE_IN_A_POSTGRES_EMERGENCY  bs=1MB count=10000

If you don’t have a dummy file and your disks are filled up with pg_wal, you are in a bit of a fix. What you need to do then is, mount a disk with sufficient space, copy the pg_wal folder to that disk, and then do a linux symlink of the copied folder to the pg_wal original folder and run the database. This would temporarily get your database up, but you have to find out and fix the real issue of why your WAL is getting out of control. We have noticed this happening only when one of the slaves go down.

Split Brain

Here’s another funny scenario. If you haven’t disabled postgres service in systemd which runs patroni, you may encounter a split brain issue. This happens when node 1 is master and for some reason node 1 patroni goes down and node 2 is elected master. However, the systemd postgres service kicks inside node 1 and that too becomes an elected master for a short period. Because of this you have a master-master scenario which we call split brain. Because of this, node 1, the original master will have a different timeline. (Recall the timeline explanation in avengers endgame by the ancient one). So when node 1 becomes a slave, after that, the timelines with node 1 and the node 2 new master will not match and because of this, the node 1 slave would crash. As a solution, we tried pg_rewind, but didn’t give the results we expected. Therefore we had to do a patroni_reinit which would re initialized the slave db.

Conclusion

Postgres is an excellent database choice and using Patroni to orchestrate it is fairly easy compared to other HA solutions out there for databases like MySQL. That being said, if your company can afford EnterpriseDB (the enterprise version of Postgres) go for that which would make your lives much easier as a Database Administrator.

--

--