postgresqlpatronietcdhigh-availabilityhaproxyansible

Building a PostgreSQL High-Availability (HA) Cluster with Patroni

A hands-on Patroni + etcd + HAProxy guide written so that even first-timers to database redundancy can follow along — from the core concepts through automatic failover, synchronous replication, and air-gapped installs.

Data DynamicsJune 22, 202627 min read

If you only have a single database server, the moment that server stops, your whole service stops with it. This article explains — step by step, in plain prose — how to build database redundancy, from the underlying concepts to an actual deployment, so that even readers who have never dealt with redundancy before can follow it to the end. The focus is less on memorizing commands and more on understanding why we do things this way. Take your time reading through it.

📦 Source code (public GitHub repo): github.com/DataDynamics-OSS/patroni-postgresql-ha

The entire Ansible playbook and the air-gapped install scripts covered in this article are all open source. You can clone it and follow along right away.
git clone https://github.com/DataDynamics-OSS/patroni-postgresql-ha.git

The setup described here is published as the single open-source repository above, and one Ansible playbook installs the entire cluster.

What are we trying to build?

We want to scale the database out to several machines so that even if one of them dies, the rest immediately take over its role. This is commonly called High Availability (HA), or simply redundancy.

But standing up several servers does not, by itself, make redundancy complete. Only when you can answer the following questions do you really have usable redundancy:

At this very moment, can everyone agree on who the real leader (the Primary that accepts writes) is?
When the leader suddenly dies, is the next leader chosen automatically?
Can the application keep connecting without changing its address even when the leader changes?

To answer these three questions, this project uses Ansible to install and configure a set of proven open-source tools in one shot. The tools are:

Tool	One-line role
PostgreSQL	The database engine itself, where the actual data lives
Patroni	The cluster controller that decides which node is the leader (Primary) and automatically elects a new one when it dies
etcd	A distributed key-value store (DCS) that holds cluster state. The current leader is recorded here
PgBouncer	A "connection broker" that pools many connections and forwards them efficiently
HAProxy	A "traffic cop" that routes incoming requests to the current leader
Keepalived	An "address manager" that attaches a fixed connection address (VIP) to a live server

The big picture

Let's start with the overall design. When the application connects to a single fixed address (the VIP) to read and write data, behind that address several components cooperate to always forward requests to the live leader.

Loading diagram…

There are two things to remember from this diagram. First, who the leader is is not decided among the PostgreSQL nodes themselves, but through the distributed key-value store called etcd. Second, the application always connects to the same VIP address, even when the leader changes. HAProxy takes care of the routing.

What happens when something fails

Saying "automatic failover" in words alone doesn't quite land, so let's walk through the moment the leader server suddenly dies, step by step.

Loading diagram…

The key point is that without anyone having to wake up at 3 a.m. and intervene by hand, a new leader is usually chosen and traffic shifts automatically within a few seconds to a few tens of seconds.

There are several ways to make PostgreSQL redundant besides Patroni. Even so, this project chose Patroni for clear reasons. Let's unpack them, starting with what matters most from a beginner's perspective.

Truly automatic failover that requires no human intervention. Older approaches (for example, running repmgr manually, or hand-written scripts) often still required a person to step in when a failure occurred. Patroni uses a consensus store — etcd — as the referee, and safely elects a new leader without any human involvement.
It structurally prevents the "both sides insist they are the leader" accident (split-brain). The scariest accident in redundancy is when two servers simultaneously claim "I am the leader" and each accepts writes. The data diverges, and recovery becomes extremely difficult. Through etcd's quorum (majority consensus), Patroni enforces the rule that "only the single node holding the leader key in the store is the leader," so this accident is structurally hard to occur.
Configuration is applied consistently across the whole cluster. When you edit PostgreSQL settings separately on each node, they drift apart and cause incidents. Patroni stores the important settings as a single copy in etcd and applies them identically to all nodes. Changing settings is a single patronictl edit-config away.
It ships with an operational tool (patronictl) and a status REST API out of the box. You can immediately answer questions like "who is the leader right now?" or "how much replication lag is there?", and tools like HAProxy can query health status through this REST API (/primary, /replica). Our HAProxy configuration takes advantage of exactly this.
It fits cloud-native environments well and has an active community. Several PostgreSQL operators on Kubernetes use Patroni internally, so it has become close to a de facto standard. There is plenty of material, which makes problems easy to solve by searching.

In short, the four reasons for choosing Patroni are that it is automatic, safe, standard, and easy to operate.

Software version requirements

When versions don't line up, they trip you up in small places. The tables below list the versions this playbook validates and recommends. Most can be changed in group_vars/all.yml.

Control node (your PC/server that runs Ansible)

Software	Recommended version	Notes
ansible-core	2.15 or newer	uses `assert` and recent filters
Python (control node)	3.9 or newer	for running Ansible

Required Ansible collections (install all at once with requirements.yml):

Collection	Recommended version
community.general	8.0 or newer
community.postgresql	3.0 or newer
ansible.posix	1.5 or newer

Target nodes (the servers where the cluster is actually installed)

Software	Recommended version	Why this version
PostgreSQL	16 (PGDG repo)	changeable via `postgresql_version` in `all.yml` (14, 15, 17, etc.)
Patroni	3.x (latest pip)	requires Python 3.7+. Supports the etcd v3 API
etcd	3.5.x	uses the v3 API. `etcd_version` in `all.yml`
PgBouncer	1.21 or newer recommended	a version line that fully supports `scram-sha-256` authentication
HAProxy	2.x or newer	uses the `http-check expect status` syntax
Keepalived	2.x or newer	VRRP-based VIP management
OS	Debian/Ubuntu or RHEL/Rocky	both families supported
Python (target nodes)	3.9 or newer	runs Patroni and Ansible modules

If you want to change versions, edit the postgresql_version and etcd_version values in group_vars/all.yml. PgBouncer, HAProxy, and Keepalived use the versions from the OS package repositories.

How many nodes should you run?

This is the part beginners get confused about most, so it gets its own section. The key point is that etcd and PostgreSQL have different requirements.

For etcd, "quorum (a majority)" is everything. Even if one node dies, a majority must remain alive for the store to work. That's why it must be an odd number, at least 3 nodes. Two nodes is more dangerous than a single server, because losing just one breaks the majority. This playbook's pre-deployment check (preflight) refuses to even start if etcd isn't an odd number of 3 or more.
PostgreSQL (Patroni) nodes do not vote among themselves. Since etcd decides the leader, the DB can fail over safely and automatically with just 2 nodes (1 leader + 1 replica). That said, with only 2 DB nodes, you'll briefly have zero replicas right after a failover, leaving the cluster exposed. Running 3 nodes keeps a replica around even after a failover — redundancy is maintained, and you can also distribute read load.

In other words, the minimum condition for "safe automatic failover" is 3 etcd nodes, and 2 or more DB nodes. For production, we recommend 3 DB nodes as well.

Two ways to lay it out

Because the groups in this playbook are independent of each other, the same code can build either of two topologies.

Layout A — Separated (default: 3 dedicated etcd + 2 or more DB)

etcd is lightweight, so put it on three small dedicated servers, and keep only as many expensive DB servers as you need. This approach safely secures quorum while saving on DB licenses or resources, and it is exactly the default in inventory/hosts.yml.

Layout B — Collocated (everything installed together on 3 servers)

This is the layout to use when you want to keep things simple with just three servers. You put etcd, PostgreSQL, Patroni, PgBouncer, HAProxy, and Keepalived all on each of the 3 servers. It is the most common and resource-efficient layout.

Loading diagram…

To build this collocated layout (Layout B), write inventory/hosts.yml as below. The key is to register the same 3 servers in both the etcd group and the patroni_cluster group.

all:
  vars:
    ansible_user: root
  children:
    etcd:                      # all 3 also serve as etcd (quorum 3)
      hosts:
        pg-node-1: { ansible_host: 10.0.0.11, node_ip: 10.0.0.11 }
        pg-node-2: { ansible_host: 10.0.0.12, node_ip: 10.0.0.12 }
        pg-node-3: { ansible_host: 10.0.0.13, node_ip: 10.0.0.13 }
 
    patroni_cluster:           # all 3 are DB nodes
      hosts:
        pg-node-1: { ansible_host: 10.0.0.11, node_ip: 10.0.0.11, patroni_name: pg-node-1 }
        pg-node-2: { ansible_host: 10.0.0.12, node_ip: 10.0.0.12, patroni_name: pg-node-2 }
        pg-node-3: { ansible_host: 10.0.0.13, node_ip: 10.0.0.13, patroni_name: pg-node-3 }
 
    pgbouncer:                 # same as the DB nodes
      hosts: { pg-node-1: {}, pg-node-2: {}, pg-node-3: {} }
 
    haproxy:                   # same as DB nodes + Keepalived priorities
      hosts:
        pg-node-1: { keepalived_state: MASTER, keepalived_priority: 110 }
        pg-node-2: { keepalived_state: BACKUP, keepalived_priority: 105 }
        pg-node-3: { keepalived_state: BACKUP, keepalived_priority: 100 }

Prerequisites

Before you start deploying in earnest, you need to have the following in place.

Servers ready. For Layout A, 3 nodes for etcd + 2 or more for the DB; for Layout B, 3 nodes. Every server must be reachable from the control node via SSH key and have sudo privileges.

Network/firewall open. The following ports must be open between the nodes.

Port	Purpose
2379, 2380	etcd (client / peer)
5432	PostgreSQL
6432	PgBouncer
8008	Patroni REST API (HAProxy health check)
5000, 5001	HAProxy (write / read)
7000	HAProxy statistics page
VRRP (protocol 112)	Keepalived VIP communication

Install Ansible and the collections on the control node.

# Install Ansible (example)
python3 -m pip install "ansible-core>=2.15"
 
# Install the collections this project uses
ansible-galaxy collection install -r requirements.yml

Editing the configuration files

Before deploying, adjust a few places to fit your environment. Just follow along in order.

(1) Inventory — the server list and IPs

Open inventory/hosts.yml and change each server's ansible_host and node_ip to its real IP. To go with the collocated layout, rewrite it as in the example above. To add more DB nodes, just add entries in the same format (the file contains a commented-out pg-node-3 example).

(2) Global variables — VIP, versions, tuning

Open group_vars/all.yml and, at a minimum, you must set the following two to match your environment.

cluster_vip: 10.0.0.10     # the fixed address the application connects to (an IP nobody is using yet)
vip_interface: eth0        # the network card name to attach that VIP to (check with `ip a`)

(3) Allowed connection IPs — `pg_hba.conf`

You decide which IPs may connect to PostgreSQL with the postgresql_allowed_cidrs list. Only the ranges listed here are allowed to connect (replication connections between DB nodes are allowed automatically, so you don't need to list them).

postgresql_auth_method: scram-sha-256
postgresql_allowed_cidrs:
  - 127.0.0.1/32
  - 10.0.0.0/8           # in production, narrow this to your real application IP range
  # - 203.0.113.25/32    # use /32 to allow a specific server only

After changing this value, running the single line below applies it immediately at runtime, without restarting PostgreSQL (Patroni rewrites pg_hba.conf and reloads).

ansible-playbook site.yml --tags patroni

(4) Tuning profile — a `postgresql.conf` that fits your scale

The default (postgresql.conf) is a "one-size-fits-all T-shirt," so it doesn't fit any server perfectly. That's why this project provides per-scale profiles based on the DB server's RAM. You just pick one in all.yml.

postgresql_tuning_profile: minimal   # minimal | small | medium | large | xl

Profile	Recommended DB server RAM	`shared_buffers`	`max_connections`	Notes
`minimal`	≤ 4GB (test/lab)	512MB	200	default. Boots safely even on small VMs
`small`	16GB	4GB	200	small-scale production
`medium`	32GB	8GB	400	general production
`large`	64GB	16GB	600	large-scale (+ huge_pages)
`xl`	128GB	32GB	800	very large-scale (+ huge_pages)

This profile table is based on the PostgreSQL tuning guide by cluster size. Memory, WAL, autovacuum, logging, and parallelism parameters are all set together per scale. Applying a profile with large shared_buffers to a server with little RAM will fail to boot, so be sure to pick the profile that matches the server's RAM (when in doubt, start with minimal).

If you want to change only max_connections, you can force it independently of the profile.

postgresql_max_connections: 500      # leave empty to use the profile's value

To override only specific parameters directly, use postgresql_parameters_override (the highest priority).

postgresql_parameters_override:
  shared_buffers: "6GB"
  work_mem: "20MB"

Order of application: common base → selected profile → forced max_connections value → override. For how to apply changed tuning to a running cluster, see the "Day-to-day operations" section below.

(5) Synchronous replication — when you need zero data loss

The default is asynchronous replication. Since the leader doesn't wait for the replica's acknowledgment, it's fast, but if the leader suddenly dies, the last few transactions that haven't been replicated yet can be lost. For data where you can't afford to lose a single record — like money or orders — turn on synchronous replication.

patroni_synchronous_mode: true          # turn on synchronous replication (default false)
patroni_synchronous_mode_strict: false  # see the explanation below
patroni_synchronous_node_count: 1       # number of standbys to confirm synchronously

Each option means the following.

Option	Meaning	Trade-off
`patroni_synchronous_mode`	when `true`, the leader waits until the commit is reflected on a synchronous standby	less loss↑ / write latency↑
`patroni_synchronous_mode_strict`	when `true`, rejects writes when there is no synchronous standby at all (absolute zero-loss guarantee)	consistency↑ / availability↓
`patroni_synchronous_node_count`	number of standbys to confirm synchronously (e.g. `2` in a 3-node cluster)	higher value → stronger guarantee↑ / latency↑

The core trade-off in one line: synchronous replication gives up "speed" to gain "safety (zero loss)." In particular, strict=true stops the leader's writes when all synchronous standbys die (sacrificing availability), so turn it on only when zero loss is truly an absolute requirement. Patroni manages synchronous_standby_names automatically, so you don't need to touch the PostgreSQL settings directly.

⚠️ If a synchronous standby is slow, the leader's writes slow down with it. So it's safer to keep synchronous_node_count smaller than the number of live replicas (with 3 nodes, 1 out of the 2 replicas excluding the leader is a reasonable default).

Turning synchronous replication on and off for an already-running cluster is possible without a restart (the apply-tuning.yml below reflects this setting too).

(6) Passwords — always encrypt with Vault

all.yml contains placeholder passwords of the form ChangeMe_.... In production you must replace them with real passwords encrypted via Vault, as below.

cp group_vars/vault.yml.example group_vars/vault.yml
# after editing the values in vault.yml to your real passwords,
ansible-vault encrypt group_vars/vault.yml

Deploying

Now you're all set. Deployment is a single command.

# First check that you can reach the servers
ansible all -m ping
 
# Deploy the entire cluster (add --ask-vault-pass if you used Vault)
ansible-playbook site.yml --ask-vault-pass

Deployment proceeds automatically in the following order. Each stage is tagged, so you can also re-run just part of it.

preflight — first checks things like whether etcd is an odd number of 3 or more and whether there are 2 or more DB nodes. If a condition isn't met, it stops here, preventing the accident of deploying a misconfigured setup.
common — sets up base packages, time synchronization, and host registration on all nodes.
etcd — brings up the distributed key-value store cluster one node at a time, safely.
postgresql — installs the PostgreSQL binaries (Patroni handles cluster initialization).
patroni — installs Patroni, bootstraps the cluster, then creates the application DB and user on the leader.
pgbouncer — installs the connection pooler on each node.
haproxy — configures HAProxy and Keepalived (VIP).

To run only a specific stage, use tags.

ansible-playbook site.yml --tags etcd            # etcd stage only
ansible-playbook site.yml --tags patroni,pgbouncer
ansible-playbook site.yml --limit pg-node-2      # a specific node only

Checking that it worked

When deployment finishes, check whether the cluster is healthy. The operations playbook below shows the output of patronictl list.

ansible-playbook playbooks/cluster-status.yml

Or you can log into any DB node and check it directly.

patronictl -c /etc/patroni/patroni.yml list

If one node shows as Leader and the rest as Replica in the output, with a small Lag, it's healthy. You can also see which backends are alive with your own eyes on the HAProxy statistics page (http://<node-IP>:7000/).

Connecting from the application

The application always connects via the VIP only, not to individual DB server IPs. That way, even when the leader changes, there's no need to change connection settings.

Purpose	Connection address	Port
Read/write (current leader)	VIP `10.0.0.10`	`5000`
Read-only (replica distribution)	VIP `10.0.0.10`	`5001`

# Example write connection (automatically routed to the leader)
psql "host=10.0.0.10 port=5000 dbname=appdb user=appuser"

During deployment, appuser (the application user) is created, and appdb is created with that user as its owner. The username, DB name, and ownership can be changed via the app_db_* variables in all.yml.

Day-to-day operations

# View cluster status
ansible-playbook playbooks/cluster-status.yml
 
# Planned switchover — when you deliberately move the leader to another node, e.g. for maintenance (close to zero downtime)
ansible-playbook playbooks/switchover.yml -e target_leader=pg-node-2

Applying tuning to a "running" cluster

After changing the profile, max_connections, or the synchronous replication settings in all.yml, run apply-tuning.yml to apply them to an already-running cluster. This playbook reflects the values into etcd via patronictl edit-config, and Patroni reloads immediately.

# Apply the profile/values set in all.yml as-is
ansible-playbook playbooks/apply-tuning.yml
 
# Try different values just once (override on the command line)
ansible-playbook playbooks/apply-tuning.yml -e postgresql_tuning_profile=medium
ansible-playbook playbooks/apply-tuning.yml -e postgresql_max_connections=500
 
# Turn on synchronous replication without a restart (use false to turn off)
ansible-playbook playbooks/apply-tuning.yml -e patroni_synchronous_mode=true

One important point here: not every parameter applies with a reload alone.

Applied immediately by reload: work_mem, effective_cache_size, random_page_cost, autovacuum and logging related, etc. → apply-tuning.yml alone is enough.
Requires a restart: shared_buffers, max_connections, wal_buffers, huge_pages, max_worker_processes, etc. When you change these, they show up in the Pending restart column of patronictl list, and you must run the restart playbook below once more.

Restarting PostgreSQL (rolling)

# Safe rolling restart in the order replicas first → leader last
ansible-playbook playbooks/restart-postgresql.yml
 
# Restart only the nodes marked 'Pending restart'
ansible-playbook playbooks/restart-postgresql.yml -e only_pending=true
 
# Apply tuning and restart in one go (handy when you changed values that need a restart)
ansible-playbook playbooks/apply-tuning.yml -e restart_after_apply=true

Commands you run directly on a node

patronictl -c /etc/patroni/patroni.yml list            # check status
patronictl -c /etc/patroni/patroni.yml failover         # emergency failover (manual)
patronictl -c /etc/patroni/patroni.yml restart <cluster> <node>   # restart a specific node
patronictl -c /etc/patroni/patroni.yml edit-config      # change cluster-wide settings

Note: cluster-wide PostgreSQL settings (memory, WAL, etc.) are changed not by editing a node's patroni.yml directly, but via patronictl edit-config (= apply-tuning.yml). That's what keeps them applied consistently across all nodes.

Troubleshooting

Below are the problems you'll commonly run into during deployment or operations, organized by symptom along with their causes. Most are solved quickly once you know "where to look first." The first places to look are patronictl list and journalctl -u patroni.

No leader gets elected / `patronictl list` is empty

This is almost always an etcd problem. If Patroni can't write the leader key to etcd, no one can become the leader.

# Check etcd status and members (from a DB node)
ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 endpoint health
ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 member list
journalctl -u etcd -n 50 --no-pager

Check that etcd hasn't lost its quorum (majority). At least 2 of 3 must be alive.
If the clocks between nodes drift apart, lease-expiry decisions get thrown off → check with chronyc tracking.
Check whether ports 2379 and 2380 are blocked by the firewall.

The Patroni service won't start

journalctl -u patroni -n 80 --no-pager
sudo -u postgres patroni --validate-config /etc/patroni/patroni.yml

If shared_buffers is set larger than the server's RAM, PostgreSQL won't come up at all. Lower the tuning profile to match the server's RAM (drop to minimal to confirm).
Check whether the bin_dir path differs from the actual install path (postgresql_bin_dir_* in all.yml).
Check that the data directory permissions are postgres:postgres / 0700.

A replica can't keep up (Lag keeps growing)

patronictl -c /etc/patroni/patroni.yml list      # check the Lag column
# Check replication status on the leader
psql -U postgres -c "SELECT client_addr, state, sync_state, replay_lag FROM pg_stat_replication;"

Check whether the replica's disk I/O or network bandwidth is insufficient.
If replication broke because WAL ran short, increase max_wal_size and wal_keep_size (move up a profile).
For a broken replica, use patronictl reinit <cluster> <node> to have it re-replicate from the leader.

Connection refused (`no pg_hba.conf entry` error)

This is because the client IP isn't in postgresql_allowed_cidrs.

# Check the currently applied rules
psql -U postgres -c "SELECT type, database, user_name, address, auth_method FROM pg_hba_file_rules;"

Add the client range to postgresql_allowed_cidrs in all.yml and run ansible-playbook site.yml --tags patroni to apply it without a restart.
If it's an authentication failure, check postgresql_auth_method (default scram-sha-256) and whether the client driver supports SCRAM.

Can't connect via the VIP / connections don't move after a failover

ip a | grep '<VIP>'                              # which node the VIP is attached to
systemctl status haproxy keepalived              # status of the two services
curl -s http://127.0.0.1:7000/                   # HAProxy statistics page (text)

If the VIP isn't on any node, it's a Keepalived problem → journalctl -u keepalived.
Check that vip_interface in all.yml matches the actual NIC name (ip a).
If HAProxy can't find the leader, check the Patroni REST (8008) health-check path: curl -s http://<node>:8008/primary (returns 200 if it's the leader).

I raised `max_connections` but it isn't taking effect

max_connections is a parameter that requires a restart. Changing only the value with apply-tuning.yml just marks it Pending restart without applying it → run restart-postgresql.yml. To confirm it applied:

psql -U postgres -c "SHOW max_connections;"
psql -U postgres -c "SELECT name, setting, pending_restart FROM pg_settings WHERE name='max_connections';"

After turning on synchronous replication, writes stall or become very slow

With patroni_synchronous_mode=true (especially strict=true), if there aren't enough synchronous standbys, the leader can't confirm commits, so writes wait.

# Check who the current synchronous standby is / the sync settings
psql -U postgres -c "SHOW synchronous_standby_names;"
psql -U postgres -c "SELECT application_name, sync_state FROM pg_stat_replication;"
patronictl -c /etc/patroni/patroni.yml list   # check the Sync Standby display

If synchronous_node_count is larger than the number of live replicas, it can never be satisfied → lower the value.
It also happens when a replica is slow or disconnected (see "A replica can't keep up" above).
If availability is more urgent than zero loss, turn off strict, or briefly turn off synchronous replication to switch to async: ansible-playbook playbooks/apply-tuning.yml -e patroni_synchronous_mode=false

Queries to confirm tuning "really took effect"

-- Did the value actually apply?
SELECT name, setting, unit, source, pending_restart
FROM pg_settings
WHERE name IN ('shared_buffers','work_mem','max_connections','max_wal_size','random_page_cost');
 
-- Are checkpoints too frequent? (a high checkpoints_req is a signal to raise max_wal_size)
SELECT checkpoints_timed, checkpoints_req FROM pg_stat_bgwriter;
 
-- Are current connections close to max_connections?
SELECT count(*) AS conns, current_setting('max_connections') AS max FROM pg_stat_activity;

Security checklist

Finally, the things to verify before going to production.

Did you replace all the ChangeMe_... passwords in group_vars/all.yml with Vault values?
In the pg_hba rules of patroni.yml, did you narrow 0.0.0.0/0 to your real application IP range?
Did you consider applying TLS to etcd, the Patroni REST API, and inter-node replication traffic?
Did you confirm that vault.yml is excluded from the repository by .gitignore?

Air-gapped (offline) installation — RHEL 9 / Python 3.9

In environments cut off from the internet (finance, public sector, internal networks), the nodes can't reach external repositories, GitHub, or PyPI. So you gather all the needed install files in advance somewhere with internet access, then move that bundle into the air-gapped network to install. This project automates that process with two scripts, based on RHEL 9's system Python 3.9.

Version-compatibility note: RHEL 9's system Python is 3.9. ansible-core 2.16 and newer require Python 3.10+ on the control node, so in this environment you must use ansible-core 2.15.x. The bundle-build script pins to this range (>=2.15,<2.16) when downloading.

The whole picture

Loading diagram…

Step 1 — Build the bundle (on a RHEL 9 host with internet)

The build host must have the BaseOS/AppStream, EPEL, and PGDG (PostgreSQL) repositories enabled (to fetch postgresql16-server, pgbouncer, etc.).

# Match the versions to group_vars/all.yml (default PG16, etcd 3.5.16)
./scripts/airgap-build-bundle.sh
# Or specify versions
PG_VER=16 ETCD_VER=3.5.16 ./scripts/airgap-build-bundle.sh

What the script does:

RPM — gathers packages including dependencies (--resolve --alldeps) and generates local repository metadata with createrepo_c
pip wheel — downloads the control-node set (ansible-core 2.15.x) and the target set (patroni[etcd3], psycopg2-binary, python-etcd) separately (downloaded on RHEL 9 / py3.9 for compatibility)
etcd binary tarball download
Ansible collections (community.general, community.postgresql, ansible.posix) download
Bundles all of the above plus airgap-install.sh and a manifest into patroni-airgap-bundle-*.tar.gz

Step 2 — Move the bundle

Copy the generated patroni-airgap-bundle-*.tar.gz to the air-gapped network's control node and all target nodes, then extract it on each node.

tar xzf patroni-airgap-bundle-*.tar.gz && cd airgap-bundle

Step 3 — Offline install per node

# Control node: install ansible-core + collections (offline) into a Python 3.9 venv
sudo ./airgap-install.sh --role control
 
# Each DB/etcd node: place the local dnf repo + wheelhouse + etcd binary
sudo ./airgap-install.sh --role target
 
# If one node is both control and target (collocated)
sudo ./airgap-install.sh --role both

The install script places the following on each node (paths match offline_* in all.yml).

Resource	Location	Purpose
local dnf repo	`/etc/yum.repos.d/patroni-airgap.repo` → `/opt/patroni-airgap/rpms`	offline package install
pip wheelhouse	`/opt/patroni-airgap/wheels`	offline install of Patroni, etc.
etcd binary	`/opt/patroni-airgap/etcd/`	offline etcd install
ansible venv	`/opt/patroni-airgap/venv` (control only)	run playbooks within the air-gapped network

Step 4 — Deploy with `offline_mode`

On the control node, activate the venv and deploy as usual with offline_mode=true. When this is on, the roles skip internet operations (adding the PGDG repo, downloading etcd, installing pip from the internet) and use only the local resources placed above.

source /opt/patroni-airgap/venv/bin/activate
ansible all -m ping
ansible-playbook site.yml -e offline_mode=true --ask-vault-pass

If passing offline_mode with -e every time is tedious, you can fix it as offline_mode: true in group_vars/all.yml. Then the air-gapped install proceeds with just ansible-playbook site.yml.

How it works, in brief

When offline_mode: true, each role differs as follows.

Role	Online (default)	Air-gapped (offline_mode)
common / postgresql / pgbouncer / haproxy	dnf install from the system and PGDG repos	uses only the local repo with `--disablerepo='*' --enablerepo=patroni-airgap`
postgresql	adds the PGDG repo RPM + disables the module	skipped (the local repo provides the packages)
etcd	downloads the tarball from GitHub	uses the local file `offline_etcd_tarball`
patroni	pip install from PyPI	`pip --no-index --find-links <wheelhouse>`

Wrapping up

If you've followed along this far, you now have a PostgreSQL cluster that picks itself back up even when one node dies. To recap the key points one more time:

The leader decision is made not by PostgreSQL but by the distributed key-value store (DCS) called etcd — which is what structurally blocks split-brain.
The application always connects through a single VIP — no need to change connection settings even when a failover happens.
etcd should be an odd number of 3, and the DB 2 or more (3 recommended for production) — the minimum conditions for quorum and redundancy.
Tuning and synchronous replication can be changed while running — though some parameters require a rolling restart.

Take your time and practice one step at a time. You can find the full playbook and air-gap scripts at DataDynamics-OSS/patroni-postgresql-ha.

What are we trying to build?

The big picture

What happens when something fails

Why we recommend Patroni specifically

Software version requirements

Control node (your PC/server that runs Ansible)

Target nodes (the servers where the cluster is actually installed)

How many nodes should you run?

Two ways to lay it out

Layout A — Separated (default: 3 dedicated etcd + 2 or more DB)

Layout B — Collocated (everything installed together on 3 servers)

Prerequisites

Editing the configuration files

(1) Inventory — the server list and IPs

(2) Global variables — VIP, versions, tuning

(3) Allowed connection IPs — pg_hba.conf

(4) Tuning profile — a postgresql.conf that fits your scale

(5) Synchronous replication — when you need zero data loss

(6) Passwords — always encrypt with Vault

Deploying

Checking that it worked

Connecting from the application

Day-to-day operations

Applying tuning to a "running" cluster

Restarting PostgreSQL (rolling)

Commands you run directly on a node

Troubleshooting

No leader gets elected / patronictl list is empty

The Patroni service won't start

A replica can't keep up (Lag keeps growing)

Connection refused (no pg_hba.conf entry error)

Can't connect via the VIP / connections don't move after a failover

I raised max_connections but it isn't taking effect

After turning on synchronous replication, writes stall or become very slow

Queries to confirm tuning "really took effect"

Security checklist

Air-gapped (offline) installation — RHEL 9 / Python 3.9

The whole picture

Step 1 — Build the bundle (on a RHEL 9 host with internet)

Step 2 — Move the bundle

Step 3 — Offline install per node

Step 4 — Deploy with offline_mode

How it works, in brief

Wrapping up

(3) Allowed connection IPs — `pg_hba.conf`

(4) Tuning profile — a `postgresql.conf` that fits your scale

No leader gets elected / `patronictl list` is empty

Connection refused (`no pg_hba.conf entry` error)

I raised `max_connections` but it isn't taking effect

Step 4 — Deploy with `offline_mode`