Clustering

Clustering enables a multi-node deployment against a shared repository. The content (database) and blob storage are shared, while each node keeps its own search index and temporary area. Nodes coordinate through database leases (locks) and a journal that replays transactions across nodes.

Configuration

Repository

# <repository>/etc/repository.yml
cluster:
  enabled: true
  nodeId: node-1   # optional; unique per node

Override per node with the env var CMS_CLUSTER_NODE_ID, or framework properties org.mintjams.jcr.cluster.nodeId / org.mintjams.jcr.cluster.enabled. If nodeId is omitted, the host name is used (or a random id).

Workspace (shared storage)

# <workspace>/etc/jcr/jcr.yml
datasource:
  jdbcURL: jdbc:postgresql://db:5432/jcr_\${workspace.name}
  username: jcr
  password: secret
  driverClassName: org.postgresql.Driver
blobstore:
  type: fs
  directory: /mnt/shared/cms/blobs/\${workspace.name}
search:
  indexPath: /var/lib/cms/search/\${workspace.name}   # node-local fast storage

Variables such as \${repository.home}, \${workspace.name} and \${cluster.nodeId} are substituted. The search index is kept per node and rebuilt automatically from content if empty.

Where persistent state lives

State Standalone (default) Clustered
Content, ACLs, journal embedded H2 shared DB (e.g. PostgreSQL), one DB per workspace
Blobs (binaries) local files shared storage (NFS, etc.)
Full-text search index local node-local

Files that must be identical on every node

The following "identity files" must be identical across all nodes (auto-generated on first boot; do not regenerate on the second and later nodes — copy them from the first):

  • secrets/secret-key.yml (encryption key for stored secrets)
  • etc/boot.id (repository identifier; used to derive keys for masked values)
  • etc/idp-keystore.p12 / etc/sp-keystore.p12 (SAML keys)
  • etc/idp.yml / etc/saml2.yml

The recommended approach is to put the repository directory on shared storage (so etc/ and secrets/ are shared automatically). The temporary directory (tmp/) is wiped at startup, so in a cluster it automatically uses tmp/nodes/<nodeId> and must not be shared.

Journal & coordination

Every transaction is recorded in a journal, and each node's poller (every 2 seconds) replays transactions from other nodes. This makes cache invalidation, index updates and OSGi events (Camel route redeployment, CMS events, SSE/GraphQL subscriptions) cluster-aware.

Coordination tables are created automatically:

  • jcr_cluster_nodes — node registry; refreshes last_heartbeat every 30s
  • jcr_cluster_locks — lease locks (with TTL, so a crash never blocks indefinitely)
  • jcr_cluster_signals — a signal bus for short-lived control notifications

Single-node work — workspace startup, blob cleanup, content deployment — is serialized with leases.

Procedure (overview)

  1. Provision a PostgreSQL database per workspace for JCR (and one for BPM if used)
  2. Install the PostgreSQL JDBC driver bundle into Felix
  3. Put the repository directory on shared storage (at minimum, share blobstore.directory across nodes)
  4. Configure each workspace's jcr.yml#datasource (and bpm.yml#jdbcURL if needed) identically on all nodes
  5. Share the identity files across nodes (on first boot, start a single node alone)
  6. Enable cluster.enabled and give each node a unique nodeId. Keep node clocks NTP-synchronized
  7. Place the nodes behind a load balancer (sticky sessions recommended)

Coordination from application code

The script API can run a piece of work on exactly one node in the cluster.

def lease = cluster.tryLock("nightly-report", 600000)
if (lease != null) {
    try {
        // ... runs on exactly one node ...
    } finally {
        lease.close()
    }
}

cluster.isClusterEnabled(), cluster.nodeId and cluster.listMembers() are also available. In standalone mode the lock is granted immediately and the same code runs unchanged.

Monitoring

Use the GraphQL cluster query (admin), or the Cluster card in the Dashboard Operations section, to review each node's heartbeat (liveness). A node silent for three intervals (~90s) is logged as a warning.

Cautions

  • Clock skew breaks the stability window (10s). NTP synchronization is required.
  • External databases and blob storage are not auto-managed. Cleaning up the DB/blobs after deleting a workspace, and clearing the DB before recreating one of the same name, are manual steps.
  • The search index is per-node and not replicated (it rebuilds automatically when empty).