torresvault:proxmox:cluster

Proxmox Cluster Architecture (Current State)

Proxmox Cluster Architecture (Current State)

This document describes the current TorresVault Proxmox cluster as it exists today. It focuses on hardware, networking, storage, workloads, and backup/restore, and is intended as the authoritative reference for how virtualization is implemented in TorresVault 2.0 (current state).

Future redesigns (new NAS, X570D4U, Mini PC cluster, etc.) will be documented separately on the roadmap page.

1. High-Level Overview

The Proxmox environment is a 2-node cluster with a qdevice, running on older but solid Intel desktop platforms with expanded SATA and NIC capacity.

Cluster name: `torres-cluster`
Hypervisor: Proxmox VE 9.x
Nodes: `pve1`, `pve2`
Quorum helper: Raspberry Pi running `corosync-qdevice`
Backup server: Proxmox Backup Server (PBS) VM
Storage backend: Local SATA disks per node; TrueNAS VM providing backup storage

High-level logical view:

Compute layer: pve1, pve2
Storage layer: local disks per node, plus TrueNAS VM used as backup target
Backup layer: PBS VM writing to TrueNAS (`pbs-main` datastore)
Monitoring layer: Prometheus + Grafana, Kuma, Proxmox built-ins

The design intentionally does not use shared storage for HA; instead, VMs are pinned to nodes and protected via image-based backups to PBS.

2. Physical Hosts & Hardware

2.1 pve1

Role: General compute node, many core services

CPU: Intel Core i5-2500 @ 3.30 GHz
- 4 cores / 4 threads, 1 socket
RAM: 32 GB DDR3L 1600 MHz
- 4 × 8 GB Timetec DDR3L (PC3L-12800) UDIMM kit
Motherboard / Chipset: Older Intel desktop platform
Disk controllers:
- Onboard Intel SATA controller (RAID mode)
- ASMedia ASM1064 SATA controller
- GLOTRENDS SA3112-C 12-port PCIe x1 SATA expansion card
Disk inventory (approximate):
- Several 1 TB WDC WD1003FBYX enterprise HDDs
- Several 1 TB Seagate ST91000640NS HDDs
- System / boot disk plus ~10–12 × 1 TB data disks
Network interfaces:
- Onboard Intel 82579LM 1 GbE
- Intel I350 quad-port 1 GbE PCIe NIC
Installed OS: Proxmox VE 9.x (legacy BIOS)
Kernel example: 6.14.x-pve

Primary role summary:

Runs web, monitoring, automation, PBS, TrueNAS and various lab VMs
Acts as one half of the Proxmox cluster
Provides local LVM/ZFS storage for its own VMs

—

2.2 pve2

Role: General compute node, media & application workloads

CPU: Intel Core i5-4570 @ 3.20 GHz
- 4 cores / 4 threads, 1 socket
RAM: 32 GB DDR3L 1600 MHz
- Same Timetec 4 × 8 GB kit as pve1
Disk controllers:
- Intel 9-Series SATA controller (AHCI)
- ASMedia ASM1064 SATA controller
- GLOTRENDS SA3112-C 12-port PCIe x1 SATA expansion card
Disk inventory (approximate):
- Multiple 1 TB Seagate ST91000640NS HDDs
- System disk plus ~10–12 × 1 TB data disks
Network interfaces:
- Intel I350 quad-port 1 GbE (matching pve1)
Installed OS: Proxmox VE 9.x (EFI boot)

Primary role summary:

Runs general apps, including media and image workloads (Nextcloud, Immich, Jellyfin, etc.)
Acts as second node of Proxmox cluster
Mirrors pve1’s storage pattern with local disks only

—

2.3 QDevice

Hardware: Raspberry Pi (dedicated qdevice)
Software: `corosync-qdevice`
Purpose: provides voting/quorum for a 2-node cluster, preventing split-brain.

—

3. Network Design

The Proxmox cluster uses:

Main LAN: `192.168.1.0/24`
- Gateway: UCG Max at `192.168.1.1`
Cluster link: dedicated point-to-point /30 network:
- pve1: `10.10.10.1/30`
- pve2: `10.10.10.2/30`

UniFi VLANs exist on the network side (stark_user, stark_IOT, guest, IOT+, Torres Family Lights); for now, Proxmox sees mostly the flat LAN plus specific lab networks for testing.

—

3.1 pve1 Network Interfaces

From the Proxmox UI:

bond0 – Linux bond
- Mode: active-backup
- Slaves: `enp1s0f1`, `enp1s0f2`
- No IP address (used as bridge slave)
eno1 – onboard NIC
- Not active, not autostart (reserved / spare)
enp1s0f0
- IP: `10.10.10.1/30`
- Usage: dedicated cluster interconnect to pve2
enp1s0f1 / enp1s0f2
- Members of `bond0`
enp1s0f3
- Currently unused
vmbr0 – Linux bridge
- IP: `192.168.1.150/24`
- Gateway: `192.168.1.1` (UCG Max)
- Bridge port: `bond0`
- All LAN-facing VMs attach here

Design notes:

Two NIC ports for LAN via bond (`bond0` → `vmbr0`) for basic redundancy.
One NIC port for cluster link (`enp1s0f0`).
One NIC still available (`enp1s0f3`) for future use (e.g., storage VLAN or DMZ).

—

3.2 pve2 Network Interfaces

bond0 – Linux bond
- Mode: active-backup
- Slaves: `enp2s0f1`, `enp2s0f2`
eno1 – onboard NIC
- Not active
enp2s0f0
- IP: `10.10.10.2/30`
- Usage: dedicated cluster interconnect to pve1
enp2s0f1 / enp2s0f2
- Members of `bond0`
enp2s0f3
- Currently unused
vmbr0
- IP: `192.168.1.151/24`
- Gateway: `192.168.1.1`
- Bridge port: `bond0`

Design notes:

Symmetric layout with pve1 to make VM migration and cabling easier.
Cluster traffic is physically separated from LAN.

—

3.3 Logical Topology Diagram (Text)

LAN 192.168.1.0/24
- UCG Max (192.168.1.1)
- pve1 (192.168.1.150, vmbr0 on bond0)
- pve2 (192.168.1.151, vmbr0 on bond0)
- Other LAN clients / services

Cluster link 10.10.10.0/30
- pve1 – `enp1s0f0` → `10.10.10.1`
- pve2 – `enp2s0f0` → `10.10.10.2`
- Single direct cable between nodes

Quorum
- Raspberry Pi qdevice on LAN (IP TBD), reachable from both nodes

This separation keeps corosync and cluster traffic off the main LAN and avoids cluster instability if LAN becomes noisy.

4. Storage Architecture (Current)

There are three main storage layers:

Local node storage (per-node disks, LVM/ZFS)
TrueNAS VM – used as backup target
Proxmox Backup Server (PBS) – used for image-based backups

—

4.1 Local Storage on pve1

Typical Proxmox storages (names as shown in the UI):

`local (pve1)` – boot disk, ISOs, templates
`local-lvm (pve1)` – LVM-thin for VM disks
`VM-pool (pve1)` – additional pool for VMs (local disks)
`PBS (pve1)` – smaller storage local to the PBS VM (e.g., for metadata or staging)

Backed by:

WDC WD1003FBYX and Seagate ST91000640NS disks on Intel/ASMedia/GLOTRENDS SATA controllers
No hardware RAID; uses Proxmox’s software stack and/or ZFS/LVM

—

4.2 Local Storage on pve2

`local (pve2)` – boot, ISOs, templates
`local-lvm (pve2)` – VM disks
`apps-pool (pve2)` – main pool for application VMs (Nextcloud, Immich, Jellyfin, etc.)

Also backed by multiple 1 TB Seagate ST91000640NS disks via the same combination of controllers.

—

4.3 TrueNAS VM

VM ID: 108 (`truenas`)
Node: pve1
Purpose: Provides network file storage and backup target for PBS
Storage role: Backing store for PBS datastore `pbs-main`
Backups: TrueNAS is NOT backed up by PBS.
- Reason: TrueNAS holds the backups; recursively backing it up is inefficient and can overload

the system.

Over time, this VM may be migrated to a dedicated physical NAS, but for now it is virtualized.

—

4.4 Proxmox Backup Server (PBS) VM

VM ID: 105 (`pbs`)
Node: pve1
OS: Proxmox Backup Server 4.x
CPU: 3 vCPUs
RAM: 4 GB
Main datastore: `pbs-main`
- Size: ~5.38 TB
- Used: ~0.9 TB
- Backed by TrueNAS storage

Important backup rule:

PBS does not back up itself.
- The PBS VM is excluded from nightly backup jobs.
TrueNAS (backup storage) is also excluded from PBS backups.

This prevents:

Storage thrashing / self-backup loops
Catastrophic performance impact from PBS trying to back up its own datastore

PBS instead focuses on backing up critical application VMs only.

5. Workload Layout

The current cluster runs a mix of core services and lab workloads. VM IDs/names:

5.1 VMs on pve1

100 – `web`
- Role: front-end web / landing page (e.g., torresvault.com)
101 – `Kuma`
- Role: uptime / service monitoring
105 – `pbs`
- Role: Proxmox Backup Server VM
106 – `n8n`
- Role: automation / workflow engine
107 – `npm`
- Role: Nginx Proxy Manager (reverse proxy)
108 – `truenas`
- Role: storage VM / backup target
110 – `Prometheus`
- Role: metrics + Grafana stack
112 – `iperf-vlan10`
113 – `iperf-vlan20`
114 – `iperf-vlan1`
- Role: lab VMs for VLAN and bandwidth testing
115 – `portainer-mgmt`
- Role: container management for other hosts
116 – `wiki`
- Role: DokuWiki instance hosting TorresVault documentation
111 – `iperf-vlan1` (pve1 local test network)

—

5.2 VMs on pve2

102 – `next`
- Role: Nextcloud services
103 – `immich`
- Role: photo / media backup
104 – `jellyfin`
- Role: media server
109 – `RDPjump`
- Role: jump host / remote access box

These assignments are not HA; VMs are pinned to nodes and protected by PBS backups.

6. Backup & Restore Strategy

Backups are handled by the PBS VM (ID 105), writing into datastore `pbs-main` hosted on TrueNAS.

Key points:

Backup scope:
- Critical service VMs:
  - web (100)
  - Kuma (101)
  - next (102)
  - immich (103)
  - jellyfin (104)
  - n8n (106)
  - npm (107)
  - Prometheus (110)
  - portainer-mgmt (115)
  - wiki (116)
  - other lab VMs as needed
- Excluded:
  - `truenas` VM (108)
  - `pbs` VM (105) itself
Datastore: `pbs-main`
- Size ~5.38 TB; currently lightly used
Retention policy: configured in PBS; typical pattern (can be tuned):
- e.g., 7 daily, 4 weekly, 12 monthly (confirm in PBS UI)
Verification:
- PBS supports scheduled verify jobs to detect bit-rot
Prune jobs:
- Automated prune / garbage collection jobs run regularly to reclaim space

—

6.1 Rationale for Exclusions

TrueNAS holds the backup data; backing it up with PBS would:
- Recursively back up the PBS datastore
- Multiply IO load
- Risk saturating disks and disrupting backups

PBS VM is the backup system itself:
- Backing PBS to its own datastore is logically unsound
- If PBS is lost, it can be rebuilt from Proxmox templates and reattached to

existing datastore

Instead, PBS backup jobs focus on stateless or easily rebuildable VMs where immutable data is stored externally (e.g., on TrueNAS, Nextcloud data, or other locations).

—

6.2 Restore Process (Operational Runbook)

Scenario: single VM failure

1. Identify affected VM in Proxmox UI.
2. In PBS UI:
   * Go to **Datastore → pbs-main → Content → VM group**.
   * Select the latest successful backup.
3. Choose **Restore**:
   * Target node: original host (or alternate host if needed)
   * Disk storage: appropriate local storage (`local-lvm`, `apps-pool`, etc.)
4. Start VM in Proxmox and validate:
   * Application health checks (web UI, API, etc.)
   * Network connectivity (LAN, DNS, etc.)

Scenario: node loss (pve1 or pve2)

1. Replace/fix hardware and reinstall Proxmox VE.
2. Rejoin node to `torres-cluster`.
3. Recreate necessary storages pointing at local disks.
4. From PBS, restore VMs to the rebuilt node using the procedure above.

Scenario: PBS VM lost but TrueNAS datastore intact

1. Recreate PBS VM from Proxmox template.
2. Reattach existing `pbs-main` datastore on TrueNAS.
3. PBS will rediscover existing backups.
4. Resume normal operations.

7. Monitoring & Observability

Monitoring in TorresVault is layered:

Proxmox Node Metrics
- Built-in graphs: CPU, RAM, I/O, network, load average per node and per VM
Prometheus VM (110)
- Scrapes metrics from nodes and services
- Grafana dashboards provide historical views
Kuma VM (101)
- Synthetic checks / uptime monitoring for key services
PBS Analytics
- Shows backup / prune / verify job history and datastore usage
TrueNAS UI
- Disk health, pool status

Operational practice:

Use Kuma for “is it up?”
Use Proxmox + Grafana for “how is it behaving?”
Use PBS UI for backup health.

8. Operational Procedures (Day-to-Day)

8.1 Adding a New VM

1. Decide which node (pve1 vs pve2) based on workload:
   * Storage-heavy → whichever has more free disk
   * Media or GPU heavy (later) → pve2
2. Create VM in Proxmox:
   * Attach to `vmbr0` for LAN access
   * Store disks on `local-lvm`, `apps-pool`, or `VM-pool`
3. Install OS and configure network.
4. In **PBS**, add VM to an existing or new backup group.
5. Verify first backup completes successfully.

—

8.2 Maintenance & Patching

Proxmox nodes:

1. Live-migrate or gracefully shut down VMs on the target node if needed.
2. `apt update && apt full-upgrade` on the node (via console or SSH).
3. Reboot node.
4. Verify:
   * Corosync quorum healthy
   * VMs auto-started where expected

PBS & TrueNAS:

Update during low-traffic windows (overnight).
Confirm backups resume successfully after upgrades.

—

8.3 Power-Down / Power-Up Order

Planned maintenance that requires full stack shutdown:

Shutdown order:

1. Application VMs (web, Nextcloud, Immich, Jellyfin, etc.)
2. Monitoring VMs (Kuma, Prometheus)
3. PBS VM
4. TrueNAS VM
5. pve2 node
6. pve1 node (last Proxmox node)
7. Network gear / UPS if necessary

Power-up order:

1. Network gear & UPS
2. pve1 and pve2 nodes
3. TrueNAS VM
4. PBS VM
5. Core apps (web, Nextcloud, Immich, Jellyfin, n8n, NPM, wiki)
6. Monitoring stack (Kuma, Prometheus/Grafana)

This order ensures that storage is ready before PBS, and PBS is ready before depending VMs (if any use backup features like guest-initiated restore).

9. Risks, Constraints & Known Limitations

No shared storage / HA:
- VMs are pinned to nodes. If a node fails, VMs require restore or manual migration.
Older hardware:
- CPUs and DDR3L era platforms limit performance and efficiency.
Local disks only:
- Mix of older 1 TB HDDs; no SSD-only tiers for high IOPS workloads.
PBS & TrueNAS both virtualized on pve1:
- Concentrates backup and storage responsibility on a single compute node.
Limited RAM (32 GB per node):
- Constrains number of memory-heavy workloads.

These are acceptable for a home lab / prosumer environment but are captured here explicitly for future planning.

10. Future Improvements (Pointer to Roadmap)

The following items are out of scope for this document but are tracked on the roadmap:

Dedicated NAS / Proxmox hybrid with ASRock Rack X570D4U and 16 × 6 TB SAS
Standalone physical TrueNAS or SCALE box
Additional Proxmox node or Mini-PC cluster for Kubernetes
10 GbE or faster interconnect between nodes
Storage tiers (NVMe → SSD → HDD → PBS)
Better separation of roles:
- PBS on dedicated hardware
- TrueNAS on physical host
- Proxmox nodes focused on compute

See: roadmap (to be created).

Table of Contents