This is an old revision of the document!
Table of Contents
Proxmox Cluster Architecture (Current State)
This document describes the current TorresVault Proxmox cluster as it exists today. It focuses on hardware, networking, storage, workloads, and backup/restore, and is intended as the authoritative reference for how virtualization is implemented in TorresVault 2.0 (current state).
Future redesigns (new NAS, X570D4U, Mini PC cluster, etc.) will be documented separately on the Proxmox / TorresVault 2.0 Roadmap page.
1. High-Level Overview
The Proxmox environment is a 2-node cluster with a qdevice, running on older but solid Intel desktop platforms with expanded SATA and NIC capacity.
- Cluster name: `torres-cluster`
- Hypervisor: Proxmox VE 9.x
- Nodes: `pve1`, `pve2`
- Quorum helper: Raspberry Pi running `corosync-qdevice`
- Backup server: Proxmox Backup Server (PBS) VM
- Storage backend: Local SATA disks per node; TrueNAS VM providing backup storage
High-level logical view:
- Compute layer: pve1, pve2
- Storage layer: local disks per node, plus TrueNAS VM used as backup target
- Backup layer: PBS VM writing to TrueNAS (`pbs-main` datastore)
- Monitoring layer: Prometheus + Grafana, Kuma, Proxmox built-ins
The design intentionally does not use shared storage for HA; instead, VMs are pinned to nodes and protected via image-based backups to PBS.
2. Physical Hosts & Hardware
2.1 pve1
Role: General compute node, many core services
- CPU: Intel Core i5-2500 @ 3.30 GHz
- 4 cores / 4 threads, 1 socket
- RAM: 32 GB DDR3L 1600 MHz
- 4 Γ 8 GB Timetec DDR3L (PC3L-12800) UDIMM kit
- Motherboard / Chipset: Older Intel desktop platform
- Disk controllers:
- Onboard Intel SATA controller (RAID mode)
- ASMedia ASM1064 SATA controller
- GLOTRENDS SA3112-C 12-port PCIe x1 SATA expansion card
- Disk inventory (approximate):
- Several 1 TB WDC WD1003FBYX enterprise HDDs
- Several 1 TB Seagate ST91000640NS HDDs
- System / boot disk plus ~10β12 Γ 1 TB data disks
- Network interfaces:
- Onboard Intel 82579LM 1 GbE
- Intel I350 quad-port 1 GbE PCIe NIC
- Installed OS: Proxmox VE 9.x (legacy BIOS)
- Kernel example: 6.14.x-pve
Primary role summary:
- Runs web, monitoring, automation, PBS, TrueNAS and various lab VMs
- Acts as one half of the Proxmox cluster
- Provides local LVM/ZFS storage for its own VMs
β
2.2 pve2
Role: General compute node, media & application workloads
- CPU: Intel Core i5-4570 @ 3.20 GHz
- 4 cores / 4 threads, 1 socket
- RAM: 32 GB DDR3L 1600 MHz
- Same Timetec 4 Γ 8 GB kit as pve1
- Disk controllers:
- Intel 9-Series SATA controller (AHCI)
- ASMedia ASM1064 SATA controller
- GLOTRENDS SA3112-C 12-port PCIe x1 SATA expansion card
- Disk inventory (approximate):
- Multiple 1 TB Seagate ST91000640NS HDDs
- System disk plus ~10β12 Γ 1 TB data disks
- Network interfaces:
- Intel I350 quad-port 1 GbE (matching pve1)
- Installed OS: Proxmox VE 9.x (EFI boot)
Primary role summary:
- Runs general apps, including media and image workloads (Nextcloud, Immich, Jellyfin, etc.)
- Acts as second node of Proxmox cluster
- Mirrors pve1βs storage pattern with local disks only
β
2.3 QDevice
- Hardware: Raspberry Pi (dedicated qdevice)
- Software: `corosync-qdevice`
- Purpose: provides voting/quorum for a 2-node cluster, preventing split-brain.
β
3. Network Design
The Proxmox cluster uses:
- Main LAN: `192.168.1.0/24`
- Gateway: UCG Max at `192.168.1.1`
- Cluster link: dedicated point-to-point /30 network:
- pve1: `10.10.10.1/30`
- pve2: `10.10.10.2/30`
UniFi VLANs exist on the network side (stark_user, stark_IOT, guest, IOT+, Torres Family Lights); for now, Proxmox sees mostly the flat LAN plus specific lab networks for testing.
β
3.1 pve1 Network Interfaces
From the Proxmox UI:
- bond0 β Linux bond
- Mode: active-backup
- Slaves: `enp1s0f1`, `enp1s0f2`
- No IP address (used as bridge slave)
- eno1 β onboard NIC
- Not active, not autostart (reserved / spare)
- enp1s0f0
- IP: `10.10.10.1/30`
- Usage: dedicated cluster interconnect to pve2
- enp1s0f1 / enp1s0f2
- Members of `bond0`
- enp1s0f3
- Currently unused
- vmbr0 β Linux bridge
- IP: `192.168.1.150/24`
- Gateway: `192.168.1.1` (UCG Max)
- Bridge port: `bond0`
- All LAN-facing VMs attach here
Design notes:
- Two NIC ports for LAN via bond (`bond0` β `vmbr0`) for basic redundancy.
- One NIC port for cluster link (`enp1s0f0`).
- One NIC still available (`enp1s0f3`) for future use (e.g., storage VLAN or DMZ).
β
3.2 pve2 Network Interfaces
- bond0 β Linux bond
- Mode: active-backup
- Slaves: `enp2s0f1`, `enp2s0f2`
- eno1 β onboard NIC
- Not active
- enp2s0f0
- IP: `10.10.10.2/30`
- Usage: dedicated cluster interconnect to pve1
- enp2s0f1 / enp2s0f2
- Members of `bond0`
- enp2s0f3
- Currently unused
- vmbr0
- IP: `192.168.1.151/24`
- Gateway: `192.168.1.1`
- Bridge port: `bond0`
Design notes:
- Symmetric layout with pve1 to make VM migration and cabling easier.
- Cluster traffic is physically separated from LAN.
β
3.3 Logical Topology Diagram (Text)
- LAN 192.168.1.0/24
- UCG Max (192.168.1.1)
- pve1 (192.168.1.150, vmbr0 on bond0)
- pve2 (192.168.1.151, vmbr0 on bond0)
- Other LAN clients / services
- Cluster link 10.10.10.0/30
- pve1 β `enp1s0f0` β `10.10.10.1`
- pve2 β `enp2s0f0` β `10.10.10.2`
- Single direct cable between nodes
- Quorum
- Raspberry Pi qdevice on LAN (IP TBD), reachable from both nodes
This separation keeps corosync and cluster traffic off the main LAN and avoids cluster instability if LAN becomes noisy.
4. Storage Architecture (Current)
There are three main storage layers:
- Local node storage (per-node disks, LVM/ZFS)
- TrueNAS VM β used as backup target
- Proxmox Backup Server (PBS) β used for image-based backups
β
4.1 Local Storage on pve1
Typical Proxmox storages (names as shown in the UI):
- `local (pve1)` β boot disk, ISOs, templates
- `local-lvm (pve1)` β LVM-thin for VM disks
- `VM-pool (pve1)` β additional pool for VMs (local disks)
- `PBS (pve1)` β smaller storage local to the PBS VM (e.g., for metadata or staging)
Backed by:
- WDC WD1003FBYX and Seagate ST91000640NS disks on Intel/ASMedia/GLOTRENDS SATA controllers
- No hardware RAID; uses Proxmoxβs software stack and/or ZFS/LVM
β
4.2 Local Storage on pve2
- `local (pve2)` β boot, ISOs, templates
- `local-lvm (pve2)` β VM disks
- `apps-pool (pve2)` β main pool for application VMs (Nextcloud, Immich, Jellyfin, etc.)
Also backed by multiple 1 TB Seagate ST91000640NS disks via the same combination of controllers.
β
4.3 TrueNAS VM
- VM ID: 108 (`truenas`)
- Node: pve1
- Purpose: Provides network file storage and backup target for PBS
- Storage role: Backing store for PBS datastore `pbs-main`
- Backups: TrueNAS is NOT backed up by PBS.
- Reason: TrueNAS holds the backups; recursively backing it up is inefficient and can overload
the system.
Over time, this VM may be migrated to a dedicated physical NAS, but for now it is virtualized.
β
4.4 Proxmox Backup Server (PBS) VM
- VM ID: 105 (`pbs`)
- Node: pve1
- OS: Proxmox Backup Server 4.x
- CPU: 3 vCPUs
- RAM: 4 GB
- Main datastore: `pbs-main`
- Size: ~5.38 TB
- Used: ~0.9 TB
- Backed by TrueNAS storage
Important backup rule:
- PBS does not back up itself.
- The PBS VM is excluded from nightly backup jobs.
- TrueNAS (backup storage) is also excluded from PBS backups.
This prevents:
- Storage thrashing / self-backup loops
- Catastrophic performance impact from PBS trying to back up its own datastore
PBS instead focuses on backing up critical application VMs only.
5. Workload Layout
The current cluster runs a mix of core services and lab workloads. VM IDs/names:
5.1 VMs on pve1
- 100 β `web`
- Role: front-end web / landing page (e.g., torresvault.com)
- 101 β `Kuma`
- Role: uptime / service monitoring
- 105 β `pbs`
- Role: Proxmox Backup Server VM
- 106 β `n8n`
- Role: automation / workflow engine
- 107 β `npm`
- Role: Nginx Proxy Manager (reverse proxy)
- 108 β `truenas`
- Role: storage VM / backup target
- 110 β `Prometheus`
- Role: metrics + Grafana stack
- 112 β `iperf-vlan10`
- 113 β `iperf-vlan20`
- 114 β `iperf-vlan1`
- Role: lab VMs for VLAN and bandwidth testing
- 115 β `portainer-mgmt`
- Role: container management for other hosts
- 116 β `wiki`
- Role: DokuWiki instance hosting TorresVault documentation
- 111 β `iperf-vlan1` (pve1 local test network)
β
5.2 VMs on pve2
- 102 β `next`
- Role: Nextcloud services
- 103 β `immich`
- Role: photo / media backup
- 104 β `jellyfin`
- Role: media server
- 109 β `RDPjump`
- Role: jump host / remote access box
These assignments are not HA; VMs are pinned to nodes and protected by PBS backups.
6. Backup & Restore Strategy
Backups are handled by the PBS VM (ID 105), writing into datastore `pbs-main` hosted on TrueNAS.
Key points:
- Backup scope:
- Critical service VMs:
- web (100)
- Kuma (101)
- next (102)
- immich (103)
- jellyfin (104)
- n8n (106)
- npm (107)
- Prometheus (110)
- portainer-mgmt (115)
- wiki (116)
- other lab VMs as needed
- Excluded:
- `truenas` VM (108)
- `pbs` VM (105) itself
- Datastore: `pbs-main`
- Size ~5.38 TB; currently lightly used
- Retention policy: configured in PBS; typical pattern (can be tuned):
- e.g., 7 daily, 4 weekly, 12 monthly (confirm in PBS UI)
- Verification:
- PBS supports scheduled verify jobs to detect bit-rot
- Prune jobs:
- Automated prune / garbage collection jobs run regularly to reclaim space
β
6.1 Rationale for Exclusions
- TrueNAS holds the backup data; backing it up with PBS would:
- Recursively back up the PBS datastore
- Multiply IO load
- Risk saturating disks and disrupting backups
- PBS VM is the backup system itself:
- Backing PBS to its own datastore is logically unsound
- If PBS is lost, it can be rebuilt from Proxmox templates and reattached to
existing datastore
Instead, PBS backup jobs focus on stateless or easily rebuildable VMs where immutable data is stored externally (e.g., on TrueNAS, Nextcloud data, or other locations).
β
6.2 Restore Process (Operational Runbook)
Scenario: single VM failure
1. Identify affected VM in Proxmox UI. 2. In PBS UI: * Go to **Datastore β pbs-main β Content β VM group**. * Select the latest successful backup. 3. Choose **Restore**: * Target node: original host (or alternate host if needed) * Disk storage: appropriate local storage (`local-lvm`, `apps-pool`, etc.) 4. Start VM in Proxmox and validate: * Application health checks (web UI, API, etc.) * Network connectivity (LAN, DNS, etc.)
Scenario: node loss (pve1 or pve2)
1. Replace/fix hardware and reinstall Proxmox VE. 2. Rejoin node to `torres-cluster`. 3. Recreate necessary storages pointing at local disks. 4. From PBS, restore VMs to the rebuilt node using the procedure above.
Scenario: PBS VM lost but TrueNAS datastore intact
1. Recreate PBS VM from Proxmox template. 2. Reattach existing `pbs-main` datastore on TrueNAS. 3. PBS will rediscover existing backups. 4. Resume normal operations.
7. Monitoring & Observability
Monitoring in TorresVault is layered:
- Proxmox Node Metrics
- Built-in graphs: CPU, RAM, I/O, network, load average per node and per VM
- Prometheus VM (110)
- Scrapes metrics from nodes and services
- Grafana dashboards provide historical views
- Kuma VM (101)
- Synthetic checks / uptime monitoring for key services
- PBS Analytics
- Shows backup / prune / verify job history and datastore usage
- TrueNAS UI
- Disk health, pool status
Operational practice:
- Use Kuma for βis it up?β
- Use Proxmox + Grafana for βhow is it behaving?β
- Use PBS UI for backup health.
8. Operational Procedures (Day-to-Day)
8.1 Adding a New VM
1. Decide which node (pve1 vs pve2) based on workload: * Storage-heavy β whichever has more free disk * Media or GPU heavy (later) β pve2 2. Create VM in Proxmox: * Attach to `vmbr0` for LAN access * Store disks on `local-lvm`, `apps-pool`, or `VM-pool` 3. Install OS and configure network. 4. In **PBS**, add VM to an existing or new backup group. 5. Verify first backup completes successfully.
β
8.2 Maintenance & Patching
Proxmox nodes:
1. Live-migrate or gracefully shut down VMs on the target node if needed. 2. `apt update && apt full-upgrade` on the node (via console or SSH). 3. Reboot node. 4. Verify: * Corosync quorum healthy * VMs auto-started where expected
PBS & TrueNAS:
- Update during low-traffic windows (overnight).
- Confirm backups resume successfully after upgrades.
β
8.3 Power-Down / Power-Up Order
Planned maintenance that requires full stack shutdown:
Shutdown order:
1. Application VMs (web, Nextcloud, Immich, Jellyfin, etc.) 2. Monitoring VMs (Kuma, Prometheus) 3. PBS VM 4. TrueNAS VM 5. pve2 node 6. pve1 node (last Proxmox node) 7. Network gear / UPS if necessary
Power-up order:
1. Network gear & UPS 2. pve1 and pve2 nodes 3. TrueNAS VM 4. PBS VM 5. Core apps (web, Nextcloud, Immich, Jellyfin, n8n, NPM, wiki) 6. Monitoring stack (Kuma, Prometheus/Grafana)
This order ensures that storage is ready before PBS, and PBS is ready before depending VMs (if any use backup features like guest-initiated restore).
9. Risks, Constraints & Known Limitations
- No shared storage / HA:
- VMs are pinned to nodes. If a node fails, VMs require restore or manual migration.
- Older hardware:
- CPUs and DDR3L era platforms limit performance and efficiency.
- Local disks only:
- Mix of older 1 TB HDDs; no SSD-only tiers for high IOPS workloads.
- PBS & TrueNAS both virtualized on pve1:
- Concentrates backup and storage responsibility on a single compute node.
- Limited RAM (32 GB per node):
- Constrains number of memory-heavy workloads.
These are acceptable for a home lab / prosumer environment but are captured here explicitly for future planning.
10. Future Improvements (Pointer to Roadmap)
The following items are out of scope for this document but are tracked on the roadmap:
- Dedicated NAS / Proxmox hybrid with ASRock Rack X570D4U and 16 Γ 6 TB SAS
- Standalone physical TrueNAS or SCALE box
- Additional Proxmox node or Mini-PC cluster for Kubernetes
- 10 GbE or faster interconnect between nodes
- Storage tiers (NVMe β SSD β HDD β PBS)
- Better separation of roles:
- PBS on dedicated hardware
- TrueNAS on physical host
- Proxmox nodes focused on compute
See: Proxmox / TorresVault 2.0 Roadmap (to be created).
