torresvault:proxmox:cluster
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| torresvault:proxmox:cluster [2026/01/23 13:37] – created nathna | torresvault:proxmox:cluster [2026/01/23 14:40] (current) – [10. Future Improvements (Pointer to Roadmap)] nathna | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ===== Proxmox Cluster ===== | + | ===== Proxmox Cluster |
| - | Overview of PVE1, PVE2, QDevice, storage | + | This document describes the **current TorresVault Proxmox cluster** as it exists today. |
| + | It focuses on hardware, networking, storage, | ||
| + | authoritative reference for how virtualization is implemented in TorresVault 2.0 (current state). | ||
| - | (More content coming soon.) | + | Future redesigns |
| + | on the [[torresvault: | ||
| + | |||
| + | ---- | ||
| + | |||
| + | ==== 1. High-Level Overview ==== | ||
| + | |||
| + | The Proxmox environment is a **2-node cluster with a qdevice**, running on older but solid Intel | ||
| + | desktop platforms with expanded SATA and NIC capacity. | ||
| + | |||
| + | * **Cluster name:** `torres-cluster` | ||
| + | * **Hypervisor: | ||
| + | * **Nodes:** `pve1`, `pve2` | ||
| + | * **Quorum helper:** Raspberry Pi running `corosync-qdevice` | ||
| + | * **Backup server:** Proxmox Backup Server (PBS) VM | ||
| + | * **Storage backend:** Local SATA disks per node; TrueNAS VM providing backup storage | ||
| + | |||
| + | High-level logical view: | ||
| + | |||
| + | * **Compute layer:** pve1, pve2 | ||
| + | * **Storage layer:** local disks per node, plus TrueNAS VM used as backup target | ||
| + | * **Backup layer:** PBS VM writing to TrueNAS (`pbs-main` datastore) | ||
| + | * **Monitoring layer:** Prometheus + Grafana, Kuma, Proxmox built-ins | ||
| + | |||
| + | The design intentionally **does not use shared storage for HA**; instead, VMs are pinned to nodes | ||
| + | and protected via **image-based backups to PBS**. | ||
| + | |||
| + | ---- | ||
| + | |||
| + | ==== 2. Physical Hosts & Hardware ==== | ||
| + | |||
| + | === 2.1 pve1 === | ||
| + | |||
| + | **Role:** General compute node, many core services | ||
| + | |||
| + | * **CPU:** Intel Core i5-2500 @ 3.30 GHz | ||
| + | * 4 cores / 4 threads, 1 socket | ||
| + | * **RAM:** 32 GB DDR3L 1600 MHz | ||
| + | * 4 × 8 GB Timetec DDR3L (PC3L-12800) UDIMM kit | ||
| + | * **Motherboard / Chipset:** Older Intel desktop platform | ||
| + | * **Disk controllers: | ||
| + | * Onboard Intel SATA controller (RAID mode) | ||
| + | * ASMedia ASM1064 SATA controller | ||
| + | * **GLOTRENDS SA3112-C 12-port PCIe x1 SATA expansion card** | ||
| + | * **Disk inventory (approximate): | ||
| + | * Several **1 TB WDC WD1003FBYX** enterprise HDDs | ||
| + | * Several **1 TB Seagate ST91000640NS** HDDs | ||
| + | * System / boot disk plus ~10–12 × 1 TB data disks | ||
| + | * **Network interfaces: | ||
| + | * Onboard Intel 82579LM 1 GbE | ||
| + | * **Intel I350 quad-port 1 GbE** PCIe NIC | ||
| + | * **Installed OS:** Proxmox VE 9.x (legacy BIOS) | ||
| + | * **Kernel example:** 6.14.x-pve | ||
| + | |||
| + | Primary role summary: | ||
| + | |||
| + | * Runs web, monitoring, automation, PBS, TrueNAS and various lab VMs | ||
| + | * Acts as one half of the Proxmox cluster | ||
| + | * Provides local LVM/ZFS storage for its own VMs | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 2.2 pve2 === | ||
| + | |||
| + | **Role:** General compute node, media & application workloads | ||
| + | |||
| + | * **CPU:** Intel Core i5-4570 @ 3.20 GHz | ||
| + | * 4 cores / 4 threads, 1 socket | ||
| + | * **RAM:** 32 GB DDR3L 1600 MHz | ||
| + | * Same Timetec 4 × 8 GB kit as pve1 | ||
| + | * **Disk controllers: | ||
| + | * Intel 9-Series SATA controller (AHCI) | ||
| + | * ASMedia ASM1064 SATA controller | ||
| + | * **GLOTRENDS SA3112-C 12-port PCIe x1 SATA expansion card** | ||
| + | * **Disk inventory (approximate): | ||
| + | * Multiple **1 TB Seagate ST91000640NS** HDDs | ||
| + | * System disk plus ~10–12 × 1 TB data disks | ||
| + | * **Network interfaces: | ||
| + | * Intel I350 quad-port 1 GbE (matching pve1) | ||
| + | * **Installed OS:** Proxmox VE 9.x (EFI boot) | ||
| + | |||
| + | Primary role summary: | ||
| + | |||
| + | * Runs general apps, including media and image workloads (Nextcloud, Immich, Jellyfin, etc.) | ||
| + | * Acts as second node of Proxmox cluster | ||
| + | * Mirrors pve1’s storage pattern with local disks only | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 2.3 QDevice === | ||
| + | |||
| + | * Hardware: **Raspberry Pi** (dedicated qdevice) | ||
| + | * Software: `corosync-qdevice` | ||
| + | * Purpose: provides voting/ | ||
| + | |||
| + | --- | ||
| + | |||
| + | ==== 3. Network Design ==== | ||
| + | |||
| + | The Proxmox cluster uses: | ||
| + | |||
| + | * **Main LAN:** `192.168.1.0/ | ||
| + | * Gateway: **UCG Max** at `192.168.1.1` | ||
| + | * **Cluster link:** dedicated point-to-point /30 network: | ||
| + | * pve1: `10.10.10.1/ | ||
| + | * pve2: `10.10.10.2/ | ||
| + | |||
| + | UniFi VLANs exist on the network side (stark_user, | ||
| + | for now, Proxmox sees mostly the flat LAN plus specific lab networks for testing. | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 3.1 pve1 Network Interfaces === | ||
| + | |||
| + | From the Proxmox UI: | ||
| + | |||
| + | * **bond0** – Linux bond | ||
| + | * Mode: active-backup | ||
| + | * Slaves: `enp1s0f1`, `enp1s0f2` | ||
| + | * No IP address (used as bridge slave) | ||
| + | * **eno1** – onboard NIC | ||
| + | * Not active, not autostart (reserved / spare) | ||
| + | * **enp1s0f0** | ||
| + | * IP: `10.10.10.1/ | ||
| + | * Usage: dedicated **cluster interconnect** to pve2 | ||
| + | * **enp1s0f1 / enp1s0f2** | ||
| + | * Members of `bond0` | ||
| + | * **enp1s0f3** | ||
| + | * Currently unused | ||
| + | * **vmbr0** – Linux bridge | ||
| + | * IP: `192.168.1.150/ | ||
| + | * Gateway: `192.168.1.1` (UCG Max) | ||
| + | * Bridge port: `bond0` | ||
| + | * All LAN-facing VMs attach here | ||
| + | |||
| + | Design notes: | ||
| + | |||
| + | * Two NIC ports for LAN via bond (`bond0` → `vmbr0`) for basic redundancy. | ||
| + | * One NIC port for cluster link (`enp1s0f0`). | ||
| + | * One NIC still available (`enp1s0f3`) for future use (e.g., storage VLAN or DMZ). | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 3.2 pve2 Network Interfaces === | ||
| + | |||
| + | * **bond0** – Linux bond | ||
| + | * Mode: active-backup | ||
| + | * Slaves: `enp2s0f1`, `enp2s0f2` | ||
| + | * **eno1** – onboard NIC | ||
| + | * Not active | ||
| + | * **enp2s0f0** | ||
| + | * IP: `10.10.10.2/ | ||
| + | * Usage: dedicated **cluster interconnect** to pve1 | ||
| + | * **enp2s0f1 / enp2s0f2** | ||
| + | * Members of `bond0` | ||
| + | * **enp2s0f3** | ||
| + | * Currently unused | ||
| + | * **vmbr0** | ||
| + | * IP: `192.168.1.151/ | ||
| + | * Gateway: `192.168.1.1` | ||
| + | * Bridge port: `bond0` | ||
| + | |||
| + | Design notes: | ||
| + | |||
| + | * Symmetric layout with pve1 to make VM migration and cabling easier. | ||
| + | * Cluster traffic is physically separated from LAN. | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 3.3 Logical Topology Diagram (Text) === | ||
| + | |||
| + | * **LAN 192.168.1.0/ | ||
| + | * UCG Max (192.168.1.1) | ||
| + | * pve1 (192.168.1.150, | ||
| + | * pve2 (192.168.1.151, | ||
| + | * Other LAN clients / services | ||
| + | |||
| + | * **Cluster link 10.10.10.0/ | ||
| + | * pve1 – `enp1s0f0` → `10.10.10.1` | ||
| + | * pve2 – `enp2s0f0` → `10.10.10.2` | ||
| + | * Single direct cable between nodes | ||
| + | |||
| + | * **Quorum** | ||
| + | * Raspberry Pi qdevice on LAN (IP TBD), reachable from both nodes | ||
| + | |||
| + | This separation keeps corosync and cluster traffic off the main LAN and avoids | ||
| + | cluster instability if LAN becomes noisy. | ||
| + | |||
| + | ---- | ||
| + | |||
| + | ==== 4. Storage Architecture (Current) ==== | ||
| + | |||
| + | There are **three main storage layers**: | ||
| + | |||
| + | * **Local node storage** (per-node disks, LVM/ZFS) | ||
| + | * **TrueNAS VM** – used as backup target | ||
| + | * **Proxmox Backup Server (PBS)** – used for image-based backups | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 4.1 Local Storage on pve1 === | ||
| + | |||
| + | Typical Proxmox storages (names as shown in the UI): | ||
| + | |||
| + | * `local (pve1)` – boot disk, ISOs, templates | ||
| + | * `local-lvm (pve1)` – LVM-thin for VM disks | ||
| + | * `VM-pool (pve1)` – additional pool for VMs (local disks) | ||
| + | * `PBS (pve1)` – smaller storage local to the PBS VM (e.g., for metadata or staging) | ||
| + | |||
| + | Backed by: | ||
| + | |||
| + | * WDC WD1003FBYX and Seagate ST91000640NS disks on Intel/ | ||
| + | * No hardware RAID; uses Proxmox’s software stack and/or ZFS/LVM | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 4.2 Local Storage on pve2 === | ||
| + | |||
| + | * `local (pve2)` – boot, ISOs, templates | ||
| + | * `local-lvm (pve2)` – VM disks | ||
| + | * `apps-pool (pve2)` – main pool for application VMs (Nextcloud, Immich, Jellyfin, etc.) | ||
| + | |||
| + | Also backed by multiple 1 TB Seagate ST91000640NS disks via the same combination of controllers. | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 4.3 TrueNAS VM === | ||
| + | |||
| + | * **VM ID:** 108 (`truenas`) | ||
| + | * **Node:** pve1 | ||
| + | * **Purpose: | ||
| + | * **Storage role:** Backing store for **PBS datastore `pbs-main`** | ||
| + | * **Backups: | ||
| + | * Reason: TrueNAS holds the backups; recursively backing it up is inefficient and can overload | ||
| + | the system. | ||
| + | |||
| + | Over time, this VM may be migrated to a **dedicated physical NAS**, but for now it is virtualized. | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 4.4 Proxmox Backup Server (PBS) VM === | ||
| + | |||
| + | * **VM ID:** 105 (`pbs`) | ||
| + | * **Node:** pve1 | ||
| + | * **OS:** Proxmox Backup Server 4.x | ||
| + | * **CPU:** 3 vCPUs | ||
| + | * **RAM:** 4 GB | ||
| + | * **Main datastore: | ||
| + | * Size: ~5.38 TB | ||
| + | * Used: ~0.9 TB | ||
| + | * Backed by TrueNAS storage | ||
| + | |||
| + | Important backup rule: | ||
| + | |||
| + | * **PBS does not back up itself.** | ||
| + | * The PBS VM is excluded from nightly backup jobs. | ||
| + | * **TrueNAS (backup storage) is also excluded** from PBS backups. | ||
| + | |||
| + | This prevents: | ||
| + | |||
| + | * Storage thrashing / self-backup loops | ||
| + | * Catastrophic performance impact from PBS trying to back up its own datastore | ||
| + | |||
| + | PBS instead focuses on backing up **critical application VMs** only. | ||
| + | |||
| + | ---- | ||
| + | |||
| + | ==== 5. Workload Layout ==== | ||
| + | |||
| + | The current cluster runs a mix of core services and lab workloads. VM IDs/ | ||
| + | |||
| + | === 5.1 VMs on pve1 === | ||
| + | |||
| + | * **100 – `web`** | ||
| + | * Role: front-end web / landing page (e.g., torresvault.com) | ||
| + | * **101 – `Kuma`** | ||
| + | * Role: uptime / service monitoring | ||
| + | * **105 – `pbs`** | ||
| + | * Role: Proxmox Backup Server VM | ||
| + | * **106 – `n8n`** | ||
| + | * Role: automation / workflow engine | ||
| + | * **107 – `npm`** | ||
| + | * Role: Nginx Proxy Manager (reverse proxy) | ||
| + | * **108 – `truenas`** | ||
| + | * Role: storage VM / backup target | ||
| + | * **110 – `Prometheus`** | ||
| + | * Role: metrics + Grafana stack | ||
| + | * **112 – `iperf-vlan10`** | ||
| + | * **113 – `iperf-vlan20`** | ||
| + | * **114 – `iperf-vlan1`** | ||
| + | * Role: lab VMs for VLAN and bandwidth testing | ||
| + | * **115 – `portainer-mgmt`** | ||
| + | * Role: container management for other hosts | ||
| + | * **116 – `wiki`** | ||
| + | * Role: DokuWiki instance hosting TorresVault documentation | ||
| + | * **111 – `iperf-vlan1` (pve1 local test network)** | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 5.2 VMs on pve2 === | ||
| + | |||
| + | * **102 – `next`** | ||
| + | * Role: Nextcloud services | ||
| + | * **103 – `immich`** | ||
| + | * Role: photo / media backup | ||
| + | * **104 – `jellyfin`** | ||
| + | * Role: media server | ||
| + | * **109 – `RDPjump`** | ||
| + | * Role: jump host / remote access box | ||
| + | |||
| + | These assignments are **not HA**; VMs are pinned to nodes and protected by PBS backups. | ||
| + | |||
| + | ---- | ||
| + | |||
| + | ==== 6. Backup & Restore Strategy ==== | ||
| + | |||
| + | Backups are handled by the **PBS VM (ID 105)**, writing into datastore `pbs-main` | ||
| + | hosted on TrueNAS. | ||
| + | |||
| + | Key points: | ||
| + | |||
| + | * **Backup scope:** | ||
| + | * Critical service VMs: | ||
| + | * web (100) | ||
| + | * Kuma (101) | ||
| + | * next (102) | ||
| + | * immich (103) | ||
| + | * jellyfin (104) | ||
| + | * n8n (106) | ||
| + | * npm (107) | ||
| + | * Prometheus (110) | ||
| + | * portainer-mgmt (115) | ||
| + | * wiki (116) | ||
| + | * other lab VMs as needed | ||
| + | * **Excluded: | ||
| + | * `truenas` VM (108) | ||
| + | * `pbs` VM (105) itself | ||
| + | * **Datastore: | ||
| + | * Size ~5.38 TB; currently lightly used | ||
| + | * **Retention policy:** configured in PBS; typical pattern (can be tuned): | ||
| + | * e.g., 7 daily, 4 weekly, 12 monthly (confirm in PBS UI) | ||
| + | * **Verification: | ||
| + | * PBS supports scheduled **verify jobs** to detect bit-rot | ||
| + | * **Prune jobs:** | ||
| + | * Automated prune / garbage collection jobs run regularly to reclaim space | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 6.1 Rationale for Exclusions === | ||
| + | |||
| + | * **TrueNAS** holds the backup data; backing it up with PBS would: | ||
| + | * Recursively back up the PBS datastore | ||
| + | * Multiply IO load | ||
| + | * Risk saturating disks and disrupting backups | ||
| + | |||
| + | * **PBS VM** is the backup system itself: | ||
| + | * Backing PBS to its own datastore is logically unsound | ||
| + | * If PBS is lost, it can be rebuilt from Proxmox templates and reattached to | ||
| + | existing datastore | ||
| + | |||
| + | Instead, PBS backup jobs focus on **stateless or easily rebuildable VMs** where immutable data | ||
| + | is stored externally (e.g., on TrueNAS, Nextcloud data, or other locations). | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 6.2 Restore Process (Operational Runbook) === | ||
| + | |||
| + | **Scenario: single VM failure** | ||
| + | |||
| + | 1. Identify affected VM in Proxmox UI. | ||
| + | 2. In PBS UI: | ||
| + | * Go to **Datastore → pbs-main → Content → VM group**. | ||
| + | * Select the latest successful backup. | ||
| + | 3. Choose **Restore**: | ||
| + | * Target node: original host (or alternate host if needed) | ||
| + | * Disk storage: appropriate local storage (`local-lvm`, | ||
| + | 4. Start VM in Proxmox and validate: | ||
| + | * Application health checks (web UI, API, etc.) | ||
| + | * Network connectivity (LAN, DNS, etc.) | ||
| + | |||
| + | **Scenario: node loss (pve1 or pve2)** | ||
| + | |||
| + | 1. Replace/fix hardware and reinstall Proxmox VE. | ||
| + | 2. Rejoin node to `torres-cluster`. | ||
| + | 3. Recreate necessary storages pointing at local disks. | ||
| + | 4. From PBS, restore VMs to the rebuilt node using the procedure above. | ||
| + | |||
| + | **Scenario: PBS VM lost but TrueNAS datastore intact** | ||
| + | |||
| + | 1. Recreate PBS VM from Proxmox template. | ||
| + | 2. Reattach existing `pbs-main` datastore on TrueNAS. | ||
| + | 3. PBS will rediscover existing backups. | ||
| + | 4. Resume normal operations. | ||
| + | |||
| + | ---- | ||
| + | |||
| + | ==== 7. Monitoring & Observability ==== | ||
| + | |||
| + | Monitoring in TorresVault is layered: | ||
| + | |||
| + | * **Proxmox Node Metrics** | ||
| + | * Built-in graphs: CPU, RAM, I/O, network, load average per node and per VM | ||
| + | * **Prometheus VM (110)** | ||
| + | * Scrapes metrics from nodes and services | ||
| + | * Grafana dashboards provide historical views | ||
| + | * **Kuma VM (101)** | ||
| + | * Synthetic checks / uptime monitoring for key services | ||
| + | * **PBS Analytics** | ||
| + | * Shows backup / prune / verify job history and datastore usage | ||
| + | * **TrueNAS UI** | ||
| + | * Disk health, pool status | ||
| + | |||
| + | Operational practice: | ||
| + | |||
| + | * Use Kuma for **“is it up?”** | ||
| + | * Use Proxmox + Grafana for **“how is it behaving? | ||
| + | * Use PBS UI for **backup health**. | ||
| + | |||
| + | ---- | ||
| + | |||
| + | ==== 8. Operational Procedures (Day-to-Day) ==== | ||
| + | |||
| + | === 8.1 Adding a New VM === | ||
| + | |||
| + | 1. Decide which node (pve1 vs pve2) based on workload: | ||
| + | * Storage-heavy → whichever has more free disk | ||
| + | * Media or GPU heavy (later) → pve2 | ||
| + | 2. Create VM in Proxmox: | ||
| + | * Attach to `vmbr0` for LAN access | ||
| + | * Store disks on `local-lvm`, | ||
| + | 3. Install OS and configure network. | ||
| + | 4. In **PBS**, add VM to an existing or new backup group. | ||
| + | 5. Verify first backup completes successfully. | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 8.2 Maintenance & Patching === | ||
| + | |||
| + | **Proxmox nodes:** | ||
| + | |||
| + | 1. Live-migrate or gracefully shut down VMs on the target node if needed. | ||
| + | 2. `apt update && apt full-upgrade` on the node (via console or SSH). | ||
| + | 3. Reboot node. | ||
| + | 4. Verify: | ||
| + | * Corosync quorum healthy | ||
| + | * VMs auto-started where expected | ||
| + | |||
| + | **PBS & TrueNAS: | ||
| + | |||
| + | * Update during low-traffic windows (overnight). | ||
| + | * Confirm backups resume successfully after upgrades. | ||
| + | |||
| + | --- | ||
| + | |||
| + | === 8.3 Power-Down / Power-Up Order === | ||
| + | |||
| + | Planned maintenance that requires full stack shutdown: | ||
| + | |||
| + | **Shutdown order:** | ||
| + | |||
| + | 1. Application VMs (web, Nextcloud, Immich, Jellyfin, etc.) | ||
| + | 2. Monitoring VMs (Kuma, Prometheus) | ||
| + | 3. PBS VM | ||
| + | 4. TrueNAS VM | ||
| + | 5. pve2 node | ||
| + | 6. pve1 node (last Proxmox node) | ||
| + | 7. Network gear / UPS if necessary | ||
| + | |||
| + | **Power-up order:** | ||
| + | |||
| + | 1. Network gear & UPS | ||
| + | 2. pve1 and pve2 nodes | ||
| + | 3. TrueNAS VM | ||
| + | 4. PBS VM | ||
| + | 5. Core apps (web, Nextcloud, Immich, Jellyfin, n8n, NPM, wiki) | ||
| + | 6. Monitoring stack (Kuma, Prometheus/ | ||
| + | |||
| + | This order ensures that **storage is ready before PBS**, and **PBS is ready before | ||
| + | depending VMs** (if any use backup features like guest-initiated restore). | ||
| + | |||
| + | ---- | ||
| + | |||
| + | ==== 9. Risks, Constraints & Known Limitations ==== | ||
| + | |||
| + | * **No shared storage / HA:** | ||
| + | * VMs are pinned to nodes. If a node fails, VMs require restore or manual migration. | ||
| + | * **Older hardware: | ||
| + | * CPUs and DDR3L era platforms limit performance and efficiency. | ||
| + | * **Local disks only: | ||
| + | * Mix of older 1 TB HDDs; no SSD-only tiers for high IOPS workloads. | ||
| + | * **PBS & TrueNAS both virtualized on pve1: | ||
| + | * Concentrates backup and storage responsibility on a single compute node. | ||
| + | * **Limited RAM (32 GB per node): | ||
| + | * Constrains number of memory-heavy workloads. | ||
| + | |||
| + | These are acceptable for a home lab / prosumer environment but are captured here | ||
| + | explicitly for future planning. | ||
| + | |||
| + | ---- | ||
| + | |||
| + | ==== 10. Future Improvements (Pointer to Roadmap) ==== | ||
| + | |||
| + | The following items are **out of scope for this document** but are tracked on the roadmap: | ||
| + | |||
| + | * Dedicated **NAS / Proxmox hybrid** with ASRock Rack X570D4U and 16 × 6 TB SAS | ||
| + | * Standalone physical **TrueNAS or SCALE** box | ||
| + | * Additional **Proxmox node** or Mini-PC cluster for Kubernetes | ||
| + | * 10 GbE or faster interconnect between nodes | ||
| + | * Storage tiers (NVMe → SSD → HDD → PBS) | ||
| + | * Better separation of roles: | ||
| + | * PBS on dedicated hardware | ||
| + | * TrueNAS on physical host | ||
| + | * Proxmox nodes focused on compute | ||
| + | |||
| + | See: [[torresvault: | ||
| + | |||
| + | ---- | ||
torresvault/proxmox/cluster.1769193448.txt.gz · Last modified: by nathna
