Differences

This shows you the differences between two versions of the page.

--- torresvault:proxmox:cluster [2026/01/23 13:37] – created nathna
+++ torresvault:proxmox:cluster [2026/01/23 14:40] (current) – [10. Future Improvements (Pointer to Roadmap)] nathna
@@ Line 1: / Line 1: @@
-===== Proxmox Cluster =====
+===== Proxmox Cluster Architecture (Current State) =====
-Overview of PVE1, PVE2, QDevice, storage layout, networking, and HA.
+This document describes the **current TorresVault Proxmox cluster** as it exists today.
+It focuses on hardware, networking, storage, workloads, and backup/restore, and is intended as the
+authoritative reference for how virtualization is implemented in TorresVault 2.0 (current state).
-(More content coming soon.)
+Future redesigns (new NAS, X570D4U, Mini PC cluster, etc.) will be documented separately
+on the [[torresvault:todo:roadmap]] page.
+----
+==== 1. High-Level Overview ====
+The Proxmox environment is a **2-node cluster with a qdevice**, running on older but solid Intel
+desktop platforms with expanded SATA and NIC capacity.
+  * **Cluster name:** `torres-cluster`
+  * **Hypervisor:** Proxmox VE 9.x
+  * **Nodes:** `pve1`, `pve2`
+  * **Quorum helper:** Raspberry Pi running `corosync-qdevice`
+  * **Backup server:** Proxmox Backup Server (PBS) VM
+  * **Storage backend:** Local SATA disks per node; TrueNAS VM providing backup storage
+High-level logical view:
+  * **Compute layer:** pve1, pve2
+  * **Storage layer:** local disks per node, plus TrueNAS VM used as backup target
+  * **Backup layer:** PBS VM writing to TrueNAS (`pbs-main` datastore)
+  * **Monitoring layer:** Prometheus + Grafana, Kuma, Proxmox built-ins
+The design intentionally **does not use shared storage for HA**; instead, VMs are pinned to nodes
+and protected via **image-based backups to PBS**.
+----
+==== 2. Physical Hosts & Hardware ====
+=== 2.1 pve1 ===
+**Role:** General compute node, many core services
+  * **CPU:** Intel Core i5-2500 @ 3.30 GHz
+    * 4 cores / 4 threads, 1 socket
+  * **RAM:** 32 GB DDR3L 1600 MHz
+    * 4 × 8 GB Timetec DDR3L (PC3L-12800) UDIMM kit
+  * **Motherboard / Chipset:** Older Intel desktop platform
+  * **Disk controllers:**
+    * Onboard Intel SATA controller (RAID mode)
+    * ASMedia ASM1064 SATA controller
+    * **GLOTRENDS SA3112-C 12-port PCIe x1 SATA expansion card**
+  * **Disk inventory (approximate):**
+    * Several **1 TB WDC WD1003FBYX** enterprise HDDs
+    * Several **1 TB Seagate ST91000640NS** HDDs
+    * System / boot disk plus ~10–12 × 1 TB data disks
+  * **Network interfaces:**
+    * Onboard Intel 82579LM 1 GbE
+    * **Intel I350 quad-port 1 GbE** PCIe NIC
+  * **Installed OS:** Proxmox VE 9.x (legacy BIOS)
+  * **Kernel example:** 6.14.x-pve
+Primary role summary:
+  * Runs web, monitoring, automation, PBS, TrueNAS and various lab VMs
+  * Acts as one half of the Proxmox cluster
+  * Provides local LVM/ZFS storage for its own VMs
+---
+=== 2.2 pve2 ===
+**Role:** General compute node, media & application workloads
+  * **CPU:** Intel Core i5-4570 @ 3.20 GHz
+    * 4 cores / 4 threads, 1 socket
+  * **RAM:** 32 GB DDR3L 1600 MHz
+    * Same Timetec 4 × 8 GB kit as pve1
+  * **Disk controllers:**
+    * Intel 9-Series SATA controller (AHCI)
+    * ASMedia ASM1064 SATA controller
+    * **GLOTRENDS SA3112-C 12-port PCIe x1 SATA expansion card**
+  * **Disk inventory (approximate):**
+    * Multiple **1 TB Seagate ST91000640NS** HDDs
+    * System disk plus ~10–12 × 1 TB data disks
+  * **Network interfaces:**
+    * Intel I350 quad-port 1 GbE (matching pve1)
+  * **Installed OS:** Proxmox VE 9.x (EFI boot)
+Primary role summary:
+  * Runs general apps, including media and image workloads (Nextcloud, Immich, Jellyfin, etc.)
+  * Acts as second node of Proxmox cluster
+  * Mirrors pve1’s storage pattern with local disks only
+---
+=== 2.3 QDevice ===
+  * Hardware: **Raspberry Pi** (dedicated qdevice)
+  * Software: `corosync-qdevice`
+  * Purpose: provides voting/quorum for a **2-node cluster**, preventing split-brain.
+---
+==== 3. Network Design ====
+The Proxmox cluster uses:
+  * **Main LAN:** `192.168.1.0/24`
+    * Gateway: **UCG Max** at `192.168.1.1`
+  * **Cluster link:** dedicated point-to-point /30 network:
+    * pve1: `10.10.10.1/30`
+    * pve2: `10.10.10.2/30`
+UniFi VLANs exist on the network side (stark_user, stark_IOT, guest, IOT+, Torres Family Lights);
+for now, Proxmox sees mostly the flat LAN plus specific lab networks for testing.
+---
+=== 3.1 pve1 Network Interfaces ===
+From the Proxmox UI:
+  * **bond0** – Linux bond
+    * Mode: active-backup
+    * Slaves: `enp1s0f1`, `enp1s0f2`
+    * No IP address (used as bridge slave)
+  * **eno1** – onboard NIC
+    * Not active, not autostart (reserved / spare)
+  * **enp1s0f0**
+    * IP: `10.10.10.1/30`
+    * Usage: dedicated **cluster interconnect** to pve2
+  * **enp1s0f1 / enp1s0f2**
+    * Members of `bond0`
+  * **enp1s0f3**
+    * Currently unused
+  * **vmbr0** – Linux bridge
+    * IP: `192.168.1.150/24`
+    * Gateway: `192.168.1.1` (UCG Max)
+    * Bridge port: `bond0`
+    * All LAN-facing VMs attach here
+Design notes:
+  * Two NIC ports for LAN via bond (`bond0` → `vmbr0`) for basic redundancy.
+  * One NIC port for cluster link (`enp1s0f0`).
+  * One NIC still available (`enp1s0f3`) for future use (e.g., storage VLAN or DMZ).
+---
+=== 3.2 pve2 Network Interfaces ===
+  * **bond0** – Linux bond
+    * Mode: active-backup
+    * Slaves: `enp2s0f1`, `enp2s0f2`
+  * **eno1** – onboard NIC
+    * Not active
+  * **enp2s0f0**
+    * IP: `10.10.10.2/30`
+    * Usage: dedicated **cluster interconnect** to pve1
+  * **enp2s0f1 / enp2s0f2**
+    * Members of `bond0`
+  * **enp2s0f3**
+    * Currently unused
+  * **vmbr0**
+    * IP: `192.168.1.151/24`
+    * Gateway: `192.168.1.1`
+    * Bridge port: `bond0`
+Design notes:
+  * Symmetric layout with pve1 to make VM migration and cabling easier.
+  * Cluster traffic is physically separated from LAN.
+---
+=== 3.3 Logical Topology Diagram (Text) ===
+  * **LAN 192.168.1.0/24**
+    * UCG Max (192.168.1.1)
+    * pve1 (192.168.1.150, vmbr0 on bond0)
+    * pve2 (192.168.1.151, vmbr0 on bond0)
+    * Other LAN clients / services
+  * **Cluster link 10.10.10.0/30**
+    * pve1 – `enp1s0f0` → `10.10.10.1`
+    * pve2 – `enp2s0f0` → `10.10.10.2`
+    * Single direct cable between nodes
+  * **Quorum**
+    * Raspberry Pi qdevice on LAN (IP TBD), reachable from both nodes
+This separation keeps corosync and cluster traffic off the main LAN and avoids
+cluster instability if LAN becomes noisy.
+----
+==== 4. Storage Architecture (Current) ====
+There are **three main storage layers**:
+  * **Local node storage** (per-node disks, LVM/ZFS)
+  * **TrueNAS VM** – used as backup target
+  * **Proxmox Backup Server (PBS)** – used for image-based backups
+---
+=== 4.1 Local Storage on pve1 ===
+Typical Proxmox storages (names as shown in the UI):
+  * `local (pve1)` – boot disk, ISOs, templates
+  * `local-lvm (pve1)` – LVM-thin for VM disks
+  * `VM-pool (pve1)` – additional pool for VMs (local disks)
+  * `PBS (pve1)` – smaller storage local to the PBS VM (e.g., for metadata or staging)
+Backed by:
+  * WDC WD1003FBYX and Seagate ST91000640NS disks on Intel/ASMedia/GLOTRENDS SATA controllers
+  * No hardware RAID; uses Proxmox’s software stack and/or ZFS/LVM
+---
+=== 4.2 Local Storage on pve2 ===
+  * `local (pve2)` – boot, ISOs, templates
+  * `local-lvm (pve2)` – VM disks
+  * `apps-pool (pve2)` – main pool for application VMs (Nextcloud, Immich, Jellyfin, etc.)
+Also backed by multiple 1 TB Seagate ST91000640NS disks via the same combination of controllers.
+---
+=== 4.3 TrueNAS VM ===
+  * **VM ID:** 108 (`truenas`)
+  * **Node:** pve1
+  * **Purpose:** Provides network file storage and **backup target** for PBS
+  * **Storage role:** Backing store for **PBS datastore `pbs-main`**
+  * **Backups:** **TrueNAS is NOT backed up by PBS.**
+    * Reason: TrueNAS holds the backups; recursively backing it up is inefficient and can overload
+      the system.
+Over time, this VM may be migrated to a **dedicated physical NAS**, but for now it is virtualized.
+---
+=== 4.4 Proxmox Backup Server (PBS) VM ===
+  * **VM ID:** 105 (`pbs`)
+  * **Node:** pve1
+  * **OS:** Proxmox Backup Server 4.x
+  * **CPU:** 3 vCPUs
+  * **RAM:** 4 GB
+  * **Main datastore:** `pbs-main`
+    * Size: ~5.38 TB
+    * Used: ~0.9 TB
+    * Backed by TrueNAS storage
+Important backup rule:
+  * **PBS does not back up itself.**
+    * The PBS VM is excluded from nightly backup jobs.
+  * **TrueNAS (backup storage) is also excluded** from PBS backups.
+This prevents:
+  * Storage thrashing / self-backup loops
+  * Catastrophic performance impact from PBS trying to back up its own datastore
+PBS instead focuses on backing up **critical application VMs** only.
+----
+==== 5. Workload Layout ====
+The current cluster runs a mix of core services and lab workloads. VM IDs/names:
+=== 5.1 VMs on pve1 ===
+  * **100 – `web`**
+    * Role: front-end web / landing page (e.g., torresvault.com)
+  * **101 – `Kuma`**
+    * Role: uptime / service monitoring
+  * **105 – `pbs`**
+    * Role: Proxmox Backup Server VM
+  * **106 – `n8n`**
+    * Role: automation / workflow engine
+  * **107 – `npm`**
+    * Role: Nginx Proxy Manager (reverse proxy)
+  * **108 – `truenas`**
+    * Role: storage VM / backup target
+  * **110 – `Prometheus`**
+    * Role: metrics + Grafana stack
+  * **112 – `iperf-vlan10`**
+  * **113 – `iperf-vlan20`**
+  * **114 – `iperf-vlan1`**
+    * Role: lab VMs for VLAN and bandwidth testing
+  * **115 – `portainer-mgmt`**
+    * Role: container management for other hosts
+  * **116 – `wiki`**
+    * Role: DokuWiki instance hosting TorresVault documentation
+  * **111 – `iperf-vlan1` (pve1 local test network)**
+---
+=== 5.2 VMs on pve2 ===
+  * **102 – `next`**
+    * Role: Nextcloud services
+  * **103 – `immich`**
+    * Role: photo / media backup
+  * **104 – `jellyfin`**
+    * Role: media server
+  * **109 – `RDPjump`**
+    * Role: jump host / remote access box
+These assignments are **not HA**; VMs are pinned to nodes and protected by PBS backups.
+----
+==== 6. Backup & Restore Strategy ====
+Backups are handled by the **PBS VM (ID 105)**, writing into datastore `pbs-main`
+hosted on TrueNAS.
+Key points:
+  * **Backup scope:**
+    * Critical service VMs:
+      * web (100)
+      * Kuma (101)
+      * next (102)
+      * immich (103)
+      * jellyfin (104)
+      * n8n (106)
+      * npm (107)
+      * Prometheus (110)
+      * portainer-mgmt (115)
+      * wiki (116)
+      * other lab VMs as needed
+    * **Excluded:**
+      * `truenas` VM (108)
+      * `pbs` VM (105) itself
+  * **Datastore:** `pbs-main`
+    * Size ~5.38 TB; currently lightly used
+  * **Retention policy:** configured in PBS; typical pattern (can be tuned):
+    * e.g., 7 daily, 4 weekly, 12 monthly (confirm in PBS UI)
+  * **Verification:**
+    * PBS supports scheduled **verify jobs** to detect bit-rot
+  * **Prune jobs:**
+    * Automated prune / garbage collection jobs run regularly to reclaim space
+---
+=== 6.1 Rationale for Exclusions ===
+  * **TrueNAS** holds the backup data; backing it up with PBS would:
+    * Recursively back up the PBS datastore
+    * Multiply IO load
+    * Risk saturating disks and disrupting backups
+  * **PBS VM** is the backup system itself:
+    * Backing PBS to its own datastore is logically unsound
+    * If PBS is lost, it can be rebuilt from Proxmox templates and reattached to
+      existing datastore
+Instead, PBS backup jobs focus on **stateless or easily rebuildable VMs** where immutable data
+is stored externally (e.g., on TrueNAS, Nextcloud data, or other locations).
+---
+=== 6.2 Restore Process (Operational Runbook) ===
+**Scenario: single VM failure**
+. Identify affected VM in Proxmox UI.
+. In PBS UI:
+     * Go to **Datastore → pbs-main → Content → VM group**.
+     * Select the latest successful backup.
+. Choose **Restore**:
+     * Target node: original host (or alternate host if needed)
+     * Disk storage: appropriate local storage (`local-lvm`, `apps-pool`, etc.)
+. Start VM in Proxmox and validate:
+     * Application health checks (web UI, API, etc.)
+     * Network connectivity (LAN, DNS, etc.)
+**Scenario: node loss (pve1 or pve2)**
+. Replace/fix hardware and reinstall Proxmox VE.
+. Rejoin node to `torres-cluster`.
+. Recreate necessary storages pointing at local disks.
+. From PBS, restore VMs to the rebuilt node using the procedure above.
+**Scenario: PBS VM lost but TrueNAS datastore intact**
+. Recreate PBS VM from Proxmox template.
+. Reattach existing `pbs-main` datastore on TrueNAS.
+. PBS will rediscover existing backups.
+. Resume normal operations.
+----
+==== 7. Monitoring & Observability ====
+Monitoring in TorresVault is layered:
+  * **Proxmox Node Metrics**
+    * Built-in graphs: CPU, RAM, I/O, network, load average per node and per VM
+  * **Prometheus VM (110)**
+    * Scrapes metrics from nodes and services
+    * Grafana dashboards provide historical views
+  * **Kuma VM (101)**
+    * Synthetic checks / uptime monitoring for key services
+  * **PBS Analytics**
+    * Shows backup / prune / verify job history and datastore usage
+  * **TrueNAS UI**
+    * Disk health, pool status
+Operational practice:
+  * Use Kuma for **“is it up?”**
+  * Use Proxmox + Grafana for **“how is it behaving?”**
+  * Use PBS UI for **backup health**.
+----
+==== 8. Operational Procedures (Day-to-Day) ====
+=== 8.1 Adding a New VM ===
+. Decide which node (pve1 vs pve2) based on workload:
+     * Storage-heavy → whichever has more free disk
+     * Media or GPU heavy (later) → pve2
+. Create VM in Proxmox:
+     * Attach to `vmbr0` for LAN access
+     * Store disks on `local-lvm`, `apps-pool`, or `VM-pool`
+. Install OS and configure network.
+. In **PBS**, add VM to an existing or new backup group.
+. Verify first backup completes successfully.
+---
+=== 8.2 Maintenance & Patching ===
+**Proxmox nodes:**
+. Live-migrate or gracefully shut down VMs on the target node if needed.
+. `apt update && apt full-upgrade` on the node (via console or SSH).
+. Reboot node.
+. Verify:
+     * Corosync quorum healthy
+     * VMs auto-started where expected
+**PBS & TrueNAS:**
+  * Update during low-traffic windows (overnight).
+  * Confirm backups resume successfully after upgrades.
+---
+=== 8.3 Power-Down / Power-Up Order ===
+Planned maintenance that requires full stack shutdown:
+**Shutdown order:**
+. Application VMs (web, Nextcloud, Immich, Jellyfin, etc.)
+. Monitoring VMs (Kuma, Prometheus)
+. PBS VM
+. TrueNAS VM
+. pve2 node
+. pve1 node (last Proxmox node)
+. Network gear / UPS if necessary
+**Power-up order:**
+. Network gear & UPS
+. pve1 and pve2 nodes
+. TrueNAS VM
+. PBS VM
+. Core apps (web, Nextcloud, Immich, Jellyfin, n8n, NPM, wiki)
+. Monitoring stack (Kuma, Prometheus/Grafana)
+This order ensures that **storage is ready before PBS**, and **PBS is ready before
+depending VMs** (if any use backup features like guest-initiated restore).
+----
+==== 9. Risks, Constraints & Known Limitations ====
+  * **No shared storage / HA:**
+    * VMs are pinned to nodes. If a node fails, VMs require restore or manual migration.
+  * **Older hardware:**
+    * CPUs and DDR3L era platforms limit performance and efficiency.
+  * **Local disks only:**
+    * Mix of older 1 TB HDDs; no SSD-only tiers for high IOPS workloads.
+  * **PBS & TrueNAS both virtualized on pve1:**
+    * Concentrates backup and storage responsibility on a single compute node.
+  * **Limited RAM (32 GB per node):**
+    * Constrains number of memory-heavy workloads.
+These are acceptable for a home lab / prosumer environment but are captured here
+explicitly for future planning.
+----
+==== 10. Future Improvements (Pointer to Roadmap) ====
+The following items are **out of scope for this document** but are tracked on the roadmap:
+  * Dedicated **NAS / Proxmox hybrid** with ASRock Rack X570D4U and 16 × 6 TB SAS
+  * Standalone physical **TrueNAS or SCALE** box
+  * Additional **Proxmox node** or Mini-PC cluster for Kubernetes
+  * 10 GbE or faster interconnect between nodes
+  * Storage tiers (NVMe → SSD → HDD → PBS)
+  * Better separation of roles:
+    * PBS on dedicated hardware
+    * TrueNAS on physical host
+    * Proxmox nodes focused on compute
+See: [[torresvault:todo:roadmap]] (to be created).
+----