User Tools

Site Tools


torresvault:proxmox:cluster

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
torresvault:proxmox:cluster [2026/01/23 13:37] – created nathnatorresvault:proxmox:cluster [2026/01/23 14:40] (current) – [10. Future Improvements (Pointer to Roadmap)] nathna
Line 1: Line 1:
-===== Proxmox Cluster =====+===== Proxmox Cluster Architecture (Current State) =====
  
-Overview of PVE1PVE2, QDevice, storage layoutnetworking, and HA.+This document describes the **current TorresVault Proxmox cluster** as it exists today.   
 +It focuses on hardwarenetworking, storage, workloads, and backup/restore, and is intended as the 
 +authoritative reference for how virtualization is implemented in TorresVault 2.0 (current state).
  
-(More content coming soon.)+Future redesigns (new NAS, X570D4U, Mini PC cluster, etc.) will be documented separately 
 +on the [[torresvault:todo:roadmap]] page. 
 + 
 +---- 
 + 
 +==== 1. High-Level Overview ==== 
 + 
 +The Proxmox environment is a **2-node cluster with a qdevice**, running on older but solid Intel 
 +desktop platforms with expanded SATA and NIC capacity. 
 + 
 +  * **Cluster name:** `torres-cluster` 
 +  * **Hypervisor:** Proxmox VE 9.x 
 +  * **Nodes:** `pve1`, `pve2` 
 +  * **Quorum helper:** Raspberry Pi running `corosync-qdevice` 
 +  * **Backup server:** Proxmox Backup Server (PBS) VM 
 +  * **Storage backend:** Local SATA disks per node; TrueNAS VM providing backup storage 
 + 
 +High-level logical view: 
 + 
 +  * **Compute layer:** pve1, pve2 
 +  * **Storage layer:** local disks per node, plus TrueNAS VM used as backup target 
 +  * **Backup layer:** PBS VM writing to TrueNAS (`pbs-main` datastore) 
 +  * **Monitoring layer:** Prometheus + Grafana, Kuma, Proxmox built-ins 
 + 
 +The design intentionally **does not use shared storage for HA**; instead, VMs are pinned to nodes 
 +and protected via **image-based backups to PBS**. 
 + 
 +---- 
 + 
 +==== 2. Physical Hosts & Hardware ==== 
 + 
 +=== 2.1 pve1 === 
 + 
 +**Role:** General compute node, many core services 
 + 
 +  * **CPU:** Intel Core i5-2500 @ 3.30 GHz   
 +    * 4 cores / 4 threads, 1 socket 
 +  * **RAM:** 32 GB DDR3L 1600 MHz   
 +    * 4 × 8 GB Timetec DDR3L (PC3L-12800) UDIMM kit 
 +  * **Motherboard / Chipset:** Older Intel desktop platform 
 +  * **Disk controllers:** 
 +    * Onboard Intel SATA controller (RAID mode) 
 +    * ASMedia ASM1064 SATA controller 
 +    * **GLOTRENDS SA3112-C 12-port PCIe x1 SATA expansion card** 
 +  * **Disk inventory (approximate):** 
 +    * Several **1 TB WDC WD1003FBYX** enterprise HDDs 
 +    * Several **1 TB Seagate ST91000640NS** HDDs 
 +    * System / boot disk plus ~10–12 × 1 TB data disks 
 +  * **Network interfaces:** 
 +    * Onboard Intel 82579LM 1 GbE 
 +    * **Intel I350 quad-port 1 GbE** PCIe NIC 
 +  * **Installed OS:** Proxmox VE 9.x (legacy BIOS) 
 +  * **Kernel example:** 6.14.x-pve 
 + 
 +Primary role summary: 
 + 
 +  * Runs web, monitoring, automation, PBS, TrueNAS and various lab VMs 
 +  * Acts as one half of the Proxmox cluster 
 +  * Provides local LVM/ZFS storage for its own VMs 
 + 
 +--- 
 + 
 +=== 2.2 pve2 === 
 + 
 +**Role:** General compute node, media & application workloads 
 + 
 +  * **CPU:** Intel Core i5-4570 @ 3.20 GHz   
 +    * 4 cores / 4 threads, 1 socket 
 +  * **RAM:** 32 GB DDR3L 1600 MHz   
 +    * Same Timetec 4 × 8 GB kit as pve1 
 +  * **Disk controllers:** 
 +    * Intel 9-Series SATA controller (AHCI) 
 +    * ASMedia ASM1064 SATA controller 
 +    * **GLOTRENDS SA3112-C 12-port PCIe x1 SATA expansion card** 
 +  * **Disk inventory (approximate):** 
 +    * Multiple **1 TB Seagate ST91000640NS** HDDs 
 +    * System disk plus ~10–12 × 1 TB data disks 
 +  * **Network interfaces:** 
 +    * Intel I350 quad-port 1 GbE (matching pve1) 
 +  * **Installed OS:** Proxmox VE 9.x (EFI boot) 
 + 
 +Primary role summary: 
 + 
 +  * Runs general apps, including media and image workloads (Nextcloud, Immich, Jellyfin, etc.) 
 +  * Acts as second node of Proxmox cluster 
 +  * Mirrors pve1’s storage pattern with local disks only 
 + 
 +--- 
 + 
 +=== 2.3 QDevice === 
 + 
 +  * Hardware: **Raspberry Pi** (dedicated qdevice) 
 +  * Software: `corosync-qdevice` 
 +  * Purpose: provides voting/quorum for a **2-node cluster**, preventing split-brain. 
 + 
 +--- 
 + 
 +==== 3. Network Design ==== 
 + 
 +The Proxmox cluster uses: 
 + 
 +  * **Main LAN:** `192.168.1.0/24`   
 +    * Gateway: **UCG Max** at `192.168.1.1` 
 +  * **Cluster link:** dedicated point-to-point /30 network: 
 +    * pve1: `10.10.10.1/30` 
 +    * pve2: `10.10.10.2/30` 
 + 
 +UniFi VLANs exist on the network side (stark_user, stark_IOT, guest, IOT+, Torres Family Lights); 
 +for now, Proxmox sees mostly the flat LAN plus specific lab networks for testing. 
 + 
 +--- 
 + 
 +=== 3.1 pve1 Network Interfaces === 
 + 
 +From the Proxmox UI: 
 + 
 +  * **bond0** – Linux bond 
 +    * Mode: active-backup 
 +    * Slaves: `enp1s0f1`, `enp1s0f2` 
 +    * No IP address (used as bridge slave) 
 +  * **eno1** – onboard NIC   
 +    * Not active, not autostart (reserved / spare) 
 +  * **enp1s0f0** 
 +    * IP: `10.10.10.1/30` 
 +    * Usage: dedicated **cluster interconnect** to pve2 
 +  * **enp1s0f1 / enp1s0f2** 
 +    * Members of `bond0` 
 +  * **enp1s0f3** 
 +    * Currently unused 
 +  * **vmbr0** – Linux bridge 
 +    * IP: `192.168.1.150/24` 
 +    * Gateway: `192.168.1.1` (UCG Max) 
 +    * Bridge port: `bond0` 
 +    * All LAN-facing VMs attach here 
 + 
 +Design notes: 
 + 
 +  * Two NIC ports for LAN via bond (`bond0` → `vmbr0`) for basic redundancy. 
 +  * One NIC port for cluster link (`enp1s0f0`). 
 +  * One NIC still available (`enp1s0f3`) for future use (e.g., storage VLAN or DMZ). 
 + 
 +--- 
 + 
 +=== 3.2 pve2 Network Interfaces === 
 + 
 +  * **bond0** – Linux bond 
 +    * Mode: active-backup 
 +    * Slaves: `enp2s0f1`, `enp2s0f2` 
 +  * **eno1** – onboard NIC   
 +    * Not active 
 +  * **enp2s0f0** 
 +    * IP: `10.10.10.2/30` 
 +    * Usage: dedicated **cluster interconnect** to pve1 
 +  * **enp2s0f1 / enp2s0f2** 
 +    * Members of `bond0` 
 +  * **enp2s0f3** 
 +    * Currently unused 
 +  * **vmbr0** 
 +    * IP: `192.168.1.151/24` 
 +    * Gateway: `192.168.1.1` 
 +    * Bridge port: `bond0` 
 + 
 +Design notes: 
 + 
 +  * Symmetric layout with pve1 to make VM migration and cabling easier. 
 +  * Cluster traffic is physically separated from LAN. 
 + 
 +--- 
 + 
 +=== 3.3 Logical Topology Diagram (Text) === 
 + 
 +  * **LAN 192.168.1.0/24** 
 +    * UCG Max (192.168.1.1) 
 +    * pve1 (192.168.1.150, vmbr0 on bond0) 
 +    * pve2 (192.168.1.151, vmbr0 on bond0) 
 +    * Other LAN clients / services 
 + 
 +  * **Cluster link 10.10.10.0/30** 
 +    * pve1 – `enp1s0f0` → `10.10.10.1` 
 +    * pve2 – `enp2s0f0` → `10.10.10.2` 
 +    * Single direct cable between nodes 
 + 
 +  * **Quorum** 
 +    * Raspberry Pi qdevice on LAN (IP TBD), reachable from both nodes 
 + 
 +This separation keeps corosync and cluster traffic off the main LAN and avoids 
 +cluster instability if LAN becomes noisy. 
 + 
 +---- 
 + 
 +==== 4. Storage Architecture (Current) ==== 
 + 
 +There are **three main storage layers**: 
 + 
 +  * **Local node storage** (per-node disks, LVM/ZFS) 
 +  * **TrueNAS VM** – used as backup target 
 +  * **Proxmox Backup Server (PBS)** – used for image-based backups 
 + 
 +--- 
 + 
 +=== 4.1 Local Storage on pve1 === 
 + 
 +Typical Proxmox storages (names as shown in the UI): 
 + 
 +  * `local (pve1)` – boot disk, ISOs, templates 
 +  * `local-lvm (pve1)` – LVM-thin for VM disks 
 +  * `VM-pool (pve1)` – additional pool for VMs (local disks) 
 +  * `PBS (pve1)` – smaller storage local to the PBS VM (e.g., for metadata or staging) 
 + 
 +Backed by: 
 + 
 +  * WDC WD1003FBYX and Seagate ST91000640NS disks on Intel/ASMedia/GLOTRENDS SATA controllers 
 +  * No hardware RAID; uses Proxmox’s software stack and/or ZFS/LVM 
 + 
 +--- 
 + 
 +=== 4.2 Local Storage on pve2 === 
 + 
 +  * `local (pve2)` – boot, ISOs, templates 
 +  * `local-lvm (pve2)` – VM disks 
 +  * `apps-pool (pve2)` – main pool for application VMs (Nextcloud, Immich, Jellyfin, etc.) 
 + 
 +Also backed by multiple 1 TB Seagate ST91000640NS disks via the same combination of controllers. 
 + 
 +--- 
 + 
 +=== 4.3 TrueNAS VM === 
 + 
 +  * **VM ID:** 108 (`truenas`) 
 +  * **Node:** pve1 
 +  * **Purpose:** Provides network file storage and **backup target** for PBS 
 +  * **Storage role:** Backing store for **PBS datastore `pbs-main`** 
 +  * **Backups:** **TrueNAS is NOT backed up by PBS.**   
 +    * Reason: TrueNAS holds the backups; recursively backing it up is inefficient and can overload 
 +      the system. 
 + 
 +Over time, this VM may be migrated to a **dedicated physical NAS**, but for now it is virtualized. 
 + 
 +--- 
 + 
 +=== 4.4 Proxmox Backup Server (PBS) VM === 
 + 
 +  * **VM ID:** 105 (`pbs`) 
 +  * **Node:** pve1 
 +  * **OS:** Proxmox Backup Server 4.x 
 +  * **CPU:** 3 vCPUs 
 +  * **RAM:** 4 GB 
 +  * **Main datastore:** `pbs-main` 
 +    * Size: ~5.38 TB 
 +    * Used: ~0.9 TB 
 +    * Backed by TrueNAS storage 
 + 
 +Important backup rule: 
 + 
 +  * **PBS does not back up itself.**   
 +    * The PBS VM is excluded from nightly backup jobs. 
 +  * **TrueNAS (backup storage) is also excluded** from PBS backups. 
 + 
 +This prevents: 
 + 
 +  * Storage thrashing / self-backup loops 
 +  * Catastrophic performance impact from PBS trying to back up its own datastore 
 + 
 +PBS instead focuses on backing up **critical application VMs** only. 
 + 
 +---- 
 + 
 +==== 5. Workload Layout ==== 
 + 
 +The current cluster runs a mix of core services and lab workloads. VM IDs/names: 
 + 
 +=== 5.1 VMs on pve1 === 
 + 
 +  * **100 – `web`** 
 +    * Role: front-end web / landing page (e.g., torresvault.com) 
 +  * **101 – `Kuma`** 
 +    * Role: uptime / service monitoring 
 +  * **105 – `pbs`** 
 +    * Role: Proxmox Backup Server VM 
 +  * **106 – `n8n`** 
 +    * Role: automation / workflow engine 
 +  * **107 – `npm`** 
 +    * Role: Nginx Proxy Manager (reverse proxy) 
 +  * **108 – `truenas`** 
 +    * Role: storage VM / backup target 
 +  * **110 – `Prometheus`** 
 +    * Role: metrics + Grafana stack 
 +  * **112 – `iperf-vlan10`** 
 +  * **113 – `iperf-vlan20`** 
 +  * **114 – `iperf-vlan1`** 
 +    * Role: lab VMs for VLAN and bandwidth testing 
 +  * **115 – `portainer-mgmt`** 
 +    * Role: container management for other hosts 
 +  * **116 – `wiki`** 
 +    * Role: DokuWiki instance hosting TorresVault documentation 
 +  * **111 – `iperf-vlan1` (pve1 local test network)** 
 + 
 +--- 
 + 
 +=== 5.2 VMs on pve2 === 
 + 
 +  * **102 – `next`** 
 +    * Role: Nextcloud services 
 +  * **103 – `immich`** 
 +    * Role: photo / media backup 
 +  * **104 – `jellyfin`** 
 +    * Role: media server 
 +  * **109 – `RDPjump`** 
 +    * Role: jump host / remote access box 
 + 
 +These assignments are **not HA**; VMs are pinned to nodes and protected by PBS backups. 
 + 
 +---- 
 + 
 +==== 6. Backup & Restore Strategy ==== 
 + 
 +Backups are handled by the **PBS VM (ID 105)**, writing into datastore `pbs-main` 
 +hosted on TrueNAS. 
 + 
 +Key points: 
 + 
 +  * **Backup scope:** 
 +    * Critical service VMs: 
 +      * web (100) 
 +      * Kuma (101) 
 +      * next (102) 
 +      * immich (103) 
 +      * jellyfin (104) 
 +      * n8n (106) 
 +      * npm (107) 
 +      * Prometheus (110) 
 +      * portainer-mgmt (115) 
 +      * wiki (116) 
 +      * other lab VMs as needed 
 +    * **Excluded:** 
 +      * `truenas` VM (108) 
 +      * `pbs` VM (105) itself 
 +  * **Datastore:** `pbs-main` 
 +    * Size ~5.38 TB; currently lightly used 
 +  * **Retention policy:** configured in PBS; typical pattern (can be tuned): 
 +    * e.g., 7 daily, 4 weekly, 12 monthly (confirm in PBS UI) 
 +  * **Verification:** 
 +    * PBS supports scheduled **verify jobs** to detect bit-rot 
 +  * **Prune jobs:** 
 +    * Automated prune / garbage collection jobs run regularly to reclaim space 
 + 
 +--- 
 + 
 +=== 6.1 Rationale for Exclusions === 
 + 
 +  * **TrueNAS** holds the backup data; backing it up with PBS would: 
 +    * Recursively back up the PBS datastore 
 +    * Multiply IO load 
 +    * Risk saturating disks and disrupting backups 
 + 
 +  * **PBS VM** is the backup system itself: 
 +    * Backing PBS to its own datastore is logically unsound 
 +    * If PBS is lost, it can be rebuilt from Proxmox templates and reattached to 
 +      existing datastore 
 + 
 +Instead, PBS backup jobs focus on **stateless or easily rebuildable VMs** where immutable data 
 +is stored externally (e.g., on TrueNAS, Nextcloud data, or other locations). 
 + 
 +--- 
 + 
 +=== 6.2 Restore Process (Operational Runbook) === 
 + 
 +**Scenario: single VM failure** 
 + 
 +  1. Identify affected VM in Proxmox UI. 
 +  2. In PBS UI: 
 +     * Go to **Datastore → pbs-main → Content → VM group**. 
 +     * Select the latest successful backup. 
 +  3. Choose **Restore**: 
 +     * Target node: original host (or alternate host if needed) 
 +     * Disk storage: appropriate local storage (`local-lvm`, `apps-pool`, etc.) 
 +  4. Start VM in Proxmox and validate: 
 +     * Application health checks (web UI, API, etc.) 
 +     * Network connectivity (LAN, DNS, etc.) 
 + 
 +**Scenario: node loss (pve1 or pve2)** 
 + 
 +  1. Replace/fix hardware and reinstall Proxmox VE. 
 +  2. Rejoin node to `torres-cluster`. 
 +  3. Recreate necessary storages pointing at local disks. 
 +  4. From PBS, restore VMs to the rebuilt node using the procedure above. 
 + 
 +**Scenario: PBS VM lost but TrueNAS datastore intact** 
 + 
 +  1. Recreate PBS VM from Proxmox template. 
 +  2. Reattach existing `pbs-main` datastore on TrueNAS. 
 +  3. PBS will rediscover existing backups. 
 +  4. Resume normal operations. 
 + 
 +---- 
 + 
 +==== 7. Monitoring & Observability ==== 
 + 
 +Monitoring in TorresVault is layered: 
 + 
 +  * **Proxmox Node Metrics** 
 +    * Built-in graphs: CPU, RAM, I/O, network, load average per node and per VM 
 +  * **Prometheus VM (110)** 
 +    * Scrapes metrics from nodes and services 
 +    * Grafana dashboards provide historical views 
 +  * **Kuma VM (101)** 
 +    * Synthetic checks / uptime monitoring for key services 
 +  * **PBS Analytics** 
 +    * Shows backup / prune / verify job history and datastore usage 
 +  * **TrueNAS UI** 
 +    * Disk health, pool status 
 + 
 +Operational practice: 
 + 
 +  * Use Kuma for **“is it up?”** 
 +  * Use Proxmox + Grafana for **“how is it behaving?”** 
 +  * Use PBS UI for **backup health**. 
 + 
 +---- 
 + 
 +==== 8. Operational Procedures (Day-to-Day) ==== 
 + 
 +=== 8.1 Adding a New VM === 
 + 
 +  1. Decide which node (pve1 vs pve2) based on workload: 
 +     * Storage-heavy → whichever has more free disk 
 +     * Media or GPU heavy (later) → pve2 
 +  2. Create VM in Proxmox: 
 +     * Attach to `vmbr0` for LAN access 
 +     * Store disks on `local-lvm`, `apps-pool`, or `VM-pool` 
 +  3. Install OS and configure network. 
 +  4. In **PBS**, add VM to an existing or new backup group. 
 +  5. Verify first backup completes successfully. 
 + 
 +--- 
 + 
 +=== 8.2 Maintenance & Patching === 
 + 
 +**Proxmox nodes:** 
 + 
 +  1. Live-migrate or gracefully shut down VMs on the target node if needed. 
 +  2. `apt update && apt full-upgrade` on the node (via console or SSH). 
 +  3. Reboot node. 
 +  4. Verify: 
 +     * Corosync quorum healthy 
 +     * VMs auto-started where expected 
 + 
 +**PBS & TrueNAS:** 
 + 
 +  * Update during low-traffic windows (overnight). 
 +  * Confirm backups resume successfully after upgrades. 
 + 
 +--- 
 + 
 +=== 8.3 Power-Down / Power-Up Order === 
 + 
 +Planned maintenance that requires full stack shutdown: 
 + 
 +**Shutdown order:** 
 + 
 +  1. Application VMs (web, Nextcloud, Immich, Jellyfin, etc.) 
 +  2. Monitoring VMs (Kuma, Prometheus) 
 +  3. PBS VM 
 +  4. TrueNAS VM 
 +  5. pve2 node 
 +  6. pve1 node (last Proxmox node) 
 +  7. Network gear / UPS if necessary 
 + 
 +**Power-up order:** 
 + 
 +  1. Network gear & UPS 
 +  2. pve1 and pve2 nodes 
 +  3. TrueNAS VM 
 +  4. PBS VM 
 +  5. Core apps (web, Nextcloud, Immich, Jellyfin, n8n, NPM, wiki) 
 +  6. Monitoring stack (Kuma, Prometheus/Grafana) 
 + 
 +This order ensures that **storage is ready before PBS**, and **PBS is ready before 
 +depending VMs** (if any use backup features like guest-initiated restore). 
 + 
 +---- 
 + 
 +==== 9. Risks, Constraints & Known Limitations ==== 
 + 
 +  * **No shared storage / HA:**   
 +    * VMs are pinned to nodes. If a node fails, VMs require restore or manual migration. 
 +  * **Older hardware:**   
 +    * CPUs and DDR3L era platforms limit performance and efficiency. 
 +  * **Local disks only:**   
 +    * Mix of older 1 TB HDDs; no SSD-only tiers for high IOPS workloads. 
 +  * **PBS & TrueNAS both virtualized on pve1:**   
 +    * Concentrates backup and storage responsibility on a single compute node. 
 +  * **Limited RAM (32 GB per node):**   
 +    * Constrains number of memory-heavy workloads. 
 + 
 +These are acceptable for a home lab / prosumer environment but are captured here 
 +explicitly for future planning. 
 + 
 +---- 
 + 
 +==== 10. Future Improvements (Pointer to Roadmap) ==== 
 + 
 +The following items are **out of scope for this document** but are tracked on the roadmap: 
 + 
 +  * Dedicated **NAS / Proxmox hybrid** with ASRock Rack X570D4U and 16 × 6 TB SAS 
 +  * Standalone physical **TrueNAS or SCALE** box 
 +  * Additional **Proxmox node** or Mini-PC cluster for Kubernetes 
 +  * 10 GbE or faster interconnect between nodes 
 +  * Storage tiers (NVMe → SSD → HDD → PBS) 
 +  * Better separation of roles: 
 +    * PBS on dedicated hardware 
 +    * TrueNAS on physical host 
 +    * Proxmox nodes focused on compute 
 + 
 +See: [[torresvault:todo:roadmap]] (to be created). 
 + 
 +----
  
torresvault/proxmox/cluster.1769193448.txt.gz · Last modified: by nathna

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki