proxmox:cluster
Differences
This shows you the differences between two versions of the page.
| proxmox:cluster [2026/01/23 18:15] – created nathna | proxmox:cluster [2026/02/23 13:50] (current) – 192.168.1.189 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ===== Proxmox Cluster Architecture (Current State) | + | ====== ========================================= ====== |
| - | This document describes the **current TorresVault | + | ====== |
| - | It focuses on hardware, networking, storage, workloads, and backup/ | + | |
| - | authoritative reference for how virtualization is implemented in TorresVault 2.0 (current state). | + | |
| - | Future redesigns (new NAS, X570D4U, Mini PC cluster, etc.) will be documented separately | + | ====== ========================================= ====== |
| - | on the [[torresvault: | + | |
| - | ---- | + | This document describes the **current**, |
| + | |||
| + | This page replaces and supersedes all references to: | ||
| - | ==== 1. High-Level Overview ==== | + | * '' |
| + | * '' | ||
| + | * The old 2-node cluster | ||
| + | * The Raspberry Pi qdevice | ||
| + | * All Intel-based legacy hardware | ||
| - | The Proxmox environment is a **2-node cluster with a qdevice**, running on older but solid Intel | + | All of that hardware has been decommissioned. |
| - | desktop platforms with expanded SATA and NIC capacity. | + | |
| - | * **Cluster name:** `torres-cluster` | + | The sole hypervisor is now: |
| - | * **Hypervisor: | + | |
| - | * **Nodes:** `pve1`, `pve2` | + | |
| - | * **Quorum helper:** Raspberry Pi running `corosync-qdevice` | + | |
| - | * **Backup server:** Proxmox Backup Server (PBS) VM | + | |
| - | * **Storage backend:** Local SATA disks per node; TrueNAS VM providing backup storage | + | |
| - | High-level logical view: | + | ==== ▶ PVE-NAS (192.168.1.153) ==== |
| - | | + | Running on **enterprise-grade Ryzen hardware** with **TrueNAS virtualized via HBA passthrough**, and acting |
| - | | + | |
| - | * **Backup layer:** PBS VM writing to TrueNAS (`pbs-main` datastore) | + | |
| - | * **Monitoring layer:** Prometheus | + | |
| - | The design intentionally **does not use shared storage for HA**; instead, VMs are pinned to nodes | + | Future expansions (backup NAS, mini-PC cluster, GPU with Jarvis, Flex 10G, etc.) will be documented on a separate roadmap page. |
| - | and protected via **image-based backups to PBS**. | + | |
| ---- | ---- | ||
| - | ==== 2. Physical Hosts & Hardware | + | ====== 1. High-Level Overview ====== |
| - | === 2.1 pve1 === | + | ==== Hypervisor Platform ==== |
| - | **Role:** General compute | + | |
| + | * **Single-node design (no cluster)** | ||
| + | * System name: **'' | ||
| + | * Management IP: **192.168.1.153** | ||
| + | * IPMI: **192.168.1.145** | ||
| - | * **CPU:** Intel Core i5-2500 @ 3.30 GHz | + | ==== Storage Layer (under TrueNAS VM) ==== |
| - | * 4 cores / 4 threads, 1 socket | + | |
| - | * **RAM:** 32 GB DDR3L 1600 MHz | + | |
| - | * 4 × 8 GB Timetec DDR3L (PC3L-12800) UDIMM kit | + | |
| - | * **Motherboard / Chipset:** Older Intel desktop platform | + | |
| - | * **Disk controllers: | + | |
| - | * Onboard Intel SATA controller (RAID mode) | + | |
| - | * ASMedia ASM1064 SATA controller | + | |
| - | * **GLOTRENDS SA3112-C 12-port PCIe x1 SATA expansion card** | + | |
| - | * **Disk inventory (approximate): | + | |
| - | * Several **1 TB WDC WD1003FBYX** enterprise HDDs | + | |
| - | * Several **1 TB Seagate ST91000640NS** HDDs | + | |
| - | * System / boot disk plus ~10–12 × 1 TB data disks | + | |
| - | * **Network interfaces: | + | |
| - | * Onboard Intel 82579LM 1 GbE | + | |
| - | * **Intel I350 quad-port 1 GbE** PCIe NIC | + | |
| - | * **Installed OS:** Proxmox VE 9.x (legacy BIOS) | + | |
| - | * **Kernel example:** 6.14.x-pve | + | |
| - | Primary role summary: | + | * 8 × Samsung PM863 **1.92 TB enterprise SSDs** passed directly to TrueNAS via HBA |
| + | * TrueNAS manages all storage pools | ||
| + | * PVE-NAS uses: | ||
| + | * NVMe mirror → Proxmox OS | ||
| + | * 1.9 TB SSDs → VM storage | ||
| + | * ZFS replication & snapshots inside TrueNAS | ||
| + | * PBS nightly backups | ||
| - | * Runs web, monitoring, automation, PBS, TrueNAS and various lab VMs | + | ==== Backup Layer ==== |
| - | * Acts as one half of the Proxmox cluster | + | |
| - | * Provides local LVM/ZFS storage for its own VMs | + | |
| - | --- | + | * **PBS VM** on PVE-NAS |
| + | * Writes into **pbs-main** datastore on TrueNAS | ||
| - | === 2.2 pve2 === | + | ==== Workload Layer ==== |
| - | **Role:** General compute node, media & application workloads | + | Core services: |
| - | * **CPU:** Intel Core i5-4570 @ 3.20 GHz | + | * Immich |
| - | | + | |
| - | * **RAM:** 32 GB DDR3L 1600 MHz | + | * Jellyfin |
| - | | + | |
| - | * **Disk controllers: | + | * NPM reverse proxy |
| - | * Intel 9-Series SATA controller (AHCI) | + | * Prometheus / Grafana |
| - | * ASMedia ASM1064 SATA controller | + | * Kuma |
| - | * **GLOTRENDS SA3112-C 12-port PCIe x1 SATA expansion card** | + | * Wiki |
| - | * **Disk inventory (approximate): | + | * n8n automations |
| - | * Multiple **1 TB Seagate ST91000640NS** HDDs | + | |
| - | * System disk plus ~10–12 × 1 TB data disks | + | |
| - | * **Network interfaces: | + | |
| - | * Intel I350 quad-port 1 GbE (matching pve1) | + | |
| - | * **Installed OS:** Proxmox VE 9.x (EFI boot) | + | |
| - | Primary role summary: | + | ==== Automation Layer ==== |
| - | * Runs general apps, including media and image workloads | + | * Home Assistant |
| - | * Acts as second node of Proxmox cluster | + | * BLE tracking |
| - | * Mirrors pve1’s storage pattern with local disks only | + | * FPP (192.168.60.55) |
| + | * WLED (including car warning system) | ||
| - | --- | + | This is currently the entire virtualization footprint for TorresVault 2.0. |
| - | === 2.3 QDevice === | + | ---- |
| - | * Hardware: **Raspberry Pi** (dedicated qdevice) | + | ====== 2. Physical Host: PVE-NAS ====== |
| - | * Software: `corosync-qdevice` | + | |
| - | * Purpose: provides voting/ | + | |
| - | --- | + | ==== Hardware Summary ==== |
| - | ==== 3. Network Design ==== | + | ^ Component ^ Details ^ |
| + | | **Motherboard** | ASRock Rack **X570D4U-2L2T** | | ||
| + | | **CPU** | AMD Ryzen 7 5700G — 8 cores / 16 threads | | ||
| + | | **RAM** | **64 GiB DDR4 ECC** | | ||
| + | | **Boot** | 2 × NVMe SSD (ZFS mirror) | | ||
| + | | **VM Storage** | 2 × Samsung PM863 1.92 TB SSD (Proxmox local storage) | | ||
| + | | **HBA** | 1 × LSI IT-Mode passthrough | | ||
| + | | **TrueNAS Pool Drives** | 8 × Samsung PM863 1.92 TB SSD (full passthrough) | | ||
| + | | **Networking** | 1G ×2 + 10G ×2 (X550 NICs) | | ||
| + | | **IPMI** | 192.168.1.145 | | ||
| - | The Proxmox cluster uses: | + | This is now your **single most powerful and consolidated host** in TorresVault. |
| - | * **Main LAN:** `192.168.1.0/ | + | ---- |
| - | * Gateway: **UCG Max** at `192.168.1.1` | + | |
| - | * **Cluster link:** dedicated point-to-point /30 network: | + | |
| - | * pve1: `10.10.10.1/ | + | |
| - | * pve2: `10.10.10.2/ | + | |
| - | UniFi VLANs exist on the network side (stark_user, | + | ====== 3. Network Design ====== |
| - | for now, Proxmox sees mostly the flat LAN plus specific lab networks for testing. | + | |
| - | --- | + | Proxmox sees only the main LAN and storage networks you define. |
| - | === 3.1 pve1 Network Interfaces | + | ==== Management & LAN ==== |
| - | From the Proxmox UI: | + | ^ Interface ^ IP ^ Purpose ^ |
| + | | **vmbr0** | 192.168.1.153 | Main LAN bridge & VM network | | ||
| + | | **eno1 / eno2** | (bridged) | 1G LAN & VM connectivity | | ||
| + | | **ens1f0 / ens1f1** | (available) | Dual 10GbE for future storage network / Flex 10G | | ||
| - | * **bond0** – Linux bond | + | ==== VLANs (available |
| - | * Mode: active-backup | + | |
| - | * Slaves: `enp1s0f1`, `enp1s0f2` | + | |
| - | * No IP address | + | |
| - | * **eno1** – onboard NIC | + | |
| - | * Not active, not autostart (reserved / spare) | + | |
| - | * **enp1s0f0** | + | |
| - | * IP: `10.10.10.1/ | + | |
| - | * Usage: dedicated **cluster interconnect** | + | |
| - | * **enp1s0f1 / enp1s0f2** | + | |
| - | * Members of `bond0` | + | |
| - | * **enp1s0f3** | + | |
| - | * Currently unused | + | |
| - | * **vmbr0** – Linux bridge | + | |
| - | * IP: `192.168.1.150/ | + | |
| - | * Gateway: `192.168.1.1` (UCG Max) | + | |
| - | * Bridge port: `bond0` | + | |
| - | * All LAN-facing VMs attach here | + | |
| - | Design notes: | + | * VLAN10 – User |
| + | * VLAN20 – IoT | ||
| + | * VLAN50 – IoT+ | ||
| + | * VLAN60 – Lighting | ||
| + | * VLAN30 – Guest\\ (all managed via UniFi) | ||
| - | * Two NIC ports for LAN via bond (`bond0` → `vmbr0`) for basic redundancy. | + | ==== IPMI ==== |
| - | * One NIC port for cluster link (`enp1s0f0`). | + | |
| - | * One NIC still available (`enp1s0f3`) for future use (e.g., storage VLAN or DMZ). | + | |
| - | --- | + | * 192.168.1.145\\ Always available even if Proxmox is offline. |
| - | === 3.2 pve2 Network Interfaces === | + | ---- |
| - | * **bond0** – Linux bond | + | ====== 4. Storage Architecture (Current) ====== |
| - | * Mode: active-backup | + | |
| - | * Slaves: `enp2s0f1`, `enp2s0f2` | + | |
| - | * **eno1** – onboard NIC | + | |
| - | * Not active | + | |
| - | * **enp2s0f0** | + | |
| - | * IP: `10.10.10.2/ | + | |
| - | * Usage: dedicated **cluster interconnect** to pve1 | + | |
| - | * **enp2s0f1 / enp2s0f2** | + | |
| - | * Members of `bond0` | + | |
| - | * **enp2s0f3** | + | |
| - | * Currently unused | + | |
| - | * **vmbr0** | + | |
| - | * IP: `192.168.1.151/ | + | |
| - | * Gateway: `192.168.1.1` | + | |
| - | * Bridge port: `bond0` | + | |
| - | Design notes: | + | There are three main storage components: |
| - | * Symmetric layout with pve1 to make VM migration and cabling easier. | + | ---- |
| - | * Cluster traffic is physically separated from LAN. | + | |
| - | --- | + | ===== 4.1 Proxmox Local Storage (OS + VM disks) ===== |
| - | === 3.3 Logical Topology Diagram (Text) === | + | ^ Storage Name ^ Description ^ Backed By ^ |
| + | | '' | ||
| + | | '' | ||
| + | | '' | ||
| + | | '' | ||
| + | | '' | ||
| + | | '' | ||
| - | * **LAN 192.168.1.0/ | + | ---- |
| - | * UCG Max (192.168.1.1) | + | |
| - | * pve1 (192.168.1.150, | + | |
| - | * pve2 (192.168.1.151, | + | |
| - | * Other LAN clients / services | + | |
| - | * **Cluster link 10.10.10.0/ | + | ===== 4.2 TrueNAS VM (ID 108) ===== |
| - | * pve1 – `enp1s0f0` → `10.10.10.1` | + | |
| - | * pve2 – `enp2s0f0` → `10.10.10.2` | + | |
| - | * Single direct cable between nodes | + | |
| - | | + | ^ Component ^ Details ^ |
| - | * Raspberry Pi qdevice on LAN (IP TBD), reachable from both nodes | + | | **Disks** | 8 × PM863 1.92TB SSDs via HBA passthrough | |
| + | | **Role** | All primary storage for Immich, Nextcloud, Jellyfin, PBS datastore | | ||
| + | | **IP** | 192.168.1.108 | | ||
| + | | **Pools** | '' | ||
| - | This separation keeps corosync and cluster traffic off the main LAN and avoids | + | TrueNAS acts as your **central storage authority**. |
| - | cluster instability if LAN becomes noisy. | + | |
| ---- | ---- | ||
| - | ==== 4. Storage Architecture | + | ===== 4.3 Proxmox Backup Server VM (ID 105) ===== |
| - | There are **three main storage layers**: | + | ^ Component ^ Details ^ |
| + | | **Datastore** | '' | ||
| + | | **Backed By** | TrueNAS | | ||
| + | | **Backed Up?** | **NO** (PBS never backs up itself) | | ||
| - | * **Local node storage** (per-node disks, LVM/ZFS) | + | PBS backs up: |
| - | * **TrueNAS VM** – used as backup target | + | |
| - | * **Proxmox Backup Server (PBS)** – used for image-based backups | + | |
| - | --- | + | * Immich |
| + | * Nextcloud | ||
| + | * Jellyfin | ||
| + | * Web / Wiki | ||
| + | * Prometheus / Kuma | ||
| + | * n8n | ||
| + | * NPM | ||
| - | === 4.1 Local Storage on pve1 === | + | **Excluded: |
| - | Typical Proxmox storages | + | * PBS (cannot back itself) |
| + | * TrueNAS VM (contains | ||
| - | * `local (pve1)` – boot disk, ISOs, templates | + | ---- |
| - | * `local-lvm (pve1)` – LVM-thin for VM disks | + | |
| - | * `VM-pool (pve1)` – additional pool for VMs (local disks) | + | |
| - | * `PBS (pve1)` – smaller storage local to the PBS VM (e.g., for metadata or staging) | + | |
| - | + | ||
| - | Backed by: | + | |
| - | + | ||
| - | * WDC WD1003FBYX and Seagate ST91000640NS disks on Intel/ | + | |
| - | * No hardware RAID; uses Proxmox’s software stack and/or ZFS/LVM | + | |
| - | + | ||
| - | --- | + | |
| - | + | ||
| - | === 4.2 Local Storage on pve2 === | + | |
| - | + | ||
| - | * `local (pve2)` – boot, ISOs, templates | + | |
| - | * `local-lvm (pve2)` – VM disks | + | |
| - | * `apps-pool (pve2)` – main pool for application VMs (Nextcloud, Immich, Jellyfin, etc.) | + | |
| - | + | ||
| - | Also backed by multiple 1 TB Seagate ST91000640NS disks via the same combination of controllers. | + | |
| - | + | ||
| - | --- | + | |
| - | + | ||
| - | === 4.3 TrueNAS VM === | + | |
| - | + | ||
| - | * **VM ID:** 108 (`truenas`) | + | |
| - | * **Node:** pve1 | + | |
| - | * **Purpose: | + | |
| - | * **Storage role:** Backing store for **PBS datastore `pbs-main`** | + | |
| - | * **Backups: | + | |
| - | * Reason: TrueNAS holds the backups; recursively backing it up is inefficient and can overload | + | |
| - | the system. | + | |
| - | + | ||
| - | Over time, this VM may be migrated to a **dedicated physical NAS**, but for now it is virtualized. | + | |
| - | + | ||
| - | --- | + | |
| - | + | ||
| - | === 4.4 Proxmox Backup Server (PBS) VM === | + | |
| - | + | ||
| - | * **VM ID:** 105 (`pbs`) | + | |
| - | * **Node:** pve1 | + | |
| - | * **OS:** Proxmox Backup Server 4.x | + | |
| - | * **CPU:** 3 vCPUs | + | |
| - | * **RAM:** 4 GB | + | |
| - | * **Main datastore: | + | |
| - | * Size: ~5.38 TB | + | |
| - | * Used: ~0.9 TB | + | |
| - | * Backed by TrueNAS storage | + | |
| - | + | ||
| - | Important backup rule: | + | |
| - | * **PBS does not back up itself.** | + | ====== 5. Workload Layout |
| - | * The PBS VM is excluded from nightly backup jobs. | + | |
| - | * **TrueNAS | + | |
| - | This prevents: | + | ==== VMs Hosted on PVE-NAS ==== |
| - | * Storage thrashing | + | ^ VM ID ^ Name ^ Purpose ^ |
| - | * Catastrophic performance impact from PBS trying to back up its own datastore | + | | 100 | web | TorresVault home page | |
| + | | 101 | Kuma | Uptime monitoring | | ||
| + | | 102 | next | Nextcloud | | ||
| + | | 103 | immich | Photo/video backup | ||
| + | | 104 | jellyfin | Media server | | ||
| + | | 105 | pbs | Backup server | | ||
| + | | 106 | n8n | Automations | | ||
| + | | 107 | npm | Reverse proxy | | ||
| + | | 108 | truenas | Core storage | | ||
| + | | 110 | Prometheus | Monitoring | | ||
| + | | 116 | wiki | DokuWiki | | ||
| + | | 112/ | ||
| - | PBS instead focuses | + | Everything is now consolidated |
| ---- | ---- | ||
| - | ==== 5. Workload Layout | + | ====== 6. Backup Strategy ====== |
| - | The current cluster runs a mix of core services and lab workloads. VM IDs/names: | + | ==== Nightly PBS Backup Jobs ==== |
| - | === 5.1 VMs on pve1 === | + | Backed up nightly: |
| - | * **100 – `web`** | + | * Core services (web, Nextcloud, Immich, Jellyfin) |
| - | * Role: front-end web / landing page (e.g., torresvault.com) | + | * Monitoring stack |
| - | * **101 – `Kuma`** | + | * Wiki |
| - | * Role: uptime / service monitoring | + | * n8n |
| - | * **105 – `pbs`** | + | * NPM |
| - | * Role: Proxmox Backup Server VM | + | * Portainer |
| - | * **106 – `n8n`** | + | * All IPERF lab images |
| - | * Role: automation / workflow engine | + | |
| - | * **107 – `npm`** | + | |
| - | * Role: Nginx Proxy Manager | + | |
| - | * **108 – `truenas`** | + | |
| - | * Role: storage VM / backup target | + | |
| - | * **110 – `Prometheus`** | + | |
| - | * Role: metrics + Grafana stack | + | |
| - | * **112 – `iperf-vlan10`** | + | |
| - | * **113 – `iperf-vlan20`** | + | |
| - | * **114 – `iperf-vlan1`** | + | |
| - | * Role: lab VMs for VLAN and bandwidth testing | + | |
| - | * **115 – `portainer-mgmt`** | + | |
| - | * Role: container management for other hosts | + | |
| - | * **116 – `wiki`** | + | |
| - | * Role: DokuWiki instance hosting TorresVault documentation | + | |
| - | * **111 – `iperf-vlan1` | + | |
| - | --- | + | ==== Excluded ==== |
| - | === 5.2 VMs on pve2 === | + | * **TrueNAS VM** (contains datastore) |
| + | * **PBS VM** (cannot self-backup) | ||
| + | * **VMs with external data stores** (e.g., Nextcloud files on TrueNAS) | ||
| - | * **102 – `next`** | + | ==== Restore Flow ==== |
| - | * Role: Nextcloud services | + | |
| - | * **103 – `immich`** | + | |
| - | * Role: photo / media backup | + | |
| - | * **104 – `jellyfin`** | + | |
| - | * Role: media server | + | |
| - | * **109 – `RDPjump`** | + | |
| - | * Role: jump host / remote access box | + | |
| - | These assignments are **not HA**; VMs are pinned to nodes and protected by PBS backups. | + | - In PBS: pick snapshot |
| + | - Restore to local-lvm or ZFS | ||
| + | - Boot VM | ||
| + | - Validate with service health checks | ||
| ---- | ---- | ||
| - | ==== 6. Backup & Restore Strategy | + | ====== 7. Monitoring ====== |
| - | Backups are handled by the **PBS VM (ID 105)**, writing into datastore `pbs-main` | + | Monitoring stack includes: |
| - | hosted on TrueNAS. | + | |
| - | Key points: | + | ==== Node-level ==== |
| - | * **Backup scope:** | + | * Proxmox UI (graphs) |
| - | * Critical service VMs: | + | * ZFS ARC graphs |
| - | * web (100) | + | * IO delay graphs |
| - | * Kuma (101) | + | |
| - | * next (102) | + | |
| - | * immich (103) | + | |
| - | * jellyfin (104) | + | |
| - | * n8n (106) | + | |
| - | * npm (107) | + | |
| - | * Prometheus (110) | + | |
| - | * portainer-mgmt (115) | + | |
| - | * wiki (116) | + | |
| - | * other lab VMs as needed | + | |
| - | * **Excluded: | + | |
| - | * `truenas` VM (108) | + | |
| - | * `pbs` VM (105) itself | + | |
| - | * **Datastore: | + | |
| - | * Size ~5.38 TB; currently lightly used | + | |
| - | * **Retention policy:** configured in PBS; typical pattern (can be tuned): | + | |
| - | * e.g., 7 daily, 4 weekly, 12 monthly (confirm in PBS UI) | + | |
| - | * **Verification: | + | |
| - | * PBS supports scheduled **verify jobs** to detect bit-rot | + | |
| - | * **Prune jobs:** | + | |
| - | * Automated prune / garbage collection jobs run regularly to reclaim space | + | |
| - | --- | + | ==== Service-level ==== |
| - | === 6.1 Rationale | + | * Prometheus (metrics) |
| + | * Grafana dashboards | ||
| + | * Kuma for ping/HTTP checks | ||
| - | * **TrueNAS** holds the backup data; backing it up with PBS would: | + | ==== Storage-level ==== |
| - | * Recursively back up the PBS datastore | + | |
| - | * Multiply IO load | + | |
| - | * Risk saturating disks and disrupting backups | + | |
| - | * **PBS VM** is the backup system itself: | + | * TrueNAS SMART monitoring |
| - | * Backing | + | * PBS datastore |
| - | * If PBS is lost, it can be rebuilt from Proxmox templates and reattached to | + | * Verification/ |
| - | existing datastore | + | |
| - | Instead, PBS backup jobs focus on **stateless or easily rebuildable VMs** where immutable data | + | ==== Network-level ==== |
| - | is stored externally (e.g., on TrueNAS, Nextcloud data, or other locations). | + | |
| - | --- | + | |
| - | + | * HA sensors & automations | |
| - | === 6.2 Restore Process (Operational Runbook) === | + | |
| - | + | ||
| - | **Scenario: single VM failure** | + | |
| - | + | ||
| - | | + | |
| - | 2. In PBS UI: | + | |
| - | * Go to **Datastore → pbs-main → Content → VM group**. | + | |
| - | * Select the latest successful backup. | + | |
| - | 3. Choose **Restore**: | + | |
| - | * Target node: original host (or alternate host if needed) | + | |
| - | * Disk storage: appropriate local storage (`local-lvm`, | + | |
| - | 4. Start VM in Proxmox and validate: | + | |
| - | * Application health checks (web UI, API, etc.) | + | |
| - | * Network connectivity (LAN, DNS, etc.) | + | |
| - | + | ||
| - | **Scenario: node loss (pve1 or pve2)** | + | |
| - | + | ||
| - | 1. Replace/fix hardware and reinstall Proxmox VE. | + | |
| - | 2. Rejoin node to `torres-cluster`. | + | |
| - | 3. Recreate necessary storages pointing at local disks. | + | |
| - | 4. From PBS, restore VMs to the rebuilt node using the procedure above. | + | |
| - | + | ||
| - | **Scenario: PBS VM lost but TrueNAS datastore intact** | + | |
| - | + | ||
| - | 1. Recreate PBS VM from Proxmox template. | + | |
| - | 2. Reattach existing `pbs-main` datastore on TrueNAS. | + | |
| - | 3. PBS will rediscover existing backups. | + | |
| - | 4. Resume normal operations. | + | |
| ---- | ---- | ||
| - | ==== 7. Monitoring & Observability | + | ====== 8. Operations ====== |
| - | Monitoring in TorresVault is layered: | + | ==== Power-Down Order ==== |
| - | | + | - Apps (Immich, Nextcloud, NPM, web, wiki) |
| - | * Built-in graphs: CPU, RAM, I/O, network, load average per node and per VM | + | |
| - | | + | |
| - | * Scrapes metrics from nodes and services | + | |
| - | * Grafana dashboards provide historical views | + | - PVE-NAS |
| - | * **Kuma VM (101)** | + | |
| - | * Synthetic checks / uptime monitoring for key services | + | |
| - | | + | |
| - | * Shows backup / prune / verify job history and datastore usage | + | |
| - | | + | |
| - | * Disk health, pool status | + | |
| - | Operational practice: | + | ==== Power-Up Order ==== |
| - | | + | |
| - | | + | |
| - | | + | |
| + | - PBS | ||
| + | - Core apps | ||
| + | - Monitoring | ||
| - | ---- | + | ==== Updating Proxmox ==== |
| - | ==== 8. Operational Procedures (Day-to-Day) ==== | + | < |
| - | + | apt update && apt full-upgradereboot | |
| - | === 8.1 Adding a New VM === | + | </code> |
| - | + | ||
| - | 1. Decide which node (pve1 vs pve2) based on workload: | + | |
| - | * Storage-heavy → whichever has more free disk | + | |
| - | * Media or GPU heavy (later) → pve2 | + | |
| - | 2. Create VM in Proxmox: | + | |
| - | * Attach to `vmbr0` for LAN access | + | |
| - | * Store disks on `local-lvm`, | + | |
| - | 3. Install OS and configure network. | + | |
| - | 4. In **PBS**, add VM to an existing or new backup group. | + | |
| - | 5. Verify first backup completes successfully. | + | |
| - | + | ||
| - | --- | + | |
| - | + | ||
| - | === 8.2 Maintenance & Patching === | + | |
| - | + | ||
| - | **Proxmox nodes:** | + | |
| - | + | ||
| - | 1. Live-migrate or gracefully shut down VMs on the target node if needed. | + | |
| - | 2. `apt update && apt full-upgrade` on the node (via console or SSH). | + | |
| - | 3. Reboot node. | + | |
| - | 4. Verify: | + | |
| - | * Corosync quorum healthy | + | |
| - | * VMs auto-started where expected | + | |
| - | + | ||
| - | **PBS & TrueNAS: | + | |
| - | + | ||
| - | * Update during low-traffic windows (overnight). | + | |
| - | * Confirm backups resume successfully after upgrades. | + | |
| - | + | ||
| - | --- | + | |
| - | + | ||
| - | === 8.3 Power-Down | + | |
| - | + | ||
| - | Planned maintenance that requires full stack shutdown: | + | |
| - | + | ||
| - | **Shutdown order:** | + | |
| - | + | ||
| - | 1. Application VMs (web, Nextcloud, Immich, Jellyfin, etc.) | + | |
| - | 2. Monitoring VMs (Kuma, Prometheus) | + | |
| - | 3. PBS VM | + | |
| - | 4. TrueNAS VM | + | |
| - | 5. pve2 node | + | |
| - | 6. pve1 node (last Proxmox node) | + | |
| - | 7. Network gear / UPS if necessary | + | |
| - | + | ||
| - | **Power-up order:** | + | |
| - | + | ||
| - | 1. Network gear & UPS | + | |
| - | 2. pve1 and pve2 nodes | + | |
| - | 3. TrueNAS VM | + | |
| - | 4. PBS VM | + | |
| - | 5. Core apps (web, Nextcloud, Immich, Jellyfin, n8n, NPM, wiki) | + | |
| - | 6. Monitoring stack (Kuma, Prometheus/ | + | |
| - | + | ||
| - | This order ensures that **storage is ready before PBS**, and **PBS is ready before | + | |
| - | depending VMs** (if any use backup features like guest-initiated restore). | + | |
| ---- | ---- | ||
| - | ==== 9. Risks, Constraints | + | ====== 9. Risks & Constraints ====== |
| - | * **No shared storage / HA:** | + | * **Single-node setup** (no HA) |
| - | * VMs are pinned to nodes. If a node fails, VMs require restore or manual migration. | + | * **TrueNAS + PBS + all VMs** on same hardware = consolidated risk |
| - | * **Older hardware:** | + | * **No shared storage** |
| - | * CPUs and DDR3L era platforms limit performance and efficiency. | + | * **Heavy workloads can spike RAM** (currently ~65% used steady) |
| - | * **Local disks only:** | + | |
| - | * Mix of older 1 TB HDDs; no SSD-only tiers for high IOPS workloads. | + | |
| - | * **PBS & TrueNAS both virtualized on pve1:** | + | |
| - | * Concentrates backup and storage responsibility on a single compute node. | + | |
| - | * **Limited RAM (32 GB per node):** | + | |
| - | | + | |
| - | These are acceptable for a home lab / prosumer environment but are captured here | + | These are acceptable for home lab usage. |
| - | explicitly for future planning. | + | |
| ---- | ---- | ||
| - | ==== 10. Future | + | ====== 10. Future |
| - | The following items are **out of scope for this document** but are tracked on the roadmap: | + | |
| + | | ||
| + | * Add **UM890 Pro mini-PC cluster** | ||
| + | * Add **Jarvis AI GPU node** | ||
| + | * Scale-out TrueNAS pool to 10–11 SSDs | ||
| + | * Offload PBS to dedicated hardware | ||
| + | | ||
| + | | ||
| - | * Dedicated **NAS / Proxmox hybrid** with ASRock Rack X570D4U and 16 × 6 TB SAS | + | Roadmap page will detail this further. |
| - | * Standalone physical **TrueNAS or SCALE** box | + | |
| - | * Additional **Proxmox node** or Mini-PC cluster for Kubernetes | + | |
| - | * 10 GbE or faster interconnect between nodes | + | |
| - | * Storage tiers (NVMe → SSD → HDD → PBS) | + | |
| - | * Better separation of roles: | + | |
| - | * PBS on dedicated hardware | + | |
| - | * TrueNAS on physical host | + | |
| - | * Proxmox nodes focused on compute | + | |
| - | See: [[torresvault: | ||
| - | |||
| - | ---- | ||
proxmox/cluster.txt · Last modified: by 192.168.1.189
