Table of Contents

Proxmox Cluster Architecture (Current State)

This document describes the current TorresVault Proxmox cluster as it exists today. It focuses on hardware, networking, storage, workloads, and backup/restore, and is intended as the authoritative reference for how virtualization is implemented in TorresVault 2.0 (current state).

Future redesigns (new NAS, X570D4U, Mini PC cluster, etc.) will be documented separately on the roadmap page.


1. High-Level Overview

The Proxmox environment is a 2-node cluster with a qdevice, running on older but solid Intel desktop platforms with expanded SATA and NIC capacity.

High-level logical view:

The design intentionally does not use shared storage for HA; instead, VMs are pinned to nodes and protected via image-based backups to PBS.


2. Physical Hosts & Hardware

2.1 pve1

Role: General compute node, many core services

Primary role summary:

2.2 pve2

Role: General compute node, media & application workloads

Primary role summary:

2.3 QDevice

3. Network Design

The Proxmox cluster uses:

UniFi VLANs exist on the network side (stark_user, stark_IOT, guest, IOT+, Torres Family Lights); for now, Proxmox sees mostly the flat LAN plus specific lab networks for testing.

3.1 pve1 Network Interfaces

From the Proxmox UI:

Design notes:

3.2 pve2 Network Interfaces

Design notes:

3.3 Logical Topology Diagram (Text)

This separation keeps corosync and cluster traffic off the main LAN and avoids cluster instability if LAN becomes noisy.


4. Storage Architecture (Current)

There are three main storage layers:

4.1 Local Storage on pve1

Typical Proxmox storages (names as shown in the UI):

Backed by:

4.2 Local Storage on pve2

Also backed by multiple 1 TB Seagate ST91000640NS disks via the same combination of controllers.

4.3 TrueNAS VM

the system.

Over time, this VM may be migrated to a dedicated physical NAS, but for now it is virtualized.

4.4 Proxmox Backup Server (PBS) VM

Important backup rule:

This prevents:

PBS instead focuses on backing up critical application VMs only.


5. Workload Layout

The current cluster runs a mix of core services and lab workloads. VM IDs/names:

5.1 VMs on pve1

5.2 VMs on pve2

These assignments are not HA; VMs are pinned to nodes and protected by PBS backups.


6. Backup & Restore Strategy

Backups are handled by the PBS VM (ID 105), writing into datastore `pbs-main` hosted on TrueNAS.

Key points:

6.1 Rationale for Exclusions

existing datastore

Instead, PBS backup jobs focus on stateless or easily rebuildable VMs where immutable data is stored externally (e.g., on TrueNAS, Nextcloud data, or other locations).

6.2 Restore Process (Operational Runbook)

Scenario: single VM failure

1. Identify affected VM in Proxmox UI.
2. In PBS UI:
   * Go to **Datastore → pbs-main → Content → VM group**.
   * Select the latest successful backup.
3. Choose **Restore**:
   * Target node: original host (or alternate host if needed)
   * Disk storage: appropriate local storage (`local-lvm`, `apps-pool`, etc.)
4. Start VM in Proxmox and validate:
   * Application health checks (web UI, API, etc.)
   * Network connectivity (LAN, DNS, etc.)

Scenario: node loss (pve1 or pve2)

1. Replace/fix hardware and reinstall Proxmox VE.
2. Rejoin node to `torres-cluster`.
3. Recreate necessary storages pointing at local disks.
4. From PBS, restore VMs to the rebuilt node using the procedure above.

Scenario: PBS VM lost but TrueNAS datastore intact

1. Recreate PBS VM from Proxmox template.
2. Reattach existing `pbs-main` datastore on TrueNAS.
3. PBS will rediscover existing backups.
4. Resume normal operations.

7. Monitoring & Observability

Monitoring in TorresVault is layered:

Operational practice:


8. Operational Procedures (Day-to-Day)

8.1 Adding a New VM

1. Decide which node (pve1 vs pve2) based on workload:
   * Storage-heavy → whichever has more free disk
   * Media or GPU heavy (later) → pve2
2. Create VM in Proxmox:
   * Attach to `vmbr0` for LAN access
   * Store disks on `local-lvm`, `apps-pool`, or `VM-pool`
3. Install OS and configure network.
4. In **PBS**, add VM to an existing or new backup group.
5. Verify first backup completes successfully.

8.2 Maintenance & Patching

Proxmox nodes:

1. Live-migrate or gracefully shut down VMs on the target node if needed.
2. `apt update && apt full-upgrade` on the node (via console or SSH).
3. Reboot node.
4. Verify:
   * Corosync quorum healthy
   * VMs auto-started where expected

PBS & TrueNAS:

8.3 Power-Down / Power-Up Order

Planned maintenance that requires full stack shutdown:

Shutdown order:

1. Application VMs (web, Nextcloud, Immich, Jellyfin, etc.)
2. Monitoring VMs (Kuma, Prometheus)
3. PBS VM
4. TrueNAS VM
5. pve2 node
6. pve1 node (last Proxmox node)
7. Network gear / UPS if necessary

Power-up order:

1. Network gear & UPS
2. pve1 and pve2 nodes
3. TrueNAS VM
4. PBS VM
5. Core apps (web, Nextcloud, Immich, Jellyfin, n8n, NPM, wiki)
6. Monitoring stack (Kuma, Prometheus/Grafana)

This order ensures that storage is ready before PBS, and PBS is ready before depending VMs (if any use backup features like guest-initiated restore).


9. Risks, Constraints & Known Limitations

These are acceptable for a home lab / prosumer environment but are captured here explicitly for future planning.


10. Future Improvements (Pointer to Roadmap)

The following items are out of scope for this document but are tracked on the roadmap:

See: roadmap (to be created).