Difference between revisions of "Gpubox"

From Sudo Room
Jump to navigation Jump to search
(init gpubox notes)
 
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
gpubox Setup Guide
= Hosted Services =
1. Bare Metal Configuration
{| class="wikitable"
|+ On gpubox
|-
! Hostname:Port !! Description
|-
| [https://gpubox.local:8006/ gpubox.local:8006] || Proxmox admin
|-
| [http://dockerhost.local:3000/ dockerhost.local:3000] || Open WebUI (to play with LLMs)
|-
| [https://ipmi-compute-2-171.local/ ipmi-compute-2-171.local] || IPMI
|}


    Debian 13 Installation: 
= gpubox Setup =
        Use the provided ISO (debian-13.3.0-amd64-netinst.iso). 
        Ensure the correct network interface is configured (check ip a after installation).


    IPMI Setup: 
== Bare Metal Configuration ==
        Access IPMI via ipmitool (e.g., ipmitool -I lanplus -H 10.0.0.234 -U ADMIN -P ADMIN pwd for password).


    Proxmox VE Installation: 
=== IPMI Setup ===
        Install Proxmox VE on bare metal using the official installer.
* Access IPMI via <pre>ipmitool</pre> with hostname <pre>ipmi-compute-2-171.local</pre>
        Configure storage (e.g., local LVM for VMs).


2. GPU Passthrough Configuration
Example commands:
<pre>
$ ipmitool -H ipmi-compute-2-171.local -U ADMIN -P pwd power status
Chassis Power is on
$ ipmitool -H ipmi-compute-2-171.local  -U ADMIN -P pwd dcmi power reading
[shows electrical power presently being consumed by system]
</pre>


    BIOS/UEFI: Enable VT-d (Virtualization Technology for Directed I/O).
=== Debian 13 and Proxmox 8 Installation ===
* Debian 13 install: https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-13.3.0-amd64-netinst.iso
* Proxmox VE on that: https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_13_Trixie
* Let it use DHCP to grab an IP address, then I changed it later when I set up vrb0
* Hostname gpubox
* Proxmox web ui available at: https://gpubox.local:8006


    Identify GPUs: 
== GPU Passthrough Configuration ==
    bash
   
       
   
   
    1
    lspci -nn | grep -i nvidia  # List all NVIDIA GPUs
   
   


    Example output: 
=== BIOS/UEFI ===
    bash
* Enable VT-d (Virtualization Technology for Directed I/O) in BIOS on gpubox
   
       
   
   
    1
    08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev ff)
   
   


    Use the vendor ID (10de) and device ID (1b06) to identify GPUs.
=== Identify GPUs ===
Use the vendor ID (`10de`) and device ID (`1b06`) to identify GPUs. Both video cards and their associated audio device will show up.


    VFIO Modules:
<pre>
    Create /etc/modules-load.d/vfio.conf with:
$ lspci -nnk | grep -A 3 'VGA'
    bash
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
   
        Subsystem: NVIDIA Corporation Device [10de:12a4]
       
        Kernel modules: nvidiafb, nouveau
   
04:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
   
--
    1
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
    2
        Subsystem: PNY Device [196e:1213]
    3
        Kernel modules: nvidiafb, nouveau
    4
05:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
    vfio
--
    vfio_iommu_type1
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
    vfio_pci
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
    vfio_virqfd
        Kernel modules: nvidiafb, nouveau
   
08:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
   
--
09:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
        Kernel modules: nvidiafb, nouveau
09:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
0c:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 30)
        Subsystem: Super Micro Computer Inc Device [15d9:0892]
        Kernel driver in use: ast
        Kernel modules: ast
--
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
        Subsystem: Gigabyte Technology Co., Ltd Device [1458:37c4]
        Kernel modules: nvidiafb, nouveau
84:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
--
85:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
        Subsystem: eVga.com. Corp. Device [3842:6180]
        Kernel modules: nvidiafb, nouveau
85:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
--
88:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
        Kernel modules: nvidiafb, nouveau
88:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
89:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
        Kernel modules: nvidiafb, nouveau
89:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
</pre>


3. NVIDIA Drivers on Host
=== VFIO Modules ===
* Create `/etc/modules-load.d/vfio.conf` with:
<pre>
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
</pre>


    Edit /etc/apt/sources.list: 
== NVIDIA Drivers on Host ==
    bash
We did have to install some drivers on gpubox, but we installed them again later on the VMs. This confuses me.
   
       
   
   
    1
    2
    3
    sed -i 's/main/main non-free contrib/g' /etc/apt/sources.list
    apt update
    apt install -y nvidia-driver nvidia-kernel-dkms
   
   


4. VM Templates & Cloning
=== Edit /etc/apt/sources.list ===
<pre>
sed -i 's/main/main non-free contrib/g' /etc/apt/sources.list
apt update
apt install -y nvidia-driver nvidia-kernel-dkms
</pre>


    Template VM
== VM Templates & Cloning ==
        Use debian-13.3.0-amd64-netinst.iso to create a minimal Debian 13 template. 
        Convert to template (Proxmox: VM > Convert to Template).


    Clone VMs: 
=== Template VM ===
        Clone the template for ollama-2080 and dockerhost. 
* Upload <pre>debian-13.3.0-amd64-netinst.iso</pre> to storage through the proxmox web ui
        Pass GPU:
* Create a minimal Debian 13 template
            In VM settings, go to "Hardware" > "PCI" > "Raw" and select the GPU (use lspci IDs).
** <pre>apt install -y ufw fail2ban curl git zsh sudo net-tools</pre>
** <pre>sudo apt update && sudo apt full-upgrade -y</pre>
* Make a user called <pre>deb</pre> with sudo
* Convert to template (Proxmox: VM > Convert to Template) with name: <pre>debian13-template</pre>


5. Specific VM Configurations
=== Clone VMs ===
* Clone the template for <pre>ollama-2080</pre> and future VMs that will house AI models
* **Pass GPU**: In VM settings, go to "Hardware" > "PCI" > "Raw" and select the GPU (use <pre>lspci</pre> IDs)
* Don't forget to change the new cloned VM's hostname
** <pre>sudo hostnamectl set-hostname ollama2080 --static</pre>


    ollama-2080: 
== Specific VM Configurations ==
        Install ollama (e.g., curl -fsSL https://ollama.com/install.sh | sh). 
        Configure GPU acceleration (check ollama --version and ensure NVIDIA drivers are loaded).


    dockerhost: 
=== ollama-2080 ===
        Install Docker:
* Install nvidia drivers https://www.xda-developers.com/nvidia-stopped-supporting-my-gpu-so-i-started-self-hosting-llms-with-it/
        bash
** Pin the driver version so you don't have to re-run the nvidia installer every time the kernel gets updated
       
* Install ollama with <pre>curl -fsSL https://ollama.com/install.sh | sh</pre>
           
* Use ollama to pull and run deepseek-r1:8b
       
* <pre>sudo ufw allow from 10.0.0.0/24 to any port 11434 proto tcp</pre>
       
* Verify: http://ollama.local:11434/ should show the message <pre>Ollama is running.</pre>
        1
        2
        apt install -y docker.io
        systemctl enable --now docker
       
       
        Add user to docker group (usermod -aG docker $USER).


    ai-conductor: 
=== imgtotext ===
        Install required tools (e.g., kubectl for Kubernetes orchestration).
* Install ollama as above
* <pre>ollama run hf.co/noctrex/ZwZ-8B-GGUF:Q8_0</pre> from the page https://huggingface.co/noctrex/ZwZ-8B-GGUF (i pressed the image-to-text tag and looked at trending models)
* http://imgtotext.local:11434/ should show ollama is running


Key Commands
=== dockerhost ===
bash
* Install Docker:
```
   
apt install -y docker.io
systemctl enable --now docker
```
1
* Add user <pre>docker</pre> to do docker stuff. Do NOT give <pre>docker</pre> sudo.
2
 
3
==== Install openwebui ====
4
* As the docker user, make directories <pre>~/git/openwebui</pre>
5
* Make a docker compose file at <pre>~/git/openwebui/docker-compose.yaml</pre>
6
** <pre>services:
7
  open-webui:
8
    build:
9
      context: .
10
      dockerfile: Dockerfile
11
    image: ghcr.io/open-webui/open-webui:${WEBUI_DOCKER_TAG-main}
    container_name: open-webui
    volumes:
      - open-webui:/app/backend/data
    ports:
      - ${OPEN_WEBUI_PORT-3000}:8080
    environment:
      - 'OLLAMA_BASE_URL=http://ollama-2080.local:11434'
      - 'WEBUI_SECRET_KEY=secretkeyhere'
    extra_hosts:
      - host.docker.internal:host-gateway
    restart: unless-stopped
 
volumes:
  open-webui: {}
</pre>
** Eventually, we'll check this in to git.
* In <pre>~/git/openwebui</pre>, run <pre>docker compose up</pre>
** Note: newer docker uses <pre>docker compose</pre>, not <pre>docker-compose</pre>
* I had to do some hole-punching in ufw to get open-webui to see ollama2080
* Useful commands
** <pre>sudo ss -plnt # Lists ports this machine is listening on
ip -4 a # Get this machine's IP address on the local network
</pre>
 
=== ai-conductor ===
* TBD
 
=== if you're using a 1080 Ti or 1080 ===
<pre>sudo apt purge "*nvidia*"
sudo apt autoremove --purge
</pre>
then reboot.
 
== Key Commands ==
 
```
# Check GPU visibility in host
# Check GPU visibility in host
lspci -k | grep -A 2 "VGA"
lspci -k | grep -A 2 "VGA"
Line 139: Line 204:
# Clone a template in Proxmox
# Clone a template in Proxmox
qm clone <source_VM_ID> <new_VM_ID> --name "ollama-2080"
qm clone <source_VM_ID> <new_VM_ID> --name "ollama-2080"
```
 
Troubleshooting
== Troubleshooting ==


    GPU Not Visible: Ensure VT-d is enabled in BIOS and the GPU is listed in lspci
* **GPU Not Visible**: Ensure VT-d is enabled in BIOS and the GPU is listed in `lspci`
    Driver Issues: Reinstall nvidia-driver and reboot
* **Driver Issues**: Reinstall `nvidia-driver` and reboot
    Permission Errors: Add user to docker and kvm groups.
* **Permission Errors**: Add user to `docker` and `kvm` groups

Latest revision as of 22:58, 16 February 2026

Hosted Services

On gpubox
Hostname:Port Description
gpubox.local:8006 Proxmox admin
dockerhost.local:3000 Open WebUI (to play with LLMs)
ipmi-compute-2-171.local IPMI

gpubox Setup

Bare Metal Configuration

IPMI Setup

  • Access IPMI via
    ipmitool
    with hostname
    ipmi-compute-2-171.local

Example commands:

$ ipmitool -H ipmi-compute-2-171.local -U ADMIN -P pwd power status 
Chassis Power is on
$ ipmitool -H ipmi-compute-2-171.local  -U ADMIN -P pwd dcmi power reading
[shows electrical power presently being consumed by system]

Debian 13 and Proxmox 8 Installation

GPU Passthrough Configuration

BIOS/UEFI

  • Enable VT-d (Virtualization Technology for Directed I/O) in BIOS on gpubox

Identify GPUs

Use the vendor ID (`10de`) and device ID (`1b06`) to identify GPUs. Both video cards and their associated audio device will show up.

$ lspci -nnk | grep -A 3 'VGA'
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12a4]
        Kernel modules: nvidiafb, nouveau
04:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
--
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: PNY Device [196e:1213]
        Kernel modules: nvidiafb, nouveau
05:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
        Kernel modules: nvidiafb, nouveau
08:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
09:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
        Kernel modules: nvidiafb, nouveau
09:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
0c:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 30)
        Subsystem: Super Micro Computer Inc Device [15d9:0892]
        Kernel driver in use: ast
        Kernel modules: ast
--
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
        Subsystem: Gigabyte Technology Co., Ltd Device [1458:37c4]
        Kernel modules: nvidiafb, nouveau
84:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
--
85:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
        Subsystem: eVga.com. Corp. Device [3842:6180]
        Kernel modules: nvidiafb, nouveau
85:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
--
88:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
        Kernel modules: nvidiafb, nouveau
88:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
89:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
        Kernel modules: nvidiafb, nouveau
89:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)

VFIO Modules

  • Create `/etc/modules-load.d/vfio.conf` with:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

NVIDIA Drivers on Host

We did have to install some drivers on gpubox, but we installed them again later on the VMs. This confuses me.

Edit /etc/apt/sources.list

sed -i 's/main/main non-free contrib/g' /etc/apt/sources.list
apt update
apt install -y nvidia-driver nvidia-kernel-dkms

VM Templates & Cloning

Template VM

  • Upload
    debian-13.3.0-amd64-netinst.iso
    to storage through the proxmox web ui
  • Create a minimal Debian 13 template
    • apt install -y ufw fail2ban curl git zsh sudo net-tools
    • sudo apt update && sudo apt full-upgrade -y
  • Make a user called
    deb
    with sudo
  • Convert to template (Proxmox: VM > Convert to Template) with name:
    debian13-template

Clone VMs

  • Clone the template for
    ollama-2080
    and future VMs that will house AI models
  • **Pass GPU**: In VM settings, go to "Hardware" > "PCI" > "Raw" and select the GPU (use
    lspci
    IDs)
  • Don't forget to change the new cloned VM's hostname
    • sudo hostnamectl set-hostname ollama2080 --static

Specific VM Configurations

ollama-2080

imgtotext

dockerhost

  • Install Docker:

``` apt install -y docker.io systemctl enable --now docker ```

  • Add user
    docker
    to do docker stuff. Do NOT give
    docker
    sudo.

Install openwebui

  • As the docker user, make directories
    ~/git/openwebui
  • Make a docker compose file at
    ~/git/openwebui/docker-compose.yaml
    • services:
 open-webui:
   build:
     context: .
     dockerfile: Dockerfile
   image: ghcr.io/open-webui/open-webui:${WEBUI_DOCKER_TAG-main}
   container_name: open-webui
   volumes:
     - open-webui:/app/backend/data
   ports:
     - ${OPEN_WEBUI_PORT-3000}:8080
   environment:
     - 'OLLAMA_BASE_URL=http://ollama-2080.local:11434'
     - 'WEBUI_SECRET_KEY=secretkeyhere'
   extra_hosts:
     - host.docker.internal:host-gateway
   restart: unless-stopped

volumes:

 open-webui: {}
    • Eventually, we'll check this in to git.
  • In
    ~/git/openwebui
    , run
    docker compose up
    • Note: newer docker uses
      docker compose
      , not
      docker-compose
  • I had to do some hole-punching in ufw to get open-webui to see ollama2080
  • Useful commands
    • sudo ss -plnt # Lists ports this machine is listening on

ip -4 a # Get this machine's IP address on the local network

ai-conductor

  • TBD

if you're using a 1080 Ti or 1080

sudo apt purge "*nvidia*"
sudo apt autoremove --purge

then reboot.

Key Commands

```

  1. Check GPU visibility in host

lspci -k | grep -A 2 "VGA"

  1. Verify VFIO modules loaded

lsmod | grep vfio

  1. Test NVIDIA driver

nvidia-smi # Should show GPU details

  1. Clone a template in Proxmox

qm clone <source_VM_ID> <new_VM_ID> --name "ollama-2080" ```

Troubleshooting

  • **GPU Not Visible**: Ensure VT-d is enabled in BIOS and the GPU is listed in `lspci`
  • **Driver Issues**: Reinstall `nvidia-driver` and reboot
  • **Permission Errors**: Add user to `docker` and `kvm` groups