Difference between revisions of "Gpubox"

From Sudo Room
Jump to navigation Jump to search
(pre tags instead of `, and lots of edits)
Line 127: Line 127:
```
```
* Add user <pre>docker</pre> to do docker stuff. Do NOT give <pre>docker</pre> sudo.
* Add user <pre>docker</pre> to do docker stuff. Do NOT give <pre>docker</pre> sudo.
==== Install openwebui ====
* As the docker user, make directories <pre>~/git/openwebui</pre>
* As the docker user, make directories <pre>~/git/openwebui</pre>
* Make a docker compose file at <pre>~/git/openwebui/docker-compose.yaml</pre>
* Make a docker compose file at <pre>~/git/openwebui/docker-compose.yaml</pre>
** <pre>services:
  open-webui:
    build:
      context: .
      dockerfile: Dockerfile
    image: ghcr.io/open-webui/open-webui:${WEBUI_DOCKER_TAG-main}
    container_name: open-webui
    volumes:
      - open-webui:/app/backend/data
    ports:
      - ${OPEN_WEBUI_PORT-3000}:8080
    environment:
      - 'OLLAMA_BASE_URL=http://ollama-2080.local:11434'
      - 'WEBUI_SECRET_KEY=secretkeyhere'
    extra_hosts:
      - host.docker.internal:host-gateway
    restart: unless-stopped
volumes:
  open-webui: {}
</pre>
** Eventually, we'll check this in to git.
* In <pre>~/git/openwebui</pre>, run <pre>docker compose up</pre>
* In <pre>~/git/openwebui</pre>, run <pre>docker compose up</pre>
** Note: newer docker uses <pre>docker compose</pre>, not <pre>docker-compose</pre>
** Note: newer docker uses <pre>docker compose</pre>, not <pre>docker-compose</pre>

Revision as of 21:33, 16 February 2026

mediawiki

gpubox Setup Guide

Bare Metal Configuration

IPMI Setup

  • Access IPMI via
    ipmitool
    with hostname
    ipmi-compute-2-171.local

Example commands:

$ ipmitool -H ipmi-compute-2-171.local -U ADMIN -P pwd power status 
Chassis Power is on
$ ipmitool -H ipmi-compute-2-171.local  -U ADMIN -P pwd dcmi power reading
[shows electrical power presently being consumed by system]

Debian 13 and Proxmox 8 Installation

GPU Passthrough Configuration

BIOS/UEFI

  • Enable VT-d (Virtualization Technology for Directed I/O) in BIOS on gpubox

Identify GPUs

Use the vendor ID (`10de`) and device ID (`1b06`) to identify GPUs. Both video cards and their associated audio device will show up.

$ lspci -nnk | grep -A 3 'VGA'
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12a4]
        Kernel modules: nvidiafb, nouveau
04:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
--
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: PNY Device [196e:1213]
        Kernel modules: nvidiafb, nouveau
05:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
        Kernel modules: nvidiafb, nouveau
08:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
09:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
        Kernel modules: nvidiafb, nouveau
09:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
0c:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 30)
        Subsystem: Super Micro Computer Inc Device [15d9:0892]
        Kernel driver in use: ast
        Kernel modules: ast
--
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
        Subsystem: Gigabyte Technology Co., Ltd Device [1458:37c4]
        Kernel modules: nvidiafb, nouveau
84:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
--
85:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
        Subsystem: eVga.com. Corp. Device [3842:6180]
        Kernel modules: nvidiafb, nouveau
85:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
--
88:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
        Kernel modules: nvidiafb, nouveau
88:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
89:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
        Kernel modules: nvidiafb, nouveau
89:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)

VFIO Modules

  • Create `/etc/modules-load.d/vfio.conf` with:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

NVIDIA Drivers on Host

We did have to install some drivers on gpubox, but we installed them again later on the VMs. This confuses me.

Edit /etc/apt/sources.list

sed -i 's/main/main non-free contrib/g' /etc/apt/sources.list
apt update
apt install -y nvidia-driver nvidia-kernel-dkms

VM Templates & Cloning

Template VM

  • Upload
    debian-13.3.0-amd64-netinst.iso
    to storage through the proxmox web ui
  • Create a minimal Debian 13 template
    • apt install -y ufw fail2ban curl git zsh sudo
    • sudo apt update && sudo apt full-upgrade -y
  • Make a user called
    deb
    with sudo
  • Convert to template (Proxmox: VM > Convert to Template) with name:
    debian13-template

Clone VMs

  • Clone the template for
    ollama-2080
    and future VMs that will house AI models
  • **Pass GPU**: In VM settings, go to "Hardware" > "PCI" > "Raw" and select the GPU (use
    lspci
    IDs)
  • Don't forget to change the new cloned VM's hostname
    • sudo hostnamectl set-hostname ollama2080 --static

Specific VM Configurations

ollama-2080

  • Install
    ollama
    with
    curl -fsSL https://ollama.com/install.sh | sh
  • Use ollama to pull and run Deepseek R1 8
  • Verify: http://ollama.local:11434/ should show the message
    Ollama is running.

dockerhost

  • Install Docker:

``` apt install -y docker.io systemctl enable --now docker ```

  • Add user
    docker
    to do docker stuff. Do NOT give
    docker
    sudo.

Install openwebui

  • As the docker user, make directories
    ~/git/openwebui
  • Make a docker compose file at
    ~/git/openwebui/docker-compose.yaml
    • services:
 open-webui:
   build:
     context: .
     dockerfile: Dockerfile
   image: ghcr.io/open-webui/open-webui:${WEBUI_DOCKER_TAG-main}
   container_name: open-webui
   volumes:
     - open-webui:/app/backend/data
   ports:
     - ${OPEN_WEBUI_PORT-3000}:8080
   environment:
     - 'OLLAMA_BASE_URL=http://ollama-2080.local:11434'
     - 'WEBUI_SECRET_KEY=secretkeyhere'
   extra_hosts:
     - host.docker.internal:host-gateway
   restart: unless-stopped

volumes:

 open-webui: {}
    • Eventually, we'll check this in to git.
  • In
    ~/git/openwebui
    , run
    docker compose up
    • Note: newer docker uses
      docker compose
      , not
      docker-compose
  • Useful commands
    • sudo ss -plnt # Shows ports this machine is listening on

ip -4 a # Get this machine's IP address on the local network

ai-conductor

  • TBD

Key Commands

```

  1. Check GPU visibility in host

lspci -k | grep -A 2 "VGA"

  1. Verify VFIO modules loaded

lsmod | grep vfio

  1. Test NVIDIA driver

nvidia-smi # Should show GPU details

  1. Clone a template in Proxmox

qm clone <source_VM_ID> <new_VM_ID> --name "ollama-2080" ```

Troubleshooting

  • **GPU Not Visible**: Ensure VT-d is enabled in BIOS and the GPU is listed in `lspci`
  • **Driver Issues**: Reinstall `nvidia-driver` and reboot
  • **Permission Errors**: Add user to `docker` and `kvm` groups