Difference between revisions of "Gpubox"

From Sudo Room
Jump to navigation Jump to search
(mediawiki formatting)
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
mediawiki
= Hosted Services =
= gpubox Setup Guide =
{| class="wikitable"
|+ On gpubox
|-
! Hostname:Port !! Description
|-
| [https://gpubox.local:8006/ gpubox.local:8006] || Proxmox admin
|-
| [http://dockerhost.local:3000/ dockerhost.local:3000] || Open WebUI (to play with LLMs)
|-
| [https://ipmi-compute-2-171.local/ ipmi-compute-2-171.local] || IPMI
|}
 
= gpubox Setup =


== Bare Metal Configuration ==
== Bare Metal Configuration ==


=== Debian 13 Installation ===
=== IPMI Setup ===
* Use the provided ISO (`debian-13.3.0-amd64-netinst.iso`)
* Access IPMI via <pre>ipmitool</pre> with hostname <pre>ipmi-compute-2-171.local</pre>
* Ensure correct network interface configuration (check `ip a` after installation)


=== IPMI Setup ===
Example commands:
* Access IPMI via `ipmitool` (e.g., `ipmitool -I lanplus -H 10.0.0.234 -U ADMIN -P ADMIN pwd` for password)
<pre>
$ ipmitool -H ipmi-compute-2-171.local -U ADMIN -P pwd power status
Chassis Power is on
$ ipmitool -H ipmi-compute-2-171.local  -U ADMIN -P pwd dcmi power reading
[shows electrical power presently being consumed by system]
</pre>


=== Proxmox VE Installation ===
=== Debian 13 and Proxmox 8 Installation ===
* Install Proxmox VE on bare metal using the official installer
* Debian 13 install: https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-13.3.0-amd64-netinst.iso
* Configure storage (e.g., local LVM for VMs)
* Proxmox VE on that: https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_13_Trixie
* Let it use DHCP to grab an IP address, then I changed it later when I set up vrb0
* Hostname gpubox
* Proxmox web ui available at: https://gpubox.local:8006


== GPU Passthrough Configuration ==
== GPU Passthrough Configuration ==


=== BIOS/UEFI ===
=== BIOS/UEFI ===
* Enable VT-d (Virtualization Technology for Directed I/O)
* Enable VT-d (Virtualization Technology for Directed I/O) in BIOS on gpubox


=== Identify GPUs ===
=== Identify GPUs ===
```
Use the vendor ID (`10de`) and device ID (`1b06`) to identify GPUs. Both video cards and their associated audio device will show up.
{{#lst:|l|nvidia}}  // List all NVIDIA GPUs
 
```
<pre>
Example output:
$ lspci -nnk | grep -A 3 'VGA'
```
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev ff)
        Subsystem: NVIDIA Corporation Device [10de:12a4]
```
        Kernel modules: nvidiafb, nouveau
Use the vendor ID (`10de`) and device ID (`1b06`) to identify GPUs.
04:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
--
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: PNY Device [196e:1213]
        Kernel modules: nvidiafb, nouveau
05:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
        Kernel modules: nvidiafb, nouveau
08:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
09:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
        Kernel modules: nvidiafb, nouveau
09:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
0c:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 30)
        Subsystem: Super Micro Computer Inc Device [15d9:0892]
        Kernel driver in use: ast
        Kernel modules: ast
--
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
        Subsystem: Gigabyte Technology Co., Ltd Device [1458:37c4]
        Kernel modules: nvidiafb, nouveau
84:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
--
85:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
        Subsystem: eVga.com. Corp. Device [3842:6180]
        Kernel modules: nvidiafb, nouveau
85:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
--
88:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
        Kernel modules: nvidiafb, nouveau
88:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
89:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
        Kernel modules: nvidiafb, nouveau
89:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
</pre>


=== VFIO Modules ===
=== VFIO Modules ===
Create `/etc/modules-load.d/vfio.conf` with:
* Create `/etc/modules-load.d/vfio.conf` with:
```
<pre>
vfio
vfio
vfio_iommu_type1
vfio_iommu_type1
vfio_pci
vfio_pci
vfio_virqfd
vfio_virqfd
```
</pre>


== NVIDIA Drivers on Host ==
== NVIDIA Drivers on Host ==
We did have to install some drivers on gpubox, but we installed them again later on the VMs. This confuses me.


=== Edit /etc/apt/sources.list ===
=== Edit /etc/apt/sources.list ===
```
<pre>
sed -i 's/main/main non-free contrib/g' /etc/apt/sources.list
sed -i 's/main/main non-free contrib/g' /etc/apt/sources.list
apt update
apt update
apt install -y nvidia-driver nvidia-kernel-dkms
apt install -y nvidia-driver nvidia-kernel-dkms
```
</pre>


== VM Templates & Cloning ==
== VM Templates & Cloning ==


=== Template VM ===
=== Template VM ===
* Use `debian-13.3.0-amd64-netinst.iso` to create a minimal Debian 13 template
* Upload <pre>debian-13.3.0-amd64-netinst.iso</pre> to storage through the proxmox web ui
* Convert to template (Proxmox: VM > Convert to Template)
* Create a minimal Debian 13 template
** <pre>apt install -y ufw fail2ban curl git zsh sudo net-tools</pre>
** <pre>sudo apt update && sudo apt full-upgrade -y</pre>
* Make a user called <pre>deb</pre> with sudo
* Convert to template (Proxmox: VM > Convert to Template) with name: <pre>debian13-template</pre>


=== Clone VMs ===
=== Clone VMs ===
* Clone the template for `ollama-2080` and `dockerhost`
* Clone the template for <pre>ollama-2080</pre> and future VMs that will house AI models
* **Pass GPU**: In VM settings, go to "Hardware" > "PCI" > "Raw" and select the GPU (use `lspci` IDs)
* **Pass GPU**: In VM settings, go to "Hardware" > "PCI" > "Raw" and select the GPU (use <pre>lspci</pre> IDs)
* Don't forget to change the new cloned VM's hostname
** <pre>sudo hostnamectl set-hostname ollama2080 --static</pre>


== Specific VM Configurations ==
== Specific VM Configurations ==


=== ollama-2080 ===
=== ollama-2080 ===
* Install `ollama` (e.g., `curl -fsSL https://ollama.com/install.sh | sh`)
* Install nvidia drivers https://www.xda-developers.com/nvidia-stopped-supporting-my-gpu-so-i-started-self-hosting-llms-with-it/
* Configure GPU acceleration (check `ollama --version` and ensure NVIDIA drivers are loaded)
** Pin the driver version so you don't have to re-run the nvidia installer every time the kernel gets updated
* Install ollama with <pre>curl -fsSL https://ollama.com/install.sh | sh</pre>
* Use ollama to pull and run deepseek-r1:8b
* <pre>sudo ufw allow from 10.0.0.0/24 to any port 11434 proto tcp</pre>
* Verify: http://ollama.local:11434/ should show the message <pre>Ollama is running.</pre>
 
=== imgtotext ===
* Install ollama as above
* <pre>ollama run hf.co/noctrex/ZwZ-8B-GGUF:Q8_0</pre> from the page https://huggingface.co/noctrex/ZwZ-8B-GGUF (i pressed the image-to-text tag and looked at trending models)
* http://imgtotext.local:11434/ should show ollama is running


=== dockerhost ===
=== dockerhost ===
Line 70: Line 146:
systemctl enable --now docker
systemctl enable --now docker
```
```
* Add user to `docker` group (`usermod -aG docker $USER`)
* Add user <pre>docker</pre> to do docker stuff. Do NOT give <pre>docker</pre> sudo.
 
==== Install openwebui ====
* As the docker user, make directories <pre>~/git/openwebui</pre>
* Make a docker compose file at <pre>~/git/openwebui/docker-compose.yaml</pre>
** <pre>services:
  open-webui:
    build:
      context: .
      dockerfile: Dockerfile
    image: ghcr.io/open-webui/open-webui:${WEBUI_DOCKER_TAG-main}
    container_name: open-webui
    volumes:
      - open-webui:/app/backend/data
    ports:
      - ${OPEN_WEBUI_PORT-3000}:8080
    environment:
      - 'OLLAMA_BASE_URL=http://ollama-2080.local:11434'
      - 'WEBUI_SECRET_KEY=secretkeyhere'
    extra_hosts:
      - host.docker.internal:host-gateway
    restart: unless-stopped
 
volumes:
  open-webui: {}
</pre>
** Eventually, we'll check this in to git.
* In <pre>~/git/openwebui</pre>, run <pre>docker compose up</pre>
** Note: newer docker uses <pre>docker compose</pre>, not <pre>docker-compose</pre>
* I had to do some hole-punching in ufw to get open-webui to see ollama2080
* Useful commands
** <pre>sudo ss -plnt # Lists ports this machine is listening on
ip -4 a # Get this machine's IP address on the local network
</pre>


=== ai-conductor ===
=== ai-conductor ===
* Install required tools (e.g., `kubectl` for Kubernetes orchestration)
* TBD
 
=== if you're using a 1080 Ti or 1080 ===
<pre>sudo apt purge "*nvidia*"
sudo apt autoremove --purge
</pre>
then reboot.


== Key Commands ==
== Key Commands ==

Latest revision as of 22:58, 16 February 2026

Hosted Services

On gpubox
Hostname:Port Description
gpubox.local:8006 Proxmox admin
dockerhost.local:3000 Open WebUI (to play with LLMs)
ipmi-compute-2-171.local IPMI

gpubox Setup

Bare Metal Configuration

IPMI Setup

  • Access IPMI via
    ipmitool
    with hostname
    ipmi-compute-2-171.local

Example commands:

$ ipmitool -H ipmi-compute-2-171.local -U ADMIN -P pwd power status 
Chassis Power is on
$ ipmitool -H ipmi-compute-2-171.local  -U ADMIN -P pwd dcmi power reading
[shows electrical power presently being consumed by system]

Debian 13 and Proxmox 8 Installation

GPU Passthrough Configuration

BIOS/UEFI

  • Enable VT-d (Virtualization Technology for Directed I/O) in BIOS on gpubox

Identify GPUs

Use the vendor ID (`10de`) and device ID (`1b06`) to identify GPUs. Both video cards and their associated audio device will show up.

$ lspci -nnk | grep -A 3 'VGA'
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12a4]
        Kernel modules: nvidiafb, nouveau
04:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
--
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: PNY Device [196e:1213]
        Kernel modules: nvidiafb, nouveau
05:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
        Kernel modules: nvidiafb, nouveau
08:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
09:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
        Kernel modules: nvidiafb, nouveau
09:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
0c:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 30)
        Subsystem: Super Micro Computer Inc Device [15d9:0892]
        Kernel driver in use: ast
        Kernel modules: ast
--
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
        Subsystem: Gigabyte Technology Co., Ltd Device [1458:37c4]
        Kernel modules: nvidiafb, nouveau
84:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
--
85:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
        Subsystem: eVga.com. Corp. Device [3842:6180]
        Kernel modules: nvidiafb, nouveau
85:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
--
88:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
        Kernel modules: nvidiafb, nouveau
88:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
89:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
        Kernel modules: nvidiafb, nouveau
89:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)

VFIO Modules

  • Create `/etc/modules-load.d/vfio.conf` with:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

NVIDIA Drivers on Host

We did have to install some drivers on gpubox, but we installed them again later on the VMs. This confuses me.

Edit /etc/apt/sources.list

sed -i 's/main/main non-free contrib/g' /etc/apt/sources.list
apt update
apt install -y nvidia-driver nvidia-kernel-dkms

VM Templates & Cloning

Template VM

  • Upload
    debian-13.3.0-amd64-netinst.iso
    to storage through the proxmox web ui
  • Create a minimal Debian 13 template
    • apt install -y ufw fail2ban curl git zsh sudo net-tools
    • sudo apt update && sudo apt full-upgrade -y
  • Make a user called
    deb
    with sudo
  • Convert to template (Proxmox: VM > Convert to Template) with name:
    debian13-template

Clone VMs

  • Clone the template for
    ollama-2080
    and future VMs that will house AI models
  • **Pass GPU**: In VM settings, go to "Hardware" > "PCI" > "Raw" and select the GPU (use
    lspci
    IDs)
  • Don't forget to change the new cloned VM's hostname
    • sudo hostnamectl set-hostname ollama2080 --static

Specific VM Configurations

ollama-2080

imgtotext

dockerhost

  • Install Docker:

``` apt install -y docker.io systemctl enable --now docker ```

  • Add user
    docker
    to do docker stuff. Do NOT give
    docker
    sudo.

Install openwebui

  • As the docker user, make directories
    ~/git/openwebui
  • Make a docker compose file at
    ~/git/openwebui/docker-compose.yaml
    • services:
 open-webui:
   build:
     context: .
     dockerfile: Dockerfile
   image: ghcr.io/open-webui/open-webui:${WEBUI_DOCKER_TAG-main}
   container_name: open-webui
   volumes:
     - open-webui:/app/backend/data
   ports:
     - ${OPEN_WEBUI_PORT-3000}:8080
   environment:
     - 'OLLAMA_BASE_URL=http://ollama-2080.local:11434'
     - 'WEBUI_SECRET_KEY=secretkeyhere'
   extra_hosts:
     - host.docker.internal:host-gateway
   restart: unless-stopped

volumes:

 open-webui: {}
    • Eventually, we'll check this in to git.
  • In
    ~/git/openwebui
    , run
    docker compose up
    • Note: newer docker uses
      docker compose
      , not
      docker-compose
  • I had to do some hole-punching in ufw to get open-webui to see ollama2080
  • Useful commands
    • sudo ss -plnt # Lists ports this machine is listening on

ip -4 a # Get this machine's IP address on the local network

ai-conductor

  • TBD

if you're using a 1080 Ti or 1080

sudo apt purge "*nvidia*"
sudo apt autoremove --purge

then reboot.

Key Commands

```

  1. Check GPU visibility in host

lspci -k | grep -A 2 "VGA"

  1. Verify VFIO modules loaded

lsmod | grep vfio

  1. Test NVIDIA driver

nvidia-smi # Should show GPU details

  1. Clone a template in Proxmox

qm clone <source_VM_ID> <new_VM_ID> --name "ollama-2080" ```

Troubleshooting

  • **GPU Not Visible**: Ensure VT-d is enabled in BIOS and the GPU is listed in `lspci`
  • **Driver Issues**: Reinstall `nvidia-driver` and reboot
  • **Permission Errors**: Add user to `docker` and `kvm` groups