Difference between revisions of "Gpubox"

From Sudo Room
Jump to navigation Jump to search
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
mediawiki
= Hosted Services =
= gpubox Setup Guide =
{| class="wikitable"
|+ On gpubox
|-
! Hostname:Port !! Description
|-
| [https://gpubox.local:8006/ gpubox.local:8006] || Proxmox admin
|-
| [http://dockerhost.local:3000/ dockerhost.local:3000] || Open WebUI (to play with LLMs)
|-
| [https://ipmi-compute-2-171.local/ ipmi-compute-2-171.local] || IPMI
|}
 
= gpubox Setup =


== Bare Metal Configuration ==
== Bare Metal Configuration ==
Line 102: Line 114:
* Upload <pre>debian-13.3.0-amd64-netinst.iso</pre> to storage through the proxmox web ui
* Upload <pre>debian-13.3.0-amd64-netinst.iso</pre> to storage through the proxmox web ui
* Create a minimal Debian 13 template
* Create a minimal Debian 13 template
** <pre>apt install -y ufw fail2ban curl git zsh sudo</pre>
** <pre>apt install -y ufw fail2ban curl git zsh sudo net-tools</pre>
** <pre>sudo apt update && sudo apt full-upgrade -y</pre>
** <pre>sudo apt update && sudo apt full-upgrade -y</pre>
* Make a user called <pre>deb</pre> with sudo
* Make a user called <pre>deb</pre> with sudo
Line 116: Line 128:


=== ollama-2080 ===
=== ollama-2080 ===
* Install <pre>ollama</pre> with <pre>curl -fsSL https://ollama.com/install.sh | sh</pre>
* Install nvidia drivers https://www.xda-developers.com/nvidia-stopped-supporting-my-gpu-so-i-started-self-hosting-llms-with-it/
* Use ollama to pull and run Deepseek R1 8
** Pin the driver version so you don't have to re-run the nvidia installer every time the kernel gets updated
* Install ollama with <pre>curl -fsSL https://ollama.com/install.sh | sh</pre>
* Use ollama to pull and run deepseek-r1:8b
* <pre>sudo ufw allow from 10.0.0.0/24 to any port 11434 proto tcp</pre>
* Verify: http://ollama.local:11434/ should show the message <pre>Ollama is running.</pre>
* Verify: http://ollama.local:11434/ should show the message <pre>Ollama is running.</pre>
=== imgtotext ===
* Install ollama as above
* <pre>ollama run hf.co/noctrex/ZwZ-8B-GGUF:Q8_0</pre> from the page https://huggingface.co/noctrex/ZwZ-8B-GGUF (i pressed the image-to-text tag and looked at trending models)
* http://imgtotext.local:11434/ should show ollama is running


=== dockerhost ===
=== dockerhost ===
Line 155: Line 175:
* In <pre>~/git/openwebui</pre>, run <pre>docker compose up</pre>
* In <pre>~/git/openwebui</pre>, run <pre>docker compose up</pre>
** Note: newer docker uses <pre>docker compose</pre>, not <pre>docker-compose</pre>
** Note: newer docker uses <pre>docker compose</pre>, not <pre>docker-compose</pre>
* I had to do some hole-punching in ufw to get open-webui to see ollama2080
* Useful commands
* Useful commands
** <pre>sudo ss -plnt # Shows ports this machine is listening on
** <pre>sudo ss -plnt # Lists ports this machine is listening on
ip -4 a # Get this machine's IP address on the local network
ip -4 a # Get this machine's IP address on the local network
</pre>
</pre>
Line 162: Line 183:
=== ai-conductor ===
=== ai-conductor ===
* TBD
* TBD
=== if you're using a 1080 Ti or 1080 ===
<pre>sudo apt purge "*nvidia*"
sudo apt autoremove --purge
</pre>
then reboot.


== Key Commands ==
== Key Commands ==

Latest revision as of 22:58, 16 February 2026

Hosted Services

On gpubox
Hostname:Port Description
gpubox.local:8006 Proxmox admin
dockerhost.local:3000 Open WebUI (to play with LLMs)
ipmi-compute-2-171.local IPMI

gpubox Setup

Bare Metal Configuration

IPMI Setup

  • Access IPMI via
    ipmitool
    with hostname
    ipmi-compute-2-171.local

Example commands:

$ ipmitool -H ipmi-compute-2-171.local -U ADMIN -P pwd power status 
Chassis Power is on
$ ipmitool -H ipmi-compute-2-171.local  -U ADMIN -P pwd dcmi power reading
[shows electrical power presently being consumed by system]

Debian 13 and Proxmox 8 Installation

GPU Passthrough Configuration

BIOS/UEFI

  • Enable VT-d (Virtualization Technology for Directed I/O) in BIOS on gpubox

Identify GPUs

Use the vendor ID (`10de`) and device ID (`1b06`) to identify GPUs. Both video cards and their associated audio device will show up.

$ lspci -nnk | grep -A 3 'VGA'
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12a4]
        Kernel modules: nvidiafb, nouveau
04:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
--
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: PNY Device [196e:1213]
        Kernel modules: nvidiafb, nouveau
05:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
        Kernel modules: nvidiafb, nouveau
08:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
09:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
        Kernel modules: nvidiafb, nouveau
09:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
0c:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 30)
        Subsystem: Super Micro Computer Inc Device [15d9:0892]
        Kernel driver in use: ast
        Kernel modules: ast
--
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
        Subsystem: Gigabyte Technology Co., Ltd Device [1458:37c4]
        Kernel modules: nvidiafb, nouveau
84:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
--
85:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
        Subsystem: eVga.com. Corp. Device [3842:6180]
        Kernel modules: nvidiafb, nouveau
85:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
--
88:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
        Kernel modules: nvidiafb, nouveau
88:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
89:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
        Kernel modules: nvidiafb, nouveau
89:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)

VFIO Modules

  • Create `/etc/modules-load.d/vfio.conf` with:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

NVIDIA Drivers on Host

We did have to install some drivers on gpubox, but we installed them again later on the VMs. This confuses me.

Edit /etc/apt/sources.list

sed -i 's/main/main non-free contrib/g' /etc/apt/sources.list
apt update
apt install -y nvidia-driver nvidia-kernel-dkms

VM Templates & Cloning

Template VM

  • Upload
    debian-13.3.0-amd64-netinst.iso
    to storage through the proxmox web ui
  • Create a minimal Debian 13 template
    • apt install -y ufw fail2ban curl git zsh sudo net-tools
    • sudo apt update && sudo apt full-upgrade -y
  • Make a user called
    deb
    with sudo
  • Convert to template (Proxmox: VM > Convert to Template) with name:
    debian13-template

Clone VMs

  • Clone the template for
    ollama-2080
    and future VMs that will house AI models
  • **Pass GPU**: In VM settings, go to "Hardware" > "PCI" > "Raw" and select the GPU (use
    lspci
    IDs)
  • Don't forget to change the new cloned VM's hostname
    • sudo hostnamectl set-hostname ollama2080 --static

Specific VM Configurations

ollama-2080

imgtotext

dockerhost

  • Install Docker:

``` apt install -y docker.io systemctl enable --now docker ```

  • Add user
    docker
    to do docker stuff. Do NOT give
    docker
    sudo.

Install openwebui

  • As the docker user, make directories
    ~/git/openwebui
  • Make a docker compose file at
    ~/git/openwebui/docker-compose.yaml
    • services:
 open-webui:
   build:
     context: .
     dockerfile: Dockerfile
   image: ghcr.io/open-webui/open-webui:${WEBUI_DOCKER_TAG-main}
   container_name: open-webui
   volumes:
     - open-webui:/app/backend/data
   ports:
     - ${OPEN_WEBUI_PORT-3000}:8080
   environment:
     - 'OLLAMA_BASE_URL=http://ollama-2080.local:11434'
     - 'WEBUI_SECRET_KEY=secretkeyhere'
   extra_hosts:
     - host.docker.internal:host-gateway
   restart: unless-stopped

volumes:

 open-webui: {}
    • Eventually, we'll check this in to git.
  • In
    ~/git/openwebui
    , run
    docker compose up
    • Note: newer docker uses
      docker compose
      , not
      docker-compose
  • I had to do some hole-punching in ufw to get open-webui to see ollama2080
  • Useful commands
    • sudo ss -plnt # Lists ports this machine is listening on

ip -4 a # Get this machine's IP address on the local network

ai-conductor

  • TBD

if you're using a 1080 Ti or 1080

sudo apt purge "*nvidia*"
sudo apt autoremove --purge

then reboot.

Key Commands

```

  1. Check GPU visibility in host

lspci -k | grep -A 2 "VGA"

  1. Verify VFIO modules loaded

lsmod | grep vfio

  1. Test NVIDIA driver

nvidia-smi # Should show GPU details

  1. Clone a template in Proxmox

qm clone <source_VM_ID> <new_VM_ID> --name "ollama-2080" ```

Troubleshooting

  • **GPU Not Visible**: Ensure VT-d is enabled in BIOS and the GPU is listed in `lspci`
  • **Driver Issues**: Reinstall `nvidia-driver` and reboot
  • **Permission Errors**: Add user to `docker` and `kvm` groups