Difference between revisions of "Gpubox"
(init gpubox notes) |
|||
| (12 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
gpubox | = Hosted Services = | ||
{| class="wikitable" | |||
|+ On gpubox | |||
|- | |||
! Hostname:Port !! Description | |||
|- | |||
| [https://gpubox.local:8006/ gpubox.local:8006] || Proxmox admin | |||
|- | |||
| [http://dockerhost.local:3000/ dockerhost.local:3000] || Open WebUI (to play with LLMs) | |||
|- | |||
| [https://ipmi-compute-2-171.local/ ipmi-compute-2-171.local] || IPMI | |||
|} | |||
= gpubox Setup = | |||
== Bare Metal Configuration == | |||
=== IPMI Setup === | |||
* Access IPMI via <pre>ipmitool</pre> with hostname <pre>ipmi-compute-2-171.local</pre> | |||
2. | Example commands: | ||
<pre> | |||
$ ipmitool -H ipmi-compute-2-171.local -U ADMIN -P pwd power status | |||
Chassis Power is on | |||
$ ipmitool -H ipmi-compute-2-171.local -U ADMIN -P pwd dcmi power reading | |||
[shows electrical power presently being consumed by system] | |||
</pre> | |||
=== Debian 13 and Proxmox 8 Installation === | |||
* Debian 13 install: https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-13.3.0-amd64-netinst.iso | |||
* Proxmox VE on that: https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_13_Trixie | |||
* Let it use DHCP to grab an IP address, then I changed it later when I set up vrb0 | |||
* Hostname gpubox | |||
* Proxmox web ui available at: https://gpubox.local:8006 | |||
== GPU Passthrough Configuration == | |||
=== BIOS/UEFI === | |||
* Enable VT-d (Virtualization Technology for Directed I/O) in BIOS on gpubox | |||
=== Identify GPUs === | |||
Use the vendor ID (`10de`) and device ID (`1b06`) to identify GPUs. Both video cards and their associated audio device will show up. | |||
<pre> | |||
$ lspci -nnk | grep -A 3 'VGA' | |||
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1) | |||
Subsystem: NVIDIA Corporation Device [10de:12a4] | |||
Kernel modules: nvidiafb, nouveau | |||
04:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1) | |||
-- | |||
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1) | |||
Subsystem: PNY Device [196e:1213] | |||
Kernel modules: nvidiafb, nouveau | |||
05:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1) | |||
-- | |||
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1) | |||
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609] | |||
Kernel modules: nvidiafb, nouveau | |||
08:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1) | |||
-- | |||
09:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1) | |||
Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470] | |||
Kernel modules: nvidiafb, nouveau | |||
09:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1) | |||
-- | |||
0c:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 30) | |||
Subsystem: Super Micro Computer Inc Device [15d9:0892] | |||
Kernel driver in use: ast | |||
Kernel modules: ast | |||
-- | |||
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1) | |||
Subsystem: Gigabyte Technology Co., Ltd Device [1458:37c4] | |||
Kernel modules: nvidiafb, nouveau | |||
84:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1) | |||
-- | |||
85:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1) | |||
Subsystem: eVga.com. Corp. Device [3842:6180] | |||
Kernel modules: nvidiafb, nouveau | |||
85:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1) | |||
-- | |||
88:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1) | |||
Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470] | |||
Kernel modules: nvidiafb, nouveau | |||
88:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1) | |||
-- | |||
89:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1) | |||
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609] | |||
Kernel modules: nvidiafb, nouveau | |||
89:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1) | |||
</pre> | |||
=== VFIO Modules === | |||
* Create `/etc/modules-load.d/vfio.conf` with: | |||
<pre> | |||
vfio | |||
vfio_iommu_type1 | |||
vfio_pci | |||
vfio_virqfd | |||
</pre> | |||
== NVIDIA Drivers on Host == | |||
We did have to install some drivers on gpubox, but we installed them again later on the VMs. This confuses me. | |||
=== Edit /etc/apt/sources.list === | |||
<pre> | |||
sed -i 's/main/main non-free contrib/g' /etc/apt/sources.list | |||
apt update | |||
apt install -y nvidia-driver nvidia-kernel-dkms | |||
</pre> | |||
== VM Templates & Cloning == | |||
=== Template VM === | |||
* Upload <pre>debian-13.3.0-amd64-netinst.iso</pre> to storage through the proxmox web ui | |||
* Create a minimal Debian 13 template | |||
** <pre>apt install -y ufw fail2ban curl git zsh sudo net-tools</pre> | |||
** <pre>sudo apt update && sudo apt full-upgrade -y</pre> | |||
* Make a user called <pre>deb</pre> with sudo | |||
* Convert to template (Proxmox: VM > Convert to Template) with name: <pre>debian13-template</pre> | |||
=== Clone VMs === | |||
* Clone the template for <pre>ollama-2080</pre> and future VMs that will house AI models | |||
* **Pass GPU**: In VM settings, go to "Hardware" > "PCI" > "Raw" and select the GPU (use <pre>lspci</pre> IDs) | |||
* Don't forget to change the new cloned VM's hostname | |||
** <pre>sudo hostnamectl set-hostname ollama2080 --static</pre> | |||
== Specific VM Configurations == | |||
=== ollama-2080 === | |||
* Install nvidia drivers https://www.xda-developers.com/nvidia-stopped-supporting-my-gpu-so-i-started-self-hosting-llms-with-it/ | |||
** Pin the driver version so you don't have to re-run the nvidia installer every time the kernel gets updated | |||
* Install ollama with <pre>curl -fsSL https://ollama.com/install.sh | sh</pre> | |||
* Use ollama to pull and run deepseek-r1:8b | |||
* <pre>sudo ufw allow from 10.0.0.0/24 to any port 11434 proto tcp</pre> | |||
* Verify: http://ollama.local:11434/ should show the message <pre>Ollama is running.</pre> | |||
=== imgtotext === | |||
* Install ollama as above | |||
* <pre>ollama run hf.co/noctrex/ZwZ-8B-GGUF:Q8_0</pre> from the page https://huggingface.co/noctrex/ZwZ-8B-GGUF (i pressed the image-to-text tag and looked at trending models) | |||
* http://imgtotext.local:11434/ should show ollama is running | |||
=== dockerhost === | |||
* Install Docker: | |||
``` | |||
apt install -y docker.io | |||
systemctl enable --now docker | |||
``` | |||
* Add user <pre>docker</pre> to do docker stuff. Do NOT give <pre>docker</pre> sudo. | |||
==== Install openwebui ==== | |||
4 | * As the docker user, make directories <pre>~/git/openwebui</pre> | ||
* Make a docker compose file at <pre>~/git/openwebui/docker-compose.yaml</pre> | |||
** <pre>services: | |||
open-webui: | |||
build: | |||
context: . | |||
dockerfile: Dockerfile | |||
image: ghcr.io/open-webui/open-webui:${WEBUI_DOCKER_TAG-main} | |||
container_name: open-webui | |||
volumes: | |||
- open-webui:/app/backend/data | |||
ports: | |||
- ${OPEN_WEBUI_PORT-3000}:8080 | |||
environment: | |||
- 'OLLAMA_BASE_URL=http://ollama-2080.local:11434' | |||
- 'WEBUI_SECRET_KEY=secretkeyhere' | |||
extra_hosts: | |||
- host.docker.internal:host-gateway | |||
restart: unless-stopped | |||
volumes: | |||
open-webui: {} | |||
</pre> | |||
** Eventually, we'll check this in to git. | |||
* In <pre>~/git/openwebui</pre>, run <pre>docker compose up</pre> | |||
** Note: newer docker uses <pre>docker compose</pre>, not <pre>docker-compose</pre> | |||
* I had to do some hole-punching in ufw to get open-webui to see ollama2080 | |||
* Useful commands | |||
** <pre>sudo ss -plnt # Lists ports this machine is listening on | |||
ip -4 a # Get this machine's IP address on the local network | |||
</pre> | |||
=== ai-conductor === | |||
* TBD | |||
=== if you're using a 1080 Ti or 1080 === | |||
<pre>sudo apt purge "*nvidia*" | |||
sudo apt autoremove --purge | |||
</pre> | |||
then reboot. | |||
== Key Commands == | |||
``` | |||
# Check GPU visibility in host | # Check GPU visibility in host | ||
lspci -k | grep -A 2 "VGA" | lspci -k | grep -A 2 "VGA" | ||
| Line 139: | Line 204: | ||
# Clone a template in Proxmox | # Clone a template in Proxmox | ||
qm clone <source_VM_ID> <new_VM_ID> --name "ollama-2080" | qm clone <source_VM_ID> <new_VM_ID> --name "ollama-2080" | ||
``` | |||
Troubleshooting | == Troubleshooting == | ||
* **GPU Not Visible**: Ensure VT-d is enabled in BIOS and the GPU is listed in `lspci` | |||
* **Driver Issues**: Reinstall `nvidia-driver` and reboot | |||
* **Permission Errors**: Add user to `docker` and `kvm` groups | |||
Latest revision as of 22:58, 16 February 2026
Hosted Services
| Hostname:Port | Description |
|---|---|
| gpubox.local:8006 | Proxmox admin |
| dockerhost.local:3000 | Open WebUI (to play with LLMs) |
| ipmi-compute-2-171.local | IPMI |
gpubox Setup
Bare Metal Configuration
IPMI Setup
- Access IPMI via
ipmitool
with hostnameipmi-compute-2-171.local
Example commands:
$ ipmitool -H ipmi-compute-2-171.local -U ADMIN -P pwd power status Chassis Power is on $ ipmitool -H ipmi-compute-2-171.local -U ADMIN -P pwd dcmi power reading [shows electrical power presently being consumed by system]
Debian 13 and Proxmox 8 Installation
- Debian 13 install: https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-13.3.0-amd64-netinst.iso
- Proxmox VE on that: https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_13_Trixie
- Let it use DHCP to grab an IP address, then I changed it later when I set up vrb0
- Hostname gpubox
- Proxmox web ui available at: https://gpubox.local:8006
GPU Passthrough Configuration
BIOS/UEFI
- Enable VT-d (Virtualization Technology for Directed I/O) in BIOS on gpubox
Identify GPUs
Use the vendor ID (`10de`) and device ID (`1b06`) to identify GPUs. Both video cards and their associated audio device will show up.
$ lspci -nnk | grep -A 3 'VGA'
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12a4]
Kernel modules: nvidiafb, nouveau
04:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
--
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
Subsystem: PNY Device [196e:1213]
Kernel modules: nvidiafb, nouveau
05:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
Kernel modules: nvidiafb, nouveau
08:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
09:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
Kernel modules: nvidiafb, nouveau
09:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
0c:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 30)
Subsystem: Super Micro Computer Inc Device [15d9:0892]
Kernel driver in use: ast
Kernel modules: ast
--
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] [10de:1e04] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd Device [1458:37c4]
Kernel modules: nvidiafb, nouveau
84:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
--
85:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:6180]
Kernel modules: nvidiafb, nouveau
85:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
--
88:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1470]
Kernel modules: nvidiafb, nouveau
88:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
--
89:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
Kernel modules: nvidiafb, nouveau
89:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev a1)
VFIO Modules
- Create `/etc/modules-load.d/vfio.conf` with:
vfio vfio_iommu_type1 vfio_pci vfio_virqfd
NVIDIA Drivers on Host
We did have to install some drivers on gpubox, but we installed them again later on the VMs. This confuses me.
Edit /etc/apt/sources.list
sed -i 's/main/main non-free contrib/g' /etc/apt/sources.list apt update apt install -y nvidia-driver nvidia-kernel-dkms
VM Templates & Cloning
Template VM
- Upload
debian-13.3.0-amd64-netinst.iso
to storage through the proxmox web ui - Create a minimal Debian 13 template
apt install -y ufw fail2ban curl git zsh sudo net-tools
sudo apt update && sudo apt full-upgrade -y
- Make a user called
deb
with sudo - Convert to template (Proxmox: VM > Convert to Template) with name:
debian13-template
Clone VMs
- Clone the template for
ollama-2080
and future VMs that will house AI models - **Pass GPU**: In VM settings, go to "Hardware" > "PCI" > "Raw" and select the GPU (use
lspci
IDs) - Don't forget to change the new cloned VM's hostname
sudo hostnamectl set-hostname ollama2080 --static
Specific VM Configurations
ollama-2080
- Install nvidia drivers https://www.xda-developers.com/nvidia-stopped-supporting-my-gpu-so-i-started-self-hosting-llms-with-it/
- Pin the driver version so you don't have to re-run the nvidia installer every time the kernel gets updated
- Install ollama with
curl -fsSL https://ollama.com/install.sh | sh
- Use ollama to pull and run deepseek-r1:8b
sudo ufw allow from 10.0.0.0/24 to any port 11434 proto tcp
- Verify: http://ollama.local:11434/ should show the message
Ollama is running.
imgtotext
- Install ollama as above
ollama run hf.co/noctrex/ZwZ-8B-GGUF:Q8_0
from the page https://huggingface.co/noctrex/ZwZ-8B-GGUF (i pressed the image-to-text tag and looked at trending models)- http://imgtotext.local:11434/ should show ollama is running
dockerhost
- Install Docker:
``` apt install -y docker.io systemctl enable --now docker ```
- Add user
docker
to do docker stuff. Do NOT givedocker
sudo.
Install openwebui
- As the docker user, make directories
~/git/openwebui
- Make a docker compose file at
~/git/openwebui/docker-compose.yaml
services:
open-webui:
build:
context: .
dockerfile: Dockerfile
image: ghcr.io/open-webui/open-webui:${WEBUI_DOCKER_TAG-main}
container_name: open-webui
volumes:
- open-webui:/app/backend/data
ports:
- ${OPEN_WEBUI_PORT-3000}:8080
environment:
- 'OLLAMA_BASE_URL=http://ollama-2080.local:11434'
- 'WEBUI_SECRET_KEY=secretkeyhere'
extra_hosts:
- host.docker.internal:host-gateway
restart: unless-stopped
volumes:
open-webui: {}
- Eventually, we'll check this in to git.
- In
~/git/openwebui
, rundocker compose up
- Note: newer docker uses
docker compose
, notdocker-compose
- Note: newer docker uses
- I had to do some hole-punching in ufw to get open-webui to see ollama2080
- Useful commands
sudo ss -plnt # Lists ports this machine is listening on
ip -4 a # Get this machine's IP address on the local network
ai-conductor
- TBD
if you're using a 1080 Ti or 1080
sudo apt purge "*nvidia*" sudo apt autoremove --purge
then reboot.
Key Commands
```
- Check GPU visibility in host
lspci -k | grep -A 2 "VGA"
- Verify VFIO modules loaded
lsmod | grep vfio
- Test NVIDIA driver
nvidia-smi # Should show GPU details
- Clone a template in Proxmox
qm clone <source_VM_ID> <new_VM_ID> --name "ollama-2080" ```
Troubleshooting
- **GPU Not Visible**: Ensure VT-d is enabled in BIOS and the GPU is listed in `lspci`
- **Driver Issues**: Reinstall `nvidia-driver` and reboot
- **Permission Errors**: Add user to `docker` and `kvm` groups