Self-hosted video transcription that runs entirely on your own GPU. Upload a video, get an accurate transcript. No cloud, no API keys, no data leaving your machine.
Built on faster-whisper with a minimal FastAPI + Alpine.js web UI. Designed for personal use over Tailscale/VPN — upload from any device, process on the machine with the GPU.
- Local only — Whisper runs on your GPU, videos never leave the machine
- Web UI — upload, queue, monitor progress live, read + download transcripts
- Dark / light theme — Linear/Vercel-inspired minimal interface, dark by default
- FIFO queue with cancellation support (one job at a time, GPU-bound)
- Live progress via SSE — real-time updates as segments are transcribed
- Full-text search across all past transcripts (SQLite FTS5)
- Editable names — rename transcripts after the fact
- Keep or delete the source video per job
- Auto language detection — primarily Spanish + English, others supported
- Hallucination cleanup — deterministic post-processing strips the typical
no no no no/Yes. Yes. Yes.Whisper artifacts - CLI fallback —
transcribe.pystill works standalone for scripted use - No Node.js — frontend is a single Jinja2 template + Alpine.js + Tailwind via CDN
Cloud transcription services are fast and convenient, but they have real downsides: you send your audio to someone else's servers, pay per minute, and trust them with recordings you may not want floating around. For personal content (meetings, voice notes, OBS recordings) a local GPU can transcribe faster than real-time with large-v3 quality — you just need a reasonable UI around it.
This project wraps the standard faster-whisper workflow in a small webapp so you can upload from your laptop/phone over your VPN, queue jobs, and get searchable transcripts without ever hitting a third-party API.
- Windows 11 with WSL2 (Ubuntu distro)
- NVIDIA GPU with drivers + CUDA (WSL2 uses the Windows driver via NVIDIA's CUDA passthrough)
- Python 3.12 inside WSL (
sudo apt install python3.12 python3.12-venv) - Git inside WSL
- Tailscale (optional, for access from other devices)
Note: the launcher targets WSL2 on Windows but the backend is plain FastAPI — it runs on any Linux with CUDA. Swap the
.batfor a shell script if you want to run it natively.
All commands run inside WSL (not CMD/PowerShell).
cd /mnt/c/Development
git clone https://github.com/Mykle23/transcriptvideo.git
cd transcriptvideopython3 -m venv venv
source venv/bin/activatepython -m pip install --upgrade pip
python -m pip install -r requirements.txtFirst install pulls ~2 GB (PyTorch + CUDA + ctranslate2). Takes several minutes.
python -c "from faster_whisper import WhisperModel; WhisperModel('tiny', device='cuda'); print('CUDA OK')"The first real run with large-v3 pulls ~3 GB. Pre-download it now to avoid waiting later:
HF_HUB_ENABLE_HF_TRANSFER=0 python -c "from faster_whisper import WhisperModel; WhisperModel('large-v3', device='cuda', compute_type='float16')"Always pass
HF_HUB_ENABLE_HF_TRANSFER=0. Without it, the parallel downloader can crash WiFi on some PCs (see Troubleshooting).
pytest tests/ -v21 tests should pass.
Double-click start-webapp.bat from the Windows desktop.
- Opens
http://localhost:8000in your browser - Launches the FastAPI server inside WSL
- Loads the Whisper model into GPU (~10 s once cached)
If the browser shows a connection error, wait a few seconds and refresh (the server needs a moment to start).
Access from other devices (via Tailscale):
http://<tailscale-ip-of-the-pc>:8000
The server binds to 0.0.0.0:8000, so any device on the same Tailscale net can reach it.
source venv/bin/activate
python transcribe.py "my_video.mp4"Videos must live under videos/. Output goes to transcriptions/<video-name>/transcription.txt.
transcriptvideo/
├── transcribe.py # standalone CLI (legacy, still works)
├── start-webapp.bat # Windows launcher (WSL + uvicorn + browser)
├── videos/ # .mp4 files (ignored by git)
├── transcriptions/ # output folder per job (ignored by git)
│ └── <name>/
│ └── transcription.txt
├── webapp/
│ ├── app.py # FastAPI app: routes + lifespan + SSE
│ ├── config.py # paths and constants
│ ├── database.py # SQLite + FTS5 search
│ ├── transcriber.py # transcription + hallucination cleanup
│ ├── worker.py # background job processor (FIFO, GPU-bound)
│ ├── templates/index.html # single-page UI (Alpine.js + Tailwind)
│ └── static/app.js # frontend logic (SSE, upload, search, theme)
├── tests/
│ ├── conftest.py
│ ├── test_api.py # FastAPI endpoint tests
│ ├── test_database.py # DB CRUD + FTS5
│ ├── test_integration.py # full upload-to-delete flow
│ └── test_transcriber.py # segment cleanup logic
├── requirements.txt
├── LICENSE
└── README.md
| Layer | Tech |
|---|---|
| Transcription | faster-whisper, large-v3, CUDA float16 |
| Backend | FastAPI, uvicorn, SQLite (WAL + FTS5) |
| Frontend | Alpine.js + Tailwind CSS via CDN (no Node.js, no build) |
| Live progress | Server-Sent Events (SSE) |
| Runtime | WSL2 Ubuntu + NVIDIA GPU |
Symptom: when the .bat launches, WiFi drops 1-2 min later (no networks visible in Windows).
Cause: faster-whisper pulls the model via huggingface_hub, which enables hf_transfer by default — an aggressive Rust downloader using many parallel connections. On some PCs (Realtek-based WiFi adapters common in WSL2 setups), this crashes the Windows network stack.
Fix: the start-webapp.bat already forces HF_HUB_ENABLE_HF_TRANSFER=0 (serial download — slower but stable). If you download the model manually, export the same variable:
export HF_HUB_ENABLE_HF_TRANSFER=0
python -c "from faster_whisper import WhisperModel; WhisperModel('large-v3', device='cuda', compute_type='float16')"The cache is corrupted or incomplete (.incomplete blob left behind). Wipe and retry:
wsl bash -c "rm -rf /root/.cache/huggingface/hub/models--Systran--faster-whisper-large-v3"Relaunch the .bat — it will re-download cleanly.
If / returns 500 but /api/jobs works, you have an outdated Starlette. The code uses the modern templates.TemplateResponse(request, "index.html") signature. Upgrade:
python -m pip install --upgrade starlette fastapi- The Whisper model loads once at server startup and is reused across all jobs.
- Language is auto-detected per video.
- Post-processing removes typical Whisper hallucinations: segments where one word dominates >75% (
no no no no ...) collapse to that word, and runs of 4+ consecutive identical short segments (Yes. Yes. Yes. Yes.) merge into one. - Existing transcriptions already on disk (created with the CLI) are auto-imported into the DB on first launch.
- No Node.js or build step required for the frontend.
MIT — use it, fork it, ship it.
Versión en español: docs/README.es.md
