Learn How to Install Whisper Real-Time Speech-to-Text on Linux Ubuntu 24.04 GPU-Enabled Server.

Introduction
Instantly converting spoken words into text unlocks a wide range of possibilities—from real-time captions and meeting notes to accessibility enhancements and smart assistant integration. OpenAI’s Whisper model stands out for its robust multilingual speech recognition, offering high precision and responsiveness when paired with a GPU-accelerated Ubuntu 24.04 environment. This tutorial walks you through the process of configuring and running Whisper on an Ubuntu 24.04 server with GPU support for real-time speech-to-text functionality.
Prerequisites
Before you start, ensure you have:
- An Ubuntu 24.04 server with an NVIDIA GPU.
- A non-root user account with
sudo
privileges. - NVIDIA GPU drivers installed and functioning properly.
Step 1: Install Required System Packages
sudo apt install -y ffmpeg python3 python3-venv python3-pip portaudio19-dev
Step 2: Set Up a Python Virtual Environment
python3 -m venv whisper-env
source whisper-env/bin/activate
Step 3: Upgrade pip and Install Whisper
pip install --upgrade pip
pip install git+https://github.com/openai/whisper.git
Step 4: Install PyTorch with GPU (CUDA) Support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Step 5: Add Supporting Python Libraries
pip install sounddevice numpy gradio
sounddevice
– microphone accessnumpy
– numerical processinggradio
– create the web interface
Step 6: Build the Speech-to-Text Application
Create and edit a Python script:
nano app.py
Paste the following code into app.py
:
import whisper
import gradio as gr
import datetime
model = whisper.load_model("base")
LOG_FILE = "error_logs.txt"
def log_error(error_message):
with open(LOG_FILE, "a") as f:
timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
f.write(f"[{timestamp}] {error_message}\n")
def transcribe_audio(audio):
try:
if audio is None:
return "❌ No audio received. Please record or upload an audio file."
result = model.transcribe(audio)
return result["text"]
except Exception as e:
log_error(str(e))
return f"⚠️ Error occurred: {e}"
with gr.Blocks() as demo:
gr.Markdown("## 🎙️ Whisper Real-Time Speech-to-Text on Ubuntu 24.04 GPU Server")
input_method = gr.Radio(["Upload Audio File", "Record via Microphone"], label="Choose Input Method", value="Upload Audio File")
upload_audio = gr.Audio(type="filepath", label="📂 Upload Audio File", visible=True)
record_audio = gr.Audio(type="filepath", label="🎤 Record via Microphone", visible=False)
output_text = gr.Textbox(label="📝 Transcribed Text")
def toggle_inputs(method):
return (
gr.update(visible=(method == "Upload Audio File")),
gr.update(visible=(method == "Record via Microphone"))
)
input_method.change(fn=toggle_inputs, inputs=input_method, outputs=[upload_audio, record_audio])
def handle_transcription(upload, record, method):
audio = upload if method == "Upload Audio File" else record
return transcribe_audio(audio)
transcribe_button = gr.Button("🔍 Transcribe")
transcribe_button.click(fn=handle_transcription, inputs=[upload_audio, record_audio, input_method], outputs=output_text)
demo.launch(share=True)
Step 7: Run the Application
python3 app.py
You’ll see output like:
* Running on local URL: http://127.0.0.1:7860
* Running on public URL: https://your-unique-link.gradio.live
Step 8: Use the Web Interface
- Open the public Gradio URL shown in the terminal.
- Select “Record via Microphone” or “Upload Audio File.”
- Click the microphone icon to begin recording, then stop it.
- Click “Transcribe” to process the audio and view the text output.
Conclusion
You’ve now built a real-time speech-to-text system using OpenAI’s Whisper on Ubuntu 24.04 with GPU acceleration and a simple web interface. Feel free to enhance the model, customize the UI, or integrate it into larger systems.