Learn How to Install Whisper Real-Time Speech-to-Text on Linux Ubuntu 24.04 GPU-Enabled Server

Learn How to Install Whisper Real-Time Speech-to-Text on Linux Ubuntu 24.04 GPU-Enabled Server.

Introduction

Instantly converting spoken words into text unlocks a wide range of possibilities—from real-time captions and meeting notes to accessibility enhancements and smart assistant integration. OpenAI’s Whisper model stands out for its robust multilingual speech recognition, offering high precision and responsiveness when paired with a GPU-accelerated Ubuntu 24.04 environment. This tutorial walks you through the process of configuring and running Whisper on an Ubuntu 24.04 server with GPU support for real-time speech-to-text functionality.

Prerequisites

Before you start, ensure you have:

An Ubuntu 24.04 server with an NVIDIA GPU.
A non-root user account with sudo privileges.
NVIDIA GPU drivers installed and functioning properly.

Step 1: Install Required System Packages

sudo apt install -y ffmpeg python3 python3-venv python3-pip portaudio19-dev

Step 2: Set Up a Python Virtual Environment

python3 -m venv whisper-env
source whisper-env/bin/activate

Step 3: Upgrade pip and Install Whisper

pip install --upgrade pip
pip install git+https://github.com/openai/whisper.git

Step 4: Install PyTorch with GPU (CUDA) Support

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Step 5: Add Supporting Python Libraries

pip install sounddevice numpy gradio

sounddevice – microphone access
numpy – numerical processing
gradio – create the web interface

Step 6: Build the Speech-to-Text Application

Create and edit a Python script:

nano app.py

Paste the following code into app.py:

import whisper
import gradio as gr
import datetime

model = whisper.load_model("base")
LOG_FILE = "error_logs.txt"

def log_error(error_message):
    with open(LOG_FILE, "a") as f:
        timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        f.write(f"[{timestamp}] {error_message}\n")

def transcribe_audio(audio):
    try:
        if audio is None:
            return "❌ No audio received. Please record or upload an audio file."
        result = model.transcribe(audio)
        return result["text"]
    except Exception as e:
        log_error(str(e))
        return f"⚠️ Error occurred: {e}"

with gr.Blocks() as demo:
    gr.Markdown("## 🎙️ Whisper Real-Time Speech-to-Text on Ubuntu 24.04 GPU Server")

    input_method = gr.Radio(["Upload Audio File", "Record via Microphone"], label="Choose Input Method", value="Upload Audio File")

    upload_audio = gr.Audio(type="filepath", label="📂 Upload Audio File", visible=True)
    record_audio = gr.Audio(type="filepath", label="🎤 Record via Microphone", visible=False)

    output_text = gr.Textbox(label="📝 Transcribed Text")

    def toggle_inputs(method):
        return (
            gr.update(visible=(method == "Upload Audio File")),
            gr.update(visible=(method == "Record via Microphone"))
        )

    input_method.change(fn=toggle_inputs, inputs=input_method, outputs=[upload_audio, record_audio])

    def handle_transcription(upload, record, method):
        audio = upload if method == "Upload Audio File" else record
        return transcribe_audio(audio)

    transcribe_button = gr.Button("🔍 Transcribe")
    transcribe_button.click(fn=handle_transcription, inputs=[upload_audio, record_audio, input_method], outputs=output_text)

demo.launch(share=True)

Step 7: Run the Application

python3 app.py

You’ll see output like:

* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://your-unique-link.gradio.live

Step 8: Use the Web Interface

Open the public Gradio URL shown in the terminal.
Select “Record via Microphone” or “Upload Audio File.”
Click the microphone icon to begin recording, then stop it.
Click “Transcribe” to process the audio and view the text output.

Conclusion

You’ve now built a real-time speech-to-text system using OpenAI’s Whisper on Ubuntu 24.04 with GPU acceleration and a simple web interface. Feel free to enhance the model, customize the UI, or integrate it into larger systems.