Skip to main content
🦞

OpenClaw Voice Assistant: Build Your Own Jarvis

15 minTutorial

title: "OpenClaw Voice Assistant: Build Your Own Jarvis" date: "2026-02-20" description: "Build a voice-controlled AI assistant with OpenClaw. This tutorial covers speech recognition, natural language processing, voice synthesis, and creating custom voice commands." category: "Tutorial" author: "OpenClaw Team" tags: ["voice", "assistant", "speech", "tutorial"] readTime: "15 min"

Imagine walking into your home office and saying "Hey Claw, summarize my emails and brief me on today's agenda" — and having an AI assistant that actually responds intelligently. That's exactly what you'll build in this tutorial. Using OpenClaw as the AI brain and open-source speech tools, you can have your own Jarvis running on your local machine in under an hour.

What You'll Build

By the end of this tutorial, you'll have a fully functional voice assistant that:

  • Listens for a custom wake word without consuming cloud resources
  • Transcribes your speech with OpenAI Whisper running locally
  • Processes natural language through OpenClaw to generate smart responses
  • Reads responses aloud using text-to-speech synthesis
  • Handles ambient noise and partial utterances gracefully
  • Executes custom voice command workflows like sending emails or searching the web

Prerequisites

Before you begin, make sure you have:

  • OpenClaw installed (installation guide)
  • Python 3.10+ with pip
  • A working microphone connected to your machine
  • At least 4 GB of RAM (8 GB recommended for Whisper medium model)
  • An OpenClaw API key configured
  • Basic familiarity with Python and the command line

Install the required Python packages up front:

pip install openai-whisper pyaudio pyttsx3 pvporcupine sounddevice numpy scipy openclaw-sdk

On macOS you may also need:

brew install portaudio ffmpeg

Step 1: Set Up Speech-to-Text with Whisper

Whisper is OpenAI's open-source speech recognition model. Running it locally means your voice never leaves your machine — an important privacy consideration.

# whisper_transcriber.py
import whisper
import sounddevice as sd
import numpy as np
import tempfile
import scipy.io.wavfile as wav

class WhisperTranscriber:
    def __init__(self, model_size="base"):
        # Options: tiny, base, small, medium, large
        # 'base' is a good balance of speed and accuracy
        print(f"Loading Whisper {model_size} model...")
        self.model = whisper.load_model(model_size)
        self.sample_rate = 16000  # Whisper expects 16kHz audio

    def record_audio(self, duration=5, silence_threshold=0.01):
        """Record audio and stop early if silence is detected."""
        print("Listening...")
        audio = sd.rec(
            int(duration * self.sample_rate),
            samplerate=self.sample_rate,
            channels=1,
            dtype="float32"
        )
        sd.wait()
        return audio.flatten()

    def transcribe(self, audio_array):
        """Transcribe a numpy audio array to text."""
        result = self.model.transcribe(
            audio_array,
            language="en",
            fp16=False  # Set True if you have a CUDA GPU
        )
        return result["text"].strip()

The base model requires around 150 MB of disk space and runs comfortably on a CPU. If you have a dedicated GPU, switch to medium for significantly better accuracy on accented speech or technical terms.

Step 2: Implement Wake Word Detection

Polling the microphone constantly and running Whisper on silence is wasteful. A lightweight wake word engine lets your assistant sleep until you need it.

# wake_word.py
import pvporcupine
import sounddevice as sd
import numpy as np

class WakeWordDetector:
    def __init__(self, access_key, keywords=None):
        # Porcupine offers free tier for personal use
        # Default keywords include "hey siri", "alexa", "jarvis", etc.
        self.porcupine = pvporcupine.create(
            access_key=access_key,
            keywords=keywords or ["jarvis"]
        )
        self.frame_length = self.porcupine.frame_length
        self.sample_rate = self.porcupine.sample_rate

    def listen_for_wake_word(self):
        """Block until the wake word is detected."""
        print("Waiting for wake word...")
        with sd.InputStream(
            samplerate=self.sample_rate,
            channels=1,
            dtype="int16",
            blocksize=self.frame_length
        ) as stream:
            while True:
                audio_frame, _ = stream.read(self.frame_length)
                result = self.porcupine.process(audio_frame.flatten())
                if result >= 0:
                    print("Wake word detected!")
                    return True

    def cleanup(self):
        self.porcupine.delete()

Porcupine offers a free tier that supports custom wake words. Sign up at their developer portal to get an access_key. Alternatively, you can use the open-source openwakeword library for a fully offline solution.

Step 3: Connect to OpenClaw for Intelligence

OpenClaw handles the "thinking" part — interpreting intent, maintaining conversation context, and generating useful responses.

# openclaw_brain.py
from openclaw import OpenClaw, ConversationMemory

class AssistantBrain:
    def __init__(self, api_key):
        self.client = OpenClaw(api_key=api_key)
        self.memory = ConversationMemory(max_turns=10)

        self.system_prompt = """
        You are a helpful voice assistant called Jarvis.
        Keep your responses concise and natural for spoken delivery —
        avoid markdown formatting, bullet points, and code blocks.
        Use plain conversational language.
        If you need to list items, speak them naturally: "First... Second... Third..."
        """

    def process(self, user_input: str) -> str:
        """Process user input and return a voice-friendly response."""
        self.memory.add_user_message(user_input)

        response = self.client.chat(
            model="claude-3-sonnet",
            system=self.system_prompt,
            messages=self.memory.get_messages(),
            max_tokens=300  # Keep responses short for voice
        )

        assistant_reply = response.content
        self.memory.add_assistant_message(assistant_reply)
        return assistant_reply

    def reset_context(self):
        """Clear conversation history for a fresh session."""
        self.memory.clear()

Notice the max_tokens=300 limit. Voice responses should be short — nobody wants to listen to a three-minute monologue from their assistant. If you need longer outputs (like a full email draft), have the assistant save it to a file and notify you verbally.

Step 4: Add Text-to-Speech Output

# tts_engine.py
import pyttsx3
import threading

class TTSEngine:
    def __init__(self, rate=175, volume=0.9, voice_index=0):
        self.engine = pyttsx3.init()
        self.engine.setProperty("rate", rate)      # Words per minute
        self.engine.setProperty("volume", volume)  # 0.0 to 1.0

        # Select voice (0 = default, try 1 for female voice on most systems)
        voices = self.engine.getProperty("voices")
        if voice_index < len(voices):
            self.engine.setProperty("voice", voices[voice_index].id)

        self._lock = threading.Lock()

    def speak(self, text: str):
        """Speak text synchronously."""
        with self._lock:
            self.engine.say(text)
            self.engine.runAndWait()

    def speak_async(self, text: str):
        """Speak text without blocking the main thread."""
        thread = threading.Thread(target=self.speak, args=(text,), daemon=True)
        thread.start()
        return thread

For higher-quality voices, consider using the elevenlabs or google-cloud-texttospeech library instead of pyttsx3. The trade-off is latency — cloud TTS adds 300-800ms per request.

Step 5: Build Custom Voice Command Workflows

Voice commands become truly powerful when they trigger multi-step workflows via OpenClaw's automation engine.

# voice_commands.yaml
commands:
  check_emails:
    trigger_phrases:
      - "check my emails"
      - "any new emails"
      - "summarize inbox"
    workflow: email_triage
    response_template: "You have {count} unread emails. {summary}"

  morning_brief:
    trigger_phrases:
      - "morning brief"
      - "good morning"
      - "start my day"
    workflow: daily_briefing
    response_template: "Good morning! Here's your briefing: {content}"

  set_reminder:
    trigger_phrases:
      - "remind me to"
      - "set a reminder"
    workflow: create_reminder
    extract_params:
      - name: task
        type: text
      - name: time
        type: datetime
# command_router.py
import yaml
from openclaw import WorkflowRunner
from difflib import SequenceMatcher

class CommandRouter:
    def __init__(self, commands_file="voice_commands.yaml"):
        with open(commands_file) as f:
            self.commands = yaml.safe_load(f)["commands"]
        self.workflow_runner = WorkflowRunner()

    def find_matching_command(self, text: str, threshold=0.6):
        """Find the best matching command using fuzzy matching."""
        text_lower = text.lower()
        best_match = None
        best_score = 0

        for cmd_name, cmd_config in self.commands.items():
            for phrase in cmd_config["trigger_phrases"]:
                score = SequenceMatcher(None, text_lower, phrase).ratio()
                if score > best_score and score >= threshold:
                    best_score = score
                    best_match = (cmd_name, cmd_config)

        return best_match

    def execute(self, text: str):
        """Route voice input to the appropriate workflow."""
        match = self.find_matching_command(text)
        if match:
            cmd_name, cmd_config = match
            result = self.workflow_runner.run(cmd_config["workflow"])
            return cmd_config["response_template"].format(**result)
        return None  # Fall through to general AI conversation

Step 6: Handle Ambient Noise

A voice assistant that mishears "play jazz" as "place jazz order" is more frustrating than helpful. Implement noise gating to filter out background audio.

# noise_filter.py
import numpy as np
import sounddevice as sd

class NoiseGatedRecorder:
    def __init__(
        self,
        sample_rate=16000,
        silence_threshold=0.015,
        min_speech_duration=0.5,
        max_duration=8.0,
        silence_timeout=1.2
    ):
        self.sample_rate = sample_rate
        self.silence_threshold = silence_threshold
        self.min_speech_frames = int(min_speech_duration * sample_rate)
        self.max_frames = int(max_duration * sample_rate)
        self.silence_frames = int(silence_timeout * sample_rate)

    def record_utterance(self):
        """Record a single utterance, stopping on silence."""
        frames = []
        silence_counter = 0
        speaking = False

        with sd.InputStream(samplerate=self.sample_rate, channels=1, dtype="float32") as stream:
            while len(frames) < self.max_frames:
                chunk, _ = stream.read(512)
                chunk = chunk.flatten()
                frames.extend(chunk)

                rms = np.sqrt(np.mean(chunk ** 2))

                if rms > self.silence_threshold:
                    speaking = True
                    silence_counter = 0
                elif speaking:
                    silence_counter += len(chunk)
                    if silence_counter > self.silence_frames:
                        break  # Silence detected after speech — done

        audio = np.array(frames, dtype="float32")

        # Reject recordings that are too short
        if len(audio) < self.min_speech_frames:
            return None

        return audio

The silence_threshold value (0.015) works well in a quiet home office. In noisier environments, raise it to 0.03-0.05 or implement adaptive thresholding based on background RMS measurements taken at startup.

Step 7: Assemble the Full Assistant and Optimize Latency

# main.py
import os
from wake_word import WakeWordDetector
from whisper_transcriber import WhisperTranscriber
from noise_filter import NoiseGatedRecorder
from openclaw_brain import AssistantBrain
from tts_engine import TTSEngine
from command_router import CommandRouter

def main():
    # Initialize all components
    wake_detector = WakeWordDetector(
        access_key=os.environ["PORCUPINE_ACCESS_KEY"]
    )
    recorder = NoiseGatedRecorder()
    transcriber = WhisperTranscriber(model_size="base")
    brain = AssistantBrain(api_key=os.environ["OPENCLAW_API_KEY"])
    tts = TTSEngine(rate=175)
    router = CommandRouter("voice_commands.yaml")

    tts.speak("Jarvis online. Say 'Jarvis' to activate.")

    try:
        while True:
            # Phase 1: Detect wake word (low CPU usage)
            wake_detector.listen_for_wake_word()
            tts.speak_async("Yes?")

            # Phase 2: Record the actual command
            audio = recorder.record_utterance()
            if audio is None:
                continue

            # Phase 3: Transcribe speech to text
            text = transcriber.transcribe(audio)
            if not text:
                continue
            print(f"You said: {text}")

            # Phase 4: Route to workflow or AI
            workflow_response = router.execute(text)
            if workflow_response:
                response = workflow_response
            else:
                response = brain.process(text)

            # Phase 5: Speak the response
            print(f"Jarvis: {response}")
            tts.speak(response)

    except KeyboardInterrupt:
        tts.speak("Shutting down. Goodbye.")
        wake_detector.cleanup()

if __name__ == "__main__":
    main()

Latency Optimization Tips

The biggest latency killer is Whisper transcription. Here are ways to cut response time:

| Optimization | Latency Saving | Trade-off | |---|---|---| | Use tiny Whisper model | ~400ms | Slightly lower accuracy | | Pre-load Whisper at startup | ~1.5s on first use | Higher memory usage | | Stream TTS while generating | ~600ms perceived | More complex code | | Use claude-3-haiku model | ~300ms | Shorter context window | | Cache common responses | ~800ms | Stale data risk |

With all optimizations applied, end-to-end latency from end of speech to start of response should be under 2 seconds on a modern laptop.

Common Pitfalls to Avoid

  1. Running Whisper on every audio chunk: Only transcribe after detecting meaningful speech above the noise gate threshold.
  2. Blocking the main thread on TTS: Use speak_async so the assistant can listen again while speaking short confirmations.
  3. No timeout on recordings: Always set max_duration — a dropped microphone otherwise causes an infinite recording.
  4. Single-threaded architecture: Wake word detection, recording, and TTS should each be in separate threads for responsiveness.

Next Steps

Now that your voice assistant is running, extend it further:

Your Jarvis is just getting started. Every workflow you connect makes it smarter and saves you more time.

Cookie Preferences

We use essential cookies and analytics to operate and improve the site. Advertising cookies are loaded only after you consent. Privacy Policy