OpenClaw Voice Assistant: Build Your Own Jarvis

title: "OpenClaw Voice Assistant: Build Your Own Jarvis" date: "2026-02-20" description: "Build a voice-controlled AI assistant with OpenClaw. This tutorial covers speech recognition, natural language processing, voice synthesis, and creating custom voice commands." category: "Tutorial" author: "OpenClaw Team" tags: ["voice", "assistant", "speech", "tutorial"] readTime: "15 min"

Imagine walking into your home office and saying "Hey Claw, summarize my emails and brief me on today's agenda" — and having an AI assistant that actually responds intelligently. That's exactly what you'll build in this tutorial. Using OpenClaw as the AI brain and open-source speech tools, you can have your own Jarvis running on your local machine in under an hour.

What You'll Build

By the end of this tutorial, you'll have a fully functional voice assistant that:

Listens for a custom wake word without consuming cloud resources
Transcribes your speech with OpenAI Whisper running locally
Processes natural language through OpenClaw to generate smart responses
Reads responses aloud using text-to-speech synthesis
Handles ambient noise and partial utterances gracefully
Executes custom voice command workflows like sending emails or searching the web

Prerequisites

Before you begin, make sure you have:

OpenClaw installed (installation guide)
Python 3.10+ with pip
A working microphone connected to your machine
At least 4 GB of RAM (8 GB recommended for Whisper medium model)
An OpenClaw API key configured
Basic familiarity with Python and the command line

Install the required Python packages up front:

pip install openai-whisper pyaudio pyttsx3 pvporcupine sounddevice numpy scipy openclaw-sdk

On macOS you may also need:

brew install portaudio ffmpeg

Step 1: Set Up Speech-to-Text with Whisper

Whisper is OpenAI's open-source speech recognition model. Running it locally means your voice never leaves your machine — an important privacy consideration.

# whisper_transcriber.py
import whisper
import sounddevice as sd
import numpy as np
import tempfile
import scipy.io.wavfile as wav

class WhisperTranscriber:
    def __init__(self, model_size="base"):
        # Options: tiny, base, small, medium, large
        # 'base' is a good balance of speed and accuracy
        print(f"Loading Whisper {model_size} model...")
        self.model = whisper.load_model(model_size)
        self.sample_rate = 16000  # Whisper expects 16kHz audio

    def record_audio(self, duration=5, silence_threshold=0.01):
        """Record audio and stop early if silence is detected."""
        print("Listening...")
        audio = sd.rec(
            int(duration * self.sample_rate),
            samplerate=self.sample_rate,
            channels=1,
            dtype="float32"
        )
        sd.wait()
        return audio.flatten()

    def transcribe(self, audio_array):
        """Transcribe a numpy audio array to text."""
        result = self.model.transcribe(
            audio_array,
            language="en",
            fp16=False  # Set True if you have a CUDA GPU
        )
        return result["text"].strip()

The base model requires around 150 MB of disk space and runs comfortably on a CPU. If you have a dedicated GPU, switch to medium for significantly better accuracy on accented speech or technical terms.

Step 2: Implement Wake Word Detection

Polling the microphone constantly and running Whisper on silence is wasteful. A lightweight wake word engine lets your assistant sleep until you need it.

# wake_word.py
import pvporcupine
import sounddevice as sd
import numpy as np

class WakeWordDetector:
    def __init__(self, access_key, keywords=None):
        # Porcupine offers free tier for personal use
        # Default keywords include "hey siri", "alexa", "jarvis", etc.
        self.porcupine = pvporcupine.create(
            access_key=access_key,
            keywords=keywords or ["jarvis"]
        )
        self.frame_length = self.porcupine.frame_length
        self.sample_rate = self.porcupine.sample_rate

    def listen_for_wake_word(self):
        """Block until the wake word is detected."""
        print("Waiting for wake word...")
        with sd.InputStream(
            samplerate=self.sample_rate,
            channels=1,
            dtype="int16",
            blocksize=self.frame_length
        ) as stream:
            while True:
                audio_frame, _ = stream.read(self.frame_length)
                result = self.porcupine.process(audio_frame.flatten())
                if result >= 0:
                    print("Wake word detected!")
                    return True

    def cleanup(self):
        self.porcupine.delete()

Porcupine offers a free tier that supports custom wake words. Sign up at their developer portal to get an access_key. Alternatively, you can use the open-source openwakeword library for a fully offline solution.

Step 3: Connect to OpenClaw for Intelligence

OpenClaw handles the "thinking" part — interpreting intent, maintaining conversation context, and generating useful responses.

# openclaw_brain.py
from openclaw import OpenClaw, ConversationMemory

class AssistantBrain:
    def __init__(self, api_key):
        self.client = OpenClaw(api_key=api_key)
        self.memory = ConversationMemory(max_turns=10)

        self.system_prompt = """
        You are a helpful voice assistant called Jarvis.
        Keep your responses concise and natural for spoken delivery —
        avoid markdown formatting, bullet points, and code blocks.
        Use plain conversational language.
        If you need to list items, speak them naturally: "First... Second... Third..."
        """

    def process(self, user_input: str) -> str:
        """Process user input and return a voice-friendly response."""
        self.memory.add_user_message(user_input)

        response = self.client.chat(
            model="claude-3-sonnet",
            system=self.system_prompt,
            messages=self.memory.get_messages(),
            max_tokens=300  # Keep responses short for voice
        )

        assistant_reply = response.content
        self.memory.add_assistant_message(assistant_reply)
        return assistant_reply

    def reset_context(self):
        """Clear conversation history for a fresh session."""
        self.memory.clear()

Notice the max_tokens=300 limit. Voice responses should be short — nobody wants to listen to a three-minute monologue from their assistant. If you need longer outputs (like a full email draft), have the assistant save it to a file and notify you verbally.

Step 4: Add Text-to-Speech Output

# tts_engine.py
import pyttsx3
import threading

class TTSEngine:
    def __init__(self, rate=175, volume=0.9, voice_index=0):
        self.engine = pyttsx3.init()
        self.engine.setProperty("rate", rate)      # Words per minute
        self.engine.setProperty("volume", volume)  # 0.0 to 1.0

        # Select voice (0 = default, try 1 for female voice on most systems)
        voices = self.engine.getProperty("voices")
        if voice_index < len(voices):
            self.engine.setProperty("voice", voices[voice_index].id)

        self._lock = threading.Lock()

    def speak(self, text: str):
        """Speak text synchronously."""
        with self._lock:
            self.engine.say(text)
            self.engine.runAndWait()

    def speak_async(self, text: str):
        """Speak text without blocking the main thread."""
        thread = threading.Thread(target=self.speak, args=(text,), daemon=True)
        thread.start()
        return thread

For higher-quality voices, consider using the elevenlabs or google-cloud-texttospeech library instead of pyttsx3. The trade-off is latency — cloud TTS adds 300-800ms per request.

Step 5: Build Custom Voice Command Workflows

Voice commands become truly powerful when they trigger multi-step workflows via OpenClaw's automation engine.

# voice_commands.yaml
commands:
  check_emails:
    trigger_phrases:
      - "check my emails"
      - "any new emails"
      - "summarize inbox"
    workflow: email_triage
    response_template: "You have {count} unread emails. {summary}"

  morning_brief:
    trigger_phrases:
      - "morning brief"
      - "good morning"
      - "start my day"
    workflow: daily_briefing
    response_template: "Good morning! Here's your briefing: {content}"

  set_reminder:
    trigger_phrases:
      - "remind me to"
      - "set a reminder"
    workflow: create_reminder
    extract_params:
      - name: task
        type: text
      - name: time
        type: datetime

# command_router.py
import yaml
from openclaw import WorkflowRunner
from difflib import SequenceMatcher

class CommandRouter:
    def __init__(self, commands_file="voice_commands.yaml"):
        with open(commands_file) as f:
            self.commands = yaml.safe_load(f)["commands"]
        self.workflow_runner = WorkflowRunner()

    def find_matching_command(self, text: str, threshold=0.6):
        """Find the best matching command using fuzzy matching."""
        text_lower = text.lower()
        best_match = None
        best_score = 0

        for cmd_name, cmd_config in self.commands.items():
            for phrase in cmd_config["trigger_phrases"]:
                score = SequenceMatcher(None, text_lower, phrase).ratio()
                if score > best_score and score >= threshold:
                    best_score = score
                    best_match = (cmd_name, cmd_config)

        return best_match

    def execute(self, text: str):
        """Route voice input to the appropriate workflow."""
        match = self.find_matching_command(text)
        if match:
            cmd_name, cmd_config = match
            result = self.workflow_runner.run(cmd_config["workflow"])
            return cmd_config["response_template"].format(**result)
        return None  # Fall through to general AI conversation

Step 6: Handle Ambient Noise

A voice assistant that mishears "play jazz" as "place jazz order" is more frustrating than helpful. Implement noise gating to filter out background audio.

# noise_filter.py
import numpy as np
import sounddevice as sd

class NoiseGatedRecorder:
    def __init__(
        self,
        sample_rate=16000,
        silence_threshold=0.015,
        min_speech_duration=0.5,
        max_duration=8.0,
        silence_timeout=1.2
    ):
        self.sample_rate = sample_rate
        self.silence_threshold = silence_threshold
        self.min_speech_frames = int(min_speech_duration * sample_rate)
        self.max_frames = int(max_duration * sample_rate)
        self.silence_frames = int(silence_timeout * sample_rate)

    def record_utterance(self):
        """Record a single utterance, stopping on silence."""
        frames = []
        silence_counter = 0
        speaking = False

        with sd.InputStream(samplerate=self.sample_rate, channels=1, dtype="float32") as stream:
            while len(frames) < self.max_frames:
                chunk, _ = stream.read(512)
                chunk = chunk.flatten()
                frames.extend(chunk)

                rms = np.sqrt(np.mean(chunk ** 2))

                if rms > self.silence_threshold:
                    speaking = True
                    silence_counter = 0
                elif speaking:
                    silence_counter += len(chunk)
                    if silence_counter > self.silence_frames:
                        break  # Silence detected after speech — done

        audio = np.array(frames, dtype="float32")

        # Reject recordings that are too short
        if len(audio) < self.min_speech_frames:
            return None

        return audio

The silence_threshold value (0.015) works well in a quiet home office. In noisier environments, raise it to 0.03-0.05 or implement adaptive thresholding based on background RMS measurements taken at startup.

Step 7: Assemble the Full Assistant and Optimize Latency

# main.py
import os
from wake_word import WakeWordDetector
from whisper_transcriber import WhisperTranscriber
from noise_filter import NoiseGatedRecorder
from openclaw_brain import AssistantBrain
from tts_engine import TTSEngine
from command_router import CommandRouter

def main():
    # Initialize all components
    wake_detector = WakeWordDetector(
        access_key=os.environ["PORCUPINE_ACCESS_KEY"]
    )
    recorder = NoiseGatedRecorder()
    transcriber = WhisperTranscriber(model_size="base")
    brain = AssistantBrain(api_key=os.environ["OPENCLAW_API_KEY"])
    tts = TTSEngine(rate=175)
    router = CommandRouter("voice_commands.yaml")

    tts.speak("Jarvis online. Say 'Jarvis' to activate.")

    try:
        while True:
            # Phase 1: Detect wake word (low CPU usage)
            wake_detector.listen_for_wake_word()
            tts.speak_async("Yes?")

            # Phase 2: Record the actual command
            audio = recorder.record_utterance()
            if audio is None:
                continue

            # Phase 3: Transcribe speech to text
            text = transcriber.transcribe(audio)
            if not text:
                continue
            print(f"You said: {text}")

            # Phase 4: Route to workflow or AI
            workflow_response = router.execute(text)
            if workflow_response:
                response = workflow_response
            else:
                response = brain.process(text)

            # Phase 5: Speak the response
            print(f"Jarvis: {response}")
            tts.speak(response)

    except KeyboardInterrupt:
        tts.speak("Shutting down. Goodbye.")
        wake_detector.cleanup()

if __name__ == "__main__":
    main()

Latency Optimization Tips

The biggest latency killer is Whisper transcription. Here are ways to cut response time:

| Optimization | Latency Saving | Trade-off | |---|---|---| | Use tiny Whisper model | ~400ms | Slightly lower accuracy | | Pre-load Whisper at startup | ~1.5s on first use | Higher memory usage | | Stream TTS while generating | ~600ms perceived | More complex code | | Use claude-3-haiku model | ~300ms | Shorter context window | | Cache common responses | ~800ms | Stale data risk |

With all optimizations applied, end-to-end latency from end of speech to start of response should be under 2 seconds on a modern laptop.

Common Pitfalls to Avoid

Running Whisper on every audio chunk: Only transcribe after detecting meaningful speech above the noise gate threshold.
Blocking the main thread on TTS: Use speak_async so the assistant can listen again while speaking short confirmations.
No timeout on recordings: Always set max_duration — a dropped microphone otherwise causes an infinite recording.
Single-threaded architecture: Wake word detection, recording, and TTS should each be in separate threads for responsiveness.

Next Steps

Now that your voice assistant is running, extend it further:

Build a multi-agent system to give Jarvis access to research and writing agents
Schedule AI tasks with cron jobs to run morning briefings automatically
5 OpenClaw automations for ready-made workflows to plug into your voice commands

Your Jarvis is just getting started. Every workflow you connect makes it smarter and saves you more time.