title: "OpenClaw Voice Assistant: Build Your Own Jarvis" date: "2026-02-20" description: "Build a voice-controlled AI assistant with OpenClaw. This tutorial covers speech recognition, natural language processing, voice synthesis, and creating custom voice commands." category: "Tutorial" author: "OpenClaw Team" tags: ["voice", "assistant", "speech", "tutorial"] readTime: "15 min"
Imagine walking into your home office and saying "Hey Claw, summarize my emails and brief me on today's agenda" — and having an AI assistant that actually responds intelligently. That's exactly what you'll build in this tutorial. Using OpenClaw as the AI brain and open-source speech tools, you can have your own Jarvis running on your local machine in under an hour.
What You'll Build
By the end of this tutorial, you'll have a fully functional voice assistant that:
- Listens for a custom wake word without consuming cloud resources
- Transcribes your speech with OpenAI Whisper running locally
- Processes natural language through OpenClaw to generate smart responses
- Reads responses aloud using text-to-speech synthesis
- Handles ambient noise and partial utterances gracefully
- Executes custom voice command workflows like sending emails or searching the web
Prerequisites
Before you begin, make sure you have:
- OpenClaw installed (installation guide)
- Python 3.10+ with
pip - A working microphone connected to your machine
- At least 4 GB of RAM (8 GB recommended for Whisper medium model)
- An OpenClaw API key configured
- Basic familiarity with Python and the command line
Install the required Python packages up front:
pip install openai-whisper pyaudio pyttsx3 pvporcupine sounddevice numpy scipy openclaw-sdk
On macOS you may also need:
brew install portaudio ffmpeg
Step 1: Set Up Speech-to-Text with Whisper
Whisper is OpenAI's open-source speech recognition model. Running it locally means your voice never leaves your machine — an important privacy consideration.
# whisper_transcriber.py
import whisper
import sounddevice as sd
import numpy as np
import tempfile
import scipy.io.wavfile as wav
class WhisperTranscriber:
def __init__(self, model_size="base"):
# Options: tiny, base, small, medium, large
# 'base' is a good balance of speed and accuracy
print(f"Loading Whisper {model_size} model...")
self.model = whisper.load_model(model_size)
self.sample_rate = 16000 # Whisper expects 16kHz audio
def record_audio(self, duration=5, silence_threshold=0.01):
"""Record audio and stop early if silence is detected."""
print("Listening...")
audio = sd.rec(
int(duration * self.sample_rate),
samplerate=self.sample_rate,
channels=1,
dtype="float32"
)
sd.wait()
return audio.flatten()
def transcribe(self, audio_array):
"""Transcribe a numpy audio array to text."""
result = self.model.transcribe(
audio_array,
language="en",
fp16=False # Set True if you have a CUDA GPU
)
return result["text"].strip()
The base model requires around 150 MB of disk space and runs comfortably on a CPU. If you have a dedicated GPU, switch to medium for significantly better accuracy on accented speech or technical terms.
Step 2: Implement Wake Word Detection
Polling the microphone constantly and running Whisper on silence is wasteful. A lightweight wake word engine lets your assistant sleep until you need it.
# wake_word.py
import pvporcupine
import sounddevice as sd
import numpy as np
class WakeWordDetector:
def __init__(self, access_key, keywords=None):
# Porcupine offers free tier for personal use
# Default keywords include "hey siri", "alexa", "jarvis", etc.
self.porcupine = pvporcupine.create(
access_key=access_key,
keywords=keywords or ["jarvis"]
)
self.frame_length = self.porcupine.frame_length
self.sample_rate = self.porcupine.sample_rate
def listen_for_wake_word(self):
"""Block until the wake word is detected."""
print("Waiting for wake word...")
with sd.InputStream(
samplerate=self.sample_rate,
channels=1,
dtype="int16",
blocksize=self.frame_length
) as stream:
while True:
audio_frame, _ = stream.read(self.frame_length)
result = self.porcupine.process(audio_frame.flatten())
if result >= 0:
print("Wake word detected!")
return True
def cleanup(self):
self.porcupine.delete()
Porcupine offers a free tier that supports custom wake words. Sign up at their developer portal to get an access_key. Alternatively, you can use the open-source openwakeword library for a fully offline solution.
Step 3: Connect to OpenClaw for Intelligence
OpenClaw handles the "thinking" part — interpreting intent, maintaining conversation context, and generating useful responses.
# openclaw_brain.py
from openclaw import OpenClaw, ConversationMemory
class AssistantBrain:
def __init__(self, api_key):
self.client = OpenClaw(api_key=api_key)
self.memory = ConversationMemory(max_turns=10)
self.system_prompt = """
You are a helpful voice assistant called Jarvis.
Keep your responses concise and natural for spoken delivery —
avoid markdown formatting, bullet points, and code blocks.
Use plain conversational language.
If you need to list items, speak them naturally: "First... Second... Third..."
"""
def process(self, user_input: str) -> str:
"""Process user input and return a voice-friendly response."""
self.memory.add_user_message(user_input)
response = self.client.chat(
model="claude-3-sonnet",
system=self.system_prompt,
messages=self.memory.get_messages(),
max_tokens=300 # Keep responses short for voice
)
assistant_reply = response.content
self.memory.add_assistant_message(assistant_reply)
return assistant_reply
def reset_context(self):
"""Clear conversation history for a fresh session."""
self.memory.clear()
Notice the max_tokens=300 limit. Voice responses should be short — nobody wants to listen to a three-minute monologue from their assistant. If you need longer outputs (like a full email draft), have the assistant save it to a file and notify you verbally.
Step 4: Add Text-to-Speech Output
# tts_engine.py
import pyttsx3
import threading
class TTSEngine:
def __init__(self, rate=175, volume=0.9, voice_index=0):
self.engine = pyttsx3.init()
self.engine.setProperty("rate", rate) # Words per minute
self.engine.setProperty("volume", volume) # 0.0 to 1.0
# Select voice (0 = default, try 1 for female voice on most systems)
voices = self.engine.getProperty("voices")
if voice_index < len(voices):
self.engine.setProperty("voice", voices[voice_index].id)
self._lock = threading.Lock()
def speak(self, text: str):
"""Speak text synchronously."""
with self._lock:
self.engine.say(text)
self.engine.runAndWait()
def speak_async(self, text: str):
"""Speak text without blocking the main thread."""
thread = threading.Thread(target=self.speak, args=(text,), daemon=True)
thread.start()
return thread
For higher-quality voices, consider using the elevenlabs or google-cloud-texttospeech library instead of pyttsx3. The trade-off is latency — cloud TTS adds 300-800ms per request.
Step 5: Build Custom Voice Command Workflows
Voice commands become truly powerful when they trigger multi-step workflows via OpenClaw's automation engine.
# voice_commands.yaml
commands:
check_emails:
trigger_phrases:
- "check my emails"
- "any new emails"
- "summarize inbox"
workflow: email_triage
response_template: "You have {count} unread emails. {summary}"
morning_brief:
trigger_phrases:
- "morning brief"
- "good morning"
- "start my day"
workflow: daily_briefing
response_template: "Good morning! Here's your briefing: {content}"
set_reminder:
trigger_phrases:
- "remind me to"
- "set a reminder"
workflow: create_reminder
extract_params:
- name: task
type: text
- name: time
type: datetime
# command_router.py
import yaml
from openclaw import WorkflowRunner
from difflib import SequenceMatcher
class CommandRouter:
def __init__(self, commands_file="voice_commands.yaml"):
with open(commands_file) as f:
self.commands = yaml.safe_load(f)["commands"]
self.workflow_runner = WorkflowRunner()
def find_matching_command(self, text: str, threshold=0.6):
"""Find the best matching command using fuzzy matching."""
text_lower = text.lower()
best_match = None
best_score = 0
for cmd_name, cmd_config in self.commands.items():
for phrase in cmd_config["trigger_phrases"]:
score = SequenceMatcher(None, text_lower, phrase).ratio()
if score > best_score and score >= threshold:
best_score = score
best_match = (cmd_name, cmd_config)
return best_match
def execute(self, text: str):
"""Route voice input to the appropriate workflow."""
match = self.find_matching_command(text)
if match:
cmd_name, cmd_config = match
result = self.workflow_runner.run(cmd_config["workflow"])
return cmd_config["response_template"].format(**result)
return None # Fall through to general AI conversation
Step 6: Handle Ambient Noise
A voice assistant that mishears "play jazz" as "place jazz order" is more frustrating than helpful. Implement noise gating to filter out background audio.
# noise_filter.py
import numpy as np
import sounddevice as sd
class NoiseGatedRecorder:
def __init__(
self,
sample_rate=16000,
silence_threshold=0.015,
min_speech_duration=0.5,
max_duration=8.0,
silence_timeout=1.2
):
self.sample_rate = sample_rate
self.silence_threshold = silence_threshold
self.min_speech_frames = int(min_speech_duration * sample_rate)
self.max_frames = int(max_duration * sample_rate)
self.silence_frames = int(silence_timeout * sample_rate)
def record_utterance(self):
"""Record a single utterance, stopping on silence."""
frames = []
silence_counter = 0
speaking = False
with sd.InputStream(samplerate=self.sample_rate, channels=1, dtype="float32") as stream:
while len(frames) < self.max_frames:
chunk, _ = stream.read(512)
chunk = chunk.flatten()
frames.extend(chunk)
rms = np.sqrt(np.mean(chunk ** 2))
if rms > self.silence_threshold:
speaking = True
silence_counter = 0
elif speaking:
silence_counter += len(chunk)
if silence_counter > self.silence_frames:
break # Silence detected after speech — done
audio = np.array(frames, dtype="float32")
# Reject recordings that are too short
if len(audio) < self.min_speech_frames:
return None
return audio
The silence_threshold value (0.015) works well in a quiet home office. In noisier environments, raise it to 0.03-0.05 or implement adaptive thresholding based on background RMS measurements taken at startup.
Step 7: Assemble the Full Assistant and Optimize Latency
# main.py
import os
from wake_word import WakeWordDetector
from whisper_transcriber import WhisperTranscriber
from noise_filter import NoiseGatedRecorder
from openclaw_brain import AssistantBrain
from tts_engine import TTSEngine
from command_router import CommandRouter
def main():
# Initialize all components
wake_detector = WakeWordDetector(
access_key=os.environ["PORCUPINE_ACCESS_KEY"]
)
recorder = NoiseGatedRecorder()
transcriber = WhisperTranscriber(model_size="base")
brain = AssistantBrain(api_key=os.environ["OPENCLAW_API_KEY"])
tts = TTSEngine(rate=175)
router = CommandRouter("voice_commands.yaml")
tts.speak("Jarvis online. Say 'Jarvis' to activate.")
try:
while True:
# Phase 1: Detect wake word (low CPU usage)
wake_detector.listen_for_wake_word()
tts.speak_async("Yes?")
# Phase 2: Record the actual command
audio = recorder.record_utterance()
if audio is None:
continue
# Phase 3: Transcribe speech to text
text = transcriber.transcribe(audio)
if not text:
continue
print(f"You said: {text}")
# Phase 4: Route to workflow or AI
workflow_response = router.execute(text)
if workflow_response:
response = workflow_response
else:
response = brain.process(text)
# Phase 5: Speak the response
print(f"Jarvis: {response}")
tts.speak(response)
except KeyboardInterrupt:
tts.speak("Shutting down. Goodbye.")
wake_detector.cleanup()
if __name__ == "__main__":
main()
Latency Optimization Tips
The biggest latency killer is Whisper transcription. Here are ways to cut response time:
| Optimization | Latency Saving | Trade-off |
|---|---|---|
| Use tiny Whisper model | ~400ms | Slightly lower accuracy |
| Pre-load Whisper at startup | ~1.5s on first use | Higher memory usage |
| Stream TTS while generating | ~600ms perceived | More complex code |
| Use claude-3-haiku model | ~300ms | Shorter context window |
| Cache common responses | ~800ms | Stale data risk |
With all optimizations applied, end-to-end latency from end of speech to start of response should be under 2 seconds on a modern laptop.
Common Pitfalls to Avoid
- Running Whisper on every audio chunk: Only transcribe after detecting meaningful speech above the noise gate threshold.
- Blocking the main thread on TTS: Use
speak_asyncso the assistant can listen again while speaking short confirmations. - No timeout on recordings: Always set
max_duration— a dropped microphone otherwise causes an infinite recording. - Single-threaded architecture: Wake word detection, recording, and TTS should each be in separate threads for responsiveness.
Next Steps
Now that your voice assistant is running, extend it further:
- Build a multi-agent system to give Jarvis access to research and writing agents
- Schedule AI tasks with cron jobs to run morning briefings automatically
- 5 OpenClaw automations for ready-made workflows to plug into your voice commands
Your Jarvis is just getting started. Every workflow you connect makes it smarter and saves you more time.