Designing the voice AI stack

We’re past the point where voice can be treated as just another feature.

For more than a decade, the smart home has operated under a flawed assumption: that voice is optional. It’s not. As homes grow more complex and connected, voice is the only interface that aligns with how people actually live.

Traditional interfaces don’t scale: touchscreens fail when your hands are full, apps demand too much attention, and remotes are always missing when you need them. Voice is the only input that works across rooms, contexts, and users, if it works reliably.

And yet, we’re still tethered to physical buttons and remote controls, because we don’t fully trust voice interfaces. They miss commands, struggle in noisy environments, and break the moment connectivity becomes unstable. That’s not a UI flaw. It’s an architectural one.

03.10.2026

To replace the light switch, voice needs to be always available, always accurate, and always in context. That means rethinking where intelligence lives and how decisions are made.

Hybrid Voice AI architecture is not an incremental upgrade, it’s an engineering breakthrough that transforms the smart home from a scattered set of reactive gadgets into a cohesive, proactive system. By separating real-time, on-device reflexes from deep, cloud-based reasoning, this architecture is designed to make voice a trusted, primary interface, every time, in every room.

Making voice work in the real world

The flaw in current voice technology isn’t a lack of data; it’s a lack of clarity

Real homes are acoustically chaotic. They’re full of overlapping conversations, background music, household noise, and hard surfaces that introduce echo and reverb. Users speak from different rooms, distances, and angles. Commands are often ambiguous or incomplete. These aren’t edge cases. They’re the default operating conditions.

Current cloud-only models are powerful but slow, while legacy on-device models are fast but dim-witted. Neither alone can deliver the “Star Trek” experience users crave. To achieve the non-negotiable standard of 100% reliability, we need a system that mimics the human brain’s ability to process reflexes locally and complex thoughts deeply

In that context, today’s voice interfaces consistently fall short. Not because of a lack of data or model size, but because of fundamental architecture-level decisions about where processing happens, how quickly systems respond, and how they handle failure.

A symbiotic two-tier architecture

The innovation lies in splitting the intelligence. By decoupling immediate execution from deep reasoning, we create a system that is both instant and intelligent.

The Reflex Layer – Edge AI (Supports Instant Response):
1. Definition: Think of this as the smart home’s autonomic nervous system.
2. Innovation: High-performance, always-on SLM embedded directly on the device’s silicon.
3. Function: Handles the “here and now.” Commands like “Lights on” or “Volume down” are processed locally with near-zero latency.
4. Impact: Delivers absolute privacy and instant responsiveness. No data leaves the room, and the experience feels as immediate as flipping a physical switch.
The Reasoning Layer – Cloud AI (Intelligent Coordination):
1. Definition: This acts as the system’s prefrontal cortex—responsible for reasoning.
2. Innovation: Leverages large language models (LLMs) to manage long-term state, memory, and complex logic across devices and use cases.
3. Function: Handles the “what if” and “what next.” It manages household routines, coordinates multiple devices, and draws inferences from incomplete inputs (e.g., “Order dinner for whoever is home tonight.”)
4. Impact: Enables devices to go beyond command execution—they begin to understand intent, anticipate user needs, and adapt over time (Figure 1).

Figure 1 A hybrid voice stack routes audio through on-device perception (AEC, spatial analysis, separation, intent gating) and escalates only complex requests to cloud reasoning. (Source: Kardome)

Differentiation for the decade ahead

For OEMs and Tier 1 suppliers, architecture, not features, is emerging as the defining battleground for the next generation of smart home systems.

The market is saturated with devices that can set timers, play music, or toggle lights. These capabilities are now commodity. What will set future systems apart is their ability to demonstrate true Auditory Intelligence—to perceive, localize, and interpret human speech reliably, even in noisy, multi-speaker, real-world environments.

By integrating spatial hearing AI and cognition technologies into a hybrid architecture, manufacturers can go beyond individual product features and instead build the auditory nervous system of the modern home.

We are past the era of voice assistants that require users to repeat themselves or speak in rigid syntax. Hybrid Voice AI enables a different class of experience—one where technology is felt, but rarely seen.

Figure 2 Spatial processing turns a mixed audio scene (TV + two speakers + reverb) into separated target streams suitable for intent detection and command execution. (Source: Kardome)

What “reflex vs. reasoning” means

In a production voice system, “hybrid” isn’t simply “ASR on-device and an LLM in the cloud.” It’s a routing architecture with a continuously running perception pipeline that decides:

Is anyone speaking?
Who is speaking (and where)?
Is it directed at the device?
Can we execute locally, or do we need cloud reasoning?

A practical edge “reflex” stack typically includes:

Acoustic front end (always-on): microphone capture → gain control / denoise → echo cancellation (to remove the device’s own playback).
Spatial scene analysis: estimate how many sources exist and where they are relative to the device (near/far, left/right, different rooms).
Source separation + target selection: isolate the intended speaker stream(s) and suppress competing sources (TV, music, second speaker).
Speech activity detection + endpointing: stable detection of speech start/stop to avoid clipped commands and reduce false triggers.
Device-directed intent gating (SLM): a lightweight model answers: “Is this speech for the device?” using spatial cues + conversational flow + linguistic signals.
Execution vs. escalation:
1. Local path: deterministic actions and short commands (“lights on,” “stop,” “volume down”) with minimal latency.
2. Cloud path: long-horizon reasoning, multi-device planning, and tasks requiring external knowledge—only when needed.

The engineering advantage is that the system can stay fast and predictable for everyday commands while still enabling deeper capabilities when appropriate.

Why spatial audio is the “make or break” layer

Most failures in today’s voice assistants begin before language: the system is fed garbage audio (mixed speakers, reverberation, background media), then asked to “understand” it. Hybrid architectures push the hard work earlier: fix the audio scene first, then do language.

Spatial processing matters because it enables three foundational capabilities:

Localization: determine where speech is coming from and whether it’s in the same room.
Separation: isolate a voice even with overlapping speakers and media noise.
Attribution: reduce wrong-room actions and improve “who said what” reliability.

This is also where direction of arrival (DOA)-only approaches struggle in real homes: reflective surfaces create strong echoes and multiple delayed arrivals. A “flat” directional estimate can become unstable under reverb, causing separation and attribution errors. A more robust approach treats each source as having a unique spatial signature (an “acoustic fingerprint”) and uses that signature to stabilize separation and tracking over time.

Latency, offline behavior, failure modes

If voice is going to replace physical controls, reliability can’t be an aspiration—it has to be engineered with explicit budgets and test matrices.

Latency budget

Humans pause roughly ~200ms between conversational turns, while cloud round trips often land in the 1–3 second range—good enough for Q&A, not good enough for control.

The reflex path should therefore be designed so the most common commands complete without waiting on the network.

Offline and “brownout” modes

Define tiers of capability that remain functional without connectivity:

Tier A (must work offline): lights, volume, stop/quiet, timers, basic routines.
Tier B (cloud-required): deep reasoning, external services.

This avoids a binary “voice works / voice is dead” experience and increases user trust.

Failure modes that must be tested (not treated as edge cases)

overlapping speakers (barge-in, crosstalk)
competing media (TV/music)
far-field speech + occlusion (speaker in hallway / adjacent room)
changing echo paths (content and volume changes)
reverberant rooms (kitchen tile, open-plan living spaces)

Metrics that map to trust (beyond WER):

end-to-end command success rate by scenario class
false accept / false reject rates for device-directed intent gating
speaker attribution / room attribution accuracy
P95 latency (not just average) for Tier A commands
recovery time after connectivity loss

Why privacy and economics often improve in a hybrid design

A counterintuitive benefit of edge-first reflex layers is that they can be more private and more cost-stable than cloud-streaming approaches—because a large fraction of everyday interactions can be processed locally, and the cloud is invoked only when deeper reasoning is necessary.

On the economics side, cloud inference costs scale with usage, while edge compute is amortized with silicon volume and can reduce the need for continuous cloud processing for trivial requests.

One example of this architectural direction is Kardome, which focuses on combining spatial hearing (to separate and localize voices) with an on-device context-aware SLM (to decide whether speech is directed at the system), escalating to the cloud only when deeper reasoning is needed.

Dr. Alon Slapak is the co-founder and CTO of Kardome, a voice AI startup pioneering Spatial Hearing and Cognition AI technology that enables seamless, natural voice interaction in real-world noisy environments. He holds a Ph.D. from Tel Aviv University and brings deep expertise in acoustics, signal processing, and machine learning. Alon and co-founder and CEO Dr. Dani Cherkassky launched Kardome out of a shared passion for solving end-user frustrations with voice devices, combining their expertise in acoustics and advanced machine learning to build leading-edge voice user interface technology. Kardome has raised $10M in Series A funding.

Related Content

Source link