—·
A practitioner’s rubric to compare Xiaomi’s MiMo miclaw device agent against OpenAI, Claude, and Gemini tool stacks, with eval cases you can run.
On a smartphone, the difference between “an assistant” and “a device controller” shows up the first time you say, “turn the lights on”—and it either does it correctly or gets stuck, wrong, or blocked. On March 6, 2026, Xiaomi began limited testing of Xiaomi miclaw, describing it as a smartphone system-level AI assistant that can perform device and smart-home actions after the user grants permission. (gizmochina.com)
That “action execution” shift matters because success can no longer be measured as a good-sounding response. It becomes an engineering question: did the system take the correct action, under the correct permissions, within an acceptable latency and cost envelope—wired through Xiaomi’s Mi Home smart-device layer? (news.cgtn.com, gizmochina.com)
Xiaomi’s MiMo v2 lineup is being positioned as the model layer behind this behavior. Xiaomi’s MiMo-V2-Flash repo and technical write-up describe a Mixture-of-Experts (MoE) architecture (meaning only a subset of “expert” sub-models are active per token), with 309B total parameters and 15B active, and claims around long context up to 256K. (github.com, arxiv.org) The model also documents “Multi-Token Prediction (MTP),” used to accelerate decoding by predicting multiple future tokens. (github.com, arxiv.org)
But the MiMo miclaw story is still a system story: a fast model, a tool interface to devices, and a permissions/guardrails layer. Evaluation has to cover all three, not just model IQ.
You can compare “agentic tooling” across families without trusting capability slogans by measuring five execution properties—directly aligned to device-control workflows miclaw is being tested for. (You can run these tests in staging with a permission-gated harness.)
Tool-calling reliability asks: when the model decides to use a tool, does it produce a schema-correct call, and does the orchestration layer execute it deterministically?
Tool-calling fails for reasons unrelated to “intelligence.” Tool APIs require specific structured outputs; OpenAI’s documentation notes that Structured Outputs can enforce JSON schema alignment when strict: true is enabled. (help.openai.com) Anthropic’s tool-use docs similarly warn about formatting mismatches such as “tool_use ids… found without tool_result blocks.” (docs.anthropic.com)
For MiMo miclaw specifically, public documentation is limited, but reports describe that miclaw can control supported smart-home devices via Mi Home and system-level functions only after user permission. (gizmochina.com) Your reliability tests should therefore include both “happy path” tool calls and “schema boundary” prompts (ambiguous device names, partially specified times, and multiple possible actions).
If tool calls fail 1–3% of the time, long-horizon device automation degrades sharply—because retries compound latency and permission prompts. Tool-schema correctness should be a gating metric before you test deeper “reasoning.”
Multi-step task decomposition checks whether the agent breaks a request into the right sequence of tool calls and validations (read state → choose action → verify outcome → continue or stop).
Agent stacks differ even when they share similar LLM core capability. In production orchestration, teams often want explicit planning steps with tool-based verification. Anthropic’s docs for tool use describe modes where the model decides whether to call tools and how tool results feed back for continuation. (docs.anthropic.com) OpenAI-style stacks commonly use schema enforcement and tool execution tracing to keep decomposition grounded.
For Chinese device-controlling agents, decomposition also has to translate natural language into device-level semantics—for example, mapping “dim the lamp to cozy warmth” into brightness plus color temperature ranges. Xiaomi’s Mi Home automation behavior is documented in Xiaomi’s own privacy and IoT materials at the platform level (including which permissions are required for device scanning and interactions). (trust.mi.com)
To make decomposition testing comparable across vendors, don’t just verify that multiple steps happened—measure whether verification loops happened for the right reasons. Implement a decomposition score with three measurable components:
Concretely, run a fixed task set where expected tool sequences are deterministic. For each task, log tool calls and state snapshots, then compute:
Latencies aren’t just model speed. Orchestrated agents add overhead for planning, multiple tool invocations, and potentially reranking or self-checks.
MiMo-V2-Flash’s technical report describes decoding acceleration using MTP and reports acceptance-length behavior and decoding speedups (documented as up to 2.6x decoding speedup in the paper’s reported results section). (arxiv.org) Interpret this carefully: end-to-end latency depends on your serving stack and the number of tool calls, not tokens/sec alone.
On the OpenAI side, tool calling patterns often separate model generation from tool execution; the latency budget becomes: (LLM planning time) + (tool execution time) + (second LLM continuation). OpenAI’s function calling documentation highlights structured outputs, which indirectly reduce wasted cycles from malformed calls. (help.openai.com)
Measure latency at the orchestration layer: p50 and p95 for (1) single-tool actions and (2) 3–6 step device sequences. A “fast” model can lose to a slower one if it needs more retries due to tool-call errors.
The most operationally painful failures are permission denials. When a device agent is blocked (user denies, OS denies, API denies), the correct behavior is not “keep trying” forever. It should fail safely with an actionable explanation—and ideally switch to an alternative permitted action.
Claude’s developer docs describe explicit user-input handling patterns: Claude requests user input when it needs permission to use a tool, and it can return a denial result when permission is not granted. (platform.claude.com) Tool-use docs also note errors when the tool sequence is not formatted correctly, which can resemble “failure under denial” in logs. (docs.anthropic.com)
For Xiaomi miclaw, public reporting states that it can control devices and system tools “provided the user allows it,” with sensitive information handled locally via an “edge-cloud privacy computing” description. (gizmochina.com, gizmochina.com) That implies permission-gated design—and it’s exactly the behavior you should hard-test.
A parallel signal comes from the broader OpenClaw craze. Security coverage reports China has issued warnings around OpenClaw adoption risks, including that improper installation and the agent’s autonomous operation with high system permissions can increase potential impact of misuse. (techradar.com) This isn’t miclaw-specific evidence, but it reinforces the general permissions + autonomous actions failure model your test plan must address.
Permission failures have shapes, so your harness should grade them. Build a denial matrix:
Then score three outcomes for each run:
Finally, watch for the “permission retry loop” pattern: repeated tool calls to the same denied endpoint without changing parameters. Log tool-call attempts per denied event and set a hard threshold (e.g., max 1 retry).
Multilingual grounding measures whether the agent correctly maps commands in different languages to the right entities and action parameters.
Evaluation must include localized phrasing variations (synonyms, slang, transliterations) and device naming mismatches. MiMo-V2-Flash claims long context and is positioned for reasoning and coding and “agentic foundation” uses in Xiaomi’s materials. (github.com, arxiv.org) But multilingual command accuracy isn’t guaranteed by context length alone. You need command-level tests such as: “turn the living room lamp to 30% in Japanese,” “set a ‘warm mode’ in English,” and then verify that the agent calls the correct device and uses correct numeric ranges.
For non-specialists, multilingual grounding means the model “attaches” words in a language to real device controls (device IDs, brightness, temperature), rather than describing them vaguely.
Run the same action sets in at least two languages your target users will use, and score on parameter correctness plus device selection.
MiMo-V2-Flash isn’t presented as a chatbot-only model; the MiMo-V2-Flash repo positions it as an efficient foundation for reasoning/coding/agentic use and includes a Multi-Token Prediction module described in the documentation. (github.com) The arXiv technical report discusses speculative decoding via MTP and provides reported decoding speedup figures. (arxiv.org)
Xiaomi miclaw is reported as a system-level app in closed beta with more than 50 capabilities, including reading/writing text messages and files and controlling smart home devices through Mi Home when permitted. (news.cgtn.com, gizmochina.com) Because direct implementation details aren’t fully public, the test plan should focus on black-box behavior: tool call outputs, action results, and permission prompts.
Use these numeric anchors to help justify engineering budget and scope:
Because these are agent workflows, don’t overfit to the numbers. They describe generation efficiency under specific conditions; your orchestration latency includes tool execution and user permission flows.
For miclaw-like device agents, first production risks are tool schema errors, wrong device targeting, and “stuck loops” after denial. Speedups only become meaningful after you have correctness and stop conditions.
You can’t directly compare internal architectures across Xiaomi miclaw and “OpenAI-class,” “Claude-class,” and “Gemini-class” models from public sources alone. But you can compare the tool orchestration contract they expose and the typical operational failure modes.
OpenAI’s function calling documentation emphasizes structured outputs and schema strictness to avoid argument drift. Structured Outputs with strict: true aims to guarantee JSON schema alignment. (help.openai.com) When implemented properly, this reduces reliability failures attributable to malformed arguments, which can be a dominant error source in device automation.
In practice, deliberately use near-miss device identifiers (“living rm lamp” vs “living room lamp”) and check whether strict schema enforcement prevents the agent from calling wrong tool arguments—or instead increases “denied action” outcomes that require clarification.
Anthropic provides tool use implementation guidance and clarifies message sequencing requirements for tool calls and tool results. (docs.anthropic.com) Claude’s Agent SDK documentation also describes how the system requests user input when it needs permission and how denial results should be handled. (platform.claude.com)
Test implication: for denied actions, Claude-like stacks often provide a structured pathway for “ask user question” or “permission result deny.” Measure whether the agent converts “no” into a specific next step—such as “I can’t change system settings without permission, but I can suggest a permitted shortcut.”
Public developer forum threads show practitioners running into tool calling quality/performance issues with Gemini function calling—often requiring debugging by simplifying tool sets and adding arguments incrementally. (discuss.ai.google.dev, discuss.ai.google.dev) These are anecdotal reports, not controlled benchmarks—but they reinforce a practical reality: tool reliability depends heavily on the orchestration wrapper and tool schema design.
In production device agents, the “agent” value shows up in the execution contract. Invest in tool schemas and permission handling more than you invest in chasing model names.
Case patterns reveal consistent failure modes—use them to refine your evaluation.
Entity: Xiaomi miclaw (MiMo-based system-level smartphone agent)
Outcome: limited internal testing/closed beta started after announcement and described as permission-gated for device control and Mi Home integration. (gizmochina.com, news.cgtn.com)
Timeline: announced and limited testing reported around March 6, 2026. (gizmochina.com)
What it teaches: permission gating and local handling claims help, but they’re not sufficient. You still need end-to-end logs of what tools were attempted, what was denied, and what actually changed.
Entity: OpenClaw, the autonomous agent popular in the China market
Outcome: security authorities warned about risks of installing improperly and highlighted how agents operating with high system permissions can increase impact of misuse. (techradar.com)
Timeline: warnings reported mid-March 2026 (e.g., March 13–15 coverage). (techradar.com)
What it teaches: treat denial handling and tool verification as security primitives. “The agent did something” isn’t the same as “the agent did the correct thing safely.”
Entity: OpenClaw in government environments
Outcome: reporting says China warned state enterprises/agencies not to install OpenClaw on office computers and referenced security guidelines and trustworthiness standards. (tomshardware.com)
Timeline: reported around March 13, 2026. (tomshardware.com)
What it teaches: environment context (workplace vs personal phone) changes the risk calculus. Failure-mode tests should be environment-specific.
Entity: Xiaomi and Huawei initiatives around AI agents
Outcome: reporting frames a broader deployment wave and mentions miclaw beginning limited testing, while describing system-level capabilities and user-permission model. (caixinglobal.com)
Timeline: published March 12, 2026. (caixinglobal.com)
What it teaches: when multiple vendors ship system-level agents quickly, tool reliability gaps become visible in the wild. Differentiation comes from evaluation discipline—not integration speed.
In production, expect a capability race to outpace execution verification. Your testing harness is the counterweight.
The right model-agent choice depends on workload horizon.
Chat-with-tools means the agent calls tools mostly to enrich answers (search, database lookups, summarization), while the model’s main job is to respond. In this case, tool reliability mostly affects “answer correctness,” not physical changes.
Pick a stack that supports schema strictness and good tool result parsing. OpenAI’s Structured Outputs guidance is relevant because it reduces malformed tool arguments. (help.openai.com) Claude’s tool-use documentation helps ensure the correct sequencing contract. (docs.anthropic.com)
Optimize for correctness and fast iteration. Your biggest risk is hidden tool-call formatting issues that quietly reduce answer quality.
Autonomous micro-actions are short sequences with limited scope: “turn on desk lamp,” “set a timer,” “add a reminder based on a message.” Here, you need decomposition, verification, and denial handling.
This is where miclaw’s positioning is most relevant: it is described as a system-level agent capable of reading/writing content and controlling smart-home devices when permitted. (gizmochina.com, news.cgtn.com)
Run scenario-based tests with forced permission denial and verify that “no” turns into a safe stop plus a helpful alternative.
Long-horizon automation is the hardest category: multi-day plans, state tracking, chained actions across devices, and occasional re-planning when the world changes.
Here, model speed and long-context support become operationally meaningful—but only if orchestration can keep audit logs and stop conditions. MiMo-V2-Flash’s 256K context claim and decoding acceleration are directly relevant to long planning windows, while the paper describes speculative decoding speedups. (github.com, arxiv.org) Still, the “agent for device control” system must handle tool denial and state mismatch without compounding errors.
Treat tool execution as a workflow with checkpoints. Require “read-back verification” after each state-changing action.
If you’re evaluating miclaw-like device-controlling agents against OpenAI-, Claude-, and Gemini-class tool stacks, don’t argue about model quality—deploy a permission-denied device control harness that scores: tool-calling reliability, multi-step decomposition, latency/cost under orchestration, denied-action failure modes, and multilingual command accuracy.
Require your product team to implement audit-ready tool traces and a least-privilege tool allowlist before expanding from micro-actions to long-horizon automation. Put this under your engineering quality gate, owned by the CTO/Head of Platform Engineering, not by the model provider. The evidence basis is pragmatic: tool-calling contracts and permission gates are documented (OpenAI schema strictness, Anthropic tool sequencing and permission handling), and broader ecosystem experience around autonomous agents highlights how permission and system access amplify risk. (help.openai.com, docs.anthropic.com, platform.claude.com, techradar.com)
By the time you can reliably stop, explain, and audit permission denials, your agent stops being “autonomous”—and starts being trustworthy.
Forecast (next 90 days): By June 2026, expect most teams integrating device-control agents to shift emphasis from “agent prompt quality” to orchestration correctness: stricter tool schemas, better permission-denied stop behavior, and more deterministic verification steps. The reason is operational: system-level agents are already entering limited testing and deployment cycles, while ecosystem warnings around high-permission autonomous tooling are pushing implementers toward safer execution patterns. (gizmochina.com, techradar.com)
Xiaomi miclaw turns MiMo’s reasoning into phone and smart-home execution, but device-controlling agents live or die by permissions, verification, and audit-ready tool reliability.
Xiaomi’s MiMo v2 lineup is pushing Chinese agent systems from chatbot “reasoning” to tool-using behavior by prioritizing throughput, multimodal input, and device control.
Xiaomi’s MiMo-V2 lineup is being packaged for reasoning-to-action workflows. The decisive battleground now is tool orchestration, reliability, and auditable device control.