All Stories
—
·
All Stories
PULSE.

Multilingual editorial — AI-curated intelligence on tech, business & the world.

Topics

  • Space Exploration
  • Artificial Intelligence
  • Health & Nutrition
  • Sustainability
  • Energy Storage
  • Space Technology
  • Sports Technology
  • Interior Design
  • Remote Work
  • Architecture & Design
  • Transportation
  • Ocean Conservation
  • Space & Exploration
  • Digital Mental Health
  • AI in Science
  • Financial Literacy
  • Wearable Technology
  • Creative Arts
  • Esports & Gaming
  • Sustainable Transportation

Browse

  • All Topics

© 2026 Pulse Latellu. All rights reserved.

AI-generated. Made by Latellu

PULSE.

All content is AI-generated and may contain inaccuracies. Please verify independently.

Articles

Trending Topics

Public Policy & Regulation
Cybersecurity
Energy Transition
Infrastructure
AI & Machine Learning
Agentic AI

Browse by Category

Space ExplorationArtificial IntelligenceHealth & NutritionSustainabilityEnergy StorageSpace TechnologySports TechnologyInterior DesignRemote WorkArchitecture & DesignTransportationOcean ConservationSpace & ExplorationDigital Mental HealthAI in ScienceFinancial LiteracyWearable TechnologyCreative ArtsEsports & GamingSustainable Transportation
Bahasa IndonesiaIDEnglishEN日本語JA

All content is AI-generated and may contain inaccuracies. Please verify independently.

All Articles

Browse Topics

Space ExplorationArtificial IntelligenceHealth & NutritionSustainabilityEnergy StorageSpace TechnologySports TechnologyInterior DesignRemote WorkArchitecture & DesignTransportationOcean ConservationSpace & ExplorationDigital Mental HealthAI in ScienceFinancial LiteracyWearable TechnologyCreative ArtsEsports & GamingSustainable Transportation

Language & Settings

Bahasa IndonesiaEnglish日本語
All Stories
AI & Machine Learning—March 20, 2026·15 min read

MiMo miclaw Device Control Benchmark: Tool Reliability, Denied Actions, Multilingual Commands

A practitioner’s rubric to compare Xiaomi’s MiMo miclaw device agent against OpenAI, Claude, and Gemini tool stacks, with eval cases you can run.

Sources

  • gizmochina.com
  • news.cgtn.com
  • github.com
  • arxiv.org
  • help.openai.com
  • docs.anthropic.com
  • platform.claude.com
  • techradar.com
  • caixinglobal.com
All Stories

In This Article

  • MiMi miclaw Device Control Benchmark: Tool Reliability, Denied Actions, Multilingual Commands
  • The miclaw bet: device action, not chat
  • Build an engineering rubric for execution
  • Rubric item 1: tool-calling reliability
  • Rubric item 2: multi-step decomposition
  • Rubric item 3: latency and cost under orchestration
  • Rubric item 4: denied actions and permission failures
  • Rubric item 5: multilingual grounding and command accuracy
  • What to test first in a miclaw-like benchmark
  • Quantitative anchor points you can operationalize
  • Cross-family comparison: tool stacks in practice
  • OpenAI-style stacks: schema strictness helps
  • Claude-style stacks: explicit permission gates
  • Gemini-style stacks: orchestration drives outcomes
  • Four real-world agent-phone lessons
  • Case 1: Xiaomi miclaw limited testing begins
  • Case 2: OpenClaw adoption triggers security warnings
  • Case 3: China’s agent-tool crackdown on government machines
  • Case 4: Xiaomi and Huawei “rush to deploy” system agents
  • Choose what to build by workload horizon
  • Workload A: chat-with-tools
  • Workload B: autonomous micro-actions
  • Workload C: long-horizon device automation
  • Conclusion: build a harness that can’t lie
  • Implementation policy for practitioners
  • Forecast for the next 90 days

MiMi miclaw Device Control Benchmark: Tool Reliability, Denied Actions, Multilingual Commands

On a smartphone, the difference between “an assistant” and “a device controller” shows up the first time you say, “turn the lights on”—and it either does it correctly or gets stuck, wrong, or blocked. On March 6, 2026, Xiaomi began limited testing of Xiaomi miclaw, describing it as a smartphone system-level AI assistant that can perform device and smart-home actions after the user grants permission. (gizmochina.com)

That “action execution” shift matters because success can no longer be measured as a good-sounding response. It becomes an engineering question: did the system take the correct action, under the correct permissions, within an acceptable latency and cost envelope—wired through Xiaomi’s Mi Home smart-device layer? (news.cgtn.com, gizmochina.com)

The miclaw bet: device action, not chat

Xiaomi’s MiMo v2 lineup is being positioned as the model layer behind this behavior. Xiaomi’s MiMo-V2-Flash repo and technical write-up describe a Mixture-of-Experts (MoE) architecture (meaning only a subset of “expert” sub-models are active per token), with 309B total parameters and 15B active, and claims around long context up to 256K. (github.com, arxiv.org) The model also documents “Multi-Token Prediction (MTP),” used to accelerate decoding by predicting multiple future tokens. (github.com, arxiv.org)

But the MiMo miclaw story is still a system story: a fast model, a tool interface to devices, and a permissions/guardrails layer. Evaluation has to cover all three, not just model IQ.

Build an engineering rubric for execution

You can compare “agentic tooling” across families without trusting capability slogans by measuring five execution properties—directly aligned to device-control workflows miclaw is being tested for. (You can run these tests in staging with a permission-gated harness.)

Rubric item 1: tool-calling reliability

Tool-calling reliability asks: when the model decides to use a tool, does it produce a schema-correct call, and does the orchestration layer execute it deterministically?

Tool-calling fails for reasons unrelated to “intelligence.” Tool APIs require specific structured outputs; OpenAI’s documentation notes that Structured Outputs can enforce JSON schema alignment when strict: true is enabled. (help.openai.com) Anthropic’s tool-use docs similarly warn about formatting mismatches such as “tool_use ids… found without tool_result blocks.” (docs.anthropic.com)

For MiMo miclaw specifically, public documentation is limited, but reports describe that miclaw can control supported smart-home devices via Mi Home and system-level functions only after user permission. (gizmochina.com) Your reliability tests should therefore include both “happy path” tool calls and “schema boundary” prompts (ambiguous device names, partially specified times, and multiple possible actions).

If tool calls fail 1–3% of the time, long-horizon device automation degrades sharply—because retries compound latency and permission prompts. Tool-schema correctness should be a gating metric before you test deeper “reasoning.”

Rubric item 2: multi-step decomposition

Multi-step task decomposition checks whether the agent breaks a request into the right sequence of tool calls and validations (read state → choose action → verify outcome → continue or stop).

Agent stacks differ even when they share similar LLM core capability. In production orchestration, teams often want explicit planning steps with tool-based verification. Anthropic’s docs for tool use describe modes where the model decides whether to call tools and how tool results feed back for continuation. (docs.anthropic.com) OpenAI-style stacks commonly use schema enforcement and tool execution tracing to keep decomposition grounded.

For Chinese device-controlling agents, decomposition also has to translate natural language into device-level semantics—for example, mapping “dim the lamp to cozy warmth” into brightness plus color temperature ranges. Xiaomi’s Mi Home automation behavior is documented in Xiaomi’s own privacy and IoT materials at the platform level (including which permissions are required for device scanning and interactions). (trust.mi.com)

To make decomposition testing comparable across vendors, don’t just verify that multiple steps happened—measure whether verification loops happened for the right reasons. Implement a decomposition score with three measurable components:

  1. State-read coverage: Did the agent call a “get device state”/query tool before a state-changing action when the task requires it? (e.g., for “set brightness to 20%,” require a prior read of current brightness if the system expects relative changes).
  2. Stop-condition integrity: After the verification step, did the agent stop when the target was reached—or did it continue “chasing” the state? Penalize extra tool calls that don’t change the end state.
  3. Error recovery shape: When you intentionally inject a mismatch (e.g., set lamp to 50% just before verification), does the agent re-plan only the minimal remaining steps, or restart the whole chain?

Concretely, run a fixed task set where expected tool sequences are deterministic. For each task, log tool calls and state snapshots, then compute:

  • Decomposition success rate: % of runs where the final device state matches the target within tolerance (e.g., ±5% brightness; ±100K color temperature).
  • Verification efficiency: median number of tool calls per successful run, excluding idempotent reads.
  • Chase rate: % of runs where verification was attempted but the agent still issued an additional state-change after confirmation.

Rubric item 3: latency and cost under orchestration

Latencies aren’t just model speed. Orchestrated agents add overhead for planning, multiple tool invocations, and potentially reranking or self-checks.

MiMo-V2-Flash’s technical report describes decoding acceleration using MTP and reports acceptance-length behavior and decoding speedups (documented as up to 2.6x decoding speedup in the paper’s reported results section). (arxiv.org) Interpret this carefully: end-to-end latency depends on your serving stack and the number of tool calls, not tokens/sec alone.

On the OpenAI side, tool calling patterns often separate model generation from tool execution; the latency budget becomes: (LLM planning time) + (tool execution time) + (second LLM continuation). OpenAI’s function calling documentation highlights structured outputs, which indirectly reduce wasted cycles from malformed calls. (help.openai.com)

Measure latency at the orchestration layer: p50 and p95 for (1) single-tool actions and (2) 3–6 step device sequences. A “fast” model can lose to a slower one if it needs more retries due to tool-call errors.

Rubric item 4: denied actions and permission failures

The most operationally painful failures are permission denials. When a device agent is blocked (user denies, OS denies, API denies), the correct behavior is not “keep trying” forever. It should fail safely with an actionable explanation—and ideally switch to an alternative permitted action.

Claude’s developer docs describe explicit user-input handling patterns: Claude requests user input when it needs permission to use a tool, and it can return a denial result when permission is not granted. (platform.claude.com) Tool-use docs also note errors when the tool sequence is not formatted correctly, which can resemble “failure under denial” in logs. (docs.anthropic.com)

For Xiaomi miclaw, public reporting states that it can control devices and system tools “provided the user allows it,” with sensitive information handled locally via an “edge-cloud privacy computing” description. (gizmochina.com, gizmochina.com) That implies permission-gated design—and it’s exactly the behavior you should hard-test.

A parallel signal comes from the broader OpenClaw craze. Security coverage reports China has issued warnings around OpenClaw adoption risks, including that improper installation and the agent’s autonomous operation with high system permissions can increase potential impact of misuse. (techradar.com) This isn’t miclaw-specific evidence, but it reinforces the general permissions + autonomous actions failure model your test plan must address.

Permission failures have shapes, so your harness should grade them. Build a denial matrix:

  • Denial timing: deny at tool request time (before execution) vs deny at tool result time (after execution would have happened).
  • Denial granularity: deny “device control” but allow “read state,” deny “system write” but allow “calendar read,” etc.
  • User intent mismatch: deny the specific requested action (e.g., “turn off system notification sounds”) while allowing a close alternative (“mute notifications,” if permitted).

Then score three outcomes for each run:

  1. Safe-stop correctness: % of runs where no prohibited state change occurs (verified by device state diff).
  2. Denial transparency: % of runs where the user-facing message names the constraint in a non-generic way (“permission denied for system sound settings”), rather than a vague failure.
  3. Fallback quality: % of runs where the agent offers or executes a permitted alternative that still satisfies the user’s intent. Define allowed substitutes per task in your test spec.

Finally, watch for the “permission retry loop” pattern: repeated tool calls to the same denied endpoint without changing parameters. Log tool-call attempts per denied event and set a hard threshold (e.g., max 1 retry).

Rubric item 5: multilingual grounding and command accuracy

Multilingual grounding measures whether the agent correctly maps commands in different languages to the right entities and action parameters.

Evaluation must include localized phrasing variations (synonyms, slang, transliterations) and device naming mismatches. MiMo-V2-Flash claims long context and is positioned for reasoning and coding and “agentic foundation” uses in Xiaomi’s materials. (github.com, arxiv.org) But multilingual command accuracy isn’t guaranteed by context length alone. You need command-level tests such as: “turn the living room lamp to 30% in Japanese,” “set a ‘warm mode’ in English,” and then verify that the agent calls the correct device and uses correct numeric ranges.

For non-specialists, multilingual grounding means the model “attaches” words in a language to real device controls (device IDs, brightness, temperature), rather than describing them vaguely.

Run the same action sets in at least two languages your target users will use, and score on parameter correctness plus device selection.

What to test first in a miclaw-like benchmark

MiMo-V2-Flash isn’t presented as a chatbot-only model; the MiMo-V2-Flash repo positions it as an efficient foundation for reasoning/coding/agentic use and includes a Multi-Token Prediction module described in the documentation. (github.com) The arXiv technical report discusses speculative decoding via MTP and provides reported decoding speedup figures. (arxiv.org)

Xiaomi miclaw is reported as a system-level app in closed beta with more than 50 capabilities, including reading/writing text messages and files and controlling smart home devices through Mi Home when permitted. (news.cgtn.com, gizmochina.com) Because direct implementation details aren’t fully public, the test plan should focus on black-box behavior: tool call outputs, action results, and permission prompts.

Quantitative anchor points you can operationalize

Use these numeric anchors to help justify engineering budget and scope:

  1. 256K context is claimed for MiMo-V2-Flash in the model materials and technical report. (github.com, arxiv.org) In device agents, long context matters when you keep multi-day schedules, automation history, and device state snapshots.
  2. 309B total / 15B active parameters are documented in the MiMo-V2-Flash repo, reflecting sparsity for efficiency. (github.com)
  3. The MiMo-V2-Flash technical report reports up to 2.6x decoding speedup in its experimental results. (arxiv.org) Translate this into fewer LLM tokens consumed per action, and faster end-to-end completion when orchestration is dominated by generation.

Because these are agent workflows, don’t overfit to the numbers. They describe generation efficiency under specific conditions; your orchestration latency includes tool execution and user permission flows.

For miclaw-like device agents, first production risks are tool schema errors, wrong device targeting, and “stuck loops” after denial. Speedups only become meaningful after you have correctness and stop conditions.

Cross-family comparison: tool stacks in practice

You can’t directly compare internal architectures across Xiaomi miclaw and “OpenAI-class,” “Claude-class,” and “Gemini-class” models from public sources alone. But you can compare the tool orchestration contract they expose and the typical operational failure modes.

OpenAI-style stacks: schema strictness helps

OpenAI’s function calling documentation emphasizes structured outputs and schema strictness to avoid argument drift. Structured Outputs with strict: true aims to guarantee JSON schema alignment. (help.openai.com) When implemented properly, this reduces reliability failures attributable to malformed arguments, which can be a dominant error source in device automation.

In practice, deliberately use near-miss device identifiers (“living rm lamp” vs “living room lamp”) and check whether strict schema enforcement prevents the agent from calling wrong tool arguments—or instead increases “denied action” outcomes that require clarification.

Claude-style stacks: explicit permission gates

Anthropic provides tool use implementation guidance and clarifies message sequencing requirements for tool calls and tool results. (docs.anthropic.com) Claude’s Agent SDK documentation also describes how the system requests user input when it needs permission and how denial results should be handled. (platform.claude.com)

Test implication: for denied actions, Claude-like stacks often provide a structured pathway for “ask user question” or “permission result deny.” Measure whether the agent converts “no” into a specific next step—such as “I can’t change system settings without permission, but I can suggest a permitted shortcut.”

Gemini-style stacks: orchestration drives outcomes

Public developer forum threads show practitioners running into tool calling quality/performance issues with Gemini function calling—often requiring debugging by simplifying tool sets and adding arguments incrementally. (discuss.ai.google.dev, discuss.ai.google.dev) These are anecdotal reports, not controlled benchmarks—but they reinforce a practical reality: tool reliability depends heavily on the orchestration wrapper and tool schema design.

In production device agents, the “agent” value shows up in the execution contract. Invest in tool schemas and permission handling more than you invest in chasing model names.

Four real-world agent-phone lessons

Case patterns reveal consistent failure modes—use them to refine your evaluation.

Case 1: Xiaomi miclaw limited testing begins

Entity: Xiaomi miclaw (MiMo-based system-level smartphone agent)
Outcome: limited internal testing/closed beta started after announcement and described as permission-gated for device control and Mi Home integration. (gizmochina.com, news.cgtn.com)
Timeline: announced and limited testing reported around March 6, 2026. (gizmochina.com)

What it teaches: permission gating and local handling claims help, but they’re not sufficient. You still need end-to-end logs of what tools were attempted, what was denied, and what actually changed.

Case 2: OpenClaw adoption triggers security warnings

Entity: OpenClaw, the autonomous agent popular in the China market
Outcome: security authorities warned about risks of installing improperly and highlighted how agents operating with high system permissions can increase impact of misuse. (techradar.com)
Timeline: warnings reported mid-March 2026 (e.g., March 13–15 coverage). (techradar.com)

What it teaches: treat denial handling and tool verification as security primitives. “The agent did something” isn’t the same as “the agent did the correct thing safely.”

Case 3: China’s agent-tool crackdown on government machines

Entity: OpenClaw in government environments
Outcome: reporting says China warned state enterprises/agencies not to install OpenClaw on office computers and referenced security guidelines and trustworthiness standards. (tomshardware.com)
Timeline: reported around March 13, 2026. (tomshardware.com)

What it teaches: environment context (workplace vs personal phone) changes the risk calculus. Failure-mode tests should be environment-specific.

Case 4: Xiaomi and Huawei “rush to deploy” system agents

Entity: Xiaomi and Huawei initiatives around AI agents
Outcome: reporting frames a broader deployment wave and mentions miclaw beginning limited testing, while describing system-level capabilities and user-permission model. (caixinglobal.com)
Timeline: published March 12, 2026. (caixinglobal.com)

What it teaches: when multiple vendors ship system-level agents quickly, tool reliability gaps become visible in the wild. Differentiation comes from evaluation discipline—not integration speed.

In production, expect a capability race to outpace execution verification. Your testing harness is the counterweight.

Choose what to build by workload horizon

The right model-agent choice depends on workload horizon.

Workload A: chat-with-tools

Chat-with-tools means the agent calls tools mostly to enrich answers (search, database lookups, summarization), while the model’s main job is to respond. In this case, tool reliability mostly affects “answer correctness,” not physical changes.

Pick a stack that supports schema strictness and good tool result parsing. OpenAI’s Structured Outputs guidance is relevant because it reduces malformed tool arguments. (help.openai.com) Claude’s tool-use documentation helps ensure the correct sequencing contract. (docs.anthropic.com)

Optimize for correctness and fast iteration. Your biggest risk is hidden tool-call formatting issues that quietly reduce answer quality.

Workload B: autonomous micro-actions

Autonomous micro-actions are short sequences with limited scope: “turn on desk lamp,” “set a timer,” “add a reminder based on a message.” Here, you need decomposition, verification, and denial handling.

This is where miclaw’s positioning is most relevant: it is described as a system-level agent capable of reading/writing content and controlling smart-home devices when permitted. (gizmochina.com, news.cgtn.com)

Run scenario-based tests with forced permission denial and verify that “no” turns into a safe stop plus a helpful alternative.

Workload C: long-horizon device automation

Long-horizon automation is the hardest category: multi-day plans, state tracking, chained actions across devices, and occasional re-planning when the world changes.

Here, model speed and long-context support become operationally meaningful—but only if orchestration can keep audit logs and stop conditions. MiMo-V2-Flash’s 256K context claim and decoding acceleration are directly relevant to long planning windows, while the paper describes speculative decoding speedups. (github.com, arxiv.org) Still, the “agent for device control” system must handle tool denial and state mismatch without compounding errors.

Treat tool execution as a workflow with checkpoints. Require “read-back verification” after each state-changing action.

Conclusion: build a harness that can’t lie

If you’re evaluating miclaw-like device-controlling agents against OpenAI-, Claude-, and Gemini-class tool stacks, don’t argue about model quality—deploy a permission-denied device control harness that scores: tool-calling reliability, multi-step decomposition, latency/cost under orchestration, denied-action failure modes, and multilingual command accuracy.

Implementation policy for practitioners

Require your product team to implement audit-ready tool traces and a least-privilege tool allowlist before expanding from micro-actions to long-horizon automation. Put this under your engineering quality gate, owned by the CTO/Head of Platform Engineering, not by the model provider. The evidence basis is pragmatic: tool-calling contracts and permission gates are documented (OpenAI schema strictness, Anthropic tool sequencing and permission handling), and broader ecosystem experience around autonomous agents highlights how permission and system access amplify risk. (help.openai.com, docs.anthropic.com, platform.claude.com, techradar.com)

By the time you can reliably stop, explain, and audit permission denials, your agent stops being “autonomous”—and starts being trustworthy.

Forecast for the next 90 days

Forecast (next 90 days): By June 2026, expect most teams integrating device-control agents to shift emphasis from “agent prompt quality” to orchestration correctness: stricter tool schemas, better permission-denied stop behavior, and more deterministic verification steps. The reason is operational: system-level agents are already entering limited testing and deployment cycles, while ecosystem warnings around high-permission autonomous tooling are pushing implementers toward safer execution patterns. (gizmochina.com, techradar.com)

Keep Reading

Cybersecurity

From Tokens to Control Loops: Xiaomi miclaw and the Reliability Bottlenecks of Device-Controlling MiMo Agent Models

Xiaomi miclaw turns MiMo’s reasoning into phone and smart-home execution, but device-controlling agents live or die by permissions, verification, and audit-ready tool reliability.

March 20, 2026·15 min read
AI & Machine Learning

Xiaomi’s MiMo v2 Turns Agent Capability into Execution: 256K Context, 150 Tokens/Second, and the Latency Math Behind Tool Use

Xiaomi’s MiMo v2 lineup is pushing Chinese agent systems from chatbot “reasoning” to tool-using behavior by prioritizing throughput, multimodal input, and device control.

March 20, 2026
·
13 min read
AI & Machine Learning

MiMo V2’s Execution Pipeline Is the Real “Agent Boom”: Reasoning to Device Control, Orchestration to Failure Modes

Xiaomi’s MiMo-V2 lineup is being packaged for reasoning-to-action workflows. The decisive battleground now is tool orchestration, reliability, and auditable device control.

March 20, 2026·13 min read