Khenda Resources Radar

Khenda Resources Radar Recent humanoid-robotics research from arXiv: manipulation, VLA models, imitation learning, locomotion, and more. https://www.khendarobotics.com/resources 2026-06-19T10:12:35.799Z MemoryWAM: Efficient World Action Modeling with Persistent Memory arxiv:2606.20562 2026-06-18T17:59:51.000Z 2026-06-18T17:59:51.000Z Sizhe Yang Juncheng Mu Tianming Wei Chenhao Lu Xiaofan Li Linning Xu

Robust robotic manipulation in the real world requires not only an understanding of the current observation, but also memory and dynamics modeling. World action models (WAMs) possess these capabilities by jointly modeling visual foresight and actions conditioned on both current and historical observations, making them a promising paradigm for robotic manipulation. However, existing WAMs face a fundamental trade-off: methods with efficient inference typically condition only on a bounded window of recent observations and therefore struggle in non-Markovian environments, whereas methods that pre…

Generating Robot Hands from Human Demonstrations arxiv:2606.20549 2026-06-18T17:57:21.000Z 2026-06-18T17:57:21.000Z Sha Yi Nicklas Hansen Xueqian Bai Carmelo Sferrazza Michael T. Tolley Xiaolong Wang

Robot learning has advanced rapidly in learning control, but learning the physical body of a robot remains much more difficult because jointly searching over design and control creates a very large combinatorial problem. Here, we present a data-driven framework for generating robot hands from human demonstrations. Instead of learning a complex controller together with each candidate design, we generate robot hand designs using the same simple control policy used after fabrication: matching fingertip positions through inverse kinematics. Using more than 4 million frames of human fingertip moti…

Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented Urban Navigation arxiv:2606.20458 2026-06-18T16:40:07.000Z 2026-06-18T16:40:07.000Z Zhenghao "Mark'' Peng Honglin He Quanyi Li Yukai Ma Bolei Zhou

Learning-based planners for sidewalk navigation can generate diverse candidate trajectories in real time, yet their scoring functions often fail to select the best trajectory in challenging situations, outputting trajectories that make the mobile robot drive onto grass, toward pedestrians, or in the wrong direction, even when better candidates exist in the same set. We call this the trajectory scoring gap: in real-world sidewalk navigation, the gap between an anchor-based planner's top choice and the best possible candidate is substantial, likely due to limited high-level scene understanding…

TaCauchy: An Extensible FEM Framework for Vision-Based Tactile Simulation arxiv:2606.20426 2026-06-18T16:08:45.000Z 2026-06-18T16:08:45.000Z Hengfei Zhao Yifan Xie Junhao Gong Yue Sun Kai Zhu Weihua He

Vision-based tactile sensors require high-fidelity simulation for reinforcement learning, yet existing approaches struggle to provide accurate mechanical stress fields within GPU-accelerated robotics platforms. We present TaCauchy, an extensible Finite Element Method (FEM) framework that integrates rigorous physics-based force computation into Isaac Sim. Built on the Unified Incremental Potential Contact (UIPC) solver, TaCauchy directly computes Cauchy stress tensors from hyperelastic constitutive laws and projects them onto contact surfaces to obtain traction forces and pressure distribution…

CoLI: A Reproducible Platform for Continuum Robot Learning via Monolithic 3D Printing and Isomorphic Teleoperation arxiv:2606.20389 2026-06-18T15:45:10.000Z 2026-06-18T15:45:10.000Z Ziyuan Tang Chenxi Xiao*

Continuum robots offer strong potential for manipulation tasks due to their high degrees of freedom, compliant structures, and operational safety. However, their adoption in both research and practical applications has been hindered by reproducibility issues arising from complex fabrication and assembly processes, challenging kinematic modeling, and a lack of intuitive control interfaces. To address these challenges, we present a novel open-source continuum robot design. The platform features a simplified fabrication pipeline enabled by multi-material 3D printing, allowing the arm to be fabri…

Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems arxiv:2606.20285 2026-06-18T14:28:37.000Z 2026-06-18T14:28:37.000Z Yandong Wang Jiaqian Yu Xiongfeng Peng Lu Xu Yamin Mao Weiming Li

Vision-language-action (VLA) models show strong capabilities in single and dual-arm robotic manipulation. Prior works show coordinated bimanual behaviors can emerge from end-to-end learning, leveraging large vision-language backbones with continuous action prediction. However, as bimanual tasks become tightly coupled and execution constraints become critical, implicit coordination alone is insufficient to ensure reliable, interpretable, and stable behavior. In this work, we propose Co-VLA, a coordination-aware bimanual manipulation framework introducing explicit structural priors into VLA mod…

Lagrange: An Open-Vocabulary, Energy-Based Sparse Framework for Generalized End-to-End Driving arxiv:2606.20274 2026-06-18T14:18:01.000Z 2026-06-18T14:18:01.000Z Shihao Ji HongXi Li Zihui Song Mingyu Li

Scaling end-to-end autonomous driving to complex, open-world environments requires perceptual models that generalize to anomalous scenarios and planners that produce kinematically valid trajectories. Existing paradigms face a distinct dichotomy between representational efficiency and generalization capacity. Dense models (e.g., occupancy networks), while geometrically robust, incur critical computational bottlenecks and struggle with high-level semantic reasoning. Conversely, sparse, query-based planners are efficient but reliant on closed-set definitions, rendering them vulnerable to out-of-…

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think arxiv:2606.20246 2026-06-18T13:57:12.000Z 2026-06-18T13:57:12.000Z Gia-Binh Nguyen Trong-Bao Ho Thien-Loc Ha Khoa Vo Philip Lund Møller Quang T. Nguyen

Vision-Language-Action (VLA) models pre-trained on massive video-robot datasets have revolutionized robotic manipulation, yet their multi-billion parameter architectures impose prohibitive computational burdens during downstream fine-tuning and real-time inference. In this work, we reveal a highly non-trivial architectural characteristic of these continuous control foundation policies (e.g., pi_0, GR00T-N1.5): despite being trained on diverse physical trajectories, they exhibit severe layer-wise representational redundancy. To exploit this, we introduce a structural compression pipeline that…

Belt-Finger: An Affordable Soft Belt-Driven Gripper for Dexterous In-Hand Manipulation arxiv:2606.20193 2026-06-18T13:07:04.000Z 2026-06-18T13:07:04.000Z Boya Zhang Andreas Zell Georg Martius

Parallel-jaw grippers are the default manipulator choice in robotics because they are simple, robust, and inexpensive. Their limited in-hand mobility, however, often forces large arm motions and restricts dexterous manipulation in confined workspaces. We present a parallel-gripper upgrade: a double-soft-belt-based finger module that preserves standard opening/closing while adding three in-hand degrees of freedom (DoF): translation, pitch, and roll. The mechanism is deliberately kept simple and engineered for inexpensive manufacturing and straightforward integration, preserving the reliability…

Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation arxiv:2606.20135 2026-06-18T11:58:30.000Z 2026-06-18T11:58:30.000Z Jianing Guo Fangzheng Chen Zihao Mao Wong Lik Hang Kenny Zhenhong Wu Yu Li

Flow matching has emerged as a standard paradigm for robotic manipulation owing to its strong expressive power for modelling complex, multimodal action distributions, alongside similar approaches like diffusion policy. However, existing methods rely on discretized action chunks, making them brittle to demonstrations collected at heterogeneous control frequencies and prone to temporally inconsistent actions that degrade control stability. In this paper, we propose Frequency-Aware Flow Matching (FAFM), which outputs continuous, temporally consistent actions. To handle heterogeneous frequency in…

Pose6DAug: Physically Plausible Multi-view Object Swapping for Robot Data Augmentation arxiv:2606.20118 2026-06-18T11:41:25.000Z 2026-06-18T11:41:25.000Z Jonghoon Lee Seong Hyeon Park Byungwoo Jeon Minha Lee Jinwoo Shin

Vision-language-action (VLA) policies have shown strong potential for general-purpose manipulation, yet they often fail on novel, out-of-distribution objects whose appearance or geometry deviates from the training distribution. The standard remedy is to collect multi-view teleoperation data for every failure case, but this scales poorly in both cost and time. We introduce Pose6DAug, a failure-driven data augmentation framework that turns a policy's own successful episodes into targeted demonstrations for its failure modes, without any new data collection. Our key insight is that each successf…

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies arxiv:2606.20092 2026-06-18T11:11:37.000Z 2026-06-18T11:11:37.000Z Ganlin Yang Zhangzheng Tu Yuqiang Yang Sitong Mao Junyi Dong Tianxing Chen

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises…

VFILC: Accurate Frequency Extrapolations in Imitation Learning via Sampling Frequency ILC arxiv:2606.20056 2026-06-18T10:28:28.000Z 2026-06-18T10:28:28.000Z Nozomu Masuya Toshiaki Tsuji Sho Sakaino

Conventional neural network (NN)-based imitation learning methods for variable-speed motion either restricted their scope to interpolated speeds, or generated unpredictable motions when extrapolating beyond trained velocity ranges. Variable-frequency imitation learning (VFIL) enabled extrapolations of speeds by linking the NN model's sampling frequency to the motion frequency, whereas its open-loop configuration caused frequency errors, especially in the extrapolated high-frequency settings. This study proposes variable-frequency imitation learning with iterative learning control (VFILC) base…

MirrorDuo: Reflection-Consistent Visuomotor Learning from Mirrored Demonstration Pairs arxiv:2606.20048 2026-06-18T10:23:16.000Z 2026-06-18T10:23:16.000Z Zheyu Zhuang Ruiyu Wang Giovanni Luca Marchetti Florian T. Pokorny Danica Kragic

Image-based behaviour cloning leverages demonstrations captured from ubiquitous RGB cameras. However, it remains constrained by the cost of collecting diverse demos, especially for generalizing across workspace variations. We propose MirrorDuo, a reflection-based formulation that operates on image, proprioception, and full 6-DoF end-effector action tuples, generating a mirrored counterpart for each original demonstration, effectively achieving "collect one, get one for free". It can be applied as a data augmentation strategy for existing learning pipelines, such as standard behaviour cloning…

Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory arxiv:2606.19998 2026-06-18T09:34:22.000Z 2026-06-18T09:34:22.000Z Jinghan Yang Yunchao Zhang Wang Yuan Haolun Wan Jiaming Zhang Zhengyang Hu

Vision-Language-Action (VLA) models are increasingly deployed across diverse tasks, yet they remain black boxes whose physical interactions can cause irreversible harm, making generalizable and interpretable failure detection essential. We observe that successful and failed rollouts carry systematically different information-theoretic signatures. Building on this, we formalize VLA control as a closed-loop information pipeline and derive the Triple Information-theoretic (Tri-Info) signals that capture whether actions remain diverse, temporally consistent, and coupled to state transitions. Acro…

Evaluation of Augmented Reality-based Intuitive Interface for Robot-Assisted Transesophageal Echocardiography: A User Study arxiv:2606.19971 2026-06-18T09:10:59.000Z 2026-06-18T09:10:59.000Z Xiu Zhang* Matteo Di Mauro* Sofia Breschi Angela Peloso Emiliano Votta Arianna Menciassi

TransEsophageal Echocardiography (TEE) is essential for diagnosing and guiding Structural Heart Disease (SHD) interventions. However, manual TEE manipulation demands significant operator expertise, is physically demanding, and exposes clinicians to radiation when performed alongside fluoroscopy. Robotic-assisted TEE systems have been introduced to improve probe handling and reduce operator fatigue, yet the design of intuitive and effective user interfaces remains an open challenge. This study presents and evaluates a model-enhanced, Augmented Reality (AR)-based intuitive interface for robot-a…

PhysDrift: Bridging the Embodiment Gap in Humanoid Co-Speech Motion Generation arxiv:2606.19935 2026-06-18T08:31:58.000Z 2026-06-18T08:31:58.000Z Zhangzhao Liang Xiaofen Xing Mingyue Yang Wenlve Zhou Xiangmin Xu

Humanoid robots require co-speech motions that are not only expressive and speech-aligned, but also physically executable under embodiment constraints. Existing co-speech generation pipelines are predominantly human-centric: motions are first generated in human-body representations such as SMPL-X and subsequently retargeted to humanoid robots. In this work, we identify a fundamental embodiment gap in this paradigm, where the mismatch between human motion manifolds and humanoid embodiment constraints disrupts embodiment consistency during motion transfer and physical execution. Through extensi…

SWAP: Symmetric Equivariant World-Model for Agile Robot Parkour arxiv:2606.19928 2026-06-18T08:28:30.000Z 2026-06-18T08:28:30.000Z Kaixin Lan Ze Wang Hongyi Li Lei Jiang Chaojie Fu Chengkai Su

While latent world models enable the proactive predictions required for extreme parkour, their purely data-driven nature forces them to redundantly encode left-right symmetric interactions as independent patterns. This inflates the learning burden and hinders the capture of geometric regularities, restricting the latent space's efficiency for downstream policies. To address this, we propose SWAP, an end-to-end equivariant symmetric world model. This framework embeds symmetry directly into both the world model and the actor-critic networks. In real-world tests, the robot leaps across a 2.13 m…

Co-policy: Responsive Human-Robot Co-Creation for Musical Performances arxiv:2606.19914 2026-06-18T08:08:16.000Z 2026-06-18T08:08:16.000Z Xuetao Li Wenke Huang Mang Ye Zijian Liu Jinhua Xie Jifeng Xuan

Art has long stood as a pivotal expression of human creativity. Embodied artificial intelligence offers a route for generative models to participate in that creativity through physical action rather than disembodied digital content. In robotic music co-creation, it is challenging to connect semantic musical understanding with real-time and physically executable performance. We present Co-policy, a framework for human-robot musical co-creation that separates semantic intent grounding, constrained musical variation, and visuomotor execution. To ground musical semantics, Co-policy uses pre-infer…

One-to-Two Acting: A Novel Framework for Single-arm Agent Action Expansion to Dual Arms arxiv:2606.19897 2026-06-18T07:53:42.000Z 2026-06-18T07:53:42.000Z Youbin Yao Nieqin Cao Mingyan Li Yan Ding Fuqiang Gu Chao Chen

Dual-arm manipulation can improve throughput via parallel execution, but collecting bimanual demonstrations for training is costly and difficult. We present ExS2D, a hierarchical action expansion framework that enables dual-arm manipulation from single-arm supervision. ExS2D first generates structured subtasks from textual instructions while explicitly capturing temporal precedence. It then grounds each subtask into executable actions through subtask-guided action mapping in observation. Finally, precedence-aware action allocation and synchronized planning are performed by a multimodal large…

EquiVLA: A General Framework for Rotationally Equivariant Vision-Language-Action Models arxiv:2606.19784 2026-06-18T04:36:57.000Z 2026-06-18T04:36:57.000Z Thien-Loc Ha Quang-Tan Nguyen Trong-Bao Ho Long Dinh Minh Duc Nguyen Gia-Binh Nguyen

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for generalist robot manipulation, yet they lack geometric inductive biases: policies trained at specific orientations require substantially more data to generalize across rotational configurations. We present \textsc{EquiVLA}, the first general framework for end-to-end $\mathrm{SO}(2)$-equivariant VLA models, applicable to any architecture coupling a frozen vision-language backbone with a flow-matching Diffusion Transformer action head. \textsc{EquiVLA} introduces \textsc{EquiPerceptor}, which produces approximately $\ma…

Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection arxiv:2606.19774 2026-06-18T04:14:11.000Z 2026-06-18T04:14:11.000Z Trong-Bao Ho Quang-Tan Nguyen Thien-Loc Ha Gia-Binh Nguyen Viet-Thanh Nguyen Long Dinh

Action chunking enables robot policies to produce temporally coherent behavior, but generating multi-step action sequences with flow-based policies incurs latency that is incompatible with real-time control. Under asynchronous execution, the robot continues executing the current chunk while the next one is generated, causing even minor delays to create inconsistencies at chunk boundaries. Existing methods address this problem by steering generation toward the already executed action prefix. We instead show that prefix consistency can be achieved by selecting an appropriate initial noise befor…

Data Standards for Humanoid Robotics: The Missing Infrastructure for Physical AI arxiv:2606.19769 2026-06-18T04:10:16.000Z 2026-06-18T04:10:16.000Z Shaoshan Liu Xiugong Qin Xuan Wu Xuan Xia Ning Ding Jialu Liu

The scalability of humanoid robots will depend not only on models and hardware, but also on whether physical experience can accumulate across robots, tasks, organizations, and time. Drawing on the authors' work in developing ISO/WD 26264-1, Humanoid robot datasets -- Part 1: General requirements, within ISO/TC 299/WG 16, this article argues that data standards are becoming foundational infrastructure for Physical AI. We develop three insights. First, humanoid robot data is embodied interaction data, not a collection of isolated digital samples; a useful dataset must preserve the relationship…

Temporal Self-Imitation Learning arxiv:2606.19752 2026-06-18T03:31:56.000Z 2026-06-18T03:31:56.000Z Yinsen Jia Boyuan Chen

Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively r…

Bidirectional Tutoring for Developmental Motor Learning in Robots: Co-Developed Interaction Dynamics Support Stable Learning arxiv:2606.19728 2026-06-18T02:51:29.000Z 2026-06-18T02:51:29.000Z Rui Fukushima Jun Tani

Infants are well known to develop their motor skills through dense interaction with caregivers. Although such social interaction is crucial for human development, motor-skill learning in robots is often treated as a unidirectional process in which robots passively receive demonstrations from tutors. This overlooks a key property of social interaction: it is inherently bidirectional, with tutor and learner dynamically adapting to each other. In such interactions, the robot's past experiences may function as prior constraints that shape the dynamics of their co-developed trajectories. We hypoth…

DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning arxiv:2606.19656 2026-06-17T23:40:45.000Z 2026-06-17T23:40:45.000Z Calvin Luo Chen Sun Shuran Song

A natural recipe for intelligent robotic decision-making is initializing from pretrained generative control policies, which have summarized offline experience, and adapting them to self-collected online experience. We present DF-ExpEnse, an exploration technique that improves the quality of online experience collection, thus increasing finetuning sample-efficiency. DF-ExpEnse leverages the multimodal modeling capabilities of the generative control policy to create an expressive and tractably evaluatable candidate set. It then utilizes an ensemble of critics to identify the action that best ba…

CTS-MoE: Implicit Terrain Adaptation via Mixture-of-Experts for Perceptive Locomotion arxiv:2606.19633 2026-06-17T22:25:49.000Z 2026-06-17T22:25:49.000Z Francisco Affonso Matheus P. Angarola Ana Luiza Mineiro Aditya Potnis Marcelo Becker Girish Chowdhary

Perceptive legged locomotion over discontinuous terrain (e.g., stairs, gaps, and obstacles) requires adaptive behavior, as a single conservative gait cannot produce the anticipatory maneuvers needed for abrupt topology changes. Cast as multi-task reinforcement learning, this problem introduces a tension between sharing and separation. Tasks use a common locomotion base but have conflicting rewards, so a policy must share behavior while avoiding value interference. Prior work addresses only one side, with monolithic policies sacrificing specialization and hierarchical sub-policies sacrificing…

Fail-RAG : A Retrieval Augmented Generation Informed Framework for Robot Failure Identification arxiv:2606.19598 2026-06-17T21:02:47.000Z 2026-06-17T21:02:47.000Z Ameya Salvi Jie Hu

Industry automation is witnessing an evolution in robotics driven by both technological breakthroughs and societal changes: progress towards generalist robots, embodied and physical artificial intelligence (AI), and increasing labor shortage in manufacturing.An intelligent autonomous robot needs to not only act according to planned motions but also react to any unexpected events. In this study, we focus on such unexpected events in warehouses where robots are used for material handling. Specifically, we refer to any unexpected events as failures and develop methods to detect robot operations…

One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies arxiv:2606.19586 2026-06-17T20:41:13.000Z 2026-06-17T20:41:13.000Z Chuer Pan Litian Liang Dominik Bauer Eric Cousineau Benjamin Burchfiel Siyuan Feng

Visuomotor policies for manipulation have demonstrated remarkable potential in modeling complex robotic behaviors, yet minor alterations in the robot's initial configuration and unseen obstacles easily lead to out-of-distribution observations. Without extensive data collection effort, these result in catastrophic execution failures. In this work, we introduce an effective data augmentation framework that generates visually realistic fisheye image sequences and corresponding physically feasible action trajectories from real-world eye-in-hand demonstrations, captured with a portable parallel gr…

SCAN-Planner: Spatial Collision-Aware Local Planning for Route-Guided Long-Range Quadruped Navigation arxiv:2606.19555 2026-06-17T19:55:09.000Z 2026-06-17T19:55:09.000Z Han Zheng Zhe Chen Yiwen Fu Ming Yang Tong Qin

Quadruped robots are increasingly expected to navigate through narrow passages, cluttered indoor scenes, and large-scale 3D unstructured environments. Existing local planners commonly approximate the robot using isotropic geometric inflation or rely on planar and elevation-map representations, leading to conservative motion in tight spaces and limited reasoning about overhanging structures. This letter presents SCAN-Planner, a spatial collision-aware local planning framework for long-range quadruped navigation. A yaw-aware twin-cylinder footprint is used to model the elongated robot body, ena…

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing? arxiv:2606.19531 2026-06-17T19:25:28.000Z 2026-06-17T19:25:28.000Z Yuyang Zhang Wenyao Zhang Zekun Qi He Zhang Haitao Lin Jingbo Zhang

World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot…

Proprioceptive Invariant State Estimation for Humanoid Robots on Non-Inertial Ground arxiv:2606.19512 2026-06-17T18:53:48.000Z 2026-06-17T18:53:48.000Z Falak Mandali Zijian He Yan Gu

This paper presents an invariant extended Kalman filtering (InEKF) approach for real-time state estimation of humanoid robots operating on non-inertial ground using only onboard proprioceptive sensing. The proposed approach estimates the robot's base position and velocity relative to the moving ground frame without requiring direct measurements of ground motion or externally mounted sensors. By exploiting kinematic constraints at the stance foot through foot-mounted IMUs, the filter accounts for ground-induced nonlinearities in the process and measurement models while remaining fully proprioc…

Simulating Robotic Locomotion in Sand: Resistive Force Theory in an Open-Source Physics Engine arxiv:2606.19504 2026-06-17T18:41:52.000Z 2026-06-17T18:41:52.000Z Ryan Walker Brown Laura K. Treers Kathryn A. Daltorio

Recent advancements in Resistive Force Theory (RFT) enable approximation of ground reaction forces for locomotion in sand without the computational expense of modeling interactions with individual grains. However, these tools have been absent in 3D physics engines commonly used for robot simulation. We explore if resistive force approximations are sufficient, when integrated with standard dynamics calculations, to provide a stable substrate for a freely walking robot. To determine this, we implement 3D Granular Resistive Force Theory (3D RFT) in a physics simulation engine, MuJoCo. We verify…

3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning arxiv:2606.19451 2026-06-17T18:00:08.000Z 2026-06-17T18:00:08.000Z Ellina Zhang Madhaven Iyengar Amir Zadeh Chuan Li Deepak Pathak David Held

We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the l…

Zero-Shot Long-Horizon Dexterous Manipulation via Multi-View 3D-Grounded VLM Reasoning arxiv:2606.19340 2026-06-17T17:59:56.000Z 2026-06-17T17:59:56.000Z Jisoo Kim Sangwon Baik Taeksoo Kim Sungjoo Kim Junyoung Lee Mingi Choi

We present a zero-shot framework for long-horizon dexterous manipulation that grounds language instructions into executable 3D task plans from calibrated multi-view RGB images. Rather than training an end-to-end policy, our system uses a vision-language model (VLM) to produce reference-frame task grounding and primitive-level 2D keypoints, then lifts them into 3D via multi-view fusion. This lifting combines triangulation of view-wise VLM groundings with reference-view ray voting, which searches along a semantic camera ray for geometrically consistent candidates across neighboring views. The r…

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos arxiv:2606.19333 2026-06-17T17:57:34.000Z 2026-06-17T17:57:34.000Z Bhawna Paliwal Haritheja Etukuru William Liang Pieter Abbeel Nur Muhammad Mahi Shafiullah Jitendra Malik

How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO…

Modeling Branches for Active Manipulation using Iterative Parameter Estimation arxiv:2606.19314 2026-06-17T17:37:06.000Z 2026-06-17T17:37:06.000Z Madhav Rijal Rashik Shrestha Trevor Smith Yu Gu

This study presents a method for modeling diverse plant branches by iteratively estimating material parameters to support delicate branch manipulation. Branch manipulation is necessary in agricultural robotics for plant repositioning, stabilizing, and clearing visual obstructions in dense foliage. The proposed method builds a tetrahedral branch model from point-cloud data and simulates its behavior using the finite element method. Using real observed deformation data, it iteratively estimates branch parameters and then computes an optimal path with a deformation-aware motion planner to move a…

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models arxiv:2606.19297 2026-06-17T17:20:46.000Z 2026-06-17T17:20:46.000Z Nikita Kachaev Andrey Moskalenko Matvey Skripkin Nikita Kurlaev Daria Pugacheva Albina Burlova

Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action…

Shape Sensing of Continuum Robots using Direct Laser Writing arxiv:2606.19265 2026-06-17T16:41:09.000Z 2026-06-17T16:41:09.000Z Amber K. Rothe Nidhi Malhotra Jaydev P. Desai

Continuum robots offer a promising approach for minimally invasive and natural-orifice surgical procedures due to their inherent compliance and dexterity. However, this flexibility also makes estimating the current shape of the robot challenging. Several approaches have been used to reconstruct the shape of these robots, including imaging, optical sensing, magnetic sensing, and resistive sensing. Strain sensors fabricated using direct laser writing (DLW) could provide an alternative sensing method. This technique involves using a laser to induce carbonization of certain polymers to create gra…

Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation arxiv:2606.19240 2026-06-17T16:20:10.000Z 2026-06-17T16:20:10.000Z Thomas M. Kwok Nicholas Koenig Yue Hu

Markerless, single-RGB-D-camera motion capture provides a low-cost and non-invasive alternative to conventional marker-based systems for robot teleoperation; however, depth estimation often degrades in the presence of self-occlusion, particularly during upper-limb motion. This paper presents an Arm Kinematic Correction (AKC) method that improves depth estimation by enforcing geometric constraints based on constant arm lengths. The proposed approach reconstructs occluded joint depths by leveraging wrist positions and predefined arm lengths via a deterministic formulation based on the Pythagore…

Mobile Pedipulation for Object Sliding via Hierarchical Control on a Wheeled Bipedal Robot arxiv:2606.19233 2026-06-17T16:09:02.000Z 2026-06-17T16:09:02.000Z Yue Qin Yulun Zhuang Zelin Shen Yanran Ding

In this letter, we present a hierarchical control framework that enables wheeled bipedal robots to perform planar object sliding tasks with their wheeled legs. The proposed approach formulates a nonlinear model predictive controller (NMPC) based on a reduced-order three rigid bodies (TRB) dynamical model that explicitly accounts for the hip roll degree of freedom and multiple wheel-environment contact modes, which is essential for lateral stepping and pedipulation tasks. Within this framework, the NMPC simultaneously regulates robot locomotion and interaction forces, allowing the robot to sta…

Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation arxiv:2606.19194 2026-06-17T15:35:27.000Z 2026-06-17T15:35:27.000Z Yu Zhang Kangyi Ji Yongxiang Zou Rongtao Xu Feng Zheng Long Cheng

This paper presents an invertible neural network adapter for general robotic manipulation, designed to generate precise high-dimensional actions conditioned on multimodal observations, including visual, linguistic, and proprioceptive inputs, through a one-step denoising process. Built upon a flow-matching formulation, the proposed adapter effectively constrains the action generation trajectory within an invertible latent space, thereby enabling efficient and high-quality dexterous action synthesis with only a single inference step. Compared with conventional iterative flow-matching policies,…

Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise arxiv:2606.19186 2026-06-17T15:27:15.000Z 2026-06-18T01:51:31.000Z Mengxiang Hao Xin Jiang Xinghao Huang Wenliang Su Zhiteng Wang Junjie Rao

Autonomous Emergency Braking (AEB) optimization relies on accurately annotated real-world trigger events, particularly rare but critical delayed and false AEB triggers that expose system deficiencies. However, these minority samples comprise less than 5% of thousands of daily triggers, making manual annotation prohibitively expensive at scale. We present the first automated AEB annotation framework to address this problem. During development, we identified two fundamental challenges that severely impair delayed/false trigger annotation accuracy: (1) Extreme class imbalance where delayed/false…

HT-Bench: Benchmarking and Learning Dexterous Full-Hand Tactile Representations with Egocentric Vision arxiv:2606.19161 2026-06-17T15:01:30.000Z 2026-06-17T15:01:30.000Z Yuzhe Huang Jiaping Wu Jiaming Jiang Hezhe Lin Aikebaier Aierken Yunlong Wang

Establishing a universal benchmark for tactile representation learning in robotic manipulation remains challenging due to the diversity of tactile sensor designs, data formats, and robot embodiments. Rather than seeking to establish such, we explore a scalable and promising direction for future development: egocentric vision paired with full-hand tactile data. To this end, we introduce \textbf{HT-Bench}, a large-scale multi-task benchmark for dexterous full-hand tactile sensing, comprising 10M RGB frames and 7.8M tactile frames collected across 226 tasks. HT-Bench evaluates tactile representa…

ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks arxiv:2606.19088 2026-06-17T13:58:06.000Z 2026-06-17T13:58:06.000Z Simon Schwaiger David Seyser Alessandro Scherl Wilfried Wöber Gerald Steinbauer-Wagner

Vision-Language Models (VLMs) enable robots to follow open-language instructions. However, dense VLM embeddings have shown to be noisy and lack spatial consistency. This is problematic for robotic applications, which require simultaneous reasoning over semantics and 3D space. We examine spatial structure across recent VLMs and propose ReSiReg, a feature reconstruction method that uses spatially consistent VLM intermediates to improve dense language-grounded retrieval. ReSiReg clusters intermediates into visual prototypes, derives their language descriptors, and reconstructs each patch as a so…

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots arxiv:2606.19067 2026-06-17T13:41:07.000Z 2026-06-17T13:41:07.000Z Roberto Corlito Fabian Schmidt Nils Seibert Markus Enzweiler Abhinav Valada Arne Roennau

Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipeline…

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation arxiv:2606.18960 2026-06-17T11:42:00.000Z 2026-06-18T07:33:11.000Z Zirui Zheng Jiaqian Yu Xiongfeng Peng jun shi Mingyi Li Chao Zhang

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scena…

TactSpace: Learning a Physics-enriched Shared Latent Space for Tactile Sim-to-Real Transfer arxiv:2606.18959 2026-06-17T11:41:27.000Z 2026-06-17T11:41:27.000Z Arunim Joarder Arjun Bhardwaj René Zurbrügg Mayank Mittal Florin Püntener Sira Bielefeldt

Tactile sensing provides direct measurements of contact interactions that are essential for robotic manipulation. However, current simulators lack the fidelity to faithfully model the complex deformation and transduction mechanics of tactile sensors, severely hindering sim-to-real transfer in robot learning pipelines. To address this challenge, we propose a multi-modal representation learning framework that aligns heterogeneous tactile modalities within a shared latent space, eliminating the need for accurate raw-signal simulation while preserving relevant contact information. Our approach em…

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos arxiv:2606.18955 2026-06-17T11:37:59.000Z 2026-06-17T11:37:59.000Z Runze Xu Yiluo Zhang Jian Wang Yu Wang Jincheng Yu

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds throug…

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement arxiv:2606.18953 2026-06-17T11:36:54.000Z 2026-06-17T11:36:54.000Z Kinam Kim Namiko Saito Heecheol Kim Katsushi Ikeuchi Jaegul Choo Yasuyuki Matsushita

Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the…