Research

By Genesis Team

May 7

2026

GENE-26.5: Advancing Robotic Manipulation to Human Level

Today we introduce GENE-26.5, our first robotic foundation model system and the initial public release in the GENE family.

By Genesis Team

GENE-26.5 is designed to push general-purpose robotic manipulation towards human-level capability. Across a set of long-horizon and contact-rich tasks, including cooking, lab automation, solving a Rubik’s cube, making a smoothie, wire harnessing, multi-object grasping, and piano playing, the system demonstrates a broad set of dexterous skills using the same model, hardware platform, data strategy, and control stack.

These tasks are not designed as isolated demos. They are designed to test a broader question: can a robot interact with the physical world with the precision, timing, coordination, and adaptability required for real work?

At Genesis, we believe the path to useful general-purpose robots begins with manipulation.

💡 TL;DR

Manipulation is the most important yet unsolved problem in robotics
Human-level dexterity and capability is closer than it appears
Solving manipulation requires full-stack system thinking beyond a pure model-centric perspective
Human-centric data coupled with human-centric hardware solutions is the scaling path for pre-training
High-fidelity simulation is the ultimate accelerator for model iteration

Manipulation: why, and how.

Manipulation is the most valuable problem in robotics because it turns intelligence into useful work. Most physical labor is not about moving through the world, but transforming it.

It is also the hardest unsolved problem. Navigation formulates the world as “obstacles and free space” and avoids contact. Locomotion and whole-body control use contact for support: the ground is stable, the pattern repeats, and errors are often recoverable. Manipulation is different. Contact is the task. The robot must understand the world, predict and reason about the outcomes of interactions, and interact with unknown objects under uncertainty in shape, weight, friction, and dynamics, using precise force and timing. Errors compound over long horizons, and a few millimeters can decide success or failure.

At Genesis, we view manipulation as the core problem in robotics. If a robot can reliably and intelligently control physical interactions with the world, everything else becomes support.

A system problem, not just an AI problem

One of our early observations is that robotic manipulation is difficult to solve as a pure model training problem. Robotics is inherently more complex than digital AI: it demands tight coordination between sensors, actuators, control, data, and the model itself. Limitations in any one layer propagate throughout the entire system and ultimately constrain overall performance. Therefore, building capable and reliable robots requires not only optimizing individual components, but excelling across the entire stack.

Interestingly, when the system designed jointly from the ground up, many seemingly difficult challenges from a model-centric perspective can instead be addressed more fundamentally in other layers.

Data is the clearest example. The scarcity of high-quality demonstration data remains one of the primary bottlenecks in robotics. Human interaction data is the richest and most scalable source of real-world supervision. It naturally captures the diversity of workflows and environments robots ultimately need to generalize across.

However, existing approaches operate under a fundamental trade-off between scale and fidelity. Egocentric human video collected in the wild scales well but suffers from noise, occlusion, and limited observability. Interfaces such as teleoperation and hand-held gripper devices provide richer signals, but require dedicated operators, controlled collection setups, and workflows organized around data collection rather than the task itself. As a result, robotics datasets remain limited not only in scale, but also in the diversity of natural real-world interaction they contain.

At the root of this challenge is the embodiment gap between humans and robots. Human hands generate rich interaction data naturally, but those motions do not directly transfer to robot hardware.

The status quo

As a result, researchers in robotics often have to settle for compromised solutions due to deeper system-level constraints. Hardware limitations restrict range of feasible interactions, which in turn shapes data collection strategies, despite the fact that the world is built by tools designed for human hands. Latency and other non-idealities in the control middleware introduce discrepancies between commanded actions and actual system states, whether through latency, controller dynamics, transmission error, backlash, or actuator inaccuracy, making model training favor teleoperation signals as supervisions, which implicitly encode robot-specific artifacts. Evaluation Bottleneck further slows progress: one robot, one human evaluator, one trial at a time; minutes per trial, operator-days per checkpoint.

But do we really have to compromise?

No.

Rather than optimizing within these constraints, we take a step back and consider the full system jointly. If the goal is human-level manipulation, every layer must support it:

Bridge the embodiment gap: Minimize the gap at the hardware level rather than compensating for it in modeling and algorithms. This means not only using a high-DoF hand, but designing one that closely matches the human hand in size, kinematic structure, degrees of freedom, and soft-contact dynamics.
Capture high-fidelity data on the job: Data collection interfaces should preserve natural human behavior while improving observability and precision, making it possible to collect high-quality demonstrations within real workflows.
Optimize control: Minimize latency and tracking error so the model can learn from broader supervision beyond robot-specific teleoperation signals.
Build a robotics-native model: The model must scale across language, vision, proprioception, tactile sensing, and action at the frequency and dimensionality required for complex manipulation.
Scale evaluation: Evaluation must match the diversity, efficiency, reproducibility, and throughput required for foundation-model-scale iteration. You can only improve what you can measure.

GENE is not just a model, but a holistic system defined by these principles. GENE-26.5 is the first release in this direction.

Advancing manipulation capability to near-human level

Human-level manipulation extends beyond simply relocating objects from point A to point B. It is the art of composing contact in space and time. A truly intelligent robot should be capable of synthesizing diverse modes of interacting with the world on the fly: using one finger to push, two fingers to pinch, three fingers to stabilize, four to re-orient, the full hand and palm for a power grasp, tools to extend its reach, and seamless handovers between hands.

To reason about this frontier, we evaluate robotic manipulation along five core axes:

Spatial Precision: Where interaction occurs, and how accurately contacts, objects, or tools must be placed positionally.
Temporal Composition: When and how fast actions must be executed to produce the desired dynamics over time.
Contact Richness: The number and diversity of simultaneous contacts, from single-point touches to full-hand and multi-object interactions.
Contact Coordination: The degree to which multiple contacts must be synchronized to act as a coherent behavior.
Tool-Mediated Interaction: The ability to extend capability beyond the robot’s physical body by using intermediate objects both as designed and in novel but physically reasonable ways.

Contact richness measures how many and how diverse contacts are used, while contact coordination measures how tightly they must be synchronized. This framework guides how we design and evaluate tasks. Rather than optimizing for isolated demonstrations, we select tasks that stress different combinations of these axes across real-world settings.

What GENE-26.5 can do

We evaluate GENE-26.5 on a diverse set of real-world tasks spanning household, laboratory, and industrial workflows. All tasks are executed at 1× real-world speed by a single model with shared weights. (Except for piano playing, which is to test the capability of our control system, and for fun).

The table below maps each task to a different combination of these axes:

Cooking

A four-minute, long-horizon task in an unsimplified, real-world setting with more than 20 subtasks. The robot performs single-handed egg cracking, demonstrating delicate force control and inter-finger coordination; coordinated bimanual manipulation, using one hand to reorient a tomato while the other performs precise cutting; and both direct and indirect tool use across tools including a towel, salt mill, whisk, knife, spatula, and frying pan. A representative moment: when transferring diced tomatoes, the robot reorients the knife against the cutting board for support and completes the transfer through coordinated bimanual motion.

Cooking | Autonomous 1x speed

Lab pipetting

A high-precision laboratory workflow involving pipetting, liquid transfer, tube sealing, and centrifuge loading. The robot grasps a pipette in the correct pose, inserts it into a tip, transfers liquid from a beaker into a tube, ejects the tip, seals the tube, opens the centrifuge by actuating the small “open” button with a subtle nudge, and places the tube into the rotor. This requires millimeter-level precision, tool use, fine-motor coordination (e.g., screwing on a 1 cm cap), and dexterous in-hand re-grasping to reposition the pipette for hanging it back onto the rack.

Lab pipetting | Autonomous 1x speed

Solving a Rubik’s Cube

Rubik’s Cube solving has been a challenging benchmark for robotic manipulation. The task requires fine-grained control under the geometric and kinematic constraints imposed by the cube itself: each rotation must be precise, while the object must remain stable across successive moves. In a bimanual setting, the challenge becomes even more demanding, requiring tight coordination across arms, hands, and multiple fingers.

Prior work on robotic Rubik’s Cube solving without specially-designed mechanical fixtures remains limited, with OpenAI’s single-handed robotic system (2019) as a key milestone. Following this line, we use an external solver to generate closed-loop action commands on the fly, translate them into language instructions, and execute them with the model.

To the best of our knowledge, this is the first time a general-purpose bimanual robotic system can solve a Rubik’s Cube.

Solving a Rubik’s Cube | Autonomous 1x speed

Making a smoothie

A long-horizon, language-instructed task that involves preparing a smoothie from raw ingredients. It requires careful handling of diverse material states, including rigid objects, deformables, and liquids.

Making a smoothie | Autonomous 1x speed

Smoothie straw flip

A follow up task that stresses handling straw and its plastic cover, which are extremely fragile and translucent items. It ends with an in-hand flipping motion to re-orient the straw into the right direction, requiring complex and synchronized coordination between multiple fingers in one hand.

Smoothie straw flip | Autonomous 1x speed

Multi-object grasping

Designed to demonstrate what becomes possible when pick-and-place is enabled by a highly dexterous hand, and the efficiency gains over gripper systems. The robot simultaneously grasps four objects of different sizes using four distinct grasp types with a single hand, and sorts them into corresponding bins.

Picking up mutliple objects | Autonomous 1x speed

Wire harnessing

A holy grail task in the automotive industry, requiring precise handling of soft, highly deformable objects like cables and tape. The robot coordinates both hands to bundle cables, hang on stands, and wrap them in tape.

Wire harnessing | Autonomous 1x speed

Control Stack Stress Test: Piano Playing

While being part of the GENE model family, the policy used here is separately trained via reinforcement learning in simulation, guided by human demonstrations. This task is designed specifically to validate the high speed and accurate tracking capabilities of our control stack. We test it on two clips - Ferris Wheel and Rush-E.

Playing piano (Ferris Wheel) | Autonomous 1x speed

Playing piano (Rush E clip) | Autonomous 1x speed

For most of the challenging skills in our demo tasks collection, GENE requires less than one hour of task-specific robot data, which translates to less than 200 episodes for skills under 20s duration.

The scaling path for manipulation

Over 80% of physical labor is manipulation. Almost none of it has ever been recorded.

Human-centric data is one of the most important scalable sources for scaling manipulation intelligence in the real-world. The challenge lies not just in volume alone, but also in capturing data that preserves the richness of human interaction while maximizing its extractability and usability for robotic systems. GENE is built around a scaling path towards human-level capability: pre-training on diverse human demonstrations, aligning with a small amount of robot data, and continuously improving through feedback from real-world and simulation.

Genesis Hand 1.0

A model for dexterous manipulation needs a capable physical interface for expressing rich contact. Genesis Hand 1.0 designed with this principle in mind. It is a highly dexterous, direct-drive robotic hand engineered to achieve a true 1:1 size match with the human hand. It features 20 active, back-drivable degrees of freedom and is covered in soft material across the palm and fingers to mimic the soft-contact physics of human skin. This biomimetic design allows us to map human hand motions directly to the robotic hand, effectively eliminating the need for complex retargeting algorithms, resulting in near-lossless information transfer from human demonstrations.

Genesis Hand 1.0 | Left: comparison with a human hand | Right: in motion

GENE-26.5 currently runs on a hardware platform that is already highly dexterous, but still leaves room for further reducing the embodiment gap to the human hand. Genesis Hand 1.0 represents the next step in our hardware roadmap and will serve as an important platform for continued iteration of the GENE system.

Hardware is not downstream of the model; it is what makes the right data scalable.

Human-centric data engine

The world’s most valuable physical expertise lives in the tacit knowledge of human hands: the intuition of an assembly worker, the precision of a lab technician, the speed of a kitchen line. Accessing this knowledge at scale requires solving a fundamental constraint: data must be captured without interrupting the work it comes from. If collection changes behavior, it limits both scale and fidelity.

Our data engine combines three complementary sources that together span the quality–quantity Pareto frontier:

Glove data captures high-fidelity hand motion and tactile signals
Egocentric video captures natural behavior and real-world task diversity
Third-person video provides internet-scale coverage of physical interaction

To capture high-fidelity interaction, our data collection glove uses EMF-based finger tracking and dense tactile sensing across the hand. This interface can be shared seamlessly between human and robotic hand, preserving consistency across data and deployment. The glove is designed to be minimally-invasive, integrating into existing workflows so real work becomes data collection with minimal friction.

A robotics-native foundation model

Our goal is to learn a unified model that best absorbs scale across heterogeneous inputs and outputs: language, vision, proprioception, tactile, and action. We model a joint distribution over trajectories using flow matching to capture inherently multimodal futures while preserving coupled temporal dynamics, with the following goals in mind:

Scalable training on heterogeneous, partially observed data: trains on ego-centric streams (vision, hand state, language), glove data (vision, language, refined hand state, tactile), robot data (controls), Internet language and video data, without requiring explicit alignment.
A unified model for all tasks: control, generative simulation, state estimation, inverse dynamics, goal inference, rendering, and value estimation arise as conditional queries on this joint distribution, with missing modalities inferred through denoising.
Flexible incorporation of priors from pre-trained models as a way to import scale: Vision-Language models (VLMs) encode intent and semantic representations, while World Models (action-conditioned video-generation models, in our definition) capture temporal and physical dynamics. The joint distribution can leverage both.

Scaling towards instant deployment

In collaboration with partners, we have collected over 200,000 hours of data across these modalities. With scaling, we aim for instant deployment: a robot that can enter a new environment and begin performing useful work immediately, with minimal data collection and manual tuning. Achieving this in practice requires rethinking how we build and evaluate models.

We frame instant deployment as the convergence of efficient and effective task-specific fine-tuning: when adaptation is necessary, it should require minimal data, time, and human effort. In the limit, when task-specific effort approaches zero, deployment becomes virtually instantaneous with zero-shot generalization.

For pre-training, we begin with open-loop evaluation to study scaling behavior. As shown in the diagram above, increasing model size and compute consistently reduces validation loss, with larger models achieving lower asymptotic error. This aligns with well-established scaling laws in foundation model training: larger models possess greater capacity and continue to benefit from additional compute and data.

However, open-loop metrics alone are insufficient for robotics, while closed-loop performance, where actions influence future observations, is a far more meaningful indicator of capability.

We therefore rely heavily on simulation for closed-loop evaluation. At Genesis, we have been pushing the realism boundary of simulation over a year. Compared to model evaluation in the real world, simulation-based model development is much more controllable, scalable, and reproducible. Prior work running sim-based evaluation for robotic foundation model requires co-training on real-world and simulation data. Thanks to the unprecedented realism level of the latest version of Genesis World, we are able to run scalable, reproducible and systematic evaluation on our model, with zero simulation data.

We constructed highly extensive simulated evaluations, spanning over a wide variety of tasks with common skills, with diverse variations in lighting, backgrounds, object properties, scene configurations, task instructions etc. In the following plot, each single data point represents 200 evaluation setups and over 150 hours of robot execution time; the whole plot requires 2700 human-robot hours if evaluated in real world instead. Simulation enables us to conduct extensive evaluations to informatively assess foundation model capabilities at a scale that would be infeasible in the real world. The key finding is clear: scaling pre-training data leads to stronger zero-shot generalization under these extensive closed-loop evaluations.

In our upcoming release, we will share an exciting update on Genesis World and how we establish strong correlation between model evaluation in simulation and real world.

Finally, for task-specific fine-tuning, we ground our evaluation in the real world. We curate novel tasks fully excluded from pre-training and evaluate them in a ultra-low-data regime that reflects instant deployment constraints. These include an internally defined suite of tasks, each with ~20–30 minutes of data, as well as more complex tasks, such as those shown in our demo videos. This setup allows us to rigorously measure how efficiently models can adapt. Beyond the gains observed in zero-shot generalization, increased pre-training data scale also significantly improves fine-tuning performance: models adapt faster, require less data, and achieve higher final performance.

Together, these results highlight a clear trend: scaling data and compute improves both generalization and adaptation efficiency.

Low latency, high-fidelity control

An AI-controlled robotic system is inherently hierarchical, spanning multiple layers from model outputs to intermediate control signal processing, low-level PID controllers, and ultimately motor-level FOC actuation. Across these layers, the system accumulates latency, tracking error, controller artifacts, and actuation non-idealities, all of which widen the gap between what the model intends and what the robot actually executes.

In teleoperation-based systems, latency and tracking error are often implicitly captured in the training signal, so the model learns under the same robot-specific dynamics profile it will encounter at deployment. However, when training from non-robot data — such as human motion — this assumption no longer holds. The training data does not reflect the dynamics characteristics of the physical system, creating a mismatch between training and execution. While artificial delays or system noise can be introduced during training, accurately modeling real-world dynamics is difficult, as they are state-dependent and vary with robot configuration, velocity, load, contact condition, controller gains, transmission behavior, and actuator dynamics at each moment.

To reduce this mismatch at the source, we replaced the vendor-supplied controller on our bimanual robotic arms with our own control middleware, redesigned for low latency, high tracking fidelity, and deterministic execution. The system uses a high-performance impedance controller and achieves end-to-end latency as low as 3 ms under tuned settings. It runs both arms through a single EtherCAT Y-slave network, uses a PREEMPT_RT kernel with isolated CPU cores for real-time control threads, uses KickCAT as the EtherCAT master with Distributed Clocks support, runs at 500 Hz, and supports both position and impedance control with position and velocity targets.

The plots above compare our control middleware with the default off-the-shelf controller provided by the arm supplier. When tracking a 15 cm diameter circle over 4 seconds, the default controller produces an average tracking error of approximately 20mm, while ours reduces it to approximately 2mm, representing an order-of-magnitude improvement. In a single-joint sinusoidal tracking benchmark under impedance mode, the default controller exhibits roughly 80ms of delay, while ours responds within 9ms, and can be further reduced to approximately 3ms with tuned gains.

This is what full-stack robotics looks like in practice for us. We own the communication layer (KickCAT, UDP, PREEMPT_RT scheduling, and core isolation), the controllers (impedance control, PID, and trapezoid profiling), and the interfaces above them. Maintaining coherence across the stack allows improvements in one layer to translate directly to others. This is crucial for closing the human-to-robot gap: by minimizing latency, tracking error, and hidden controller artifacts, we make it possible for GENE to learn from human motion rather than relying exclusively on robot-specific teleoperation signals.

Conclusion

We believe GENE-26.5 is an early but important step toward human-level robotic manipulation. Manipulation capability does not emerge from model training alone. It requires a coherent system: hardware that can express rich contact, data collection that preserves human interaction, control that minimizes the gap between intention and execution, models that absorb multimodal supervision at scale, and evaluation infrastructure that makes iteration scalable and reproducible.

This release reflects our conviction that the path to general-purpose robots starts with manipulation, and that manipulation must be solved as a full-stack problem. GENE-26.5 is just a beginning, but it establishes a foundation to scale: human data, capable hardware, high-fidelity control, realistic simulation, rich evaluation, and fast feedback from the real world.

More to come.

Citation

If you found this work useful in your research, consider citing it as:

@article{
     genesis2026gene265,
     author = {Genesis AI Team},
     title = {GENE-26.5: Advancing Robotic Manipulation to Human Level},
     journal = {Genesis AI Blog},
     month = {May},
     year = {2026},
     url = {https://genesis.ai/blog/gene-26-5-advancing-robotic-manipulation-to-human-level},
}

Made by humans, for humans.