Untitled 1

Accurate sim physics data

SDG data for multiple demonstrations

Too little demonstrations (And camera errors found later)

Data

Real world data

Representing the real world

Final integration with real life data

Synthetic Data generation

World models

Training

World model + VLA

Accurate Scene physics and contacts

Control input

SDG from IRL data

Physical Data collection

World model training

Cloud point object identifier - latent transform

Current approach

What if...

Learned simulator world models pipeline

Training performance locally

Analytical simulator world models pipeline

Scene input

What alternatives are there for a more efficient process?

How do we ensure the robot wont make any irreversible mistake?

SDG

Profit...

Object detection + recorded states

Scene digital twin

Train VLA model (SmolVLA)

One shot for N frames

To be able to react fast to changes in enviroment, the number of frames should be low. Constant calculation, even with flowmatching is slow - It could cause jitter in the system

Pasted image 20260222014943.png

Expensive to do at scale
Manual Scene reandomization
Prone to overtraining a specific scene (Light, surroundings, noise, etc)

Convert the camera outputs with the original hdf5 into lerobot format

Robot states and camera feeds directly

Batch 32 - 10Gb Vram

Pasted image 20260221115645.png

Pasted image 20260222013742.png

Timeline 4 1.mp4

Work in progress :)

Particle graph meshes
Point cloud with drawn math edges (springs and links) between neighbours to form a net. Graph Neural netwrok
For very flexible items the material edges tend to break and need to be recalculated constanly, delaying sim time

We need an actual 3D representation (Either physics and/or visuals)

point cloud
Plain xyz points, no connections.
The system learns the relations - Could be upgraded with point cloud segmentation

Semantic keypoints. Give more info, but you need to understand the semantics of each object

3d gaussian splat - Non rigid ICP
Very complete, dense and accurate, photorealistic. Doesn't force a rigid mesh,
represents the world as millions of ellipsoids with volume, scale opacity, light, etc. Very good visuals.
Too complex for simulations
too resource heavy, prone to drift

Real world representation

3d bounding boxes with multicamera... but that still doesn't represent anything more than central coordinate of the clump

Video segmentation and depth camera for position 😃

Material point method
Hybrid, uses cloud point as visualization. But uses a background grid to calculate the physics. (Isaac sim works like this)

Multi view rgb-encoder
No need to generate a 3d space, the model learns the 3d relationships (Open VLA works like this)
It is a black box, no way to know if the model understands the 3d structures, folds, etc.

Timeline 3.mp4

Replicating same physics across different simulators isnt either

Pasted image 20260222012852.png

Pasted image 20260222012929.png

Pasted image 20260222012807.png

Pasted image 20260222012828.png

Pasted image 20260222012749.png

Pasted image 20260220062936.png

What if we wanted to make our own 3d perception system to accommodate to the flexibles or materials in our scenes? Or even fine tune an existing one

Translating robot models and joints configs isnt fun

Pasted image 20260222002834.png

We can change lighting, material textures and colors. Additional clutter and movement trough the scene

4 cameras and physics re-run when replaying episodes

20fps with no physics/collisions and 3 cameras with lower quality

At 15fps per camera, thats faster than realtime - With no manual scene resets

Pasted image 20260222005315.png

Pasted image 20260222005231.png

Pasted image 20260222005251.png

Pasted image 20260222005222.png

Pasted image 20260222012759.png

Pasted image 20260222012818.png

Pasted image 20260222012741.png

Pasted image 20260222012842.png

Pasted image 20260222005342.png

Pasted image 20260222005334.png

Pasted image 20260222005351.png

Pasted image 20260222005323.png

Pasted image 20260222005241.png

Pasted image 20260222005305.png

Pasted image 20260222012913.png

hdf5 files with no cameras.
Just a control replay with all scene states

Timeline 2.mp4

Pasted image 20260221235739.png

Pasted image 20260222000109.png

Mujoco SIM

Timeline 1.mp4

Pasted image 20260221235722.png

Robot model (XML from urdf)

YAML scene configuration

SigLIP 2 - https://arxiv.org/pdf/2502.14786

General purpose Clothes Manipulation with semantic keypoints - https://arxiv.org/pdf/2408.08160v3

AnyDexgrasp - https://graspnet.net/anydexgrasp/

Point-SAM - https://point-sam.github.io/

PointNet - https://stanford.edu/~rqi/pointnet/

Segmented point cloud

Extract only the depth pixels of the desired object and represent them in a point cloud.

1X World model - Video approach - https://www.1x.tech/discover/world-model-self-learning

Nvidia Cosmos Predict 2.5 - Video apporach, but can understand cloud points with converters -https://github.com/nvidia-cosmos/cosmos-predict2.5

Particle Grid Neural dynamics - https://kywind.github.io/pgnd

DinoV2 - https://arxiv.org/pdf/2304.07193

The same as dino but for 3d. Raw point cloud and it gives a latent map by detecting corners, etc.

RGB encoder for latent feature map

If we use the same pipeline as in the mujoco for simulation - Isaac sim for sdg. Could we extract the very accurate object from a gaussian splat and use that as the asset for SDG?

Doing that for every frame we can get a time based USD file or Alembic (abc) and use it in Isaac with the recorded robot states.

SuGaR - Surface Aligned Gaussian Splatting for 3D mesh reconstruction - https://arxiv.org/pdf/2311.12775

3DGS generates a whole scene. We coudl extract the object using segmentation algorithms (SAM+DinoV2) to extract only the items we want and 3DGS that.

If only we could extract only the resulting mesh from a 3DGS...

3DGS gives us ellipsoids with many properties, which include light and other already defined parameters depending on the scene.

We can extract the ground truth of deformables easily

We can simulate rgb-d sensors. Alotugh they are rendered as ideal, so some noise and artifacts need to be integrated.

With ground truth and current perception we can generate lots of close-to-real data extremely fast

This process would also work for generating semantic keypoints databases for deformables

Genesis - https://genesis-embodied-ai.github.io/

Isaac sim digital twin

State classificator (Is the state desirable) WMPC

MPC

Scene replication in Genesis

SDG data

Execute action

Recalculate VLA output

Sensor readings

Artificial noise + artifacts

Latent space transform / Object identifier training

Ground truths of object surfaces and properties

Latent space transform

World model

World model state input

World model training

Latent space transform

Robot state inputs

Camera inputs

Robot state inputs

Cloud point generation

3D Gaussian splat

Robot state inputs

Camera inputs

3D Mesh Reconstruction per frame

Time based USD file

Cloud point generation

Camera inputs

VLA output for N frames

World model prediction for N frames

A world model equivalent could be just a very accurate simulator

Genesis

Linear Blend Skinning (LBS) for visualization into gaussian splat. But physics run without viz

Built for camera occlusion

Very good with occlusion, requires heavy vision models to extract the grids (Segmentation)

The cloud map points are mapped to 3d grids
It calculates all the points in the grid, if the object moves, the grid flows, and if anything is covered, the model still understands it in the grid.

ParticleFormer - https://arxiv.org/pdf/2506.23126

Built for multi-material interactions.

Percieves a clud points as individual objects with properties. Rigid, soft, etc. Only focusing on the objects to be manipulated. The end effector is also percieved separatelty

Calculates point-to-point collisions depending on the properties of each point.

Very good at complex multimaterial interactions
Vulnerable to occlusion

PointWorld - https://point-world.github.io/

Universal robot control

Using the cloud map, each point is a 3D pixel. The whole scene is processed as a whole (no individual properties or objects)

The flow of the whole scene is what is calculated. How does every single pixel afect each other.

Robot agnostic, but extremely computationally heavy.

Raw point clound

Encoder to latent space (Like pointNet)

World model

Predicted Scene State according to previous state and action state

Voxel grids
preserves the density of the scene, very structured and good for 3d convolutional neural neetworks
too heavy/wasteful, you calculate empty space, and interaction between bodies is too heavy

That imited to only rigid objects. A shirt, a rope would probably fail to be represented properly

RGB - text encoder

Point maps conversion into grasping maps . Useful for the moment the robot is interacting with the object. But doesnt give a represntation of the object in space

How do we get the numerical representation of an object in a scene?

No sim-to-real pain
Trained in absolute ground truth

Failed resulting model as of now. with smolVLA_base as initial model

What happened?
The mujoco and isaac sim links for the camera reference point are different. So when trying to link the camera in sdg it didn't found the prim and gave a blank image. which caused all of the gripper camera frames to be useless.
Also, and most importantly, 30 episodes is too little. And doing 5 sdg variations probably overfits the model.

Raw point clound

Cloud point segmentation

Particles as MPM particles with according properties

MPM processing

Analytical simulated Scene State