Too little demonstrations
(And camera errors found later)
Data
Real world data
Representing the real world
Final integration with real life data
Synthetic Data generation
World models
Training
World model + VLA
Accurate Scene physics and contacts
Control input
SDG from IRL data
Physical Data collection
World model training
Cloud point object identifier - latent transform
Current approach
What if...
Learned simulator world models pipeline
Training performance locally
Analytical simulator world models pipeline
Scene input
What alternatives are there for a more efficient process?
How do we ensure the robot wont make any irreversible mistake?
SDG
Profit...
Object detection + recorded states
Scene digital twin
Train VLA model (SmolVLA)
One shot for N frames
To be able to react fast to changes in enviroment, the number of frames should be low. Constant calculation, even with flowmatching is slow - It could cause jitter in the system
Pasted image 20260222014943.png
Expensive to do at scale
Manual Scene reandomization
Prone to overtraining a specific scene (Light, surroundings, noise, etc)
Convert the camera outputs with the original hdf5 into lerobot format
Robot states and camera feeds directly
Batch 32 - 10Gb Vram
Pasted image 20260221115645.png
Pasted image 20260222013742.png
Timeline 4 1.mp4
Work in progress :)
Particle graph meshes
Point cloud with drawn math edges (springs and links) between neighbours to form a net. Graph Neural netwrok
For very flexible items the material edges tend to break and need to be recalculated constanly, delaying sim time
We need an actual 3D representation (Either physics and/or visuals)
point cloud
Plain xyz points, no connections.
The system learns the relations - Could be upgraded with point cloud segmentation
Semantic keypoints. Give more info, but you need to understand the semantics of each object
3d gaussian splat - Non rigid ICP
Very complete, dense and accurate, photorealistic. Doesn't force a rigid mesh,
represents the world as millions of ellipsoids with volume, scale opacity, light, etc. Very good visuals.
Too complex for simulations
too resource heavy, prone to drift
Real world representation
3d bounding boxes with multicamera... but that still doesn't represent anything more than central coordinate of the clump
Video segmentation and depth camera for position 😃
Material point method
Hybrid, uses cloud point as visualization. But uses a background grid to calculate the physics. (Isaac sim works like this)
Multi view rgb-encoder
No need to generate a 3d space, the model learns the 3d relationships (Open VLA works like this)
It is a black box, no way to know if the model understands the 3d structures, folds, etc.
Timeline 3.mp4
Replicating same physics across different simulators isnt either
Pasted image 20260222012852.png
Pasted image 20260222012929.png
Pasted image 20260222012807.png
Pasted image 20260222012828.png
Pasted image 20260222012749.png
Pasted image 20260220062936.png
What if we wanted to make our own 3d perception system to accommodate to the flexibles or materials in our scenes? Or even fine tune an existing one
Translating robot models and joints configs isnt fun
Pasted image 20260222002834.png
We can change lighting, material textures and colors. Additional clutter and movement trough the scene
4 cameras and physics re-run when replaying episodes
20fps with no physics/collisions and 3 cameras with lower quality
At 15fps per camera, thats faster than realtime - With no manual scene resets
Pasted image 20260222005315.png
Pasted image 20260222005231.png
Pasted image 20260222005251.png
Pasted image 20260222005222.png
Pasted image 20260222012759.png
Pasted image 20260222012818.png
Pasted image 20260222012741.png
Pasted image 20260222012842.png
Pasted image 20260222005342.png
Pasted image 20260222005334.png
Pasted image 20260222005351.png
Pasted image 20260222005323.png
Pasted image 20260222005241.png
Pasted image 20260222005305.png
Pasted image 20260222012913.png
hdf5 files with no cameras.
Just a control replay with all scene states
The same as dino but for 3d. Raw point cloud and it gives a latent map by detecting corners, etc.
RGB encoder for latent feature map
If we use the same pipeline as in the mujoco for simulation - Isaac sim for sdg. Could we extract the very accurate object from a gaussian splat and use that as the asset for SDG?
Doing that for every frame we can get a time based USD file or Alembic (abc) and use it in Isaac with the recorded robot states.
Latent space transform / Object identifier training
Ground truths of object surfaces and properties
Latent space transform
World model
World model state input
World model training
Latent space transform
Robot state inputs
Camera inputs
Robot state inputs
Cloud point generation
3D Gaussian splat
Robot state inputs
Camera inputs
3D Mesh Reconstruction per frame
Time based USD file
Cloud point generation
Camera inputs
VLA output for N frames
World model prediction for N frames
A world model equivalent could be just a very accurate simulator
Genesis
Linear Blend Skinning (LBS) for visualization into gaussian splat. But physics run without viz
Built for camera occlusion
Very good with occlusion, requires heavy vision models to extract the grids (Segmentation)
The cloud map points are mapped to 3d grids
It calculates all the points in the grid, if the object moves, the grid flows, and if anything is covered, the model still understands it in the grid.
Percieves a clud points as individual objects with properties. Rigid, soft, etc. Only focusing on the objects to be manipulated. The end effector is also percieved separatelty
Calculates point-to-point collisions depending on the properties of each point.
Very good at complex multimaterial interactions
Vulnerable to occlusion
Using the cloud map, each point is a 3D pixel. The whole scene is processed as a whole (no individual properties or objects)
The flow of the whole scene is what is calculated. How does every single pixel afect each other.
Robot agnostic, but extremely computationally heavy.
Raw point clound
Encoder to latent space (Like pointNet)
World model
Predicted Scene State according to previous state and action state
Voxel grids
preserves the density of the scene, very structured and good for 3d convolutional neural neetworks
too heavy/wasteful, you calculate empty space, and interaction between bodies is too heavy
That imited to only rigid objects. A shirt, a rope would probably fail to be represented properly
RGB - text encoder
Point maps conversion into grasping maps . Useful for the moment the robot is interacting with the object. But doesnt give a represntation of the object in space
How do we get the numerical representation of an object in a scene?
No sim-to-real pain
Trained in absolute ground truth
Failed resulting model as of now. with smolVLA_base as initial model
What happened?
The mujoco and isaac sim links for the camera reference point are different. So when trying to link the camera in sdg it didn't found the prim and gave a blank image. which caused all of the gripper camera frames to be useless.
Also, and most importantly, 30 episodes is too little. And doing 5 sdg variations probably overfits the model.
Raw point clound
Cloud point segmentation
Particles as MPM particles with according properties