Dexterous Generalist Policies with Force Feedback

2025/05/26

I’ve been reading a lot about generalist robotic policies recently. A few days ago, I wrote about Physical Intelligence’s FAST, which is the tokenizer for their transformer-based policy model $\pi0$, but I’ve also been looking at a lot of the other open source work, especially work that relates to turning language into controls (for instance, RT-1, RT-2, RT-X, and Octo - all of these are coauthored by Chelsea Finn, a co-founder of Physical Intelligence, who as far as I can tell has worked on approximately every open source generalist model).

But one thing that has really surprised me when I read about these policies: they consistently ignore tactile feedback. I haven’t found any papers that train a robotics foundation model on touch; and only the Octo paper and the FuSe paper from January fine-tune a foundation model using touch. Why is it so common to ignore tactile feedback?

Octo mentions tactile feedback and FuSe elaborates on some more experiments with it, but it’s worth saying I think they’re actually both the same fine-tune; FuSe is a fine-tune of Octo published by a lab at Berkeley that collaborated on Octo, and the Octo paper specifically talks about the “Berkeley peg insertion” task (which is the only one which involved the tactile fine-tuning). There’s also TLA, which replaces vision with tactile sensing, and is an interesting experiment that allows the authors to reuse weights of open models but seems like a bad idea in that it’s replacing vision with tactile sensing. So I round to about 1.5 attempts that I know of to train a tactile-vision-language-action model. This seems very strange: if you look at experiments like this one it seems very clear that humans rely heavily on tactile feedback to perform dexterous tasks. It’s probably extremely easy for you to pull a single dollar bill out of your wallet or open a tupperware container without looking, once you have the objects you need in your hands already. Videos like the one I linked suggest that you could not do the same thing so easily without feeling.

Also, there are already papers that suggest that touch is extremely useful in learning robotic manipulation. Here’s one showing that robots with touch can perform dexterous tasks like swiping credit cards and plugging in flash drives at a much higher rate than robots without it (I think these are task-specific policies, not a generalist model). Here’s another one showing the same thing with tasks like opening an AirPods case or screwing on a nut, with even lower success rates for pure vision. Maybe my favorite gets high success rates on a peg insertion task with tolerances of 0.4mm (cheating somewhat with a conveniently positioned camera).

These papers show that having a generalist vision policy is really important! A good vision policy allows you to get close enough to touch the right object, for one thing. Vision frequently lets you know when you’ve failed at something, and how, which lets you try it again. I think this is the main reason that the current generalist models work as well as they do; they can easily see when they miss a grasp or something and they just try that part of the action again. So just to be clear: I think vision is important too, it just seems like touch is underrated.

If touch is underrated, we still don’t know why. It would be ridiculous to think that the people at the generalist robotics labs don’t know it’s important; it’s even in the Octo paper, on the exact task where touch is most important (I spent a summer working on low tolerance peg insertion; as far as I know, it is universally acknowledged that you need tactile sensing to solve it reliably).

I also don’t think it’s a data problem. It’s true that most large robotics datasets don’t have force or tactile data, but Isaac Lab and MuJoCo both have contact dynamics that you could use to gather simulated data*, and you could try a two-step training process where you first use your greater volume of non-force data and then fine-tune with a mixture of synthetic data and real-world force-rich data. And that’s just the simplest approach; you could try much fancier things like trying to learn the force data (or learn simulation parameters) from a video and then putting the force data (simulated or inferred) into your large datasets.

It doesn’t make sense as a flexibility problem: you might want your generalist model to only require an absolute minimum number of sensors, but you can always train different output heads for forceful/forceless settings. Anecdotally, I very seldom do reasoning on my tactile feedback and so I don’t need it as an input to the part of my neural net that contains the bulk of the parameters.

So there are only a couple more things I can think of:

So, in conclusion, I have no idea why things are the way they are. Maturity of the software/hardware ecosystem is probably my best guess? If you’re someone from one of the generalist robotics labs, I would love to hear from you.


* As I mention later, I think you could get simple contact dynamics like normal and frictional forces from MuJoCo and Isaac, but they don’t support dense tactile sensing.