I’ve been reading a lot about generalist robotic policies recently. A few days ago, I wrote about Physical Intelligence’s FAST, which is the tokenizer for their transformer-based policy model $\pi0$, but I’ve also been looking at a lot of the other open source work, especially work that relates to turning language into controls (for instance, RT-1, RT-2, RT-X, and Octo - all of these are coauthored by Chelsea Finn, a co-founder of Physical Intelligence, who as far as I can tell has worked on approximately every open source generalist model).
But one thing that has really surprised me when I read about these policies: they consistently ignore tactile feedback. I haven’t found any papers that train a robotics foundation model on touch; and only the Octo paper and the FuSe paper from January fine-tune a foundation model using touch. Why is it so common to ignore tactile feedback?
Octo mentions tactile feedback and FuSe elaborates on some more experiments with it, but it’s worth saying I think they’re actually both the same fine-tune; FuSe is a fine-tune of Octo published by a lab at Berkeley that collaborated on Octo, and the Octo paper specifically talks about the “Berkeley peg insertion” task (which is the only one which involved the tactile fine-tuning). There’s also TLA, which replaces vision with tactile sensing, and is an interesting experiment that allows the authors to reuse weights of open models but seems like a bad idea in that it’s replacing vision with tactile sensing. So I round to about 1.5 attempts that I know of to train a tactile-vision-language-action model. This seems very strange: if you look at experiments like this one it seems very clear that humans rely heavily on tactile feedback to perform dexterous tasks. It’s probably extremely easy for you to pull a single dollar bill out of your wallet or open a tupperware container without looking, once you have the objects you need in your hands already. Videos like the one I linked suggest that you could not do the same thing so easily without feeling.
Also, there are already papers that suggest that touch is extremely useful in learning robotic manipulation. Here’s one showing that robots with touch can perform dexterous tasks like swiping credit cards and plugging in flash drives at a much higher rate than robots without it (I think these are task-specific policies, not a generalist model). Here’s another one showing the same thing with tasks like opening an AirPods case or screwing on a nut, with even lower success rates for pure vision. Maybe my favorite gets high success rates on a peg insertion task with tolerances of 0.4mm (cheating somewhat with a conveniently positioned camera).
These papers show that having a generalist vision policy is really important! A good vision policy allows you to get close enough to touch the right object, for one thing. Vision frequently lets you know when you’ve failed at something, and how, which lets you try it again. I think this is the main reason that the current generalist models work as well as they do; they can easily see when they miss a grasp or something and they just try that part of the action again. So just to be clear: I think vision is important too, it just seems like touch is underrated.
If touch is underrated, we still don’t know why. It would be ridiculous to think that the people at the generalist robotics labs don’t know it’s important; it’s even in the Octo paper, on the exact task where touch is most important (I spent a summer working on low tolerance peg insertion; as far as I know, it is universally acknowledged that you need tactile sensing to solve it reliably).
I also don’t think it’s a data problem. It’s true that most large robotics datasets don’t have force or tactile data, but Isaac Lab and MuJoCo both have contact dynamics that you could use to gather simulated data*, and you could try a two-step training process where you first use your greater volume of non-force data and then fine-tune with a mixture of synthetic data and real-world force-rich data. And that’s just the simplest approach; you could try much fancier things like trying to learn the force data (or learn simulation parameters) from a video and then putting the force data (simulated or inferred) into your large datasets.
It doesn’t make sense as a flexibility problem: you might want your generalist model to only require an absolute minimum number of sensors, but you can always train different output heads for forceful/forceless settings. Anecdotally, I very seldom do reasoning on my tactile feedback and so I don’t need it as an input to the part of my neural net that contains the bulk of the parameters.
So there are only a couple more things I can think of:
- Dense tactile sensing hardware/software: the labs are convinced that dense tactile sensing is the most important thing, and dense tactile sensing is still somewhat in its infancy, in terms of hardware and software. It seems to me like the best dense tactile sensors are the (very clever) GelSight sensors, which use a camera covered in gel to see visually which parts of the gel are being compressed and by how much. Classic robotics simulation software like Isaac Lab and MuJoCo doesn’t support dense tactile sensors, but there are papers/projects like DiffTactile that do. So maybe the labs are waiting for that ecosystem to become more mature? This doesn’t quite make sense to me because it doesn’t seem like it would be that hard to push things forward yourself (and if you kept it closed-source, you could actually gain a competitive advantage by doing this). So maybe the answer is that the labs are currently partway through work on their closed-source contact dynamics software and we’ll see cool results from it soon?
- It’s actually trivial: the labs are convinced that all the hard problems to solve are covered by vision, language, and action. Tactile sensing will be trivially easy to finetune into the models at any time (as it seems to have been for Octo) and will have a predictable impact on performance. Given this, it’s just a waste of parameters to train it into research models that are targeting the actually important issues. This doesn’t make sense to me because very few startups are in the business of ignoring low-hanging fruit and it’s really silly to imagine that you can perfectly predict the impact of tactile sensing on performance.
- The labs are actually doing this but they’re keeping it secret. But why would you maintain two research pipelines and show people the results from the worse one?
- Tactile sensing is actually not helpful. I would be shocked if this were true - at the very least it should help substantially with things like missed grasps - but I guess it’s possible.
- The EMH is false. Labs all suspect tactile sensing would be helpful, all believe that it would give them an edge to start gathering tactile data, etc, but none of them have gotten around to it yet because there are lots of things to work on.
So, in conclusion, I have no idea why things are the way they are. Maturity of the software/hardware ecosystem is probably my best guess? If you’re someone from one of the generalist robotics labs, I would love to hear from you.
* As I mention later, I think you could get simple contact dynamics like normal and frictional forces from MuJoCo and Isaac, but they don’t support dense tactile sensing.