Carrying and pouring open containers with liquid without spillage using robot manipulation.
In this article, I will present the problem case of one of the research problems I am working on and how I am planning on tackling it. Shoot me a message if you have questions!
One of my research topics I am looking into is robot safety, especially during manipulation. For mobile robots, safety is more so a simple collision avoidance problem where classical methods such as MPC work well, and even methods like HJ Reachability analyis can be directly applied without too much modifications. On the other hand, manipulation is a different case! In manipulation tasks, you interact with entities (humans, objects, surfaces, etc.) whose state are not so easy to define and representing safe vs failure safe is even harder to represent.
Sketch of safe vs usafe set
So, this is one of my main areas of research, and to evaluate various tools and methodologies I would need simple tasks with no so simple failure cases. One such task is carrying a glass full of water and pouring into another glass.
Sketch of the problem
This task poses the following challeges which need to be addressed:
For this experiment, I will be using the Franka Emika Panda Powertool (Figure *). This robot arm was chosen for a number of reasons. Firstly, having a redundant joint (7dof) allows for more dexterity which is useful when handling liquids. Moreover, they have integrated torque sensors which will allow for more comprehensive algorithm design as they can help in estimating the liquid left in the container during pouring. The quick data acquisition on the platform is quite useful for trainig Imitation Learning (IL). It also has an easy ROS2 interface which is always a bonus! *
In the experiment, an RGB sensor (Intel RealSense) will be used; it will be attached onto the end-effector and will move along with it. This will allow the robot to capture more information from different angles about the quantity of the water in the glass compared to stritcly top-down/singular angle view. As mentioned, torque sensors will also be used with the RGB sensor.
As for the workspace, a simple desk setup will be used. The two glasses will be randomly placed in the workspace and the robot has to pick the filled glass and empty it into the empty one. The liquid color, desk color, lighting, initial amount of liquid in the glass will change between trials to gauge the robustness of the algorithms.
In the majority of the experiment (until we are convinced our algorithm works), we will try with M&Ms inside the glass instead of liquid to minimize excessive spillage while still maintaining a somewhat similar pouring effect of water. M&Ms here exhibit granular flow which can be though of as a discretized version of liquid flow.Even though there are differences in the “flow” of these two, there are enough similarities for the robot to generalize to liquids.
The major simulators used in robotics like MuJoCo * and IsaacSim * do not support high-fidelity fluid simulation nor do they have accurate modelling of granular flow capabilities. We would still try basic simulations on these platforms, but it’s not promising.
Another direction we could look into is to integrate high-fidelity fluid solvers (like ANSYS Fluent *) into these simulators. Isaac through it’s WARP module offers this capability, but without further testing cannot say if it would be benificial. Furthermore, the time commitment to learn ANSYS Fluent, integrate with Isaac and setup the simulation enviorment would be big. Additionally, this method will be computationally expensive as the simulation will be solving Non-Liear Navier Stokes PDE for large number of particles each timestep.
Hence, the simulation route is still under consideration for this case study.
To achieve our task of transferring M&Ms/water from one container to another, we can consider either the traditional robotic pipeline or a learning based one.
In a traditional pipeline, you have seperate and modular components with human understandable inputs and outputs. These components usually are the following:
Usually, there would be additioanl components (like sensor drivers, world modelling) but can be categorized in the aforementioned three. A typical pipeline is depicted in Figure *.
For our task, the pipeline components would have the following structure:
The aforementioned pipeline has some severe drawbacks. Firstly, our pipeline is quite brittle to scene parameters, like liquid in container, lighting, background color/texture, etc. Furthermore, this pipeline does not easily generalize to other tasks; for each custom task the components require significant modifications and hand engineering. It is also hard to implement complex safety specifications (like spillage prevention). The way it is done in this pipeline is a very crude method and to improve it would require significant hand engineering (again lack of generalizability).
Hence, even though having sepereate modular components is good for human iterpretability, it lacks generalizability and requires significant expert knowledge (and intervention).
Another way we can approach the problem is to combine one or more such components and learn the task trajectories (and safety specs) directly from demonstrations rather then having to specify them mathamatically, which can be computationally expensive and brittle.
Usually for this approach, the perception and planning module are combined to create something called a visuomotor policy. A visuomotor policy takes in visual observations (RGB frames) and directly outputs coordinates (velocities) in end-effector space or joint angles; these values are then passed to the low-level controllers present on the robot arm. A policy can also be made to directly output motor torque values, but generally this is a rare practice due to safety concerns.
The policy (which is a state to action mapping) is represented by a neural network with a pretrained perception backbone which is trained using Imitation Learning, and more specifically Behavior Cloning (BC). BC basically turns control into a supervised learning problem. Where for a history of observations, the policy outputs a control action(s) which is compared to a human demonstrator’s action (from a trajectory) and based on the reconstruction loss, the policy weights are adjusted. You can find a simple training diagram in the Figure below.
Training Sketch
The training usually takes place for 50 to 100 human demonstrated trajectories after which the policy is good enough to reliably to the said task.
To collect demonstration trajectories, a simple keyboard teleop interface can be created in ROS (or someother realtime framework) where the observations from the camera and the commands sent to the robot arm (joint angle velocity or end effector velocity) is recorded in a ROS Bag (or a csv file) which can be then used to train the network.
There are some engineering hacks and tips to keep in mind while collecting data *
In the end, if the policy is not working well it is because we do not have enough data. There is a misconception usually with BC that the policy can never do better then the demonstrator. This is actually false, especially with scale, it is possible to generalize (abit) and do better then the demonstrator.
The policy archietecture has two parts: encoder and the decoder (see Figure *). The encoder (also called the “backbone”) takes in the high dimensional RGB data and turns it into a lower dimensional latent represenatation. This latent representation is fed into the decoder to output action (or action sequences).
For the encoder, we are planning on using ResNet-18 backbone as it is the research standard and pretrained models available.
For the decoder, we have two choices: Diffusion models ad Action-chunking transformers (ACT). \
ros_franka and libfranka