Safe Robotics - Handling and Manipulating Liquids

Carrying and pouring open containers with liquid without spillage using robot manipulation.

Introduction

In this article, I will present the problem case of one of the research problems I am working on and how I am planning on tackling it. Shoot me a message if you have questions!

Problem Statement

One of my research topics I am looking into is robot safety, especially during manipulation. For mobile robots, safety is more so a simple collision avoidance problem where classical methods such as MPC work well, and even methods like HJ Reachability analyis can be directly applied without too much modifications. On the other hand, manipulation is a different case! In manipulation tasks, you interact with entities (humans, objects, surfaces, etc.) whose state are not so easy to define and representing safe vs failure safe is even harder to represent.

Sketch of safe vs usafe set

So, this is one of my main areas of research, and to evaluate various tools and methodologies I would need simple tasks with no so simple failure cases. One such task is carrying a glass full of water and pouring into another glass.

Sketch of the problem

This task poses the following challeges which need to be addressed:

How to represent spillage (very “non-rigid” constraint) as failure?
What about other failure modes: breaking/deforming the glass, turning over the recieveing glass?
Pouring motion/angle changes with quantity/density of liquid in the container, is visual feedback enough to gauge these values? Do we need to explicitly estimate these values?
Can the model learn the above liquid parameters, through self-supervision, by “interacting” with the object for certain time and use this information?
How robust is the performance to changing enviornmental conditions like lighting, table color/texture, cup color?

Experimental Setup

For this experiment, I will be using the Franka Emika Panda Powertool (Figure *). This robot arm was chosen for a number of reasons. Firstly, having a redundant joint (7dof) allows for more dexterity which is useful when handling liquids. Moreover, they have integrated torque sensors which will allow for more comprehensive algorithm design as they can help in estimating the liquid left in the container during pouring. The quick data acquisition on the platform is quite useful for trainig Imitation Learning (IL). It also has an easy ROS2 interface which is always a bonus! *

In the experiment, an RGB sensor (Intel RealSense) will be used; it will be attached onto the end-effector and will move along with it. This will allow the robot to capture more information from different angles about the quantity of the water in the glass compared to stritcly top-down/singular angle view. As mentioned, torque sensors will also be used with the RGB sensor.

As for the workspace, a simple desk setup will be used. The two glasses will be randomly placed in the workspace and the robot has to pick the filled glass and empty it into the empty one. The liquid color, desk color, lighting, initial amount of liquid in the glass will change between trials to gauge the robustness of the algorithms.

In the majority of the experiment (until we are convinced our algorithm works), we will try with M&Ms inside the glass instead of liquid to minimize excessive spillage while still maintaining a somewhat similar pouring effect of water. M&Ms here exhibit granular flow which can be though of as a discretized version of liquid flow.Even though there are differences in the “flow” of these two, there are enough similarities for the robot to generalize to liquids.

Simulation Setup (or lack thereof)

The major simulators used in robotics like MuJoCo * and IsaacSim * do not support high-fidelity fluid simulation nor do they have accurate modelling of granular flow capabilities. We would still try basic simulations on these platforms, but it’s not promising.

Another direction we could look into is to integrate high-fidelity fluid solvers (like ANSYS Fluent *) into these simulators. Isaac through it’s WARP module offers this capability, but without further testing cannot say if it would be benificial. Furthermore, the time commitment to learn ANSYS Fluent, integrate with Isaac and setup the simulation enviorment would be big. Additionally, this method will be computationally expensive as the simulation will be solving Non-Liear Navier Stokes PDE for large number of particles each timestep.

Hence, the simulation route is still under consideration for this case study.

Methodology

To achieve our task of transferring M&Ms/water from one container to another, we can consider either the traditional robotic pipeline or a learning based one.

Traditional Pipeline

In a traditional pipeline, you have seperate and modular components with human understandable inputs and outputs. These components usually are the following:

Perception
Planning
Control

Usually, there would be additioanl components (like sensor drivers, world modelling) but can be categorized in the aforementioned three. A typical pipeline is depicted in Figure *.

For our task, the pipeline components would have the following structure:

Perception: The perception module takes in the stream of RGB input from the camera and outputs the pose (x,y) of both the containers and their semantic labels (filled or empty). The height infromation and radius of the containers will be known apriori to simplify the experiment. A semantic segmentation output representations can also be used to get a more precise location estimate of the container and it’s geometry. Depth information can be incorporated to get an estimate of the depth as well. To create this module, pre-trained vision models (YOLO, RCNN, Vision Transformer) can be used and fine-tuned on our data with our desired outputs. An object tracker can also be added to robustly track the containers when occlusion is present (ex: robot arm blocks the view).
- Sensor Calibration: Hand-eye calibration of the RGB camera (attached to the end-effector) would be needed to transform the pose outputs to world frame.
- Spillage Detection:
Planning: This module is responsible for taking in the pose and label of the containers (along with other geometrical properties) to generate a robot trajectory and grasp locations (and orientation) for the filled container. This module will have a state machine which will identify which phase of the task the robot is in:
- Move to filled container
- Pick up filled container
- Move to empty container
- Pour the liquid
Control: Control is responsible for following the trajectory outputted by the planning module, this would typically done using MPC which would include state constraints and actuator constraints.

Drawbacks

The aforementioned pipeline has some severe drawbacks. Firstly, our pipeline is quite brittle to scene parameters, like liquid in container, lighting, background color/texture, etc. Furthermore, this pipeline does not easily generalize to other tasks; for each custom task the components require significant modifications and hand engineering. It is also hard to implement complex safety specifications (like spillage prevention). The way it is done in this pipeline is a very crude method and to improve it would require significant hand engineering (again lack of generalizability).

Hence, even though having sepereate modular components is good for human iterpretability, it lacks generalizability and requires significant expert knowledge (and intervention).

End-to-end Learning

Another way we can approach the problem is to combine one or more such components and learn the task trajectories (and safety specs) directly from demonstrations rather then having to specify them mathamatically, which can be computationally expensive and brittle.

Usually for this approach, the perception and planning module are combined to create something called a visuomotor policy. A visuomotor policy takes in visual observations (RGB frames) and directly outputs coordinates (velocities) in end-effector space or joint angles; these values are then passed to the low-level controllers present on the robot arm. A policy can also be made to directly output motor torque values, but generally this is a rare practice due to safety concerns.

The policy (which is a state to action mapping) is represented by a neural network with a pretrained perception backbone which is trained using Imitation Learning, and more specifically Behavior Cloning (BC). BC basically turns control into a supervised learning problem. Where for a history of observations, the policy outputs a control action(s) which is compared to a human demonstrator’s action (from a trajectory) and based on the reconstruction loss, the policy weights are adjusted. You can find a simple training diagram in the Figure below.

Training Sketch

The training usually takes place for 50 to 100 human demonstrated trajectories after which the policy is good enough to reliably to the said task.

Collecting demonstrator trajectories (and what works)

To collect demonstration trajectories, a simple keyboard teleop interface can be created in ROS (or someother realtime framework) where the observations from the camera and the commands sent to the robot arm (joint angle velocity or end effector velocity) is recorded in a ROS Bag (or a csv file) which can be then used to train the network.

There are some engineering hacks and tips to keep in mind while collecting data *

Collect your own “expert data” and don’t trust anyone else to make it perfect
Avoid “no action” data so your policy doesn’t just sit there.
If it is not working; collect more data until “extrapolation” becomes “interpolation”.
Train and test on the same day because your setup might change tommorow.

In the end, if the policy is not working well it is because we do not have enough data. There is a misconception usually with BC that the policy can never do better then the demonstrator. This is actually false, especially with scale, it is possible to generalize (abit) and do better then the demonstrator.

Network Architecture

The policy archietecture has two parts: encoder and the decoder (see Figure *). The encoder (also called the “backbone”) takes in the high dimensional RGB data and turns it into a lower dimensional latent represenatation. This latent representation is fed into the decoder to output action (or action sequences).

For the encoder, we are planning on using ResNet-18 backbone as it is the research standard and pretrained models available.

For the decoder, we have two choices: Diffusion models ad Action-chunking transformers (ACT). \

Input Encoders

Pretrained Vision Backbone
Finetune output layers, with final layer having 7 outputs (joint velocities)

Diffusion Model

Hardware Considerations

Need to implement custom mid-level controller?
- to generate joint pose from ee pose of policy
- each step solve diff kin problem to compute desired joint vel to track desired ee vel. resulting joint vel is Euler integrated into joint pose, which is tracked by join level controller.
- need it mostly for INTERPOLATION
The telo-op and learned policies run at 10 Hz
Teleop done using spacemouse?
- How?
Camera attached on gripper
- Realsense vs GoPro
- 30fps -> downsampled to 320x240 at 10fps as input to learned policies

Implementing safety (Latent Safety Filter)

World Model

A world model allows the agent (our robot) to build representations of the world through past experiences, essentially learning the dynamics model of the task where the “states” are learnt in the latent spaces of the neural network * (Watter paper). Having a latent state space (instead of using the pixels as state) will also reduce the memory footprint.

Core Structure of a World Model:

Encoder (E): Learns to represent the latent spatial component, typically via a variational encoder. It is typically written as (where $z_t$ is the latent state, and $o_t$ is the observation at time t): $z_t ~ Q_\psi (z_t| \hat{z_t}, o_t)$

It is analagous to a Kalman Filter.

Transition Model (TM): Learns to predict the temporal evolution of the latent states (typically via a RNN). It is represented as follows: $\hat{z_{t+1}} ~ p_\theta(\hat{z_{t+1}} | z_t, a_t)$
Output controller Layer: takes the latent represenation from the VAE and RNN and maximize reward

** Add pic of world model **

These world models ususally are trained using self-supervision using observation reconstruction (typical for encoders), teacher forcing, etc.

When training this world model’s in “imagination”, an inital (random) observation from the enviornment is used and then rollouts are collected in the latent imagination for few steps (20-30) to prevent too out of distribution data.

Failure Classifier

In HJ-Reachability, the failure set is usually the zero-sublevel set of a function $l(s)$, meaning $l(s) < 0$ defines the failure set. For typical collision avoidance problem, this is defined as the signed distance function to the failure set.

However, this approach does not work with more abstract specification of safety like spilling. Hence, we need to learn the $l(z)$ function from data by modelling it as a simple classifier over the latent states $z$.

To train a classifer with the zero sublevel set property for the failure set, the loss needs to be the following of the network:

$$\mathcal{L}(\mu) = \frac{1}{N_{\text{safe}}} \sum_{o^+ \in \mathcal{D}{\text{safe}}} \mathrm{ReLU}\left(\delta - \ell{\mu}(\mathcal{E}_{\psi}(o^+))\right)

\frac{1}{N_{\text{fail}}} \sum_{o^- \in \mathcal{D}{\text{fail}}} \mathrm{ReLU}\left(\delta + \ell{\mu}(\mathcal{E}_{\psi}(o^-))\right)$$
HJ-Reachability in Latent Space

Reachability analysis reaquires an analytic model of the system dynamics or a high-fidelity simulation to solve the fixed-point Bellman equation. In our context, we can use the world model’s dynamic model (“world models imagination”) to capture hard to design (and simulation) fluid interaction dynamics. The fixed point Bellman update is as follows:

\[\]

Note, we need a learning based approximation of this update as a tabular based version would be intractable for a 544 dimensional latent space. Also, note the reachability in this context will mean the robot will do it’s best to be safe in its representation of the world. The safety of the policy is interlinked with the accuracy of the world model in representing the enviornment.

During deployment, the world model picks an action and imagines it’s outcome in the latent space. For the imagined outcome, it estimates the value, which will serve as the monitoring signal for whether the robot is safe or it should start applying the safety policy. The filtering law is as follows: