Alex Bewley

Publications

Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

A simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship.

Learning to Learn Faster from Human Feedback with Language Model Predictive Control

When human language inputs are observations, and robot code outputs are actions, then training an Large language models (LLMs) to complete previous interactions can be viewed as training a transition dynamics model - that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments - improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%.

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Winner of best conference paper award at ICRA 2024 (DLR photo)!

We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms.

Robots That Can See: Leveraging Human Pose for Trajectory Prediction

The proposed Human Scene Transformer observes past human positions, head orientations, and 3D skeletal keypoints using onboard robot sensors to predict their future trajectories. This innovative model not only captures the inherent uncertainty in predicting future human trajectories but also attains state-of-the-art performance on widely recognized prediction benchmarks. Furthermore, it emerged victorious in the prestigious ICCV 2023 challenge for end-to-end Human Trajectory Forecasting. The model's success underscores its effectiveness in handling complex scenarios and advancing the field of trajectory prediction.

Agile Catching with Whole-Body MPC and Blackbox Policy Learning

This work studies the challenging task of robot catching by presenting the relative merits of two fundamentally different solution strategies: (i) Model Predictive Control using accelerated constrained trajectory optimization, and (ii) Reinforcement Learning using zeroth-order optimization.

Robotic Table Tennis: A Case Study into a High Speed Learning System

This work details the design of a robotic research platform composed of a highly optimized perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real world and also train policies for zero-shot transfer, and automated real world environment resets that enable autonomous training and evaluation on physical robots.

Video OWL-ViT: Temporally-Consistent Open-World Localization in Video

We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next.

i-sim2real: Reinforcement Learning of Robotic Policies in Tight Human-Robot Interaction Loops

Sim-to-real transfer is a powerful paradigm for robotic reinforcement learning. The ability to train policies in simulation enables safe exploration and large-scale data collection quickly at low cost. i-S2R bootstraps from a simple model of human behaviour and alternates between training in simulation and deploying in the real world. In each iteration, both the human behaviour model and the policy are refined, leading to longer rallies.

Local Metrics for Multi-Object Tracking

Local metrics provide an intuitive mechanism to explicitly specify the trade-off between detection and association for evaluating object trackers.

RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection

Range Sparse Net (RSN) is a simple, efficient, and accurate 3D object framework for real time detection using LiDAR with extensive range. Lightweight 2D convolutions on dense range images results in significantly fewer selected foreground points, thus enabling the later sparse convolutions in RSN to efficiently operate. RSN runs at more than 60 frames per second on a 150mx150m detection region on Waymo Open Dataset (WOD) while being more accurate than previously published detectors.

Range Conditioned Dilated Convolutions for Scale Invariant 3D Object Detection

A novel 3D object detection framework that processes LiDAR data directly on its native range image representation. To overcome scale sensitivity in this perspective view, a range-conditioned dilation (RCD) layer is proposed to dynamically adjust a continuous dilation rate as a function of the measured range. Unparalleled performance is achieved at long range detection when combined with a second stage refinement.

Large Scale Outdoor Scene Reconstruction and Correction with Vision

The BOR2G system developed at the Oxford Robotics Institute fuses data from multiple sensor modalities (cameras, lidars, or both) and regularizes the resulting 3D model. We use a compressed 3D data structure which allows us to operate over a large scale. A earned correction mechanism which takes the global context of the reconstruction and adjusts the constructed mesh addressing pathological errors.

Learning to Drive from Simulation without Real World Labels

A method for transferring a vision-based lane following driving policy from simulation to operation on a rural road without any real-world labels. Our approach leverages recent advances in image-to-image translation to achieve domain transfer while jointly learning a single-camera control policy from simulation control labels.

Dropout Distillation for Efficiently Estimating Model Confidence

An efficient way to output better calibrated uncertainty scores from neural networks. These Distilled Dropout Network makes standard (non-Bayesian) neural networks more introspective by adding a new training loss.

Learning to Drive in a Day with Deep Reinforcement Learning

This work demonstrates model-free deep reinforcement learning on an autonomous car in the real world. With a handful of exploration and optimisation steps performed on the single onboard NVIDIA DRIVE PX2, our model-free algorithm learnt to follow its lane without any prior map.

Neural Stethoscopes: Unifying Analytic, Auxiliary and Adversarial Network Probing

This work unifies auxiliary tasks, adversarial information removal and side tasks analysis with a single multi-task learning framework we call neural stethoscopes. Neural stethoscopes are then used to interrogate specific visual cues a network learns in the context of intuitive physics. Furthermore, we are able to actively de-bias network predictions as well as enhance performance via suitable auxiliary and adversarial stethoscope losses.

Deep Cosine Metric Learning for Person Re-Identification

This work presents a method for learning a feature embedding where the cosine similarity is effectively optimised through a simple re-parametrization of the conventional softmax classification regime. At test time, the final classification layer can be stripped of the Network, facilitating nearest neighbour queries on unseen individuals using the cosine similarity metric.

Incremental Adversarial Domain Adaptation for Continually Changing Environments

Continuous appearance shifts such as changes in weather and lighting conditions can impact the performance of deployed machine learning models. Unsupervised domain adaptation aims to address this challenge, though current approaches do not utilise the continuity of the occurring shifts. This work presents an adversarial approach for lifelong, incremental domain adaptation which benefits from unsupervised alignment to a series of sub-domains which successively diverge from the labelled source domain.

Meshed Up: Learnt Error Correction in 3D Reconstructions

Dense reconstructions often contain errors that prior work has so far minimised using high quality sensors and regularising the output. Nevertheless, errors still persist. This paper proposes a machine learning technique to identify errors in three dimensional (3D) meshes. Beyond simply identifying errors, our method quantifies both the magnitude and the direction of depth estimate errors when viewing the scene.

Hierarchical Attentive Recurrent Tracking

Inspired by how the human visual cortex employs spatial attention and separate “where†and “what†processing pathways to actively suppress irrelevant visual features, this work develops a hierarchical attentive recurrent model for single object tracking in videos.

DeepSORT: Simple Online and Realtime Tracking with a Deep Association Metric

Building on the success of the SORT tracking framework, this work extends the location based tracker with appearance based association optimised via metric learning on a deep neural network.

Addressing Appearance Change in Outdoor Robotics with Adversarial Domain Adaptation

Appearance changes due to weather and seasonal conditions represent a strong impediment to the robust implementation of machine learning systems in outdoor robotics. This work develops a framework for applying adversarial techniques to adapt popular, state-of-the-art network architectures with the additional objective to be invariant across conditions.

What Makes a Place? Building Bespoke Place Dependent Object Detectors for Robotics

This paper is about enabling robots to improve their perceptual performance through repeated use in their operating environment, creating local expert detectors fitted to the places through which a robot moves.

Vision based Detection and Tracking in Dynamic Environments with Minimal Supervision

My PhD thesis in the format of thesis-by-publication composed mainly from papers competed between 2013-2016. Submitted late 2016, accepted 2017 and finally published publically in 2018.

SORT: Simple Online and Realtime Tracking

This work presents a fast, yet simple, technique for updating trajectory estimates within an online multiple object tracking framework. Furthermore, the impact of detection quality on tracking is highlighted by achieving stat-of-the-art performance on a recent tracking benchmark.

Background Modelling with Applications to Visual Object Detection in an Open Pit Mine

This work investigates the use of appearance based object detection in an open pit mine. Various forms of background modelling techniques are explored for adapting a pretrained detector to the novel environment.

ALExTRAC: Affinity Learning by Exploring Temporal Reinforcement within Association Chains

This paper presents a self-supervised approach for learning to associate object detections in a video sequence as often required in tracking-by-detection systems.

Fine-Grained Classification via Mixture of Deep Convolutional Neural Networks

A novel deep convolutional neural network (DCNN) architecture is proposed for fine-grained image classification. This architecture, called MixDCNN, combines the output of several DCNNs within a mixture model framework and is shown to outperform other methods.

From ImageNet to Mining: Adapting Visual Object Detection with Minimal Supervision

A background modeling approach to reducing the false positive rate of a pre-trained object detector for use in an open-pit mining environment.

Fine-Grained Bird Species Recognition via Hierarchical Subset Learning

This paper presents a novel method to improve fine-grained classification based on hierarchical subset learning. First a similarity tree is formed where classes with strong visual correlations are grouped into subsets. An expert local classifier with strong discriminative power to distinguish visually similar classes is then learnt for each subset.

Online Self-Supervised Multi-Instance Segmentation of Dynamic Objects

A training free method for detecting and tracking moving objects is presented and evaluated with video footage from a moving camera.

Advantages of Exploiting Projection Structure for Segmenting Dense 3D Point Clouds

A simple, yet efficient method for finding nearest neighbours in projected 3D point clouds is presented with applications towards object segmentation.

Development of a Dragline In-Bucket Bulk Density Monitor

This paper details the implementation and trialling of a prototype in-bucket bulk density monitor on a production dragline.

Real-Time Volume Estimation of a Dragline Payload

This paper presents a method for measuring the in-bucket payload volume on a dragline excavator for the purpose of estimating material bulk density in real-time.