Human-Robot Visual Collaboration

Human Data and Prediction Pipeline Developed in the Paper:
Asking for Help with the Right Question by Predicting Human Visual Performance

Hong (Herbert) Cai and Yasamin Mostofi
(Presented at Robotics: Science and Systems 2016)

When referring to this site, please refer to its DOI address:

We are making publicly available our data and code for predicting human visual performance. More specifically, this page contains the download link and documentation for the human data and machine learning-based prediction pipeline developed in our paper. For more details on how we collected our human data and trained a machine learning pipeline, please refer to the project page and the paper.

If you have any questions or comments regarding the data and/or the trained models, please contact Herbert Cai.

  • Credits and Usage

    The code/data are owned by UCSB and can be used for academic purposes only.

    If you have used this data and/or code for your work, please refer the readers to this data and code site at its DOI address:, and also cite the following paper:

    title={Asking for Help with the Right Question by Predicting Human Visual Performance},
    author={H. Cai and Y. Mostofi},
    booketitle={Proceedings of Robotics: Science and Systems},

    1. Data Set

    There is a total of 3000 images collected from NOAA, SUN and PASCAL VOC. Each image contains at least one human. The images are put into 3 categories: 1) easy images where the human presence is obvious, 2) images where the human is in a cluttered environment, and 3) images where the size of the human is small (due to distance). We have also manually darkened the images to simulate night time scenarios.

    2. Training and Validation Data

    We used 2400 (80%) of the images for training and 600 (20%) for validation.

    3. Trained Caffe Model for Human Performance Prediction

    This is the human performance predictor model. It takes as input a 256×256 image and outputs the probability that a person is able to find the human in the image.

    4. Pre-trained Model Used for Initialization

    This model, which is based on the AlexNet architecture, takes as input a 256×256 image and performs a binary classification indicating whether a human is present in the image. It is used as the initialization for the training of our human predictor network.