Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
First Place Solution: Three-stage Visual Relationship DetectorYichao Lu*, Cheng Chang*, Himanshu Rai*
Layer6 AI
First Stage (Object Detection -Partial Weight Transfer)
Second Stage (Spatio-Semantic Model)
➢ The goal of the aggregation model is to learn to aggregate the spatial, semantic, and visual features to generate the final prediction.
➢ We use a GBM which takes the predictions from the second stage model in addition to the spatial and semantic features as its input and outperforms both second stage models. Experiment results on the validation set demonstrates GBM based aggregation model outperforms averaging the two second-stage models.
Third Stage (Aggregation Model)
boy
boy
boy
football
Det.
Mo
del
GB
M
GB
M
CN
N
hits
boyfootball
featureextraction
semantic
spatial
visual
detectioninput image
Overview of Model PipelineIntroduction
➢ Object Detection: Ensemble of detectors trained using our Partial Weight Transfer Approach to detect objects belonging to one of the 57 categories.
➢ Relationship detection: GBMs and CNN based model to predict relationships.
➢ Given an image, the goal is to detect objects and their relationships.➢ The relationships include human-object relationships, object-object relationships and
object-attribute relationships.➢ We propose a three-stage model, that consists of object detection models as first stage, a
spatio-semantic GBM model and a CNN model for second stage and a GBM based aggregation model for third stage.
➢ Spatio-Semantic Model: We use a tree-based Gradient Boosting Machine (GBM) to model the spatial relationship between pairs of objects (e.g. the Euclidean distance between the centers of the objects) as well as their semantic relationship (e.g. the probability that a pair of objects are related).
C A R
B O Y
W O M A N
D O G
C A R
D O G
G I R L
M A N
P E R S O N
…….
…….
Open Images Model
COCO Model
➢ In order to speed up training and achiever higher mAP, we propose PWT. ➢ Mapped classes between COCO and Open Images. ➢ Transfer classifier head weights for mapped classes, backbone and regression weights.➢ Used ensemble of SOTA models like cascade RCNN, HRNet etc. by weighted NMS. ➢ For “is” , formed all possible subject-attribute relationships and used an object detector
with PWT.
CNN
Conclusion and Learderboard➢ We generalized across public and private leaderboard with about 5% lead over second team.➢ Our PWT approach allowed us to train high performance models with low computational
cost in a small amount of time.
Second Stage (Visual Model)➢ Visual Model: The convolutional neural network based visual model aims to provide
visual features to solve relationships that cannot be inferred by only spatial and semantic features.
➢ The spatio-semantic model is good at predicting relationships that are highly determined by spatial features (e.g. man “on” horse, Fig 1 ) , while a visual model is required to predict relationships that are highly visual (e.g. woman “holds” mike, Fig 2).
Fig 1 Fig 2