Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning


Zhenfang Chen      Jiayuan Mao      Jiajun Wu     
Kwan-Yee K. Wong      Joshua B. Tenenbaum       Chuang Gan



Abstract:


We study the problem of dynamic visual reasoning on raw videos. This is a challenging problem; currently, state-of-the-art models often require dense supervisionon physical object properties and events from simulation, which are impracticalto obtain in real life. In this paper, we present the Dynamic Concept Learner(DCL), a unified framework that grounds physical objects and events from video and language. DCL first adopts a trajectory extractor to track each object overtime and to represent it as a latent, object-centric feature vector. Building up on this object-centric representation, DCL learns to approximate the dynamic interaction among objects using graph networks. DCL further incorporates a semantic parser to parse question into semantic programs and, finally, a program executor to run the program to answer the question, levering the learned dynamics model. After training, DCL can detect and associate objects across the frames, ground visual properties and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these extracted presentations for answering queries. DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training. We further test DCL on a newly proposed video-retrieval and event localization dataset derived from CLEVRER, showing its strong generalization capacity.


Qualitative Results




Visualization of Concept Learning on CLEVRER

We visualize the dynamic concepts learned by DCL on CLEVRER. We parse the scenes by densely quantizing the concepts in each frame. Extracted object trajectories, predictive collision, in and out are marked with green, red, blue and yellow colors.

Visualization of Concept Learning on BLOCK TOWERS

We visualize the dynamic concepts learned by DCL on BLOCK TOWERS. We parse the scenes by densely quantizing the color and falling concepts in each video. Extracted object trajectories are bounded by the predicted color. Falling objects are marked with falling on top of their bounding boxes.

Qualitative Results on CLEVRER-QA

Question 1: What is the color of the last object to collide with the metal sphere?
Predicted Answer: Brown.   (Ground-truth Answer: Brown.)

(a). Descriptive question sample. Extracted object trajectories, predictive
collision, in and out are marked with green, red, blue and yellow colors.

Question 2: Which of the following is responsible for the collision between the yellow cube and the cylinder?
Choice 1: The collision between the green metal object and the metal sphere. (Predicted: Wrong)   (GT.: Wrong)
Choice 2: The green metal cube's colliding with the yellow cube. (Predicted: Correct.)   (GT.: Correct.)
Choice 3: The green metal object's entering the scene. (Predicted: Correct.)   (GT.: Correct.)

(b). Explanatory question sample. Extracted object trajectories, predictive
collision, in and out are marked with green, red, blue and yellow colors.

Question 3: What will happen next?
Choice 1: The cylinder collides with the blue sphere. (Predicted: Correct.)   (GT.: Correct.)
Choice 2: The cylinder and the red object collide. (Predicted: Wrong)   (GT.: Wrong)

(c). Predictive question sample. Objects and evnets predicted in the future scenes are marked with black boxes.
Extracted object trajectories, predictive collision, in and out are marked with green, red, blue and yellow colors.

Question 4: Without the red sphere, which of the following will happen?
Choice 1: The cylinder collides with the blue object. (Predicted: Wrong)   (GT.: Wrong)
Choice 2: The yellow object and the cylinder collide. (Predicted: Correct.)   (GT.: Correct.)

Original Video.


Counterfacutal Video.

(d). Counterfactual question sample.

Qualitative Results on CLEVRER-Grounding

  Query: The collision that happens    
after the blue sphere exits the scene.
  Query: The cube enters the scene before    
the rubber sphere enters the scene.
   Query: The object that collides  
with the brown cube.
We visualize typical examples of CLEVRER-Grounding. The query expressions are shown on top of the videos and the spatio-temporal localization results in the videos are bounded with green boxes. DCL can explicitly ground object and event concepts, analyze temporal structures, and understanding the complex logic to localize the target event or object.

Qualitative Results on CLEVRER-Retrieval

Query expression: A video that contains a collision that happens before the green rubber cube enters the scene.
 
Top 1
 
Top 2
 
Top 3

Top 4
We visualize a typical example of CLEVRER-Retrieval. Gallery videos with top 4 ranks are shown. DCL can explicitly ground object and event concepts, analyze their relations and perform step-by-step reasoning to get the positive gallery videos.

Visualization of Concept Learning on CLEVRER

We visualize the dynamic concepts learned by DCL on CLEVRER. We parse the scenes by densely quantizing the concepts in each frame. Extracted object trajectories, predictive collision, in and out are marked with green, red, blue and yellow colors.

Visualization of Concept Learning on BLOCK TOWERS

We visualize the dynamic concepts learned by DCL on BLOCK TOWERS. We parse the scenes by densely quantizing the color and falling concepts in each video. Extracted object trajectories are bounded by the predicted color. Falling objects are marked with falling on top of their bounding boxes.

Qualitative Results on CLEVRER-QA

Question 1: What is the color of the last object to collide with the metal sphere?
Predicted Answer: Brown.   (Ground-truth Answer: Brown.)

(a). Descriptive question sample. Extracted object trajectories, predictive
collision, in and out are marked with green, red, blue and yellow colors.

Question 2: Which of the following is responsible for the collision between the yellow cube and the cylinder?
Choice 1: The collision between the green metal object and the metal sphere. (Predicted: Wrong)   (GT.: Wrong)
Choice 2: The green metal cube's colliding with the yellow cube. (Predicted: Correct.)   (GT.: Correct.)
Choice 3: The green metal object's entering the scene. (Predicted: Correct.)   (GT.: Correct.)

(b). Explanatory question sample. Extracted object trajectories, predictive
collision, in and out are marked with green, red, blue and yellow colors.

Question 3: What will happen next?
Choice 1: The cylinder collides with the blue sphere. (Predicted: Correct.)   (GT.: Correct.)
Choice 2: The cylinder and the red object collide. (Predicted: Wrong)   (GT.: Wrong)

(c). Predictive question sample. Objects and evnets predicted in the future scenes are marked with black boxes.
Extracted object trajectories, predictive collision, in and out are marked with green, red, blue and yellow colors.

Question 4: Without the red sphere, which of the following will happen?
Choice 1: The cylinder collides with the blue object. (Predicted: Wrong)   (GT.: Wrong)
Choice 2: The yellow object and the cylinder collide. (Predicted: Correct.)   (GT.: Correct.)

Original Video.


Counterfacutal Video.

(d). Counterfactual question sample.

Qualitative Results on CLEVRER-Grounding

  Query: The collision that happens    
after the blue sphere exits the scene.
  Query: The cube enters the scene before    
the rubber sphere enters the scene.
   Query: The object that collides  
with the brown cube.
We visualize typical examples of CLEVRER-Grounding. The query expressions are shown on top of the videos and the spatio-temporal localization results in the videos are bounded with green boxes. DCL can explicitly ground object and event concepts, analyze temporal structures, and understanding the complex logic to localize the target event or object.

Qualitative Results on CLEVRER-Retrieval

Query expression: A video that contains a collision that happens before the green rubber cube enters the scene.
 
Top 1
 
Top 2
 
Top 3

Top 4
We visualize a typical example of CLEVRER-Retrieval. Gallery videos with top 4 ranks are shown. DCL can explicitly ground object and event concepts, analyze their relations and perform step-by-step reasoning to get the positive gallery videos.

Qualitative Results on BLOCK TOWERS


Q.: Are there any falling green objects?   
A.: No.   (GT.: No)
Q.: How many falling blocks are there?   
A.: 2.   (GT.: 2)
Q.: What is the color of the block at the top?   
A.: Blue.   (GT.: Blue)
Qualitative Results on BLOCK TOWERS. Questions, predicted answers and ground-truth anoswers are marked with Q., A. and GT., respectively. Extracted object trajectories are bounded by the target objects' predicted colors. Falling objects are marked with falling on top of their bounding boxes.

Paper


Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning
Zhenfang Chen, Jiayuan Mao, Jiajun Wu, Kwan-Yee K. Wong, Joshua B. Tenenbaum, and Chuang Gan
[Paper] [Code] [BibTeX]