Yu Kong, Yunde Jia and Yun Fu

Abstract. In this paper, we present a novel approach for human interaction recognition from videos. We introduce high-level descriptions called interactive phrases to express binary semantic motion relation- ships between interacting people. Interactive phrases naturally exploit human knowledge to describe interactions and allow us to construct a more descriptive model for recognizing human interactions. We propose a novel hierarchical model to encode interactive phrases based on the latent SVM framework where interactive phrases are treated as latent variables. The interdependencies between interactive phrases are explicitly captured in the model to deal with motion ambiguity and partial occlusion in the interactions. We evaluate our method on a newly collected BIT-Interaction dataset and UT-Interaction dataset. Promising results demonstrate the effectiveness of the proposed method.


Fig. 1. Framework of our interactive phrase method.

Attribute Model

We utilize motion attributes to describe individual actions [16], e.g. “arm raising up motion”, “leg stepping backward motion”, etc. In interactions, both of the two interacting people have the same attribute vocabulary but with different values. Those motion attributes can be inferred from low-level motion features (Fig.2), for example, spatiotemporal interest points [18].


Fig. 2. Framework of detecting motion attributes from videos.

Interaction Model

Recognize human interactions based on the interactive phrases.

Interactive phrases encode human knowledge about motion relationships between people. The phrases are built on attributes of two interacting people and utilized to describe their co-occurrence relationships.


Fig. 3. The unary, pairwise and global interaction potentials in the interaction model.

Learning and Inference

Attribute model: structured SVM formulationyukong_int_img004


Interaction model: latent SVM formulation




Datasets: We evaluate our method on the BIT-Interaction dataset and the UT-Interaction dataset.yukong_int_img006

Fig. 4. Example frames of BIT-Interaction dataset. This dataset consists of 8 classes of human interactions: bow, boxing, handshake, high-five, hug, kick, pat, and push.


Fig. 5. Example frames of the UT-Interaction dataset. This dataset consists of 6 classes of human interactions: handshake, hug, kick, point, punch and push.

Results on the BIT dataset


Fig. 6. Results of our method on BIT-Interaction dataset. In (b), correctly recognized examples are in the first two rows and misclassifications are in the last row.

Results on the UT dataset


Fig. 9. Results of our method on UT-Interaction dataset. In (b), correctly recognized examples are in the first three columns and misclassifications are in the last column.


Table 3. Recognition accuracy (%) of methods on the UT-Interaction dataset.


[1] Yu Kong, Yunde Jia and Yun Fu. Learning Human Interaction by Interactive Phrases. ECCV 2012.  [PDF], [Poster], [PPT], [BIT-Interaction Dataset]