Yu Kong, Yunde Jia and Yun Fu
Abstract. In this paper, we present a novel approach for human interaction recognition from videos. We introduce high-level descriptions called interactive phrases to express binary semantic motion relation- ships between interacting people. Interactive phrases naturally exploit human knowledge to describe interactions and allow us to construct a more descriptive model for recognizing human interactions. We propose a novel hierarchical model to encode interactive phrases based on the latent SVM framework where interactive phrases are treated as latent variables. The interdependencies between interactive phrases are explicitly captured in the model to deal with motion ambiguity and partial occlusion in the interactions. We evaluate our method on a newly collected BIT-Interaction dataset and UT-Interaction dataset. Promising results demonstrate the effectiveness of the proposed method.
Fig. 1. Framework of our interactive phrase method.
We utilize motion attributes to describe individual actions , e.g. “arm raising up motion”, “leg stepping backward motion”, etc. In interactions, both of the two interacting people have the same attribute vocabulary but with different values. Those motion attributes can be inferred from low-level motion features (Fig.2), for example, spatiotemporal interest points .
Fig. 2. Framework of detecting motion attributes from videos.
Recognize human interactions based on the interactive phrases.
Interactive phrases encode human knowledge about motion relationships between people. The phrases are built on attributes of two interacting people and utilized to describe their co-occurrence relationships.
Fig. 3. The unary, pairwise and global interaction potentials in the interaction model.
Attribute model: structured SVM formulation
Interaction model: latent SVM formulation
Datasets: We evaluate our method on the BIT-Interaction dataset and the UT-Interaction dataset.
Fig. 4. Example frames of BIT-Interaction dataset. This dataset consists of 8 classes of human interactions: bow, boxing, handshake, high-five, hug, kick, pat, and push.
Fig. 5. Example frames of the UT-Interaction dataset. This dataset consists of 6 classes of human interactions: handshake, hug, kick, point, punch and push.
Results on the BIT dataset
Fig. 6. Results of our method on BIT-Interaction dataset. In (b), correctly recognized examples are in the first two rows and misclassifications are in the last row.
Results on the UT dataset
Fig. 9. Results of our method on UT-Interaction dataset. In (b), correctly recognized examples are in the first three columns and misclassifications are in the last column.