We present a purely algorithmic method for distinguishing when two hands are visually merged together and tracking their positions by propagating tracking information from anchor frames in single-camera video without depth information. We demonstrate and evaluate on a manually labeled dataset selected primarily for clasped hands with 698 images of a single speaker with 1301 annotated left and right hands. Toward the goal of recognizing clasping hands, our method performs better than baseline on recall (0.66 vs. 0.53) without sacrificing precision (0.65 for both). We also evaluate its tracking efficacy through its ability to affect performance of a naive hand labeling heuristic, resulting in an improvement over the baseline (F-score of 0.59 vs. 0.48 baseline).
John R. Zhang, John R. Kender
2013 IEEE International Conference on Image Processing