Paper
Efficient Visual Recognition
Published Jul 10, 2020 · Li Liu, M. Pietikäinen, Jie Qin
International Journal of Computer Vision
4
Citations
0
Influential Citations
Abstract
Visual recognition is the ability to recognize and localize visual categories such as faces, persons, objects, scenes, places, attributes, human expressions, emotions, actions and gestures, as well as object relations and interactions in images or videos, i.e. the ability to answer the basic and important question “What is Where”, which is crucial for answering advanced reasoning questions such as: What is happening? What will happen next? What should I do? Visual recognition is the cornerstone of computer vision. Almost any vision task fundamentally relies on the ability to recognize and localize visual categories such as those mentioned above. Visual recognition thus touches many areas of artificial intelligence and information retrieval, such as image search, data mining, question answering, autonomous driving, medical diagnosis, robotics and many others. The recent revival of interest in artificial neural networks, in particular deep learning, has brought tremendous progress in various computer vision problems (including visual recognition) and a broad range of fields beyond computer vision such as speech recognition and language translation. The beginning of deep learning in 2006 focused on the MNIST digit image classification problem and achieved the state of the art. Later in 2012, object recognition with the large scale ImageNet dataset achieved a significant breakthrough result by a Deep Convolutional Neural Network (DCNN) named AlexNet, which is arguably what reignited the field of artificial neural networks and triggered the recent revolution in artificial intelligence. Since then, research focus in visual recognition has begun to move away from feature engineering to feature learning. Recent advances in representation learning, especially deep learning, have opened up the possibility of visual recognition towards “large scale” and “in the wild”, and many visual recognition algorithms have been made into products. Although visual recognition has made significant progress, especially in the past several years, there is continued need for vigorous research to solve many challenging problems towards highly efficient visual recognition including achieving energy efficiency and label/ sample efficiency. On the one hand, the high accuracy of various visual recognition tasks heavily depends on large scale Deep Neural Networks (DNNs) which require ultra high performance processors (e.g., GPUs) with high computation capability. However we are in the era of post Moore’s Law, and energy efficient sensing and computing is vital at all levels, from the smallest sensor like the chip to ultra high performance processors and systems like the cloud. In addition, with the ubiquity of mobile devices such as smartphones, Internet of Things (IoTs) and wearable devices which have very limited computing related resources (e.g., power, memory, storage, CPUs, and bandwidth), recognizing efficiently on such devices is as critical as recognizing accurately. Therefore, there is pressing need for computational efficient algorithms to enable such devices to support a wide range of computer vision tasks. Edge intelligence is important to enable ubiquitous artificial intelligence over the next decade. On the other hand, the high accuracy of various visual recognition tasks heavily depends on massive amounts of labeled datasets which are painstakingly labeled by numerous workers or specialists. However, labeling instances is difficult, Communicated by Li Liu, Matti Pietikäinen, Jie Qin, Wanli Ouyang, Luc Van Gool.
Full text analysis coming soon...