☆ 4.8 Article

Rank Pooling for Action Recognition

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2017)

Journal

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Volume 39, Issue 4, Pages 773-787

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TPAMI.2016.2558148

Keywords

Action recognition; temporal encoding; temporal pooling; rank pooling; video dynamics

Categories

Computer Science, Artificial Intelligence Engineering, Electrical & Electronic

Funding

FP7 ERC Starting Grant [240530 COGNIMUND]
KU Leuven DBOF PhD fellowship
FWO project Monitoring of abnormal activity with camera systems
iMinds High-Tech Visualization project
Australian Research Council Centre of Excellence for Robotic Vision [CE140100016]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

We propose a function-based temporal pooling method that captures the latent structure of the video sequence data - e. g., how frame-level features evolve over time in a video. We show how the parameters of a function that has been fit to the video data can serve as a robust new video representation. As a specific example, we learn a pooling function via ranking machines. By learning to rank the frame-level features of a video in chronological order, we obtain a new representation that captures the video-wide temporal dynamics of a video, suitable for action recognition. Other than ranking functions, we explore different parametric models that could also explain the temporal changes in videos. The proposed functional pooling methods, and rank pooling in particular, is easy to interpret and implement, fast to compute and effective in recognizing a wide variety of actions. We evaluate our method on various benchmarks for generic action, fine-grained action and gesture recognition. Results show that rank pooling brings an absolute improvement of 7-10 average pooling baseline. At the same time, rank pooling is compatible with and complementary to several appearance and local motion based methods and features, such as improved trajectories and deep learning features.

Authors

I am an author on this paper

Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.8

Not enough ratings

Secondary Ratings

Novelty

-

Significance

-

Scientific rigor

-

Rate this paper

Recommended

Article Computer Science, Artificial Intelligence

Segment spatial-temporal representation and cooperative learning of convolution neural networks for multimodal-based action recognition

Ziliang Ren, Qieshi Zhang, Jun Cheng, Fusheng Hao, Xiangyang Gao

Summary: This paper proposes a novel approach for multimodal human action recognition by learning complementary features from RGB-D sequence, compressing the sequence into dynamic images, and designing SC-ConvNets to learn complementary features from different modalities. Experimental results demonstrate excellent recognition performance across multiple datasets.

NEUROCOMPUTING (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

Spatial-temporal pooling for action recognition in videos

Jiaming Wang, Zhenfeng Shao, Xiao Huang, Tao Lu, Ruiqian Zhang, Xianwei Lv

Summary: The study introduces a novel parameter-free spatial-temporal pooling block (STP) for action recognition in videos, which efficiently discards non-informative frames, learns spatial and temporal weights, and uses a new loss function to enforce the model to learn information from sparse and discriminative frames, ultimately outperforming several state-of-the-art methods in action classification.

NEUROCOMPUTING (2021)

Add to Collection

Article Mathematics

Interactive Learning of a Dual Convolution Neural Network for Multi-Modal Action Recognition

Qingxia Li, Dali Gao, Qieshi Zhang, Wenhong Wei, Ziliang Ren

Summary: This paper proposes a method to improve action recognition performance by constructing dynamic images and designing an interactive learning dual-ConvNet (ILD-ConvNet). The constructed visual dynamic images capture spatial-temporal information using the rank pooling method and extend to depth sequences for more abundant multi-modal spatial-temporal information. The proposed ILD-ConvNet achieves competitive recognition accuracy on NTU RGB + D 120 and PKU-MMD datasets.

MATHEMATICS (2022)

Add to Collection

Article Computer Science, Information Systems

Multi-modality learning for human action recognition

Ziliang Ren, Qieshi Zhang, Xiangyang Gao, Pengyi Hao, Jun Cheng

Summary: The paper introduces a multi-modality learning approach for human action recognition, which utilizes bidirectional rank pooling to obtain spatial-temporal information from RGB and depth images, and designs an effective ConvNets architecture based on multi-modality hierarchical fusion strategy. The proposed method achieves state-of-the-art results on multiple datasets.

MULTIMEDIA TOOLS AND APPLICATIONS (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

Action Recognition with a Multi-View Temporal Attention Network

Dengdi Sun, Zhixiang Su, Zhuanlian Ding, Bin Luo

Summary: The proposed action recognition model based on a multi-view temporal attention mechanism effectively captures and utilizes motion information present in image frames and optical flows. Experimental results demonstrate that the method outperforms existing techniques in action recognition, showcasing the effectiveness of introducing temporal attention and multi-view fusion approaches.

COGNITIVE COMPUTATION (2022)

Add to Collection

Article Multidisciplinary Sciences

STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video

Guoan Yang, Yong Yang, Zhengzhi Lu, Junjie Yang, Deyang Liu, Chuanbo Zhou, Zien Fan

Summary: This study addresses the issue of deep learning-based action recognition models focusing only on short-term motions and causing misjudgments of actions combined by multiple processes. It proposes a Spatial-Temporal Attention Temporal Segment Networks (STA-TSN) model that incorporates a soft attention mechanism to adaptively focus on key spatial and temporal features. By combining a multi-scale spatial focus feature enhancement strategy and a deep learning-based key frames exploration module, the model captures long-term information and key frames more effectively, achieving superior results compared to existing methods on public datasets.

PLOS ONE (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

A two-stage temporal proposal network for precise action localization in untrimmed video

Fei Wang, Guorui Wang, Yuxuan Du, Zhenquan He, Yong Jiang

Summary: This paper introduces a two-stage temporal proposal algorithm for the action detection task in long untrimmed videos. The algorithm utilizes a novel prior-minor watershed and sliding window approach in the first stage, and an extended context pooling (ECP) and temporal context regression network in the second stage to improve the precision of action localization. Results on three large scale benchmarks demonstrate that the proposed method outperforms state-of-the-art approaches and runs efficiently on a GPU.

INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

GSoANet: Group Second-Order Aggregation Network for Video Action Recognition

Zhenwei Wang, Wei Dong, Bingbing Zhang, Jianxin Zhang, Xiangdong Liu, Bin Liu, Qiang Zhang

Summary: In video action recognition, the proposed GSoANet integrates the GSoAM at the end of the network to aggregate spatio-temporal features. GSoAM decomposes input features into low-dimensional vectors and aggregates video spatio-temporal features. The network also introduces ConvNeXt as a backbone to improve accuracy at a lower computational cost.

NEURAL PROCESSING LETTERS (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

AMS-Net: Modeling Adaptive Multi-Granularity Spatio-Temporal Cues for Video Action Recognition

Qilong Wang, Qiyao Hu, Zilin Gao, Peihua Li, Qinghua Hu

Summary: This article proposes an adaptive multi-granularity spatio-temporal network (AMS-Net) for effectively handling complex scale variations in videos. The network is capable of capturing subtle variations in visual tempos and fair-sized spatio-temporal dynamics in an efficient manner, achieving state-of-the-art performance on action recognition tasks.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

Exploiting spatio-temporal knowledge for video action recognition

Huigang Zhang, Liuan Wang, Jun Sun

Summary: Action recognition is a popular area of computer vision research, focusing on identifying human actions in videos. Existing methods rely on visual features within the videos, but lack the ability to represent general knowledge of actions beyond the video. This study presents a novel spatio-temporal knowledge module (STKM) that combines external knowledge with visual features, leading to improved recognition results. Experimental results demonstrate the robustness and generalization ability of STKM.

IET COMPUTER VISION (2023)

Add to Collection

Article Engineering, Electrical & Electronic

A Real-Time Action Representation With Temporal Encoding and Deep Compression

Kun Liu, Wu Liu, Huadong Ma, Mingkui Tan, Chuang Gan

Summary: A new real-time convolutional architecture T-C3D is proposed for action representation, which combines deep compression techniques to accelerate model deployment. By studying action representation performance, the method achieves a 5.4% improvement in accuracy and is 2 times faster in terms of inference speed compared to state-of-the-art real-time methods, with a model size of less than 5MB.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2021)

Add to Collection

Article Computer Science, Information Systems

MTRFN: Multiscale Temporal Receptive Field Network for Compressed Video Action Recognition at Edge Servers

Lijun He, Miao Zhang, Sijin Zhang, Liejun Wang, Fan Li

Summary: With the wide deployment of Internet of Things monitoring terminals, a tremendous amount of videos are being accumulated. This paper introduces a method that utilizes the compressed domain to efficiently extract video information and recognize actions with different durations based on multiscale temporal features. The experimental results show that the proposed algorithm achieves a good balance between accuracy and computational complexity.

IEEE INTERNET OF THINGS JOURNAL (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

Two-stream temporal enhanced Fisher vector encoding for skeleton-based action recognition

Jun Tang, Baodi Liu, Wenhui Guo, Yanjiang Wang

Summary: This paper introduces a skeleton-based action recognition method that effectively utilizes the information of feature distributions by incorporating Fisher vector encoding into graph convolutional networks (GCNs). A temporal enhanced Fisher vector encoding algorithm is proposed to capture both temporal information and fine-grained spatial configurations and temporal dynamics. The performance is further improved by combining the TEFV model with the GCN model in a two-stream framework.

COMPLEX & INTELLIGENT SYSTEMS (2023)

Add to Collection

Article Computer Science, Information Systems

Multiple Temporal Pooling Mechanisms for Weakly Supervised Temporal Action Localization

Peng Dou, Ying Zeng, Zhuoqun Wang, Haifeng Hu

Summary: Recent action localization works learn in a weakly supervised manner to avoid the expensive cost of human labeling. To solve the problem of weak discriminative foreground action segments and the background ones, as well as the relationship between different actions, we propose multiple temporal pooling mechanisms (MTP) that leverage more effective information and generate different Class Activation Sequences (CASs). Our method achieves excellent results on the THUMOS14 and ActivityNet1.2 datasets.

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS (2023)

Add to Collection

Article Computer Science, Information Systems

Human Action Recognition by Discriminative Feature Pooling and Video Segment Attention Model

Md Moniruzzaman, Zhaozheng Yin, Zhihai He, Ruwen Qin, Ming C. Leu

Summary: This study introduces a simple yet effective network model for human action recognition from trimmed and untrimmed videos. By introducing attentional pooling mechanism and video segment attention model, the network can emphasize critical features related to actions in videos and learn attention weights even without precise temporal annotations. Experimental results on multiple datasets demonstrate the superior performance of the network compared to the state-of-the-art methods.

IEEE TRANSACTIONS ON MULTIMEDIA (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

Residual Tuning: Toward Novel Category Discovery Without Labels

Yu Liu, Tinne Tuytelaars

Summary: Discovering novel visual categories from unlabeled images is crucial for intelligent vision systems, and we propose a residual-tuning approach to overcome the tradeoff between preserving features on labeled data and adapting features on unlabeled data. Our method achieves consistent and considerable gains on benchmark tests, reducing the performance gap to fully supervised learning setup.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2023)

Add to Collection

Article Agriculture, Multidisciplinary

Inline nondestructive internal disorder detection in pear fruit using explainable deep anomaly detection on X-ray images

Tim Van De Looverbosch, Jiaqi He, Astrid Tempelaere, Klaas Kelchtermans, Pieter Verboven, Tinne Tuytelaars, Jan Sijbers, Bart Nicolai

Summary: X-ray radiography has been investigated as a technique for internal quality inspection of pears in storage, with multiple deep anomaly detection methods showing effectiveness in detecting pears with internal cavity and browning disorders. The best performing methods were found to be on par with a state-of-the-art multisensor disorder detection method.

COMPUTERS AND ELECTRONICS IN AGRICULTURE (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

CLAD: A realistic Continual Learning benchmark for Autonomous Driving

Eli Verwimp, Kuo Yang, Sarah Parisot, Lanqing Hong, Steven McDonagh, Eduardo Perez-Pellitero, Matthias De Lange, Tinne Tuytelaars

Summary: In this paper, a new Continual Learning benchmark for Autonomous Driving (CLAD) is introduced, focusing on object classification and object detection problems. The benchmark utilizes SODA10M, a large-scale dataset related to autonomous driving. Existing continual learning benchmarks are reviewed and discussed, showing that most of them are extreme cases. Online classification benchmark CLAD-C and domain incremental continual object detection benchmark CLAD-D are introduced. The inherent difficulties and challenges are examined through a survey of top-3 participants in a CLAD-challenge workshop at ICCV 2021. Possible pathways to improve the current state of continual learning and promising directions for future research are discussed.

NEURAL NETWORKS (2023)

Add to Collection

Article Agronomy

Synthetic data for X-ray CT of healthy and disordered pear fruit using deep learning

Astrid Tempelaere, Tim Van De Looverbosch, Klaas Kelchtermans, Pieter Verboven, Tinne Tuytelaars, Bart Nicolai

Summary: This study proposes a method to generate synthetic CT images using a conditional cGAN to overcome the challenges of obtaining large annotated datasets. The performance of the predictor was evaluated quantitatively and visually, showing that the cGAN effectively generated CT images of healthy and defective fruit based on annotations.

POSTHARVEST BIOLOGY AND TECHNOLOGY (2023)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

Spatial Consistency Loss for Training Multi-Label Classifiers from Single-Label Annotations

Thomas Verelst, Paul K. Rubenstein, Marcin Eichner, Tinne Tuytelaars, Maxim Berman

Summary: Multi-label image classification is more practical for real-world scenarios than single-label classification due to the presence of multiple objects in natural images. However, annotating every object of interest is time-consuming and expensive. In this study, we propose an Expected Negative loss to train multi-label classifiers using datasets where each image is annotated with a single positive label. To handle the uncertainty of other classes, we generate a set of expected negative labels based on prediction consistency. Additionally, we introduce a novel spatial consistency loss to improve supervision by maintaining consistent spatial feature maps for each training image. Our experiments on various datasets demonstrate the effectiveness of the Expected Negative loss in combination with consistency and spatial consistency losses, and we achieve improved multi-label classification mAP on ImageNet-1K using the ReaL multi-label validation set.

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) (2023)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

SimGlim: Simplifying glimpse based active visual reconstruction

Abhishek Jha, Soroush Seifi, Tinne Tuytelaars

Summary: In active visual exploration, it is crucial to sample informative local observations for modeling global context. This paper proposes the use of vision transformers instead of CNNs for such agents and introduces a transformer-based active visual sampling model called SimGlim. The model utilizes the transformer's self-attention architecture to predict the best next location based on the current observable environment. Experimental results demonstrate the effectiveness of the proposed method in image reconstruction and comparisons against existing methods are provided. Ablation studies are also conducted to analyze the importance of design choices in the overall architecture.

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) (2023)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

Barlow constrained optimization for Visual Question Answering

Abhishek Jha, Badri Patro, Luc Van Gool, Tinne Tuytelaars

Summary: This paper proposes a novel regularization method called COB to improve the information content of the joint space in visual question answering models. It reduces redundancy by minimizing the correlation between learned feature components, disentangling semantic concepts. The model aligns the joint space with the answer embedding space and shows improved accuracy on VQA datasets.

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) (2023)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

Global-Local Self-Distillation for Visual Representation Learning

Tim Lebailly, Tinne Tuytelaars

Summary: The downstream accuracy of self-supervised methods depends on the proxy task and the quality of gradients extracted during training. Incorporating local cues in the proxy task can improve model accuracy on downstream tasks. We propose a geometric approach for matching local representations in self-distillation, which outperforms similarity-based methods, especially in low-data regimes. However, similarity-based matchings are highly detrimental to model performance in low-data regimes compared to the baseline without local self-distillation.

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) (2023)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

Weakly Supervised Face Naming with Symmetry-Enhanced Contrastive Loss

Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens

Summary: This paper revisits the weakly supervised cross-modal face-name alignment task and proposes SECLA and SECLA-B models. These models use appropriate loss functions to learn the alignments between names and faces in a neural network setting. SECLA maximizes the similarity scores between faces and names in a weakly supervised fashion, while SECLA-B learns to align names and faces from easy to hard cases, further improving the performance.

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) (2023)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

CrOC : Cross-View Online Clustering for Dense Visual Representation Learning

Thomas Stegmuller, Tim Lebailly, Behzad Bozorgtabar, Tinne Tuytelaars, Jean-Philippe Thiran

Summary: In this paper, we propose a method for learning dense visual representations without labels by discovering and segmenting the semantics of views through an online clustering mechanism. The resulting method is highly generalizable and does not require cumbersome pre-processing steps.

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR (2023)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

AUTOMATED VIRTUAL REDUCTION OF DISPLACED DISTAL RADIUS FRACTURES

J. Osstyn, F. Danckaers, A. Van Haver, J. Oramas, M. Vanhees, J. Sijbers

Summary: This article presents a fully automated algorithm for the reduction of displaced fractures, which is robust and closely resembles the manual reductions by surgeons.

2023 IEEE 20TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI (2023)

Add to Collection

Proceedings Paper Computer Science, Information Systems

AIMLAI: Advances in Interpretable Machine Learning and Artificial Intelligence

Adrien Bibal, Tassadit Bouadi, Benoit Frenay, Luis Galarraga, Jose Oramas

Summary: Recent technological advances rely on accurate decision support systems, but the lack of transparency due to complexity can lead to various issues, sparking the emergence of interpretable and explainable AI to address the problem of trust and bias in decision-making processes.

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022 (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

A Continual Learning Survey: Defying Forgetting in Classification Tasks

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Greg Slabaugh, Tinne Tuytelaars

Summary: This article introduces the application of artificial neural networks in continual learning, focusing on task incremental classification. It proposes a new framework for continually evaluating the stability-plasticity trade-off of the network and performs experimental comparisons of 11 state-of-the-art continual learning methods, evaluating their strengths and weaknesses by considering different benchmark datasets.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

Effective Multimodal Encoding for Image Paragraph Captioning

Thanh-Son Nguyen, Basura Fernando

Summary: In this paper, a regularization-based image paragraph generation method is proposed. A novel multimodal encoding generator (MEG) is introduced to generate effective multimodal encoding that captures individual sentence, visual, and paragraph-sequential information. The generated encoding is utilized to regularize a paragraph generation model, leading to improved results in all evaluation metrics for the captioning model. The proposed MEG model, along with reinforcement learning optimization, achieves state-of-the-art results on the Stanford paragraph dataset. Extensive empirical analysis demonstrates the capabilities of MEG encoding, where qualitative visualization and multimodal sentence/image retrieval tasks show that MEG captures semantic and meaningful textual and visual information.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2022)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

Deep Set Conditioned Latent Representations for Action Recognition

Akash Singh, Tom de Schepper, Kevin Mets, Peter Hellinckx, Jose Oramas, Steven Latre

Summary: In recent years, there has been increasing interest in multi-label, multi-class video action recognition. This paper proposes a method that learns to reason over the semantic concept of objects and actions using relational networks. The empirical results show that artificial neural networks benefit from pretraining, relational inductive biases, and unordered set-based latent representations in action recognition tasks.

PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 5 (2022)

Add to Collection

No Data Available

© Peeref 2019-2024. All rights reserved.