登录    注册    忘记密码

详细信息

Video Moment Retrieval With Noisy Labels  ( SCI-EXPANDED收录 EI收录)   被引量:37

文献类型:期刊文献

英文题名:Video Moment Retrieval With Noisy Labels

作者:Pan, Wenwen[1] Zhao, Zhou[1] Huang, Wencan[1] Zhang, Zhu[1] Fu, Liyong[2,3] Pan, Zhigeng[1] Yu, Jun[4] Wu, Fei[1]

第一作者:Pan, Wenwen

通信作者:Fu, LY[1]

机构:[1]Zhejiang Univ, Coll Comp Sci, Hangzhou 310027,, Peoples R China;[2]Chinese Acad Forestry, Inst Forest Resource Informat Tech, Beijing 100091, Peoples R China;[3]Natl Forestry & Grassland Adm, Key Lab Forest Management & Growth Modeling, Beijing 100091, Peoples R China;[4]Hangzhou Dianzi Univ, Coll Comp Sci, Hangzhou 310018, Peoples R China

年份:2024

卷号:35

期号:5

起止页码:6779-6791

外文期刊名:IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

收录:;EI(收录号:20224613111246);Scopus(收录号:2-s2.0-85141576892);WOS:【SCI-EXPANDED(收录号:WOS:001214608800075)】;

基金:This work was supported in part by the 14th Five-Year Plan Pioneering Project of High Technology Plan of the National Department of Technology under Grant 2021YFD2200405, in part by the National Key Research and Development Project of China under Grant 2017YFB1002803, in part by the Zhejiang Natural Science Foundation under Grant LR19F020006, in part by the Key Laboratory Foundation of Information Perception and Systems for Public Security of Ministry of Industry and Information Technology of the People's Republic China (MIIT) (Nanjing University of Science and Technology) under Grant 210094 and Grant 202001, and in part by the National Natural Science Foundation of China under Grant 62072397 and Grant 61836002

语种:英文

外文关键词:Noise measurement; Annotations; Training; Manuals; Feature extraction; Deep learning; Task analysis; Co-teaching; feature pyramid network; multilevel losses; noisy label learning; video moment retrieval (VMR)

摘要:Video moment retrieval (VMR) aims to localize the target moment in an untrimmed video according to the given nature language query. The existing algorithms typically rely on clean annotations to train their models. However, making annotations by human labors may introduce much noise. Thus, the video moment retrieval models will not be well trained in practice. In this article, we present a simple yet effective video moment retrieval framework via bottom-up schema, which is in end-to-end manners and robust to noisy label training. Specifically, we extract the multimodal features by syntactic graph convolutional networks and multihead attention layers, which are fused by the cross gates and the bilinear approach. Then, the feature pyramid networks are constructed to encode plentiful scene relationships and capture high semantics. Furthermore, to mitigate the effects of noisy annotations, we devise the multilevel losses characterized by two levels: a frame-level loss that improves noise tolerance and an instance-level loss that reduces adverse effects of negative instances. For the frame level, we adopt the Gaussian smoothing to regard noisy labels as soft labels through the partial fitting. For the instance level, we exploit a pair of structurally identical models to let them teach each other during iterations. This leads to our proposed robust video moment retrieval model, which experimentally and significantly outperforms the state-of-the-art approaches on standard public datasets ActivityCaption and textually annotated cooking scene (TACoS). We also evaluate the proposed approach on the different manual annotation noises to further demonstrate the effectiveness of our model.

参考文献:

正在载入数据...

版权所有©中国林业科学研究院 重庆维普资讯有限公司 渝B2-20050021-8 
渝公网安备 50019002500408号 违法和不良信息举报中心