期刊文献+

基于多层级视频Transformer的视觉自动定位方法

Visual Automatic Localization Method Based on Multi-level Video Transformer
在线阅读 下载PDF
导出
摘要 工业自动化产线中,设备的异常检测直接决定加工质量,由机械臂和搭载于机械臂前端的工业相机构成的视觉系统可以有效监测此类异常。本文使用六轴机械臂搭载工业相机对工件表面进行成像,获取由模糊到清晰再到模糊的视频序列,以此选出最清晰的视频帧作为自动加工中有聚焦要求的距离指导,以进行聚焦异常修正,从而实现自动定位。提出一种基于多层级视频Transformer的视频分类模型多级视频Transformer(MLVT)用于高语义级别的视频表征学习,并用于选出视频序列中成像最清晰的帧。首先,提出一种具有多种感受野的token划分方法多级标记(MLT),能够将原始视频数据按2D图像补丁、3D图像补丁、帧和片段这4个层级划分成token序列,并在加入位置编码之后送入多级编码器(MLE)方法进行注意力的计算。为了缓解多层级的tokens带来的计算代价和收敛速度慢的问题,MLE引入一种逐层的可变形注意力机制逐层可变形注意力机制(LWLA),以一种可学习的方式代替全局注意力进行特征相似性的计算。最终,该方法3个版本的模型在本文的视频数据集上分别取得了87.2%、88.6%、88.9%的分类准确率,在与同参数量级的主流视频Transformer实验对比中均表现了最优的性能,有效地完成了从视频序列中选择出最清晰帧的任务,能够为下游视觉任务的性能提供强有力保障。 Objective This study investigates the advanced application of a six-axis robotic arm equipped with a high-resolution industrial camera to capture precise images of workpiece surfaces.The setup is designed to acquire a dynamic video sequence illustrating the transition of image clarity,starting from blurry,achieving optimal clarity,and then reverting to blurry.The primary goal is to select the clearest frame from this sequence,which is critical in determining the precise focusing distance required for automated machining processes.The industrial camera is strategically mounted on the robotic arm,which meticulously controls the camera’s downward trajectory,ensuring the capture of varying image qualities.As the camera descends,it records the shifting focus on the workpiece surface,from out-of-focus(blurry)to in-focus(clear),and back to out-of-focus.This fluctuation is crucial,as blurry images can significantly impair the performance of subsequent tasks,particularly those involving deep learning-based intelligent recognition systems utilized in modern manufacturing.Blurry images may result in inaccurate feature recognition,adversely affecting the quality and precision of automated operations.An effective and precise video processing methodology is utilized to address these challenges.This approach incorporates advanced image processing techniques to analyze video sequences captured by the industrial camera.Sophisticated algorithms enable the system to identify the frame with optimal clarity and sharpness.This frame is critical feedback for adjusting the robotic arm,ensuring that the camera aligns precisely with the position where the focal length is accurately calibrated to the workpiece’s surface.This process guarantees the high quality of captured images and boosts the overall efficiency of the machining process.The system significantly reduces human error and enhances the consistency of output,which is crucial in high-precision manufacturing environments by automating the focus adjustment based on the clearest image.In addition,integrating this technology into existing industrial setups is expected to streamline operations,decrease waste,and enhance the speed and accuracy of production cycles.This study highlights the technological integration,addresses the challenges,and the substantial enhancements in automated machining processes facilitated by this innovative approach.Methods This study introduces a sophisticated Multi-level Video Transformer algorithm-based video classification model,the Multi-level Video Transformer,designed for high-level semantic video representation learning.This innovative model is developed to identify the clearest frame within a video sequence,a pivotal step for enhancing automated machining precision.The methodology commences with a novel token segmentation approach named Multi-level Tokenization(MLT).This approach divides the original video data into token sequences across four levels:2D Patch,3D Patch,Frame,and Clip,capturing a comprehensive range of spatial and temporal details.After token segmentation,positional encodings are applied to these tokens to preserve the sequence order,which is crucial for processing time-dependent data.The tokens are input into the newly developed Multi-level Encoder(MLE)for advanced attention calculations.At the core of the MLE are its dual attention modules:the Level-wise Learnable Attention(LWLA)and the Multi-level Cross Attention(MLCA),each stacked multiple times to deepen learning and integrate fea-tures more effectively.The LWLA employs a deformable attention mechanism,an innovative replacement for global attention that calculates fea-ture similarity more flexibly and efficiently,reducing computational costs and mitigating slow convergence issues commonly associated with tra-ditional models.In contrast,the MLCA transcends traditional layer limitations by conducting global attention calculations across the entire token sequence,fos-tering a deeper integration of features at all levels.This integration is further enhanced by incorporating classification tokens in the MLCA layer,which develop concurrently with global tokens.It is maintained throughout the processing stages.After multiple iterations through the MLEs,these tokens are fed into a Multi-Layer Perceptron for final classification predictions,focusing on semantic-level video classification.The empir-ical validation of this model employs data gathered on-site with a robotic arm equipped with an industrial camera,creating a unique dataset.The operational setup involves positioning the camera vertically downward above metal samples.The data collection protocol is meticulously de-signed:it starts with the camera at a height where the sample appears blurry due to exceeding the focal length,then the camera is gradually lowered until the clearest possible image is captured and continued beyond to capture the re-emergence of blurriness.This methodical movement generates a video sequence of“blurry-gradually clear-clearest-gradually blurry-blurry”images.From this sequence,the clearest image is me-ticulously selected,and its index is recorded as the actual value label,serving as a critical dataset for training and testing the proposed model.This comprehensive approach ensures the precision of frame selection and significantly contributes to the reliability and efficiency of auto-mated processes that rely on precise visual data for operational accuracy.Results and Disccussions The empirical assessment of the Multi-level Video Transformer model demonstrated encouraging outcomes across its three variants,achieving classification accuracies of 87.2%,88.6%,and 88.9%on a custom video dataset.These results signify substantial ad-vancements in video processing for precision tasks.The proposed models display superior performance when these variants are compared to mainstream video transformers of comparable parameter sizes.This success highlights the effectiveness of the specialized approach in managing the complexities of selecting the clearest frame from video sequences.The refinement and precision inherent in the proposed models facilitate the identification of the sharpest frames and reduce potential errors in subsequent automated tasks that depend critically on image clarity for accuracy.The Multi-level Video Transformer affirms its robustness and reliability,establishing a new benchmark for video classification tasks in industrial applications by attaining such high classification accuracies.In addition,these results offer compelling evidence that the proposed methodological innovations,such as MLT,deformable attention mechanisms,and cross-level attention integration,significantly enhance model performance.These advancements are particularly advantageous for tasks requiring great detail and precision in frame selection,which are crucial in many in-dustrial and manufacturing environments.Conclusions Accordingly,the Multi-level Video Transformer meets and surpasses current industry standards for video classification,marking a significant advancement over existing technologies.This model’s triumph lays the groundwork for more nuanced and effective automated sys-tems capable of operating with heightened accuracy and reduced human intervention.This will be particularly transformative in sectors where pre-cision is critical,laying a robust foundation for further research and development in intelligent automation and machine learning applications in visual data processing.
作者 邹琦萍 李博涛 陈赛安 郭茜 张桃红 ZOU Qiping;LI Botao;CHEN Saian;GUO Xi;ZHANG Taohong(Key Laboratory of AI and Information Processing(Hechi University),Education Department of Guangxi Zhuang Autonomous Region,Hechi 546300,China;School of Computer and Communication Engineering,University of Science and Technology Beijing,Beijing 100083,China;Beijing Key Laboratory of Knowledge Engineering for Materials Science,Beijing 100083,China)
出处 《工程科学与技术》 EI CAS CSCD 北大核心 2024年第6期34-43,共10页 Advanced Engineering Sciences
基金 科技部科技创新2030–重大项目(2020AAA0108703) 广西高校人工智能与信息处理重点实验室基金项目(2022GXZDSY007)。
关键词 视频Transformer 视频分类 视觉自动定位 可变形注意力 video transformer video classification visual automatic localization deformable attention
  • 相关文献

参考文献3

二级参考文献21

共引文献27

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部