High-Precision Feature Pair Selection Techniques in Monocular Visual Odometry Using Transformer Models
- Abstract
- Monocular Visual Odometry (VO) is a technique that estimates the position and attitude changes of a moving object using a single camera. However, it has limitations in accurate position estimation due to the ambiguity of depth information and difficulties in feature point extraction. This study proposes a method to improve the performance of monocular VO by combining the classical feature extraction algorithm SIFT (Scale-Invariant Feature Transform) with the Self-Attention mechanism of Transformers. Robust feature points are extracted using the SIFT algorithm, and feature points between consecutive image pairs are matched using the FLANN algorithm. The matched feature point pairs are used as input to the Transformer model, which then selects effective matching points for pose estimation through the Self-Attention mechanism. This approach reduces errors caused by incorrect matches and improves the accuracy of position estimation. The experiment employed the KITTI Odometry dataset, isolating straight-motion data to evaluate the potential for 6-DOF pose estimation. Forward distance estimation was then conducted. In the preprocessing phase, 10-image sequences were created, and feature coordinates were normalized for transformer model training. This research presents a novel approach of efficient matching point selection using the Attention mechanism. Through this, it demonstrated the potential of Transformer research in autonomous driving and robot vision systems.
- Author(s)
- 신건우
- Issued Date
- 2025
- Type
- Thesis
- URI
- https://scholar.gist.ac.kr/handle/local/19341
- Alternative Author(s)
- Gunwoo Shin
- Department
- 대학원 AI대학원
- Advisor
- Lee, Yong-Gu
- Table Of Contents
- Abstract ․․․․․․․․․․․․․․․․․․․․ i
List of Contents ․․․․․․․․․․․․․․․․․․․․ ii
List of Table ․․․․․․․․․․․․․․․․․․․․ v
List of Figures ․․․․․․․․․․․․․․․․․․․․ vi
I. Introduction ․․․․․․․․․․․․․․․․․․․․ 1
1. 1. Background ․․․․․․․․․․․․․․ 1
1. 2. Understanding of Monocular Camera ․․․․․․․․․․․․․․ 1
1.2.1 Property of Monocular camera environment ․․․․․․․․․․․ 1
1.2.2 Difficulity for Monocular Visual Odometry ․․․․․․․․․․․․ 1
1. 3. Visual Odometry and autonomous driving ․․․․․․․․․․․․․․ 3
1.3.1 Visual Odometry for localization technology ․․․․․․․․․․ 3
1.3.2 Development to SLAM ․․․․․․․․․․ 3
1. 4. Need to improve Monocular Visual Odometry ․․․․․․․․․․․․․․ 3
1.4.1 Limitations of MLP ․․․․․․․․․․ 3
1.4.2 Importance of Sequential data ․․․․․․․․․․ 3
1. 5. Emergence and Applications of Transformer ․․․․․․․․․․․․․․ 4
1.5.1 Overview of Transformer ․․․․․․․․․․ 4
1.5.2 Transformer in the field of Computer Vision ․․․․․․․․․․ 4
1. 6. Research Pupose and Contributions ․․․․․․․․․․․․․․ 5
1. 7. Structure of the Paper ․․․․․․․․․․․․․․ 5
II. Related Works ․․․․․․․․․․․․․․ 6
2. 1. Existing Methods for Monocular Visual Odometry ․․․․․․․․․․․․․․ 6
2.1.1 Feature-Based Visual Odometry ․․․․․․․․․․ 6
2.1.2 Direct Methods ․․․․․․․․․․ 6
2.1.3 Deep Learning-Based Visual Odometry ․․․․․․․․․․ 7
2. 2. Role and Applications of the Attention Mechanism ․․․․․․․․․․․․․․ 8
2.2.1 Overview of the Attention Mechanism ․․․․․․․․․․ 8
2.2.2 Applications of Attention Mechanisms in Computer Vision ․․․․․․․․․․ 8
2.2.3 Attention Mechanisms in Visual Odometry ․․․․․․․․․․․․․․ 8
2.3 Applications of Transformer in Computer Vision ․․․․․․․․․․․․․․․․․․ 9
2.3.1 Vision Transformer(ViT) ․․․․․․․․․․․ 9
2.3.2 Transformer-Based Object Detection and Segmentation ․․․․․․․․․․ 9
2.3.3 Transformer Applications in Visual Sequence Data ․․․․․․․․․․ 9
2.4 Research Trends in Transformer-Based Visual Odometry ․․․․․․․․․․ 10
2.5 Limitations of Related Work and Contributions of This Study ․․․․․․․․․․ 10
III. Data Preparation ․․․․․․․․․․ 11
3.1 KITTI Odometry Dataset ․․․․․․․․․․ 11
3.2 Quaternion Transformation of Pose Information and 6-DOF Odometry ․․․․․․․․․․ 13
3.2.1 Representation of Pose Information ․․․․․․․․․․ 13
3.2.2 Transformation to Quaternion ․․․․․․․․․․ 13
3.2.3 Objective of the Model ․․․․․․․․․․ 14
3.3 Preprocessing of Data Sequences ․․․․․․․․․․ 14
3.3.1 Composition of Image Sequences ․․․․․․․․․․ 14
3.3.2 Feature Matching and Storage ․․․․․․․․․․ 14
3.3.3 Data Splatting ․․․․․․․․․․ 15
3.4 Summary ․․․․․․․․․․ 15
IV. Method ․․․․․․․․․․ 16
4.1 Model Architecture ․․․․․․․․․․ 16
4.1.1 Matching Embedding ․․․․․․․․․․ 17
4.1.2 Positional Encoding ․․․․․․․․․․ 17
4.1.3 Transformer Layer ․․․․․․․․․․ 18
4.2 Loss function ․․․․․․․․․․ 19
V. Experiments and Result ․․․․․․․․․․ 20
5.1 Experimental Setup ․․․․․․․․․․ 19
5.2 Training Loss and Results ․․․․․․․․․․ 20
5.3 Experimental Conclusions ․․․․․․․․․․ 23
VI. Conclusion and Future work ․․․․․․․․․․ 24
Reference
Appendix
- Degree
- Master
-
Appears in Collections:
- Department of AI Convergence > 3. Theses(Master)
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.