Practical approaches to apply Human Pose Estimation for real-world applications
- Author(s)
- LEE SANGHYUB
- Type
- Thesis
- Degree
- Doctor
- Department
- 정보컴퓨팅대학 AI융합학과(문화기술프로그램)
- Advisor
- Hong, Jin-Hyuk
- Abstract
- Human Pose Estimation (HPE) has emerged as a vital area of research in computer vision, aiming to estimate human body configurations from input visual data. This dissertation addresses the challenges of HPE by proposing a set of practical methodologies focused on enhancing efficiency, robustness, and real-world applicability. The research is structured around three interconnected studies, each tackling a unique aspect of pose estimation, yet collectively contributing to the broader objective of developing scalable and high-performing HPE systems. The first study centers on multi-view 3D HPE, introducing a lightweight and markerless skeleton tracking algorithm that effectively resolves self-occlusion—a persistent challenge in pose estimation. This is achieved by merging pose candidates derived from multiple RGB-D sensors using a combination of DBSCAN clustering and Kalman filtering. By avoiding reliance on heavy deep learning models, the proposed algorithm is suitable for real- time applications in resource-constrained environments. Experimental evaluations confirm its superiority in tracking limb joints under occlusion, underscoring the importance of sensor fusion and spatial redundancy. However, this approach still requires a careful sensor installation process and may suffer from reduced accuracy due to the inherent limitations of depth information, such as susceptibility to environmental interference (e.g., sunlight or reflective surfaces), as well as the challenges posed by suboptimal sensor placement or simultaneous tracking of multiple individuals. Building upon the foundational insights from the first study, the second study transitions from controlled sensor environments to consumer-grade RGB videos, applying 3D human reconstruction (HR) techniques in the context of dance education. The resulting system, DanceSculpt (DS), demonstrates how HR can deliver multi-angle visualizations of human movement without the limitations of depth-based systems, such as IR interference and complex calibration. By leveraging 3D avatars reconstructed from monocular input, DS provides learners with detailed visual feedback, improving understanding of posture, timing, and spatial formation. The successful application of this HR method in dance learning emphasizes its potential for other motion-intensive educational contexts, creating a direct link to the goals of the first study while expanding its practical relevance. Nonetheless, DS’s reliance on a top-down approach results in increased inference time proportional to the number of detected individuals, making real-time performance a challenge. Furthermore, its inability to reconstruct poses accurately under severe occlusion or when most body parts are not visible remains a significant limitation. The third study builds upon the system-level insights from the first two investigations by introducing an advanced yet efficient 2D multi-person pose estimation framework that enhances performance within a one-stage architecture. Rather than generalizing the findings, this study focuses on achieving additional performance gains through a more effective integration of instance-centric attention mechanisms. InstaPose, developed in this stage, incorporates a novel Instance-Centric Keypoint Attention (ICKA) mechanism within a DETR-based transformer model. This design directly addresses a key limitation observed in earlier approaches—insufficient interaction between instance and keypoint queries—by enhancing contextual coherence and spatial precision. Extensive experiments on MS COCO and CrowdPose datasets validate the framework’s effectiveness, demonstrating its superiority in crowded scenes with minimal parameter overhead. The performance gains achieved here reflect lessons learned from both the robust merging strategies of the first study and the reconstruction-based feedback system of the second. However, InstaPose still inherits limitations of transformer-based architectures, including relatively high computational cost and complexity in training. Moreover, the framework lacks extension to 3D HPE, which limits its application in scenarios requiring full spatial understanding. Together, these three studies form a cohesive research trajectory that progressively abstracts from multi-sensor integration to high-level model architecture. This studies highlights how HPE solutions can be adapted across varying input modalities and use cases, from high-precision tracking systems to educational and real-time applications. This dissertation contributes to the ongoing evolution of HPE by showing that accurate, efficient, and scalable solutions are not mutually exclusive but can be simultaneously realized through thoughtful system design and cross-domain insight. The proposed approaches hold promise for a wide range of applications, including interactive learning, health monitoring, sports analysis, and beyond, where understanding and interpreting human motion is essential.
- URI
- https://scholar.gist.ac.kr/handle/local/31944
- Fulltext
- http://gist.dcollection.net/common/orgView/200000884269
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.