Towards Interpretability of GPT-Style Models in Step-by-Step Games Through First-Order Logic
- Author(s)
- Klea Lena Kovacec
- Type
- Thesis
- Degree
- Master
- Department
- 정보컴퓨팅대학 AI융합학과
- Advisor
- Kim, Sundong
- Abstract
- Understanding how transformer-based large language models make decisions re- mains a continuous challenge in artificial intelligence. While these models achieve im- pressive performance, their internal workings and reasoning processes remain unclear black boxes, especially on the attention level. In this thesis, I explore mechanistic in- terpretability methods and attempts to combine its findings with First-Order Logic to propose a framework idea that could systematically characterize strategic reason- ing in game-playing transformer models and formally express them. I focus on models trained to play games like Othello and Chess, which provides a controlled domain where rules and optimal strategies are fully known, which proves ideal for intepreting the transformer’s reasoning at the attention level. I synthesize insights from circuit- level interpretability, probing methodologies, neuro-symbolic systems with First-Order Logic, and emergence of world models to manifest a way to identify computational pathways, detect encoded strategy heuristics, and translate attention patterns into explicit First-Order Logic formulas.
- URI
- https://scholar.gist.ac.kr/handle/local/33853
- Fulltext
- http://gist.dcollection.net/common/orgView/200000953752
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.