OAK

Towards Interpretability of GPT-Style Models in Step-by-Step Games Through First-Order Logic

Metadata Downloads
Author(s)
Klea Lena Kovacec
Type
Thesis
Degree
Master
Department
정보컴퓨팅대학 AI융합학과
Advisor
Kim, Sundong
Abstract
Understanding how transformer-based large language models make decisions re- mains a continuous challenge in artificial intelligence. While these models achieve im- pressive performance, their internal workings and reasoning processes remain unclear black boxes, especially on the attention level. In this thesis, I explore mechanistic in- terpretability methods and attempts to combine its findings with First-Order Logic to propose a framework idea that could systematically characterize strategic reason- ing in game-playing transformer models and formally express them. I focus on models trained to play games like Othello and Chess, which provides a controlled domain where rules and optimal strategies are fully known, which proves ideal for intepreting the transformer’s reasoning at the attention level. I synthesize insights from circuit- level interpretability, probing methodologies, neuro-symbolic systems with First-Order Logic, and emergence of world models to manifest a way to identify computational pathways, detect encoded strategy heuristics, and translate attention patterns into explicit First-Order Logic formulas.
URI
https://scholar.gist.ac.kr/handle/local/33853
Fulltext
http://gist.dcollection.net/common/orgView/200000953752
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.