parser

컴파일러 구조

Lexical analysis, Syntax analysis, Semantic analysis, Intermediate code generation, Optimization, Code generation

Context Free Grammar

문맥자유문법. 문맥에 따라 토큰의 의미가 달라지지 않는다는 것. 실제 프로그래밍 언어는 이 CFG의 subset이다. 'C bnf'로 검색하면 C언어의 BNF표기(CFG의 한가지 표현법)를 볼 수 있다. 그것만으로는 실제 C언어로는 의미상 말이 안되고 컴파일안되는 코드도 accept되는 CFG이다. 사실 모든 문자를 하나씩 tokenize하고 BNF가 A:=xA (x는 임의의 문자) 인 간단한 문법도 C언어를 Accept한다. 하지만 저 BNF로 만든 syntax tree는 프로그래밍 언어를 만드는데 있어 아무런 도움이 되지 않는다. 그러므로 최대한 compact하게 BNF를 만드는것도 중요한것 같다. 여기서부터 엔지니어링 영역인것 같고, 이후 semantic부여 과정이 추가적으로 들어가는데 semantic은 이론적으로 완벽한 방법은 없어서 이것도 엔지니어링 영역이라고 알고 있다.

Pushdown automata

CFG를 인식할 수 있는, DFA와는 다르게 각 노드마다 stack이 존재해서 그거의 top으로 다음 transition을 처리할 수 있는 오토마타. CFG를 처리할 수 있음. Deterministic pushdown automata는 deterministic CFG를 처리할 수 있음. deterministic CFG는 unambigious CFG의 subset이다. https://en.wikipedia.org/wiki/Deterministic_context-free_language https://en.wikipedia.org/wiki/Deterministic_context-free_grammar

추측

LL(k)나 LR(k)는 pushdown automata를 만들어서 처리하는 파서인가?

LL(k)는 Left부터 처리하며 다음 k개의 토큰을 pop없이 볼 수 있을때의 Top-down parsing.

LR(k)는 Left부터 처리하며 다음 k개의 토큰을 pop없이 볼 수 있을때의 shift-reduce parsing.

shift-reduce는 그 동작이 stack의 상태를 보고 reduce하는데, 그게 Right-most derivation을 사용한것과 동일한 결과이게 된다. 그래서 LR파서라고 부름. 아마도 shift-reduce는 LR(k)말고도 다른 아종들까지 다 합친 그룹을 말하고, 그중에서 LR(0)가 제일 기본형태인듯?

LL로 topdown 내려오면서 백트랙을 하지 않기 때문에 선형시간, LR도 stack에서 pop되는게 amortized라서 선형시간이다. 둘 다 deterministic CFG를 deterministic pushdown automata를 만들어서 처리하는것같다. LL(k)에서 k개 보고 결정하는게 아니라 그냥 하나도 안보고(=LL(0)) 가능한 모든 경우의 수를 백트래킹(시도해보고 안되면 retrieve)하면 naive recursive descent parser이다.

LR같은건 deterministic만 처리 가능하고, 일반적인 CFG를 처리하기 위해서는 다른 파서들이 있긴 한데 느린거같다. "Some methods which can parse arbitrary context-free languages (e.g., Cocke-Younger-Kasami, Earley, GLR) have worst-case performance of O(n^3) time." 근데 보통 프로그래밍 language설계할때 최대한 unambigious하게 설계하고 ambigiousity가 있더라도 token 한두개 look ahead로 결정가능하게 만드니까, 선형에 처리가능한 LR이 practical하게는 중요한것 같다.

While LR(k) grammars have equal generative power for all k≥1, the case of LR(0) grammars is slightly different. A language L is said to have the prefix property if no word in L is a proper prefix of another word in L.[12] A language L has an LR(0) grammar if and only if L is a deterministic context-free language with the prefix property.[13] As a consequence, a language L is deterministic context-free if and only if L$ has an LR(0) grammar, where "$" is not a symbol of L's alphabet.[14] LR(0)면 deterministic CFG만 처리 가능하고, LR(k>=1)은 모두 같은 처리능력을 가진다는듯. 그래서 LR(1)을 많이 쓰는듯.

LL(0)는 어떤 의미가 있나? look-ahead가 0이므로 현재 상태에서 다음토큰에 따라 derivation할 경우가 1개이면 그쪽으로 가고, 아니라면 바로 conflict이다. look-ahead가 없으니까. 이때 conflict를 해결하기 위해 backtracking으로 다 해보는게 recursive descent parser이고 그냥 아무거나 하나 내려가면 LL(0) parser의 한 구현체가 되는듯? Every LL(k) grammar is also an LR(k) grammar. 라는 위키내용이 있는걸 보면 그냥 LR파서가 짱인거같다.