### Introduction

##### Why do Parsing?
• Gives "who did what to whom" info of sentence.
• Gives linguistic structure in terms of syntax.
• Parsed result can be used for various machine learning tasks.
##### What Makes Parsing Hard?
• Ambiguity of sentence structure (which could have been inherited from the ambiguity of POS tagging) or word segmentation.
• e.g. "The man saw the girl with a telescope." $→$ who is having the telescope?
• e.g. Chinese characters $→$ how to split into tokens?
• Infinite possible number of sentences.
• May have to do it on a sentence that have never been written before.

### What is Parsing?

Mapping the input space of sentences $X$ to an output space of syntactic representations $Y$.

• Input $X$
• We generally assume that an input sentence $x ∈ X$ is a sentence consisting of a sequence of $x_1, x_2, ⋯ , x_n$ tokens.
• We also generally assume that the tokenization of the sentence is not part of the parsing task and that tokenization has been done properly (e.g. annotated, delimited, etc.) beforehand.
• Output $Y$
• The definition of what counts as a "syntactic representation" $y ∈ Y$ differs across various linguistic theories in terms of what properties or relations should be represented.
• Phrase Structure (= Constituent Structure) is a historically famous representation type which are naturally induced by CFGs and is widely used in annotation schemes for treebanks.
• In this case, a sentence is recursively decomposed into smaller segments that are categorized into phrases (e.g. noun phrase NP, verb phrase VP, etc.) according to their internal structure.
• Dependency Structure is a more recently famous representation type and works especially well for languages with free or flexible word order (e.g. Czech).
• In this case, a sentence is analyzed by connecting its words with binary asymmetrical dependency relations (e.g. head, dependent) while also categorizing words into functional roles (e.g. subject, object, etc.).
• Mapping
• The historic notion of parsing was intimately tied to the notion of grammaticality. This implied mapping to be more of a relation than a function, which would focus on relating a sentence to a possible representation. In which case, an ambiguous sentence would be mapped to more than one representation, and a grammatically incorrect sentence would be mapped to no representation.
• Nowadays, we define mapping to be a function that returns the contextually most appropriate representation (among the various possible representations) of the input sentence. In order to do so, we need the parser to be a mathematical model which should be able to (1) generate possible representations for a given input sentence and (2) return the best representation that has the highest score after numerically scoring the possible representations. By defining the model as $M = (\text"GEN", \text"EVAL")$ where,
• $\text"GEN"$ is the generative component that returns a set of candidate representations $\{y_1, y_2, ⋯ , y_k\}$ for an input $x$ such that $\text"GEN"(x) ⊆ Y$ for $x ∈ X$
• $\text"EVAL"$ is the evaluative component that ranks candidate representations via a scoring scheme such that $\text"EVAL"(y) ∈ ℝ$ for $y ∈ \text"GEN"(x)$
• Then, the best representation $y↖*$ can be found by solving the following equation:

$$y↖* = {argmax}↙{y ∈ \text"GEN"(x)} \ \text"EVAL"(y)$$

### Main Types of Parsing Models $M = (\text"GEN", \text"EVAL")$

• Phrase Structure (= Constituency) Parsing
• $\text"GEN"$: Derivation of phrase structure representations of the sentence. Can be solved by either a top-down approach or a bottom-up approach.
• CKY Algorithm
• one of the earliest recognition and parsing algorithms
• a bottom-up dynamic programming approach, which solves the problem for $[i, j]$ by solving $[i, k]$ and $[k, j]$, where $i < j < k$ and $0 ≤ i,j ≤ {|x|}$, given that $x$ is the input sentence and $i,j,k$ are the indices of the words in $x$
• standard version can only recognize CFGs with binary branching rules
• recovers the parse tree through backtracking
• requires time complexity of $O(n^3)$ and space complexity of $O(n^2)$
• Earley's Algorithm
• a bottom-up parsing guided by top-down predictions
• maintains sets of dotted grammar rules (e.g. predictor, completer, scanner) reflect what the parser has seen so far
• requires time complexity of $O(n^3)$ but performs better on particular sub-classes
• Shift-Reduce Algorithm
• deterministic transition based process depending on the parsing state
• parsing state is defined by the stack, that holds parse trees already been built, and the buffer, that holds the words which have not been parsed yet
• transition action can either be a "shift" or a "reduce"
• requires time complexity of $O(n)$ without the backtracking part
• $\text"EVAL"$: PARSEVAL metrics
• precision: the number of correct constituents out of the number of constituents in the candidate parse
• recall: the number of correct constituents out of the number of constituents in the gold standard
• F score: the harmonic mean of precision and recall, where $$F = 2 · {\text"precision" · \text"recall"} / {\text"precision" + \text"recall"}$$
• Dependency Parsing
• $\text"GEN"$: Derivation of dependency representations of the sentence. Can be solved by either a transition-based approach or a graph-based approach.
• Greedy Classifier-based Parsing
• Beam Search
• Structured Learning
• Dynamic Oracles
• Non-projective Parsing
• Joint Parsing with POS Tagging
• MST
• $\text"EVAL"$: Attachment scores
• Unlabeled Attachment Score (UAS)
• Labeled Attachment Score (LAS)