Basics of Parsing in NLP


Why do Parsing?
  • Gives "who did what to whom" info of sentence.
  • Gives linguistic structure in terms of syntax.
  • Parsed result can be used for various machine learning tasks.
What Makes Parsing Hard?
  • Ambiguity of sentence structure (which could have been inherited from the ambiguity of POS tagging) or word segmentation.
    • e.g. "The man saw the girl with a telescope." $→$ who is having the telescope?
    • e.g. Chinese characters $→$ how to split into tokens?
  • Infinite possible number of sentences.
  • May have to do it on a sentence that have never been written before.

What is Parsing?

Mapping the input space of sentences $X$ to an output space of syntactic representations $Y$.

  • Input $X$
    • We generally assume that an input sentence $x ∈ X$ is a sentence consisting of a sequence of $x_1, x_2, ⋯ , x_n$ tokens.
    • We also generally assume that the tokenization of the sentence is not part of the parsing task and that tokenization has been done properly (e.g. annotated, delimited, etc.) beforehand.
  • Output $Y$
    • The definition of what counts as a "syntactic representation" $y ∈ Y$ differs across various linguistic theories in terms of what properties or relations should be represented.
    • Phrase Structure (= Constituent Structure) is a historically famous representation type which are naturally induced by CFGs and is widely used in annotation schemes for treebanks.
      • In this case, a sentence is recursively decomposed into smaller segments that are categorized into phrases (e.g. noun phrase NP, verb phrase VP, etc.) according to their internal structure.
    • Dependency Structure is a more recently famous representation type and works especially well for languages with free or flexible word order (e.g. Czech).
      • In this case, a sentence is analyzed by connecting its words with binary asymmetrical dependency relations (e.g. head, dependent) while also categorizing words into functional roles (e.g. subject, object, etc.).
  • Mapping
    • The historic notion of parsing was intimately tied to the notion of grammaticality. This implied mapping to be more of a relation than a function, which would focus on relating a sentence to a possible representation. In which case, an ambiguous sentence would be mapped to more than one representation, and a grammatically incorrect sentence would be mapped to no representation.
    • Nowadays, we define mapping to be a function that returns the contextually most appropriate representation (among the various possible representations) of the input sentence. In order to do so, we need the parser to be a mathematical model which should be able to (1) generate possible representations for a given input sentence and (2) return the best representation that has the highest score after numerically scoring the possible representations. By defining the model as $M = (\text"GEN", \text"EVAL")$ where,
      • $\text"GEN"$ is the generative component that returns a set of candidate representations $\{y_1, y_2, ⋯ , y_k\}$ for an input $x$ such that $\text"GEN"(x) ⊆ Y$ for $x ∈ X$
      • $\text"EVAL"$ is the evaluative component that ranks candidate representations via a scoring scheme such that $\text"EVAL"(y) ∈ ℝ$ for $y ∈ \text"GEN"(x)$
    • Then, the best representation $y↖*$ can be found by solving the following equation:

$$y↖* = {argmax}↙{y ∈ \text"GEN"(x)} \ \text"EVAL"(y)$$

Main Types of Parsing Models $M = (\text"GEN", \text"EVAL")$

  • Phrase Structure (= Constituency) Parsing
    • $\text"GEN"$: Derivation of phrase structure representations of the sentence. Can be solved by either a top-down approach or a bottom-up approach.
      • CKY Algorithm
        • one of the earliest recognition and parsing algorithms
        • a bottom-up dynamic programming approach, which solves the problem for $[i, j]$ by solving $[i, k]$ and $[k, j]$, where $i < j < k$ and $0 ≤ i,j ≤ {|x|}$, given that $x$ is the input sentence and $i,j,k$ are the indices of the words in $x$
        • standard version can only recognize CFGs with binary branching rules
        • recovers the parse tree through backtracking
        • requires time complexity of $O(n^3)$ and space complexity of $O(n^2)$
      • Earley's Algorithm
        • a bottom-up parsing guided by top-down predictions
        • maintains sets of dotted grammar rules (e.g. predictor, completer, scanner) reflect what the parser has seen so far
        • requires time complexity of $O(n^3)$ but performs better on particular sub-classes
      • Shift-Reduce Algorithm
        • deterministic transition based process depending on the parsing state
        • parsing state is defined by the stack, that holds parse trees already been built, and the buffer, that holds the words which have not been parsed yet
        • transition action can either be a "shift" or a "reduce"
        • requires time complexity of $O(n)$ without the backtracking part
    • $\text"EVAL"$: PARSEVAL metrics
      • precision: the number of correct constituents out of the number of constituents in the candidate parse
      • recall: the number of correct constituents out of the number of constituents in the gold standard
      • F score: the harmonic mean of precision and recall, where $$F = 2 · {\text"precision" · \text"recall"} / {\text"precision" + \text"recall"}$$
  • Dependency Parsing
    • $\text"GEN"$: Derivation of dependency representations of the sentence. Can be solved by either a transition-based approach or a graph-based approach.
      • Greedy Classifier-based Parsing
      • Beam Search
      • Structured Learning
      • Dynamic Oracles
      • Non-projective Parsing
      • Joint Parsing with POS Tagging
      • MST
    • $\text"EVAL"$: Attachment scores
      • Unlabeled Attachment Score (UAS)
      • Labeled Attachment Score (LAS)


Open Comments