Madhuka: Grammar induction

Few days I was working for pattern mining on huge files and came across with millions of pattern (even different length from 2 to 150). Now I am looking for regex generation algorithms and came across by ‘Grammar induction’ which we knew some thing when in university time. But this is much more Smile to do.

Grammar induction

Grammar induction, also known as grammatical inference or syntactic pattern recognition, refers to the process in machine learning of learning a formal grammar (usually as a collection of re-write rules or productions or alternatively as a finite state machine or automaton). There is now a rich literature on learning different types of grammar and automata, under various different learning models and using various different methodologies. So researcher need to go back for book and read them .

Grammatical inference[1] has often been very focused on the problem of learning finite state machines of various types (Induction of regular languages), since there have been efficient algorithms for this problem since the 1980s.A more recent textbook is de la Higuera (2010) [1] which covers the theory of grammatical inference of regular languages and finite state automata. More recently these approaches have been extended to the problem of inference of context-free grammars and richer formalisms, such as multiple context-free grammars and parallel multiple context-free grammars. Other classes of grammars for which grammatical inference has been studied are contextual grammars, and pattern languages. Here is some summary of the topic

Grammatical inference by genetic algorithms[2]
Grammatical inference by greedy algorithms

Context-free grammar generating algorithms

Lempel-Ziv-Welch algorithm[3]
Sequitur

Distributional Learning algorithms

Context-free grammars languages
Mildly context-sensitive languages

Induction of regular languages
Induction of regular languages refers to the task of learning a formal description (e.g. grammar) of a regular language from a given set of example strings. Language identification in the limit[4] is a formal model for inductive inference. A regular language is defined as a (finite or infinite) set of strings that can be described by one of the mathematical formalisms called "finite automaton", "regular grammar", or "regular expression", all of which have the same expressive power. A regular expression can be

∅ (denoting the empty set of strings),
ε (denoting the singleton set containing just the empty string),
a (where a is any character in Σ; denoting the singleton set just containing the single-character string a),
r+s (where r and s are, in turn, simpler regular expressions; denoting their set's union)
r⋅s (denoting the set of all possible concatenations of strings from r 's and s 's set),
r⁺ (denoting the set of n-fold repetitions of strings from r 's set, for any n≥1), or
r^* (similarly denoting the set of n-fold repetitions, but also including the empty string, seen as 0-fold repetition).

The largest and the smallest set containing the given strings, called the trivial overgeneralization and under-generalization respectively.

Brill[5] Reduced regular expressions

a (where a is any character in Σ; denoting the singleton set just containing the single-character string a),
¬a (denoting any other single character in Σ except a),
• (denoting any single character in Σ)
a^*, (¬a)^*, or •^* (denoting arbitrarily many, possibly zero, repetitions of characters from the set of a, ¬a, or •, respectively), or
r⋅s (where r and s are, in turn, simpler reduced regular expressions; denoting the set of all possible concatenations of strings from r 's and s 's set).

Given an input set of strings, he builds step by step a tree with each branch labeled by a reduced regular expression accepting a prefix of some input strings, and each node labelled with the set of lengths of accepted prefixes[5]. He aims at learning correction rules for English spelling errors, rather than at theoretical considerations about learnability of language classes. Consequently, he uses heuristics to prune the tree-buildup, leading to a considerable improvement in run time.

[1] de la Higuera, Colin (2010). Grammatical Inference: Learning Automata and Grammars. Cambridge: Cambridge University Press.

[2] Dupont, Pierre. "Regular grammatical inference from positive and negative samples by genetic search: the GIG method." Grammatical Inference and Applications. Springer Berlin Heidelberg, 1994. 236-245.

[3] Batista, Leonardo Vidal, and Moab Mariz Meira. "Texture classification using the Lempel-Ziv-Welch algorithm." Advances in Artificial Intelligence–SBIA 2004. Springer Berlin Heidelberg, 2004. 444-453.

[4] Gold, E. Mark (1967). "Language identification in the limit". Information and Control 10 (5): 447–474.

[5] Eric Brill (2000). "Pattern–Based Disambiguation for Natural Language Processing". Proc. EMNLP/VLC

Madhuka

Friday, May 15, 2015

Grammar induction

No comments:

Post a Comment