Madhuka: NLTK tutorial

An n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be syllables, letters, words or base pairs according to the application. n-grams may also be called shingles.

Tokenization

My first post was mainly on this.

1 from nltk.tokenize import RegexpTokenizer
2 
3 tokenizer = RegexpTokenizer("[a-zA-Z'`]+")
4 #skipping the numbers in here, include ' for tokens
5 print tokenizer.tokenize("I am Madhuka Udantha, I'm going to write 2blog posts")
6 #==>['I', 'am', 'Madhuka', 'Udantha', "I'm", 'going', 'to', 'write', 'blog', 'posts']
7

Generating N-grams for each token

nltk.util.ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None).

sequence – the source data to be converted into ngrams (sequence or iter)

n – the degree of the ngrams (int)

pad_left – whether the ngrams should be left-padded (bool)

pad_right – whether the ngrams should be right-padded (bool)

pad_symbol – the symbol to use for padding (default is None, any)

1 from nltk.util import ngrams
2 
3 print list(ngrams([1,2,3,4,5], 3))
4 print list(ngrams([1,2,3,4,5], 2, pad_right=True))
5 print list(ngrams([1,2,3,4,5], 2, pad_right=True,pad_symbol="END"))

Counting each N-gram occurrences

1 ngrams_statistics = {}
2 
3 for ngram in ngrams:
4   if not ngrams_statistics.has_key(ngram):
5       ngrams_statistics.update({ngram:1})
6   else:
7       ngram_occurrences = ngrams_statistics[ngram]
8       ngrams_statistics.update({ngram:ngram_occurrences+1})
9

Sorting

1 ngrams_statistics_sorted = sorted(ngrams_statistics.iteritems(), reverse=True)
2 print ngrams_statistics_sorted

Madhuka

Monday, May 11, 2015

NLTK tutorial–03 (n-gram)

No comments:

Post a Comment