Monday, May 11, 2015

NLTK tutorial–03 (n-gram)

An n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be syllables, letters, words or base pairs according to the application. n-grams may also be called shingles.

Tokenization

My first post was mainly on this.

1 from nltk.tokenize import RegexpTokenizer
2
3 tokenizer = RegexpTokenizer("[a-zA-Z'`]+")
4 #skipping the numbers in here, include ' for tokens
5 print tokenizer.tokenize("I am Madhuka Udantha, I'm going to write 2blog posts")
6 #==>['I', 'am', 'Madhuka', 'Udantha', "I'm", 'going', 'to', 'write', 'blog', 'posts']
7

Generating N-grams for each token


nltk.util.ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None).



  • sequence –  the source data to be converted into ngrams (sequence or iter)

  • n  – the degree of the ngrams (int)

  • pad_left  – whether the ngrams should be left-padded (bool)

  • pad_right  – whether the ngrams should be right-padded (bool)

  • pad_symbol – the symbol to use for padding (default is None, any)

1 from nltk.util import ngrams
2
3 print list(ngrams([1,2,3,4,5], 3))
4 print list(ngrams([1,2,3,4,5], 2, pad_right=True))
5 print list(ngrams([1,2,3,4,5], 2, pad_right=True,pad_symbol="END"))

image


Counting each N-gram occurrences


1 ngrams_statistics = {}
2
3 for ngram in ngrams:
4 if not ngrams_statistics.has_key(ngram):
5 ngrams_statistics.update({ngram:1})
6 else:
7 ngram_occurrences = ngrams_statistics[ngram]
8 ngrams_statistics.update({ngram:ngram_occurrences+1})
9

Sorting


1 ngrams_statistics_sorted = sorted(ngrams_statistics.iteritems(), reverse=True)
2 print ngrams_statistics_sorted

image

No comments:

Post a Comment