An n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be syllables, letters, words or base pairs according to the application. n-grams may also be called shingles.
Tokenization
My first post was mainly on this.
1 from nltk.tokenize import RegexpTokenizer
2
3 tokenizer = RegexpTokenizer("[a-zA-Z'`]+")
4 #skipping the numbers in here, include ' for tokens
5 print tokenizer.tokenize("I am Madhuka Udantha, I'm going to write 2blog posts")
6 #==>['I', 'am', 'Madhuka', 'Udantha', "I'm", 'going', 'to', 'write', 'blog', 'posts']
7
Generating N-grams for each token
nltk.util.ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None).
- sequence – the source data to be converted into ngrams (sequence or iter)
- n – the degree of the ngrams (int)
- pad_left – whether the ngrams should be left-padded (bool)
- pad_right – whether the ngrams should be right-padded (bool)
- pad_symbol – the symbol to use for padding (default is None, any)
1 from nltk.util import ngrams
2
3 print list(ngrams([1,2,3,4,5], 3))
4 print list(ngrams([1,2,3,4,5], 2, pad_right=True))
5 print list(ngrams([1,2,3,4,5], 2, pad_right=True,pad_symbol="END"))
Counting each N-gram occurrences
1 ngrams_statistics = {}
2
3 for ngram in ngrams:
4 if not ngrams_statistics.has_key(ngram):
5 ngrams_statistics.update({ngram:1})
6 else:
7 ngram_occurrences = ngrams_statistics[ngram]
8 ngrams_statistics.update({ngram:ngram_occurrences+1})
9
Sorting
1 ngrams_statistics_sorted = sorted(ngrams_statistics.iteritems(), reverse=True)
2 print ngrams_statistics_sorted
No comments:
Post a Comment