Sunday, May 10, 2015

NLTK tutorial–02 (Texts as Lists of Words / Frequency words)

Previous post was  basically about installing and introduction for NLTK and searching text with NLTK basic functions. This post main going on ‘Texts as Lists of Words’ as text is nothing more than a sequence of words and punctuation.  Frequency Distribution also visited at the end of this post.

sent1 = ['Today', 'I', 'call', 'James', '.']

len(sent1)—> 4

  • Concatenation combines the lists together into a single list. We can concatenate sentences to build up a text.
    text1 = sent1 + sent2
  • Index the text to find the word in the index. (indexes start from zero)
    text1[12]
  • We can do the converse; given a word, find the index of when it first occurs
    text1.index('call')
  • Slicing the text(By convention, m:n means elements m…n-1)
    text1[165:198]
    • NOTE
      If accidentally we use an index that is too large, we get an error: 'IndexError: list index out of range'
  • Sorting
    noun_phrase = text5[1:6]
    sorted(noun_phrase)

NOTE
Remember that capitalized words appear before lowercase words in sorted lists


Strings

Few to play with String in python. These are very basic but usefull to know when you are work with NLP.
name = 'Madhuka'
name[0] --> 'M'
name[:5] --> 'Madhu'
name * 2 --> 'MadhukaMadhuka'
name + '.' --> 'Madhuka.'

Splitting and join
' '.join(['NLTK', 'Python']) --> 'NLTL Python'
'NLTL Python'.split() --> ['NLTK', 'Python']

 

Frequency Distributions

Text contains frequency distributed words. NLTK provides built-in support for them. Let's use a FreqDist to find the 50 most frequent words in text/book

Lets check frequency distributions of 'The Book of Genesis'

1 from nltk.book import *
2
3 fdist1 = FreqDist(text3)
4 print(fdist1)
5 print fdist1.most_common(50)

Here is the frequency distributions of the text3 ('The Book of Genesis')


image


Long words


Listing words that are more than 12 characters long. For each word w in the vocabulary V, we check whether len(w) is greater than 12;


1 from nltk.book import *
2
3 V = set(text3)
4 long_words = [w for w in V if len(w) > 12]
5 print sorted(long_words)

Here are all words from the chat corpus that are longer than 8 characters, that occur more than 10 times


1 sorted(w for w in set(text3) if len(w) > 8 and fdist3[w] > 10)

image


Collocation


A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not. To get a handle on collocations, we start off by extracting from a text a list of word pairs, also known as bigrams. This is easily accomplished with the function bigrams():


In particular, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words. The collocations() function does this for us


1 from nltk.book import *
2
3 phase = text3[:5]
4 print "===Bigrams==="
5 print list(bigrams(phase))
6 print "===Collocations==="
7 print text3.collocations()

Here is out put of the sample code


image


 



  • fdist = FreqDist(samples)
    create a frequency distribution containing the given samples

  • fdist[sample] += 1
    increment the count for this sample

  • fdist['monstrous']
    count of the number of times a given sample occurred

  • fdist.freq('monstrous')
    frequency of a given sample

  • fdist.N()
    total number of samples

  • fdist.most_common(n)
    the n most common samples and their frequencies

  • for sample in fdist:
    iterate over the samples

  • fdist.max()
    sample with the greatest count

  • fdist.tabulate()
    tabulate the frequency distribution

  • fdist.plot()
    graphical plot of the frequency distribution

  • fdist.plot(cumulative=True)
    cumulative plot of the frequency distribution

  • fdist1 |= fdist2
    update fdist1 with counts from fdist2

  • fdist1 < fdist2
    test if samples in fdist1 occur less frequently than in fdist2

No comments:

Post a Comment