Madhuka: NLTK tutorial–02 (Texts as Lists of Words / Frequency words)

Previous post was basically about installing and introduction for NLTK and searching text with NLTK basic functions. This post main going on ‘Texts as Lists of Words’ as text is nothing more than a sequence of words and punctuation. Frequency Distribution also visited at the end of this post.

sent1 = ['Today', 'I', 'call', 'James', '.']

len(sent1)—> 4

Concatenation combines the lists together into a single list. We can concatenate sentences to build up a text.
text1 = sent1 + sent2
Index the text to find the word in the index. (indexes start from zero)
text1[12]
We can do the converse; given a word, find the index of when it first occurs
text1.index('call')
Slicing the text(By convention, m:n means elements m…n-1)
text1[165:198]

NOTE
If accidentally we use an index that is too large, we get an error: 'IndexError: list index out of range'

Sorting
noun_phrase = text5[1:6]
sorted(noun_phrase)

NOTE
Remember that capitalized words appear before lowercase words in sorted lists

Strings

Few to play with String in python. These are very basic but usefull to know when you are work with NLP.
name = 'Madhuka'
name[0] --> 'M'
name[:5] --> 'Madhu'
name * 2 --> 'MadhukaMadhuka'
name + '.' --> 'Madhuka.'

Splitting and join
' '.join(['NLTK', 'Python']) --> 'NLTL Python'
'NLTL Python'.split() --> ['NLTK', 'Python']

Frequency Distributions

Text contains frequency distributed words. NLTK provides built-in support for them. Let's use a FreqDist to find the 50 most frequent words in text/book

Lets check frequency distributions of 'The Book of Genesis'

1 from nltk.book import *
2 
3 fdist1 = FreqDist(text3) 
4 print(fdist1) 
5 print fdist1.most_common(50)

Here is the frequency distributions of the text3 ('The Book of Genesis')

Long words

Listing words that are more than 12 characters long. For each word w in the vocabulary V, we check whether len(w) is greater than 12;

1 from nltk.book import *
2 
3 V = set(text3)
4 long_words = [w for w in V if len(w) > 12]
5 print sorted(long_words)

Here are all words from the chat corpus that are longer than 8 characters, that occur more than 10 times

1 sorted(w for w in set(text3) if len(w) > 8 and fdist3[w] > 10)

Collocation

A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not. To get a handle on collocations, we start off by extracting from a text a list of word pairs, also known as bigrams. This is easily accomplished with the function bigrams():

In particular, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words. The collocations() function does this for us

1 from nltk.book import *
2 
3 phase = text3[:5]
4 print "===Bigrams==="
5 print list(bigrams(phase))
6 print "===Collocations==="
7 print text3.collocations()

Here is out put of the sample code

fdist = FreqDist(samples)
create a frequency distribution containing the given samples

fdist[sample] += 1
increment the count for this sample

fdist['monstrous']
count of the number of times a given sample occurred

fdist.freq('monstrous')
frequency of a given sample

fdist.N()
total number of samples

fdist.most_common(n)
the n most common samples and their frequencies

for sample in fdist:
iterate over the samples

fdist.max()
sample with the greatest count

fdist.tabulate()
tabulate the frequency distribution

fdist.plot()
graphical plot of the frequency distribution

fdist.plot(cumulative=True)
cumulative plot of the frequency distribution

fdist1 |= fdist2
update fdist1 with counts from fdist2

fdist1 < fdist2
test if samples in fdist1 occur less frequently than in fdist2

Madhuka

Sunday, May 10, 2015

NLTK tutorial–02 (Texts as Lists of Words / Frequency words)

No comments:

Post a Comment