Previous post was basically about installing and introduction for NLTK and searching text with NLTK basic functions. This post main going on ‘Texts as Lists of Words’ as text is nothing more than a sequence of words and punctuation. Frequency Distribution also visited at the end of this post.
sent1 = ['Today', 'I', 'call', 'James', '.']
len(sent1)—> 4
- Concatenation combines the lists together into a single list. We can concatenate sentences to build up a text.
text1 = sent1 + sent2 - Index the text to find the word in the index. (indexes start from zero)
text1[12] - We can do the converse; given a word, find the index of when it first occurs
text1.index('call') - Slicing the text(By convention, m:n means elements m…n-1)
text1[165:198] - NOTE
If accidentally we use an index that is too large, we get an error: 'IndexError: list index out of range' - Sorting
noun_phrase = text5[1:6]
sorted(noun_phrase)
NOTE
Remember that capitalized words appear before lowercase words in sorted lists
Strings
Few to play with String in python. These are very basic but usefull to know when you are work with NLP.
name = 'Madhuka'
name[0] --> 'M'
name[:5] --> 'Madhu'
name * 2 --> 'MadhukaMadhuka'
name + '.' --> 'Madhuka.'
Splitting and join
' '.join(['NLTK', 'Python']) --> 'NLTL Python'
'NLTL Python'.split() --> ['NLTK', 'Python']
Frequency Distributions
Text contains frequency distributed words. NLTK provides built-in support for them. Let's use a FreqDist to find the 50 most frequent words in text/book
Lets check frequency distributions of 'The Book of Genesis'
1 from nltk.book import *
2
3 fdist1 = FreqDist(text3)
4 print(fdist1)
5 print fdist1.most_common(50)
Here is the frequency distributions of the text3 ('The Book of Genesis')
Long words
Listing words that are more than 12 characters long. For each word w in the vocabulary V, we check whether len(w) is greater than 12;
1 from nltk.book import *
2
3 V = set(text3)
4 long_words = [w for w in V if len(w) > 12]
5 print sorted(long_words)
Here are all words from the chat corpus that are longer than 8 characters, that occur more than 10 times
1 sorted(w for w in set(text3) if len(w) > 8 and fdist3[w] > 10)
Collocation
A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not. To get a handle on collocations, we start off by extracting from a text a list of word pairs, also known as bigrams. This is easily accomplished with the function bigrams():
In particular, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words. The collocations() function does this for us
1 from nltk.book import *
2
3 phase = text3[:5]
4 print "===Bigrams==="
5 print list(bigrams(phase))
6 print "===Collocations==="
7 print text3.collocations()
Here is out put of the sample code
- fdist = FreqDist(samples)
create a frequency distribution containing the given samples - fdist[sample] += 1
increment the count for this sample - fdist['monstrous']
count of the number of times a given sample occurred - fdist.freq('monstrous')
frequency of a given sample - fdist.N()
total number of samples - fdist.most_common(n)
the n most common samples and their frequencies - for sample in fdist:
iterate over the samples - fdist.max()
sample with the greatest count - fdist.tabulate()
tabulate the frequency distribution - fdist.plot()
graphical plot of the frequency distribution - fdist.plot(cumulative=True)
cumulative plot of the frequency distribution - fdist1 |= fdist2
update fdist1 with counts from fdist2 - fdist1 < fdist2
test if samples in fdist1 occur less frequently than in fdist2
No comments:
Post a Comment