Madhuka: Natural Language Toolkit (NLTK) sample and tutorial

What is NLTK?

Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data (Natural Language Processing). It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit. NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning.

Library contains

Lexical analysis: Word and text tokenizer
n-gram and collocations
Part-of-speech tagger
Tree model and Text chunker for capturing
Named-entity recognition

Download and Install

1. You can download NLTK from here in windows

2. Once NLTK is installed, start up the Python interpreter to install the data required for rest of the work.

1 import nltk
2 nltk.download()

It consists of about 30 compressed files requiring about 100Mb disk space. If any disk space issue or network issue you can pick only you need.

Once the data is downloaded to your machine, you can load some of it using the Python interpreter.

1 from nltk.book import *

Basic Operation in Text

 1 from __future__ import division
 2 from nltk.book import *
 3 
 4 
 5 #Enter their names to find out about these texts
 6 print text3
 7 #Length of a text from start to finish, in terms of the words and punctuation symbols that appear.
 8 print 'Length of Text: '+str(len(text3))
 9 
10 #Text is just the set of tokens
11 #print sorted(set(text3))
12 print 'Length of Token: '+str(len(set(text3)))
13 
14 #lexical richness of the text
15 def lexical_richness(text):
16     return len(set(text)) / len(text)
17     
18 #percentage of the text is taken up by a specific word    
19 def percentage(word, text):
20     return (100 * text.count(word) / len(text))
21     
22 print 'Lexical richness of the text: '+str(lexical_richness(text3))
23 print 'Percentage: '+ str(percentage('God',text3));

Now we will pick ‘text3’ called '”The Book of Genesis” for try NLTK features. Above code sample is showing

Name of the Text

The length of a text from starting to end

Token count of the text. (A token is the technical name for a sequence of characters. Text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together.)

Calculate a measure of the lexical richness of the text (number of distinct words by total number of words)

How often a word occurs in a text (compute what percentage of the text is taken up by a specific word)

Note
In Python 2, to start with from __future__ import for division.

Output of above code snippet

Searching Text

Count(word) - support count the word in the text

Concordance(word) - give every occurrence of a given word, together with some context.

Similar(word) - appending the term similar to the name of the text

Common_contexts([word]) - contexts are shared by two or more words

 1 from nltk.book import *
 2 
 3 #names of the Text
 4 print text3
 5 
 6 #count the word in the Text
 7 print "===Count==="
 8 print text3.count("Adam")
 9 
10 #'concordance()' view shows us every occurrence of a given word, together with some context.
11 #Here 'Adam' search in 'The Book of Genesis'
12 print "===Concordance==="
13 print text3.concordance("Adam")
14 
15 #Appending the term similar to the name of the text
16 print "===Similar==="
17 print text3.similar("Adam")
18 
19 #Contexts are shared by two or more words
20 print "===Common Contexts==="
21 text3.common_contexts(["Adam", "Noah"])

output of the code sample

Now I need plot word that are distributing over the text. Such as "God","Adam", "Eve", "Noah", "Abram","Sarah", "Joseph", "Shem", "Isaac" word are place in the text/book.

1 text3.dispersion_plot(["God","Adam", "Eve", "Noah", "Abram","Sarah", "Joseph", "Shem", "Isaac"])

References

[1] Bird, Steven; Klein, Ewan; Loper, Edward (2009). Natural Language Processing with Python. O'Reilly Media Inc. ISBN 0-596-51649-5.

Madhuka

Sunday, May 10, 2015

Natural Language Toolkit (NLTK) sample and tutorial - 01

No comments:

Post a Comment