Sunday, May 10, 2015

Natural Language Toolkit (NLTK) sample and tutorial - 01

What is NLTK?

Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data (Natural Language Processing). It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit. NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning.

Library contains

  • Lexical analysis: Word and text tokenizer
  • n-gram and collocations
  • Part-of-speech tagger
  • Tree model and Text chunker for capturing
  • Named-entity recognition

Download and Install

1. You can download NLTK from here in windows

2. Once NLTK is installed, start up the Python interpreter to install the data required for rest of the work.

1 import nltk
2 nltk.download()

image


It consists of about 30 compressed files requiring about 100Mb disk space. If any disk space issue or network issue you can pick only you need.


Once the data is downloaded to your machine, you can load some of it using the Python interpreter.


1 from nltk.book import *

image


Basic Operation in Text



1 from __future__ import division
2 from nltk.book import *
3
4
5 #Enter their names to find out about these texts
6 print text3
7 #Length of a text from start to finish, in terms of the words and punctuation symbols that appear.
8 print 'Length of Text: '+str(len(text3))
9
10 #Text is just the set of tokens
11 #print sorted(set(text3))
12 print 'Length of Token: '+str(len(set(text3)))
13
14 #lexical richness of the text
15 def lexical_richness(text):
16 return len(set(text)) / len(text)
17
18 #percentage of the text is taken up by a specific word
19 def percentage(word, text):
20 return (100 * text.count(word) / len(text))
21
22 print 'Lexical richness of the text: '+str(lexical_richness(text3))
23 print 'Percentage: '+ str(percentage('God',text3));


Now we will pick ‘text3’ called '”The Book of Genesis” for try NLTK features. Above code sample is showing



  • Name of the Text

  • The length of a text from starting to end

  • Token count of the text. (A token is the technical name for a sequence of characters. Text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together.)

  • Calculate a measure of the lexical richness of the text (number of distinct words by total number of words)

  • How often a word occurs in a text (compute what percentage of the text is taken up by a specific word)

Note
In Python 2, to start with from __future__ import for division.


Output of above code snippet


image


Searching Text



  • Count(word) - support count the word in the text

  • Concordance(word) - give every occurrence of a given word, together with some context.

  • Similar(word) - appending the term similar to the name of the text

  • Common_contexts([word]) - contexts are shared by two or more words

1 from nltk.book import *
2
3 #names of the Text
4 print text3
5
6 #count the word in the Text
7 print "===Count==="
8 print text3.count("Adam")
9
10 #'concordance()' view shows us every occurrence of a given word, together with some context.
11 #Here 'Adam' search in 'The Book of Genesis'
12 print "===Concordance==="
13 print text3.concordance("Adam")
14
15 #Appending the term similar to the name of the text
16 print "===Similar==="
17 print text3.similar("Adam")
18
19 #Contexts are shared by two or more words
20 print "===Common Contexts==="
21 text3.common_contexts(["Adam", "Noah"])

output of the code sample


image


Now I need plot word that are distributing over the text. Such as "God","Adam", "Eve", "Noah", "Abram","Sarah", "Joseph", "Shem", "Isaac" word are place in the text/book.



1 text3.dispersion_plot(["God","Adam", "Eve", "Noah", "Abram","Sarah", "Joseph", "Shem", "Isaac"])


image


References


[1] Bird, Steven; Klein, Ewan; Loper, Edward (2009). Natural Language Processing with Python. O'Reilly Media Inc. ISBN 0-596-51649-5.

No comments:

Post a Comment