Suraj Pattar

Robotics AI Research Engineer / Ph.D. in AI and Robotics / Data Scientist / 3D Printing Enthusiast

Comparing the Efficiency of Parsers in NLTK

Keywords: Natural Language Processing, NLTK, Parsing, Tokenize

Project Description

This was an assignment for my AI course at the University of Genoa. The task was to download some electronic books from Project Gutenberg -> Find the longest sentence -> Analyze what syntactic construction(s) were responsible for such long sentences.

We made use of various parsers including NLTK, Stanford CoreNLP and openNLP. We also made use of a large grammar, namely CT grammar, ATIS grammar and PT grammar. Also we tried to build our own grammar using in-built functions of NLTK.

In the end, we were able to scan the books for long sentences and find the longest sentence using the inbuilt functions of gutenberg object in NLTK. But we were unsuccesful in building the syntax tree of these long sentences and determining the syntatic construction(s) responsible for such long sentences, we found the following obstacles and drawbacks associated with NLTK.

  • The functions in gutenberg object can mistake a set of sentences as a single sentence.
  • There was no availabe grammar at that time in NLTK which had a large enough lexicon to cover the regular vocabulary used in books. (Although there are other alternatives which are used in production and seem more promising, eg. spaCy, Syntaxnet, Stanford NLP).
  • Building grammar manually is very inefficient and no automated models are available to build grammar under NLTK.