Natural language processing, or NLP for short, is best described as “AI for speech and text.” The magic behind voice commands, speech and text translation, sentiment analysis, text summarization, and many other linguistic applications and analyses, natural language processing has been improved dramatically through deep learning.
The Python language provides a convenient front-end to all varieties of machine learning including NLP. In fact, there is an embarrassment of NLP riches to choose from in the Python ecosystem. In this article we’ll explore each of the NLP libraries available for Python—their use cases, their strengths, their weaknesses, and their general level of popularity.
Note that some of these libraries provide higher-level versions of the same functionality exposed by others, making that functionality easier to use at the cost of some precision or performance. You’ll want to choose a library well-suited both to your level of expertise and to the nature of the project.
The CoreNLP library — a product of Stanford University — was built to be a production-ready natural language processing solution, capable of delivering NLP predictions and analyses at scale. CoreNLP is written in Java, but multiple Python packages and APIs are available for it, including a native Python NLP library called Stanza.
CoreNLP includes a broad range of language tools—grammar tagging, named entity recognition, parsing, sentiment analysis, and plenty more. It was designed to be human language agnostic, and currently supports Arabic, Chinese, French, German, and Spanish in addition to English (with Russian, Swedish, and Danish support available from third parties). CoreNLP also includes a web API server, a convenient way to serve predictions without too much additional work.
The easiest place to start with CoreNLP’s Python wrappers is Stanza, the reference implementation created by the Stanford NLP Group. In addition to being well-documented, Stanza is also maintained regularly; many of the other Python libraries for CoreNLP were not updated for some time.
CoreNLP also supports the use of NLTK, a major Python NLP library discussed below. As of version 3.2.3, NLTK includes interfaces to CoreNLP in its parser. Just be sure to use the correct API.
The obvious downside of CoreNLP is that you’ll need some familiarity with Java to get it up and running, but that’s nothing a careful reading of the documentation can’t achieve. Another hurdle could be CoreNLP’s licensing. The whole toolkit is licensed under the GPLv3, meaning any use in proprietary software that you distribute to others will require a commercial license.
Gensim does just two things, but does them exceedingly well. Its focus is statistical semantics—analyzing documents for their structure, then scoring other documents based on their similarity.
Gensim can work with very large bodies of text by streaming documents to its analysis engine and performing unsupervised learning on them incrementally. It can create multiple types of models, each suited to different scenarios: Word2Vec, Doc2Vec, FastText, and Latent Dirichlet Allocation.
Gensim’s detailed documentation includes tutorials and how-to guides that explain key concepts and illustrate them with hands-on examples. Common recipes are also available on the Gensim GitHub repo.
The latest version, Gensim 4, supports Python 3 only but brings major optimizations to common algorithms such as Word2Vec, a less complex OOP model, and many other modernizations.
The Natural Language Toolkit, or NLTK for short, is among the best-known and most powerful of the Python natural language processing libraries. Many corpora (data sets) and trained models are available to use with NLTK out of the box, so you can start experimenting with NLTK right away.
As the documentation states, NLTK provides a wide variety of tools for working with text: “classification, tokenization, stemming, tagging, parsing, and semantic reasoning.” It can also work with some third-party tools to enhance its functionality, such as the Stanford Tagger, TADM, and MEGAM.
Keep in mind that NLTK was created by and for an academic research audience. It was not designed to serve NLP models in a production environment. The documentation is also somewhat sparse; even the how-tos are thin. Also, there is no 64-bit binary; you’ll need to install the 32-bit edition of Python to use it. Finally, NLTK is not the fastest library either, but it can be sped up with parallel processing.
If you are determined to leverage what’s inside NLTK, you might start instead with TextBlob (discussed below).
If all you need to do is scrape a popular website and analyze what you find, reach for Pattern. This natural language processing library is far smaller and narrower than other libraries covered here, but that also means it’s focused on doing one common job really well.
Pattern comes with built-ins for scraping a number of popular web services and sources (Google, Wikipedia, Twitter, Facebook, generic RSS, etc.), all of which are available as Python modules (e.g.,
from pattern.web import Twitter). You don’t have to reinvent the wheels for getting data from those sites, with all of their individual quirks. You can then perform a variety of common NLP operations on the data, such as sentiment analysis.
Pattern exposes some of its lower-level functionality, allowing you to to use NLP functions, n-gram search, vectors, and graphs directly if you like. It also has a built-in helper library for working with common databases (MySQL, SQLite, and MongoDB in the future), making it easy to work with tabular data stored from previous sessions or obtained from third parties.
Polyglot, as the name implies, enables natural language processing applications that deal with multiple languages at once.
The NLP features in Polyglot echo what’s found in other NLP libraries: tokenization, named entity recognition, part-of-speech tagging, sentiment analysis, word embeddings, etc. For each of these operations, Polyglot provides models that work with the needed languages.
Note that Polyglot’s language support differs greatly from feature to feature. For instance, the language detection system supports almost 200 languages, tokenization supports 165 languages (largely because it uses the Unicode Text Segmentation algorithm), and sentiment analysis supports 136 languages, while part-of-speech tagging supports only 16.
PyNLPI (pronounced “pineapple”) has only a basic roster of natural language processing functions, but it has some truly useful data-conversion and data-processsing features for NLP data formats.
Most of the NLP functions in PyNLPI are for basic jobs like tokenization or n-gram extraction, along with some statistical functions useful in NLP like Levenshtein distance between strings or Markov chains. Those functions are implemented in pure Python for convenience, so they’re unlikely to have production-level performance.
But PyNLPI shines for working with some of the more exotic data types and formats that have sprung up in the NLP space. PyNLPI can read and process GIZA, Moses++, SoNaR, Taggerdata, and TiMBL data formats, and devotes an entire module to working with FoLiA, the XML document format used to annotate language resources like corpora (bodies of text used for translation or other analysis).
You’ll want to reach for PyNLPI whenever you’re dealing with those data types.
SpaCy, which taps Python for convenience and Cython for speed, is billed as “industrial-strength natural language processing.” Its creators claim it compares favorably to NLTK, CoreNLP, and other competitors in terms of speed, model size, and accuracy. SpaCy contains models for multiple languages, although only 16 of the 64 supported have full data pipelines available for them.
SpaCy includes most every feature found in those competing frameworks: speech tagging, dependency parsing, named entity recognition, tokenization, sentence segmentation, rule-based match operations, word vectors, and tons more. SpaCy also includes optimizations for GPU operations—both for accelerating computation, and for storing data on the GPU to avoid copying.
The documentation for SpaCy is excellent. A setup wizard generates command-line installation actions for Windows, Linux, and macOS and for different Python environments (pip, conda, etc.) as well. Language models install as Python packages, so they can be tracked as part of an application’s dependency list.
The latest version of the framework, SpaCy 3.0, provides many upgrades. In addition to using the Ray framework for performing distributed training on multiple machines, it offers a new transformer-based pipeline system for better accuracy, a new training system and workflow configuration model, end-to-end workflow managament, and a good deal more.
TextBlob is a friendly front-end to the Pattern and NLTK libraries, wrapping both of those libraries in high-level, easy-to-use interfaces. With TextBlob, you spend less time struggling with the intricacies of Pattern and NLTK and more time getting results.
TextBlob smooths the way by leveraging native Python objects and syntax. The quickstart examples show how texts to be processed are simply treated as strings, and common NLP methods like part-of-speech tagging are available as methods on those string objects.
Another advantage of TextBlob is you can “lift the hood” and alter its functionality as you grow more confident. Many default components, like the sentiment analysis system or the tokenizer, can be swapped out as needed. You can also create high-level objects that combine components—this sentiment analyzer, that classifier, etc.—and re-use them with minimal effort. This way, you can prototype something quickly with TextBlob, then refine it later.
Copyright © 2021 IDG Communications, Inc.