If we think it’s hard to learn a new language, imagine the challenges hardware and software engineers face when using CPUs and GPUs to process extensive language data. Natural language processing (NLP) attempts to bridge this gap between language and computing.
Recently, MIT announced that its researchers have devised an NLP system that pays attention to more relevant keywords in speech instead of giving equal weight—and computing power—to all words in a sentence.
According to the research team, this development not only illustrates the vital role of software NLP algorithms but also of robust processors to take on the mass amounts of data computing involved in modern natural language processing systems.
What is Natural Language Processing?
Human language is difficult to process and analyze because of the many occurrences of redundancy, such as adverbs, articles, and prepositions. NLP works to simplify and translate human language into a language that a computer can understand.
The trouble with the many redundancies in human speech is that machines often rely on them, too, to determine a positive or negative sentiment to the inputted sentence. To address which part or “bit” is unnecessary, NLP utilizes its attention mechanism to downsize characters within a string of data without losing its importance. For example, the term “failed program” is shortened or pruned through the attention mechanism and analyzed as “faild prgrm.”
Current NLP systems fall short of handling multiple branches of data with complicated movements and low arithmetic intensity, leading to less memory access and slower calculations.
Researchers at MIT say they have designed a system called SpAtten that can eliminate unnecessary data from real-time calculations to focus on keywords and anticipate the next words to follow in a sentence without jeopardizing performance, efficiency, and memory access.
A sentence analysis model used in MIT’s SpAtten can eliminate insignificant parts of the language while still maintaining the word’s essence to determine a positive or negative result. Image used courtesy of MIT
SpAtten’s Software and Hardware Architecture
When dealing with large volumes of data bottling into one source, process interruption challenges are common. A few challenges from reaching limits with current NLP attention mechanisms are power, processing speed, and computation.
MIT’s SpAtten uses three algorithmic optimizations to reduce computation and memory access while improving overall performance: cascade token pruning, cascade head pruning, and progressive quantization for attention inputs.
Cascade pruning is a technique that eliminates unnecessary data bits from calculations in real-time without delays. A token is a keyword found in a sentence while a head refers to branches of computation the attention mechanism follows to determine future words. Each of these algorithms is input-dependent and adaptive to all inputted instances.
SpAtten’s architecture allows for DRAM access to be reduced by a factor of 10 and computation reduced by a factor of 3.8. Image used courtesy of MIT
MIT researchers added a parallel hardware architecture with fully-pipelined data to complement the anticipated software to assist real-time cascading. SpAtten’s hardware can increase the rate of bandwidth use through on-chip bit-width converters that split fetched bits into most significant bits (MSBs) and least significant bits (LSBs).
An on-chip SRAM helps with memory, which holds pruned-down tokens that can be reused across multiple queries—freeing up memory on the CPU.
Comparisons to Google’s BERT and OpenAI’s GPT-2
Two different tasks define current NLP models: discriminative and generative. Discriminative tasks include sentence-level classification and regression, which help the model distinguish whether to pass or fail certain data. Generative tasks handle the distribution of analyzed data and allow for the model to predict the next word in a sequence.
Google’s BERT is a discriminative task-based model that needs to summarize a given input to make predictions. With more complex, large data traffic approaching, these types of models experience higher latency.
An image of Google’s BERT in use, helping search engines prove better results. Image used courtesy of Google
An example of a generative task-based model is OpenAI’s GPT-2, which succumbs to latency and performance issues after summarizing inputted information to generate new tokens.
Both models can only process a single token at a time rather than a full sentence, leaving the attention mechanisms to about 50% of the total latency. MIT’s SpAtten is a combination of both NLP algorithms and external hardware specifically designed for the attention mechanism. This combination alleviates the large power consumption that standard CPUs face when running GPT-2 or BERT.
A breakdown of speedup over GPU for SpAtten. Image used courtesy of MIT
MIT explains that SpAtten experiences lower latency thanks to the parallelized fashion of the software and hardware and the pipelined datapath. The model can also reduce computation and DRAM access through cascade pruning and progressive quantization, which lowers energy usage.
Can SpAtten Be Used for Mobile Devices?
MIT concedes that SpAtten (in its current state) would not suit small-form-factor IoT devices because of its power consumption; SpAtten consumes 8.3 W of power, and most mobile devices do not exceed 5 W of consumption.
However, by shortening the input via cascade pruning, developers can work with smaller models with shorter processing times and lower power consumption. This goal of lower power consumption may be the next step in realizing interactive dialog for mobile applications while still upholding a stronger attention mechanism.
Though NLP has come a long way from where it began, it is still far from perfect. The key to advancing this technology—especially in terms of power and latency—lies in improving both the software and hardware in tandem, according to MIT.