//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>
SANTA CLARA, CALIF.—“The revolution we’ve seen for text will be coming to images,” renowned computer scientist Andrew Ng asserted in a keynote talk he gave at the recent AI Hardware Summit here.
Ng demonstrated a technique he called “visual prompting,” using Landing.ai’s user interface to prompt an AI agent to recognize objects in images by scribbling on that object with his mouse pointer. In just a few moments on stage, he demonstrated prompting the agent to recognize a dog, and counting cells in images of a petri dish.
“At [computer vision conference] CVPR, there was something in the air in computer vision, in the way that three years ago there was something in the air at NLP conferences,” Ng told the audience. “Progress has been driven by large transformer networks. This is true for text with LLMs [large language models] and is increasingly true for vision, training increasingly with unlabeled data … and scaling up model size is helping these [vision] models generalize.”
Ng told EE Times afterward that the world will begin to see the same types of current trends for LLMs in vision as large transformer networks become more mainstream for vision in the form of large vision models (LVMs).
“Yes, we are seeing a lot of excitement on LVMs, but the technology for LVMs is not yet mature,” he said.
While it is easy to generate and understand text tokens, and text is linear (one token follows another), understanding images with attention is less straightforward. Patches of an image can be taken as tokens, but in what order do the patches belong? Which patches do you hide, and which do you predict? And what happens for video, which adds another dimension of complexity?
“In the text realm, there were encoder and decoder architectures, but eventually, most people coalesced around decoder-only architectures,” Ng said. “There’s a bunch of decisions you make, and [LVMs] are at an earlier stage of making those decisions.”
One unanswered question is: Where will the data for training large-scale LVMs come from? The largest text-generation LLMs famously rely on a huge corpus of the internet for training. The internet can provide a huge amount of unlabeled, unstructured training data. A small amount of labeled data may then be used for fine-tuning and instruction-tuning.
Vision AI has typically required labeled data for training, but this may not always be the case, Ng said.
Techniques where parts of images are hidden and the neural network has to fill in the gaps can work to train vision networks on unlabeled data.
Another route might be synthetic data, though it so far has proved too expensive for text-generation AIs to generate the trillions of text tokens required to train a ChatGPT-sized model.
“If you want a model to mimic the style of a specific LLM, it could do that with millions of tokens, maybe even hundreds of thousands, so that’s more feasible,” Ng said.
With transformers dominating language AI and coming to vision AI, does Ng think transformers will eventually become the de facto neural network architecture for all forms of AI?
“No, I don’t think so,” he said. “Transformers are a fantastic tool in our tool chest, but I don’t think they are our only tool.”
Ng pointed out that while generative AI has done wonders for the masses of available unstructured data, it hasn’t done anything for our ability to process structured data, where there are useful insights to be gained for today’s applications. Structured data—perhaps columns of numbers in a spreadsheet—are not suited to transformers and will continue to require their own approach to AI.
The current trend for LLMs is that the bigger they are, the better they are at generalizing. But how big can LLMs get? Is there a practical limit?
“I don’t think we’ve exhausted scaling up as a recipe,” Ng said. “But it’s getting hard enough that I think there are other paths to innovation as well.”
Ng said that, in many use cases, a 13-billion–parameter model will work just as well as a 175-billion–parameter model, and for something straightforward like grammar checking, a 3-billion–parameter model running on a laptop may suffice.
One billion parameters might be enough for basic text processing like sentiment classification, which could run on a mobile device, while tens of billions of parameters are required for “decent amounts of knowledge about the world,” and hundreds of billions of parameters for more complex reasoning.
“There is one possible future where we’ll see more applications running at the edge,” he said. “We’ll fall back to cloud when you’re doing a really complex task that does really need a 100-billion–parameter model, but I think a lot of the tasks could be run with more modest-sized models.”
Transformers and the attention mechanism they are based on were invented six years ago, but hardware makers so far are only tentatively taking steps to specialize their accelerators on this important workload.
Have we reached the point where the architecture of the transformer is beginning to mature, or should we expect more evolution of this workload going forward?
“It’s difficult [to know],” he said. “The original paper is from 2017. … I’d be slightly disappointed if this is the final architecture, but I’m also willing to be shocked. … [Attention] works so well. Biological and digital brains are very different, but in biological intelligence, it feels like our brains are a collection of stuff that evolution jammed together—but it works well enough. Neural networks worked well enough before transformers. And think how long the x86 architecture has lasted!”