Advance aims to accelerate AI learning

// Electrical & Computer Engineering

For state-of-the-art artificial intelligent applications to work efficiently on a computer, the algorithm, software and hardware need to work collaboratively. Here’s a conceptual diagram of the high-performance deep-learning accelerator that UW-Madison computer engineers Jing Li and Jialiang Zhang demonstrated. They improved algorithmic mapping and memory resources use through a 2-D data-sharing architecture between processing elements and local memory.

Share this story:

Artificial intelligence holds great promise in solving some of the world’s biggest challenges in healthcare, energy, security and other areas.

It can help healthcare providers develop the right drug to treat an individual person’s illness. It is key to technologies in self-driving cars and is embedded into consumer electronics like smartphones (think: Siri). Essentially, it is all around us, transforming the way businesses operate and how people interact with the world in everyday life—for example, enhancing our online search-engine experience, translating text into other languages, recognizing faces in photos and videos, and more.

Photo of Jing Li
Jing Li

The computer platform itself—the hardware and software—plays a major role in enabling ubiquitous artificial intelligence applications.

And engineers at the University of Wisconsin-Madison have developed a method that will enable artificial intelligence applications to learn faster, with less energy.

The method centers around the researchers’ discovery that a key performance bottleneck in “learning” using the deep-learning neural network—one type of artificial intelligence—is how quickly information can be read from or stored in the processor’s own memory. “Artificial intelligence is the next big wave in computing, but our computers are not designed for artificial intelligence,” says Jing Li, the Dugald C. Jackson Faculty Scholar and an assistant professor of electrical and computer engineering at UW-Madison.

In other words, even state-of-the-art machine-learning algorithms—the “soul” of artificial intelligence—cannot make the best use of their hardware resources.

Currently, they are out of sync. “For this process to work efficiently, different aspects of the computer—software, hardware and algorithms—need to work collaboratively,” says Li.

The solution she and graduate student Jialiang Zhang developed aims to bring them back into balance.

For their research, they used a state-of-the-art reconfigurable integrated circuit known as a field-programmable gate array, or FPGA—essentially, a DIY computer chip they could program to suit their needs. Companies including Microsoft, Amazon, Baidu and Alibaba, among others, have recently deployed this promising post-computer-processing-unit technology into their cloud and data centers for everything from web-search ranking to machine-learning applications such as image and speech recognition.

To program the FPGA, Li and Zhang employed the popular programming language framework OpenCL, which recently was enabled for FPGAs and greatly reduces programming time and complexity compared with the lower-level hardware description languages in traditional tools.

OpenCL is the interface that allows programmers to develop kernels, which are individual tasks that run on the FPGA. Kernels are the heart of the computer, performing the “heavy-lifting” jobs—such data-processing and memory tasks—using the computer’s resources.

What Li and Zhang found is that in implementing a convolutional neural network—one of the dominating machine-learning algorithms—on a state-of-the-art FPGA, the bottleneck is the on-chip memory bandwidth. And that’s caused by a mismatch between the FPGA’s available memory resources and the resources required by the kernels.

Consequently, on-chip computational resources are underused—therefore limiting the overall system performance. “We invented techniques to address the bottlenecks by developing a 2D multicast interconnection between processing elements and local memory to effectively improve the memory use,” says Li.

She believes their demonstration achieves the best power efficiency and performance density compared to existing work—including that of industry giants.

Li and Zhang published news of this advance in the Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.

Author: Renee Meiller