Mastering Portability in AI’s Hardware Revolution


By
Andrew Younge, R&D Manager, Scalable Computer Architectures, Sandia National Laboratories 

10.13.2023

0

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

In this era where artificial intelligence (AI) and machine learning (ML) are capturing the media limelight, high-performance computing (HPC) is often an unsung hero on a couple of levels. It is the driving force behind groundbreaking research revolutionizing sectors like healthcare and climate research and is a critical testing ground for cutting-edge technologies and computing techniques.

As specialized hardware becomes increasingly common in both HPC and industry settings, performance portability has surfaced as a key challenge. Performance portability is what allows applications to perform efficiently across a variety of computing systems. It’s not just crucial for scientific research but also for businesses that need to quickly adapt to new technologies and environments. For enterprises relying on performance-intensive AI or ML workloads, the need for performance portability will become increasingly common.  In this article, we’ll go beyond the technical jargon to examine how performance portability could be a game changer for your business, enabling faster innovation while keeping costs in check.

Tackling Performance Portability from The Front Lines

At Sandia National Laboratories, we function as an R&D unit collectively aimed at shaping the future of computing. We’re incubating technologies that not only meet the needs of national objectives but can also power the next wave of business innovation, from AI-driven analytics to real-time financial modeling.

Andrew Younge in-between rows of racks in the Astra supercomputer – the first Petascale class Arm system deployed under the Sandia Vanguard program.

My own personal journey into this exciting domain began with an interest in exploring distributed systems, virtualization & containers, and energy efficiency in HPC. As an R&D manager—and a seasoned research scientist at Sandia among some of the brightest and most dedicated researchers—I have a solid vantage point on the challenges and opportunities in advanced computing environments. My previous research on scalable system software and virtualization placed a heavy emphasis on software portability. And through it all, our team has come to realize that the conventional “one-size-fits-all” strategy is no longer tenable in today’s heterogeneous compute environments. Evolution is necessary.

What Is Performance Portability?

For many years, research organizations and enterprises primarily relied on general-purpose computing machines. However, the waning momentum of Dennard scaling and the encroaching limitations of Moore’s law have prompted a diversification and specialization in software workloads. In turn, this has reshaped the hardware landscape. Technology providers now offer an array of accelerators, CPUs, interconnects, and more, each uniquely tailored for specific tasks. For instance, GPUs excel in parallel data processing, while TPUs are geared towards machine learning tasks. In this evolving ecosystem, the focus isn’t just on “performance portability” but also on achieving “cross-platform efficiency” and “hardware-agnostic performance.” These related concepts underscore the importance of ensuring that software not only runs but runs efficiently across diverse computing architectures.

An agile software approach is essential in this age of specialized often disparate hardware. Traditionally, tailoring software to a new system could mean rewriting millions of lines of code—a process both unsustainable and costly in terms of time and human resources. This is where performance portability comes to the rescue. It allows software to maintain high levels of performance across diverse architectures without the need for extensive re-engineering. In today’s complex computing environments, cross-platform efficiency isn’t merely a ‘nice-to-have’; it’s become an imperative for meeting the computational demands of the future.

Innovations in Performance Portability: What to Watch For

The pursuit of performance portability hinges on software reusability, which, in turn, is deeply connected to how we architect applications for diverse computing environments. The Vanguard Program currently stands as Sandia’s primary avenue for innovation in the realm of portability across diverse hardware architectures.

Born as an extension of Sandia’s Advanced Architecture Testbed, Vanguard aims to mitigate the risks of integrating untested technologies by identifying and addressing gaps in both hardware and software ecosystems. It serves as an essential link, connecting small-scale, node- or rack-level testbeds with large-scale systems that are ready for deployment. This program does more than just test emerging technologies; it aligns directly with the goal of performance portability.

Through the evaluation of real-world production workloads, Vanguard makes it easier to adapt software codes for new, diverse platforms, ensuring that they perform efficiently across different computing architectures. For technology vendors, Vanguard also expands the array of viable technology options, fostering competition and thus driving advancements in hardware-agnostic performance.

Sandia Laboratories and others across the Department of Energy (DOE) have made significant strides in this realm through tools like those available in the Kokkos EcoSystem—a suite of tools specifically designed to enhance cross-platform portability in applications written in C++. Think of Kokkos as a universal translator, ensuring that software performs effectively with the various platforms it interacts with. When developing parallel programs, developers face a plethora of choices influenced by target architectures and other variables. Kokkos offers “guardrails” in the form of patterns, policies, and spaces that guide application development teams in algorithm creation. With Kokkos Core, these algorithms and data structures can be automatically mapped to different architectures—whether they are CPU-based systems, platforms with OpenMP backends, or hardware built around NVIDIA or AMD GPUs, or even custom accelerators. Essentially, Kokkos helps standardize good coding practices that can readily adapt to diverse architectures.

Within the broader framework of the Sandia Vanguard Program, we employ a variety of specialized tools to achieve our objectives. Another tool proving highly useful is containers, which have become an invaluable resource for streamlining the porting process. Transitioning even simple code to new architectures can consume significant time and resources. Containers help alleviate this burden by enabling the creation of a ‘manifest,’ a set of instructions containing the key steps and specific library versions required for optimizing applications. This not only conserves time and financial resources but also enables other teams to leverage existing expertise, expediting their progress and reducing the trial-and-error phase. These tools serve as tactical solutions within Vanguard’s strategic mission to ensure efficient application deployment across a multitude of computing architectures.

Amidst the complexities of large-scale computing environments, it is undeniable that teams thrive when collaboration is seamless; a big part of our goal in using and evaluating these tools is to facilitate the power of teamwork. This becomes a cornerstone of our efforts, as the strength of collective efforts paves the way for establishing new avenues of innovation.

Progress Made, Opportunities Ahead

While we’ve made significant strides in performance portability, we view the path ahead as filled with opportunities for innovation. With a dedicated and highly skilled engineering team, the task of porting to a new architecture provides an exciting challenge for our team that keeps us constantly engaged, pushing the boundaries of innovation, and fostering a culture of continuous learning and improvement. At Sandia, we are assembling expert teams for targeted efforts that generate an exciting opportunity for growth, not just for us, but for the industry at large. These challenges offer collaborative opportunities, as Sandia actively partners across industry, government, and academia to pioneer novel solutions together.

And, thus, we’re not alone. The challenge of performance portability is universal, affecting both HPC and business sectors alike. As industries evolve, the importance of adaptable software will only grow. By investing in performance portability, both scientific research and businesses stand to gain, making future technology migrations more efficient and less costly. At Sandia, we are actively considering how we can share our experiences and methodologies with the wider industry.

SC23: A Forum for Progress

One of the places where the HPC community gathers to discuss these challenges and explore solutions collaboratively is the SC Conference, this year being held in Denver, CO the week of November 12-17. The SC Conference isn’t just an HPC get-together; it’s a meeting of minds from various sectors, focused on solving universal computational challenges. It’s where we discuss not just the future of HPC but also its immediate relevance to business innovation and agility. Events like this serve as platforms for fostering partnerships, demonstrating Sandia’s collaborative approach to tackling industry-wide computational challenges.

Personally, I invest a lot of time at SC conferences digging into the latest industry trends and tracking innovations that could significantly aid our mission at Sandia and beyond. This year at SC23, I’m part of the steering committee for the CANOPIE HPC workshop, where the focus is on cutting-edge container technologies, virtualization, and OS system software supporting HPC.

Given its potential for improving enterprise agility and helping to control costs, performance portability should be on the radar of every tech leader or business decision-maker overseeing performance-intensive applications that need to take advantage of cutting-edge hardware.

I encourage you to join us at SC23 to be part of this vital conversation that will shape not only the future of computing but also your enterprise’s competitiveness in an increasingly complex and dynamic landscape. The imperative for performance portability is a rallying cry for both the scientific and business communities to come together and drive the computational capabilities of tomorrow.



Source link