Machine Learning-inspired High-performance and Energy-efficient Heterogeneous Manycore Chip Design

Machine Learning-inspired High-performance and Energy-efficient Heterogeneous Manycore Chip Design
Author: Wonje Choi
Publisher:
Total Pages: 134
Release: 2018
Genre:
ISBN:

In this dissertation, we undertake above-mentioned problems of designing efficient heterogenous manycore architectures. First, we propose a hybrid Network-on-Chip architecture consisting of both wireline and wireless links that can seamlessly handle the varied traffic requirements that arise in heterogeneous manycore platforms. Second, we develop a machine learning-based multi-objective optimization (MOO) algorithm that learns an evaluation function and guides the search toward optimal designs in heterogeneous manycore systems. Finally, we propose architecture-independent imitation learning-based methodology for dynamic VFI control in heterogeneous manycore systems to address power and thermal issues.

Towards Heterogeneous Multi-core Systems-on-Chip for Edge Machine Learning

Towards Heterogeneous Multi-core Systems-on-Chip for Edge Machine Learning
Author: Vikram Jain
Publisher: Springer Nature
Total Pages: 199
Release: 2023-09-15
Genre: Technology & Engineering
ISBN: 3031382307

This book explores and motivates the need for building homogeneous and heterogeneous multi-core systems for machine learning to enable flexibility and energy-efficiency. Coverage focuses on a key aspect of the challenges of (extreme-)edge-computing, i.e., design of energy-efficient and flexible hardware architectures, and hardware-software co-optimization strategies to enable early design space exploration of hardware architectures. The authors investigate possible design solutions for building single-core specialized hardware accelerators for machine learning and motivates the need for building homogeneous and heterogeneous multi-core systems to enable flexibility and energy-efficiency. The advantages of scaling to heterogeneous multi-core systems are shown through the implementation of multiple test chips and architectural optimizations.

Machine Learning-Enabled Vertically Integrated Heterogeneous Manycore Systems for Big-Data Analytics

Machine Learning-Enabled Vertically Integrated Heterogeneous Manycore Systems for Big-Data Analytics
Author: Biresh Kumar Joardar
Publisher:
Total Pages: 101
Release: 2020
Genre: Big data
ISBN:

The rising use of deep learning and other big-data algorithms has led to an increasing demand for hardware platforms that are computationally powerful, yet energy-efficient. Heterogeneous manycore architectures that integrate multiple types of cores on a single chip present a promising direction in this regard. However, designing these new architectures often involves optimizing multiple conflicting objectives (e.g., performance, power, thermal, reliability, etc.) due to the presence of a mix of computing elements and communication methodologies; each with a different requirement for high-performance. This has made the design, and evaluation of new architectures an increasingly challenging problem. Machine Learning algorithms are a promising solution to this problem and should be investigated further. This dissertation focuses on the design of high-performance and energy efficient architectures for big-data applications, enabled by data-driven machine learning algorithms. As an example, we consider heterogeneous manycore architectures with CPUs, GPUs, and Resistive Random-Access Memory (ReRAMs) as the choice of hardware platform in this work. The disparate nature of these processing elements introduces conflicting design requirements that need to be satisfied simultaneously. In addition, novel design techniques like Processing-in-memory and 3D integration introduces additional design constraints (like temperature, noise, etc.) that need to be considered in the design process. Moreover, the on-chip traffic pattern exhibited by different big-data applications (like many-to-few-to-many in CPU/GPU-based manycore architectures) need to be incorporated in the design process for optimal power-performance trade-off. However, optimizing all these objectives simultaneously leads to an exponential increase in the design space of possible architectures. Existing optimization algorithms do not scale well to such large design spaces and often require more time to reach a good solution. In this work, we highlight the efficacy of machine learning algorithms for efficiently designing a suitable heterogeneous manycore architecture. For large design space exploration problems, the proposed machine learning algorithm can find good solutions in significantly less amount of time than exiting state-of-the-art counterparts.On overall, this work focuses on the design challenges of high-performance and energy efficient architectures for big-data applications, and proposes machine learning algorithms capable of addressing these challenges.

Hardware Accelerators for Machine Learning: From 3D Manycore to Processing-in-Memory Architectures

Hardware Accelerators for Machine Learning: From 3D Manycore to Processing-in-Memory Architectures
Author: Aqeeb Iqbal Arka
Publisher:
Total Pages: 0
Release: 2022
Genre: Machine learning
ISBN:

Big data applications such as - deep learning and graph analytics require hardware platforms that are energy-efficient yet computationally powerful. 3D manycore architectures are the key to efficiently executing such compute- and data-intensive applications. Through silicon via (TSV)-based 3D manycore system is a promising solution in this direction as it enables integration of disparate heterogeneous computing cores on a single system. Recent industry trends show the viability of 3D integration in real products (e.g., Intel Lakefield SoC Architecture, the AMD Radeon R9 Fury X graphics card, and Xilinx Virtex-7 2000T/H580T, etc.). However, the achievable performance of conventional through-silicon-via (TSV)-based 3D systems is ultimately bottlenecked by the horizontal wires (wires in each planar die). Moreover, current TSV 3D architectures suffer from thermal limitations. Hence, TSV-based architectures do not realize the full potential of 3D integration. Monolithic 3D (M3D) integration, a breakthrough technology to achieve "More Moore and More Than Moore," and opens up the possibility of designing cores and associated network routers using multiple layers by utilizing monolithic inter-tier vias (MIVs) and hence, reducing the effective wire length. Compared to TSV-based 3D ICs, M3D offers the "true" benefits of vertical dimension for system integration: the size of a MIV used in M3D is over 100x smaller than a TSV. However, designing these new architectures often involves optimizingmultiple conflicting objectives (e.g., performance, thermal, etc.) due to thepresence of a mix of computing elements and communication methodologies; each with a different requirement for high performance. To overcome the difficult optimization challenges due to the large design space and complex interactions among the heterogeneous components (CPU, GPU, Last Level Cache, etc.) in an M3D-based manycore chip, Machine Learning algorithms can be explored as a promising solution to this problem and. The first part of this dissertation focuses on the design of high-performance and energy-efficient architectures for big-data applications, enabled by M3D vertical integration and data-driven machine learning algorithms. As an example, we consider heterogeneous manycore architectures with CPUs, GPUs, and Cache as the choice of hardware platform in this part of the work. The disparate nature of these processing elements introduces conflicting design requirements that need to be satisfied simultaneously. Moreover, the on-chip traffic pattern exhibited by different big-data applications (like many-to-few-to-many in CPU/GPU-based manycore architectures) need to be incorporated in the design process for optimal power-performance trade-off. In this dissertation, we first design a M3D-enabled heterogeneous manycore architecture and we demonstrate the efficacy of machine learning algorithms for efficiently exploring a large design space. For large design space exploration problems, the proposed machine learning algorithm can find good solutions in significantly less amount of time than exiting state-of-the-art counterparts. However, the M3D-enabled heterogeneous manycore architecture is still limited by the inherent memory bandwidth bottlenecks of traditional von-Neumann architectures. As a result, later in this dissertation, we focus on Processing-in-Memory (PIM) architectures tailor-made to accelerate deep learning applications such as Graph Neural Networks (GNNs) as such architectures can achieve massive data parallelism and do not suffer from memory bandwidth-related issues. We choose GNNs as an example workload as GNNs are more complex compared to traditional deep learning applications as they simultaneously exhibit attributes of both deep learning and graph computations. Hence, it is both compute- and data-intensive in nature. The high amount of data movement required by GNN computation poses a challenge to conventional von-Neuman architectures (such as CPUs, GPUs, and heterogeneous system-on-chips (SoCs)) as they have limited memory bandwidth. Hence, we propose the use of PIM-based non-volatile memory such as Resistive Random Access Memory (ReRAM). We leverage the efficient matrix operations enabled by ReRAMs and design manycore architectures that can facilitate the unique computation and communication needs of large-scale GNN training. We then exploit various techniques such as regularization methods to further accelerate GNN training ReRAM-based manycore systems. Finally, we streamline the GNN training process by reducing the amount of redundant information in both the GNN model and the input graph.Overall, this work focuses on the design challenges of high-performance and energy-efficient manycore architectures for machine learning applications. We propose novel architectures that use M3D or ReRAM-based PIM architectures to accelerate such applications. Moreover, we focus on hardware/software co-design to ensure the best possible performance.

Towards Energy Efficient and Reliable 3D Manycore Chip Enabled by Machine Learning

Towards Energy Efficient and Reliable 3D Manycore Chip Enabled by Machine Learning
Author: Sourav Das
Publisher:
Total Pages: 200
Release: 2018
Genre:
ISBN:

Finally, we summarize our contributions and outline some promising directions for future work based on the findings of this work. Future work includes incorporating machine learning approaches for on-chip security analysis and development of online mitigation techniques against external attacks.

Heterogeneous Multicore Processor Technologies for Embedded Systems

Heterogeneous Multicore Processor Technologies for Embedded Systems
Author: Kunio Uchiyama
Publisher: Springer Science & Business Media
Total Pages: 234
Release: 2012-04-23
Genre: Technology & Engineering
ISBN: 1461402840

To satisfy the higher requirements of digitally converged embedded systems, this book describes heterogeneous multicore technology that uses various kinds of low-power embedded processor cores on a single chip. With this technology, heterogeneous parallelism can be implemented on an SoC, and greater flexibility and superior performance per watt can then be achieved. This book defines the heterogeneous multicore architecture and explains in detail several embedded processor cores including CPU cores and special-purpose processor cores that achieve highly arithmetic-level parallelism. The authors developed three multicore chips (called RP-1, RP-2, and RP-X) according to the defined architecture with the introduced processor cores. The chip implementations, software environments, and applications running on the chips are also explained in the book. Provides readers an overview and practical discussion of heterogeneous multicore technologies from both a hardware and software point of view; Discusses a new, high-performance and energy efficient approach to designing SoCs for digitally converged, embedded systems; Covers hardware issues such as architecture and chip implementation, as well as software issues such as compilers, operating systems, and application programs; Describes three chips developed according to the defined heterogeneous multicore architecture, including chip implementations, software environments, and working applications.

Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing

Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing
Author: Sudeep Pasricha
Publisher: Springer Nature
Total Pages: 481
Release: 2023-10-09
Genre: Technology & Engineering
ISBN: 3031399323

This book presents recent advances towards the goal of enabling efficient implementation of machine learning models on resource-constrained systems, covering different application domains. The focus is on presenting interesting and new use cases of applying machine learning to innovative application domains, exploring the efficient hardware design of efficient machine learning accelerators, memory optimization techniques, illustrating model compression and neural architecture search techniques for energy-efficient and fast execution on resource-constrained hardware platforms, and understanding hardware-software codesign techniques for achieving even greater energy, reliability, and performance benefits. Discusses efficient implementation of machine learning in embedded, CPS, IoT, and edge computing; Offers comprehensive coverage of hardware design, software design, and hardware/software co-design and co-optimization; Describes real applications to demonstrate how embedded, CPS, IoT, and edge applications benefit from machine learning.

Designing Heterogeneous Many-core Processors to Provide High Performance Under Limited Chip Power Budget

Designing Heterogeneous Many-core Processors to Provide High Performance Under Limited Chip Power Budget
Author: Dong Hyuk Woo
Publisher:
Total Pages:
Release: 2010
Genre: Heterogeneous computing
ISBN:

This thesis describes the efficient design of a future many-core processor that can provide higher performance under the limited chip power budget. To achieve such a goal, this thesis first develops an analytical framework within which computer architects can estimate achievable performance improvement of different many-core architectures given the same power budget. From this study, this thesis found that a future many-core processor needs (1) energy-efficient parallel cores and (2) a high-performance sequential core. Based on these observations, this thesis proposes an energy-efficient broad-purpose acceleration layer that can be snapped on top of a conventional general-purpose processor. In addition to such an energy-efficient parallel cores, this thesis also proposes different architectural techniques to further boost the performance of sequential computation while those parallel cores are idle. In particular, this thesis develops low-cost architectural techniques to enhance the memory performance of a host core by utilizing those idle parallel cores. This idea is evaluated in two different system architectures: one with the aforementioned acceleration layer and the other with an emerging integrated CPU and GPU chip.

An Open-Source Research Platform for Heterogeneous Systems on Chip

An Open-Source Research Platform for Heterogeneous Systems on Chip
Author: Andreas Dominic Kurth
Publisher: BoD – Books on Demand
Total Pages: 282
Release: 2022-10-05
Genre: Science
ISBN: 3866287747

Heterogeneous systems on chip (HeSoCs) combine general-purpose, feature-rich multi-core host processors with domain-specific programmable many-core accelerators (PMCAs) to unite versatility with energy efficiency and peak performance. By virtue of their heterogeneity, HeSoCs hold the promise of increasing performance and energy efficiency compared to homogeneous multiprocessors, because applications can be executed on hardware that is designed for them. However, this heterogeneity also increases system complexity substantially. This thesis presents the first research platform for HeSoCs where all components, from accelerator cores to application programming interface, are available under permissive open-source licenses. We begin by identifying the hardware and software components that are required in HeSoCs and by designing a representative hardware and software architecture. We then design, implement, and evaluate four critical HeSoC components that have not been discussed in research at the level required for an open-source implementation: First, we present a modular, topology-agnostic, high-performance on-chip communication platform, which adheres to a state-of-the-art industry-standard protocol. We show that the platform can be used to build high-bandwidth (e.g., 2.5 GHz and 1024 bit data width) end-to-end communication fabrics with high degrees of concurrency (e.g., up to 256 independent concurrent transactions). Second, we present a modular and efficient solution for implementing atomic memory operations in highly-scalable many-core processors, which demonstrates near-optimal linear throughput scaling for various synthetic and real-world workloads and requires only 0.5 kGE per core. Third, we present a hardware-software solution for shared virtual memory that avoids the majority of translation lookaside buffer misses with prefetching, supports parallel burst transfers without additional buffers, and can be scaled with the workload and number of parallel processors. Our work improves accelerator performance for memory-intensive kernels by up to 4×. Fourth, we present a software toolchain for mixed-data-model heterogeneous compilation and OpenMP offloading. Our work enables transparent memory sharing between a 64-bit host processor and a 32-bit accelerator at overheads below 0.7 % compared to 32-bit-only execution. Finally, we combine our contributions to a research platform for state-of-the-art HeSoCs and demonstrate its performance and flexibility.

Energy-efficient Computing with Fine-grained Many-core Systems

Energy-efficient Computing with Fine-grained Many-core Systems
Author: Bin Liu
Publisher:
Total Pages:
Release: 2016
Genre:
ISBN: 9781369615579

For the past half century, Moore's Law has been the fundamental driver of high-performance computing. The continued CMOS technology scaling doubles the transistor density of VLSI systems and had provided a predictable 40% performance improvement of single-core processors for every 18 to 24 months. However, as Dennard Scaling ends, the era of scaling frequency and performance without increasing power density is over. Since 2005, the semiconductor industry shifted to multi-core and many-core processors in order to sustain the proportional scaling of performance along with transistor count increases. One of the critical challenges for many-core system design is to reduce the power dissipation and improve the energy efficiency of the chip. Researchers are eager to seek innovative low power architectures and techniques to relieve the ``dark silicon" problem and effectively convert transistors to performance. To demonstrate that many-core processors with network-on-chip interconnects is a promising architecture for high-performance energy-efficient computing, 16 Advanced Encryption Standard (AES) engines are proposed on a fine-grained many-core system by exploring different granularities of data-level and task-level parallelism. The smallest design utilizes only six cores for offline key expansion and eight cores for online key expansion, while the largest requires 107 cores and 137 cores, respectively. In comparison with published AES cipher implementations on general purpose processors, the designs have has 3.5--15.6 times higher throughput per unit of chip area and 8.2--18.1 times higher energy efficiency. Moreover, the design shows 2.0 times higher throughput than the TI DSP C6201, and 3.3 times higher throughput per unit of chip area and 2.9 times higher energy efficiency than the GeForce 8800 GTX. Next, a scalable joint local and global dynamic voltage and frequency scaling (DVFS) scheme is proposed to further improve the energy efficiency for many-core systems by monitoring on-line workload variations. The local algorithms selects the voltage and frequency pair for each individual core based on its FIFO occupancy and stall information, while the global algorithm tunes the global voltage supplies based on the workload of all active processors. To demonstrate the effectiveness of the proposed solution, a suite of benchmarks are tested on a many-core globally asynchronous locally synchronous (GALS) platform. The experiment results show that the proposed approach can achieve near-optimal power saving under performance constraints. Different local algorithms are compared in terms of power saving, voltage switching frequency and response delay to workload variation. The impact of the number of voltage supplies and global voltage tuning resolution on the global algorithm is also investigated. To further improve the energy efficiency beyond traditional DVFS, core scaling is proposed by introducing an extra dimension beyond supply voltage and clock frequency scaling. This dissertation addresses the problem of minimizing the power dissipation of many-core systems under performance constraints by choosing an appropriate number of active cores and per-core voltage/frequency levels. A genetic algorithm based solution is proposed to solve the problem. Experiments with real applications show that (1) dynamically scaling the number of active cores can improve the energy efficiency by 5% to 42% compared with per-core DVFS for different performance requirements; (2) core scaling favors systems with more global voltage supplies and high-performance leaky process when the performance requirement is loose, while it favors systems with fewer global voltage supplies and low-power less-leaky process when the performance requirement is tight; (3) increasing the number of global voltage supplies or leakage ratio can reduce the optimal core count by 22% and 50%, respectively.