Breakthrough streaming applications such as virtual reality, augmented reality, autonomous vehicles, and multimedia demand for high-performance and power-efficient computing. In response to this ever-increasing demand, manufactures look beyond the parallelism available in Chip Multi-Processors (CMPs), and more toward application-specific designs. In this regard, ACCelerator (ACC)-based heterogeneous CMPs (ACMs) have emerged as a promising platform. An ACMP combines application-specific HW ACCelerators (ACCs) with General Purpose Processor(s) (GPP) onto a single chip. ACCs are customized to provide high-performance and power-efficient computing for specific compute-intensive functions and GPP(s) runs the remaining functions and controls the whole system. In ACMP platforms, ACCs achieve performance and power benefits at the expense of reduced flexibility and generality for running different workloads.
Therefore, manufactures must utilize several ACCs to target a diverse set of workloads within a given application domain.
However, our observation shows that conventional ACMP architectures with many ACCs have scalability limitations. The ACCs benefits in processing power can be overshadowed by bottlenecks on shared resources of processor core(s), communication fabric/DMA, and on-chip memory. The primary source of the resources bottlenecks stems from ACCs data access and orchestration load. Due to very loosely defined semantics for communication with ACCs, and relying
upon general platform architectures, the resources bottlenecks hamper performance. This dissertation explores and alleviates the scalability limitations of ACMPs. To this end, the dissertation first proposes an analytical model to holistically explore how bottlenecks emerge on shared resources with increasing number of ACCs. Afterward, it proposes ACMPerf, an analytical model to capture the impact of the resources bottlenecks on the achievable ACCs’ benefits.
Then, to open a path toward more scalable integration of ACCs, the dissertation identifies and formalizes ACC communication semantics. The semantics describe four primary aspects: data access, synchronization, data granularity, and data marshalling. Considering our identified ACC communication semantics, and improving upon con- ventional ACMP architectures, the dissertation proposes a novel architecture of Transparent Self-Synchronizing ACCs (TSS). TSS efficiently realizes our identified communication semantics of direct ACC-to-ACC connections often occurring in streaming applications. The proposed TSS adds autonomy to ACCs to locally handle the semantic aspects of data granularity, data marshalling and synchronization. It also exploits a local interconnect among ACCs to tackle the semantics aspect of data access. As TSS gives autonomy to ACCs to self-synchronize and self-orchestrate each other independent of the processor, thereby enabling finest data granularity to reduce the pressure on the shared memory. TSS also exploits a local and reconfigurable interconnect for direct data transfer among ACCs without occupying DMA and communication fabric. As a result of reducing the overhead of direct ACC-to-ACC connections, TSS delivers more of the ACCs’ benefits than that of conventional ACMP architectures: up to 130x higher throughput and 209x lower energy, all as results of up to 78x reduction in the imposed load to the shared resources.
Electrical and Computer EngineeringNortheastern University
Novel Architecture for Streaming Applications