Download PDFOpen PDF in browserDeep Learning Workload Performance Auto-OptimizerEasyChair Preprint 28174 pages•Date: February 29, 2020AbstractThe industry has seen a wave of new domain-specific accelerators purpose-built for deep learning workloads. To obtain real-world performance close to the highest theoretical performance from the accelerators, the tensor layout and workload distribution need to be optimized along with the accelerator instruction set, communication fabric, and memory architecture. In this paper, we introduce a general methodology for automating hardware architecture and software co-optimization for domain-specific accelerators. Applying this methodology to The Intel® Nervana™ Neural Network Processor for Training (Intel® Nervana™ NNP-T), it has achieved the state-of-the-art (SOTA) deep-learning microbenchmark performance on convolution benchmarks. A generic convolution context distribution algorithm developed based on auto-optimizer results for ResNet50 is also discussed in this paper. Keyphrases: Domain-Specific Accelerator, deep learning, hardware-software co-optimization, locality, parallelism
|