Similarity search or nearest neighbor search is a task of retrieving a set of vectors in the (vector) database that are most similar to the provided query vector. It has been a key kernel for many applications for a long time. However, it is becoming especially more important in recent days as modern neural networks and machine learning models represent the semantics of images, videos, and documents as high-dimensional vectors called embeddings. Finding a set of similar embeddings for the provided query embedding is now the critical operation for modern recommender systems and semantic search engines. Since exhaustively searching for the most similar vectors out of billion vectors is such a prohibitive task, approximate nearestneighbor search (ANNS) is often utilized in many real-world usecases. Unfortunately, we find that utilizing the server-class CPUs and GPUs for the ANNS task leads to suboptimal performance and energy efficiency. To address such limitations, we proposea specialized architecture named ANNA (Approximate Nearest Neighbor search Accelerator), which is compatible with state-of-the-art ANNS algorithms such as Google ScaNN and Facebook Faiss. By combining the benefits of a specialized dataflow pipelineand efficient data reuse, ANNA achieves multiple orders of magnitude higher energy efficiency, 2.3-61.6× higher throughput, and 24.0-620.8× lower latency than the conventional CPU or GPU for both million- and billion-scale datasets.
2021
KCC
Large-scale Data Parallel Processing on Many-core Systems
Yejin Lee,
Seung-Jun Cha,
Dongwoo Kim
In Communications of the Korean Institute of Information Scientists and Engineers, Korea Information Science Society
2021
This article presents a framework, Genesis (genome analysis), to efficiently and flexibly accelerate generic data manipulation operations that have become performance bottlenecks in the genomic data processing pipeline utilizing FPGAsas-a-service. Genesis conceptualizes genomic data as a very large relational database and uses extended SQL as a domain-specific language to construct data manipulation queries. To accelerate the queries, we designed a Genesis hardware library of efficient coarse-grained primitives that can be composed into a specialized dataflow architecture. This approach explores a systematic and scalable methodology to expedite domain-specific end-to-end accelerated system development and deployment.
ISCA
ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
*Tae Jun Ham,
*Yejin Lee,
Seong Hoon Seo,
Soosung Kim,
Hyunji Choi,
Sung Jun Jung,
Jae W. Lee
* These authors contributed equally to this work.
In Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)
2021
The self-attention mechanism is rapidly emerging as one of the most important key primitives in neural networks (NNs) for its ability to identify the relations within input entities. The self-attention-oriented NN models such as Google Transformer and its variants have established the stateof-the-art on a very wide range of natural language processing tasks, and many other self-attention-oriented models are achieving competitive results in computer vision and recommender systems as well. Unfortunately, despite its great benefits, the self-attention mechanism is an expensive operation whose cost increases quadratically with the number of input entities that it processes, and thus accounts for a significant portion of the inference runtime. Thus, this paper presents ELSA (Efficient, Lightweight Self-Attention), a hardware-software co-designed solution to substantially reduce the runtime as well as energy spent on the self-attention mechanism. Specifically, based on the intuition that not all relations are equal, we devise a novel approximation scheme that significantly reduces the amount of computation by efficiently filtering out relations that are unlikely to affect the final output. With the specialized hardware for this approximate self-attention mechanism, ELSA achieves a geomean speedup of 58.1× as well as over three orders of magnitude improvements in energy efficiency compared to GPU on selfattention computation in modern NN models while maintaining less than 1% loss in the accuracy metric.
ASPLOS
MERCI: Efficient Embedding Reduction on Commodity Hardware via Sub-Query Memoization
Yejin Lee,
Seong Hoon Seo,
Hyunji Choi,
Hyoung Uk Sul,
Soosung Kim,
Jae W. Lee,
Tae Jun Ham
In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)
2021
Deep neural networks (DNNs) with embedding layers are widely adopted to capture complex
relationships among entities within a dataset. Embedding layers aggregate multiple
embeddings — a dense vector used to represent the complicated nature of a data feature—
into a single embedding; such operation is called embedding reduction. Embedding reduction
spends a significant portion of its runtime on reading embeddings from memory and
thus is known to be heavily memory-bandwidth-bound. Recent works attempt to accelerate
this critical operation, but they often require either hardware modifications or emerging
memory technologies, which makes it hardly deployable on commodity hardware. Thus,
we propose MERCI, Memoization for Embedding Reduction with ClusterIng, a novel memoization
framework for efficient embedding reduction. MERCI provides a mechanism for memoizing
partial aggregation of correlated embeddings and retrieving the memoized partial result
at a low cost. MERCI substantially reduces the number of memory accesses by 44% (29%),
leading to 102% (74%) throughput improvement on real machines and 40.2% (28.6%) energy
savings at the expense of 8x(1x) additional memory usage.
2020
ISCA
Genesis: A Hardware Acceleration Framework for Genomic Data Analysis
Tae Jun Ham,
David Bruns-Smith,
Brendan Sweeney,
Yejin Lee,
Seong Hoon Seo,
U Gyeong Song,
Young H. Oh,
Krste Asanovic,
Jae W. Lee,
Lisa Wu Wills
In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)
2020
Selected for inclusion in IEEE Micro - Special Issue on Top Picks from the 2020 Computer Architecture Con- ferences
In this paper, we describe our vision to accelerate algorithms in the domain of genomic
data analysis by proposing a framework called Genesis (genome analysis)
that contains an interface and an implementation of a system that processes genomic
data efficiently. This framework can be deployed in the cloud and exploit the FPGAs-as-a-service
paradigm to provide cost-efficient secondary DNA analysis. We propose conceptualizing
genomic reads and associated read attributes as a very large relational database and
using extended SQL as a domain-specific language to construct queries that form various
data manipulation operations. To accelerate such queries, we design a Genesis hardware
library which consists of primitive hardware modules that can be composed to construct
a dataflow architecture specialized for those queries.As a proof of concept for the
Genesis framework, we present the architecture and the hardware implementation of
several genomic analysis stages in the secondary analysis pipeline corresponding to
the best known software analysis toolkit, GATK4 workflow proposed by the Broad Institute.
We walk through the construction of genomic data analysis operations using a sequence
of SQL-style queries and show how Genesis hardware library modules can be utilized
to construct the hardware pipelines designed to accelerate such queries. We exploit
parallelism and data reuse by utilizing a dataflow architecture along with the use
of on-chip scratchpads as well as non-blocking APIs to manage the accelerators, allowing
concurrent execution of the accelerator and the host. Our accelerated system deployed
on the cloud FPGA performs up to 19.3x better than GATK4 running on a commodity multi-core
Xeon server and obtains up to 15x better cost savings. We believe that if a software
algorithm can be mapped onto a hardware library to utilize the underlying accelerator(s)
using an already-standardized software interface such as SQL, while allowing the efficient
mapping of such interface to primitive hardware modules as we have demonstrated here,
it will expedite the acceleration of domain-specific algorithms and allow the easy
adaptation of algorithm changes.
ASPLOS
IIU: Specialized Architecture for Inverted Index Search
Jun Heo,
Jaeyeon Won,
Yejin Lee,
Shivam Bharuka,
Jaeyoung Jang,
Tae Jun Ham,
Jae W. Lee
In Proceedings of the 25th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)
2020
Inverted index serves as a fundamental data structure for efficient search across
various applications such as full-text search engine, document analytics and other
information retrieval systems. The storage requirement and query load for these structures
have been growing at a rapid rate. Thus, an ideal indexing system should maintain
a small index size with a low query processing time. Previous works have mainly focused
on using CPUs and GPUs to exploit query parallelism while utilizing state-of-the-art
compression schemes to fit the index in memory. However, scaling parallelism to maximally
utilize memory bandwidth on these architectures is still challenging. In this work,
we present IIU, a novel inverted index processing unit, to optimize the query performance
while maintaining a low memory overhead for index storage. To this end, we co-design
the indexing scheme and hardware accelerator so that the accelerator can process highly
compressed inverted index at a high throughput. In addition, IIU provides flexible
interconnects between modules to take advantage of both intra- and inter-query parallelism.
Our evaluation using a cycle-level simulator demonstrates that IIU provides an average
of 13.8X query latency reduction and 5.4X throughput improvement across
different query types, while reducing the average energy consumption by 18.6X,
compared to Apache Lucene, a production-grade full-text search framework.
2019
MICRO
Charon: Specialized Near-Memory Processing Architecture for Clearing Dead Objects in Memory
Jaeyoung Jang,
Jun Heo,
Yejin Lee,
Jaeyeon Won,
Seonghak Kim,
Sung Jun Jung,
Hakbeom Jang,
Tae Jun Ham,
Jae W. Lee
In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
2019
Garbage collection (GC) is a standard feature for high productivity programming, saving
a programmer from many nasty memory-related bugs. However, these productivity benefits
come with a cost in terms of application throughput, worst-case latency, and energy
consumption. Since the first introduction of GC by the Lisp programming language in
the 1950s, a myriad of hardware and software techniques have been proposed to reduce
this cost. While the idea of accelerating GC in hardware is appealing, its impact
has been very limited due to narrow coverage, lack of flexibility, intrusive system
changes, and significant hardware cost. Even with specialized hardware GC performance
is eventually limited by memory bandwidth bottleneck. Fortunately, emerging 3D stacked
DRAM technologies shed new light on this decades-old problem by enabling efficient
near-memory processing with ample memory bandwidth. Thus, we propose Charon1, the
first 3D stacked memory-based GC accelerator. Through a detailed performance analysis
of HotSpot JVM, we derive a set of key algorithmic primitives based on their GC time
coverage and implementation complexity in hardware. Then we devise a specialized processing
unit to substantially improve their memory-level parallelism and throughput with a
low hardware cost. Our evaluation of Charon with the full-production HotSpot JVM running
two big data analytics frameworks, Spark and GraphChi, demonstrates a 3.29\texttimes geomean
speedup and 60.7% energy savings for GC over the baseline 8-core out-of-order processor.
ITC-CSCC
Performance Analysis of Convolutional Neural Networks on Manycore Platforms
Jaeyoung Jang,
Yejin Lee,
Jae W. Lee
In The 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC)
2019