Yejin Lee | Publications

2022

HPCA
ANNA: Specialized Architecture for Approximate Nearest Neighbor Search

Yejin Lee, Hyunji Choi, Sunhong Min, Hyunseung Lee, Sangwon Baek, Dawoon Jeong, Jae W. Lee, Tae Jun Ham

In Proceedings of the 27th IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2022

Abs Bib PDF Slides Talk

Similarity search or nearest neighbor search is a task of retrieving a set of vectors in the (vector) database that are most similar to the provided query vector. It has been a key kernel for many applications for a long time. However, it is becoming especially more important in recent days as modern neural networks and machine learning models represent the semantics of images, videos, and documents as high-dimensional vectors called embeddings. Finding a set of similar embeddings for the provided query embedding is now the critical operation for modern recommender systems and semantic search engines. Since exhaustively searching for the most similar vectors out of billion vectors is such a prohibitive task, approximate nearestneighbor search (ANNS) is often utilized in many real-world usecases. Unfortunately, we find that utilizing the server-class CPUs and GPUs for the ANNS task leads to suboptimal performance and energy efficiency. To address such limitations, we proposea specialized architecture named ANNA (Approximate Nearest Neighbor search Accelerator), which is compatible with state-of-the-art ANNS algorithms such as Google ScaNN and Facebook Faiss. By combining the benefits of a specialized dataflow pipelineand efficient data reuse, ANNA achieves multiple orders of magnitude higher energy efficiency, 2.3-61.6× higher throughput, and 24.0-620.8× lower latency than the conventional CPU or GPU for both million- and billion-scale datasets.
@inproceedings{anna, abbr = {HPCA}, bibtex_show = {true}, selected = {true}, pdf = {hpca22_anna.pdf}, author = {Lee, Yejin and Choi, Hyunji and Min, Sunhong and Lee, Hyunseung and Baek, Sangwon and Jeong, Dawoon and Lee, Jae W. and Ham, Tae Jun}, booktitle = {Proceedings of the 27th IEEE International Symposium on High-Performance Computer Architecture (HPCA)}, title = {ANNA: Specialized Architecture for Approximate Nearest Neighbor Search}, year = {2022}, volume = {}, number = {}, slides = {./hpca22_anna-talk.pptx}, talk = {https://youtu.be/106IcUpxnqo} }

2021

KCC

Large-scale Data Parallel Processing on Many-core Systems

Yejin Lee, Seung-Jun Cha, Dongwoo Kim

In Communications of the Korean Institute of Information Scientists and Engineers, Korea Information Science Society 2021

Bib

@inproceedings{manycore,
  abbr = {KCC},
  bibtex_show = {true},
  selected = {true},
  author = {Lee, Yejin and Cha, Seung-Jun and Kim, Dongwoo},
  booktitle = {Communications of the Korean Institute of Information Scientists and Engineers, Korea Information Science Society},
  title = {Large-scale Data Parallel Processing on Many-core Systems},
  year = {2021},
  volume = {39(11)},
  number = {},
  pages = {77-89}
}

IEEE Micro
Accelerating Genomic Data Analytics With Composable Hardware Acceleration Framework

Tae Jun Ham, David Bruns-Smith, Brendan Sweeney, Yejin Lee, Seong Hoon Seo, U Gyeong Song, Young H. Oh, Krste Asanovic, Jae W. Lee, Lisa Wu Wills

IEEE Micro 2021
Special Issue on Top Picks from the 2020 Computer Architecture Conferences

Abs Bib PDF

This article presents a framework, Genesis (genome analysis), to efficiently and flexibly accelerate generic data manipulation operations that have become performance bottlenecks in the genomic data processing pipeline utilizing FPGAsas-a-service. Genesis conceptualizes genomic data as a very large relational database and uses extended SQL as a domain-specific language to construct data manipulation queries. To accelerate the queries, we designed a Genesis hardware library of efficient coarse-grained primitives that can be composed into a specialized dataflow architecture. This approach explores a systematic and scalable methodology to expedite domain-specific end-to-end accelerated system development and deployment.
@article{genomicstoppicks, abbr = {IEEE Micro}, bibtex_show = {true}, selected = {true}, pdf = {genesis_journal.pdf}, author = {Ham, Tae Jun and Bruns-Smith, David and Sweeney, Brendan and Lee, Yejin and Seo, Seong Hoon and Song, U Gyeong and Oh, Young H. and Asanovic, Krste and Lee, Jae W. and Wills, Lisa Wu}, journal = {IEEE Micro}, title = {Accelerating Genomic Data Analytics With Composable Hardware Acceleration Framework}, year = {2021}, volume = {41}, number = {3}, pages = {42-49}, doi = {10.1109/MM.2021.3072385}, note = { Special Issue on Top Picks from the 2020 Computer Architecture Conferences} }
ISCA
ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks

*Tae Jun Ham, *Yejin Lee, Seong Hoon Seo, Soosung Kim, Hyunji Choi, Sung Jun Jung, Jae W. Lee
* These authors contributed equally to this work.

In Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) 2021

Abs Bib PDF Slides Talk

The self-attention mechanism is rapidly emerging as one of the most important key primitives in neural networks (NNs) for its ability to identify the relations within input entities. The self-attention-oriented NN models such as Google Transformer and its variants have established the stateof-the-art on a very wide range of natural language processing tasks, and many other self-attention-oriented models are achieving competitive results in computer vision and recommender systems as well. Unfortunately, despite its great benefits, the self-attention mechanism is an expensive operation whose cost increases quadratically with the number of input entities that it processes, and thus accounts for a significant portion of the inference runtime. Thus, this paper presents ELSA (Efficient, Lightweight Self-Attention), a hardware-software co-designed solution to substantially reduce the runtime as well as energy spent on the self-attention mechanism. Specifically, based on the intuition that not all relations are equal, we devise a novel approximation scheme that significantly reduces the amount of computation by efficiently filtering out relations that are unlikely to affect the final output. With the specialized hardware for this approximate self-attention mechanism, ELSA achieves a geomean speedup of 58.1× as well as over three orders of magnitude improvements in energy efficiency compared to GPU on selfattention computation in modern NN models while maintaining less than 1% loss in the accuracy metric.
@inproceedings{elsa, abbr = {ISCA}, bibtex_show = {true}, selected = {true}, pdf = {https://snu-arc.github.io/pubs/isca21_elsa.pdf}, author = {Ham, *Tae Jun and Lee, *Yejin and Seo, Seong Hoon and Kim, Soosung and Choi, Hyunji and Jung, Sung Jun and Lee, Jae W.}, booktitle = {Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)}, title = {ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks}, year = {2021}, volume = {}, number = {}, pages = {692-705}, doi = {10.1109/ISCA52012.2021.00060}, slides = {https://snu-arc.github.io/pubs/isca21_elsa-talk.pdf}, talk = {https://www.youtube.com/watch?v=JDH_HeTsECM}, cofirst = {true} }
ASPLOS
MERCI: Efficient Embedding Reduction on Commodity Hardware via Sub-Query Memoization

Yejin Lee, Seong Hoon Seo, Hyunji Choi, Hyoung Uk Sul, Soosung Kim, Jae W. Lee, Tae Jun Ham

In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2021

Abs Bib PDF Code Slides Talk

Deep neural networks (DNNs) with embedding layers are widely adopted to capture complex relationships among entities within a dataset. Embedding layers aggregate multiple embeddings — a dense vector used to represent the complicated nature of a data feature— into a single embedding; such operation is called embedding reduction. Embedding reduction spends a significant portion of its runtime on reading embeddings from memory and thus is known to be heavily memory-bandwidth-bound. Recent works attempt to accelerate this critical operation, but they often require either hardware modifications or emerging memory technologies, which makes it hardly deployable on commodity hardware. Thus, we propose MERCI, Memoization for Embedding Reduction with ClusterIng, a novel memoization framework for efficient embedding reduction. MERCI provides a mechanism for memoizing partial aggregation of correlated embeddings and retrieving the memoized partial result at a low cost. MERCI substantially reduces the number of memory accesses by 44% (29%), leading to 102% (74%) throughput improvement on real machines and 40.2% (28.6%) energy savings at the expense of 8x(1x) additional memory usage.
@inproceedings{merci, abbr = {ASPLOS}, bibtex_show = {true}, selected = {true}, pdf = {https://snu-arc.github.io/pubs/asplos21_merci.pdf}, author = {Lee, Yejin and Seo, Seong Hoon and Choi, Hyunji and Sul, Hyoung Uk and Kim, Soosung and Lee, Jae W. and Ham, Tae Jun}, title = {MERCI: Efficient Embedding Reduction on Commodity Hardware via Sub-Query Memoization}, year = {2021}, isbn = {9781450383172}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3445814.3446717}, doi = {10.1145/3445814.3446717}, booktitle = {Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)}, pages = {302–313}, numpages = {12}, keywords = {Embedding Lookup, Recommender Systems, Memoization}, location = {Virtual, USA}, series = {ASPLOS 2021}, slides = {https://snu-arc.github.io/pubs/asplos21_merci-talk.pdf}, talk = {https://www.youtube.com/watch?v=5xRIuoPU60M}, code = {https://github.com/SNU-ARC/MERCI} }

2020

ISCA
Genesis: A Hardware Acceleration Framework for Genomic Data Analysis

Tae Jun Ham, David Bruns-Smith, Brendan Sweeney, Yejin Lee, Seong Hoon Seo, U Gyeong Song, Young H. Oh, Krste Asanovic, Jae W. Lee, Lisa Wu Wills

In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) 2020
Selected for inclusion in IEEE Micro - Special Issue on Top Picks from the 2020 Computer Architecture Con- ferences

Abs Bib PDF

In this paper, we describe our vision to accelerate algorithms in the domain of genomic data analysis by proposing a framework called Genesis (genome analysis) that contains an interface and an implementation of a system that processes genomic data efficiently. This framework can be deployed in the cloud and exploit the FPGAs-as-a-service paradigm to provide cost-efficient secondary DNA analysis. We propose conceptualizing genomic reads and associated read attributes as a very large relational database and using extended SQL as a domain-specific language to construct queries that form various data manipulation operations. To accelerate such queries, we design a Genesis hardware library which consists of primitive hardware modules that can be composed to construct a dataflow architecture specialized for those queries.As a proof of concept for the Genesis framework, we present the architecture and the hardware implementation of several genomic analysis stages in the secondary analysis pipeline corresponding to the best known software analysis toolkit, GATK4 workflow proposed by the Broad Institute. We walk through the construction of genomic data analysis operations using a sequence of SQL-style queries and show how Genesis hardware library modules can be utilized to construct the hardware pipelines designed to accelerate such queries. We exploit parallelism and data reuse by utilizing a dataflow architecture along with the use of on-chip scratchpads as well as non-blocking APIs to manage the accelerators, allowing concurrent execution of the accelerator and the host. Our accelerated system deployed on the cloud FPGA performs up to 19.3x better than GATK4 running on a commodity multi-core Xeon server and obtains up to 15x better cost savings. We believe that if a software algorithm can be mapped onto a hardware library to utilize the underlying accelerator(s) using an already-standardized software interface such as SQL, while allowing the efficient mapping of such interface to primitive hardware modules as we have demonstrated here, it will expedite the acceleration of domain-specific algorithms and allow the easy adaptation of algorithm changes.
@inproceedings{genesis, abbr = {ISCA}, bibtex_show = {true}, selected = {true}, pdf = {https://snu-arc.github.io/pubs/isca20_genesis.pdf}, author = {Ham, Tae Jun and Bruns-Smith, David and Sweeney, Brendan and Lee, Yejin and Seo, Seong Hoon and Song, U Gyeong and Oh, Young H. and Asanovic, Krste and Lee, Jae W. and Wills, Lisa Wu}, title = {Genesis: A Hardware Acceleration Framework for Genomic Data Analysis}, year = {2020}, isbn = {9781728146614}, publisher = {IEEE Press}, url = {https://doi.org/10.1109/ISCA45697.2020.00031}, doi = {10.1109/ISCA45697.2020.00031}, booktitle = {Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)}, pages = {254–267}, numpages = {14}, keywords = {FPGA, genomic data analysis, hardware accelerator, genome sequencing, SQL}, location = {Virtual Event}, series = {ISCA '20}, note = {Selected for inclusion in IEEE Micro - Special Issue on Top Picks from the 2020 Computer Architecture Con- ferences} }
ASPLOS
IIU: Specialized Architecture for Inverted Index Search

Jun Heo, Jaeyeon Won, Yejin Lee, Shivam Bharuka, Jaeyoung Jang, Tae Jun Ham, Jae W. Lee

In Proceedings of the 25th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2020

Abs Bib PDF

Inverted index serves as a fundamental data structure for efficient search across various applications such as full-text search engine, document analytics and other information retrieval systems. The storage requirement and query load for these structures have been growing at a rapid rate. Thus, an ideal indexing system should maintain a small index size with a low query processing time. Previous works have mainly focused on using CPUs and GPUs to exploit query parallelism while utilizing state-of-the-art compression schemes to fit the index in memory. However, scaling parallelism to maximally utilize memory bandwidth on these architectures is still challenging. In this work, we present IIU, a novel inverted index processing unit, to optimize the query performance while maintaining a low memory overhead for index storage. To this end, we co-design the indexing scheme and hardware accelerator so that the accelerator can process highly compressed inverted index at a high throughput. In addition, IIU provides flexible interconnects between modules to take advantage of both intra- and inter-query parallelism. Our evaluation using a cycle-level simulator demonstrates that IIU provides an average of 13.8X query latency reduction and 5.4X throughput improvement across different query types, while reducing the average energy consumption by 18.6X, compared to Apache Lucene, a production-grade full-text search framework.
@inproceedings{iiu, abbr = {ASPLOS}, bibtex_show = {true}, selected = {true}, pdf = {https://snu-arc.github.io/pubs/asplos20_iiu.pdf}, author = {Heo, Jun and Won, Jaeyeon and Lee, Yejin and Bharuka, Shivam and Jang, Jaeyoung and Ham, Tae Jun and Lee, Jae W.}, title = {IIU: Specialized Architecture for Inverted Index Search}, year = {2020}, isbn = {9781450371025}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3373376.3378521}, doi = {10.1145/3373376.3378521}, booktitle = {Proceedings of the 25th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)}, pages = {1233–1245}, numpages = {13}, keywords = {accelerator, hardware/software co-design, inverted index, domain-specific architecture, full-text search}, location = {Lausanne, Switzerland}, series = {ASPLOS '20} }

2019

MICRO
Charon: Specialized Near-Memory Processing Architecture for Clearing Dead Objects in Memory

Jaeyoung Jang, Jun Heo, Yejin Lee, Jaeyeon Won, Seonghak Kim, Sung Jun Jung, Hakbeom Jang, Tae Jun Ham, Jae W. Lee

In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) 2019

Abs Bib PDF

Garbage collection (GC) is a standard feature for high productivity programming, saving a programmer from many nasty memory-related bugs. However, these productivity benefits come with a cost in terms of application throughput, worst-case latency, and energy consumption. Since the first introduction of GC by the Lisp programming language in the 1950s, a myriad of hardware and software techniques have been proposed to reduce this cost. While the idea of accelerating GC in hardware is appealing, its impact has been very limited due to narrow coverage, lack of flexibility, intrusive system changes, and significant hardware cost. Even with specialized hardware GC performance is eventually limited by memory bandwidth bottleneck. Fortunately, emerging 3D stacked DRAM technologies shed new light on this decades-old problem by enabling efficient near-memory processing with ample memory bandwidth. Thus, we propose Charon1, the first 3D stacked memory-based GC accelerator. Through a detailed performance analysis of HotSpot JVM, we derive a set of key algorithmic primitives based on their GC time coverage and implementation complexity in hardware. Then we devise a specialized processing unit to substantially improve their memory-level parallelism and throughput with a low hardware cost. Our evaluation of Charon with the full-production HotSpot JVM running two big data analytics frameworks, Spark and GraphChi, demonstrates a 3.29\texttimes geomean speedup and 60.7% energy savings for GC over the baseline 8-core out-of-order processor.
@inproceedings{charon, abbr = {MICRO}, bibtex_show = {true}, selected = {true}, pdf = {https://snu-arc.github.io/pubs/micro19_charon.pdf}, author = {Jang, Jaeyoung and Heo, Jun and Lee, Yejin and Won, Jaeyeon and Kim, Seonghak and Jung, Sung Jun and Jang, Hakbeom and Ham, Tae Jun and Lee, Jae W.}, title = {Charon: Specialized Near-Memory Processing Architecture for Clearing Dead Objects in Memory}, year = {2019}, isbn = {9781450369381}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3352460.3358297}, doi = {10.1145/3352460.3358297}, booktitle = {Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)}, pages = {726–739}, numpages = {14}, keywords = {Garbage collection, Java Virtual Machine, Domain-specific architecture, Memory management, Near-memory processing}, location = {Columbus, OH, USA}, series = {MICRO '52} }

ITC-CSCC

Performance Analysis of Convolutional Neural Networks on Manycore Platforms

Jaeyoung Jang, Yejin Lee, Jae W. Lee

In The 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC) 2019

Bib

@inproceedings{cc,
  abbr = {ITC-CSCC},
  bibtex_show = {true},
  selected = {true},
  author = {Jang, Jaeyoung and Lee, Yejin and Lee, Jae W.},
  title = {Performance Analysis of Convolutional Neural Networks on Manycore Platforms},
  year = {2019},
  booktitle = {The 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC)}
}

2018

KCC

Trends in HW and SW based Memory Management for High-Performance Data Analysis

Yejin Lee Lee, Jun Heo, Jaeyoung Jang, Jae W. Lee

In Communications of the Korean Institute of Information Scientists and Engineers, Korea Information Science Society 2018

Bib

@inproceedings{trends,
  abbr = {KCC},
  bibtex_show = {true},
  selected = {true},
  author = {Lee, Yejin Lee and Heo, Jun and Jang, Jaeyoung and Lee, Jae W.},
  title = {Trends in HW and SW based Memory Management for High-Performance Data Analysis},
  year = {2018},
  booktitle = {Communications of the Korean Institute of Information Scientists and Engineers, Korea Information Science Society}
}

2017

KCC

The development of a spoken lecture summary Application Software using TextRank on Android platform

HeeWoong Jang, Yong June Kim, Yejin Lee, Myoung-Wan Koo

In Communications of the Korean Institute of Information Scientists and Engineers, Korea Information Science Society 2017

Bib

@inproceedings{aa,
  abbr = {KCC},
  bibtex_show = {true},
  selected = {true},
  author = {Jang, HeeWoong and Kim, Yong June and Lee, Yejin and Koo, Myoung-Wan},
  title = {The development of a spoken lecture summary Application Software using TextRank on Android
  platform},
  year = {2017},
  booktitle = {Communications of the Korean Institute of Information Scientists and Engineers, Korea Information Science Society}
}