My research focuses on efficient deep learning algorithms and systems, particularly for Large Language Models and Generative AI. I am committed to leveraging AI to develop reliable real-world applications.
Q-Diffusion: Quantizing Diffusion Models Xiuyu Li,
Kurt Keutzer International Conference on Computer Vision (ICCV), 2023
Diffusion models have achieved great success in image synthesis through iterative noise estimation using deep neural networks. However, the slow inference, high memory consumption,
and computation intensity of the noise estimation model hinder the efficient adoption of diffusion models. Although post-training quantization (PTQ) is considered a go-to compression method for other tasks,
it does not work out-of-the-box on diffusion models. We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture of the diffusion models,
which compresses the noise estimation network to accelerate the generation process. We identify the key difficulty of diffusion model quantization as the changing output distributions of noise estimation networks over
multiple time steps and the bimodal activation distribution of the shortcut layers within the noise estimation network. We tackle these challenges with time step-aware calibration and shortcut-splitting quantization in this work.
Experimental results show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance (small FID change of at most 2.34 compared to >100 for traditional PTQ) in a training-free manner.
Our approach can also be applied to text-guided image generation, where we can run stable diffusion in 4-bit weights with high generation quality for the first time.
SqueezeLLM: Dense-and-Sparse Quantization Sehoon Kim*,
Michael W. Mahoney,
Kurt Keutzer Preprint, 2023
Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements.
This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth,
rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing model weights with reduced precision, previous efforts have often resulted in notable performance degradation.
To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and
(ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to
the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is open-sourced and available online.
TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs Haotian Tang*,
Song Han International Symposium on Microarchitecture (MICRO), 2023
Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving,
and graph understanding in recommendation systems. Since the computation pattern is sparse and irregular, specialized high-performance
kernels are required. Existing GPU libraries offer two dataflow types for sparse convolution. The gather-GEMM-scatter dataflow is easy
to implement but not optimal in performance, while the dataflows with overlapped computation and memory access (e.g.implicit GEMM) are
highly performant but have very high engineering costs. In this paper, we introduce TorchSparse++, a new GPU library that achieves the
best of both worlds. We create a highly efficient Sparse Kernel Generator that generates performant sparse convolution kernels at less
than one-tenth of the engineering cost of the current state-of-the-art system. On top of this, we design the Sparse Autotuner, which extends
the design space of existing sparse convolution libraries and searches for the best dataflow configurations for training and inference workloads.
Consequently, TorchSparse++ achieves 2.9x, 3.3x, 2.2x and 1.7x measured end-to-end speedup on an NVIDIA A100 GPU over state-of-the-art
MinkowskiEngine, SpConv 1.2, TorchSparse and SpConv v2 in inference; and is 1.2-1.3x faster than SpConv v2 in mixed precision training across
seven representative autonomous driving benchmarks. It also seamlessly supports graph convolutions, achieving 2.6-7.6x faster inference speed
compared with state-of-the-art graph deep learning libraries.
TorchSparse: Efficient Point Cloud Inference Engine Haotian Tang*,
Song Han Conference on Machine Learning and Systems (MLSys), 2022
Deep learning on point clouds has received increased attention thanks to its wide applications in AR/VR and
autonomous driving. These applications require low latency and high accuracy to provide real-time user experience
and ensure user safety. Unlike conventional dense workloads, the sparse and irregular nature of point clouds
poses severe challenges to running sparse CNNs efficiently on the general-purpose hardware, and existing
sparse acceleration techniques for 2D images do not translate to 3D point clouds. In this paper, we introduce
TorchSparse, a high-performance point cloud inference engine that accelerates the sparse convolution computation
on GPUs. TorchSparse directly optimizes the two bottlenecks of sparse convolution: data movement and
irregular computation. It optimizes the data orchestration by quantization and fused locality-aware memory
access, reducing the memory movement cost by 2.7×. It also adopts adaptive MM grouping to trade computation
for better regularity, achieving 1.4-1.5× speedup for matrix multiplication. Evaluated on seven representative
models across three benchmark datasets, TorchSparse achieves 1.6× and 1.5× measured end-to-end speedup over
the state-of-the-art MinkowskiEngine and SpConv, respectively.
Data Isotopes for Data Provenance in DNNs Emily Wenger,
Ben Y. Zhao,
Vitaly Shmatikov Privacy Enhancing Technologies Symposium (PETS), 2024
Today, creators of data-hungry deep neural networks (DNNs) scour the Internet for training fodder, leaving users with
little control over or knowledge of when their data is appropriated for model training. To empower users to counteract
unwanted data use, we design, implement and evaluate a practical system that enables users to detect if their data was
used to train an DNN model. We show how users can create special data points we call isotopes, which introduce “spurious
features” into DNNs during training. With only query access to a trained model and no knowledge of the model training process,
or control of the data labels, a user can apply statistical hypothesis testing to detect if a model has learned the spurious features
associated with their isotopes by training on the user’s data. This effectively turns DNNs’ vulnerability to memorization and spurious
correlations into a tool for data provenance. Our results confirm efficacy in multiple settings, detecting and distinguishing between
hundreds of isotopes with high accuracy. We further show that our system works on public MLas-a-service platforms and larger models
such as ImageNet, can use physical objects instead of digital marks, and remains generally robust against several adaptive countermeasures.
The ArtBench Dataset: Benchmarking Generative Models with Artworks Peiyuan Liao*,
Kurt Keutzer Preprint, 2022
We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation.
It comprises 60,000 images of artwork from 10 distinctive artistic styles, with 5,000 training images and 1,000 testing images per style. ArtBench-10
has several advantages over previous artwork datasets. Firstly, it is class-balanced while most previous artwork datasets suffer from the long tail
class distributions. Secondly, the images are of high quality with clean annotations. Thirdly, ArtBench-10 is created with standardized data collection,
annotation, filtering, and preprocessing procedures. We provide three versions of the dataset with different resolutions (32x32, 256x256, and original image size),
formatted in a way that is easy to be incorporated by popular machine learning frameworks. We also conduct extensive benchmarking experiments using representative
image synthesis models with ArtBench-10 and present in-depth analysis. The dataset is available at this https URL
under a Fair Use license.
Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods Derek Lim*,
Sijia Linda Huang,
Ser-Nam Lim Advances in Neural Information Processing Systems (NeurIPS), 2021
Previous version: New Benchmarks for Learning on Non-Homophilous Graphs The Web Conference (WWW) Workshop on Graph Learning Benchmarks, 2021
[workshop code & datasets]
[code & datasets]
Many widely used datasets for graph machine learning tasks have generally beenhomophilous, where nodes with similar labels connect to each other.
Recently, new Graph Neural Networks (GNNs) have been developed that move beyond the homophily regime; however, their evaluation has often been conducted
on small graphs with limited application domains. We collect and introduce diverse non-homophilous datasets from a variety of application areas that
have up to 384x more nodes and 1398x more edges than prior datasets. We further show that existing scalable graph learning and graph minibatching techniques
lead to performance degradation on these non-homophilous datasets, thus highlighting the need for further work on scalable non-homophilous methods. To address
these concerns, we introduce LINKX — a strong simple method that admits straightforward minibatch training and inference. Extensive experimental results with
representative simple methods and GNNs across our proposed datasets show that LINKX achieves state-of-the-art performance for learning on non-homophilous graphs.
GARNET: Reduced-Rank Topology Learning for Robust and Scalable Graph Neural Networks Chenhui Deng,
Zhiru Zhang Learning on Graphs Conference (LoG), 2022, Spotlight
Graph neural networks (GNNs) have been increasingly deployed in various applications that involve learning on non-Euclidean data.
However, recent studies show that GNNs are vulnerable to graph adversarial attacks. Although there are several defense methods to
improve GNN robustness by eliminating adversarial components, they may also impair the underlying clean graph structure that
contributes to GNN training. In addition, few of those defense models can scale to large graphs due to their high computational
complexity and memory usage. In this paper, we propose GARNET, a scalable spectral method to boost the adversarial robustness of
GNN models. GARNET first leverages weighted spectral embedding to construct a base graph, which is not only resistant to adversarial
attacks but also contains critical (clean) graph structure for GNN training. Next, GARNET further refines the base graph by pruning
additional uncritical edges based on probabilistic graphical model. GARNET has been evaluated on various datasets, including a large
graph with millions of nodes. Our extensive experiment results show that GARNET achieves adversarial accuracy improvement and runtime
speedup over state-of-the-art GNN (defense) models by up to 13.27% and 14.7x, respectively.