This commit is contained in:
Junda Chen 2026-02-01 20:03:11 -08:00
parent c0d6bac60f
commit cefe6c8b13
2 changed files with 33 additions and 26 deletions

View file

@ -20,23 +20,27 @@ This course focuses on the design of end-to-end large language model (LLM) syste
The course can be more accurately divided into three parts (with several additional guest lectures): The course can be more accurately divided into three parts (with several additional guest lectures):
Part 1. Foundations: modern deep learning and computational representations Part 1. Foundations: modern deep learning and computational representations
- Modern deep learning and computation graphs (framework and system fundamentals) - Modern deep learning and computation graphs (framework and system fundamentals)
- Automatic differentiation and an overview of ML system architectures - Automatic differentiation and an overview of ML system architectures
- Tensor formats, in-depth matrix multiplication, and hardware accelerators - Tensor formats, in-depth matrix multiplication, and hardware accelerators
Part 2. Systems and performance optimization: from GPU kernels to compilation and memory Part 2. Systems and performance optimization: from GPU kernels to compilation and memory
- GPUs and CUDA (including basic performance models) - GPUs and CUDA (including basic performance models)
- GPU matrix multiplication and operator-level compilation - GPU matrix multiplication and operator-level compilation
- Triton programming, graph optimization, and compilation - Triton programming, graph optimization, and compilation
- Memory management (including practical issues and techniques in training and inference) - Memory management (including practical issues and techniques in training and inference)
- Quantization methods and system-level deployment - Quantization methods and system-level deployment
Part 3. LLM systems: training and inference Part 3. LLM systems: training and inference
- Parallelization strategies: model parallelism, collective communication, intra-/inter-op parallelism, and auto-parallelization - Parallelization strategies: model parallelism, collective communication, intra-/inter-op parallelism, and auto-parallelization
- LLM fundamentals: Transformers, Attention, and MoE - LLM fundamentals: Transformers, Attention, and MoE
- LLM training optimizations (e.g., FlashAttention-style techniques) - LLM training optimizations (e.g., FlashAttention-style techniques)
- LLM inference: continuous batching, paged attention, disaggregated prefill/decoding - LLM inference: continuous batching, paged attention, disaggregated prefill/decoding
- Scaling laws - Scaling laws
(Guest lectures cover topics such as ML compilers, LLM pretraining and open science, fast inference, and tool use and agents, serving as complementary extensions.) (Guest lectures cover topics such as ML compilers, LLM pretraining and open science, fast inference, and tool use and agents, serving as complementary extensions.)

View file

@ -22,23 +22,26 @@
课程可以更准确地分为三个部分(外加若干 guest lecture 课程可以更准确地分为三个部分(外加若干 guest lecture
Part 1. 基础:现代深度学习与计算表示 Part 1. 基础:现代深度学习与计算表示
- Modern DL 与计算图computational graph / framework 基础) - Modern DL 与计算图computational graph / framework 基础)
- Autodiff 与 ML system 架构概览 - Autodiff 与 ML system 架构概览
- Tensor format、MatMul 深入与硬件加速器accelerators - Tensor format、MatMul 深入与硬件加速器accelerators
Part 2. 系统与性能优化:从 GPU Kernel 到编译与内存 Part 2. 系统与性能优化:从 GPU Kernel 到编译与内存
- GPUs & CUDA含基本性能模型 - GPUs & CUDA含基本性能模型
- GPU MatMul 与算子编译operator compilation - GPU MatMul 与算子编译operator compilation
- Triton 编程、图优化与编译graph optimization & compilation - Triton 编程、图优化与编译graph optimization & compilation
- Memory含训练/推理中的内存问题与技巧) - Memory含训练/推理中的内存问题与技巧)
- Quantization量化方法与系统落地 - Quantization量化方法与系统落地
Part 3. LLM系统训练与推理 Part 3. LLM系统训练与推理
- 并行策略模型并行、collective communication、intra-/inter-op、自动并行化 - 并行策略模型并行、collective communication、intra-/inter-op、自动并行化
- LLM 基础Transformer、Attention、MoE - LLM 基础Transformer、Attention、MoE
- LLM 训练优化FlashAttention 等 - LLM 训练优化FlashAttention 等
- LLM 推理continuous batching、paged attention、disaggregated prefill/decoding - LLM 推理continuous batching、paged attention、disaggregated prefill/decoding
- Scaling law - Scaling law
Guest lecturesML compiler、LLM pretraining/open science、fast inference、tool use & agents 等,作为补充与扩展。) Guest lecturesML compiler、LLM pretraining/open science、fast inference、tool use & agents 等,作为补充与扩展。)