This commit is contained in:
Junda Chen 2026-02-01 20:03:11 -08:00
parent c0d6bac60f
commit cefe6c8b13
2 changed files with 33 additions and 26 deletions

View file

@ -24,6 +24,8 @@ Part 1. Foundations: modern deep learning and computational representations
- Automatic differentiation and an overview of ML system architectures - Automatic differentiation and an overview of ML system architectures
- Tensor formats, in-depth matrix multiplication, and hardware accelerators - Tensor formats, in-depth matrix multiplication, and hardware accelerators
Part 2. Systems and performance optimization: from GPU kernels to compilation and memory Part 2. Systems and performance optimization: from GPU kernels to compilation and memory
- GPUs and CUDA (including basic performance models) - GPUs and CUDA (including basic performance models)
- GPU matrix multiplication and operator-level compilation - GPU matrix multiplication and operator-level compilation
@ -31,6 +33,7 @@ Part 2. Systems and performance optimization: from GPU kernels to compilation an
- Memory management (including practical issues and techniques in training and inference) - Memory management (including practical issues and techniques in training and inference)
- Quantization methods and system-level deployment - Quantization methods and system-level deployment
Part 3. LLM systems: training and inference Part 3. LLM systems: training and inference
- Parallelization strategies: model parallelism, collective communication, intra-/inter-op parallelism, and auto-parallelization - Parallelization strategies: model parallelism, collective communication, intra-/inter-op parallelism, and auto-parallelization
- LLM fundamentals: Transformers, Attention, and MoE - LLM fundamentals: Transformers, Attention, and MoE
@ -38,6 +41,7 @@ Part 3. LLM systems: training and inference
- LLM inference: continuous batching, paged attention, disaggregated prefill/decoding - LLM inference: continuous batching, paged attention, disaggregated prefill/decoding
- Scaling laws - Scaling laws
(Guest lectures cover topics such as ML compilers, LLM pretraining and open science, fast inference, and tool use and agents, serving as complementary extensions.) (Guest lectures cover topics such as ML compilers, LLM pretraining and open science, fast inference, and tool use and agents, serving as complementary extensions.)
The defining characteristic of CSE234 is its strong focus on LLM systems as the core application setting. The course emphasizes real-world system design trade-offs and engineering constraints, rather than remaining at the level of algorithms or API usage. Assignments often require students to directly confront performance bottlenecks—such as memory bandwidth limitations, communication overheads, and kernel fusion—and address them through Triton or system-level optimizations. Overall, the learning experience is fairly intensive: a solid background in systems and parallel computing is important. For self-study, it is strongly recommended to prepare CUDA, parallel programming, and core systems knowledge in advance; otherwise, the learning curve becomes noticeably steep in the later parts of the course, especially around LLM optimization and inference. That said, once the pace is manageable, the course offers strong long-term value for those pursuing work in LLM infrastructure, ML systems, or AI compilers. The defining characteristic of CSE234 is its strong focus on LLM systems as the core application setting. The course emphasizes real-world system design trade-offs and engineering constraints, rather than remaining at the level of algorithms or API usage. Assignments often require students to directly confront performance bottlenecks—such as memory bandwidth limitations, communication overheads, and kernel fusion—and address them through Triton or system-level optimizations. Overall, the learning experience is fairly intensive: a solid background in systems and parallel computing is important. For self-study, it is strongly recommended to prepare CUDA, parallel programming, and core systems knowledge in advance; otherwise, the learning curve becomes noticeably steep in the later parts of the course, especially around LLM optimization and inference. That said, once the pace is manageable, the course offers strong long-term value for those pursuing work in LLM infrastructure, ML systems, or AI compilers.

View file

@ -26,6 +26,7 @@ Part 1. 基础:现代深度学习与计算表示
- Autodiff 与 ML system 架构概览 - Autodiff 与 ML system 架构概览
- Tensor format、MatMul 深入与硬件加速器accelerators - Tensor format、MatMul 深入与硬件加速器accelerators
Part 2. 系统与性能优化:从 GPU Kernel 到编译与内存 Part 2. 系统与性能优化:从 GPU Kernel 到编译与内存
- GPUs & CUDA含基本性能模型 - GPUs & CUDA含基本性能模型
- GPU MatMul 与算子编译operator compilation - GPU MatMul 与算子编译operator compilation
@ -33,6 +34,7 @@ Part 2. 系统与性能优化:从 GPU Kernel 到编译与内存
- Memory含训练/推理中的内存问题与技巧) - Memory含训练/推理中的内存问题与技巧)
- Quantization量化方法与系统落地 - Quantization量化方法与系统落地
Part 3. LLM系统训练与推理 Part 3. LLM系统训练与推理
- 并行策略模型并行、collective communication、intra-/inter-op、自动并行化 - 并行策略模型并行、collective communication、intra-/inter-op、自动并行化
- LLM 基础Transformer、Attention、MoE - LLM 基础Transformer、Attention、MoE
@ -40,6 +42,7 @@ Part 3. LLM系统训练与推理
- LLM 推理continuous batching、paged attention、disaggregated prefill/decoding - LLM 推理continuous batching、paged attention、disaggregated prefill/decoding
- Scaling law - Scaling law
Guest lecturesML compiler、LLM pretraining/open science、fast inference、tool use & agents 等,作为补充与扩展。) Guest lecturesML compiler、LLM pretraining/open science、fast inference、tool use & agents 等,作为补充与扩展。)
CSE234的最大特点在于非常专注于以LLM (LLM System)为核心应用场景,强调真实系统设计中的取舍与工程约束,而非停留在算法或 API 使用层面。课程作业通常需要直接面对性能瓶颈如内存带宽、通信开销、kernel fusion 等),并通过 Triton 或系统级优化手段加以解决,对理解“为什么某些 LLM 系统设计是现在这个样子”非常有帮助。学习体验整体偏硬核,前期对系统与并行计算背景要求较高,自学时建议提前补齐 CUDA/并行编程与基础系统知识,否则在后半部分(尤其是 LLM 优化与推理相关内容)会明显感到陡峭的学习曲线。但一旦跟上节奏,这门课对从事 LLM Infra / ML Systems / AI Compiler 方向的同学具有很强的长期价值。 CSE234的最大特点在于非常专注于以LLM (LLM System)为核心应用场景,强调真实系统设计中的取舍与工程约束,而非停留在算法或 API 使用层面。课程作业通常需要直接面对性能瓶颈如内存带宽、通信开销、kernel fusion 等),并通过 Triton 或系统级优化手段加以解决,对理解“为什么某些 LLM 系统设计是现在这个样子”非常有帮助。学习体验整体偏硬核,前期对系统与并行计算背景要求较高,自学时建议提前补齐 CUDA/并行编程与基础系统知识,否则在后半部分(尤其是 LLM 优化与推理相关内容)会明显感到陡峭的学习曲线。但一旦跟上节奏,这门课对从事 LLM Infra / ML Systems / AI Compiler 方向的同学具有很强的长期价值。