mirror of
https://github.com/PKUFlyingPig/cs-self-learning.git
synced 2026-06-22 09:27:22 +08:00
update
This commit is contained in:
parent
c0d6bac60f
commit
cefe6c8b13
2 changed files with 33 additions and 26 deletions
|
|
@ -24,6 +24,8 @@ Part 1. Foundations: modern deep learning and computational representations
|
||||||
- Automatic differentiation and an overview of ML system architectures
|
- Automatic differentiation and an overview of ML system architectures
|
||||||
- Tensor formats, in-depth matrix multiplication, and hardware accelerators
|
- Tensor formats, in-depth matrix multiplication, and hardware accelerators
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Part 2. Systems and performance optimization: from GPU kernels to compilation and memory
|
Part 2. Systems and performance optimization: from GPU kernels to compilation and memory
|
||||||
- GPUs and CUDA (including basic performance models)
|
- GPUs and CUDA (including basic performance models)
|
||||||
- GPU matrix multiplication and operator-level compilation
|
- GPU matrix multiplication and operator-level compilation
|
||||||
|
|
@ -31,6 +33,7 @@ Part 2. Systems and performance optimization: from GPU kernels to compilation an
|
||||||
- Memory management (including practical issues and techniques in training and inference)
|
- Memory management (including practical issues and techniques in training and inference)
|
||||||
- Quantization methods and system-level deployment
|
- Quantization methods and system-level deployment
|
||||||
|
|
||||||
|
|
||||||
Part 3. LLM systems: training and inference
|
Part 3. LLM systems: training and inference
|
||||||
- Parallelization strategies: model parallelism, collective communication, intra-/inter-op parallelism, and auto-parallelization
|
- Parallelization strategies: model parallelism, collective communication, intra-/inter-op parallelism, and auto-parallelization
|
||||||
- LLM fundamentals: Transformers, Attention, and MoE
|
- LLM fundamentals: Transformers, Attention, and MoE
|
||||||
|
|
@ -38,6 +41,7 @@ Part 3. LLM systems: training and inference
|
||||||
- LLM inference: continuous batching, paged attention, disaggregated prefill/decoding
|
- LLM inference: continuous batching, paged attention, disaggregated prefill/decoding
|
||||||
- Scaling laws
|
- Scaling laws
|
||||||
|
|
||||||
|
|
||||||
(Guest lectures cover topics such as ML compilers, LLM pretraining and open science, fast inference, and tool use and agents, serving as complementary extensions.)
|
(Guest lectures cover topics such as ML compilers, LLM pretraining and open science, fast inference, and tool use and agents, serving as complementary extensions.)
|
||||||
|
|
||||||
The defining characteristic of CSE234 is its strong focus on LLM systems as the core application setting. The course emphasizes real-world system design trade-offs and engineering constraints, rather than remaining at the level of algorithms or API usage. Assignments often require students to directly confront performance bottlenecks—such as memory bandwidth limitations, communication overheads, and kernel fusion—and address them through Triton or system-level optimizations. Overall, the learning experience is fairly intensive: a solid background in systems and parallel computing is important. For self-study, it is strongly recommended to prepare CUDA, parallel programming, and core systems knowledge in advance; otherwise, the learning curve becomes noticeably steep in the later parts of the course, especially around LLM optimization and inference. That said, once the pace is manageable, the course offers strong long-term value for those pursuing work in LLM infrastructure, ML systems, or AI compilers.
|
The defining characteristic of CSE234 is its strong focus on LLM systems as the core application setting. The course emphasizes real-world system design trade-offs and engineering constraints, rather than remaining at the level of algorithms or API usage. Assignments often require students to directly confront performance bottlenecks—such as memory bandwidth limitations, communication overheads, and kernel fusion—and address them through Triton or system-level optimizations. Overall, the learning experience is fairly intensive: a solid background in systems and parallel computing is important. For self-study, it is strongly recommended to prepare CUDA, parallel programming, and core systems knowledge in advance; otherwise, the learning curve becomes noticeably steep in the later parts of the course, especially around LLM optimization and inference. That said, once the pace is manageable, the course offers strong long-term value for those pursuing work in LLM infrastructure, ML systems, or AI compilers.
|
||||||
|
|
|
||||||
|
|
@ -26,6 +26,7 @@ Part 1. 基础:现代深度学习与计算表示
|
||||||
- Autodiff 与 ML system 架构概览
|
- Autodiff 与 ML system 架构概览
|
||||||
- Tensor format、MatMul 深入与硬件加速器(accelerators)
|
- Tensor format、MatMul 深入与硬件加速器(accelerators)
|
||||||
|
|
||||||
|
|
||||||
Part 2. 系统与性能优化:从 GPU Kernel 到编译与内存
|
Part 2. 系统与性能优化:从 GPU Kernel 到编译与内存
|
||||||
- GPUs & CUDA(含基本性能模型)
|
- GPUs & CUDA(含基本性能模型)
|
||||||
- GPU MatMul 与算子编译(operator compilation)
|
- GPU MatMul 与算子编译(operator compilation)
|
||||||
|
|
@ -33,6 +34,7 @@ Part 2. 系统与性能优化:从 GPU Kernel 到编译与内存
|
||||||
- Memory(含训练/推理中的内存问题与技巧)
|
- Memory(含训练/推理中的内存问题与技巧)
|
||||||
- Quantization(量化方法与系统落地)
|
- Quantization(量化方法与系统落地)
|
||||||
|
|
||||||
|
|
||||||
Part 3. LLM系统:训练与推理
|
Part 3. LLM系统:训练与推理
|
||||||
- 并行策略:模型并行、collective communication、intra-/inter-op、自动并行化
|
- 并行策略:模型并行、collective communication、intra-/inter-op、自动并行化
|
||||||
- LLM 基础:Transformer、Attention、MoE
|
- LLM 基础:Transformer、Attention、MoE
|
||||||
|
|
@ -40,6 +42,7 @@ Part 3. LLM系统:训练与推理
|
||||||
- LLM 推理:continuous batching、paged attention、disaggregated prefill/decoding
|
- LLM 推理:continuous batching、paged attention、disaggregated prefill/decoding
|
||||||
- Scaling law
|
- Scaling law
|
||||||
|
|
||||||
|
|
||||||
(Guest lectures:ML compiler、LLM pretraining/open science、fast inference、tool use & agents 等,作为补充与扩展。)
|
(Guest lectures:ML compiler、LLM pretraining/open science、fast inference、tool use & agents 等,作为补充与扩展。)
|
||||||
|
|
||||||
CSE234的最大特点在于非常专注于以LLM (LLM System)为核心应用场景,强调真实系统设计中的取舍与工程约束,而非停留在算法或 API 使用层面。课程作业通常需要直接面对性能瓶颈(如内存带宽、通信开销、kernel fusion 等),并通过 Triton 或系统级优化手段加以解决,对理解“为什么某些 LLM 系统设计是现在这个样子”非常有帮助。学习体验整体偏硬核,前期对系统与并行计算背景要求较高,自学时建议提前补齐 CUDA/并行编程与基础系统知识,否则在后半部分(尤其是 LLM 优化与推理相关内容)会明显感到陡峭的学习曲线。但一旦跟上节奏,这门课对从事 LLM Infra / ML Systems / AI Compiler 方向的同学具有很强的长期价值。
|
CSE234的最大特点在于非常专注于以LLM (LLM System)为核心应用场景,强调真实系统设计中的取舍与工程约束,而非停留在算法或 API 使用层面。课程作业通常需要直接面对性能瓶颈(如内存带宽、通信开销、kernel fusion 等),并通过 Triton 或系统级优化手段加以解决,对理解“为什么某些 LLM 系统设计是现在这个样子”非常有帮助。学习体验整体偏硬核,前期对系统与并行计算背景要求较高,自学时建议提前补齐 CUDA/并行编程与基础系统知识,否则在后半部分(尤其是 LLM 优化与推理相关内容)会明显感到陡峭的学习曲线。但一旦跟上节奏,这门课对从事 LLM Infra / ML Systems / AI Compiler 方向的同学具有很强的长期价值。
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue