From 46817604f081fdd179f74bff780fdda5313f03d6 Mon Sep 17 00:00:00 2001 From: Junda Chen <32371474+GindaChen@users.noreply.github.com> Date: Sat, 31 Jan 2026 22:55:53 -0800 Subject: [PATCH] update extended materials --- docs/机器学习系统/CSE234.en.md | 32 +++++++++++++++++++++++--------- docs/机器学习系统/CSE234.md | 16 +++++++++++++++- 2 files changed, 38 insertions(+), 10 deletions(-) diff --git a/docs/机器学习系统/CSE234.en.md b/docs/机器学习系统/CSE234.en.md index e62d844e..239d43d6 100644 --- a/docs/机器学习系统/CSE234.en.md +++ b/docs/机器学习系统/CSE234.en.md @@ -10,26 +10,26 @@ -This course focuses on the design of end-to-end large language model (LLM) systems, serving as an introductory yet comprehensive guide to building efficient LLM systems in practice. +This course focuses on the design of end-to-end large language model (LLM) systems, serving as an introductory course to building efficient LLM systems in practice. -The course can be more accurately divided into three main parts (with several additional guest lectures): +The course can be more accurately divided into three parts (with several additional guest lectures): 1. Foundations: modern deep learning and computational representations - - Modern deep learning and computation graphs (framework and system basics) - - Automatic differentiation and ML system architecture overview + - Modern deep learning and computation graphs (framework and system fundamentals) + - Automatic differentiation and an overview of ML system architectures - Tensor formats, in-depth matrix multiplication, and hardware accelerators 2. Systems and performance optimization: from GPU kernels to compilation and memory - GPUs and CUDA (including basic performance models) - GPU matrix multiplication and operator-level compilation - Triton programming, graph optimization, and compilation - - Memory management in training and inference - - Quantization techniques and system-level deployment + - Memory management (including practical issues and techniques in training and inference) + - Quantization methods and system-level deployment 3. LLM systems: training and inference - Parallelization strategies: model parallelism, collective communication, intra-/inter-op parallelism, and auto-parallelization @@ -40,7 +40,17 @@ The course can be more accurately divided into three main parts (with several ad (Guest lectures cover topics such as ML compilers, LLM pretraining and open science, fast inference, and tool use and agents, serving as complementary extensions.) -The defining characteristic of CSE234 is its strong focus on LLM systems as the core application setting. The course emphasizes real-world system design trade-offs and engineering constraints, rather than remaining at the level of algorithms or API usage. Assignments often require students to directly confront performance bottlenecks—such as memory bandwidth limitations, communication overheads, and kernel fusion—and address them through Triton or system-level optimizations. This makes the course particularly helpful for understanding *why modern LLM systems are designed the way they are*. The learning experience is overall quite intensive: a solid background in systems and parallel computing is important, and for self-study it is strongly recommended to prepare CUDA, parallel programming, and core systems knowledge in advance. Otherwise, the learning curve becomes noticeably steep in the later parts of the course, especially around LLM optimization and inference. That said, once the pace is manageable, the course offers substantial long-term value for those pursuing work in LLM infrastructure, ML systems, or AI compilers. +The defining characteristic of CSE234 is its strong focus on LLM systems as the core application setting. The course emphasizes real-world system design trade-offs and engineering constraints, rather than remaining at the level of algorithms or API usage. Assignments often require students to directly confront performance bottlenecks—such as memory bandwidth limitations, communication overheads, and kernel fusion—and address them through Triton or system-level optimizations. Overall, the learning experience is fairly intensive: a solid background in systems and parallel computing is important. For self-study, it is strongly recommended to prepare CUDA, parallel programming, and core systems knowledge in advance; otherwise, the learning curve becomes noticeably steep in the later parts of the course, especially around LLM optimization and inference. That said, once the pace is manageable, the course offers strong long-term value for those pursuing work in LLM infrastructure, ML systems, or AI compilers. + +## Recommended Learning Path + +The course itself is relatively well-structured and progressive. However, for students without prior experience in systems and parallel computing, the transition into the second part of the course may feel somewhat steep. A key aspect of this course is spending significant time implementing and optimizing systems in practice. Therefore, it is highly recommended to explore relevant open-source projects on GitHub while reading papers, and to implement related systems or kernels hands-on to deepen understanding. + +- Foundations: consider studying alongside open-source projects such as [micrograd](https://github.com/karpathy/micrograd) +- Systems & performance optimization and LLM systems: consider pairing with projects such as [nanoGPT](https://github.com/karpathy/nanoGPT) and [nano-vllm](https://github.com/GeeeekExplorer/nano-vllm) + +The course website itself provides a curated list of additional references and materials, which can be found here: +[Book-related documentation and courses](https://hao-ai-lab.github.io/cse234-w25/resources/#book-related-documentation-and-courses) ## Course Resources @@ -51,4 +61,8 @@ The defining characteristic of CSE234 is its strong focus on LLM systems as the ## Resource Summary -All course materials are released in open-source form. However, the online grading infrastructure and reference solutions for assignments have not been made public. \ No newline at end of file +All course materials are released in open-source form. However, the online grading infrastructure and reference solutions for assignments have not been made public. + +## Additional Resources / Further Reading + +- [GPUMode](https://www.youtube.com/@GPUMODE): offers in-depth explanations of GPU kernels and systems. Topics referenced in the course—such as [DistServe](https://www.youtube.com/watch?v=tIPDwUepXcA), [FlashAttention](https://www.youtube.com/watch?v=VPslgC9piIw), and [Triton](https://www.youtube.com/watch?v=njgow_zaJMw)—all have excellent extended talks available. \ No newline at end of file diff --git a/docs/机器学习系统/CSE234.md b/docs/机器学习系统/CSE234.md index c88cdc78..b6441216 100644 --- a/docs/机器学习系统/CSE234.md +++ b/docs/机器学习系统/CSE234.md @@ -45,6 +45,16 @@ CSE234的最大特点在于非常专注于以LLM (LLM System)为核心应用场景,强调真实系统设计中的取舍与工程约束,而非停留在算法或 API 使用层面。课程作业通常需要直接面对性能瓶颈(如内存带宽、通信开销、kernel fusion 等),并通过 Triton 或系统级优化手段加以解决,对理解“为什么某些 LLM 系统设计是现在这个样子”非常有帮助。学习体验整体偏硬核,前期对系统与并行计算背景要求较高,自学时建议提前补齐 CUDA/并行编程与基础系统知识,否则在后半部分(尤其是 LLM 优化与推理相关内容)会明显感到陡峭的学习曲线。但一旦跟上节奏,这门课对从事 LLM Infra / ML Systems / AI Compiler 方向的同学具有很强的长期价值。 +## 学习路线推荐 + +课程本身其实比较循序渐进,但是对于没有系统与并行计算背景的同学来说可能到第二部分会感觉稍微陡峭一点。课程最核心的部分其实是要花很多时间动手实现与优化系统,因此建议在读paper的时候就可以在Github上找一些相关的开源项目,动手实现相关的系统或者Kernel,加深理解。 + +- 基础部分:建议配合 [micrograd](https://github.com/karpathy/micrograd) 等开源项目一起学习 +- 系统与性能优化 & LLM系统:建议配合 [nanoGPT](https://github.com/karpathy/nanoGPT), [nano-vllm](https://github.com/GeeeekExplorer/nano-vllm) 等开源项目一起食用 + +课程页面本身提供了一些知识与资源,可以参考:[Book related documentation and courses](https://hao-ai-lab.github.io/cse234-w25/resources/#book-related-documentation-and-courses) + + ## 课程资源 - 课程网站:https://hao-ai-lab.github.io/cse234-w25/ @@ -54,4 +64,8 @@ CSE234的最大特点在于非常专注于以LLM (LLM System)为核心应用场 ## 资源汇总 -所有课程内容都发布了对应的开源版本,但在线测评和作业参考答案部分尚未开源。 \ No newline at end of file +所有课程内容都发布了对应的开源版本,但在线测评和作业参考答案部分尚未开源。 + +## 其他资源/课程延伸 + +- [GPUMode](https://www.youtube.com/@GPUMODE): 有非常多关于GPU Kernel / System的深度讲解。课程中提到的包括[DistServe](https://www.youtube.com/watch?v=tIPDwUepXcA), [FlashAttention](https://www.youtube.com/watch?v=VPslgC9piIw), [Triton](https://www.youtube.com/watch?v=njgow_zaJMw) 都有很好的延伸