From b5d06f3f4ca166c60a671d9c565f2623e4389632 Mon Sep 17 00:00:00 2001 From: Yinmin Zhong Date: Sun, 8 Jun 2025 00:05:00 +0800 Subject: [PATCH] update cmu11-868 --- docs/深度生成模型/大语言模型/CMU11-868.en.md | 49 ++++++++++++-------- docs/深度生成模型/大语言模型/CMU11-868.md | 37 ++++++++------- 2 files changed, 51 insertions(+), 35 deletions(-) diff --git a/docs/深度生成模型/大语言模型/CMU11-868.en.md b/docs/深度生成模型/大语言模型/CMU11-868.en.md index f68b1471..ea940a2b 100644 --- a/docs/深度生成模型/大语言模型/CMU11-868.en.md +++ b/docs/深度生成模型/大语言模型/CMU11-868.en.md @@ -1,29 +1,40 @@ -# CMU 11868: Large Language Model System +# CMU 11-868: Large Language Model Systems (LLM Systems) ## Course Overview -- University: Carnegie Mellon University (CMU) -- Prerequisites: Basic knowledge of deep learning -- Programming Language: Python, CUDA -- Difficulty: 🌟🌟🌟🌟 -- Class Hour: 60 hours +- University: Carnegie Mellon University +- Prerequisites: Strongly recommended to have taken Deep Learning (11-785) or Advanced NLP (11-611 or 11-711) +- Programming Language: Python +- Course Difficulty: 🌟🌟🌟🌟 +- Estimated Workload: 120 hours -In recent years, the progress of artificial intelligence has benefited greatly from the rapid development of large language models (LLMs) and other generative methods. These models are usually huge in scale (e.g., GPT-3 has 175 billion parameters), so it is crucial to develop scalable LLM Systems. -In this course, students will learn the core skills of designing LLMs at the system level. -One of the major differences between this course and other similar courses is that there are quite a few GPU acceleration technologies involved in this course. The course will introduce the famous [FlashAttention](https://llmsystem.github.io/llmsystem2024spring/assets/files/Group-FlashAttention-0b70d553037a7729dd2a9af5e23d8b3e.pdf), and the experiments also require you to implement some operators to accelerate training. -Overall, the course is very suitable for students who are interested in the system design of large language models. +This graduate-level course focuses on the full stack of large language model (LLM) systems — from algorithms to engineering. The curriculum covers, but is not limited to: -This course requires you to have a certain amount of preparation for deep learning and is not suitable for complete beginners. You can see the prerequisites in the [FAQ](https://llmsystem.github.io/llmsystem2024spring/docs/FAQ). -The experiments are generally challenging, and the main contents are as follows: +1. **GPU Programming and Automatic Differentiation**: Master CUDA kernel calls, fundamentals of parallel programming, and deep learning framework design. +2. **Model Training and Distributed Systems**: Learn efficient training algorithms, communication optimizations (e.g., ZeRO, FlashAttention), and distributed training frameworks like DDP, GPipe, and Megatron-LM. +3. **Model Compression and Acceleration**: Study quantization (GPTQ), sparsity (MoE), compiler technologies (JAX, Triton), and inference-time serving systems (vLLM, CacheGen). +4. **Cutting-Edge Topics and Systems Practice**: Includes retrieval-augmented generation (RAG), multimodal LLMs, RLHF systems, and end-to-end deployment, monitoring, and maintenance. -1. Assignment1: Automatic differentiation framework + handwritten CUDA operator + basic neural network construction -2. Assignmant2: GPT2 model construction -3. Assignment3: Optimize model training speed by optimizing Softmax and LayerNorm operators written in CUDA -4. Assignment4: Distributed model training, which may not be easy to configure the environment for self-study +Compared to similar courses, this one stands out for its **tight integration with recent papers and open-source implementations** (hands-on work expanding CUDA support in the miniTorch framework), a **project-driven assignment structure** (five programming assignments + a final project), and **guest lectures from industry experts**, offering students real-world insights into LLM engineering challenges and solutions. -Like many other high-quality courses, the slides and assignments of this course are open-source, with quite detailed local test code, suitable for self-study. +**Self-Study Tips**: + +- Set up a CUDA-compatible environment in advance (NVIDIA GPU + CUDA Toolkit + PyTorch). +- Review fundamentals of parallel computing and deep learning (autograd, tensor operations). +- Carefully read the assigned papers and slides before each lecture, and follow the assignments to extend the miniTorch framework from pure Python to real CUDA kernels. + +This course assumes a solid understanding of deep learning and is **not suitable for complete beginners**. See the [FAQ](https://llmsystem.github.io/llmsystem2024spring/docs/FAQ) for more on prerequisites. + +The assignments are fairly challenging and include: + +1. **Assignment 1**: Implement an autograd framework + custom CUDA ops + basic neural networks +2. **Assignment 2**: Build a GPT2 model from scratch +3. **Assignment 3**: Accelerate training with custom CUDA kernels for Softmax and LayerNorm +4. **Assignment 4**: Implement distributed model training (difficult to configure independently for self-study) ## Course Resources -- Course Website: [https://llmsystem.github.io](https://llmsystem.github.io) -- Assignments: \ No newline at end of file +- Course Website: +- Syllabus: +- Assignments: +- Course Texts: Selected research papers + selected chapters from *Programming Massively Parallel Processors (4th Edition)* diff --git a/docs/深度生成模型/大语言模型/CMU11-868.md b/docs/深度生成模型/大语言模型/CMU11-868.md index 64c36f8f..c257f55c 100644 --- a/docs/深度生成模型/大语言模型/CMU11-868.md +++ b/docs/深度生成模型/大语言模型/CMU11-868.md @@ -1,20 +1,27 @@ -# CMU 11868: Large Language Model System +# CMU 11-868: Large Language Model Systems (LLM Systems) ## 课程简介 -- 所属大学:CMU -- 先修要求:深度学习基础知识 -- 编程语言:Python, CUDA -- 课程难度:🌟🌟🌟🌟 -- 预计学时:60 小时 +- 所属大学:Carnegie Mellon University +- 先修要求:强烈建议已修读 Deep Learning (11785) 或 Advanced NLP (11-611 或 11-711) +- 编程语言:Python +- 课程难度:🌟🌟🌟🌟 +- 预计学时:120 学时 +该课程面向研究生开设,聚焦“从算法到工程”的大语言模型系统构建全过程。课程内容包括但不限于: +1. **GPU 编程与自动微分**:掌握 CUDA kernel 调用、并行编程基础,以及深度学习框架设计原理。 +2. **模型训练与分布式系统**:学习高效的训练算法、通信优化(ZeRO、FlashAttention)、分布式训练框架(DDP、GPipe、Megatron-LM)。 +3. **模型压缩与加速**:量化(GPTQ)、稀疏化(MoE)、编译技术(JAX、Triton)、以及推理时的服务化设计(vLLM、CacheGen)。 +4. **前沿技术与系统实践**:涵盖检索增强生成(RAG)、多模态 LLM、RLHF 系统,以及端到端的在线维护和监控。 -近年来,人工智能的进步在很大程度上得益于大型语言模型(LLMs)及其他生成式方法的快速发展。这些模型通常规模巨大(例如 GPT-3 有 1750 亿参数),因此开发可扩展的 LLM System 变得至关重要。 -在这门课程中,学生将在系统层面学习设计 LLM 的核心技能。 -和其他类似课程一个较大的区别是本课程中涉及到了相当多的 GPU 加速技术,课程会介绍著名的 [FlashAttention](https://llmsystem.github.io/llmsystem2024spring/assets/files/Group-FlashAttention-0b70d553037a7729dd2a9af5e23d8b3e.pdf), 实验也要求你实现一些算子来加速训练。 -此外, 课程还涉及一些系统上的加速技术,如 [PagedAttention](https://llmsystem.github.io/llmsystem2024spring/assets/files/Group-vLLM-presentation-8fab23dec42abb93f4075b63f1cc9e83.pptx) 和分布式训练。总体来说非常适合对于大模型在系统设计层面技术感兴趣的同学。 +与同类课程相比,本课程的优势在于**紧密结合最新论文与开源实现**(通过 miniTorch 框架动手扩展 CUDA 支持);**项目驱动**的作业体系(五次编程作业 + 期末大项目);以及**工业嘉宾讲座**,能让学生近距离了解真实世界中 LLM 工程实践的挑战与解决方案。 +**自学建议**: + +- 提前配置好支持 CUDA 的开发环境(NVIDIA GPU + CUDA Toolkit + PyTorch)。 +- 复习并行计算和深度学习基础(自动微分、张量运算)。 +- 阅读每次课前指定的论文与幻灯片,跟着作业把 miniTorch 框架从纯 Python 拓展到真实 CUDA 内核。 该课程要求你对深度学习有一定的预备知识,不适合纯小白入手,可见 [FAQ](https://llmsystem.github.io/llmsystem2024spring/docs/FAQ) 的先修要求。 实验总体来说是有难度的,主要内容如下: @@ -24,11 +31,9 @@ 3. Assignment3: 通过手写 CUDA 的 Softmax 和 LayerNorm 算子优化模型训练速度 4. Assignment4: 分布式模型训练,自学的话可能不太好配置环境 - -和众多优质课程一样,该课程幻灯片和作业都是开源的,有相当详尽的本地测试代码,适合自学。 - - ## 课程资源 -- 课程网站: -- 课程作业: \ No newline at end of file +- 课程网站: +- 课程大纲: +- 课程作业: +- 课程教材:精选论文 + 《Programming Massively Parallel Processors, 4th Ed》 部分章节