GitHub - jmaczan/tiny-vllm: Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM · GitHub GitHub - jmaczan/tiny-vllm: Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM · GitHub GitHub Copilot appDirect agents from issue to merge MCP RegistryNewIntegrate external tools DEVELOPER WORKFLOWS ActionsAutomate any workflow CodespacesInstant dev environments IssuesPlan and track work Code ReviewManage code changes APPLICATION SECURITY GitHub Advanced SecurityFind and fix vulnerabilities Code securitySecure your code as you build Secret protectionStop leaks before they start Solutions BY COMPANY SIZEEnterprises Small and medium teams BY USE CASEApp Modernization BY INDUSTRYHealthcare Resources EXPLORE BY TOPICAI Software Development EXPLORE BY TYPECustomer stories SUPPORT & SERVICESDocumentation Open Source COMMUNITY GitHub SponsorsFund open source developers PROGRAMSSecurity Lab Maintainer Community Enterprise ENTERPRISE SOLUTIONS Enterprise platformAI-powered developer platform AVAILABLE ADD-ONS GitHub Advanced SecurityEnterprise-grade security features Copilot for BusinessEnterprise-grade AI features Premium SupportEnterprise-grade 24/7 support Search or jump to... Use saved searches to filter your results more quickly Code Open more actions menu NameNameLast commit message Repository files navigation You're going to build a high performance LLM inference engine with C++ and CUDA - tiny-vllm, a younger and smaller sibling of vLLM We will learn a lot along the way, make mistakes and derive the ideas and maths from scratch This repository consists of two things: 1. a full source code of the inference server and 2. a course where I lead you through the process of implementing the engine. Feel invited to use it as a learning tool on your learning path or if you are a lecturer, feel welcome to use it as a teaching resource at your university The inference engine consists of: load a real LLM model from Safetensors (Llama 3.2 1B Instruct) full LLM forward pass (prefill + decode) all computation with CUDA kernels online softmax, FlashAttention-like Make yourself a hot beverage and let's begin Intro: LLM, vLLM, models, inference servers Technical prerequisities Safetensors and your model How floating-point numbers work and why we use bfloat16 Single token inference CUDA kernel engineering - embeddings RMSNorm and parallel reduction in CUDA Residual connections The column-major to row-major transposition trick Feed forward network Paged Attention CUDA kernel Intro: LLM, vLLM, models, inference servers Links
Browse another page: |