Browse - macintosh.world

macintosh.world | Log In | Register

GitHub - jmaczan/tiny-vllm: Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM · GitHub

Open Original Page

GitHub - jmaczan/tiny-vllm: Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM · GitHub

Platform AI CODE CREATION GitHub CopilotWrite better code with AI

GitHub Copilot appDirect agents from issue to merge

MCP RegistryNewIntegrate external tools

DEVELOPER WORKFLOWS ActionsAutomate any workflow

CodespacesInstant dev environments

IssuesPlan and track work

Code ReviewManage code changes

APPLICATION SECURITY GitHub Advanced SecurityFind and fix vulnerabilities

Code securitySecure your code as you build

Secret protectionStop leaks before they start

Solutions BY COMPANY SIZEEnterprises

Small and medium teams

BY USE CASEApp Modernization

BY INDUSTRYHealthcare

Resources EXPLORE BY TOPICAI

Software Development

EXPLORE BY TYPECustomer stories

SUPPORT & SERVICESDocumentation

Open Source COMMUNITY GitHub SponsorsFund open source developers

PROGRAMSSecurity Lab

Maintainer Community

Enterprise ENTERPRISE SOLUTIONS Enterprise platformAI-powered developer platform

AVAILABLE ADD-ONS GitHub Advanced SecurityEnterprise-grade security features

Copilot for BusinessEnterprise-grade AI features

Premium SupportEnterprise-grade 24/7 support

Search or jump to...

Search code, repositories, users, issues, pull requests...

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.

jmaczan

/

tiny-vllm

Notifications
You must be signed in to change notification settings

Security and quality
0

Additional navigation options

Security and quality

Code Open more actions menu

NameNameLast commit message

Repository files navigation

You're going to build a high performance LLM inference engine with C++ and CUDA - tiny-vllm, a younger and smaller sibling of vLLM

We will learn a lot along the way, make mistakes and derive the ideas and maths from scratch

This repository consists of two things: 1. a full source code of the inference server and 2. a course where I lead you through the process of implementing the engine. Feel invited to use it as a learning tool on your learning path or if you are a lecturer, feel welcome to use it as a teaching resource at your university

The inference engine consists of:

load a real LLM model from Safetensors (Llama 3.2 1B Instruct)

full LLM forward pass (prefill + decode)

all computation with CUDA kernels

online softmax, FlashAttention-like

Make yourself a hot beverage and let's begin

Intro: LLM, vLLM, models, inference servers

Technical prerequisities

Safetensors and your model

How floating-point numbers work and why we use bfloat16

Single token inference

CUDA kernel engineering - embeddings

RMSNorm and parallel reduction in CUDA

Residual connections

The column-major to row-major transposition trick

Feed forward network

Paged Attention CUDA kernel

Intro: LLM, vLLM, models, inference servers

Links

Open - Skip to content

Open - Sign in

Open - GitHub CopilotWrite better code with AI

Open - GitHub Copilot appDirect agents from issue to merge

Open - MCP RegistryNewIntegrate external tools

Open - ActionsAutomate any workflow

Open - CodespacesInstant dev environments

Open - IssuesPlan and track work

Open - Code ReviewManage code changes

Open - GitHub Advanced SecurityFind and fix vulnerabilities

Browse another page: