Jump to content
Wikipedia The Free Encyclopedia

SGLang

From Wikipedia, the free encyclopedia
Open-source framework for large language model inference
SGLang
Developer LMSYS
Initial releaseJanuary 17, 2024; 2 years ago (2024年01月17日)
Written inPython, Rust, CUDA, C++
Type Large language model inference engine
License Apache License 2.0
Websitesglang.io
Repository github.com/sgl-project/sglang

SGLang (short for Structured Generation Language) is an open-source framework for programming and serving large language models and multimodal models. It was introduced by researchers affiliated with LMSYS[1] and other institutions as a system combining a Python-embedded language for structured generation with a runtime for high-throughput inference.[2] [3] [4]

The project is designed for low latency and high-throughput inference workloads, and its documentation describes support for features such as structured outputs, speculative decoding, continuous batching, quantization, and compatibility with OpenAI-style APIs.[5]

History

[edit ]

SGLang was publicly introduced in January 2024 by researchers affiliated with Stanford, UC Berkeley, Texas A&M, and Shanghai Jiao Tong University.[2] Its academic description later appeared in the proceedings of NeurIPS 2024.[3] In January 2026, TechCrunch reported that contributors associated with the project had formed the startup RadixArk to commercialize services around SGLang while continuing its open-source development.[6] [7]

Architecture

[edit ]

According to the NeurIPS paper, SGLang consists of two main components: a front-end language embedded in Python and a back-end runtime for executing language model programs efficiently.[3] The front end provides primitives for generation, selection, and parallel control flow, while the runtime uses a set of optimizations intended to reduce repeated computation and improve throughput.[3]

Among the techniques described by the project are RadixAttention for reusing key–value cache state across multiple generation calls, compressed finite-state machines for faster constrained decoding, and speculative execution for API-based models.[3] The current documentation also describes support for serving both language models and multimodal models across a range of hardware back ends.[5]

See also

[edit ]

References

[edit ]
  1. ^ "LMSYS". GitHub. GitHub, Inc. Retrieved April 22, 2026.
  2. ^ a b "Fast and Expressive LLM Inference with RadixAttention and SGLang". LMSYS Org. January 17, 2024. Retrieved April 19, 2026.
  3. ^ a b c d e Zheng, Lianmin; Yin, Liangsheng; Xie, Zhiqiang; Sun, Chuyue; Huang, Jeff; Yu, Cody Hao; Cao, Shiyi; Kozyrakis, Christos; Stoica, Ion; Gonzalez, Joseph E.; Barrett, Clark; Sheng, Ying (2024). SGLang: Efficient Execution of Structured Language Model Programs (PDF). Advances in Neural Information Processing Systems 37. Retrieved April 19, 2026.
  4. ^ "SGLang". UC Berkeley Sky Computing Lab. April 25, 2024. Retrieved April 22, 2026.
  5. ^ a b "SGLang Documentation". SGLang. Retrieved April 19, 2026.
  6. ^ Hu, Krystal (January 21, 2026). "Sources: Project SGLang spins out as RadixArk with 400ドルM valuation as inference market explodes". TechCrunch. Retrieved April 19, 2026.
  7. ^ R, Vignesh (January 23, 2026). "From Berkeley lab to 400ドルM startup: SGLang becomes RadixArk". TFN. Retrieved April 22, 2026.
[edit ]
Concepts
Training, prompting, and alignment
Models
Early and encoder models
GPT series
Other model families
Chatbots and assistants
Agents, coding, and applications
Software
Hardware and infrastructure
Benchmarks, evaluation, and detection
Datasets and data
Organizations
People
Social, economic, and governance

AltStyle によって変換されたページ (->オリジナル) /