This is a BentoML example project, showing you how to serve and deploy open-source Large Language Models using vLLM, a high-throughput and memory-effi

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2024-04-24 08:30:03

This is a BentoML example project, showing you how to serve and deploy open-source Large Language Models using vLLM, a high-throughput and memory-efficient inference engine.

💡 This example is served as a basis for advanced code customization, such as custom model, inference logic or vLLM options. For simple LLM hosting with OpenAI compatible endpoint without writing any code, see OpenLLM.

This Service uses the @openai_endpoints decorator to set up OpenAI-compatible endpoints (chat/completions and completions). This means your client can interact with the backend Service (in this case, the VLLM class) as if they were communicating directly with OpenAI's API. This utility does not affect your BentoML Service code, and you can use it for other LLMs as well.

Note: If your Service is deployed with protected endpoints on BentoCloud, you need to set the environment variable OPENAI_API_KEY to your BentoCloud API key first.

Leave a Comment