What is vLLM?
vLLM is a fast and memory-efficient inference engine for large language models. It provides:- High throughput serving with PagedAttention
- OpenAI-compatible API for easy integration
- Support for popular models including fine-tuned versions
- Efficient memory management for better resource utilization
Why Use vLLM for Serving?
- Cost-effective: Run models locally without API costs
- Privacy: Keep your data and models on your infrastructure
- Speed: Optimized inference with batching and caching
- Compatibility: Drop-in replacement for OpenAI API calls
- Control: Full control over model deployment and scaling
Prerequisites
Before starting, ensure you have:- Your fine-tuned model checkpoints. Optionally, you can upload them to Hugging Face (following our upload guide)
- Python 3.8 or higher
- CUDA-compatible GPU (recommended for better performance)
- Sufficient GPU memory for your model size
Installation and Setup
1
Install vLLM
Install vLLM using pip. For GPU support, make sure you have the appropriate CUDA drivers installed:
If you encounter installation issues, check the vLLM installation guide for your specific system configuration.
2
Verify Your Model Access
If you uploaded your model to Hugging Face, ensure your Hugging Face model is accessible. For private repositories, make sure youβre logged in:
Serving Full Fine-Tuned Models
Full fine-tuned models contain all the updated parameters and can be served directly with vLLM.1
Start the vLLM Server
Launch your fine-tuned model as an OpenAI-compatible API server:
Replace
your-username/your-model-name-full
with either your local model path or your actual Hugging Face model repository name from the upload guide.For more options, see the vLLM documentation.
2
Test Your API
Once the server starts, test it with a simple API call:
Serving LoRA Fine-Tuned Models
LoRA models require the base model plus the adapter weights. vLLM supports LoRA serving with some additional configuration.1
Start vLLM with LoRA Support
For LoRA models, specify both the base model and the LoRA adapter:
Replace
Qwen/Qwen2.5-1.5B
with the appropriate base model ID from the model mapping table in the upload guide.2
Test LoRA Model API
Test your LoRA model by specifying the LoRA name in the API call:
Using Your API in Applications
Once your model is serving, you can use it exactly like OpenAIβs API in your applications:Python Example
JavaScript Example
Whatβs Next?
Now that your model is serving as an OpenAI-compatible API:- Scale your deployment with load balancers and multiple instances
- Monitor performance with logging and metrics collection
- Integrate with applications using the familiar OpenAI API interface
- Experiment with different models from your Prem Studio fine-tuning jobs
Remember to keep your models updated by re-uploading improved versions to Hugging Face and restarting your vLLM servers with the new model versions.