Getting Started with VLLM
Introduction
In this guide we will walk through on how to getting started with (vLLM). In this tutorial, we'll take you through the essential steps of setting up your system for vLLM, exploring their their installation options, we will Mistral as our base Model, and Finally, we'll show you how to combine VLLM with LangChain, enabling you to create advanced language applications with ease
Machine Specs for the tutorial
Hardware Specifications:
GPU Information:
GPU Type: Nvidia T4
Number of GPUs: 2
GPU Memory: 16 GB GDDR6 (per GPU)
Google compute engine Machine Type:
Type: n1-standard-4
vCPUs: 4
Cores: 2
Memory: 15 GB
Disk Information:
Disk Type: Balanced Persistent Disk
Disk Size: 150 GB
Software Specifications:
Operating System: Debian: 11
CUDA version: 12.1
Python version : 3.10
What is vLLM
vLLM is an advanced framework designed to enhance the deployment and performance of Large Language Models (LLMs) by focusing on high throughput and low latency inferences.The framework supports a range of models, allowing flexibility and ease in deploying LLMs across various platforms. vLLM is particularly beneficial for applications requiring quick, efficient language processing, with its simplified server setup and compatibility with the OpenAI API protocol adding to its user-friendliness. The framework's remarkable inference speed, even including model loading times, positions vLLM as a powerful tool for leveraging LLM capabilities in a wide array of practical applications
Setting up vLLM on virtual Machine
There is various ways on how to set up vLLM, mainly in this tutorial we will use installation on a virtual machine, and we will use docker as the second way to set up vLLM
1.prepare virtual environment (optional)
Prepare your environment. Although optional, creating a dedicated conda environment is recommended for better management of dependencies and packages. Execute:
# (Optional) Create a new conda environment.
conda create -n myenv python=3.9 -y
conda activate myenv
2.Install vLLM
Install vLLM. Ensure your system is equipped with CUDA 12.1 to leverage GPU acceleration, enhancing vLLM's performance.
pip install vllm
3. Install Ray framework
Install the Ray framework to unlock the power of distributed inference. This framework is adept at partitioning model weights among multiple GPUs, ensuring efficient and scalable model inference.
pip install ray
Test using offline Batch inference
We will test the setup through offline batch inference to ensure the vLLM setup is properly configured and operational. This method allows you to process data in batches without the need for an active server connection
1. Initialize the LLM
Begin by importing the LLM
class from the vllm
package. Initialize the LLM
object with your chosen model in this case, mistralai/Mistral-7B-v0.1.
The tensor_parallel_size of 2 means that the model's computation is split across 2 GPUs. This allows the model to handle larger models or to speed up computation by parallel processing. It's particularly useful when the model is too large to fit into the memory of a single GPU.
The dtype parameter specifies the data type of the model's weights. In this case, "float16"
indicates that the model uses 16-bit floating-point numbers. Using float16
(also known as half precision) can significantly reduce memory usage and speed up computation.
from vllm import LLM
llm = LLM("mistralai/Mistral-7B-v0.1", tensor_parallel_size=2,dtype="float16")
2. Generate Output:
use the generate
method of the LLM
object to input your query and receive the model's output.
output = llm.generate("What is the deep learning")
print(output)
Test using Online inference
Online inference is a dynamic way to interact with the vLLM model in real time. It allows you to send prompts and receive responses on the fly
1.Start the vLLM server
Launch the vLLM server with the necessary configurations to host your model. Use the following command to start the server.
python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-v0.1 --tensor-parallel-size=2 --dtype="float16"
2.Send Request the vLLM endpoint
response=$(curl -H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-v0.1",
"prompt": "Explain to me what is Deep learning ?",
"max_tokens": 300,
"temperature": 0
}' \
-s -w '\nTime: %{time_total}s\n' \
http://localhost:8000/v1/completions)
echo "$response"
Setting up vLLM using Docker
1.Run the docker command to start the container
Deploying vLLM using Docker simplifies the process of setting up and managing your vLLM environment.
1.Launch the Docker Container for vLLM
Use the Docker command to initiate the container. This command pulls the latest vLLM image, configures the environment to utilize NVIDIA GPUs, and maps the necessary ports and volumes for your application
sudo docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1 \
--tensor-parallel-size=2 \
--dtype="float16"
--runtime nvidia
: Specifies that the container should use the NVIDIA runtime, enabling GPU support.--gpus all
: Allocates all available GPUs to the container.-v ~/.cache/huggingface:/root/.cache/huggingface
: Mounts a volume to preserve the model cache between container runs. This avoids redundant downloads of the model.-p 8000:8000
: Maps port 8000 of the host to port 8000 of the container, ensuring that the vLLM service can be accessed externally.--ipc=host
: Sets the IPC mode to 'host', solving shared memory limitations which are common in deep learning applications.vllm/vllm-openai:latest
: Pulls the latest vLLM image from the Docker repository.Model configuration parameters (
--model
,--tensor-parallel-size
,--dtype
): Specifies the model, the parallelization settings, and the data type for model weights.
Integrate Langchain with vLLM
LangChain provides a rich suite of tools and functionalities designed to augment the power of language models like Mistral. Here's how you can integrate LangChain with vLLM.
Start by incorporating LangChain into your environment. Import the VLLMOpenAI
class from the langchain_community.llms
pip install langchain
from langchain_community.llms import VLLMOpenAI
llm = VLLMOpenAI(
openai_api_key="EMPTY",
openai_api_base="http://localhost:8000/v1",
model_name="mistralai/Mistral-7B-v0.1",
max_tokens="300",
model_kwargs={"stop": ["."]},
)
print(llm.invoke("what is deep learning"))
Display vLLM Metrics
Monitor the performance and health of your vLLM instance is crucial for maintaining optimal operation and troubleshooting potential issues. The vLLM metrics endpoint provides a comprehensive overview of the system's performance indicators. Here's how to retrieve and understand the metrics:
curl http://localhost:8000/metrics
Explanation of some example on returned metrics
requests_total_counter: Indicates the total number of requests received, categorized by method and path.
responses_total_counter: Reflects the total number of responses sent, allowing you to gauge the server's responsiveness.
status_codes_counter: Shows the distribution of response status codes, helping you identify potential errors or issues with request handling.
vllm:avg_generation_throughput_toks_per_s: Measures the average generation throughput in tokens per second, offering a sense of the model's performance and efficiency.
An example of a real metrics response
# HELP exceptions_total_counter Total number of requested which generated an exception
# TYPE exceptions_total_counter counter
# HELP requests_total_counter Total number of requests received
# TYPE requests_total_counter counter
requests_total_counter{method="POST",path="/v1/completions"} 8
requests_total_counter{method="POST",path="/generate"} 2
requests_total_counter{method="POST",path="/v1/"} 2
requests_total_counter{method="POST",path="/"} 3
requests_total_counter{method="GET",path="/v1/metrics"} 1
# HELP responses_total_counter Total number of responses sent
# TYPE responses_total_counter counter
responses_total_counter{method="POST",path="/v1/completions"} 8
responses_total_counter{method="POST",path="/generate"} 2
responses_total_counter{method="POST",path="/v1/"} 2
responses_total_counter{method="POST",path="/"} 3
responses_total_counter{method="GET",path="/v1/metrics"} 1
# HELP status_codes_counter Total number of response status codes
# TYPE status_codes_counter counter
status_codes_counter{method="POST",path="/v1/completions",status_code="200"} 5
status_codes_counter{method="POST",path="/generate",status_code="404"} 2
status_codes_counter{method="POST",path="/v1/completions",status_code="HTTPStatus.BAD_REQUEST"} 3
status_codes_counter{method="POST",path="/v1/",status_code="404"} 2
status_codes_counter{method="POST",path="/",status_code="404"} 3
status_codes_counter{method="GET",path="/v1/metrics",status_code="404"} 1
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="mistralai/Mistral-7B-v0.1"} 0.0
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="mistralai/Mistral-7B-v0.1"} 0.0
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="mistralai/Mistral-7B-v0.1"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="mistralai/Mistral-7B-v0.1"} 0.0
# HELP vllm:num_requests_running Number of requests that is currently running for inference.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="mistralai/Mistral-7B-v0.1"} 0
# HELP vllm:num_requests_swapped Number requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="mistralai/Mistral-7B-v0.1"} 0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="mistralai/Mistral-7B-v0.1"} 0
Refrences
https://docs.vllm.ai/en/latest/
https://python.langchain.com/docs/integrations/llms/vllm