Arguments
CLI, environment variables and .env
Arguments for launching happy_vLLM are all defined in utils_args.py
. They can be defined via three methods which are by order of priority:
- directly from the cli (for example using the entrypoint
happy-vllm --model path_to_model
) - from environment variables (for example
export MODEL="path_to_model"
and then use the entrypointhappy-vllm
) - from a
.env
file located where you use thehappy-vllm
entrypoint (see the.env.example
provided for a complete description)
Note that in the cli, the arguments are lowercase and with hyphens -
whereas in the environment variables and .env
, they are uppercase and with underscore _
. So for example the argument max-model-len
in the cli corresponds to MAX_MODEL_LEN
in the environment variables or .env
file.
The order of priority means that the arguments entered via the cli will always overwrite the argument from environment variables and .env
file if it is also defined there. Similarly, the environment variables will always trump the variables from a .env
file.
Arguments definition
Application arguments
Here is a list of arguments useful for the application (they all have default values and are optional):
host
: The name of the host (default value is127.0.0.1
)port
: The port number (default value is5000
)model-name
: The name of the model which will be given by the/v1/info
endpoint or the/v1/models
. Knowing the name of the model is important to be able to use the endpoints/v1/completions
and/v1/chat/completions
(default value is?
)app-name
: The name of the application (default value ishappy_vllm
)api-endpoint-prefix
: The prefix added to all the API endpoints (default value is no prefix)explicit-errors
: IfFalse
, the message displayed when an500 error
is encountered will beInternal Server Error
. IfTrue
, the message displayed will be more explicit and give information on the underlying error. TheTrue
setting is not recommended in a production setting (default value isFalse
).allow-credentials
: The CORS setting (default value isFalse
)allowed-origins
: The CORS setting (default value is["*"]
)allowed-methods
: The CORS setting (default value is["*"]
)allowed-headers
: The CORS setting (default value is["*"]
)uvicorn-log-level
: The log level of uvicorn (default value isinfo
)ssl-keyfile
: Uvicorn setting, the file path to the SSL key file (default value isNone
)ssl-certfile
: Uvicorn setting, the file path to the SSL cert file (default value isNone
)ssl-ca-certs
: Uvicorn setting, the CA certificates file (default value isNone
)ssl-cert-reqs
: Uvicorn setting, Whether client certificate is required (see stdlib ssl module's) (default value is0
)root_path
: The FastAPI root path (default value isNone
)lora-modules
: LoRA module configurations in the format name=pathchat-template
: The file path to the chat template, or the template in single-line form for the specified model (see the documentation of vLLM for more details). Useful in the/v1/chat/completions
endpointchat-template-content-format
: The format to render message content within a chat template.string
will render the content as a string. 'Example: "Hello World"',openai
will render the content as a list of dictionaries, similar to OpenAI schema. 'Example: [{"type": "text", "text": "Hello world!"}]' (default value isauto
)response-role
: The role name to return ifrequest.add_generation_prompt=true
. Useful in the/v1/chat/completions
endpointwith-launch-arguments
: Whether the route/v1/launch_arguments
gives the launch arguments or an empty json (default value isFalse
)return-tokens-as-token-ids
: "When--max-logprobs
is specified, represents single tokens as strings of the form 'token_id:{token_id}' so that tokens that are not JSON-encodable can be identified (default value isFalse
)max-log-len
: Max number of prompt characters or prompt ID numbers being printed in log (default value is unlimited)prompt-adapters
: Prompt adapter configurations in the format name=path. Multiple adapters can be specified (default value isNone
)disable-frontend-multiprocessing
: If specified, will run the OpenAI frontend server in the same process as the model serving engine (default value isFalse
)enable-request-id-headers
: If specified, API server will add X-Request-Id header to responses. Caution: this hurts performance at high QPS (default valueFalse
)enable-auto-tool-choice
: Enable auto tool choice for supported models. Use --tool-call-parser" "to specify which parser to use" (default value isFalse
)tool-call-parser
: Select the tool call parser depending on the model that you're using. This is used to parse the model-generated tool call. Required for --enable-auto-tool-choice. (default value isNone
, onlymistral
andhermes
are allowed)tool-parser-plugin
: Special the tool parser plugin write to parse the model-generated tool into OpenAI API format, the name register in this plugin can be used in --tool-call-parser (default value is""
)disable-fastapi-docs
: Disable FastAPI's OpenAPI schema, Swagger UI, and ReDoc endpoint (default value isFalse
)enable-prompt-tokens-details
: If set to True, enable prompt_tokens_details in usage (default value isFalse
)
Model arguments
All the usual vLLM arguments for the model and vLLM itself are usable. They all have default values (defined by vLLM) and are optional. The exhaustive list is here (source code) or here (documentation). Here are some of them:
model
: The path to the model or the huggingface repositorydtype
: The data type for model weights and activations.max-model-len
: The model context length. If unspecified, will be automatically derived from the model.gpu-memory-utilization
: The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. If unspecified, will use the default value of 0.9.