Conversation
demos/continuous_batching/README.md
Outdated
|
|
||
| ## Server Deployment | ||
|
|
||
| :::{dropdown} **Deploying with Docker** |
There was a problem hiding this comment.
dropdown should be dropped
There was a problem hiding this comment.
Pull request overview
This PR updates OVMS demo documentation around continuous batching (LLM/VLM) and demo navigation, aiming to modernize titles, adjust the docs ToC structure, and add cross-links to NPU-accelerated variants.
Changes:
- Renames/rephrases top-level demo page titles and reorganizes
toctreeentries for generative demos. - Updates the continuous batching LLM README with new deployment and request examples (Docker + Windows).
- Adds an NPU-acceleration section/link from the continuous batching VLM demo.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
demos/continuous_batching/vlm/README.md |
Adds a toctree entry and an NPU-acceleration section/link for VLM demos. |
demos/continuous_batching/speculative_decoding/README.md |
Retitles the speculative decoding demo page. |
demos/continuous_batching/README.md |
Refactors the continuous batching LLM demo instructions, examples, and local ToC entries. |
demos/README.md |
Reorders/updates the demos landing page toctree to emphasize key generative demos. |
Comments suppressed due to low confidence (1)
demos/continuous_batching/README.md:118
- The Windows Command Prompt cURL example still uses
meta-llama/Meta-Llama-3-8B-Instructwhile the Linux/PowerShell examples on this page useQwen3-30B-A3B-Instruct-2507-int4-ov. This inconsistency will lead to copy/paste failures unless both models are actually loaded under those names.
:::
::::
You can also share your feedback on Copilot code review. Take the survey.
| It can be applied using the commands below: | ||
| ```bash | ||
| docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json | ||
| ``` | ||
| ::: | ||
|
|
||
| :::{dropdown} **Deploying on Bare Metal** | ||
|
|
||
| Assuming you have unpacked model server package, make sure to: | ||
|
|
||
| - **On Windows**: run `setupvars` script | ||
| - **On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables | ||
|
|
||
| as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server. | ||
|
|
||
| Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `graph.pbtxt`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server. | ||
|
|
||
| ```bat | ||
| ovms --rest_port 8000 --config_path ./models/config.json | ||
| set MOE_USE_MICRO_GEMM_PREFILL=0 | ||
| ovms.exe --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov | ||
| ``` |
There was a problem hiding this comment.
In the “GPU baremetal on Windows11” subsection, the surrounding explanation talks about adding extra Docker parameters, which doesn’t apply to a bare-metal ovms.exe invocation. Please adjust this paragraph so it’s Windows bare-metal specific (and move Docker guidance under the Docker option).
demos/continuous_batching/README.md
Outdated
| ::: | ||
|
|
||
|
|
There was a problem hiding this comment.
The Windows bare-metal example uses a Unix-style model repository path (--model_repository_path /models). On Windows this should be a Windows path (e.g., a relative models/.\models or an absolute C:\...) so the command works when copy/pasted.
| @@ -166,7 +93,7 @@ curl http://localhost:8000/v3/chat/completions \ | |||
| }, | |||
| { | |||
| "role": "user", | |||
| "content": "What is OpenVINO?" | |||
| "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?" | |||
| } | |||
| ] | |||
| }'| jq . | |||
| @@ -179,12 +106,12 @@ Windows Powershell | |||
| (Invoke-WebRequest -Uri "http://localhost:8000/v3/chat/completions" ` | |||
| -Method POST ` | |||
| -Headers @{ "Content-Type" = "application/json" } ` | |||
| -Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the 3 main tourist attractions in Paris?"}]}').Content | |||
| -Body '{"model": "Qwen3-30B-A3B-Instruct-2507-int4-ov", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"}]}').Content | |||
| ``` | |||
|
|
|||
| Windows Command Prompt | |||
| ```bat | |||
There was a problem hiding this comment.
The examples mix max_completion_tokens (Linux) and max_tokens (PowerShell/CMD) for the same /v3/chat/completions call. Since users will copy/paste these, please standardize on a single parameter name across platforms (and keep it consistent with what OVMS documents/supports for this endpoint).
|
|
||
| ## VLM models deployment with NPU acceleration | ||
|
|
||
| Check [VLM usage with NPU acceleration](../../vlm_npu/README.md) |
There was a problem hiding this comment.
The relative link to the NPU VLM demo resolves to demos/continuous_batching/vlm_npu/README.md, but the actual file is demos/vlm_npu/README.md. Update the link target so it correctly points to the existing demos/vlm_npu/README.md from this directory (e.g., by going up one more level).
| Running this command starts the container with CPU only target device: | ||
| ```bash |
There was a problem hiding this comment.
This text references deploying models “prepared … in the previous step”, but the “Model preparation” section was removed from this page. Update the wording (or reintroduce the missing step) so readers aren’t pointed to a nonexistent section.
| docker run -it -p 8000:8000 --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 --user $(id -u):$(id -g) -v $(pwd)/models:/models/:rw openvino/model_server:weekly --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov | ||
| ``` | ||
| **GPU** | ||
| **GPU baremetal on Windows11** | ||
|
|
||
| In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` | ||
| to `docker run` command, use the image with GPU support. Export the models with precision matching the GPU capacity and adjust pipeline configuration. |
There was a problem hiding this comment.
This section is titled “CPU Docker on Ubuntu24” and says it starts the container with a CPU-only target, but the command uses --target_device GPU (and a GPU-focused int4 model). Either change the command to actually use CPU, or rename/retarget this section to GPU and ensure the image/flags match.
| :::: | ||
|
|
||
| ## Benchmarking text generation with high concurrency | ||
| ## Check how to use AI agents with MCP servers and language models |
There was a problem hiding this comment.
Do we need separate section just for one line with a link?
🛠 Summary
JIRA/Issue if applicable.
Describe the changes.
🧪 Checklist
``