llm demo update by dtrawins · Pull Request #4048 · openvinotoolkit/model_server

dtrawins · 2026-03-09T09:57:57Z

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

Unit tests added.
The documentation updated.
Change follows security best practices.
``

przepeck · 2026-03-09T11:00:31Z

demos/continuous_batching/README.md


 ## Server Deployment

 :::{dropdown} **Deploying with Docker**


dropdown should be dropped

Copilot

Pull request overview

This PR updates OVMS demo documentation around continuous batching (LLM/VLM) and demo navigation, aiming to modernize titles, adjust the docs ToC structure, and add cross-links to NPU-accelerated variants.

Changes:

Renames/rephrases top-level demo page titles and reorganizes toctree entries for generative demos.
Updates the continuous batching LLM README with new deployment and request examples (Docker + Windows).
Adds an NPU-acceleration section/link from the continuous batching VLM demo.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
`demos/continuous_batching/vlm/README.md`	Adds a `toctree` entry and an NPU-acceleration section/link for VLM demos.
`demos/continuous_batching/speculative_decoding/README.md`	Retitles the speculative decoding demo page.
`demos/continuous_batching/README.md`	Refactors the continuous batching LLM demo instructions, examples, and local ToC entries.
`demos/README.md`	Reorders/updates the demos landing page `toctree` to emphasize key generative demos.

Comments suppressed due to low confidence (1)

demos/continuous_batching/README.md:118

The Windows Command Prompt cURL example still uses meta-llama/Meta-Llama-3-8B-Instruct while the Linux/PowerShell examples on this page use Qwen3-30B-A3B-Instruct-2507-int4-ov. This inconsistency will lead to copy/paste failures unless both models are actually loaded under those names.

:::

::::

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-09T13:13:21Z

demos/continuous_batching/README.md

 It can be applied using the commands below:
-```bash
-docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json
-```
-:::
-
-:::{dropdown} **Deploying on Bare Metal**
-
-Assuming you have unpacked model server package, make sure to:
-
- **On Windows**: run `setupvars` script
- **On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables
-
-as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server.
-
-Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `graph.pbtxt`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server.
-
 ```bat
-ovms --rest_port 8000 --config_path ./models/config.json
+set MOE_USE_MICRO_GEMM_PREFILL=0
+ovms.exe --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov
 ```


In the “GPU baremetal on Windows11” subsection, the surrounding explanation talks about adding extra Docker parameters, which doesn’t apply to a bare-metal ovms.exe invocation. Please adjust this paragraph so it’s Windows bare-metal specific (and move Docker guidance under the Docker option).

Copilot · 2026-03-09T13:13:22Z

demos/continuous_batching/README.md

 :::

+


The Windows bare-metal example uses a Unix-style model repository path (--model_repository_path /models). On Windows this should be a Windows path (e.g., a relative models/.\models or an absolute C:\...) so the command works when copy/pasted.

Copilot · 2026-03-09T13:13:22Z

demos/continuous_batching/README.md

@@ -166,7 +93,7 @@ curl http://localhost:8000/v3/chat/completions \
      },
      {
        "role": "user",
-        "content": "What is OpenVINO?"
+        "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"
      }
    ]
  }'| jq .
@@ -179,12 +106,12 @@ Windows Powershell
 (Invoke-WebRequest -Uri "http://localhost:8000/v3/chat/completions" `
 -Method POST `
 -Headers @{ "Content-Type" = "application/json" } `
- -Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the 3 main tourist attractions in Paris?"}]}').Content
+ -Body '{"model": "Qwen3-30B-A3B-Instruct-2507-int4-ov", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"}]}').Content
 ```

 Windows Command Prompt
 ```bat


The examples mix max_completion_tokens (Linux) and max_tokens (PowerShell/CMD) for the same /v3/chat/completions call. Since users will copy/paste these, please standardize on a single parameter name across platforms (and keep it consistent with what OVMS documents/supports for this endpoint).

Copilot · 2026-03-09T13:13:22Z

demos/continuous_batching/vlm/README.md


+## VLM models deployment with NPU acceleration
+
+Check [VLM usage with NPU acceleration](../../vlm_npu/README.md)


The relative link to the NPU VLM demo resolves to demos/continuous_batching/vlm_npu/README.md, but the actual file is demos/vlm_npu/README.md. Update the link target so it correctly points to the existing demos/vlm_npu/README.md from this directory (e.g., by going up one more level).

Copilot · 2026-03-09T13:13:22Z

demos/continuous_batching/README.md

 Running this command starts the container with CPU only target device:
 ```bash


This text references deploying models “prepared … in the previous step”, but the “Model preparation” section was removed from this page. Update the wording (or reintroduce the missing step) so readers aren’t pointed to a nonexistent section.

Copilot · 2026-03-09T13:13:23Z

demos/continuous_batching/README.md

+docker run -it -p 8000:8000 --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 --user $(id -u):$(id -g) -v $(pwd)/models:/models/:rw openvino/model_server:weekly --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov
 ```
-**GPU**
+**GPU baremetal on Windows11**

 In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)`
 to `docker run` command, use the image with GPU support. Export the models with precision matching the GPU capacity and adjust pipeline configuration.


This section is titled “CPU Docker on Ubuntu24” and says it starts the container with a CPU-only target, but the command uses --target_device GPU (and a GPU-focused int4 model). Either change the command to actually use CPU, or rename/retarget this section to GPU and ensure the image/flags match.

mzegla · 2026-03-09T13:34:35Z

demos/continuous_batching/README.md

 ::::

-## Benchmarking text generation with high concurrency
+## Check how to use AI agents with MCP servers and language models


Do we need separate section just for one line with a link?

draf of llm demo update

b9d8bb5

dtrawins requested review from mzegla and przepeck March 9, 2026 09:57

przepeck reviewed Mar 9, 2026

View reviewed changes

menu cleanup

1ed8c3e

dtrawins marked this pull request as ready for review March 9, 2026 13:08

Copilot AI review requested due to automatic review settings March 9, 2026 13:08

Copilot started reviewing on behalf of dtrawins March 9, 2026 13:09 View session

fix

fe6ae45

Copilot AI reviewed Mar 9, 2026

View reviewed changes

mzegla reviewed Mar 9, 2026

View reviewed changes

fix

95b5c02

dtrawins changed the title ~~draf of llm demo update~~ llm demo update Mar 9, 2026

fix

2ba0f21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm demo update#4048

llm demo update#4048
dtrawins wants to merge 5 commits intomainfrom
llm-demo

dtrawins commented Mar 9, 2026

Uh oh!

przepeck Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

mzegla Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		## VLM models deployment with NPU acceleration

		Check [VLM usage with NPU acceleration](../../vlm_npu/README.md)

		Running this command starts the container with CPU only target device:
		```bash

Conversation

dtrawins commented Mar 9, 2026

🛠 Summary

🧪 Checklist

Uh oh!

przepeck Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

mzegla Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants