Skip to content

llm demo update#4048

Open
dtrawins wants to merge 5 commits intomainfrom
llm-demo
Open

llm demo update#4048
dtrawins wants to merge 5 commits intomainfrom
llm-demo

Conversation

@dtrawins
Copy link
Collaborator

@dtrawins dtrawins commented Mar 9, 2026

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

@dtrawins dtrawins requested review from mzegla and przepeck March 9, 2026 09:57

## Server Deployment

:::{dropdown} **Deploying with Docker**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropdown should be dropped

@dtrawins dtrawins marked this pull request as ready for review March 9, 2026 13:08
Copilot AI review requested due to automatic review settings March 9, 2026 13:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates OVMS demo documentation around continuous batching (LLM/VLM) and demo navigation, aiming to modernize titles, adjust the docs ToC structure, and add cross-links to NPU-accelerated variants.

Changes:

  • Renames/rephrases top-level demo page titles and reorganizes toctree entries for generative demos.
  • Updates the continuous batching LLM README with new deployment and request examples (Docker + Windows).
  • Adds an NPU-acceleration section/link from the continuous batching VLM demo.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
demos/continuous_batching/vlm/README.md Adds a toctree entry and an NPU-acceleration section/link for VLM demos.
demos/continuous_batching/speculative_decoding/README.md Retitles the speculative decoding demo page.
demos/continuous_batching/README.md Refactors the continuous batching LLM demo instructions, examples, and local ToC entries.
demos/README.md Reorders/updates the demos landing page toctree to emphasize key generative demos.
Comments suppressed due to low confidence (1)

demos/continuous_batching/README.md:118

  • The Windows Command Prompt cURL example still uses meta-llama/Meta-Llama-3-8B-Instruct while the Linux/PowerShell examples on this page use Qwen3-30B-A3B-Instruct-2507-int4-ov. This inconsistency will lead to copy/paste failures unless both models are actually loaded under those names.
:::

::::

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 43 to 47
It can be applied using the commands below:
```bash
docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json
```
:::

:::{dropdown} **Deploying on Bare Metal**

Assuming you have unpacked model server package, make sure to:

- **On Windows**: run `setupvars` script
- **On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables

as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server.

Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `graph.pbtxt`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server.

```bat
ovms --rest_port 8000 --config_path ./models/config.json
set MOE_USE_MICRO_GEMM_PREFILL=0
ovms.exe --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov
```
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the “GPU baremetal on Windows11” subsection, the surrounding explanation talks about adding extra Docker parameters, which doesn’t apply to a bare-metal ovms.exe invocation. Please adjust this paragraph so it’s Windows bare-metal specific (and move Docker guidance under the Docker option).

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +50
:::


Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Windows bare-metal example uses a Unix-style model repository path (--model_repository_path /models). On Windows this should be a Windows path (e.g., a relative models/.\models or an absolute C:\...) so the command works when copy/pasted.

Copilot uses AI. Check for mistakes.
Comment on lines 89 to 113
@@ -166,7 +93,7 @@ curl http://localhost:8000/v3/chat/completions \
},
{
"role": "user",
"content": "What is OpenVINO?"
"content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"
}
]
}'| jq .
@@ -179,12 +106,12 @@ Windows Powershell
(Invoke-WebRequest -Uri "http://localhost:8000/v3/chat/completions" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the 3 main tourist attractions in Paris?"}]}').Content
-Body '{"model": "Qwen3-30B-A3B-Instruct-2507-int4-ov", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"}]}').Content
```

Windows Command Prompt
```bat
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples mix max_completion_tokens (Linux) and max_tokens (PowerShell/CMD) for the same /v3/chat/completions call. Since users will copy/paste these, please standardize on a single parameter name across platforms (and keep it consistent with what OVMS documents/supports for this endpoint).

Copilot uses AI. Check for mistakes.

## VLM models deployment with NPU acceleration

Check [VLM usage with NPU acceleration](../../vlm_npu/README.md)
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relative link to the NPU VLM demo resolves to demos/continuous_batching/vlm_npu/README.md, but the actual file is demos/vlm_npu/README.md. Update the link target so it correctly points to the existing demos/vlm_npu/README.md from this directory (e.g., by going up one more level).

Copilot uses AI. Check for mistakes.
Comment on lines 35 to 36
Running this command starts the container with CPU only target device:
```bash
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text references deploying models “prepared … in the previous step”, but the “Model preparation” section was removed from this page. Update the wording (or reintroduce the missing step) so readers aren’t pointed to a nonexistent section.

Copilot uses AI. Check for mistakes.
Comment on lines +37 to 42
docker run -it -p 8000:8000 --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 --user $(id -u):$(id -g) -v $(pwd)/models:/models/:rw openvino/model_server:weekly --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov
```
**GPU**
**GPU baremetal on Windows11**

In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)`
to `docker run` command, use the image with GPU support. Export the models with precision matching the GPU capacity and adjust pipeline configuration.
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is titled “CPU Docker on Ubuntu24” and says it starts the container with a CPU-only target, but the command uses --target_device GPU (and a GPU-focused int4 model). Either change the command to actually use CPU, or rename/retarget this section to GPU and ensure the image/flags match.

Copilot uses AI. Check for mistakes.
::::

## Benchmarking text generation with high concurrency
## Check how to use AI agents with MCP servers and language models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need separate section just for one line with a link?

@dtrawins dtrawins changed the title draf of llm demo update llm demo update Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants