Skip to content

Fix W&B run reuse across ART models#618

Open
Kovbo wants to merge 2 commits intomainfrom
fix/wandb-multi-run
Open

Fix W&B run reuse across ART models#618
Kovbo wants to merge 2 commits intomainfrom
fix/wandb-multi-run

Conversation

@Kovbo
Copy link
Collaborator

@Kovbo Kovbo commented Mar 16, 2026

It fixes a client-side W&B run leakage bug in ART.

Before the change in src/art/model.py, ART would call wandb.init(...) without telling W&B to create a fresh run. In a single Python process, if one ART model had already opened a run, W&B could return that existing active run for the next model. The result was that metrics from model B, model C, and so on could be logged into model A’s W&B run.

That matters most for the serverless case: one client process can create and manage multiple training jobs, potentially across multiple GPUs. Those jobs need separate W&B runs. The fix makes ART open a distinct W&B run per model with reinit="create_new" and defines metrics on that specific run object instead of module-global W&B state. So metrics stay attached to the correct run name instead of being silently merged into the first one.

Copy link
Collaborator

@vivekkalyan vivekkalyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants