WIP: Source Links Fix#440
Conversation
License Check Results🚀 The license check job ran with the Bazel command: bazel run --lockfile_mode=error //src:license-checkStatus: Click to expand output |
|
The created documentation from the pull request is available at: docu-html |
| @@ -0,0 +1,23 @@ | |||
| # ******************************************************************************* | |||
There was a problem hiding this comment.
I probably want to delete this file to make everything easier
There was a problem hiding this comment.
Pull request overview
This PR reworks how “source code links” are generated and injected into Sphinx needs, aiming to support both local builds (derive Git hash from the current repo) and combo/ref-integration builds (use per-module metadata such as repo URL + commit hash).
Changes:
- Add metadata support (module name / repo URL / commit hash) to NeedLink JSON handling and Bazel CLI generation/merge scripts.
- Introduce grouping of needs by module and a new module-grouped cache JSON.
- Update Sphinx extension flow to read the new caches and generate GitHub links using either git-derived or metadata-derived refs.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| src/helper_lib/additional_functions.py | Changes GitHub link generation API to require module metadata. |
| src/extensions/score_source_code_linker/needlinks.py | Extends NeedLink with metadata fields; adds new metadata-aware JSON format/load path. |
| src/extensions/score_source_code_linker/need_source_links.py | Moves group_by_need into this module. |
| src/extensions/score_source_code_linker/module_source_links.py | New module-grouped cache format and grouping logic. |
| src/extensions/score_source_code_linker/metadata.py | New TypedDict + TypeGuard for metadata records. |
| src/extensions/score_source_code_linker/generate_source_code_links_json.py | Refactors extraction helper signature and logging (currently inconsistent call sites). |
| src/extensions/score_source_code_linker/init.py | Adds module-linker stage and changes injection to use module-grouped cache + metadata-based link generation. |
| scripts_bazel/merge_sourcelinks.py | Merges per-module sourcelinks and enriches with known-good repo/hash metadata. |
| scripts_bazel/generate_sourcelinks_cli.py | Emits sourcelinks JSON with a leading metadata dict and uses updated extraction helper signature. |
| docs.bzl | Adds optional known_good wiring into the sourcelinks merge genrule and the public docs() macro. |
You can also share your feedback on Copilot code review. Take the survey.
| data = json.load(f) | ||
| metadata = data[0] | ||
| if metadata["module_name"] and metadata["module_name"] != "local_module": | ||
| hash, repo = parse_info_from_known_good( | ||
| known_good_json=args.known_good, module_name=metadata["module_name"] | ||
| ) |
There was a problem hiding this comment.
--known_good is optional, but when a module has a non-empty metadata["module_name"] this code unconditionally calls parse_info_from_known_good(known_good_json=args.known_good, ...). If --known_good is omitted, args.known_good is None and open(None, ...) will fail. Either make --known_good required when external modules are present, or guard this branch and emit a clear error explaining how to provide the known-good file.
| print("===THIS IS MODULE NAME WE LOOK FOR===========") | ||
| print(module_name) | ||
| with open(known_good_json, "r") as f: | ||
| kg_json = json.load(f) | ||
| for category in kg_json["modules"].values(): | ||
| print("===THIS IS CATEGORY=========") | ||
| print(category) | ||
| if module_name in category: | ||
| print("===THIS IS MODULE NAME INSIDe CATEGORY===========") | ||
| print(module_name) |
There was a problem hiding this comment.
This helper contains multiple print() debug statements that will pollute build logs (and potentially leak large JSON fragments via print(category)). Prefer using the configured logger at DEBUG level (or remove these statements entirely) once the behavior is validated.
| print("===THIS IS MODULE NAME WE LOOK FOR===========") | |
| print(module_name) | |
| with open(known_good_json, "r") as f: | |
| kg_json = json.load(f) | |
| for category in kg_json["modules"].values(): | |
| print("===THIS IS CATEGORY=========") | |
| print(category) | |
| if module_name in category: | |
| print("===THIS IS MODULE NAME INSIDe CATEGORY===========") | |
| print(module_name) | |
| logger.debug("Looking up module name %r in known_good_json %s", module_name, known_good_json) | |
| with open(known_good_json, "r") as f: | |
| kg_json = json.load(f) | |
| for category in kg_json["modules"].values(): | |
| logger.debug("Inspecting category while searching for module %r: %s", module_name, category) | |
| if module_name in category: | |
| logger.debug("Found module name %r inside category", module_name) |
| print("======== THIs IS PATH we PARSIGN FOr MODULE NAME") | ||
| print(path) | ||
| if str(path).startswith("external/"): | ||
| module_raw = str(path).removeprefix("external/") | ||
| filepath_split = str(module_raw).split("/", maxsplit=1) | ||
| module_name = str(filepath_split[0].removesuffix("+")) | ||
| return module_name | ||
| return "local_module" |
There was a problem hiding this comment.
parse_module_name_from_path currently uses print() for debugging. This will add noise to Bazel build output; use the module logger at DEBUG level or remove these statements.
| store_source_code_links_with_metadata_json( | ||
| file=args.output, metadata=metadata, needlist=all_need_references | ||
| ) |
There was a problem hiding this comment.
This switches the generated JSON format from a plain list of NeedLinks to a list whose first element is a metadata dict. Any existing consumers/tests that expect the old schema will now fail. Consider either keeping the old format as default (with an opt-in flag for metadata), or updating all in-repo consumers and tests in the same PR to avoid a partially-migrated state.
| gruped_cache_exists = grouped_cache.exists() | ||
| # TODO this cache should be done via Bazel | ||
| if not gruped_cache_exists or not app.config.skip_rescanning_via_source_code_linker: |
There was a problem hiding this comment.
Typo in variable name: gruped_cache_exists should be grouped_cache_exists for readability and consistency with grouped_cache.
| metadata: moduleInfo, | ||
| link: NeedLink | DataForTestLink | DataOfTestCase | None = None, | ||
| ) -> str: | ||
| if link is None: | ||
| link = DefaultNeedLink() | ||
| if not metadata.hash: | ||
| # Local path (//:docs) | ||
| return get_github_link_from_git(link) | ||
| # Ref-Integration path (//:docs_combo..) | ||
| return get_github_link_from_json(metadata, link) |
There was a problem hiding this comment.
get_github_link now requires a metadata argument, but there are existing call sites in the repo that still call get_github_link(link) (e.g. in tests and xml_parser.py). This is a breaking API change that will raise TypeError. Consider keeping backwards compatibility by making metadata optional with a sensible default (e.g. infer local/git behavior when metadata is omitted) or providing a new function name for the metadata-based behavior.
| metadata: moduleInfo, | |
| link: NeedLink | DataForTestLink | DataOfTestCase | None = None, | |
| ) -> str: | |
| if link is None: | |
| link = DefaultNeedLink() | |
| if not metadata.hash: | |
| # Local path (//:docs) | |
| return get_github_link_from_git(link) | |
| # Ref-Integration path (//:docs_combo..) | |
| return get_github_link_from_json(metadata, link) | |
| metadata: moduleInfo | |
| | NeedLink | |
| | DataForTestLink | |
| | DataOfTestCase | |
| | None = None, | |
| link: NeedLink | DataForTestLink | DataOfTestCase | None = None, | |
| ) -> str: | |
| """ | |
| Return a GitHub link for the given location. | |
| Backwards compatible calling conventions: | |
| - New style: get_github_link(metadata, link) | |
| - Old style: get_github_link(link) # metadata omitted | |
| """ | |
| # Distinguish between new-style and old-style calls. | |
| if isinstance(metadata, moduleInfo): | |
| actual_metadata: moduleInfo | None = metadata | |
| actual_link = link | |
| else: | |
| # Old-style: first argument is actually the link (or None) | |
| actual_metadata = None | |
| # If both metadata and link are provided but metadata is not a moduleInfo, | |
| # prefer the explicitly provided link argument. | |
| actual_link = link if link is not None else metadata | |
| if actual_link is None: | |
| actual_link = DefaultNeedLink() | |
| # If no metadata is available or metadata.hash is falsy, fall back to git-based link. | |
| if actual_metadata is None or not getattr(actual_metadata, "hash", None): | |
| # Local path (//:docs) | |
| return get_github_link_from_git(actual_link) | |
| # Ref-Integration path (//:docs_combo..) | |
| return get_github_link_from_json(actual_metadata, actual_link) |
| Returns a nested dictionary structure with 'CodeLink' and 'TestLink' categories. | ||
| Example output: | ||
|
|
||
|
|
||
| { | ||
| "need": "<need_id>", | ||
| "module_name": <module_name>, | ||
| "hash": <git hash>, | ||
| "url": <github base url>, | ||
| "links": { | ||
| "CodeLinks": [NeedLink, NeedLink, ...], | ||
| "TestLinks": [testlink, testlink, ...] |
There was a problem hiding this comment.
The group_by_need docstring/documented example includes top-level module_name/hash/url fields, but SourceCodeLinks only contains need and links and this function does not populate metadata. Please update the docstring/example to match the actual returned structure to avoid confusion for future callers.
| Returns a nested dictionary structure with 'CodeLink' and 'TestLink' categories. | |
| Example output: | |
| { | |
| "need": "<need_id>", | |
| "module_name": <module_name>, | |
| "hash": <git hash>, | |
| "url": <github base url>, | |
| "links": { | |
| "CodeLinks": [NeedLink, NeedLink, ...], | |
| "TestLinks": [testlink, testlink, ...] | |
| Returns a list of SourceCodeLinks objects, each containing a 'need' ID and | |
| associated 'CodeLinks' and 'TestLinks'. | |
| Example output element: | |
| { | |
| "need": "<need_id>", | |
| "links": { | |
| "CodeLinks": [NeedLink, NeedLink, ...], | |
| "TestLinks": [DataForTestLink, DataForTestLink, ...] |
| def to_dict_full(self) -> dict[str, str | Path]: | ||
| return asdict(self) | ||
|
|
||
| def to_dict_without_metadata(self) -> dict[str, str | Path]: | ||
| d = asdict(self) | ||
| d.pop("module_name", None) | ||
| d.pop("hash", None) | ||
| d.pop("url", None) | ||
| return d |
There was a problem hiding this comment.
The return type annotations for to_dict_full() / to_dict_without_metadata() are too narrow: asdict(self) includes an int (line) and the dict may contain str, int, and Path values. Update the return type to reflect the actual value types (or use dict[str, object]).
| @dataclass | ||
| class moduleInfo: | ||
| name: str | ||
| hash: str | ||
| url: str |
There was a problem hiding this comment.
moduleInfo is a class but is named in lowerCamelCase. This is inconsistent with the rest of this package (e.g. NeedLink, ModuleSourceLinks) and makes type usage harder to read. Rename it to ModuleInfo (and update imports/usages).
| def store_source_code_links_json(file: Path, needlist: list[NeedLink]) -> None: | ||
| """ | ||
| Writes a JSON array: | ||
| [ meta_dict, needlink1, needlink2, ... ] | ||
|
|
||
| meta_dict must include: | ||
| module_name, hash, url | ||
| """ | ||
|
|
||
| file.parent.mkdir(exist_ok=True) | ||
| with open(file, "w", encoding="utf-8") as f: | ||
| json.dump(needlist, f, cls=NeedLinkEncoder, indent=2, ensure_ascii=False) |
There was a problem hiding this comment.
The docstring for store_source_code_links_json claims it writes [meta_dict, needlink1, ...], but the implementation writes only needlist. This is misleading for callers and makes it unclear which on-disk format they should expect. Either update the docstring to match the actual output, or change the function to include metadata (and keep the non-metadata format under a different name for backwards compatibility).
Still in progress, do not merge.
It works but need to fix tests and clean it up a lot.
📌 Description
🚨 Impact Analysis
✅ Checklist