Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
…time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unused _alloc_device_tensor helper from tests - Add test for rank > 5 (6D tensor) to verify upper bound validation - Add NULL check for PyMem_Malloc in prepare_tensor_map_arg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the replace_address() demonstration into its own self-contained example (tma_replace_address.py) so each file covers a single concept. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (2)
cuda_core/pixi.toml:67
- Removing the
cu12environment from this subproject can break the repository’s top-levelpixi run -e cu12 testworkflow, which runspixi run --manifest-path cuda_core testunder the propagatedPIXI_ENVIRONMENT_NAME=cu12. If cu12 testing is still expected at the workspace level, consider keeping a solvablecu12environment here (e.g., using conda-forgecuda-bindings/cuda-versionconstraints instead of the path dependency) or updating the workspace test tasks to avoid selecting a missing environment.
# NOTE: cu12 environment is intentionally omitted because the path dependency
# to ../cuda_bindings (v13.1) makes it unsolvable locally. For cu12 testing,
# use conda-forge packages or CI workflows.
[environments]
default = { features = [
"cu13",
"test",
"cython-tests",
], solve-group = "default" }
cu13 = { features = ["cu13", "test", "cython-tests"], solve-group = "default" }
cuda_core/cuda/core/_tensor_map.pyx:461
c_pixel_box_lower/c_pixel_box_upperare declared as fixed-sizeint[3]but only the firstn_spatialentries are written. If the driver implementation reads all 3 entries (the API supports up to 3 spatial dims), the remaining uninitialized values can make encoding nondeterministic. Initialize the full arrays (e.g., set all 3 to 0 first) before filling the active elements.
cdef uint64_t[5] c_global_dim
cdef uint64_t[4] c_global_strides
cdef uint32_t[5] c_element_strides
cdef int[3] c_pixel_box_lower # max 3 spatial dims (rank 5 - 2)
cdef int[3] c_pixel_box_upper
cdef int i_c
for i_c in range(rank):
c_global_dim[i_c] = <uint64_t>shape[rank - 1 - i_c]
c_element_strides[i_c] = <uint32_t>element_strides[rank - 1 - i_c]
for i_c in range(rank - 1):
c_global_strides[i_c] = <uint64_t>byte_strides[rank - 2 - i_c]
# Reverse spatial dimensions for lower/upper corners
for i_c in range(n_spatial):
c_pixel_box_lower[i_c] = <int>pixel_box_lower_corner[n_spatial - 1 - i_c]
c_pixel_box_upper[i_c] = <int>pixel_box_upper_corner[n_spatial - 1 - i_c]
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| view = _get_validated_view(tensor) | ||
| desc._source_ref = tensor | ||
|
|
There was a problem hiding this comment.
TensorMapDescriptor stores _source_ref = tensor, but when tensor is a DLPack producer the pointer/metadata lifetime is governed by the DLPack capsule returned by __dlpack__(). Since the temporary StridedMemoryView (which holds the capsule and calls the deleter in __dealloc__) is not retained, the capsule can be released immediately, potentially invalidating globalAddress for exporters where the capsule owns the backing allocation. Store a strong reference to the StridedMemoryView (or at least its metadata capsule) instead of (or in addition to) the original tensor object.
| cdef TensorMapDescriptor desc = cls.__new__(cls) | ||
|
|
||
| view = _get_validated_view(tensor) | ||
| desc._source_ref = tensor |
There was a problem hiding this comment.
Same lifetime issue as from_tiled(): _source_ref = tensor does not necessarily keep the DLPack capsule alive, so the temporary StridedMemoryView (and its capsule) may be destroyed right after descriptor creation. Keep a reference to the StridedMemoryView/capsule to ensure the exported memory remains valid for the descriptor’s lifetime.
| desc._source_ref = tensor | |
| # Keep both the original tensor and the StridedMemoryView alive so that | |
| # any underlying DLPack capsule remains valid for the descriptor's lifetime. | |
| desc._source_ref = (tensor, view) |
| # Update the source reference only after the driver call succeeds, | ||
| # so we don't drop the old tensor (risking a dangling pointer in the | ||
| # CUtensorMap struct) if the call fails. | ||
| self._source_ref = tensor |
There was a problem hiding this comment.
replace_address() updates _source_ref = tensor, but it also creates a temporary StridedMemoryView that may own a DLPack capsule. If that capsule is released when the view is GC’d, the new address can become invalid for exporters where the capsule owns the memory. Consider storing the validated StridedMemoryView (or its capsule) on the descriptor after a successful replace, similar to descriptor creation.
| self._source_ref = tensor | |
| self._source_ref = tensor | |
| # Also retain the validated StridedMemoryView to ensure that any | |
| # owning DLPack capsule (and thus the underlying memory) stays alive | |
| # for as long as this descriptor uses its address. | |
| self._view_ref = view |
| def from_im2col_wide(cls, tensor, pixel_box_lower_corner_width, pixel_box_upper_corner_width, | ||
| channels_per_pixel, pixels_per_column, *, | ||
| element_strides=None, | ||
| data_type=None, | ||
| interleave=TensorMapInterleave.NONE, | ||
| mode=TensorMapIm2ColWideMode.W, | ||
| swizzle=TensorMapSwizzle.SWIZZLE_128B, | ||
| l2_promotion=TensorMapL2Promotion.NONE, | ||
| oob_fill=TensorMapOOBFill.NONE): | ||
| """Create an im2col-wide TMA descriptor from a tensor object. |
There was a problem hiding this comment.
from_im2col_wide() and TensorMapIm2ColWideMode rely on cydriver.CU_TENSOR_MAP_IM2COL_WIDE_MODE_* and cydriver.cuTensorMapEncodeIm2colWide. Those symbols may not exist in older cuda-bindings 12.x builds (depending on the minor version the bindings were generated from), which would cause build/import failures for cu12 wheels. Consider compile-time gating this feature (similar to other CUDA_CORE_BUILD_MAJOR guards) and/or tightening the minimum supported cuda-bindings version for cu12 if this API is required.
|
/ok to test |
|
|
/ok to test |
|
/ok to test |
Summary
TensorMapDescriptorCython class wrapping the CUDA driver'sCUtensorMapfor Hopper+ TMA (Tensor Memory Accelerator) bulk data movementfrom_tiled()andfrom_im2col()class methods, with automatic dtype inference, stride computation, and validationTensorMapDescriptoras a first-class kernel argument in_kernel_arg_handler.pyxtest_tensor_map.py) and an example (tma_tensor_map.py)Closes #199
Closes #200