Skip to content

Add TMA TensorMapDescriptor support#1687

Open
rparolin wants to merge 11 commits intoNVIDIA:mainfrom
rparolin:rparolin/tma_feature
Open

Add TMA TensorMapDescriptor support#1687
rparolin wants to merge 11 commits intoNVIDIA:mainfrom
rparolin:rparolin/tma_feature

Conversation

@rparolin
Copy link
Collaborator

@rparolin rparolin commented Feb 24, 2026

Summary

  • Add TensorMapDescriptor Cython class wrapping the CUDA driver's CUtensorMap for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement
  • Support tiled and im2col descriptor creation via from_tiled() and from_im2col() class methods, with automatic dtype inference, stride computation, and validation
  • Integrate TensorMapDescriptor as a first-class kernel argument in _kernel_arg_handler.pyx
  • Add comprehensive tests (test_tensor_map.py) and an example (tma_tensor_map.py)

Closes #199
Closes #200

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Feb 24, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

This comment was marked as resolved.

rparolin and others added 2 commits February 24, 2026 15:55
…time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unused _alloc_device_tensor helper from tests
- Add test for rank > 5 (6D tensor) to verify upper bound validation
- Add NULL check for PyMem_Malloc in prepare_tensor_map_arg

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rparolin rparolin self-assigned this Feb 25, 2026
@rparolin rparolin added this to the cuda.core v0.7.0 milestone Feb 25, 2026
Move the replace_address() demonstration into its own self-contained
example (tma_replace_address.py) so each file covers a single concept.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

cuda_core/pixi.toml:67

  • Removing the cu12 environment from this subproject can break the repository’s top-level pixi run -e cu12 test workflow, which runs pixi run --manifest-path cuda_core test under the propagated PIXI_ENVIRONMENT_NAME=cu12. If cu12 testing is still expected at the workspace level, consider keeping a solvable cu12 environment here (e.g., using conda-forge cuda-bindings/cuda-version constraints instead of the path dependency) or updating the workspace test tasks to avoid selecting a missing environment.

# NOTE: cu12 environment is intentionally omitted because the path dependency
# to ../cuda_bindings (v13.1) makes it unsolvable locally. For cu12 testing,
# use conda-forge packages or CI workflows.
[environments]
default = { features = [
    "cu13",
    "test",
    "cython-tests",
], solve-group = "default" }
cu13 = { features = ["cu13", "test", "cython-tests"], solve-group = "default" }

cuda_core/cuda/core/_tensor_map.pyx:461

  • c_pixel_box_lower / c_pixel_box_upper are declared as fixed-size int[3] but only the first n_spatial entries are written. If the driver implementation reads all 3 entries (the API supports up to 3 spatial dims), the remaining uninitialized values can make encoding nondeterministic. Initialize the full arrays (e.g., set all 3 to 0 first) before filling the active elements.
        cdef uint64_t[5] c_global_dim
        cdef uint64_t[4] c_global_strides
        cdef uint32_t[5] c_element_strides
        cdef int[3] c_pixel_box_lower  # max 3 spatial dims (rank 5 - 2)
        cdef int[3] c_pixel_box_upper
        cdef int i_c

        for i_c in range(rank):
            c_global_dim[i_c] = <uint64_t>shape[rank - 1 - i_c]
            c_element_strides[i_c] = <uint32_t>element_strides[rank - 1 - i_c]

        for i_c in range(rank - 1):
            c_global_strides[i_c] = <uint64_t>byte_strides[rank - 2 - i_c]

        # Reverse spatial dimensions for lower/upper corners
        for i_c in range(n_spatial):
            c_pixel_box_lower[i_c] = <int>pixel_box_lower_corner[n_spatial - 1 - i_c]
            c_pixel_box_upper[i_c] = <int>pixel_box_upper_corner[n_spatial - 1 - i_c]


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +270 to +272
view = _get_validated_view(tensor)
desc._source_ref = tensor

Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorMapDescriptor stores _source_ref = tensor, but when tensor is a DLPack producer the pointer/metadata lifetime is governed by the DLPack capsule returned by __dlpack__(). Since the temporary StridedMemoryView (which holds the capsule and calls the deleter in __dealloc__) is not retained, the capsule can be released immediately, potentially invalidating globalAddress for exporters where the capsule owns the backing allocation. Store a strong reference to the StridedMemoryView (or at least its metadata capsule) instead of (or in addition to) the original tensor object.

Copilot uses AI. Check for mistakes.
cdef TensorMapDescriptor desc = cls.__new__(cls)

view = _get_validated_view(tensor)
desc._source_ref = tensor
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same lifetime issue as from_tiled(): _source_ref = tensor does not necessarily keep the DLPack capsule alive, so the temporary StridedMemoryView (and its capsule) may be destroyed right after descriptor creation. Keep a reference to the StridedMemoryView/capsule to ensure the exported memory remains valid for the descriptor’s lifetime.

Suggested change
desc._source_ref = tensor
# Keep both the original tensor and the StridedMemoryView alive so that
# any underlying DLPack capsule remains valid for the descriptor's lifetime.
desc._source_ref = (tensor, view)

Copilot uses AI. Check for mistakes.
# Update the source reference only after the driver call succeeds,
# so we don't drop the old tensor (risking a dangling pointer in the
# CUtensorMap struct) if the call fails.
self._source_ref = tensor
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace_address() updates _source_ref = tensor, but it also creates a temporary StridedMemoryView that may own a DLPack capsule. If that capsule is released when the view is GC’d, the new address can become invalid for exporters where the capsule owns the memory. Consider storing the validated StridedMemoryView (or its capsule) on the descriptor after a successful replace, similar to descriptor creation.

Suggested change
self._source_ref = tensor
self._source_ref = tensor
# Also retain the validated StridedMemoryView to ensure that any
# owning DLPack capsule (and thus the underlying memory) stays alive
# for as long as this descriptor uses its address.
self._view_ref = view

Copilot uses AI. Check for mistakes.
Comment on lines +503 to +512
def from_im2col_wide(cls, tensor, pixel_box_lower_corner_width, pixel_box_upper_corner_width,
channels_per_pixel, pixels_per_column, *,
element_strides=None,
data_type=None,
interleave=TensorMapInterleave.NONE,
mode=TensorMapIm2ColWideMode.W,
swizzle=TensorMapSwizzle.SWIZZLE_128B,
l2_promotion=TensorMapL2Promotion.NONE,
oob_fill=TensorMapOOBFill.NONE):
"""Create an im2col-wide TMA descriptor from a tensor object.
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_im2col_wide() and TensorMapIm2ColWideMode rely on cydriver.CU_TENSOR_MAP_IM2COL_WIDE_MODE_* and cydriver.cuTensorMapEncodeIm2colWide. Those symbols may not exist in older cuda-bindings 12.x builds (depending on the minor version the bindings were generated from), which would cause build/import failures for cu12 wheels. Consider compile-time gating this feature (similar to other CUDA_CORE_BUILD_MAJOR guards) and/or tightening the minimum supported cuda-bindings version for cu12 if this API is required.

Copilot uses AI. Check for mistakes.
@rparolin
Copy link
Collaborator Author

/ok to test

@github-actions
Copy link

@rparolin
Copy link
Collaborator Author

/ok to test

@rparolin
Copy link
Collaborator Author

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Design the TensorMap object EPIC: Support TMA descriptor

2 participants