Add TMA TensorMapDescriptor support by rparolin · Pull Request #1687 · NVIDIA/cuda-python

rparolin · 2026-02-24T23:22:18Z

Summary

Add TensorMapDescriptor Cython class wrapping the CUDA driver's CUtensorMap for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement
Support tiled and im2col descriptor creation via from_tiled() and from_im2col() class methods, with automatic dtype inference, stride computation, and validation
Integrate TensorMapDescriptor as a first-class kernel argument in _kernel_arg_handler.pyx
Add comprehensive tests (test_tensor_map.py) and an example (tma_tensor_map.py)

Closes #199
Closes #200

copy-pr-bot · 2026-02-24T23:22:22Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove unused _alloc_device_tensor helper from tests - Add test for rank > 5 (6D tensor) to verify upper bound validation - Add NULL check for PyMem_Malloc in prepare_tensor_map_arg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move the replace_address() demonstration into its own self-contained example (tma_replace_address.py) so each file covers a single concept. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

cuda_core/pixi.toml:67

Removing the cu12 environment from this subproject can break the repository’s top-level pixi run -e cu12 test workflow, which runs pixi run --manifest-path cuda_core test under the propagated PIXI_ENVIRONMENT_NAME=cu12. If cu12 testing is still expected at the workspace level, consider keeping a solvable cu12 environment here (e.g., using conda-forge cuda-bindings/cuda-version constraints instead of the path dependency) or updating the workspace test tasks to avoid selecting a missing environment.


# NOTE: cu12 environment is intentionally omitted because the path dependency
# to ../cuda_bindings (v13.1) makes it unsolvable locally. For cu12 testing,
# use conda-forge packages or CI workflows.
[environments]
default = { features = [
    "cu13",
    "test",
    "cython-tests",
], solve-group = "default" }
cu13 = { features = ["cu13", "test", "cython-tests"], solve-group = "default" }

cuda_core/cuda/core/_tensor_map.pyx:461

c_pixel_box_lower / c_pixel_box_upper are declared as fixed-size int[3] but only the first n_spatial entries are written. If the driver implementation reads all 3 entries (the API supports up to 3 spatial dims), the remaining uninitialized values can make encoding nondeterministic. Initialize the full arrays (e.g., set all 3 to 0 first) before filling the active elements.

        cdef uint64_t[5] c_global_dim
        cdef uint64_t[4] c_global_strides
        cdef uint32_t[5] c_element_strides
        cdef int[3] c_pixel_box_lower  # max 3 spatial dims (rank 5 - 2)
        cdef int[3] c_pixel_box_upper
        cdef int i_c

        for i_c in range(rank):
            c_global_dim[i_c] = <uint64_t>shape[rank - 1 - i_c]
            c_element_strides[i_c] = <uint32_t>element_strides[rank - 1 - i_c]

        for i_c in range(rank - 1):
            c_global_strides[i_c] = <uint64_t>byte_strides[rank - 2 - i_c]

        # Reverse spatial dimensions for lower/upper corners
        for i_c in range(n_spatial):
            c_pixel_box_lower[i_c] = <int>pixel_box_lower_corner[n_spatial - 1 - i_c]
            c_pixel_box_upper[i_c] = <int>pixel_box_upper_corner[n_spatial - 1 - i_c]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T01:31:01Z

cuda_core/cuda/core/_tensor_map.pyx

+        view = _get_validated_view(tensor)
+        desc._source_ref = tensor
+


TensorMapDescriptor stores _source_ref = tensor, but when tensor is a DLPack producer the pointer/metadata lifetime is governed by the DLPack capsule returned by __dlpack__(). Since the temporary StridedMemoryView (which holds the capsule and calls the deleter in __dealloc__) is not retained, the capsule can be released immediately, potentially invalidating globalAddress for exporters where the capsule owns the backing allocation. Store a strong reference to the StridedMemoryView (or at least its metadata capsule) instead of (or in addition to) the original tensor object.

Copilot · 2026-02-25T01:31:01Z

cuda_core/cuda/core/_tensor_map.pyx

+        cdef TensorMapDescriptor desc = cls.__new__(cls)
+
+        view = _get_validated_view(tensor)
+        desc._source_ref = tensor


Same lifetime issue as from_tiled(): _source_ref = tensor does not necessarily keep the DLPack capsule alive, so the temporary StridedMemoryView (and its capsule) may be destroyed right after descriptor creation. Keep a reference to the StridedMemoryView/capsule to ensure the exported memory remains valid for the descriptor’s lifetime.

Suggested change

desc._source_ref = tensor

# Keep both the original tensor and the StridedMemoryView alive so that

# any underlying DLPack capsule remains valid for the descriptor's lifetime.

desc._source_ref = (tensor, view)

Copilot · 2026-02-25T01:31:01Z

cuda_core/cuda/core/_tensor_map.pyx

+        # Update the source reference only after the driver call succeeds,
+        # so we don't drop the old tensor (risking a dangling pointer in the
+        # CUtensorMap struct) if the call fails.
+        self._source_ref = tensor


replace_address() updates _source_ref = tensor, but it also creates a temporary StridedMemoryView that may own a DLPack capsule. If that capsule is released when the view is GC’d, the new address can become invalid for exporters where the capsule owns the memory. Consider storing the validated StridedMemoryView (or its capsule) on the descriptor after a successful replace, similar to descriptor creation.

Suggested change

self._source_ref = tensor

self._source_ref = tensor

# Also retain the validated StridedMemoryView to ensure that any

# owning DLPack capsule (and thus the underlying memory) stays alive

# for as long as this descriptor uses its address.

self._view_ref = view

Copilot · 2026-02-25T01:31:01Z

cuda_core/cuda/core/_tensor_map.pyx

+    def from_im2col_wide(cls, tensor, pixel_box_lower_corner_width, pixel_box_upper_corner_width,
+                         channels_per_pixel, pixels_per_column, *,
+                         element_strides=None,
+                         data_type=None,
+                         interleave=TensorMapInterleave.NONE,
+                         mode=TensorMapIm2ColWideMode.W,
+                         swizzle=TensorMapSwizzle.SWIZZLE_128B,
+                         l2_promotion=TensorMapL2Promotion.NONE,
+                         oob_fill=TensorMapOOBFill.NONE):
+        """Create an im2col-wide TMA descriptor from a tensor object.


from_im2col_wide() and TensorMapIm2ColWideMode rely on cydriver.CU_TENSOR_MAP_IM2COL_WIDE_MODE_* and cydriver.cuTensorMapEncodeIm2colWide. Those symbols may not exist in older cuda-bindings 12.x builds (depending on the minor version the bindings were generated from), which would cause build/import failures for cu12 wheels. Consider compile-time gating this feature (similar to other CUDA_CORE_BUILD_MAJOR guards) and/or tightening the minimum supported cuda-bindings version for cu12 if this API is required.

rparolin · 2026-02-25T22:14:52Z

/ok to test

github-actions · 2026-02-25T22:26:45Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1687/
https://nvidia.github.io/cuda-python/pr-preview/pr-1687/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1687/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1687/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rparolin · 2026-02-25T23:08:13Z

/ok to test

rparolin · 2026-02-26T00:00:16Z

/ok to test

rparolin added 4 commits February 23, 2026 14:02

initial commit

1f5c573

Merge branch 'main' into rparolin/tma_feature

e25713a

tma wide

7251214

clean up

beb1c12

rparolin requested review from Copilot and leofang February 24, 2026 23:23

Copilot started reviewing on behalf of rparolin February 24, 2026 23:23 View session

This comment was marked as resolved.

Sign in to view

rparolin and others added 2 commits February 24, 2026 15:55

Add comments to prepare_tensor_map_arg explaining allocation and life…

6f68917

…time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address Copilot review feedback

53c4a3d

- Remove unused _alloc_device_tensor helper from tests - Add test for rank > 5 (6D tensor) to verify upper bound validation - Add NULL check for PyMem_Malloc in prepare_tensor_map_arg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rparolin self-assigned this Feb 25, 2026

rparolin added this to the cuda.core v0.7.0 milestone Feb 25, 2026

Split TMA example into two focused files

b1b2d3f

Move the replace_address() demonstration into its own self-contained example (tma_replace_address.py) so each file covers a single concept. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rparolin requested a review from Copilot February 25, 2026 01:21

Copilot started reviewing on behalf of rparolin February 25, 2026 01:21 View session

Merge branch 'main' into rparolin/tma_feature

c0f299f

Copilot AI reviewed Feb 25, 2026

View reviewed changes

pre-commit

0d884c2

adding stride meta data to gpu allocated memory

6317759

im2col fixes

4c866c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TMA TensorMapDescriptor support#1687

Add TMA TensorMapDescriptor support#1687
rparolin wants to merge 11 commits intoNVIDIA:mainfrom
rparolin:rparolin/tma_feature

rparolin commented Feb 24, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 24, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

rparolin commented Feb 25, 2026

Uh oh!

github-actions bot commented Feb 25, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rparolin commented Feb 25, 2026

Uh oh!

rparolin commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        desc._source_ref = tensor
+        # Keep both the original tensor and the StridedMemoryView alive so that
+        # any underlying DLPack capsule remains valid for the descriptor's lifetime.
+        desc._source_ref = (tensor, view)

-        self._source_ref = tensor
+        self._source_ref = tensor
+        # Also retain the validated StridedMemoryView to ensure that any
+        # owning DLPack capsule (and thus the underlying memory) stays alive
+        # for as long as this descriptor uses its address.
+        self._view_ref = view

Conversation

rparolin commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

copy-pr-bot bot commented Feb 24, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

rparolin commented Feb 25, 2026

Uh oh!

github-actions bot commented Feb 25, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rparolin commented Feb 25, 2026

Uh oh!

rparolin commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rparolin commented Feb 24, 2026 •

edited

Loading