Skip to content
This repository was archived by the owner on Feb 24, 2026. It is now read-only.
This repository was archived by the owner on Feb 24, 2026. It is now read-only.

[Feature Request] Lower Vectorized Loop Pass should be enhanced to adapt layout inference #258

@LeiWang1999

Description

@LeiWang1999

When utilize TileLang, some layout transformation like swizzling or padding will implicitly apply layout transformation, though this approach is efficient and powerful, but sometimes will lead to a crash for vectorization.

Considering dequantize gemm on volta:

for v in T.vectorized(0, 8):
        index = i * threads * local_size + tx * local_size + v
        vi = index // block_K
        vj = index % block_K
        B_dequantize_shared[vi, vj] = B_dequantize_local[v]

On Volta, applying a swizzle operation will adjust the memory layout to align with groups of 4 elements instead of 8 elements. This optimization enhances memory coalescing and data locality for efficient GPU execution.

We should enhance lower vectorize pass to automatically convert the vectorize stage into:

for ov in T.serial(0, local_size // 4):
      for iv in T.vectorized(0, 4):
          index = (
              i * threads * local_size
              + tx * local_size
              + ov * 4
              + iv
          )
          vi = index // block_K
          vj = index % block_K
          B_dequantize_shared[vi, vj] = B_dequantize_local[ov * 4 + iv]

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions