[Feature Request] Lower Vectorized Loop Pass should be enhanced to adapt layout inference

When utilize TileLang, some layout transformation like `swizzling` or `padding` will implicitly apply layout transformation, though this approach is efficient and powerful, but sometimes will lead to a crash for vectorization.

Considering dequantize gemm on volta:

```python
for v in T.vectorized(0, 8):
        index = i * threads * local_size + tx * local_size + v
        vi = index // block_K
        vj = index % block_K
        B_dequantize_shared[vi, vj] = B_dequantize_local[v]
```

On Volta, applying a swizzle operation will adjust the memory layout to align with groups of 4 elements instead of 8 elements. This optimization enhances memory coalescing and data locality for efficient GPU execution.

We should enhance lower vectorize pass to automatically convert the vectorize stage into:

```python
for ov in T.serial(0, local_size // 4):
      for iv in T.vectorized(0, 4):
          index = (
              i * threads * local_size
              + tx * local_size
              + ov * 4
              + iv
          )
          vi = index // block_K
          vj = index % block_K
          B_dequantize_shared[vi, vj] = B_dequantize_local[ov * 4 + iv]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Lower Vectorized Loop Pass should be enhanced to adapt layout inference #258

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Lower Vectorized Loop Pass should be enhanced to adapt layout inference #258

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions