Tianyu Guo's homepage

Posted 2023-06-20Updated 2025-06-30Technology24 minutes read (About 3603 words)

Optimize GEMM step by step

一步步优化GEMM系列，每次引入一个优化概念并对比性能变化

Posted 2023-05-14Updated 2025-06-30Technology2 minutes read (About 312 words)

learn-cutlass-5

Cutlass use abstract layout to express the mapping rules from logic index to physical index.

Posted 2023-04-16Updated 2025-06-30Technologya minute read (About 219 words)

learn-cutlass-4

Cutlass examples gives us many examples to learn cutlass. At this time, 13_two_tensor_op_fusion is introduced.

Posted 2023-04-02Updated 2025-06-30Technology3 minutes read (About 474 words)

learn-cutlass-3

Warp-level GEMMs may be implemented either by TensorCores issuing mma.sync or wmma instructions, or by thread-level matrix computations issued to CUDA cores. Wmma is an API in CUDA C++ for using TensorCores and if you want to use TensorCores by mma.sync you must use ptx by asm.

Posted 2023-03-24Updated 2025-06-30Technology4 minutes read (About 655 words)

learn-cutlass-2

I always wonder why cutlass provides many kinds of implementions of GEMM instead of just only one. In my opinion, in different situations the best implementions of GEMM differs. So that is what differs cutlass from cublas. You can make your own custiomlized implemention of GEMM to provide the best performance.

Posted 2023-03-21Updated 2025-06-30Technology6 minutes read (About 973 words)

learn-cutlass-1

In cutlass 3.0, it introduces a new library, Cute, to describe and manipulate tensors of threads and data.

Posted 2023-03-20Updated 2025-06-30Technology6 minutes read (About 928 words)

learn-cutlass-0