learn-cutlass-5

Cutlass use abstract layout to express the mapping rules from logic index to physical index.

Read more

learn-cutlass-3

Warp-level GEMMs may be implemented either by TensorCores issuing mma.sync or wmma instructions, or by thread-level matrix computations issued to CUDA cores. Wmma is an API in CUDA C++ for using TensorCores and if you want to use TensorCores by mma.sync you must use ptx by asm.

Read more

learn-cutlass-2

I always wonder why cutlass provides many kinds of implementions of GEMM instead of just only one. In my opinion, in different situations the best implementions of GEMM differs. So that is what differs cutlass from cublas. You can make your own custiomlized implemention of GEMM to provide the best performance.

Read more

learn-cutlass-1

In cutlass 3.0, it introduces a new library, Cute, to describe and manipulate tensors of threads and data.

Read more

learn-cutlass-0

learn cutlass is a series of tutorials to learn cutlass by reading its examples or source code

CUTLASS is a header-only template library. After reading that, you will be lost in templates.

Read more