learn-cutlass-1
In cutlass 3.0, it introduces a new library, Cute, to describe and manipulate tensors of threads and data.
Different types of GEMM
TYPE of GEMM | BITS of DATA | TYPE of DATA |
---|---|---|
HGEMM | 16 | floating-point number |
SGEMM | 32 | floating-point number |
DGEMM | 64 | floating-point number |
IGEMM | 8 or 16 or 32 or 64 | integer |
RowMajorInterleaved
ColumnMajorInterleaved
1 |
|
RowMajorInterleaved is a layout which confused me. I didn’t know the meaning of Interleaved.So I create an example to figure it out.
1 |
|
If tensor A is a simple RowMajor, the output should be this
1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
In my opinion, Interleaved
means it will iterate in shape(1) with size Interleave
and then iterate in shape(0).
Other things need to mind is Interleaved
may cause padding of a matrix, like
1 |
|
typename in C++
In cutlass, you will see typename
everywhere. Obviously, you can use typename
when building template. But it has other usage, such as
The following is generated by chatgpt
1 | template<typename T> |
In C++, the typename keyword is used to specify that a dependent name is a type. A dependent name is a name that depends on a template parameter, and the compiler cannot determine whether the name refers to a type or some other entity until the template is instantiated with concrete types.
Consider the above example.
Here, T::iterator is a dependent name because it depends on the template parameter T. Without the typename keyword, the compiler would not be able to determine whether T::iterator is a type or a member variable until T is instantiated with a concrete type. The typename keyword tells the compiler that T::iterator is a type.
Note that typename is only needed when using a dependent name as a type. If the name does not depend on a template parameter, or if it is not used as a type, the typename keyword should not be used.
ThreadblockSwizzle
The following is generated by chatgpt
In Cutlass, ThreadblockSwizzle is a feature that allows for different threadblock configurations to be used when performing matrix-multiplication operations. ThreadblockSwizzle can be used to optimize the performance of GEMM (General Matrix Multiply) operations on GPUs, by mapping the threadblocks to the data in a way that maximizes data reuse and minimizes thread divergence.
ThreadblockSwizzle functions in Cutlass can be used to map the blockIdx (block index) of the CUDA grid to GEMM problems. These functions implement different strategies for organizing the threads within a threadblock, such as tiling, padding, and transposing, in order to improve the efficiency of memory access patterns and to maximize the utilization of GPU resources.
Overall, ThreadblockSwizzle in Cutlass is a powerful feature that can significantly improve the performance of matrix-multiplication operations on GPUs, particularly for large-scale problems.
Comparison between typedef and using
The following is generated by chatgpt
In C++, both typedef and using can be used to create aliases for types. Here are some key differences between the two:
Syntax
1 | typedef <existing_type> <new_type_alias>; |
Scope
typedef: A type alias created using typedef has the same scope as the original type. This means that it is accessible anywhere that the original type is accessible.
using: A type alias created using using has a narrower scope than the original type. It is only accessible within the block in which it is defined.
Functionality
typedef: typedef can only be used to create aliases for types. It cannot be used to create template aliases.
using: using can be used to create both type aliases and template aliases. This makes it more versatile than typedef.
Readability
typedef: typedef can only be used to create aliases for types. It cannot be used to create template aliases.
using: using can be used to create both type aliases and template aliases. This makes it more versatile than typedef.
05_batched_gemm
This example demonstrates how to use cutlass to compute a batched strided gemm in two different ways:
- By specifying pointers to the first matrices of the batch and the stride between the consecutive matrices of the batch (this is called a strided batched gemm).
- By copying pointers to all matrices of the batch to the device memory (this is called an array gemm).
In the example, it simply calls two APIs which is cutlass::gemm::device::GemmArray
and cutlass::gemm::device::GemmBatched
. So I think it is time to read the source of cutlass.
GemmArray
Let’s take GemmArray as an example.
1 | // GemmArray is defined in following file |
See, it is not very complicated. The class GemmArray
is just built with many templates(the context of a class) and overloads operator ()
to call cutlass::Kernel
. Then the question is coming. What is cutlass:Kernal
?
1 |
|
It is just a kernel template. So the important is Opearator
of cutlass::Kernal
which stands for cutlass::gemm::kernel::GemmArray
.
1 |
|
So operator()
is the core of class/struct in cutlass. And all the others are the context of that class/struct.
learn-cutlass-1