learn-cutlass-1

In cutlass 3.0, it introduces a new library, Cute, to describe and manipulate tensors of threads and data.

Different types of GEMM

TYPE of GEMM BITS of DATA TYPE of DATA
HGEMM 16 floating-point number
SGEMM 32 floating-point number
DGEMM 64 floating-point number
IGEMM 8 or 16 or 32 or 64 integer

RowMajorInterleaved

ColumnMajorInterleaved

1
2
#include "cutlass/layout/matrix.h"
template<int Interleave> struct cutlass::layout::RowMajorInterleaved<Interleave>;

RowMajorInterleaved is a layout which confused me. I didn’t know the meaning of Interleaved.So I create an example to figure it out.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

#include <iostream>
#include <cstdio>

// Defines cutlass::layout::RowMajorInterleave
#include "cutlass/layout/matrix.h"

// Defines cutlass::HostTensor<>
#include "cutlass/util/host_tensor.h"

// Defines cutlass::MatrixCoord
#include "cutlass/matrix_coord.h"

#define M 4
#define N 4

int main(){
cutlass::HostTensor<int,cutlass::layout::RowMajorInterleaved<2> > A(cutlass::MatrixCoord(M,N));

int num = 0;
for(int i=0;i<M;i++)
for(int j=0;j<N;j++){
A.at({i,j}) = ++num;
}

int *A_ = A.host_data();
for(int i=0;i<A.capacity();i++){
printf("%3d ",A_[i]);
// if((i+1)%N==0)printf("\n");
}
/**
* output:
* 1 5 2 6 3 7 4 8 9 13 10 14 11 15 12 16
*
*/
}

If tensor A is a simple RowMajor, the output should be this

1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

In my opinion, Interleaved means it will iterate in shape(1) with size Interleave and then iterate in shape(0).
Other things need to mind is Interleaved may cause padding of a matrix, like

1
2
3
4
5
6
7
8
9
10
11
#define M 3
#define N 3
cutlass::HostTensor<int,cutlass::layout::RowMajorInterleaved<2> > A(cutlass::MatrixCoord(M,N));
int num = 0;
for(int i=0;i<M;i++)
for(int j=0;j<N;j++){
A.at({i,j}) = ++num;
}
/**
* the element in A should be
* 1 4 2 5 3 6 7 0 8 0 9 0

typename in C++

In cutlass, you will see typename everywhere. Obviously, you can use typename when building template. But it has other usage, such as

The following is generated by chatgpt

1
2
3
4
5
template<typename T>
void foo() {
typename T::iterator it;
// ...
}

In C++, the typename keyword is used to specify that a dependent name is a type. A dependent name is a name that depends on a template parameter, and the compiler cannot determine whether the name refers to a type or some other entity until the template is instantiated with concrete types.

Consider the above example.

Here, T::iterator is a dependent name because it depends on the template parameter T. Without the typename keyword, the compiler would not be able to determine whether T::iterator is a type or a member variable until T is instantiated with a concrete type. The typename keyword tells the compiler that T::iterator is a type.

Note that typename is only needed when using a dependent name as a type. If the name does not depend on a template parameter, or if it is not used as a type, the typename keyword should not be used.

ThreadblockSwizzle

The following is generated by chatgpt

In Cutlass, ThreadblockSwizzle is a feature that allows for different threadblock configurations to be used when performing matrix-multiplication operations. ThreadblockSwizzle can be used to optimize the performance of GEMM (General Matrix Multiply) operations on GPUs, by mapping the threadblocks to the data in a way that maximizes data reuse and minimizes thread divergence.

ThreadblockSwizzle functions in Cutlass can be used to map the blockIdx (block index) of the CUDA grid to GEMM problems. These functions implement different strategies for organizing the threads within a threadblock, such as tiling, padding, and transposing, in order to improve the efficiency of memory access patterns and to maximize the utilization of GPU resources.

Overall, ThreadblockSwizzle in Cutlass is a powerful feature that can significantly improve the performance of matrix-multiplication operations on GPUs, particularly for large-scale problems.

Comparison between typedef and using

The following is generated by chatgpt

In C++, both typedef and using can be used to create aliases for types. Here are some key differences between the two:

Syntax

1
2
typedef <existing_type> <new_type_alias>;
using <new_type_alias> = <existing_type>;

Scope

typedef: A type alias created using typedef has the same scope as the original type. This means that it is accessible anywhere that the original type is accessible.

using: A type alias created using using has a narrower scope than the original type. It is only accessible within the block in which it is defined.

Functionality

typedef: typedef can only be used to create aliases for types. It cannot be used to create template aliases.

using: using can be used to create both type aliases and template aliases. This makes it more versatile than typedef.

Readability

typedef: typedef can only be used to create aliases for types. It cannot be used to create template aliases.

using: using can be used to create both type aliases and template aliases. This makes it more versatile than typedef.

05_batched_gemm

This example demonstrates how to use cutlass to compute a batched strided gemm in two different ways:

  • By specifying pointers to the first matrices of the batch and the stride between the consecutive matrices of the batch (this is called a strided batched gemm).
  • By copying pointers to all matrices of the batch to the device memory (this is called an array gemm).

In the example, it simply calls two APIs which is cutlass::gemm::device::GemmArray and cutlass::gemm::device::GemmBatched. So I think it is time to read the source of cutlass.

GemmArray

Let’s take GemmArray as an example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// GemmArray is defined in following file
#include "cutlass/gemm/device/gemm_array.h"

// simplified defination of GemmArray
temlate<
typename ElementA_,
typename LayoutA_,
typename ElementB_,
typename LayoutB_,
typename ElementC_,
typename LayoutC_
//...
>
class GemmArray{
public:
// ignore some detailed attribute and functions
using GemmKernel = kernel::GemmArray<typename DefaultGemmKernel::Mma, typename DefaultGemmKernel::Epilogue, ThreadblockSwizzle>;

Status run(cudaStream_t stream = nullptr) {
// ignore some detailed codes
cutlass::Kernel<GemmKernel><<<grid, block, smem_size, stream>>>(params_);
}

// overload operator () for calling gemm_op(...)
Status operator()(cudaStream_t stream = nullptr) {
return run(stream);
}

};

See, it is not very complicated. The class GemmArray is just built with many templates(the context of a class) and overloads operator () to call cutlass::Kernel. Then the question is coming. What is cutlass:Kernal?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include "cutlass/device_kernel.h"

/// Generic CUTLASS kernel template.
template <typename Operator>
__global__
void Kernel(typename Operator::Params params) {
// Dynamic shared memory base pointer
extern __shared__ int SharedStorageBase[];

// Declare pointer to dynamic shared memory.
typename Operator::SharedStorage *shared_storage =
reinterpret_cast<typename Operator::SharedStorage *>(SharedStorageBase);

Operator op;

op(params, *shared_storage);
};

It is just a kernel template. So the important is Opearator of cutlass::Kernal which stands for cutlass::gemm::kernel::GemmArray.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include "cutlass/gemm/kernel/gemm_array.h"

template <
typename Mma_, ///! Threadblock-scoped matrix multiply-accumulate
typename Epilogue_, ///! Epilogue
typename ThreadblockSwizzle_ ///! Threadblock swizzling function
>
struct GemmArray{
// ignore some detailed attribute and functions
CUTLASS_DEVICE
void operator()(Params const &params, SharedStorage &shared_storage) {
// codes run on device
}
};

So operator() is the core of class/struct in cutlass. And all the others are the context of that class/struct.

Author

TianYu GUO

Posted on

2023-03-21

Updated on

2024-10-28

Licensed under

Comments