Tensor Parallelism Demo (Matrix Multiplication)

In Large Language Models, matrices are too big to fit on one GPU. Tensor Parallelism splits these matrices across multiple GPUs.

In this demo, we simulate Column Parallelism. The Weight Matrix $W$ is split column-wise. GPU 1 calculates the first half of the output, GPU 2 calculates the second half. Finally, they are concatenated (All-Gather).

Input $X$ (4x4)

Weights $W$ (4x4)

GPU 1

Copy of $X$

Split $W_1$ (4x2)

Partial $Y_1$ (4x2)

GPU 2

Copy of $X$

Split $W_2$ (4x2)

Partial $Y_2$ (4x2)

⬇ Gather Results

Final Output $Y$ (4x4)