
In this talk, we introduce a new PyTorch-based architecture for TensorRT-LLM that significantly enhances user experience and developer velocity—making it easier to build custom models, integrate new kernels, and extend runtime functionality, while delivering SOTA performance on the NVIDIA GPUs.
Some concrete examples will be used to illustrate the flexibility of this new PyTorch-based Architecture to help add new customizations quickly and achieve the SOTA performance.