Summary We investigate NVIDIA’s Triton (TensorRT) Inference Server as a way of hosting Transformer Language Models. The blog is roughly divided into two parts: (i) instructions for setting up your own inference server,…
Most neural architectures for machine translation use an encoder-decoder model consisting of either convolutional or recurrent layers. The encoder layers map the input to a latent space and the decoder, in turn, uses this latent representation to map the inputs to the targets.