Power Text-Generation Applications with Mistral NeMo 12B Running on a Single GPU | NVIDIA Technical Blog (2024)

NVIDIA collaborated with Mistral to co-build the next-generation language model that achieves leading performance across benchmarks in its class.

With a growing number of language models purpose-built for select tasks, NVIDIA Research and Mistral AI combined forces to offer a versatile, open language model that’s performant and runs on a single GPU.

This post explores the benefits of Mistral NeMo, training and inference optimizations, applicability for various use cases, and the ease of deployment with NVIDIA NIM.

Mistral NeMo 12B

Mistral NeMo is a 12B-parameter, text decoder-only, dense transformer model trained on 131K multilingual vocabulary size. It delivers leading accuracy on popular benchmarks across common sense reasoning, world knowledge, coding, math, and multilingual and multi-turn chat tasks.

ModelContext WindowHellaSwag (0-shot)Winograd (0-shot)NaturalQ (5-shot)TriviaQA (5-shot)MMLU (5-shot)OpenBookQA (0-shot)CommonSenseQA (0-shot)TruthfulQA (0-shot)MBPP (pass@1 3-shots)
Mistral NeMo 12B128k83.5%76.8%31.2%73.8%68.0%60.6%70.4%50.3%61.8%
Gemma 2 9B8k80.1%74.0%29.8%71.3%71.5%50.8%60.8%46.6%56.0%
Llama 3 8B8k80.6%73.5%28.2%61.0%62.3%56.4%66.7%43.0%57.2%

Supporting 128K context length, the model has enhanced understanding and the capability to process extensive and complex information, leading to more coherent, accurate, and contextually relevant outputs.

Mistral NeMo is trained on Mistral’s proprietary dataset that includes a large proportion of multilingual and code data, which enables better feature learning, reduced bias, and an improved ability to handle diverse and complex scenarios.

Optimized training

The model is trained using NVIDIA Megatron-LM, an open-source, PyTorch-based library with a collection of GPU-optimized techniques, cutting-edge system-level innovations, and modular APIs for training models at large scale.

Megatron-LM, part of NVIDIA NeMo, offers the core building blocks for the distributed training of text: multimodal and mixture of experts (MoE) models natively built into the library:

  • Attention mechanisms
  • Transformer blocks and layers
  • Normalization layers
  • Embedding techniques
  • Activation recomputation
  • Distributed checkpointing

Optimized inference

Mistral NeMo is optimized with TensorRT-LLM engines for higher inference performance. TensorRT-LLM compiles the models into TensorRT engines, from model layers into optimized CUDA kernels using pattern matching and fusion, to maximize inference performance. These engines are executed by the TensorRT-LLM runtime, which includes several optimizations:

Inference in FP8 precision is also supported by using NVIDIA TensorRT-Model-Optimizer. Using post training quantization (PTQ) on NVIDIA Hopper and NVIDIA Ada GPUs, you can optimize and reduce model complexity by creating smaller models with lower memory footprint, without sacrificing accuracy.

This model fits on a single GPU, improving compute efficiency, lowering compute cost, and enhancing security and privacy. Run the model on a range of commercial use cases including summarizing long documents, classification, multi-turn conversations, language translation, and code generation.

NVIDIA NIM

The Mistral NeMo model is packaged as an NVIDIA NIM inference microservice to streamline and accelerate the deployment of generative AI models across NVIDIA accelerated infrastructure anywhere, including cloud, data center, and workstations.

NIM uses inference optimization engines, industry-standard APIs, and prebuilt containers to provide high-throughput AI inference that scales with demand. It supports a wide range of generative AI models across domains including speech, image, video, healthcare, and more.

NVIDIA NIM delivers best-in-class throughput, enabling enterprises to generate tokens up to 5x faster. For generative AI applications, token processing is the key performance metric, and increased token throughput directly translates to higher revenue for enterprises.

Mistral NeMo has an open Apache 2.0 permissive license that provides enterprises the flexibility to customize the model and integrate it into their commercial applications.

Here’s how to test the Mistral NeMo model for programming tasks.

Coding copilot

Coding is a popular use case supported by the model to enhance developer productivity with AI-powered code suggestions. A coding copilot offers not only inline code suggestions but also generates code, documentation, comments, unit tests, and error fixes, all of which can be easily accessed through industry-standard APIs.

Power Text-Generation Applications with Mistral NeMo 12B Running on a Single GPU | NVIDIA Technical Blog (1)

Here’s the syntactically and functionally correct code generated by Mistral NeMo with an English language prompt. You can paste the following in your development environment. To generate other code samples, see the NVIDIA API Catalog.

import torchimport torch.nn as nnimport torch.nn.functional as F# Define the neural networkclass Net(nn.Module): def __init__(self): super(Net, self).__init__() # Convolutional layers self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1) self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1) # Fully connected layer self.fc = nn.Linear(32 * 8 * 8, 10) # Assuming input image size is 32x32 def forward(self, x): # Add convolutional layers with ReLU activation and max pooling x = F.relu(self.conv1(x)) x = F.max_pool2d(x, 2) x = F.relu(self.conv2(x)) x = F.max_pool2d(x, 2) # Flatten the tensor before passing it to the fully connected layer x = x.view(-1, 32 * 8 * 8) # Add fully connected layer with log softmax for multi-class classification x = self.fc(x) output = F.log_softmax(x, dim=1) return output# Create an instance of the neural networknet = Net()# Print the model architectureprint(net)# Test the forward pass with a dummy inputdummy_input = torch.randn(1, 3, 32, 32) # Batch size of 1, 3 channels, 32x32 image sizeoutput = net(dummy_input)print("Test output:\n", output)

You may also want to fine-tune the model with your domain data to generate higher-accuracy responses. NVIDIA offers tools to align the model for your use case.

Model customization

The instruction-tuned variant of the Mistral NeMo model offers strong performance amongst similarly sized LLMs across several benchmarks such as MT Bench, MixEval-Hard, IFEval-v5, and WildBench.

You can further customize it for your specific needs using NVIDIA NeMo, an end-to-end platform for developing custom generative AI, anywhere.

NeMo offers state-of-the-art fine-tuning and alignment support with parameter-efficient fine-tuning (PEFT) techniques, including p-tuning, low-rank adaption (LoRA), and its quantized version (QLoRA). These techniques are useful for creating custom models without requiring a lot of computing power.

NeMo also supports supervised fine-tuning (SFT) and alignment techniques such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and NeMo SteerLM. These techniques enable further steering the model responses and aligning them with human preferences, making the LLMs ready to integrate into custom applications.

Get started

To experience Mistral NeMo NIM microservice, see the Artificial Intelligence solution page. You will also find popular models, such as Llama 3.1 405B, Mixtral 8X22B, and Gemma 2B.

With free NVIDIA cloud credits, you can start testing the model at scale and build a proof of concept (POC) by connecting your application to the NVIDIA-hosted API endpoint running on a fully accelerated stack.

Power Text-Generation Applications with Mistral NeMo 12B Running on a Single GPU | NVIDIA Technical Blog (2024)
Top Articles
Flamin' Hot Cheetos Puffs: Exactly How Spicy Are They? | Gopuff
Where Do Flaming Hot Cheetos Place On Scoville Scale? - Spicy Quest
Spasa Parish
Rentals for rent in Maastricht
159R Bus Schedule Pdf
Sallisaw Bin Store
Black Adam Showtimes Near Maya Cinemas Delano
Espn Transfer Portal Basketball
Pollen Levels Richmond
11 Best Sites Like The Chive For Funny Pictures and Memes
Things to do in Wichita Falls on weekends 12-15 September
Craigslist Pets Huntsville Alabama
Paulette Goddard | American Actress, Modern Times, Charlie Chaplin
What's the Difference Between Halal and Haram Meat & Food?
R/Skinwalker
Rugged Gentleman Barber Shop Martinsburg Wv
Jennifer Lenzini Leaving Ktiv
Justified - Streams, Episodenguide und News zur Serie
Epay. Medstarhealth.org
Olde Kegg Bar & Grill Portage Menu
Cubilabras
Half Inning In Which The Home Team Bats Crossword
Amazing Lash Bay Colony
Juego Friv Poki
Dirt Devil Ud70181 Parts Diagram
Truist Bank Open Saturday
Water Leaks in Your Car When It Rains? Common Causes & Fixes
What’s Closing at Disney World? A Complete Guide
New from Simply So Good - Cherry Apricot Slab Pie
Drys Pharmacy
Ohio State Football Wiki
Find Words Containing Specific Letters | WordFinder®
FirstLight Power to Acquire Leading Canadian Renewable Operator and Developer Hydromega Services Inc. - FirstLight
Webmail.unt.edu
2024-25 ITH Season Preview: USC Trojans
Metro By T Mobile Sign In
Restored Republic December 1 2022
12 30 Pacific Time
Jami Lafay Gofundme
Stellaris Resolution
Wi Dept Of Regulation & Licensing
Pick N Pull Near Me [Locator Map + Guide + FAQ]
Crystal Westbrooks Nipple
Ice Hockey Dboard
Über 60 Prozent Rabatt auf E-Bikes: Aldi reduziert sämtliche Pedelecs stark im Preis - nur noch für kurze Zeit
Wie blocke ich einen Bot aus Boardman/USA - sellerforum.de
Infinity Pool Showtimes Near Maya Cinemas Bakersfield
Dermpathdiagnostics Com Pay Invoice
How To Use Price Chopper Points At Quiktrip
Maria Butina Bikini
Busted Newspaper Zapata Tx
Latest Posts
Article information

Author: Lakeisha Bayer VM

Last Updated:

Views: 5920

Rating: 4.9 / 5 (49 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Lakeisha Bayer VM

Birthday: 1997-10-17

Address: Suite 835 34136 Adrian Mountains, Floydton, UT 81036

Phone: +3571527672278

Job: Manufacturing Agent

Hobby: Skimboarding, Photography, Roller skating, Knife making, Paintball, Embroidery, Gunsmithing

Introduction: My name is Lakeisha Bayer VM, I am a brainy, kind, enchanting, healthy, lovely, clean, witty person who loves writing and wants to share my knowledge and understanding with you.