High-Quality Text Data Curation for Fine-Tuning LLMs Using Synthetic Data Generation Pipelines

Просмотров: 733 | Загружено: 2 мес.

NVIDIA Developer

34

Скачать

Всё видео пользователя: AM-NVIDIA Developer

Подробнее о видео

In this step-by-step tutorial, you will learn how to curate high-quality text data for fine-tuning LLMs using synthetic data generation (SDG) pipelines in NVIDIA NeMo Curator.

NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization. It also provides prebuilt pipelines for generating synthetic data to customize and evaluate generative AI systems.

This video walks you through installing NeMo Curator, downloading and loading a sample dataset from Hugging Face, and augmenting this dataset with high-quality data generated using the SDG pipeline.

You’ll see practical demonstrations of using NeMo Curator built-in functions to identify, filter, and remove URLs, Unicode characters, and duplicate data based on semantic meaning. Synthetic data is then generated and low-quality data is removed using scores generated with the reward model, ensuring that the dataset is of high quality and ready for fine-tuning generative AI models.

📥 Access the tutorial:

📖Learn more about NeMo Curator:

⭐️Don’t forget to star the NeMo Curator GitHub repository to receive regular updates on newly released features and tutorials and to contribute your code to the repository:

00:00 - Introduction
01:17 - Prerequisites
01:33 - Diving Into the Code
01:56 - Run the Data Curation Pipeline
02:25 - Filtering and Cleaning
03:13 - Synthetic Data Generation
04:25 - Results

High-Quality Text Data Curation for Fine-Tuning LLMs Using Synthetic Data Generation Pipelines

Похожие видео