반응형
Instruction tuning and preference alignment are essential techniques for adapting Large Language Models (LLMs) to specific tasks. Traditionally, this involves a multi-stage process: 1/ Supervised Fine-Tuning (SFT) on instructions to adapt the model to the target domain, followed by 2/ preference alignment methods like Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization (DPO) to increase the likelihood of generating preferred responses over rejected ones.
반응형
'IT > paper report' 카테고리의 다른 글
Highly Articulated Gaussian Human Avatars with Textured Mesh Prior (0) | 2024.04.25 |
---|---|
StreamingT2V (0) | 2024.04.11 |
Shap-E (0) | 2024.04.03 |
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance (0) | 2024.03.28 |
Mixture-of-Experts (0) | 2024.03.21 |