본문 바로가기

IT/paper report

ORPO

반응형

Instruction tuning and preference alignment are essential techniques for adapting Large Language Models (LLMs) to specific tasks. Traditionally, this involves a multi-stage process: 1/ Supervised Fine-Tuning (SFT) on instructions to adapt the model to the target domain, followed by 2/ preference alignment methods like Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization (DPO) to increase the likelihood of generating preferred responses over rejected ones.

 

반응형

'IT > paper report' 카테고리의 다른 글

Highly Articulated Gaussian Human Avatars with Textured Mesh Prior  (0) 2024.04.25
StreamingT2V  (0) 2024.04.11
Shap-E  (0) 2024.04.03
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance  (0) 2024.03.28
Mixture-of-Experts  (0) 2024.03.21