The Era of Real-World Human Interaction RL from User Conversations

  1. Three key properties of human interaction 人类交互的三大特性
  2. Reference

Three key properties of human interaction 人类交互的三大特性

关于人类交互的特性这里贴一下原文

Contextual grounding — arises within the flow of ongoing tasks or conversations, directly tied to the
user’s situational needs and the model’s prior outputs, while being shaped by personalized knowledge of
the user’s profile, history, and preferences;

Evolving distribution — reflects goals that shift, environments that change, and preferences that adapt
over time, thereby providing supervision that is temporally relevant and aligned with the real distribution of human needs and priorities;

Diverse supervision signals — appears in both explicit high-bandwidth signals beyond scalar rewards
(e.g., corrections or clarifications) and implicit cues (e.g., disengagement or frustration), and may include
style and role assignments, emotional tone, or even adversarial inputs such as jailbreak attempts, which
require careful handling, but also offer valuable information.

intuitively,
1.总结 (bot 与) 人类交互 (数据) 的三大特性
(i). 根植于上下文. 任何任务或者交谈总是在某个上下文推进流中,人和 bot 的一问一答的过程总是一个相互依存相互绑定的关系, 绑定的内容是 [人的 (场景化) 需求] 和 [bot当前给定的回复]; 在这个过程中, 人类上下文总是和用户的画像,人和 bot 交互历史记录和人类偏好相关
(ii). 目标持续演变. 人的目标总是在变的, 环境也是在持续变化的, 因此监督信号总是暂时相关的且需要时刻对齐人类的需求和人类画像
(iii). 多样化的监督信号. 对比数学问题类问答采用一个明确的标量 reward,人和bot交互中其实有多种监督信号: 既存在于非标量奖励之外的明确信号中(例如: 用户明确要求 “纠正” 或 “澄清”),也存在于类似某种隐含线索中(例如: “脱离互动” 或 表现出 “沮丧”),并且可能包括风格和角色分配、情感基调,甚至像 “越狱尝试” (在有些角色扮演场景下叫做 “攻略”) 这样的对抗性输入,这些都需要谨慎处理, 但也能提供有价值的信息

Reference

[1]. The Era of Real-World Human Interaction: RL from User Conversations.


转载请注明来源 goldandrabbit.github.io