rlchf features based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups
The second RLCHF variant proposed by Conitzer et al. (2024) takes a different approach: instead of aggregating rankings directly, it builds individual preference models that incorporate evaluator characteristics (demographics, values, context). These models can then be aggregated across groups, enabling context-sensitive preference aggregation.
This approach allows the system to learn: "People with characteristic X tend to prefer response type Y in context Z." Aggregation then happens by weighting or combining these learned preference functions according to a social choice rule, rather than aggregating raw rankings.
The key advantage: this variant can handle preference heterogeneity more flexibly than the aggregated rankings variant. It can adapt aggregation based on context, represent minority preferences explicitly, and enable "what would group X prefer?" queries.
Evidence
- Conitzer et al. (2024) describe this as the second RLCHF variant
- The paper notes this approach "incorporates evaluator characteristics" and enables "aggregation across diverse groups"
- This connects to the broader literature on personalized and pluralistic AI systems
Comparison to Aggregated Rankings Variant
Where the aggregated rankings variant collapses preferences into a single collective ranking before training, the features-based variant preserves preference structure throughout. This allows:
- Context-dependent aggregation (different social choice rules for different situations)
- Explicit representation of minority preferences
- Transparency about which groups prefer which responses
The tradeoff: higher complexity and potential for misuse (e.g., demographic profiling, value discrimination).
Relationship to Existing Work
This approach is conceptually similar to [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]], but more explicit about incorporating evaluator features. Both recognize that preference heterogeneity is structural, not noise.
The features-based variant also connects to [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]—both emphasize that different communities have different legitimate preferences that should be represented rather than averaged away.
---
Relevant Notes:
- [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]]
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
Topics:
- domains/ai-alignment/_map
- core/mechanisms/_map
- foundations/collective-intelligence/_map