Haptic: AI Training and Optimisation using Human Feedback

Abstract

Haptic is a technological solution designed to overhaul the training and optimization processes of Large Language Models (LLMs) and assorted AI networks. By leveraging a high-quality decentralized human feedback infrastructure, Haptic bridges AI training procedures with human cognition. The primary objective of this system is to streamline the collection of high-quality feedback and training data, ensuring that human contributors are appropriately rewarded for their valuable input. This paper aims to provide a comprehensive exploration of Haptic's unique methodology, its system overview, and its potential applications in various domains.

Introduction

Large Language Models (LLMs) have introduced a transformative shift in the organization and execution of white-collar workflows and creative processes. They have become instrumental in tasks ranging from drafting complex documents and composing professional emails to generating comprehensive reports, automating customer service, and summarizing voluminous text data. This profound influence of LLMs has led to a substantial uptick in workplace efficiency across various sectors.

However, as is the case with any technological innovation, LLMs come with their own set of challenges. These LLMs now have billions of parameters and operate like black box models with limited visibility into which parameters can be tweaked to improve performance.

LLM model parameter size is on a constant uptrend making retraining processes complex

Another prevalent issue is that the inferences generated by LLMs often do not align with user expectations. This discrepancy could be attributed to various factors such as misinterpretation of the user's input, a lack of adequate training data, or even biases inherent in the model's predictions. To counteract these challenges, Haptic employs a unique approach - Reinforcement Learning from Human Feedback (RLHF). This paper delves into the workings of Haptic's RLHF approach and explains how it can help AI models align better with intricate human values and expectations.

Why does RLHF work?

Computer scientist and Natural language processing researcher Yoav Goldberg has an excellent note on the three hypotheses on why RLHF works.

  1. Diversity hypothesis: during Supervised fine-tuning (SFT), the model’s output is expected to somewhat match the demonstrated responses. For example, given the prompt “what’s an example of a language?”, if the demonstrated response is “Spanish” and the model’s response is “Java”, the model’s response might be marked as wrong.
  1. Negative feedback hypothesis: demonstration only gives the model positive signals (e.g. only showing the model good responses), not negative signals (e.g. showing models what bad responses look like). RL allows us to show models negative signals.
  1. Hallucination hypothesis: RLHF is supposed to help with hallucination. However, the InstructGPT paper shows that RLHF actually made hallucination worse. Even though RLHF caused worse hallucination, it improved other aspects, and overall, human labellers prefer the RLHF model over SFT alone model.
Users prefer responses from LLM models retrained using RLHF

Methodology

The Haptic system comprises three fundamental stages:

  1. Pretraining a language model
  1. Gathering data and training a preference model
  1. Fine-tuning the language model with reinforcement learning

System architecture for Reinforcement learning process using human feedback

Pretraining

Pretraining in Reinforcement Learning from Human Feedback (RLHF) is the first crucial initial step that lays the foundation for the rest of the process. It essentially involves training a language model on a large corpus of text data, usually scraped from the internet, using standard techniques like next-token prediction. This stage aims to equip the model with a broad understanding of language structure, syntax, and basic world knowledge.

When a language model is pretrained, it learns to generate coherent and contextually appropriate text by predicting the next word in a sentence based on the preceding words. This allows the model to understand and generate text that is syntactically correct and semantically relevant. The large-scale data used in pretraining ensures that the model is exposed to a wide range of topics, styles, and contexts, which greatly enhances its versatility and adaptability.

However, pretraining alone is not sufficient to ensure that the model's outputs align precisely with human values or specific task requirements. This is where RLHF comes in. The pretrained model serves as a basis upon which further fine-tuning can be performed using human feedback, allowing the model to refine its understanding and generation of text to better meet the needs and preferences of users. In this sense, pre training can be seen as providing a strong starting point from which RLHF can build.

We have seen this setup and process used in every LLM model that we interact with today.

  1. DeepMind has documented using up to their 280 billion parameter model Gopher and further utilising RLHF for improving its capabilities.
  1. Anthropic used transformer models from 10 million to 52 billion parameters trained for this task. They generated their initial language model for Reinforcement learning by distilling an original LM on context clues for their “helpful, honest, and harmless” criteria.
  1. OpenAI used a smaller version of GPT-3 for its first popular RLHF model - InstructGPT.

OpenAI showed that the outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3. However, the fine tuned approach produces much superior results. The performance of these models also highly depends on the type of training data it has been built around. A model trained for disseminating subject-specific information performs highly when utilised for a limited repository of questions associated with the subject. These limited models are also computationally much cheaper to train and retrain.

All the companies mentioned above likely use much larger models in most recent RLHF-powered products. As evident core to starting the RLHF process is having a model that responds well to diverse instructions. In general, there is not a clear answer on “which model” is the best for the starting point of RLHF. Thus in our implementation at Haptic, we remain model agnostic supporting models such as Claude, Chat GPT, Grok, BARD, Monai and other popular LLM models that receive interest from the community.

Pretraining is the most resource-intensive phase. For the InstructGPT model, pretraining takes up 98% of the overall compute and data resources. You can think of SFT and RLHF as unlocking the capabilities that the pretrained model already has but are hard for users to access via prompting alone. The mathematical formulation in step 1 is very simple for the problem

AspectDetails
ML TaskLanguage Modeling
Training DataHigh-quality data in the format of (prompt, response)
Data Scale100,000+ (prompt, response) pairs. Examples dataset can include:
- OpenAssistant: 161,000 messages in 10,000 conversations -> approximately 88,000 pairs
- Dialogue-fine tuned Gopher: ~5 billion tokens, which is estimated to be in the order of 10M messages.
Future- Haptic contributors: Expected to create 50k+ pairs via human feedback network
Model Input and OutputInput: prompt, Output: response for this prompt
Loss FunctionCross entropy, but only the tokens in the response are counted towards the loss

Preference model training

Preference modeling in the Reinforcement Learning from Human Feedback (RLHF) process involves creating a scoring function to evaluate the quality of text generated by a model based on human preferences. Data for this process is gathered from a set of prompt-generation pairs, created by choosing prompts from a specific dataset and generating responses using the language model. Human feedback providers rank these responses, forming a dataset that reflects human preferences across various prompts and responses. A ranking system, rather than absolute scores, is used to capture the relative preference between different responses, reducing variance and subjectivity in the feedback data.

The preference model that we end up utilizing was decided after testing multiple reward functions in the fine-tuning phase, steering the model towards generating responses that align more closely with human preferences. This model is crucial to the RLHF process, and Haptic's contribution in this field is largely in creating a preference model that aligns with human preferences and avoids overfitting that can affect the Large Language Models (LLM).

Haptic uses two setups to model this reward, namely, a model built from scratch based on preference data, and a fine-tuned language model. The training dataset, formed of prompt-generation pairs, is created by selecting prompts from a predefined dataset. Human feedback providers then rank the text outputs generated by the LLMs. Rankings can help us compare model outputs and build more granular dataset that LLMs teams require for updating their models. Haptic employs a number of closed source ranking methods for different users in a randomized way, with the idea that aggregated results are more reliable and can be used to produce a scalar reward signal for training.

Methodology followed in creation of the preference model

Fine-tuning

Once believed to be impossible - training a language model with help of reinforcement learning, advancements in the past 6 years have made it a reality. This has been achieved by adjusting parameters in the initial Large Language Models (LLMs) using a Proximal Policy Optimization policy-gradient RL algorithm.

Some LLM parameters are fixed in place due to the high computing demand required to adjust a full model with billions of parameters. The decision on how many parameters to fix is up to the developers of the LLM as this process can be costly and impractical to execute over every layer.

Formulating the Reinforcement Learning problem involves the language model accepting a prompt and subsequently producing a text sequence. In more technical terms, the action space of the model includes all tokens that match the language model's vocabulary. The observation space, on the other hand, is the distribution of possible input token sequences, which is exceedingly vast due to the large text sequence prompts.

The reward function plays a critical role when the system merges responses from all models into one RLHF workflow. Given a prompt (A), its text response (B) is generated by the current fine-tuned policy iteration. This text, combined with the original prompt, is sent to the preference model, which provides a scalar value signaling the preference. It is also worth noting that the Reinforcement policy's per-token probability distributions are compared with those from the initial model to calculate a penalty for the difference.

Most academic papers employ Kullback–Leibler (KL) divergence for the calculation of penalties between distribution sequences over tokens. Such a formulation, helps the policy from veering too far from the initial pre-trained model with each batch, ensuring the model infers coherent text snippets related to the original prompt. Haptic instead has been very impressed with experimenting with a new method developed by the OpenAI team that can be utilized which uses a novel objective function.

Where,

θ - policy parameter

E^_{t} - empirical expectation over timesteps

r_{t} - Ratio of the probability under the new and old policies

A^_{t} - estimated advantage at time t

ε is a hyperparameter, usually set at 0.1 or 0.2

This objective function formulated above enables us to perform a trust region update that aligns with Stochastic Gradient Descent. It simplifies the algorithm by eliminating the KL penalty and the need for adaptive updates, resulting in a more straightforward implementation.

Proximal Policy Optimization (PPO) fine tuning is widespread in the industry with OpenAI's GPT-2 and GPT-3, DeepMind's Gopher, and Anthropic's AI models. They’ve demonstrated significant improvements in their ability to generate coherent and contextually appropriate responses after fine-tuning with PPO. Haptic’s goal is to make these improvements accessible to all LLM developers and teams.

System Overview

Haptic's system follows a simple yet effective process. It starts with LLM providers connecting their models to the Haptic front-end by giving access to their API endpoints. Users can then access these LLMs and their response-generation services via Haptic. Comparative feedback on the same category of questions from various users is encouraged, and an underlying scoring mechanism will be used to evaluate contribution of each user to the network. Once sufficient data and feedback responses are gathered, Haptic will assist in completing one iteration of LLM parameter retraining.

HapticAI system architecture and incentivised staking

Revenue collected from the retraining process of partner models over iterations will be directed back to token stakers that participants in the feedback process. Staked tokens are a measure to build a pseudo-reputation system which was done by teams like OpenAI via research into the labellers educational background. This manner of user segmentation may not be possible due to the nature of users in decentralized networks. Thus Haptic uses staking mechanism and the HapticScore to assign reputation to these users.

RLHF upgrades

Reinforcement Learning from Human Feedback, has certain limitations due to its human-centric problem domain. Human preference data can be costly and non-deterministic.

To counter this, crypto economics can be employed to accelerate iteration loops. The performance of RLHF is tied to the quality of its human annotations, which can introduce significant variance to the training data.

Currently, the available datasets for RLHF on large language models are limited. However, RLHF has potential for improvement with many unexplored design options that could help its progress. Several strategies for optimizing the RLHF system are being considered:

  1. Pre-training gradients: Collaborate with other LLM projects to incorporate additional pre-training gradients into the PPO update rule for fine-tuning parameters.
  1. Meta-Learning: Using additional and underexplored machine learning algorithms and techniques to accelerate the learning process
  1. Hierarchical RL: Break down the learning problem into a hierarchy of simpler problems to make the learning process more manageable and step-wise
  1. Transfer Learning: Training the RLHF system on one task and then transferring the learned knowledge to another related task
  1. Inverse Reinforcement Learning: Learn the reward function directly from observed behavior, useful for retraining when direct feedback is limited or unavailable.

In the future, RLHF could evolve by iteratively updating the reward model and the policy. As the RL policy updates, users can continue to rank these outputs against the model's earlier versions. Most research papers have yet to discuss implementing this operation, as the mode of deployment needed to collect this type of data only works for dialogue agents with access to an engaged user base. Anthropic discusses this option as Iterated Online RLHF (see the original paper), where iterations of the policy are included in the ELO ranking system across models. This introduces complex dynamics of the policy and reward model evolving, which presents an intriguing and open research question.

Potential Applications

Haptic's methodology has vast potential applicability across several domains. These include LLM models (general and topic specific), AI bias mitigation, AI art/creativity, emotional AI, medical AI, and AI for accessibility. In each of these domains, RLHF can provide valuable insights and improvements, ensuring that AI models align more closely with the expectations and values of their human users.

Conclusion

Haptic represents a significant step towards improving the way AI models are trained and optimized. Haptic aims to make feedback datasets and training process available to all easily. By harnessing the power of RLHF and decentralizing the process of feedback collection, Haptic promises to unlock new avenues for accuracy, efficiency, and innovation in the development of future AI models. As we move towards an AI-dominated future, systems like Haptic will play a crucial role in ensuring that AI models are not just technically sophisticated, but also imbued with the nuances and subtleties of human cognition.

References