ChatGPT, the latest language model from OpenAI, has been making waves in the tech industry since its release in November 2022. The model, which is based on the GPT-3 version 3.5, is an adaptation of InstructGPT and is designed with a strong focus on interactive conversations.

ChatGPT's ability to automatically generate text similar to that of a human being and its ability to take into account the context of a conversation while avoiding the shortcomings of its ancestors, such as Tay and Galactica, has made it the reference for End-to-End conversational systems.

The GPT (Generative Pre-trained Transformer) series of models is composed of language models based on Transformer technologies. It is developed by OpenAI, a company based in San Francisco. In 2020, GPT-3 was the largest language model ever trained, with 175 billion parameters, but has since been surpassed by BLOOM with its 176 billion parameters.

LLMs are usually generated from a very large volume of sample text in different languages and domains. GPT-3 has been trained on hundreds of billions of words from Common Crawl, WebText2, Books1/2, and Wikipedia in English.

Generative Model

GPT-3 is classified as a generative model, which means that it is trained primarily to predict the next token, the next word, at the end of the input sentence. This kind of autocompletion mechanism is found in search engines or Outlook now.

GPT-3 has been cited many times for its ability to generate texts that are extremely close to what a journalist or an author is capable of doing. Just give it the beginning of a sentence, and it will complete the rest of the paragraph or article word-by-word.

By extension, the model has demonstrated that it is capable of handling a large number of language processing tasks, such as translation, answering questions, and filling in missing words in a text.

However, these training strategies can lead to a misalignment of the language model for some more complex tasks, because a model which is only trained to predict the next word (or a masked word) in a text sequence, may not necessarily be learning some higher-level representations of its meaning.

As a result, the model struggles to generalize to tasks or contexts that require a deeper understanding of language.

Researchers and developers are working on various approaches to address the alignment problem in Large Language Models. ChatGPT is based on the original GPT-3 model, but has been further trained by using human feedback to guide the learning process with the specific goal of mitigating the model’s misalignment issues.

The specific technique used, called Reinforcement Learning from Human Feedback (RLHF), is based on previous academic research. ChatGPT represents the first case of use of this technique for a model put into production.

The method of RLHF

Supervised Fine-Tuning

Supervised Fine-Tuning (SFT) is a method used to fine-tune pre-trained language models, such as GPT-3, on a specific task or set of tasks.

It involves training the model on a smaller dataset, referred to as the demonstration dataset, that is curated by human labelers to align the model's output with human expectations and values.

The process of SFT can be broken down into the following steps:

Select a list of prompts

The first step is to select a list of prompts, or input sentences, that the model will be trained on. These prompts should be relevant to the specific task or set of tasks that the model will be used for.

Collect demonstration data

A group of human labelers are then asked to provide the expected output for each prompt in the list. This creates the demonstration dataset, which is a collection of input-output pairs.

Fine-tune the pre-trained model

The demonstration dataset is then used to fine-tune the pre-trained model using a supervised learning approach. The model is trained to predict the correct output for each input in the demonstration dataset.

Evaluate the fine-tuned model

The final step is to evaluate the performance of the fine-tuned model on a separate test dataset to ensure that it is aligned with human expectations and values.

In the case of ChatGPT, SFT was used to fine-tune the GPT-3 model on a specific task of interactive conversation. The creators of ChatGPT used two different sources of prompts to collect demonstration data: (1) conversations from an existing dialogue dataset, and (2) human-written prompts.

For example, one of the prompts used in the demonstration dataset could be "What is the capital of France?" and the expected output response would be "Paris".

During the fine-tuning process, the pre-trained GPT-3 model was trained on this demonstration dataset to predict the correct response for each prompt.

This allowed the model to learn the specific language and context of interactive conversation, and thus produce more natural and coherent responses.

The fine-tuned model was then evaluated on a separate test dataset to ensure that it was aligned with human expectations and values.

Mimic Human Preferences

"Mimic Human Preferences" is a step in the Reinforcement Learning from Human Feedback (RLHF) method used to train ChatGPT. The goal of this step is to incorporate human feedback into the training process, in order to improve the alignment of the model with human values and expectations.

Step-by-step guide to use Mimic Human Preferences:

Collect comparison data

The first step is to collect a large amount of data for comparison. This data is generated by the SFT model (the supervised fine-tuning model) and consists of output responses for a set of prompts.

Human labelers vote on the SFT model outputs

A group of human labelers are asked to vote on the SFT model outputs. They are given a set of output responses for a given prompt and are asked to select the response that is most aligned with human expectations.

Create the reward model

The votes are used to create a dataset consisting of comparison data. This dataset is then used to train a new model, referred to as the reward model (RM). The RM is trained to predict the human preference on a given output response.

Use the reward model to fine-tune the SFT model

The RM is used to fine-tune the SFT model. The SFT model is updated in a way that maximizes the reward predicted by the RM. This results in a new model, referred to as the policy model.

In ChatGPT, this process was used to improve the alignment of the model with human values and expectations. For example, if the SFT model generated an output response that was racist or xenophobic, the human labelers would vote against it, and the RM would learn to predict that this response is not preferred by humans.

It is important to note that this step can be iterated continuously by collecting more comparison data on the current best policy model, which is used to train a new reward model and then a new policy.

This allows for a continuous improvement of the model's alignment with human preferences.

Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a reinforcement learning algorithm that is used to optimize the performance of a policy model. The goal of PPO is to find the policy that maximizes the expected reward while constraining the change in the policy.

Step-by-Step Comprehensive Guide to use Proximal Policy Optimization:

Collect data on the current policy

Collect data on the current policy by running it in the environment and recording the observations, actions, and rewards.

Estimate the value function

Use the collected data to estimate the value function, which is a prediction of the expected reward for each state.

Update the policy

Use the value function to update the policy, by adjusting the parameters of the policy model to increase the expected reward.

Repeat steps 1-3

Repeat steps 1-3 until the policy converges to an optimal solution.

In ChatGPT, Proximal Policy Optimization was used in the final step of the RLHF training process. The reward model, which was trained in the second step of the process, was used to further fine-tune and improve the SFT model. The PPO step was used to adjust the parameters of the policy model to increase the expected reward, as determined by the reward model.

For example, if the SFT model generated the output "The cat sat on the couch" in response to the prompt "Write a sentence about a cat sitting", the PPO algorithm would adjust the parameters of the model to increase the likelihood of generating similar outputs in the future.

How much data was used in training process?

The training process for ChatGPT and other large language models like GPT-3 involves using vast amounts of text data from the internet.

The exact amount of data used in the training process for ChatGPT has not been publicly disclosed by OpenAI, but it is known that GPT-3 was trained on hundreds of billions of words from Common Crawl, WebText2, Books1/2, and Wikipedia in English

 Additionally, GPT-3 was also trained on examples of programs coded in CSS, JSX, Python, and other languages.

The large volume of sample text used in the training process is crucial for the model to learn the statistical structure of language, such as common word sequences and patterns of word usage. 

This allows the model to generate more natural and fluent text, and is an essential step in the pre-training phase of every language model.

The amount of data used in the training process also determines the size of the model, with GPT-3 being the largest language model ever trained with 175 billion parameters, requiring 800 GB of memory to train it.

In the case of ChatGPT, the training process involved not only the use of large amounts of text data, but also the use of human feedback through the Reinforcement Learning from Human Feedback technique.

This allowed the model to learn from human preferences and values, further improving its ability to generate text that aligns with human expectations.

Advantages

One of the major advantages of ChatGPT is its ability to generate text that is similar to that of a human being. This is achieved through the use of a large language model (LLM) called GPT-3, which is based on transformer technology.

GPT-3 has been trained on a vast amount of text data from the internet, which has allowed it to learn the statistical structure of language and generate more natural and fluent text.

Another advantage of ChatGPT is its ability to take into account the context of a conversation. This is a significant improvement over its predecessor, GPT-3, which sometimes produced output that was not consistent with human expectations or desirable values.

This is addressed by the use of a technique called Reinforcement Learning from Human Feedback (RLHF) in the training of ChatGPT.

This technique uses human feedback in the training loop to minimize harmful, untruthful, and/or biased outputs, making ChatGPT more aligned with human values and expectations.

One of the most notable advantages of ChatGPT is its ability to handle a wide range of language processing tasks. This includes not only text generation but also tasks such as translation and answering questions.

This is possible due to the large volume of data that the GPT-3 model has been trained on, which has allowed it to learn a wide range of language patterns and structures.

Additionally, ChatGPT is designed with a strong focus on interactive conversations, making it well-suited for use in a wide range of applications, such as customer service chatbots, virtual assistants, and more.

Another advantage of ChatGPT is its ability to learn from human feedback. Through the use of RLHF, ChatGPT is able to learn from human feedback and adjust its outputs to better align with human values and expectations. This makes it a more robust model and helps to mitigate the misalignment issues that have been seen in previous models such as GPT-3.

Limitations

Despite the advantages of the RLHF method, there are still some limitations that need to be addressed. One of the main limitations is that the method relies on human feedback, which can be expensive and time-consuming to collect. Additionally, the method may not be suitable for all types of tasks and applications, as it requires a certain level of human supervision and intervention. Finally, the method may not be suitable for tasks that require a deep understanding of language, such as common sense reasoning or natural language inference.

It's important to note that the process of using human feedback in the training loop has its own limitations. One of the main limitations is the subjectivity of human judgment. Different people may have different opinions on the quality of the generated text, and this can lead to inconsistencies in the training data. Additionally, the process of collecting and labeling large amounts of data can be time-consuming and costly.

Another limitation is that the method of RLHF is currently only suitable for interactive conversational systems. For other language generation tasks, such as text summarization or machine translation, different methods may be needed to ensure alignment with human expectations.

Despite these limitations, the use of RLHF in the training of ChatGPT has shown promising results in improving the alignment of the model with human values and expectations. The ability to generate text that is similar to that of a human being while taking into account the context of a conversation is a significant step forward in the development of conversational AI systems.

It's important to note that OpenAI has not released the details of the inner workings of ChatGPT, and further research is needed to fully understand its capabilities and limitations. However, it is clear that the use of RLHF in the training of ChatGPT represents a significant improvement over its predecessor GPT-3 and sets a new standard for conversational AI systems. As the field of AI continues to evolve, we can expect to see further advancements in the alignment of language models with human values and expectations.

Conclusion

In conclusion, ChatGPT is a powerful conversational AI model that was developed by OpenAI as an extension of its predecessor GPT-3. The model is based on the Generative Pre-trained Transformer (GPT) series of models and utilizes a combination of supervised learning and reinforcement learning techniques.

The primary technique used in ChatGPT is Reinforcement Learning from Human Feedback (RLHF), which uses human feedback to minimize harmful, untruthful, and/or biased outputs.

The training process of ChatGPT involves three distinct steps: Supervised Fine-Tuning, Mimic Human Preferences, and Proximal Policy Optimization.

The model's capability is based on its ability to perform a specific task or set of tasks, while its alignment is concerned with what we actually want the model to do versus what it is being trained to do.

One of the major advantages of ChatGPT is its ability to generate text that is similar to that of a human being and its ability to take into account the context of a conversation. This makes it a powerful tool for a wide range of applications, such as customer service, language translation, and question answering.

The training process of ChatGPT utilizes a large amount of text data from the internet, with OpenAI using Common Crawl, WebText2, Books1/2, and Wikipedia in English. Additionally, it has been trained on examples of programs coded in CSS, JSX, Python, etc.

However, it is important to note that ChatGPT has some limitations, such as its inability to understand idiomatic expressions, sarcasm and figurative language. Additionally, it has not been trained with the latest data, so it cannot evoke facts after a certain date.

Despite its limitations, ChatGPT represents a significant advancement in the field of AI and natural language processing. It is a powerful tool for a wide range of applications, and we can expect to see it continue to be developed and refined in the future.