GPT-J-6B: Exploring Approaches for News Snippet Generation and Evaluation
About the model
GPT-J-6B is a 6 billion parameter, autoregressive text generation model by EleutherAI trained on The Pile using Mesh Transformer JAX (Ben Wang and Aran Komatsuzaki). We opted to work with GPT-J-6B-8bit: A quantized GPT-J-6B with 8-bit weights for scalable and cost-efficient fine-tuning by Hivemind with LoRA and 8-bit Adam.
Why start with GPT-J?
To answer that question one has to compare GPT-J with the most powerful model currently on the market: GPT-3. The most powerful GPT-3 model, GPT-3 Davinci, has been trained on more data and with almost 30 times more parameters than its open source alternative GPT-J. GPT-J generally performs better than the smaller versions of the GPT-3, Ada and Babbage, is almost on par with GPT-3 Curie but not quite as well as GPT-3 Davinci. That being said, GPT-J is a good alternative to OpenAI’s GPT-3 as it is open source, free and less resource intensive.
Approaches
Task adaptation approach | Ability to solve the task | Observations |
---|---|---|
One Shot Prompting | ❌ insufficient | model has weak german prompting and generation capabilities, as it was mainly trained on english data |
Prompt Tuning | ❌ none | outputs are even worse, 8 bit patch and LoRA adapters complicate the approach |
Fine-tuning with adapters for headline generation | ✅ headline | Outputs for headlines improved at the cost of losing all other prompting capabilities |
Multitask fine-tuning for headline and teaser generation with adapters | ✅ headline, teaser | Outputs for headline and teaser improved at the cost of losing all other prompting capabilities |
Fine Tuning vs. Multitask Fine Tuning
The difference is in the data set. Let’s see the example for our 1k dataset. For fine-tuning, we have 1000 data points, each consisting of a text and title. One task is trained: To generate a title for a news article. For multitask fine-tuning we have 2 x 1000 data points consisting of a text and title as well as a text and teaser for each news article. Thus, two tasks are trained: To generate a title and a teaser for a news article.
The 10k dataset
Fine-tuning for a single task is not scalable for large language models, as a separate large model must be trained and hosted for each task. While multitask fine-tuning requires only one model to solve multiple tasks that are similar, it is still inferior to the prompting paradigm. This is because the model loses general prompting capabilities and becomes locked to the set of tasks it can perform.
Get your hands on the code
How to finetune GPT-J-6B-8bit for News Headline Generation, Code
Objective | Hardware | Training Platform | Dataset-Size | Time | Cost |
---|---|---|---|---|---|
Causal Language Modeling | GPU with RAM > 12 GB | Google Colab (free) | 1k | unknown | free |
How to multitask fine-tune GPT-J-6B-8bit for News Headline and Teaser Generation, Code
Objective | Hardware | Training Platform | Dataset-Size | Time | Cost |
---|---|---|---|---|---|
Causal Language Modeling | A100SXM4 | Google Colab (free) | 1k | 27h 42min | $37.00 |
You can also load our finetuned GPT-J model gptj-title-teaser-10k from huggingface. For more information on that see: snipaid/gptj-title-teaser-10k.
Quality Assessment
To assess quality, we conducted a qualitative evaluation for gptj-title-teaser-10k in the form of an expert review with two experts. Both experts had access to a sample of 30 news articles with the full texts and the actual and generated titles. Their task was to label the generated outputs independently according to three labels:
- Good: Factual and style and grammar are decent
- Bad: Poor style/grammar
- Fake: Headline does not align with the facts given in the news article
Since the percentage of generated snippets classified as good is well below 50% and scalability of this approach is an issue, we decided to explore further models and approaches. Stay tuned…
TL;DR: We evaluated the following approaches for GPT-J model adaptation to the snippet generation task: One-Shot-Prompting, Prompt Tuning, Model Fine-tuning with Adapters and Multitask Fine-tuning with Adapters. Prompting and Prompt Tuning did not prove to be successful. Fine tuning improved the model’s snippet generation capabilities. However, both quality and scalability are a concern with this approach.
“GPT-J-6B: Exploring Approaches for News Snippet Generation and Evaluation” is part nine of our series News Snippet Generation. A Learning Journey on Open Source Large Language Models and how to assist journalists in generating headlines and teasers in German with AI.