A Pile of Computers, Files, Notebooks and Phones

About the model

GPT-J-6B is a 6 billion parameter, autoregressive text generation model by EleutherAI trained on The Pile using Mesh Transformer JAX (Ben Wang and Aran Komatsuzaki). We opted to work with GPT-J-6B-8bit: A quantized GPT-J-6B with 8-bit weights for scalable and cost-efficient fine-tuning by Hivemind with LoRA and 8-bit Adam.

Why start with GPT-J?

To answer that question one has to compare GPT-J with the most powerful model currently on the market: GPT-3. The most powerful GPT-3 model, GPT-3 Davinci, has been trained on more data and with almost 30 times more parameters than its open source alternative GPT-J. GPT-J generally performs better than the smaller versions of the GPT-3, Ada and Babbage, is almost on par with GPT-3 Curie but not quite as well as GPT-3 Davinci. That being said, GPT-J is a good alternative to OpenAI’s GPT-3 as it is open source, free and less resource intensive.

Approaches

Task adaptation approach Ability to solve the task Observations
One Shot Prompting ❌ insufficient model has weak german prompting and generation capabilities, as it was mainly trained on english data
Prompt Tuning ❌ none outputs are even worse, 8 bit patch and LoRA adapters complicate the approach
Fine-tuning with adapters for headline generation ✅ headline Outputs for headlines improved at the cost of losing all other prompting capabilities
Multitask fine-tuning for headline and teaser generation with adapters ✅ headline, teaser Outputs for headline and teaser improved at the cost of losing all other prompting capabilities

Fine Tuning vs. Multitask Fine Tuning

Process of Fine Tuning and Multitask Fine Tuning

The difference is in the data set. Let’s see the example for our 1k dataset. For fine-tuning, we have 1000 data points, each consisting of a text and title. One task is trained: To generate a title for a news article. For multitask fine-tuning we have 2 x 1000 data points consisting of a text and title as well as a text and teaser for each news article. Thus, two tasks are trained: To generate a title and a teaser for a news article.

The 10k dataset

Insights into the 10k dataset

Fine-tuning for a single task is not scalable for large language models, as a separate large model must be trained and hosted for each task. While multitask fine-tuning requires only one model to solve multiple tasks that are similar, it is still inferior to the prompting paradigm. This is because the model loses general prompting capabilities and becomes locked to the set of tasks it can perform.

Get your hands on the code

How to finetune GPT-J-6B-8bit for News Headline Generation, Code

Objective Hardware Training Platform Dataset-Size Time Cost
Causal Language Modeling GPU with RAM > 12 GB Google Colab (free) 1k unknown free

How to multitask fine-tune GPT-J-6B-8bit for News Headline and Teaser Generation, Code

Objective Hardware Training Platform Dataset-Size Time Cost
Causal Language Modeling A100SXM4 Google Colab (free) 1k 27h 42min $37.00

You can also load our finetuned GPT-J model gptj-title-teaser-10k from huggingface. For more information on that see: snipaid/gptj-title-teaser-10k.

Quality Assessment

To assess quality, we conducted a qualitative evaluation for gptj-title-teaser-10k in the form of an expert review with two experts. Both experts had access to a sample of 30 news articles with the full texts and the actual and generated titles. Their task was to label the generated outputs independently according to three labels:

  • Good: Factual and style and grammar are decent
  • Bad: Poor style/grammar
  • Fake: Headline does not align with the facts given in the news article

Expert Evaluation gptj-title-teaser-10k

Since the percentage of generated snippets classified as good is well below 50% and scalability of this approach is an issue, we decided to explore further models and approaches. Stay tuned…

TL;DR: We evaluated the following approaches for GPT-J model adaptation to the snippet generation task: One-Shot-Prompting, Prompt Tuning, Model Fine-tuning with Adapters and Multitask Fine-tuning with Adapters. Prompting and Prompt Tuning did not prove to be successful. Fine tuning improved the model’s snippet generation capabilities. However, both quality and scalability are a concern with this approach.

“GPT-J-6B: Exploring Approaches for News Snippet Generation and Evaluation” is part nine of our series News Snippet Generation. A Learning Journey on Open Source Large Language Models and how to assist journalists in generating headlines and teasers in German with AI.


<
Previous Post
The Model Safari: Embarking on a Journey of Explorative Model Comparison
>
Next Post
BLOOM: Bridging Language Barriers using a Prompting and Translation Approach