A hedgehog reading a newspaper and news snippets all around

With German instruct models finally available, we decided to attempt one last model adaptation for the task of news snippet generation in German language.

The Idea: Improve snippet generation capabilities by means of continued multitask prompted fine-tuning. This approach requires news article and snippet data. For news data, there are some public news datasets available. For copyright reasons, we decided to build on an existing news dataset, MLSUM. Getting the news snippet data is the part more challenging.

Augmenting a German News Dataset with Snippet Data

MLSUM contains text, title and teaser for news articles from the German newspaper “Süddeutsche Zeitung”. To introduce further news snippets, we engineered a composite prompt to generate a SERP, Keywords and a Tweet for the news articles with GPT-3.5.

The composite prompt:

"Google SERP Snippet und Tweet Generation auf Deutsch  
Generiere Title Tag und Meta Description für den folgenden Artikel, die den Leser ansprechen.  
Title Tag: Kurztitel in drei Worten. Meta Description: Was gibt es Neues in fünfzehn Worten!  
Liste als zweites 10 SEO-Keywords getrennt mit Komma.  
Generiere als drittes einen Tweet zu dem Artikel."

Note: For us, this composite prompt has proven useful to reduce the amount of GPT requests on the one hand and to comply with the limits for title tag, meta description and tweet on the other hand.

In this manner, we collected SERP Title-Tag and Meta-Description, Keywords and Tweet for a Sample of ~500 news articles from MLSUM.

Let’s see a short example of what a news article with snippets looks like.

MLSUM Text, Title and Teaser:

text title teaser
For copyright reasons please follow this link to see the articles fulltext at “Süddeutsche Zeitung”. NRW-Landtag vor Auflösung - Schöne Neuwahlen allerseits! Plötzlich steht in NRW eine Neuwahl an. Das ist gut so, denn sie wird ziemlich sicher klare Verhältnisse schaffen. Außerdem zeigt sie, dass Politik manchmal unplanbar ist. Und das ist doch irgendwie beruhigend - auch wenn der Zeitpunkt für zwei Parteien denkbar schlecht ist.

Generated Title-Tag, Meta-Description, Keywords and Tweet:

keywords title_tag meta_description tweet
NRW, Minderheitsregierung, Neuwahlen, Landtag, Hannelore Kraft, Sylvia Löhrmann, Demokratie, Bildungskonsens, Taktiererei, Mehrheiten Juristische Einschätzungspanne Das Ende der rot-grünen Minderheitsregierung in NRW und was es bedeutet. Das Ende der rot-grünen Minderheitsregierung in NRW - eine juristische Einschätzungspanne führt zu Neuwahlen im bevölkerungsreichsten Bundesland. Was bedeutet das für die Demokratie und den Bildungskonsens? #NRW #Minderheitsregierung #Neuwahlen

The full dataset can be found here.

From Augmented News Dataset to Instruct Data

For instruction fine-tuning we need to transform the data into prompts with instruction, input and expected output, whereas instruction is a textual description of the task.

That’s how the prompt templates of German Alpaca and IGEL look like.

German Alpaca IGEL
Nachfolgend ist eine Anweisung, die eine Aufgabe beschreibt, zusammen mit einer Eingabe, die weiteren Kontext liefert. Schreiben Sie eine Antwort, die die Aufgabe angemessen erfüllt.

### Anweisung:
{instruction}

### Eingabe:
{input}

### Antwort:
### Anweisung:
{input}

### Antwort:

We now know how to structure the data. The augmented MLSUM dataset provides inputs and outputs. What we are still missing is a set of instructions. So again, we generate a set of instructions for each of the snippet generation tasks with GPT-3.5.

Getting instructions:

  1. Define a seed-set of five possible instructions for the task.
  2. Prompt GPT-3.5 for more instructions. This is the prompt we used: “Eine KI soll zu Zeitungsartikeln generieren. Hier sind einige Beispielprompts:
  3. Welche weiteren Prompts fallen dir ein?”
  4. Ask for more: “Fallen dir noch weitere ein?” or “Fallen dir noch weitere Variationen ein?”

In this manner, we generate 25 prompts per snippet type. With those we assemble an instruction dataset. Notice how each row in the MLSUM Dataset contiains a news article along with multiple snippets. We need to split this up. For each snippet per news article in the Augmented MLSUM dataset the instruction dataset contains a single row composed of instruction, input and output.

A schematic sketch of the transformation could look like this.

Instruction Input Output
One of the generated instructions for the task randomly selected The fulltext of the news article The snippet text

The full news snippet instruction dataset can be found here.

Finetuning & Results

With continued instruction fine-tuning on news snippet data we should be able to increase news snippet generation capabilities. At the same time we would like to keep the models general prompting capabilities. This raises the question of the amount of data points needed for the finetuning.

Instruction fine-tuned models proved capable of solving unknown tasks with just a few examples (one-/few-shot prompting). Exploring the 52k Alpaca Instruct dataset, we find that for each task such as coding or writing a poem there are around 500 instruction examples in the dataset. So we assume that a few up to several hundred data points will be sufficient.

We intend to repeat fine-tuning with gradually increased amounts of training data, to see the difference.

Steps for gradual increase: 10 articles - 60 snippet examples 100 articles - 600 snippet examples 500 articles - 3.000 snippet examples

Code and a short tutorial for the fine-tuning can be found here.

Our approach continues finetuning the IGEL LoRA adapters. With the steps for gradual increase this results in 3 adapters: snip-igel-10, snip-igel-100, and snip-igel-500.

Evaluation of the approach

We see improvement of snippet generation capabilities with igel-lora-finetuned-10 and igel-lora-finetuned-100 adapters.

To assess quality of igel-lora-500, we conducted a qualitative evaluation in the form of an expert review. The Reviewer had access to a sample of 30 news articles with the full text and the generated snippets. The task was to label the generated snippets independently according to three labels:

  • Good: Factual and style and grammar are decent
  • Bad: Poor style/grammar
  • Fake: Does not align with the facts given in the news article

Diagrams: Expert review for snip-igel in comarison with the expert review of gptj-title-teaser

The 30 news article sample is the same as used for the gptj-title-teaser-10k evaluation. With that we can conclude: The continuously multitask instruction finetuned IGEL (snip-igel-500) model performs 20% better than the multitask finetuned GPT-J model (gptj-title-teaser-10k) and seems less prone to hallucination. Meanwhile it only requires a third in the amount of training data.

Note: With snip-igel-500 we did not finetune for summarization. SERP and summary are the snippets with the lowest scores in the expert evaluation. To improve on that, we iterated dataset and fine-tuning once more with improved SERP and additional summary data. The result of this second fine-tuning is snip-igel-500-v2. With this we can finally provide an open source LLM capable of title, teaser, summary, keyword, SERP and tweet generation for news text in German language.

TL;DR: Augmenting news data with GPT-generated news snippets allows for continued multitask instructed finetuning of IGEL LoRA adapters. This approach supports news snippet generation in German language and yields decent news snippet generation capabilities on 6 different news snippets with relatively little data.

“SNIP-IGEL: Snippet Generation for News in German Language through Continued Multitask Prompted Fine Tuning” is part twelve of our series News Snippet Generation. A Learning Journey on Open Source Large Language Models and how to assist journalists in generating headlines and teasers in German with AI.


<
Previous Post
Alpaca and IGEL: Supporting German Language through Multitask Prompted Fine Tuning
>
Next Post
SnipAId: Review and future prospects