GERMA: a comprehensive corpus of untrustworthy German news

Fabio Carrella; Alessandro Miani

doi:10.1515/lingvan-2024-0064

Artikel Open Access

GERMA: a comprehensive corpus of untrustworthy German news

Fabio Carrella und Alessandro Miani

Veröffentlicht/Copyright: 10. Februar 2025

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Linguistics Vanguard

Abstract

The proliferation of online misinformation undermines societal cohesion and democratic principles. Effectively combating this issue relies on developing automatic classifiers, which require training data to achieve high classification accuracy. However, while English-language resources are abundant, other languages are often neglected, creating a critical gap in our ability to address misinformation globally. Furthermore, this lack of data in languages other than English hinders progress in social sciences such as psychology and linguistics. In response, we present GERMA, a corpus comprising over 230,000 German news articles (more than 130 million tokens) gathered from 30 websites classified as “untrustworthy” by professional fact-checkers. GERMA serves as an openly accessible repository, providing a wealth of text- and website-level data for testing hypotheses and developing automated detection algorithms. Beyond articles, GERMA includes supplementary data such as titles, publication dates, and semantic measures like keywords, topics, and lexical features. Moreover, GERMA offers domain-specific metadata, such as website quality evaluation based on factors like bias, factuality, credibility, and transparency. Higher-level metadata incorporates various metrics related to website traffic, offering a valuable tool into the analysis of online user behavior. GERMA represents a comprehensive resource for research in untrustworthy news detection, supporting qualitative and quantitative investigations in the German language.

Zusammenfassung

Die Verbreitung von Online-Desinformation untergräbt den gesellschaftlichen Zusammenhalt und die Demokratie. Automatische Klassifizierer können dieser Desinformation effektiv begegnen, sind jedoch auf Trainingsdaten angewiesen. Während englischsprachige Ressourcen reichlich vorhanden sind, werden andere Sprachen oft vernachlässigt, was eine Lücke in der globalen Desinformationsbekämpfung schafft. Auch behindert dieser Datenmangel in anderen Sprachen als Englisch den Fortschritt in den Sozialwissenschaften wie Psychologie und Linguistik. Um diese Lücke zu schließen, präsentieren wir GERMA, ein Korpus mit über 230.000 deutschen Nachrichtenartikeln (+130 Millionen Token) von als “unglaubwürdig” eingestuften Websites (N = 30). GERMA dient als frei zugängliches Repositorium für Text-und Website-Daten, die für Hypothesentests und automatische Erkennung verwendet werden können. Zudem umfasst GERMA Metadaten wie Titel, Veröffentlichungsdaten, Schlüsselwörter, Themen und Website-Qualitätsfaktoren, die wertvolle Einblicke in das Online-Nutzerverhalten liefern. GERMA stellt somit eine umfassende Ressource für die Forschung zur Erkennung unzuverlässiger Nachrichten in deutscher Sprache dar.

Keywords: fake news; misinformation; German; corpus; news articles

Schlüsselwörter: Falschnachrichten; Fehlinformation; Deutsch; Korpus; Nachrichtenartikel

1 Introduction

The latest World Economic Forum has declared that misinformation stands out as the most significant short-term global risk, capable of disrupting elections and exacerbating societal polarization (https://www.weforum.org/publications/global-risks-report-2024/). The reasons are multiple: on one side, the structure of social media rewards users for sharing information that attracts others’ attention, which often translates to misinformation sharing (Ceylan et al. 2023); on the other hand, generative AI minimizes the efforts needed to craft fake information, allowing almost anyone to create an array of falsified content, ranging from articles to deepfakes (Ferrara 2024). The risk is that citizens will be exposed to a variety of conflicting information, causing societal paralysis amid serious issues – COVID-related misinformation cost Canada at least 2,800 lives and 300 million dollars (Himelfarb et al. 2023) – and the erosion of democratic processes based on deliberation (Farrell and Schneier 2018; Lasser et al. 2023).

There are two main ways to detect misinformation online. The first consists of deploying experts to read and judge news articles as trustworthy or fake news. This is often the ideal process but requires time for thorough fact-checking (Hassan et al. 2015). The second way involves using large language models (LLMs) that are specifically trained to scan an article and classify it as fake or genuine based mostly on textual characteristics. These models are faster and cheaper than human classification but usually require a vast amount of data on which they can be trained for precise classification. The issue is that the majority of datasets of fake and genuine news available online are in English (D’Ulizia et al. 2021), despite misinformation being a problem everywhere.

In Germany, for example, fake news has been connected with derogatory content towards immigrants (Humprecht 2019), and a propensity to believe misinformation has been correlated with a tendency to vote for far right parties (Leuker et al. 2022). The availability of an open-access dataset of fake news content in German would substantially encourage research into the role of misinformation in the German context.

This is why we created GERMA (the GErman coRpus of MisinformAtion), a comprehensive corpus of news articles generated by sources classified as untrustworthy by professional fact-checkers. Despite not being the first corpus of its kind, as other scholars have published German fake news corpora (Mattern et al. 2021; Rieger et al. 2023; Vogel and Jiang 2019), there are two aspects that differentiate GERMA from previous efforts. First, we made the choice to consider news reliability based on the source rather than articles per se. While this means that we cannot specifically treat the material in GERMA as fake news but rather as news from unreliable sources, it allows for a greater variety of content and a generous sample size. This is crucial because fact-checking every single article would be rather expensive and time-consuming. GERMA includes over 230,000 news articles, making it the largest dataset currently available for German-language misinformation research. Second, GERMA comes not only with texts but also with a variety of corollary measures that are useful for scholars. For example, it includes keywords, topics, dates, and lexical features, as well as incoming traffic metadata of the source domains. By focusing on source reliability and providing a rich dataset with associated measures, GERMA serves as a valuable resource for researchers aiming to enhance their comprehension of the complex dynamics surrounding misinformation.

2 Methods

In the following, we describe the methodology we employed to build GERMA and to extract the rich set of (meta)data from it. Like other similarly built corpora of misinformation (e.g., Carrella et al. 2023; Miani et al. 2021), GERMA was compiled from sources, instead of articles. This means that we first identified sources that deliver misinformation and then extracted individual articles from these sources. Such a method has been shown to be reliable in extracting conspiracy versus mainstream documents (Hemm et al. 2024; Mompelat et al. 2022). Once the list of sources was created, we downloaded all articles allowed for scraping from each domain, and then cleaned the corpus. Next we then extracted a set of complementary measures capable of defining the semantic content for each document: keywords, topics, and lexical features. We also provide, for each website, a set of metadata about the quality of the information shared, and the behavior of the users reaching such websites.

2.1 Building GERMA

2.1.1 Selecting sources

We started by building a list of sources, consisting of domains that deliver misinformation, by leveraging the work of Lin et al. (2023). They statistically combined six sets of expert ratings on news reliability using a technique called principal component analysis. This analysis identified a single underlying dimension (the first principal component) that captured the most variance in these expert ratings. Focusing on German-language sources, we identified those scoring lower than 0.60 on this combined score (N = 131) and selected them for further analysis. Notably, this threshold was chosen because it aligns with the score that NewsGuard (2020), a professional fact-checking database that provides trustworthiness indexes for thousands of news domains, considers “untrustworthy” for a domain. This choice is further supported by the strong correlation between Lin’s principal component score and the NewsGuard ratings (r = 0.76). Additionally, we observed that some of these domains identified by Lin et al. (2023) as untrustworthy were instead rated as “trustworthy” or “satire” by NewsGuard. This increased the risk of including potentially reliable sources as well as satirical news in our corpus. Therefore, to ensure the integrity of a corpus constructed only from unreliable sources, we compared these domains with the scores provided by NewsGuard and excluded those rated as trustworthy or satire (where possible, as not all domains identified by Lin et al. had been evaluated by NewsGuard). Consequently, we were left with a total of 61 domains considered untrustworthy, which we used for scraping (i.e., downloading articles in human-readable text format).

2.1.2 Data collection

The articles were collected from January 2023 to April 2023 using two distinct Python packages: trafilatura (Barbaresi 2021) and news-please (Hamborg et al. 2017). Both packages facilitate web scraping of texts from sitemaps rather than individual domains. A sitemap is a file that enumerates the visible or whitelisted URLs for a particular site, indicating where machines can search for content. In practice, this means that by providing the packages with the primary news domain (e.g., www.bild.de), they will automatically search for all the content they are permitted to scrape beneath that main domain. It is worth noting that not all of the 61 domains we intended to scrape actually permitted content scraping, and that websites can actively choose to hide contents from their sitemaps. In fact, the procedure we followed allowed us to collect 382,412 articles from 30 different domains.

2.1.3 Corpus cleaning

We cleaned the corpus in two phases. In the first phase, we applied a cleaning pipeline commonly used in other misinformation-related corpora (Carrella et al. 2023; Miani et al. 2021). This process involved: (i) removing duplicate texts; (ii) selecting texts within a word count range of 100 to 10,000 words (counted via white-space tokenization^[1]); and (iii) removing non-German documents. We achieved the latter by selecting texts where the percentage of German stop words (N = 231; obtained from Benoit et al. 2018) exceeded 13 % of the entire text (a threshold chosen after visual inspection). In the second, unplanned phase, we removed a total of 21,607 documents associated with domains protected by a paywall. These documents exhibited high similarity due to repeated boilerplate terms not captured by our initial cleaning process.^[2] Extracting keywords via TF-IDF (term frequency–inverse document frequency; see 2.2.1 for further details) revealed a high frequency of boilerplate terms, prompting us to remove these documents. Specifically, we removed documents (N = 19,301) from ef-magazin. de where the highest TF-IDF term was abonnent ‘subscriber’, documents (N = 1,465) from qs24. tv where the highest TF-IDF term was qs24, and documents (N = 841) where the highest TF-IDF term was cookies.

2.2 Extracting GERMA’s features

2.2.1 Keywords

Following an established procedure (Carrella et al. 2023), we extracted a set of keywords from each document by computing the TF-IDF, a technique that assesses the relevance of a term to a document within a corpus. For each term in a document, the TF-IDF score is computed by counting how many times a term appears in a document divided by the inverse document frequency of the term in the corpus. This results in high scores for terms that are frequent within a document, but this is balanced by a penalty for terms that are common across the entire corpus (e.g., stop words). In other words, it allows us to extract terms that are characteristic of a specific document. TF-IDF was computed using the function dfm_tfidf from the R package quanteda (Benoit et al. 2018). Keywords were defined as words with the highest TF-IDF score per document.

Keywords extraction followed this pipeline: (i) we removed two frequently occurring terms (prozent and sciencefiles) because they were part of boilerplate texts that were not extracted automatically via the Python packages; (ii) we removed stop words (N = 603, from the R package unine; Savoy 2009);^[3] (iii) we stemmed terms via the function tokens_wordstem from the package quanteda, setting the language to German; (iv) we selected the top 10,000 most frequent terms; (v) we built the TF-IDF matrix (via the quanteda function dfm_tfidf) and selected, for each document, the 10 terms with the highest TF-IDF scores. In Table 1, we list the 30 most frequent keywords extracted using TF-IDF, along with the number of articles in which each keyword appeared among the top 10 TF-IDF keywords.

Table 1:

Top 30 keywords extracted via TF-IDF and their occurrence in GERMA. The “Frequency” column represents the number of articles in which the keyword appears among the top 10 TF-IDF keywords. For example, the keyword ukrain is among the top 10 TF-IDF keywords in 2,568 articles.

	Keyword	Frequency
1	Ukrain	2,568
2	Iran	1,343
3	Israel	1,248
4	Islam	1,211
5	Eu	1,074
6	Trump	1,058
7	Afd	1,044
8	Compact	842
9	Merkel	823
10	Bid	803
11	Lauterbach	714
12	Impfstoff	691
13	Schul	661
14	Scholz	638
15	Kind	627
16	Impfpflicht	622
17	Speicher	604
18	Taliban	604
19	Nato	581
20	Grun	567
21	Baerbock	534
22	Rki	534
23	Macron	526
24	Russland	513
25	Russisch	507
26	Habeck	505
27	Impfung	491
28	Putin	475
29	China	473
30	Johnson	459

2.2.2 Topics

Topics were extracted using latent Dirichlet allocation (LDA; Blei et al. 2003). LDA is an unsupervised probabilistic machine learning model capable of identifying co-occurring word patterns and extracting the underlying topic distribution for each document. Unlike keywords, topics offer a fine-grained indexing of semantic content. To extract topics, we preprocessed the corpus by partially following the pipeline we used to determine keywords. Specifically, we tokenized the corpus via the quanteda function tokens (removing punctuation, symbols, numbers, URLs, and separators). We removed stop words, but we reintroduced two specific terms – prozent and sciencefiles – that we had previously excluded when building the TF-IDF keywords. It is important to note that this reintroduction was not crucial for generating LDA topics. We stemmed and selected the 10,000 most frequent terms.

Extracting LDA topics requires researchers to set the number of topics desired, k: if a fine-grained resolution is required, then a large number of topics is better; if the number of topics is small, these topics become more general (Colin and Murdock 2020). Using the topicmodels R package (Grün and Hornik 2011), we extracted topics using three different topic resolutions, setting k at 20, 100, and 200 topics, hence obtaining a total of 320 different topics. We labeled each topic using their top 10 words which, taken together, summarize the topic’s content (Nguyen et al. 2020).

Topics are useful for both analytical and exploratory purposes. To facilitate users’ exploration of topic trends over time, we provide a PDF file (see Section 3.5) where we averaged the topic’s gamma values (the probability a topic is part of a document) for each month, obtaining a time series of gamma values. This method has been shown to reliably associate LDA topics with real-world events such as disease outbreaks, deaths of significant figures, and wars (Lansdall-Welfare et al. 2017; Mayor and Miani 2023; Miani et al. 2021).

2.2.3 Lexical features

Lexical features (N = 97, including word count) were extracted with the Linguistic Inquiry and Word Count (LIWC, version 2022; Boyd et al. 2022), relying on the most recent German translation (Meier et al. 2018). LIWC is a widely used stand-alone application that extracts psychologically meaningful features from texts (Tausczik and Pennebaker 2010), including in German (Pachucki et al. 2022). LIWC analyzes texts and checks whether words are included in predefined categories (e.g., negative and positive emotions, social ties, and so on). If a word matches a category, the value associated with that category increases. Unlike topic modeling (in which topics’ probabilities sum for each document), LIWC categories are expressed as percentages of words in a document belonging to a category. This means there can be overlap if a word appears in multiple categories. For example, the category “anxiety” (composed of words such as anxious, avoid, and insecure) is also a subgroup of the category of “negative emotions”.

2.2.4 Domain metadata

For all domains, we provide a measure of information quality, defined as an aggregated measure of bias, factuality, credibility, and transparency, where higher scores correspond to higher quality domains (Lin et al. 2023). For each website, we also extracted a set of metrics related to the domain’s incoming traffic in October 2023 on SimilarWeb (https://www.similarweb.com/corp/ourdata/). These metrics include monthly visits, visit duration, bounce rate (the percentage of visitors who leave after visiting only one page), and pages visited. Incoming traffic is further partitioned into direct traffic (reaching the website by typing the URL on the web browser or recalling it from bookmarks), traffic from search engines (e.g., using Google), referrals (when a website is reached through another website), and traffic from social media (e.g., a post on Facebook or Twitter). These metrics provide insight into the behavior of users accessing misinformation domains.

3 GERMA’s features

GERMA is a large corpus (132,330,548 tokens) containing 237,848 documents, spanning more than 30 years (see Figure 1), with an average word count of 556.37 (SD = 649.94; range: 101–9,953). It is built from 30 domains classified as delivering misinformation by Lin et al. (2023) and NewsGuard (2020). In Table 2, we list the domains used to build GERMA, including their document counts and domain quality as outlined by Lin et al. (2023). Besides the texts themselves, GERMA comes with a rich set of data and metadata. GERMA is freely available at the OSF repository at https://osf.io/3bthj. In the following, we describe the datasets stored in the repository to illustrate the richness of GERMA.

Figure 1:

Distribution of documents by date (for those with dates; N = 130,381) aggregated by year. From 2007, each year contains at least 1,000 documents with dates.

Table 2:

List of the domains included in GERMA, along with their corresponding quality ratings and the total number of articles for each domain. Domain quality is assessed based on factors such as bias, factuality, credibility, and transparency, as outlined by Lin et al. (2023).

	Domain	No. of articles	Domain quality
1	berlinertageszeitung.de	85,459	0.56
2	pi-news.net	35,225	0.46
3	de.rt.com	19,527	0.33
4	freiewelt.net	15,336	0.48
5	dieunbestechlichen.com	13,776	0.35
6	journalistenwatch.com	11,433	0.59
7	Compact-online.de	9,264	0.36
8	neopresse.com	8,039	0.46
9	sciencefiles.org	7,292	0.30
10	rubikon.news	6,821	0.60
11	connectiv.events	5,806	0.45
12	apolut.net	5,341	0.37
13	liebeisstleben.net	2,802	0.24
14	deutschlandkurier.de	2,006	0.25
15	derwaechter.net	1,871	0.24
16	contra24.online	1,780	0.36
17	heftig.de	1,411	0.40
18	qs24.tv	1,272	0.57
19	indexexpurgatorius.wordpress.com	903	0.28
20	altermedzentrum.com	681	0.22
21	unsere-natur.net	557	0.35
22	naturstoff-medizin.de	540	0.49
23	impfkritik.de	143	0.38
24	swprs.org	137	0.18
25	n23.tv	134	0.32
26	2020news.de	111	0.37
27	ef-magazin.de	107	0.38
28	corona-solution.com	45	0.34
29	top20radio.tv	28	0.35
30	marialourdesblog.com	1	0.33
		Sum = 237,848	Mean = 0.38

3.1 The corpus

The GERMA corpus is stored in the CSV file GERMA.csv (size: 1.07 GB). This data frame contains 237,848 rows, with each row representing a unique document. The data frame has eight columns, which are as follows:

doc_id: unique document identifier
url: the URL of the document (missing N = 5,970)
date: the date of writing or uploading the document (missing N = 107,467)
title: the title of the document
text: the text of the document
word_count: the number of words in the document
keywords: the top 10 TF-IDF words associated with the document
website: the domain from which the document is extracted (the unique identifier of dataset described in Section 3.2)

3.2 GERMA’s domains

The website information for the GERMA corpus is stored in the CSV file GERMA_websites.csv (size: 3 KB). This data frame contains 30 rows, with each row representing a unique website. The data frame has 10 columns, each describing a different feature of these domains. The columns are as follows:

website: the source of documents in GERMA, unique identifier of domains
credibility: website quality obtained from Lin et al. (2023), theoretically ranging from 0 to 1 (low to high quality)
monthly_visits: the number of visits (incoming traffic) the website receives within a month
visit_duration: average time spent (in seconds) on the website for each visit
bounce_rate: percentage of traffic generated by users who enter the site, take no action, and leave after visiting only one page
pages_per_visit: average number of pages visited in each visit
traffic_{direct|referrals|search|social}: percentage of traffic reaching the domain direct (users typing the URL into their browser, or recalling it from a saved bookmark or any links from outside the browser), from referrals (users clicking on a URL on another web page), from search (users using a search engine search query, e.g., in Google), and from social media (e.g., users arriving from a link in Facebook, Reddit, Twitter, or YouTube)

3.3 Lexical features

The lexical features of the GERMA corpus are stored in the CSV file GERMA_LIWC.csv (size: 102.8 MB). This data frame contains 237,848 rows, with each row corresponding to a unique document in the corpus. The data frame has 97 columns, each representing a distinct lexical feature extracted using the LIWC tool. These features are expressed as percentages, with the exception of certain derived measures such as Analytic, Clout, and others, as detailed in Boyd et al. (2022). To facilitate easy identification and linkage with other corpus data, each document’s unique identifier is stored as the row name in this data frame.

3.4 LDA topics: gamma values

The LDA topic modeling results for the GERMA corpus are stored in the CSV file GERMA_LDA_gamma.csv (size: 1.48 GB). This data frame encompasses 237,848 rows, each corresponding to a unique document in the corpus. It contains 320 columns, representing topics derived from three different LDA models with varying resolutions: 20, 100, and 200 topics. Each column contains the gamma value, which represents the topic proportion for a given document-topic pair. To facilitate easy identification and linkage with other corpus data, each document’s unique identifier is stored as the row name in this data frame. The topics are identified using a naming convention that combines the k value (indicating the number of topics in the model: 20, 100, or 200) with an arbitrary topic number ranging from 1 to k. For instance, the topic labeled k200_116 represents the 116th topic within the set with resolution k = 200 (the top 10 features for each topic are listed in the file GERMA_LDA_features.csv). This comprehensive structure allows for multi-resolution topic analysis across the entire corpus, enabling researchers to examine thematic patterns at different levels of granularity and to compare topic distributions across documents.

3.5 LDA topics: over time

The temporal evolution of topic distributions in the GERMA corpus is visualized in the PDF file GERMA_LDA_overtime.pdf (size: 920 KB). This comprehensive document spans 320 pages, each dedicated to a single topic from the three LDA models with resolutions of 20, 100, and 200 topics. Each page presents a plot depicting the topic’s gamma values over time, covering the period from 2007 to 2023. The starting year of 2007 was selected because, from that year onward, there are at least 1,000 documents with dates per year. These visualizations allow researchers to observe how the prominence of each topic has fluctuated throughout the corpus’s timespan, providing insights into temporal trends and shifts in thematic focus. To aid in topic interpretation, the top 10 terms associated with each topic are included and can be easily located within the PDF (for an example, see Figure 3). This resource offers a valuable tool for analyzing the dynamic nature of discourse within the GERMA corpus, enabling researchers to identify emerging themes, track the evolution of discussions, and contextualize content shifts within the broader temporal framework of the dataset.

4 How to use GERMA

GERMA represents a rich source of data for better understanding the content of misinformation. Being for the most part composed of texts with additional metadata and lexical features, GERMA is conceptualized as a turnkey resource. Researchers can test hypotheses, replicate results, further extract features, or build classification and predictive models from GERMA.

By using texts, researchers can replicate previous analyses by extracting from GERMA co-occurrence of terms, measures of documents similarity and textual cohesion, and relationships between nominal compound constituents and syntactical elements (Fleckenstein 2024; Miani et al. 2022, 2024]; Samory and Mitra 2018). By leveraging the lexical features provided through LIWC, researchers can explore the psychological dimensions of German misinformation. This allows them to replicate findings from English-language studies, which show that the language of misinformation – especially conspiracy theories – is often rooted in social identification and exclusion, filled with negative emotions, and focused on themes of power, dominance, and aggression, frequently challenging the official narrative (Fong et al. 2021; Klein et al. 2019; Miani et al. 2021).

As illustrated in Figure 2, low-quality documents tend to use, on average, more words (indicated by the LIWC feature WC) and longer sentences (as reflected by WPS). These documents also frequently focus on themes of tribalism and social inclusion or exclusion (as indicated by features such as they, we, pronoun, and social), employ a rhetoric of questioning (evidenced by interrog and QMark), and use causal language (reflected by cause). Different sets of psycholinguistic measures can be further extracted from the texts such as word norms for concreteness or age of acquisition, with their German translations (Birchenough et al. 2016; Brysbaert et al. 2013; Charbonnier and Wartena 2020; Kuperman et al. 2012).

Figure 2:

The 20 highest and 20 lowest β coefficients (on the y-axis; all p < 0.001, Bonferroni corrected) from regressions predicting LIWC lexical features (on the x-axis) by websites credibility scores. The credibility scores have been inverted to improve interpretability. This means that positive values indicate the feature being positively correlated with less credibility. Note that, due to the large sample, the error bars are smaller than the dots representing the coefficient, hence not visible.

Keywords, topics, and lexical features can serve as variables of interest in diachronic analyses. For instance, researchers can examine fluctuations of a specific topic of interest over time. As shown in Figure 3, we averaged the gamma values, which indicate the “presence” of a topic in the articles, for the topic “us_usa_trump_prasident_china” across each month from 2007 to 2024. The figure reveals peaks corresponding to the two most recent presidential elections at time of writing, specifically in late 2016 to early 2017, and late 2020 to early 2021.

Figure 3:

LDA topic gamma values (y-axis) over time (x-axis). Topic k20_004 relates to the United States. The 10 most relevant words for the topic are displayed in decreasing order above the plot (English translation: ‘US, USA, Trump, President, China, Govern, Biden, Chinese, Iran’). The plot reveals peaks corresponding to the US presidential elections in late 2016 and late 2020, as well as a peak in June 2009, which may be associated with former US president Barack Obama’s visit to Germany earlier that month.

Moreover, these same elements – keywords, topics, and lexical features – can be utilized not only as dependent variables but also to further subset GERMA prior to analyses. This would be parallel, for example, to a study focused on specific subsets of LOCO (the Language of Conspiracy corpus; Miani et al. 2021), which revealed distinct conceptualizations of health-related content between conspiracy documents, framed in terms of beliefs, and mainstream documents, focused on scientific evidence (Reiter-Haas 2023), paralleling analyses of political communication on Twitter (Lasser et al. 2023). Another such investigation showed that the group of documents mentioning conspiracy (searching for the pattern “conspir*” in the text) show an increase in conspiratorial language (marked by lexical features associated with crime, terrorism, deception, and stealing) in non-conspiratorial documents as well, and that documents relying on prototypical conspiratorial language are shared more on Facebook (Miani et al. 2021). Using keywords and topics, other works have shown that documents delivering misinformation exhibit greater interconnectedness (Carrella et al. 2023; Miani et al. 2022), providing empirical support for an overarching worldview (Lewandowsky et al. 2018).

Using the URLs associated with each document, researchers can extract HTML data in order to analyze web markup features as previous work has done on fake news (Castelo et al. 2019). Websites’ information could be used to analyze – or compare with other datasets – the behavior of users reaching misinformation websites. Previous work analyzing those patterns suggests that such behavior has elements of confirmation bias and in particular selective exposure to misinformation (Cardenal et al. 2019; Carrella et al. 2023; Miani et al. 2021). Lastly, efforts could also be made to develop annotation schemes and tools capable of automatically identifying misinformation language (Diab et al. 2024; Fort et al. 2023; Mompelat et al. 2022).

5 Conclusions

In this paper, we introduce GERMA, a freely accessible corpus of over 230,000 articles collected from German news websites classified as unreliable by fact-checkers. The corpus joins a group of similar existing corpora (Mattern et al. 2021; Rieger et al. 2023; Vogel and Jiang 2019), but it stands out both for its larger number of articles and for the additional features available, which allow GERMA to be used for linguistic, sociological, and psychological studies.

Features such as topics and keywords associated with individual articles allow researchers to extract specific linguistic characteristics of each topic and make comparisons. Most of the articles also have a publication date, which allows for diachronic studies to examine the linguistic evolution of each topic. Lexical features (specifically, LIWC scores) allow for a variety of sociological and psychological studies. Finally, GERMA also contains domain-specific features such as the type(s) of news typically shared by a specific source, as well as data on the incoming traffic for a domain, which can be used to study digital community behavior.

One limitation of our dataset is that the news it contains is classified as “untrustworthy” based on the reliability of the sources that produced them. This means that the news articles in our dataset cannot be automatically labeled as “fake news” without further verification. Researchers need to conduct a direct check when working with these sources. While an article-based approach would provide more precision, we believe that a domain-based approach – where domains are classified with an aggregate score from various misinformation datasets and fact-checkers – strikes a suitable balance between data quality and quantity.

Another limitation concerns the aggregate score itself. Although this score is based on multiple expert ratings, each expert uses their own criteria for judgment. Additionally, raters like NewsGuard often derive reliability scores from a combination of factors, including credibility, transparency, and error correction. Therefore, we advise users not to compare the reliability of domains in our dataset directly, as a lower or higher score does not necessarily correlate with the likelihood of sharing misinformation.

Despite these limitations, we consider GERMA to be an important resource in misinformation studies, not only for the quantity of the data it offers, but also for the additional measures it includes, allowing for different types of studies. GERMA also represents an additional resource in a language other than English, which is often over-represented in these contexts where there is a need for datasets to train classifiers. We believe GERMA’s unique features make it a valuable tool for researchers working to combat the growing problem that misinformation represents.

Corresponding author: Fabio Carrella, School of Psychological Science, University of Bristol, Bristol, UK, E-mail: fabio.carrella@bristol.ac.uk

Funding source: H2020 European Research Council

Award Identifier / Grant number: 101020961

Funding source: Volkswagen Foundation

Funding source: Swiss National Science Foundation

Award Identifier / Grant number: 214293

Acknowledgement

For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising from this submission.

Research funding: This work is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program under grant agreement No. 101020961 (PRODEMINFO). FC is supported by the Volkswagen Foundation (“Reclaiming individual autonomy and democratic discourse online”). AM is supported by the Swiss National Science Foundation (SNSF, project number 214293, “In/coherent worldviews”).

References

Barbaresi, Adrien. 2021. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Heng Ji, Jong C. Park & Rui Xia (eds.), Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing: System demonstrations, 122–131. Bangkok, Thailand: Association for Computational Linguistics.10.18653/v1/2021.acl-demo.15Suche in Google Scholar

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Nulty Paul, Adam Obeng, Stefan Müller & Akitaka Matsuo. 2018. Quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software 3(30). 774. https://doi.org/10.21105/joss.00774.Suche in Google Scholar

Birchenough, Julia M. H., Robert Davies & Connelly Vincent. 2016. Rated age-of-acquisition norms for over 3,200 German words. Behavior Research Methods 49(2). 484–501. https://doi.org/10.3758/s13428-016-0718-0.Suche in Google Scholar

Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3. 993–1022.Suche in Google Scholar

Boyd, Ryan L., Ashwini Ashokkumar, Sarah Seraj & James W. Pennebaker. 2022. The development and psychometric properties of LIWC-22. Austin, TX: University of Texas at Austin.Suche in Google Scholar

Brysbaert, Marc, Amy Beth Warriner & Victor Kuperman. 2013. Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods 46(3). 904–911. https://doi.org/10.3758/s13428-013-0403-5.Suche in Google Scholar

Cardenal, Ana S., Carlos Aguilar-Paredes, Carol Galais & Mario Pérez-Montoro. 2019. Digital technologies and selective exposure: How choice and filter bubbles shape news media exposure. International Journal of Press/Politics 24(4). 465–486. https://doi.org/10.1177/1940161219862988.Suche in Google Scholar

Carrella, Fabio, Alessandro Miani & Stephan Lewandowsky. 2023. Irma: The 335-million-word Italian coRpus for studying MisinformAtion. In Proceedings of the 17th conference of the European chapter of the association for computational linguistics, 2339–2349. Dubrovnik, Croatia: Association for Computational Linguistics.10.18653/v1/2023.eacl-main.171Suche in Google Scholar

Castelo, Sonia, Thais Almeida, Anas Elghafari, Aécio Santos, Kien Pham, Eduardo Nakamura & Juliana Freire. 2019. A topic-agnostic approach for identifying fake news pages. Companion proceedings of the 2019 world wide web conference (WWW ’19). San Francisco, USA: ACM.10.1145/3308560.3316739Suche in Google Scholar

Ceylan, Gizem, Ian A. Anderson & Wendy Wood. 2023. Sharing of misinformation is habitual, not just lazy or biased. Proceedings of the National Academy of Sciences 120(4). https://doi.org/10.1073/pnas.2216614120.Suche in Google Scholar

Charbonnier, Jean & Christian Wartena. 2020. Predicting the concreteness of German words. In Sarah Ebling, Don Tuggener, Manuela Hürlimann, Mark Cieliebak & Martin Volk (eds.), Proceedings of the 5th swiss text analytics conference (SwissText) & 16th conference on natural language processing (KONVENS). https://ceur-ws.org/Vol-2624/ (accessed 26 December 2024).Suche in Google Scholar

Colin, Allen & Jamie Murdock. 2020. LDA topic modeling: Contexts for the history and philosophy of science. In Grant Ramsey & Andreas de Block (eds.), Dynamics of science: Computational frontiers in history and philosophy of science. Pittsburgh, PA: Pittsburgh University Press.Suche in Google Scholar

D’Ulizia, Arianna, Maria Chiara Caschera, Fernando Ferri & Patrizia Grifoni. 2021. Fake news detection: A survey of evaluation datasets. PeerJ Computer Science 7. e518. https://doi.org/10.7717/peerj-cs.518.Suche in Google Scholar

Diab, Ahmad, Rr. Nefriana & Yu-Ru Lin. 2024. Classifying conspiratorial narratives at scale: False alarms and erroneous connections. Proceedings of the International AAAI Conference on Web and Social Media 18. 340–353. https://doi.org/10.1609/icwsm.v18i1.31318.Suche in Google Scholar

Farrell, Henry John & Bruce Schneier. 2018. Common-knowledge attacks on democracy. Berkman Klein Center Research Publication no. 2018–7, Available at SSRN: https://ssrn.com/abstract=3273111.10.2139/ssrn.3273111Suche in Google Scholar

Ferrara, Emilio. 2024. GenAI against humanity: Nefarious applications of generative artificial intelligence and large language models. Journal of Computational Social Science 7. 549–569. https://doi.org/10.1007/s42001-024-00250-1.Suche in Google Scholar

Fleckenstein, Kristen. 2024. Representations of gender in conspiracy theories: A corpus-assisted critical discourse analysis. Critical Discourse Studies Advanced online publication. 1–17. https://doi.org/10.1080/17405904.2024.2334263.Suche in Google Scholar

Fong, Amos, Jon Roozenbeek, Danielle Goldwert, Steven Rathje & Sander van der Linden. 2021. The language of conspiracy: A psychological analysis of speech used by conspiracy theorists and their followers on twitter. Group Processes & Intergroup Relations 24(4). 606–623. https://doi.org/10.1177/1368430220987596.Suche in Google Scholar

Fort, Matthew, Zuoyu Tian, Elizabeth Gabel, Nina Georgiades, Noah Sauer, Daniel Dakota & Sandra Kübler. 2023. Bigfoot in big tech: Detecting out of domain conspiracy theories. In Proceedings of the 14th international conference on recent advances in natural language processing, 353–363. Varna, Bulgaria: INCOMA. https://aclanthology.org/2023.ranlp-1.40 (accessed 26 December 2024).10.26615/978-954-452-092-2_040Suche in Google Scholar

Grün, Bettina & Kurt Hornik. 2011. Topicmodels: An R package for fitting topic models. Journal of Statistical Software 40(13). https://doi.org/10.18637/jss.v040.i13.Suche in Google Scholar

Hamborg, Felix, Norman Meuschke, Corinna Breitinger & Bela Gipp. 2017. News-please: A generic news crawler and extractor. In Maria Gäde (ed.), Everything changes, everything stays the same: Understanding information spaces. Proceedings of the 15th international symposium of information science (ISI 2017), Berlin, Germany, 13th–15th March 2017, 218–223. Glückstadt: Verlag Werner Hülsbusch.Suche in Google Scholar

Hassan, Naeemul, Bill Adair, James T. Hamilton, Chengkai Li, Mark Tremayne, Jun Yang & Cong Yu. 2015. The quest to automate fact-checking. In Proceedings of the 2015 computation+ journalism symposium. http://cj2015.brown.columbia.edu/papers/automate-fact-checking.pdf (accessed 26 December 2024).Suche in Google Scholar

Hemm, Ashley, Sandra Kübler, Michelle Seelig, John Funchion, Manohar Murthi, Kamal Premaratne, Daniel Verdear & Wuchty Stefan. 2024. Are you serious? Handling disagreement when annotating conspiracy theory texts. In Proceedings of the 18th linguistic annotation workshop (LAW-XVIII), 124–132. St. Julians, Malta: Association for Computational Linguistics https://aclanthology.org/2024.law-1.12 (accessed 26 December 2024).Suche in Google Scholar

Himelfarb, Alex, Andreas Boecker, Marie-Eve Carignan, Timothy Caulfield, Jean-François Cliche, Jaigris Hodson, Ojistoh Horn, Akwatu Khenti, Stephan Lewandowsky, Noni MacDonald, Philip Mai, Sachiko Ozawa & Joanna Sterling. 2023. Fault lines: The expert panel on the socioeconomic impacts of science and health misinformation. Ottawa: Council of Canadian Academies. https://cca-reports.ca/reports/the-socioeconomic-impacts-of-health-and-science-misinformation/ (accessed 26 December 2024).Suche in Google Scholar

Humprecht, Edda. 2019. Where “fake news” flourishes: A comparison across four Western democracies. Information, Communication & Society 22(13). 1973–1988. https://doi.org/10.1080/1369118x.2018.1474241.Suche in Google Scholar

Klein, Colin, Peter Clutton & Adam G. Dunn. 2019. Pathways to conspiracy: The social and linguistic precursors of involvement in Reddit’s conspiracy theory forum. PLoS One 14(11). e0225098. https://doi.org/10.1371/journal.pone.0225098.Suche in Google Scholar

Kuperman, Victor, Hans Stadthagen-Gonzalez & Marc Brysbaert. 2012. Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods 44(4). 978–990. https://doi.org/10.3758/s13428-012-0210-4.Suche in Google Scholar

Lansdall-Welfare, Thomas, Saatviga Sudhahar, James Thompson, Justin Lewis, Nello Cristianini, Amy Gregor, Boon Low, Toby Atkin-Wright, Malcolm Dobson & Richard Callison. 2017. Content analysis of 150 years of British periodicals. Proceedings of the National Academy of Sciences 114(4). https://doi.org/10.1073/pnas.1606380114.Suche in Google Scholar

Lasser, Jana, Segun T. Aroyehun, Fabio Carrella, Almog Simchon, David Garcia & Stephan Lewandowsky. 2023. From alternative conceptions of honesty to alternative facts in communications by US politicians. Nature Human Behaviour 7(12). 2140–2151. https://doi.org/10.1038/s41562-023-01691-w.Suche in Google Scholar

Leuker, Christina, Lukas Maximilian Eggeling, Nadine Fleischhut, John Gubernath, Ksenija Gumenik, Shahar Hechtlinger, Anastasia Kozyreva, Larissa Samaan & Ralph Hertwig. 2022. Misinformation in Germany during the COVID-19 pandemic: A cross-sectional survey on citizens’ perceptions and individual differences in the belief in false information. European Journal of Health Communication 3(2). 13–39. https://doi.org/10.47368/ejhc.2022.202.Suche in Google Scholar

Lewandowsky, Stephan, John Cook & Elisabeth Lloyd. 2018. The “Alice in Wonderland” mechanics of the rejection of (climate) science: Simulating coherence by conspiracism. Synthese 195(1). 175–196. https://doi.org/10.1007/s11229-016-1198-6.Suche in Google Scholar

Lin, Hause, Jana Lasser, Stephan Lewandowsky, Rocky Cole, Andrew Gully, David G. Rand & Gordon Pennycook. 2023. High level of correspondence across different news domain quality rating sets. PNAS Nexus 2(9). https://doi.org/10.1093/pnasnexus/pgad286.Suche in Google Scholar

Mattern, Justus, Yu Qiao, Elma Kerz, Daniel Wiechmann & Strohmaier Markus. 2021. FANG-COVID: A new large-scale benchmark dataset for fake news detection in German. In Rami Aly, Christos Christodoulopoulos, Oana Cocarascu, Zhijiang Guo, Arpit Mittal, Michael Schlichtkrull, James Thorne & Andreas Vlachos (eds.), Proceedings of the fourth workshop on fact extraction and verification (fever), 78–91. Dominican Republic: Association for Computational Linguistics.10.18653/v1/2021.fever-1.9Suche in Google Scholar

Mayor, Eric & Alessandro Miani. 2023. A topic models analysis of the news coverage of the Omicron variant in the United Kingdom press. BMC Public Health 23(1). https://doi.org/10.1186/s12889-023-16444-7.Suche in Google Scholar

Meier, Tabea, Ryan L. Boyd, James W. Pennebaker, Matthias R. Mehl, Mike Martin, Markus Wolf & Andrea B. Horn. 2018. LIWC auf Deutsch”: The development, psychometrics, and introduction of DE-LIWC2015. PsyArXiv. https://doi.org/10.31234/osf.io/uq8zt.Suche in Google Scholar

Miani, Alessandro, Thomas Hills & Adrian Bangerter. 2021. Loco: The 88-million-word language of conspiracy corpus. Behavior Research Methods 54(4). 1794–1817. https://doi.org/10.3758/s13428-021-01698-z (accessed 13 August 2022).Suche in Google Scholar

Miani, Alessandro, Thomas Hills & Adrian Bangerter. 2022. Interconnectedness and (in)coherence as a signature of conspiracy worldviews. Science Advances 8(43). https://doi.org/10.1126/sciadv.abq3668.Suche in Google Scholar

Miani, Alessandro, Lonneke van der Plas & Adrian Bangerter. 2024. Loose and tight: Creative formation but rigid use of nominal compounds in conspiracist texts. Journal of Creative Behavior 58(1). 114–127. https://doi.org/10.1002/jocb.633.Suche in Google Scholar

Mompelat, Ludovic, Zuoyu Tian, Amanda Kessler, Matthew Luettgen, Aaryana Rajanala, Sandra Kübler & Michelle Seelig. 2022. How “loco”’ is the LOCO corpus? Annotating the language of conspiracy theories. In Proceedings of the 16th linguistic annotation workshop (LAW-XVI) within LREC2022, 111–119. Marseille: European Language Resource Association. https://aclanthology.org/2022.law-1.14.Suche in Google Scholar

NewsGuard, Inc. 2020. Rating process and criteria. Internet Archive. https://web.archive.org/web/20200630151704/https://www.newsguardtech.com/ratings/rating-process-criteria/ (accessed 20 April 2022).Suche in Google Scholar

Nguyen, Dong, Maria Liakata, DeDeo Simon, Eisenstein Jacob, David Mimno, Rebekah Tromble & Winters Jane. 2020. How we do things with words: Analyzing text as social and cultural data. Frontiers in Artificial Intelligence 3. 62. https://doi.org/10.3389/frai.2020.00062 (accessed 4 July 2022).Suche in Google Scholar

Pachucki, Christoph, Reinhard Grohs & Ursula Scholl-Grissemann. 2022. Is nothing like before? COVID-19–evoked changes to tourism destination social media communication. Journal of Destination Marketing & Management 23. 100692. https://doi.org/10.1016/j.jdmm.2022.100692.Suche in Google Scholar

Reiter-Haas, Markus. 2023. Exploration of framing biases in polarized online content consumption. In Companion proceedings of the ACM web conference 2023, 560–564. Austin, TX: ACM.10.1145/3543873.3587534Suche in Google Scholar

Rieger, Jonas, Nico Hornig, Jonathan Flossdorf, Henrik Müller, Stephan Mündges, Carsten Jentsch, Jörg Rahnenführer & Christina Elmer. 2023. Debunking disinformation with GADMO: A topic modeling analysis of a comprehensive corpus of German-language fact-checks. In Proceedings of the 4th conference on language, data and knowledge, 520–531. Vienna: NOVA CLUNL. https://aclanthology.org/2023.ldk-1.Suche in Google Scholar

Samory, Mattia & Tanushree Mitra. 2018. “The government spies using our webcams”: The language of conspiracy theories in online discussions. Proceedings of the ACM on Human-Computer Interaction 2. 1–24. https://doi.org/10.1145/3274421.Suche in Google Scholar

Savoy, Jacques. 2009. A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science 50. 944–952. https://doi.org/10.1002/(sici)1097-4571(1999)50:10<944::aid-asi9>3.3.co;2-h.10.1002/(SICI)1097-4571(1999)50:10<944::AID-ASI9>3.3.CO;2-HSuche in Google Scholar

Tausczik, Yla R. & James W. Pennebaker. 2010. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology 29(1). 24–54. https://doi.org/10.1177/0261927X09351676 (accessed 4 July 2022).Suche in Google Scholar

Vogel, Inna & Peter Jiang. 2019. Fake news detection with the new German dataset “GermanFakeNC”. International conference on theory and practice of digital libraries, 288–295. Cham: Springer International Publishing.10.1007/978-3-030-30760-8_25Suche in Google Scholar

Received: 2024-04-14

Accepted: 2024-10-11

Published Online: 2025-02-10

This work is licensed under the Creative Commons Attribution 4.0 International License.

https://doi.org/10.1515/lingvan-2024-0064

Schlagwörter für diesen Artikel

fake news; misinformation; German; corpus; news articles

Creative Commons

BY 4.0