Datasets, models, and tools released by the group. Most live on Hugging Face or GitHub; see the linked publication for context.
FoQA
active The first dedicated question-answering dataset for Faroese, with
extractive and generative variants. Built through translation and
adaptation of existing QA benchmarks combined with native-speaker
validation, and accompanied by baseline evaluations of multilingual
and Faroese-tuned language models. Released alongside the
RESOURCEFUL 2025 paper.
GameQA
maintenance A gamified mobile-app platform for crowdsourcing multi-domain
question-answering datasets. Demonstrated at EACL 2023 System
Demonstrations and used to collect the RUQuAD-1 reading-comprehension
dataset for Icelandic.
Hotter and Colder
maintenance An Icelandic sentiment corpus of blog comments annotated with nuanced
labels for sentiment, emotions, toxicity, sarcasm, hate speech,
sympathy, and related categories. Distributed via the University of
Iceland, with the companion methodology paper at NoDaLiDa /
Baltic-HLT 2025.
IceBERT family
maintenance A family of Icelandic BERT-style language models pre-trained on the
Icelandic Crawled Corpus. Released alongside the LREC 2022 paper with
Miðeind, the IceBERT checkpoints have become the foundation for much
of the subsequent Icelandic NLP work in our group and elsewhere.
MazeEval
active A benchmark for testing sequential decision-making in language models
through navigation tasks in procedurally generated mazes. Single-author
preprint with the benchmark released alongside the paper.
MIM-GOLD-EL
maintenance An entity-linking corpus for Icelandic built on top of the MIM-GOLD
named-entity collection. Released with the University of Iceland and
introduced at LREC 2022; the companion methodology paper appeared at
the Dataset Creation for Lower-Resourced Languages workshop.
Icelandic WinoGrande
maintenance An Icelandic adaptation of the WinoGrande commonsense-reasoning
benchmark, built to evaluate Icelandic language models on pronoun
disambiguation and world-knowledge tasks. Released by Miðeind ehf.
NQiI - Natural Questions in Icelandic
maintenance An open-domain question-answering dataset for Icelandic, adapted from
the English Natural Questions benchmark. Released in two versions
(v1.0 at LREC 2022 and an updated v1.1) and distributed through the
University of Iceland for evaluating Icelandic QA systems.
RUQuAD-1
maintenance The Reykjavik University Question-Answering Dataset for Icelandic, a
SQuAD-style reading-comprehension benchmark collected through the
GameQA mobile platform. Distributed via the University of Iceland.