Datasets, models, and tools released by the group. Most live on Hugging Face or GitHub; see the linked publication for context.
FoQA
active The first dedicated question-answering dataset for Faroese, with extractive and generative variants. Built through translation and adaptation of existing QA benchmarks combined with native-speaker validation, and accompanied by baseline evaluations of multilingual and Faroese-tuned language models. Released alongside the RESOURCEFUL 2025 paper.
GameQA
maintenance A gamified mobile-app platform for crowdsourcing multi-domain question-answering datasets. Demonstrated at EACL 2023 System Demonstrations and used to collect the RUQuAD-1 reading-comprehension dataset for Icelandic.
Hotter and Colder
maintenance An Icelandic sentiment corpus of blog comments annotated with nuanced labels for sentiment, emotions, toxicity, sarcasm, hate speech, sympathy, and related categories. Distributed via the University of Iceland, with the companion methodology paper at NoDaLiDa / Baltic-HLT 2025.
IceBERT family
maintenance A family of Icelandic BERT-style language models pre-trained on the Icelandic Crawled Corpus. Released alongside the LREC 2022 paper with Miðeind, the IceBERT checkpoints have become the foundation for much of the subsequent Icelandic NLP work in our group and elsewhere.
Icelandic WinoGrande
maintenance An Icelandic adaptation of the WinoGrande commonsense-reasoning benchmark, built to evaluate Icelandic language models on pronoun disambiguation and world-knowledge tasks. Released by Miðeind ehf.
MIM-GOLD-EL
maintenance An entity-linking corpus for Icelandic built on top of the MIM-GOLD named-entity collection. Released with the University of Iceland and introduced at LREC 2022; the companion methodology paper appeared at the Dataset Creation for Lower-Resourced Languages workshop.
MazeEval
active A benchmark for testing sequential decision-making in language models through navigation tasks in procedurally generated mazes. Single-author preprint with the benchmark released alongside the paper.
NQiI - Natural Questions in Icelandic
maintenance An open-domain question-answering dataset for Icelandic, adapted from the English Natural Questions benchmark. Released in two versions (v1.0 at LREC 2022 and an updated v1.1) and distributed through the University of Iceland for evaluating Icelandic QA systems.
RUQuAD-1
maintenance The Reykjavik University Question-Answering Dataset for Icelandic, a SQuAD-style reading-comprehension benchmark collected through the GameQA mobile platform. Distributed via the University of Iceland.