Fine-tuning BERT Models on Demand for Information Systems Explained Using Training Data from Pre-modern Arabic

Thomas Asselborn, Sylvia Melzer, Said Aljoumani, Magnus Bender, Florian Marwitz, Konrad Hirschler, Ralf Möller

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningpeer review

Abstract

Humanities scholars can use Large Language Models (LLMs) to simplify text analysis and pattern recognition. Fine-tuning LLMs for specific humanities tasks can be challenging due to limited training data. However, in the humanities exists a growing number of information systems with research data which can be used for this purpose. This article outlines how to fine-tune Bidirectional Encoder Representations from Transformers (BERT) models using pre-modern Arabic data available in an information system. We also introduce the Humanities Aligned Chatbot (ChatHA) for user-friendly interaction with the fine-tuned model to break down the barriers to the application of LLMs in the humanities. The result we have achieved is that all archived research data can be used in a research data repository for fine-tuning models in a short time without requiring IT expertise. Additionally, users can chat with a ChatHA, which provides users with more precise answers. This success is also attributed to the availability of well-structured data in canonical form, enabling us to precisely define the mapping of entity types to labels. In addition, we use a manifest file which serves as the cornerstone for structuring and organizing training data to automate the Fine-tuning on Demand (FToD) process. The results we obtained show that the FToD process can be done in just a few minutes using a sample dataset and BERT. The FToD process identified names of people, places, or dates written in pre-modern Arabic that could not be recognised by the pre-trained model.
OriginalsprogEngelsk
TitelCEUR Workshop Proceedings
Antal sider14
Vol/bind3580
ForlagCEUR-WS.org
Publikationsdato7 dec. 2023
Sider38-51
StatusUdgivet - 7 dec. 2023
Udgivet eksterntJa
Begivenhed3rd Workshop on Humanities-Centred Artificial Intelligence: 46th German Conference on Artificial Intelligence - Wilhelminenhof campus of the Hochschule für Technik und Wirtschaft Berlin, Tyskland
Varighed: 26 sep. 202326 sep. 2023

Konference

Konference3rd Workshop on Humanities-Centred Artificial Intelligence: 46th German Conference on Artificial Intelligence
LokationWilhelminenhof campus of the Hochschule für Technik und Wirtschaft Berlin
Land/OmrådeTyskland
Periode26/09/202326/09/2023
NavnCEUR Workshop Proceedings
Vol/bind3580
ISSN1613-0073

Citer dette