BeLLM - A Belarusian LLM

Published on Jan 10, 2024

“Language is the road map of a culture. It tells you where its people come from and where they are going.” - Rita Mae Brown

Introduction

The beLLM project pioneers the development of the first Belarusian Large Language Model, inspired by the advancements in NLP for supporting underrepresented languages. This project aims to create a model that understands and generates text in Belarusian, leveraging the rich literary heritage of Belarus. The timeline of the project was from December 2023 to January 2024.

Progress

Project Setup

The project commenced with the architecture design inspired by GPT-2 and nanoGPT. I was referring to the existing implementations and adapting them to the Belarusian language (vocabularies, tokenization, etc.). Also the question of data collection and preprocessing was addressed by manual curation of a dataset. The computational resources were gained through servers of FU Berlin, Math Department.

Data Collection and Preparation

I manually curated and preprocessed a dataset consisting of Belarusian poems and prose. The dataset includes works from notable Belarusian authors and encompasses over 9.5 million characters. The dataset includes the following sources:

Some of the authors included in the dataset:

Maxim Tank (Максім Танк)
Yanka Kupala (Янка Купала)
Yakub Kolas (Якуб Колас)
Maxim Bogdanovich (Максім Багдановіч)
Vasyl Bykov (Васіль Быкаў)
Francishak Bagushevich (Францішак Багушэвіч)
Yanka Bryl (Янка Брыль)

Model Training

The training was conducted using PyTorch on a GPU-accelerated server. The model was trained on the curated dataset for multiple epochs, optimizing the loss function to enhance the model’s language generation capabilities. The total training time was approximately 4 hours and final model size is 125MB.

Results

Header Image

The trained model can generate coherent and contextually appropriate Belarusian text, demonstrating promising capabilities in language generation for an underrepresented language. View generated samples and model performance here.. Also, the models are available on Hugging Face model hub. beLLM on Hugging Face

Technology Learned/Used

Throughout this project, I deepened my expertise in:

PyTorch for model training and development
Python for general programming
NumPy for numerical computations
Pydantic for data validation

Conclusion

beLLM represents a significant step towards enriching the NLP tools available for Belarusian, offering insights and possibilities for future linguistic and cultural studies.