Book Genre Classification with BERT

This model was primarily trained and published for pedagogical purposes. It was not extensively engineered, nor optimized for performance. The notebook used to trained this model can be found here.

This model is an adapted version of distilbert-base-cased, trained on samples from Despina/project_gutenberg dataset. The model is trained on five-sentences long textual excerpts to determine the genre of the book they were extracted from, amongst:

  • 0: adventure stories
  • 1: children's stories
  • 2: detective and mystery stories
  • 3: science fiction

Training Details

The data used for this experiment is a corpus of five-sentences long chunks extracted from fiction novels sourced from Project Gutenberg. It is a filtered version of the dataset introduced in Christou & Tsoumakas (2025) (you can find the original dataset at Despina/project_gutenberg).

Each excerpt is (exclusively) associated with one of four genres comprising: adventure stories, children's stories, detective and mystery stories, and science fiction.

Note that the samples taken from the original dataset were purposefully filtered to be balanced across the four genres mentionned above. Moreover, the sampling script tried, as much as possible, to impose parity between (binary) inferred gender of the authors (— yet the resulting dataset is still heavily leaning towards 'male'-written books for 'adventures stories' (88.5%) and 'science fiction' (92.7%)). You can find, and reuse, the script used to filter and sample the data here.

The filtered dataset used to train the model contains ~19.5K samples, further split into train (15,606), validation (1,975) and test (1,954) sets. Note that the split were made at the book level in order to avoid important data leakage between training and evaluation sets (i.e. there cannot be excerpts from one book in two different splits).

Performance

The performances of the fine-tuned model on the test set are the following:

Class Precision Recall F1-Score Support
Adventure Stories 0.58 0.47 0.52 444
Children's Stories 0.68 0.80 0.74 490
Detective and Mystery Stories 0.59 0.70 0.64 440
Science Fiction 0.83 0.71 0.76 580
Accuracy 0.68 1954
Macro Avg 0.67 0.67 0.66 1954
Weighted Avg 0.68 0.68 0.67 1954

Computational Resources

The model was trained for 4 epochs on a NVIDIA RTX6000 GPU, lasting about 6.5 minutes.

Downloads last month
9
Safetensors
Model size
65.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for noepsl/distilbert-book-genre-classification

Finetuned
(303)
this model

Dataset used to train noepsl/distilbert-book-genre-classification