Book Genre Classification with BERT
This model was primarily trained and published for pedagogical purposes. It was not extensively engineered, nor optimized for performance. The notebook used to trained this model can be found here.
This model is an adapted version of distilbert-base-cased, trained on samples from Despina/project_gutenberg dataset. The model is trained on five-sentences long textual excerpts to determine the genre of the book they were extracted from, amongst:
- 0: adventure stories
- 1: children's stories
- 2: detective and mystery stories
- 3: science fiction
Training Details
The data used for this experiment is a corpus of five-sentences long chunks extracted from fiction novels sourced from Project Gutenberg. It is a filtered version of the dataset introduced in Christou & Tsoumakas (2025) (you can find the original dataset at Despina/project_gutenberg).
Each excerpt is (exclusively) associated with one of four genres comprising: adventure stories, children's stories, detective and mystery stories, and science fiction.
Note that the samples taken from the original dataset were purposefully filtered to be balanced across the four genres mentionned above. Moreover, the sampling script tried, as much as possible, to impose parity between (binary) inferred gender of the authors (— yet the resulting dataset is still heavily leaning towards 'male'-written books for 'adventures stories' (88.5%) and 'science fiction' (92.7%)). You can find, and reuse, the script used to filter and sample the data here.
The filtered dataset used to train the model contains ~19.5K samples, further split into train (15,606), validation (1,975) and test (1,954) sets. Note that the split were made at the book level in order to avoid important data leakage between training and evaluation sets (i.e. there cannot be excerpts from one book in two different splits).
Performance
The performances of the fine-tuned model on the test set are the following:
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Adventure Stories | 0.58 | 0.47 | 0.52 | 444 |
| Children's Stories | 0.68 | 0.80 | 0.74 | 490 |
| Detective and Mystery Stories | 0.59 | 0.70 | 0.64 | 440 |
| Science Fiction | 0.83 | 0.71 | 0.76 | 580 |
| Accuracy | 0.68 | 1954 | ||
| Macro Avg | 0.67 | 0.67 | 0.66 | 1954 |
| Weighted Avg | 0.68 | 0.68 | 0.67 | 1954 |
Computational Resources
The model was trained for 4 epochs on a NVIDIA RTX6000 GPU, lasting about 6.5 minutes.
- Downloads last month
- 9
Model tree for noepsl/distilbert-book-genre-classification
Base model
distilbert/distilbert-base-cased