About

introduction

when working with a large amount of books reading every book and trying to figure out what genre it is can be a time consuming task thats why we need to find a way to automate the process of genre labeling.

Can we judge a book buy its summury/plot?

the problem: knowing what genres and subgenres a book falls under

aim: to automate and accelerate the process of labeling the genres of books

goal: build a high accuracy model that predicted the genre of a given book

imputes: to build a precise multilabel genre prediction model to better organize and categorize my home library :P

Dataset

i scraped the data from the abbjad website

the dataset has 4160 rows

attribute	description
Book_title	title of the book
Author	author name
cover_url	book cover image url
genres	list of genres that a book falls under
descriptions	book description
cover_name	string name of the cover image
description_length	the word count of the description
genre_count	number of genres that a book has

Assumptions

my intial assumptions was that it’s easy to build a highly accurate model that can predict the genres of a given book

The steps you took to find a solution

after web scraping the data i cleaned the dataset and applied nlp preprocess steps such as :
- removing white spaces,numbers,spiecal characters and diacritics
- removing stop words
- applying feature extraction method (TF-IDF)

building the model:

while researching for this project i’ve stumbled upon a lot of similar projects and resources but the majority only allowed a single genre per book and worked with a small set of basic genres. but with this project i wanted to work with a variety of genres and subgenres. thereby classifying my problem as a Multilabel Classification problem.

the resources for Multilabel Classification is scarse. but while reading a lot of documentation i stumbled upon the MultiLabelBinarizer package which makes Multilabel Classification easier to apply. i applied th MultiLabelBinarizer() on the genre feature which allowed the values to be vectorized. after that i’ve found that logistic regression and one-vs-rest works best.

resources:
- which features extraction method to choose?
- How to Solve a Multi Class Classification Problem with Python?
-video on Multi-Label Text Classification with Scikit-MultiLearn in Python

conclusion

it is possible to predict book genre based on the book plot/summary and we might be able to achieve a higher accuracy if we have a bigger dataset.

Outlook for future development

for this project i hope to continue building on the model to increase accuracy and to continue developing the computer vision aspect of the project. i would also like to find websites with a larger book achive to webscrape and use.

Limitations & problems with your solution

computaional power
time managment