Book Rating Prediction
Introduction
For readers and publishers, the ability to forecast books’ success is extremely important. We propose a machine learning model to gauge a book’s success by predicting its rating. The problem has been addressed in [1], where the authors have used count-based features such as TF-IDF, character n-grams and writing density to predict book ratings. In [2] the authors used the pre-trained sentence encoder Universal Sentence Encoder (USE) to train a CNN model and generate book ratings. In the works [1] and [2], the authors predict book success on the basis of user reviews and goodreads ratings. This method would not work while generating ratings for unpublished books. Hence, we would tackle this problem using book features purely.
For our solution, we have used the CMU Book Summary Dataset from Kaggle and Google Books API. The dataset consists of plot summaries of 16,559 books. The data points contain features such as author and genre sourced from Freebase and GoodReads. We collected the ratings information from Google Books API and integrated it with the CMU dataset.
Problem Definition
Over 300,000 books are published in the US yearly. While books from celebrated authors are marketed well and get fame, most publications have no ratings leaving no valid way for readers to discover them. With no ratings, readers often resort to gauging a book’s quality based on the title and short summary. To alleviate this problem, we propose to generate a robust rating system on a scale of 5. We approach the prediction task on both fronts, as a classification or a regression problem with features like title, author, summary, given genre, and clustered genre. Clustered genres of a book were generated by using an unsupervised clustering algorithm, and analyzed for their efficacy.
Data Collection
CMU Book Summary Dataset
We are using the CMU Book Summary Dataset from Kaggle. This dataset contains plot summaries for 16,559 books extracted from Wikipedia and aligned metadata from Freebase, including book author, title, and genre. The plot summaries have word counts of thousands of words.
Book ratings from Google Books API
For each data point in the CMU Book Summary Dataset, we obtained the book ratings and publisher using Google Books API. For instance, the query for “The Silent Patient” by Alex Michaelides is https://www.googleapis.com/books/v1/volumes?q=thesilentpatient
. From the JSON response, we take these necessary features for each book such as book name, publisher, rating, number of ratings, and description.
Our final dataset consists of summary data from the CMU book summary dataset and additional features obtained from the Google Books API. For a given book, there are multiple publishers and each book published under a different publisher has a different number of ratings. When the data is rolled up to book level, we used a weighted average to get the average book ratings. We also did an outlier treatment to remove books with a total number of ratings less than or equal to 5.
Methodology
Generating Book Summary Embeddings:
To convert the book summaries to vector embeddings we used multiple encoding techniques were explored such as TF-IDF, Word2Vec, and Universal Sentence Encoder (USE). For the clustering , we analyzed the performance using Word2Vec and USE embeddings on various unsupervised algorithms. We finalized USE as it encodes the entire book summary without requiring additional preprocessing. For each book summary, USE generates a vector embedding of 512 features.
Clustering:
To cluster the book summaries, we clustered the corresponding vector embeddings of the books. We compared several unsupervised techniques such as KMeans Clustering, Hierarchical Clustering, and Gaussian Mixture Model. We used Silhouette Score to assess the clustering performance.
Rating prediction and classification:
We converted book summaries to embeddings using Universal Sentence Encoder. To generate clusters, we explored unsupervised learning algorithms like K-Means and Hierarchical Clustering. We created one hot vectors for genre and publisher and combined this data along with number of ratings and a dimension-reduced text embedding into our predictive features. We explored classification and regression methodologies. As a regression task, we developed a feedforward network with several Dropout layers in between and ReLU activations. The model was trained with Mean Absolute Error as the performance metric, since rating differences can be < 1, which would degrade when squared. We used use keras
and sklearn
to facilitate most of this, and used tensorflow_hub
to get the encodings of our summaries. Other regression approches such as Linear Regression and RandomForest
Regression were also explored. We also approached classification problem via RandomForest
classifier, XGBoost
and GradientBoost
classifier.
Results and Discussion
Preprocessing
-
Text Cleaning - We dropped data points where the CMU book summary was not available. As the book summary is the core feature required for our task, we cannot impute missing values.
-
One Hot encoding of relevant categorical variables - We use one hot encoding to represent categorical classes like genre and publisher. For a given book, it could fit into multiple genres or multiple publishers, so initially we generate maps of book title to genres and book title to publishers. Simultaneously we keep track of the frequency of different genres/publishers to find the top 15 to one hot, bucketing the rest in another category.
Then we create a dataframe of unique titles and query our map of book titles to genres/publishers to set the one hot of these dataframes. Overall, this provides a way we can represent which genres/publishers that a book belongs to in a binary way. We chose 15 since, we observed that for both of these categorical variables, this is where there was a drop-off in the frequency that the publishers/genres appeared at, so these must be the most relevant information. This is a tradeoff between the dimensionality of the dataset, and representation of the information.
-
Stemming and Lemmatization - We cleaned the text summary using
nltk
toolkit using standard methods of punctuation and stopwords removal and lemmatizing usingWordNet
Lemmatizer and stemming the words using Porter Stemmer.
Clustering
- Word2Vec Clustering -
Our initial approach was to attempt an unsupervised clustering algorithm to identify genres. Using the cleaned dataset, we utilized
nltk
lemmatizer and stop-word removal and other standard NLP practices. We then usedgensim
’s bigram and phrases generator. We passed this toWord2Vec
to create a model corpus. After this, we can use the summary of each text and get a vector which represents the embedding inWord2Vec
. After this we ran algorithms like K-Means clustering and measured corresponding scores such as inertia and Silhouette Score.
We visualize the lower dimension projections of the encodings using PCA
As seen, we tried various number of clusters from k = 5
to k = 175
. As we increased the number of clusters, scores declined. We looked at certain peaks such as k = 17
which had a score of 0.425
and inertia of around 700
. And we tried to compare it with true labels shown in the below table.
As we see that the genres are all over place, but the problem is also we have 210 different types of labels, and each book has multiple genre labels so it is not one-hot encoded. So, these clusters can represent a new type of genre as for example cluster 2 is more about steampunk whereas cluster 6 is about mystery/suspense and cluster 7 is more about fiction and specific science and speculative. The labelled books have been printed to show some of the franchise books are classified in the same cluster, but our clustering algorithm cannot get all books in the same cluster to be as related. This makes sense as the silhouette score is 0.425
which is not a very strong score for clustering and as seen in the elbow method graphs and the silhouette scores and inertia scores are very noisy. To kind of reduce the noise and first let us try to visualize the data, we attempted a PCA analysis and first reduced it to 2 components.
So in 2-dimension the data is very condensed and concentrated which makes sense as the Word2Vec
was a 100-dimensional model so visualizing the dataset is very hard. But just to give an idea, the figures on the right represent the K-Means labels and compare it with some of the true labels. As seen, the K-Means have clustered the data by finding dense regions but some of the true regions on a 2D visualization such as fantasy is all over the graph. Whereas something like sociology books is concentrated in one section. This approach does well in finding that but for more general genres, it just fails. We also attempted to reduce the dataset to 2 components and ran K-Means, but the Silhouette Scores and inertia were very noisy.
Then we attempted to do PCA analysis and set n_components to 7. One of the main weakness of K-Means and algorithms like this is that it does not work well with high-dimensional data and for lower-dimensions, as seen, the data was not well-visualized.
Here we see a minimal increase in Silhouette Score where it approaches 0.5
and this also trains faster. Let us look at the cluster labels for k = 17
.
This is similar to high-dimension K-Means clustering but the true labels have been more messed up. We looked at the books individually and for example, “Zooman Sams” and “Zen Ties” are fictional children book which means they are placed correctly but “Young Pioneers” although from title and some of the summary suggests it is for young audience, upon further research, it is more of a short novel. Clearly, this representation does not give consistent results and the genres does differ from true labels but to some extent, it does find some similarity between the book summaries.
- USE Encoder - We analyzed the encodings generated by the USE model by generating correlation metrics for the summaries. For instance, the books with highest correlation metric are :
Chapterhouse : Dune
Summary
The situation is desperate for the Bene Gesserit as they find themselves the targets of the Honored Matres, whose conquest of the Old Empire is almost complete. The Matres are seeking to assimilate the technology and developed methods of the Bene Gesserit and exterminate the Sisterhood itself. Now in command of the Bene Gesserit, Mother Superior Darwi Odrade continues to develop her drastic, secret plan to overcome the Honored Matres. The Bene Gesserit are also terraforming the planet Chapterhouse to accommodate the all-important sandworms, whose native planet Dune had been destroyed by the Matres. Sheeana, in charge of the project, expects sandworms to appear soon. The Honored Matres have also destroyed the entire Bene Tleilax civilization, with Tleilaxu Master Scytale the only one of his kind left alive. In Bene Gesserit captivity, Scytale possesses the Tleilaxu secret of ghola production, which he has reluctantly traded for the Sisterhood\'s protection. The first ghola produced is that of their recently-deceased military genius, Miles Teg. The Bene Gesserit have two other prisoners on Chapterhouse: the latest Duncan Idaho ghola, and former Honored Matre Murbella, whom they have accepted as a novice despite their suspicion that she intends to escape back to the Honored Matres. Lampadas, a center for Bene Gesserit education, has been destroyed by the Honored Matres. The planet\'s Chancellor, Reverend Mother Lucilla, manages to escape carrying the shared-minds of millions of Reverend Mothers. Lucilla is forced to land on Gammu where she seeks refuge with an underground group of Jews. The Rabbi gives Lucilla sanctuary, but to save his organization he must deliver her to the Matres. Before doing so, he reveals Rebecca, a "wild" Reverend Mother who has gained her Other Memory without Bene Gesserit training. Lucilla shares minds with Rebecca, who promises to take the memories of Lampadas safely back to the Sisterhood. Lucilla is then "betrayed", and taken before the Great Honored Matre Dama, who tries to persuade her to join the Honored Matres, preserving her life in exchange for Bene Gesserit secrets. Lucilla refuses, and Dama ultimately kills her. Back on Chapterhouse, Odrade confronts Duncan and forces him to admit that he is a Mentat, proving that he retains the memories of his many ghola lives. He does not reveal his mysterious visions of two people. Meanwhile, Murbella collapses under the pressure of Bene Gesserit training, giving in to "word weapons" that the Bene Gesserit had planted to undermine her earlier Honored Matre identity. Murbella realizes that she wants to be Bene Gesserit. Odrade believes that the Bene Gesserit made a mistake in fearing emotion, and that in order to evolve, the Bene Gesserit must learn to accept emotions. Odrade permits Duncan to watch Murbella undergo the spice agony, making him the first man ever to do so. Murbella survives the ordeal and becomes a Reverend Mother. Odrade then confronts Sheeana, discovering that Duncan and Sheeana have been allied together for some time. Sheeana does not reveal that they have been considering the option of reawakening Teg\'s memory through Imprinting, nor does Odrade discover that Sheeana has the keys to Duncan\'s no-ship prison. Odrade continues molding Scytale, with Sheeana showing him a baby sandworm, the Bene Gesserit\'s own long term supply of spice, and destroying Scytale\'s main bargaining card. Finally, Teg is awakened by Sheeana using imprinting techniques. Odrade appoints him again as Bashar of the military forces of the Sisterhood for the assault on the Honored Matres. Odrade next calls a meeting of all the Bene Gesserit, announcing her plan to attack the Honored Matres. She tells them that this attack will be led by Teg. She also announces candidates to succeed her as Mother Superior; she will share her memories with Murbella and Sheeana before she leaves. Odrade then goes to meet the Great Honored Matre. Under cover of Odrade\'s diplomacy, the Bene Gesserit forces under Teg attack Gammu with tremendous force. Teg uses his secret ability to see no-ships to secure control of the system. Survivors of the attack flee to Junction, and Teg follows them there and carries all with him. Victory for the Bene Gesserit seems inevitable. In the midst of this battle, the Jews (including Rebecca with her precious memories) take refuge with the Bene Gesserit fleet. Logno — chief advisor to Dama — assassinates Dama with poison and assumes control of the Honored Matres. Her first act surprises Odrade greatly. Too late Odrade and Teg realize they have fallen into a trap, and the Honored Matres use a mysterious weapon to turn defeat into victory, as well as capturing Odrade. Murbella saves as much of the Bene Gesserit force as she can and they begin to withdraw to Chapterhouse. Odrade, however, had planned for the possible failure of the Bene Gesserit attack and left Murbella instructions for a last desperate gamble. Murbella pilots a small craft down to the surface, announcing herself as an Honored Matre who, in the confusion, has managed to escape the Bene Gesserit with all their secrets. She arrives on the planet and is taken to the Great Honored Matre. Unable to control her anger, Logno attacks but is killed by Murbella. Awed by her physical prowess, the remaining Honored Matres are forced to accept her as their new leader. Odrade is also killed in the melee and Murbella shares memories with her, thereby also becoming Reverend Mother Superior. Murbella\'s ascension to leadership is not accepted as victory by all the Bene Gesserit. Some flee Chapterhouse, notably Sheeana, who has a vision of her own, and is joined by Duncan. The two escape in the giant no-ship, with Scytale, Teg and the Jews. Murbella recognizes their plan at the last minute, but is powerless to stop them. Watching this escape with interest are Daniel and Marty, the observers Duncan had been having visions of. The story ends on a cliffhanger with several questions left unanswered regarding the merging of the Honored Matres and Bene Gesserit, the fates of those on the escaped no-ship (including the role of Scytale, the development of Idaho and Teg, and the role of the Jews), the identity of the god-like characters in the book\'s final chapter and the ultimate mystery of what chased the Honored Matres back into the Old Empire.Heretics of Dune
Summary
Much has changed in the millennium and a half since the death of the God Emperor. Sandworms have reappeared on Arrakis (now called Rakis) and renewed the flow of the all-important spice melange to the galaxy. With Leto's death, a hugely complex economic system built on spice collapsed, resulting in trillions leaving known space in a great Scattering. A new civilization has risen, with three dominant powers: the Ixians, whose no-ships are capable of piloting between the stars and are invisible to outside detection; the Bene Tleilax, who have learned to manufacture spice in their axlotl tanks and have created a new breed of Face Dancers; and the Bene Gesserit, a matriarchal order of subtle political manipulators who possess superhuman abilities. However, people from the Scattering are returning with their own peculiar powers. The most powerful of these forces are the Honored Matres, a violent society of women bred and trained for combat and the sexual control of men. On Rakis, a girl called Sheeana has been discovered who can control the giant worms. The Bene Gesserit intends to use a Tleilaxu-provided Duncan Idaho ghola to gain control of this sandrider, and the religious forces of mankind who they know will ultimately worship her. The Sisterhood have subtly been altering the gholas to bring their physical reflexes up to modern standards. The Bene Gesserit leader, Mother Superior Taraza, brings Miles Teg to guard the new Idaho. Taraza also sends Reverend Mother Darwi Odrade to take command of the Bene Gesserit keep on Rakis. Odrade is a loose cannon; she does not obey normal Bene Gesserit prohibitions about love, and is also Teg's biological daughter. Bene Gesserit Imprinter Lucilla is also sent by Taraza to bind Idaho's loyalty to the Sisterhood with her sexual talents. However, Lucilla must deal with Reverend Mother Schwangyu, head of the ghola project but also the leader of a faction within the Bene Gesserit who feel gholas are a danger. Above the planet Gammu, Taraza is captured and held hostage by the Honored Matres aboard an Ixian no-ship. The Honored Matres insist Taraza invite Teg to the ship, hoping to gain control of the ghola project. Teg manages to turn the tables on the Matres, and rescues the Mother Superior and her party. An attack is then made on Sheeana on Rakis, which is prevented by the intervention of the Bene Gesserit. Odrade starts training Sheeana as a Bene Gesserit. At about the same time an attempt is made on the life of Idaho, but Teg is able to defeat it. Teg flees with Duncan and Lucilla into the countryside. In an ancient Harkonnen no-globe, Teg proceeds to awaken Idaho's original memories, but does so before Lucilla can imprint Duncan and thus tie him to the Sisterhood. In the meantime, Taraza has been searching for Teg and his party, and finally establishes contact. During the operation, however, Teg and his companions are ambushed. Teg is captured while Lucilla and Duncan escape. Teg is tortured by a T-Probe, but under pressure discovers a new ability: he is able to speed up his physical and mental reactions, which enables him to escape. At the same time, Idaho is ambushed and taken hostage. Taraza arranges a meeting with the Tleilaxu Master Waff, who is soon forced to tell her what he knows about the Honored Matres. When pressed on the issue of Idaho, he also admits that the Bene Tleilax have conditioned their own agenda into him. As the meeting draws to a close, Taraza accidentally divines that Waff is a Zensunni, giving the Bene Gesserit a lever to understand their ancient competitor. She and Odrade meet Waff again on Rakis. He tries to assassinate Taraza but Odrade convinces him that the Sisterhood shares the religious beliefs of the Bene Tleilax. Taraza offers full alliance with them against the onslaught of forces out of the Scattering. This agreement causes consternation among the Bene Gesserit, but Odrade realizes that Taraza's plan is to destroy Rakis. By destroying the planet, the Bene Gesserit would be dependent on the Tleilaxu for the spice, ensuring an alliance. Lucilla arrives at a Bene Gesserit safe house to discover it has been taken over by Honored Matres, who have Idaho as their captive. As Lucilla infiltrates it, the young Honored Matre Murbella proceeds to seduce the captured Idaho with the Honored Matre imprinting method. However, hidden Tleilaxu conditioning kicks in, and Duncan responds with an equal technique, one that overwhelms Murbella. Lucilla takes advantage of Murbella's exhaustion to knock her unconscious and rescue Duncan. The Honored Matres attack Rakis, killing Taraza. Odrade becomes temporary leader of the Bene Gesserit before escaping with Sheeana into the desert on a worm. Teg also goes to a supposed safe house, only to discover the Honored Matres. He unleashes himself upon the complex, before capturing a no-ship and locating Duncan and Lucilla. They and the captured Murbella are taken to Rakis with him. When they arrive, Teg finds Odrade and Sheeana and their giant worm. He loads them all up in his no-ship, finally leading his troops out on a last suicidal defense of Rakis, designed to attract the rage of the Honored Matres. The Honored Matres attack Rakis, destroying the planet and the sandworms except for the one the Bene Gesserit escape with. They drown the worm in a mixture of water and spice, turning it into sandtrout which will turn the secret Bene Gesserit planet Chapterhouse into another Dune.The two books are part of the same series by the author Frank Herbert. Thus, they are expected to be highly correlated.
- USE Embeddings Clustering -
The traditional clustering algorithms such as KMeans and GMM yield a low silhouette score. Hence, we attempted the clustering on a lower dimension projection of the 512-dimension encodings. We used PCA for dimensionality reduction and found that
182
components capture90.13%
of the total variance and454
components capture99.7%
of the total variance.
We visualize the lower dimension projections of the encodings using t-SNE
- Dimensionality reduction is done for the embedding with PCA to avoid the curse of dimensionality. The pretrained embedding from
tensorflow_hub
outputs a vector size512
, while our dataset size itself is ~11,000 entries which makes the neural network difficult to train as is. By using PCA, we compress the 512 dimensions to20
and are able to effectively train our model.
Rating Prediction
Classification
Random Forest Classifier
Considering book rating greater than or equal to 4 as a success metric, a regression classification model was trained with 10 estimators and a maximum depth of 10. The accuracy on the train data set is 75.6%
while the accuracy on the test dataset is 71.76%
. The data is imbalanced with ~28%
books flagged as success in both test and train datasets. The predictions on the test are mostly all zeros, the recall score is 0.004
. Experimenting a little with estimators and depths yield a model with higher accuracy of 99.8%
on train data but with a similar performance on test data and the recall_score also couldn’t improve much. To avoid overfitting, with 5 estimators and a max depth of 20, a random forest classification model was trained with an accuracy of 91.7%
on training and 67.7%
on test dataset which resulted in a recall score of 0.15
on test data. Below is the crosstab of predicted book rating vs actual book rating class on the test data.
Predicted Success flag | ||||
---|---|---|---|---|
0 | 1 | Total Count | ||
Actual success flag | 0 | 1422 | 195 | 1617 |
1 | 527 | 94 | 621 | |
Total Count | 1949 | 289 | 2238 |
A multi-class random forest classification model was also tried by rounding off the ratings to the nearest integer with 25 estimators, this model yield an accuracy of 99.6%
on training and 78.3%
on test with a recall score using the macro method of 0.25
.
Below is the cross tab of predicted vs actual.
Predicted book rating class |
||||
---|---|---|---|---|
3 | 4 | Total Count | ||
Actual book rating class | 2 | 0 | 10 | 10 |
3 | 6 | 434 | 440 | |
4 | 13 | 1747 | 1760 | |
5 | 0 | 28 | 28 | |
Total Count | 22 | 2223 | 2245 |
Another multi-class classification model was also tried, where the ratings were transformed to 1-10 integers. For this, the ratings are multiplied by a factor of two before rounding it off to the nearest integer. As most ratings take 3 and 4 when rounded off, we thought this transformation would distribute the ratings into more classes. The random forest classification model with this transformation didn’t give any better results, the accuracy on test data was just 42%
.
Support Vector Machines
We tried an SVM classifier to classify if a book would be successful (book rating >= 4) as well to see if we get any better results using rbf
and other kernels. But the model failed to predict any successful books.
GradientBoost
We tried a gradient boost model to classify if a book would be successful (book rating >= 4),
where the accuracy on the train data was 84%
for a learning rate of 0.01, and the accuracy on testing data was 72%
on the test data. But this model also couldn’t predict the successful books despite tweaking the learning rates.
XGBoost
XGBoost method was also tried to classify if a book would be successful (book rating >= 4) as well where the accuracy on the training dataset is 99.9%
and on the test data is 69.5%
with recall score of 0.11.
Predicted Success flag | ||||
---|---|---|---|---|
0 | 1 | Total | ||
Actual success flag | 0 | 1487 | 130 | 1617 |
1 | 553 | 68 | 621 | |
Total Count | 2040 | 198 | 2238 |
Regression
Random forest regression
Below is the histogram of the ratings data. As we can see, most ratings are between 3.5-4.5.
Random forest regression was tried with estimators ranging from 10-20. The fit was best when the estimator was 20, where the mean absolute error of training data was 0.1157
and of test data was 0.293
. The R-squared computed from predicted, and actual on-train data was 83%
, but the R-squared is very low at -3.5% on the test dataset. And the predicted vs. actual ratings has a very little slope which is very less than 1.
As, we don’t have any original continuous variables except for the number of ratings (even though we have encodings, those are vectorized form of the summary column), such as the number of books sold, etc., and as the majority ratings fall in a range of 3-4.5, the regression model cannot predict the ratings to decimals accurately. Hence to predict a book’s success, we tried classification models.
Linear regression:
Predicting ratings can be modeled as a regression task. Using the publisher, num_ratings, genres, and encoded_summary we created a 53 feature matrix. The encoded summary was reduced to 20 components.
General split for below models are 85% Training, 15% test. 10% is used for validation from training if needed. Results are shown on test datasets as those are unseen by the model.
After that, 85% of data was used for building the Linear Regression Model.
In this graph we see the predicted ratings are just between 3.5 and 4 and we do not really find any correlation. Below is a table of statistics to show the result of the regression task.
Metric | Score | Optimal Metric Score |
---|---|---|
R2 | 0.022827991479860588 | 1.0 |
Explained Variance | 0.0231768929303382 | 1.0 |
Root Mean Squared Error | 0.5007558913663868 | 0 |
D2 Pinball Score | 0.01893850443208711 | 1.0 |
Mean Absolute Error | 0.367700680464321 | 0 |
The scores are not good and it cannot predict well. R2 value shows no correlation between labels and truth and the mean absolute error may be significant. This is when we transition to Neural Networks.
Neural Networks
Simple Linear Layers with ReLU Activation Functions Architecture
Different architectures were tried with varying number of layers, dropout values, etc. Below are the figures for training epochs versus minimum absolute error. Now we are using Minimum Absolute Error since Minimum Squared Error would be too small as our ratings are differing by decimals.
We see convergence between error 0.35
-0.37
meaning there is only that much difference. We trained for more epochs (tested till 50) but it was overfitting for test data and we want to avoid that.
We are seeing more varied predictions for Neural Networks and it is just not tightly bound in one range. Below are the regression statistics for this method.
Metric | Score | Optimal Metric Score |
---|---|---|
R2 | 0.011508081888894961 | 1.0 |
Explained Variance | 0.01601532318292065 | 1.0 |
Root Mean Squared Error | 0.5036480072156625 | 0 |
D2 Pinball Score | 0.01598347067234218 | 1.0 |
Mean Absolute Error | 0.3680082235991418 | 0 |
We see that it is performing similar to Linear Regression and worse in some statistics.
So we analyzed the data and realized the imbalance of positive reviews, interchangeable words between three and five star reviews, and other features suggesting the dataset might be noisy. We tried to undersample the overrepresented ratings so there is more balance. We trained these models again.
If we compare this plot to the previous Linear Regression plot we may observe fewer points due to undersampling but we see that the model just does not predict between 3.5 to 4.
Metric | Score | Optimal Metric Score |
---|---|---|
R2 | 0.08645945280489531 | 1.0 |
Explained Variance | 0.017508793582832194 | 1.0 |
Root Mean Squared Error | 0.775356074265067 | 0 |
D2 Pinball Score | 0.017508793582832194 | 1.0 |
Mean Absolute Error | 0.6510761955804188 | 0 |
Here we see an improvement over previous linear regression in R2, Explained Variance and Pinball score. But the errors went up. This means our model has improved in understanding the relationship between input and ratings better however, the predictions are farther off then true values. We tried our Neural Network on this.
Using this new data on Neural Network we see that the predictions are still between 2 and 4. It could have been more varied.
This model seemed to converge to around 0.6 on Mean Absolute Error. It was trained for 100 epochs.
Metric | Score | Optimal Metric Score |
---|---|---|
R2 | -0.30756505408803236 | 1.0 |
Explained Variance | 0.0253402541669836 | 1.0 |
Root Mean Squared Error | 0.927617497498877 | 0 |
D2 Pinball Score | -0.20175904705460113 | 1.0 |
Mean Absolute Error | 0.65761955804188 | 0 |
Here we see degradation of the majority of statistics and see that this approach did not work well. This is possibly due to undersampling reducing data points causing it harder for models to train. We tried several other architectures such as adding an embedding layer with vocabulary size of 10000 followed by LSTM but they complicated the model and did not yield any fruitful results. Currently the first set of neural network and linear regression performed best for predicting however, Linear Regression on the second dataset had the highest pearson correlation coefficient.
This suggests a couple of questions. Are the dimensions of the data too high? We have tried to experiment with PCA and played with several different components but relative to the dataset, this could be the case. Is the data too noisy or hard to learn? It is important to know that using summary and other features based on the book may make it hard to predict because we are not actually analyzing the content of the book. Is this the right architecture? We should maybe look at more LSTM and transformer based architecture for specific summary tasks and maybe that can yield greater accuracy with minimal error.
Summary of the methods explored
Model | Peformance metric | Test performance |
---|---|---|
Random Forest Regression | MAE | 0.29 |
Linear regression | MAE | 0.38 |
Neural Networks | MAE | 0.37 |
Random Forest Classifier | F1 Score | 0.71 |
GradientBoost | F1 Score | 0.72 |
XGBoost | F1 Score | 0.69 |
Conclusion
Word2Vec
Using Word2Vec
to generate embeddings for text and running K-Means, we saw a Silhouette Score of 0.425
with somewhat discernable genres. After trying different PCA numbers of components, our silhouette score was still similar, but there was a lack of consistency in the results.
Rating Prediction
We explored the problem statement from a regression perspective and a classification perspective. As a classification task, we modeled a book to be successful if the average book rating is greater than or equal to 4. GradientBoost
and RandomForest
Classifiers gave fairly good results with a test accuracy of 71%
and 72%
respectively. As a regression task, RandomForest
Regressor performed the best with a mean absolute error of 0.29
. The model works best in predicting the ratings of books in the 3 to 5 star range but tends to give more generous ratings to books that were actually 1 or 2 stars. We speculate this is due to the subjective nature of books and the readers bias, and that while it may be easy to pick out an amazing book by factors such as a detailed summary, large number of ratings, and high quality publishers, these same factors could also occasionally lead to a book that many people just do not like for a variety of reasons. As the saying goes “don’t judge a book by its cover” (or summary in our case). Despite this, we think our model is fairly good at identifying top notch books and correctly placing average books and hence, can be used for these purposes in a personal or commercial setting.
References
[1] : Maharjan et al., A Multi-task Approach to Predict Likability of Books
[2] : Khalifa et al., Book Success Prediction with Pretrained Sentence Embeddings and Readability Scores
[3] : Shi et al., Neural Abstractive Text Summarization with Sequence-to-Sequence Models
[4] : Rudolph Flesch. 1948. A new readability yardstick. Journal of applied psychology, 32(3):221.
[5] : G Harry Mc Laughlin. 1969. Smog grading-a new readability formula. Journal of reading, 12(8):639– 646.
Contribution Table
Task | Contributor |
---|---|
Introduction | Prathudhar Bagathi, Vishnu Jaganathan |
Problem | Rakesh Arwini, Manav Agrawal |
Data Collection | Rakesh Arwini, Deepak Gouda, Vishnu Jaganathan |
Methodology | Manav Agrawal, Deepak Gouda, Prathudhar Bagathi |
Results and Discussion | All |
Gantt Chart | Manav Agrawal |
Project report compilation | All |
All the members of the team contributed equally