- Objective
- Motivation
- The Data
- Exploring the Data
- Logistic Regression
- Using Fewer Feaures and Applying Other Models
- Other Experiments
- Next Steps
Every year in November, writers all around the world participate in National Novel Writing Month (NaNoWriMo) and try to write 50,000 words of a novel within 30 days. They track their word count progress on the NaNoWriMo website where they may also donate to the writing cause, join 'Regions' for writing camaraderie, and display the summary of their novel in progress. Those writers who write 50,000 words before the end of November are declared 'Winners'.
My goal is to create a machine learning model that can predict whether a participating writer will be a NaNoWriMo 'winner' using data from the site.
I love writing and I am enjoy participating in NaNoWriMo. This idea stems from another personal project: creating my own Word Count Tracker that would track how much I write over time, similar to that of the cumulative word count graph displayed on each writer's novel profile every NaNoWrimo.
I wanted to take it a step further and also visualize the aggregate word count progress of a region and the whole site.
Visualizing writing progress can motivate one to write more and reach his or her writing goals! I hope creating this predictive model may help other writers and future NaNoWriMo participants improve their writing strategies continue to write and finish their novels.
The data I will use to construct this model is user data and novel data from the website. This includes usernames, novel titles, word count, and 'Winner' labels.
Some NaNoWriMo vocabulary to understand:
Writer - A NaNoWriMo.org user that is participating in a current NaNoWriMo contest.
Win - When a writer reaches 50,000 word count goal for their novel and validates this word count with the NaNoWriMo website.
Word Count/Word Count Submission - For a novel or a submission to that novel, the number of words recorded to have been written
Submission - The act of updating the word count for a novel. During a contest, if there is no update for a novel on a given day, the word count submission for that novel is recorded as 0 and the total word count for a novel remains the same. NOTE: A writer my update the word count for their novel multiple times a day. The site will not record the updates until the end of the data. The aggregate of these updates is the submission.
Contest - A NaNoWriMo event. That is, when the NaNoWriMo site opens and writers may create a novel profile and begin writing and adding submissions.
Donation/Donor - If a user makes a monetary donation to the NaNoWriMo organization and their mission, they are marked as a 'donor' on the site. NaNoWriMo does not disclose the amount the user donated, just that they are a donor. NOTE: A user may donate without being a writer. But for the purposes of this project, those users don't exist in this data set :)
Municipal Liaison - Taken from the NaNoWriMo website: "Municipal Liaisons (MLs) are volunteers who add a vibrant, real-world aspect to NaNoWriMo festivities all over the world." These writers are particularly involved NaNoWriMo users :D
Sponsorship - Writers may have their novels sponsored, with the sponsor money going to further the NaNoWriMo mission.
Novel - A writer's 'entry' in the NaNoWriMo contest - the thing they commit to writing during the contest. NOTE: 'Novels' may not actually be novels. Writers may choose to write memoirs, non-fiction, movie scripts, etc.
I created a script utilizing the site's Word Count API to get word count submission history.
The trouble is, the NaNoWriMo API, as far as I know, only gets data from the most recent contest, in this case, November 2015. This was not enough to make much of an interesting model.
Other data I wanted to incorporate in the model include a user's past daily word count averages, number of novels started, novel synopses, and whether or not they've donated to the NaNoWriMo cause.
Luckily, all the data I wanted was available on the NaNoWriMo website, but I wasn't about to click through 500+ user profiles manually entering information into a spreadsheet to get all of it.
I used Kimono Labs to scrape most of the qualitative user data including usernames, whether they're a donor or even a volunteer Municipal Liaison for the site, if they're novels are sponsored, and all the names of their past novels. I was also able to get some quantitative data such as how long they've been a NaNoWriMo member, their lifetime word count, and what years they've participated.
Below is a snapshot of Kimono Labs point-and-click interface to capture the data from a NaNoWriMo profile page.
However, I wasn't able to get the word count data from past NaNoWriMos using Kimono Labs. That data is presented on each novel's stats page as a bar graph rendered by JavaScript. Kimono can't parse JavaScript.
I researched a few different ways to parse JavaScript using Python, but then I realized I only needed a single line of the JavaScript code that stored the data points for the graph. I read the HTML document for each novel profile page as a regular text document and grabbed the line I needed.
I also wanted to extract novel synopses and excerpts, but I ran into some difficulties using Kimono to grab the large amount of text from each novel profile page. I decided it was time to switch tools.
With Beautiful Soup it was really easy to navigate the HTML structure of the novel profile page, and to find the tags and attributes for the text data I needed.
With all the data I needed, the next step was to process and aggregate all the information for analysis.
The following are a description of the iPython scripts used to scrape data.
GetCurrentContestStats - Utilizes NaNoWriMo API to get data from the most recent contest ScrapeNovelSynopses - Uses Beautiful Soup to scrape each novel synopses ScrapeNovelSynopses - Uses Beautiful Soup to scrape each novel excerpt ScrapeWCSubmissions - Parses HTML file for a JavaScript variable that contains information about daily word count submission for each novel
| Data | Description | Source |
|---|---|---|
| User Names | A list of writer's usernames | Hand collected |
| Novel Pages | A list of novels by the selected writers | Kimono Labs API |
| Novel WC Info | Word count stats for each of the novels | The ScrapeWCSubmissions script |
| Novel Names, Urls, Dates | The novels with their respective NaNoWriMo page urls and the date they were entered into a NaNoWriMo contest | Kimono Labs API |
| Novel Meta Data | Contains more information about the novels | Kimono Labs API |
| Basic User Profile Information | A writer's username, their lifetime word count, how long the have been a NaNoWriMo member | Kimono Labs API |
| Fact Sheets | Various information a writer could share about their age, occupation, location, hobbies, sponsorship, or role as a Municipal Liaison for NaNoWriMo | Kimono Labs API |
| Participation Information | The past years a writer has participated in NaNoWriMo and whether they were winners or donors in that year | Kimono Labs API |
After scraping all the data, the task at hand was to aggregate the information.
I had the following information on each of the novels of each writer.
Novel Meta Data - Contains the name of the novel, the writer, the genre, the final word count, daily average word count, and whether or not it was a winning novel Novel Word Count Info - Basic statistics calculated for each novel
I merged these files on the novel name and also appended each novel's synopses and excerpt to create a novel_data.csv file.
There is also a great deal of information in the text data for each novel - the genre, synopses, excerpt. I hypothesize, if a writer is well-prepared for NaNoWriMo, they will have a clear genre chosen for their novel, and their novel profile will have a well-written synopses and excerpt - signs that their novel idea is fleshed out and they've done some planning before the contest starts.
From the text data, I extracted numeric data such as number of words, unique words, paragraphs, and sentences in a synopses and excerpt. I also calculated a reading score for the synopses and excerpt, and classified the genre of each novel as standard (fits into the usual novel genres such as Fiction, Historical, Young Adult) or non-standard (the novel hasn't been given a genre yet, it's a more obscure genre, or a combination of different genres).
I then appended this data to the other novel data in another novel_features file.
In addition to their novels, I had the following raw data about each writer:
Basic User Profile Data - A writer's username, their lifetime word count, how long the have been a NaNoWriMo member Fact Sheets - Various information a writer could share about their age, occupation, location, hobbies, sponsorship, or role as a Municipal Liaison for NaNoWriMo Participation Data - The past years a writer has participated in NaNoWriMo and whether they were winners or donors in that year.
After a bit of cleaning, I merged the data in these files by writers' usernames.
Now, I needed to somehow aggregate the major novel data for each writer and merge it with the other writer data.
There were two different ways I aggregated the data. In one way I took typical averages of the novel word count statistics. In the other, I excluded novels created in the most current NaNoWriMo contest (November 2015). I wanted to use these novels as the target of my predictions. That is, I wanted to use the writers' past novels up to November 2014 to predict whether the novels of November 2015 would be 'winning novels' for the writer. Thus, there are two similarly named 'user_summary' files.
For the user_summary file, certain statistics (eg. Expected Final Word Count, Expected Daily Average) take into account data from NaNoWriMo November 2015. The other file with '_no2015' appended to the file name has the November 2015 information excluded from those statistics.
The following are a description of the iPython scripts used to clean and process the raw data.
FactSheetParser - Parses the raw Fact Sheets data ParseMemberLength - Cleans member length data in the raw Basic User Profile Data AppendParticipationData/AppendParticipationData_negate2015 - Two similar scripts that parse the raw Participation Data and appends results to other writer data (Basic Info, Fact Sheets) AggregateNovelStatsData/AggregateNovelStatsData_negate2015 - Two similar scripts that aggregate novel word count statistics and appends results to other writer data(Basic Info, Fact Sheets) AggregateFinalandDailyAvgs/AggregateFinalandDailyAvgs_negate2015 - Two similar scripts that aggregate the final word count and daily average of novels and appends results to other writer data (Basic Info, Fact Sheets) CalculateTextFeaturesandReadingScore - Classified as novel's genre as standard or nonstandard, and extracted the number of words, unique words, sentences, paragraphs, and reading score of novel synopses. CalculateReadingScoreExcerpts - Extracted the number of words, unique words, sentences, paragraphs, and reading score of novel excerpts.
Contains basic profile information about each writer and their past NaNoWriMo statistics.
There are 501 rows and 41 columns.
The data may be found here.
Writer Name - The writer's NaNoWriMo username
Member Length - The number of years a writer has been a NaNoWriMo user
LifetimeWordCount - The total number of words a writer has written over all NaNoWriMo contests
url - The url to the writer's profile on NaNoWriMo.org
Age - The age of the writer
Birthday - The birthday of the writer
Favorite books or authors - The writer's recorded favorite books or authors
Favorite noveling music - The writer's favorite music to listen to while writing
Hobbies - The writer's recorded hobbies
Location - The location from where the writer is writing
Occupation - The writer's recorded occupation
Primary Role - If the writer is a "Municipal Liaison" for NaNoWriMo, it is recorded here
Sponsorship URL - If the writer's novel is sponsored, a sponsorship url is recorded here
Expected Final Word Count - The average of the final word count for all a writer's novels
Expected Daily Average - The average of the daily average word count for all a writer's novels, calculated from using
CURRENT WINNER - Indicates whether the writer is a winner of the "current" or "next" NaNoWriMo (November 2015)
Current Donor - Indicates whether the writer is a donor of the "current" or "next" NaNoWriMo (November 2015)
Wins - The number of past wins for a writer. Wins cannot be greater than Participated.
Donations - The number of past donations for a writer. Donations cannot be greater than Participated.
Participated - The number of past NaNoWriMo contests in which the writer was a participant
Consecutive Donor - The maximum number of consecutive contests for which the writer has donated
Consecutive Wins - The maximum number of consecutive contests for which the writer has won
Consecutive Part - The maximum number of consecutive contests for which the writer has participated
Part Years - A list of years for which the writer has participated in NaNoWriMo
Win Years - A list of years for which the writer has won
Donor Years - A list of years for which the writer has donated
Num Novels - The number of novels which a writer has entered into NaNoWriMo
Expected Num Submissions - The average. over all a writer's novels, of the number of word count submissions entered for a novel
Expected Avg Submission - The average. over all a writer's novels, of the average number of words entered in all word count submissions for a novel
Expected Min Submission - The average, over all a writer's novels, of the minimum number of words entered in all word count submissions for a novel
Expected Min Day - The average day (from 1-30), over all contests a writer participated, on which the writer entered the minimum number of words
Expected Max Submission - The average, over all a writer's novels, of the maximum number of words entered in all word count submissions for a novel
Expected Max Day - The average day (from 1-30), over all contests a writer participated, on which the writer entered the maximum number of words
Expected Std Submissions - The average, over all a writer's novels, of the standard deviation of the number of words entered for all word count submissions for a novel
Expected Consec Subs - The average, over all a writer's novels, of the number of consecutive submissions (at least 2 submissions in a row) entered for a novel
FW Total - For the current NaNoWriMo, the total word count of a novel in the first week of the contest
FW Sub - For the current NaNoWriMo, the number of word count submissions to a novel in the first week of the contest
FH Total - For the current NaNoWriMo, the total word count of a novel written in the first half of the contest
FH Sub - For the current NaNoWriMo, the number of word count submissions to a novel in the first half of the contest
SH Total - For the current NaNoWriMo, the total word count of a novel written in the second half of the contest
SH Sub - For the current NaNoWriMo, the number of word count submissions to a novel in the second half of the contest
Contains basic profile information about each novel and their word count statistics.
There are 2122 rows and 9 columns.
The data may be found here.
Writer Name - The writer of the novel
Novel Name - The title of the novel
Genre - The genre of the novel
Final Word Count The final recorded word count for the novel
Daily Average The average recorded word count of the novel over the 30 day period of its contest
Winner Indicates whether the novel is a winning novel (reached 50,000 words) during its contest
Synopses The novel synopses
url The url of the novel's stats page
Novel Date The date of the contest for which the novel was written
Excerpt The novel excerpt
Contains numeric data represeting each novel's genre, synopses, and excerpt.
There are 2122 rows and 23 columns.
The data may be found here.
Note: There are some columns that are duplicates from the novel_data file, so they will not be redefined here.
has genre 0 if the novel has no given genre. 1 if otherwise.
standard genre 1 if the novel's given genre is one of the following "usual" genres: __ . 0 if otherwise.
has_synopses 0 if the novel has no synopsis. 1 if otherwise.
num words The number of words in a novel's synopsis.
num uniques The number of unique words in a novel's synopsis.
num sentences The number of sentences in a novel's synopsis.
paragraphs The number of sentences in a novel's synopsis.
fk score The Flesch-Kincaid score of the novel synopsis.
has excerpt 0 if the novel has no excerpt. 1 if otherwise.
num words excerpt The number of words in a novel's excerpt.
num uniques excerpt The number of unique words in a novel's excerpt.
num sentences excerpt The number of sentences in a novel's excerpt.
paragraphs The number of sentences in a novel's excerpt.
fk score excerpt The Flesch-Kincaid score of the novel synopsis.
After I had constructed the data set, I proceeded with exploring the data with Python and matplotlib visualizations.
| Writer Name | Member Length | LifetimeWordCount | url | Age | Birthday | Favorite books or authors | Favorite noveling music | Hobbies | Location | ... | Expected Max Submission | Expected Max Day | Expected Std Submissions | Expected Consec Subs | FW Total | FW Sub | FH Total | FH Sub | SH Total | SH Sub | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Nicaless | 2 | 50919 | http://nanowrimo.org/participants/nicaless | 24 | December 20 | Ursula Le Guin, J.K. | Classical, Musicals | Reading, Video Games, Blogging, Learning | San Francisco, CA | ... | 24935.0 | 28.000000 | 6235.712933 | 12.000000 | 6689 | 6 | 12486 | 9 | 11743 | 3 |
| 1 | Rachel B. Moore | 10 | 478090 | http://nanowrimo.org/participants/rachel-b-moore | NaN | NaN | 2666, Unaccustomed Earth, Exit Music, Crazy Lo... | Belle and Sebastian, Elliott Smith, PJ Harvey,... | Reading, volunteering, knitting, listening to ... | San Francisco | ... | 3809.0 | 9.000000 | 1002.295167 | 6.800000 | 16722 | 7 | 24086 | 14 | 26517 | 14 |
| 2 | abookishbabe | 1 | 0 | http://nanowrimo.org/participants/abookishbabe | NaN | April 2 | Colleen Hoover, Veronica Roth, Jennifer Niven,... | Tori Kelley | Reading (DUH), Day dreaming, Going to Disneyla... | Sacramento, CA | ... | NaN | NaN | NaN | NaN | 28632 | 1 | 29299 | 2 | 0 | 0 |
| 3 | alexabexis | 11 | 475500 | http://nanowrimo.org/participants/alexabexis | NaN | NaN | NaN | Three Goddesses playlist Florence + the Machin... | drawing, reading, movies & TV shows, comics, p... | New York City | ... | 2325.0 | 8.545455 | 570.626795 | 8.090909 | 25360 | 7 | 38034 | 12 | 40766 | 9 |
| 4 | AllYellowFlowers | 3 | 30428 | http://nanowrimo.org/participants/AllYellowFlo... | NaN | NaN | Lolita, Jesus' Son, Ask the | the sound of the coffeemaker | cryptozoology | Allston | ... | 2054.5 | 4.500000 | 538.273315 | 21.000000 | 1800 | 5 | 5300 | 10 | 5700 | 9 |
Wins and Losses for NaNoWriMo 2015
There are 219 winners and 282 nonwinners out of 501 writers. A little over a 4:3 ratio of winners to nonwinners. At first glance, winning is almost a coin-toss at 44%. One has near to a 50/50 chance of guessing correctly whether or not a writer is a NaNoWriMo winner.
Lifetime Word Count vs Member Length
Few writers have a written more than 1,000,000 words (or 20 NaNoWriMo winning novels) over the course of their NaNoWriMo lifetime. The density of nonwinners for NaNoWriMo 2015 decreases as Member Length increases, and a higher Lifetime Word Count indicates higher likelihood of winning. It makes sense that the longer one writes (Member Length) and the more words one writes (Lifetime word Count) makes one more likely to reach the NaNoWriMo writing goal.
Expected Avg Submission vs Expected Daily Average
It almost looks like there are clusters. If Expected Daily Average => Expected Avg Submission, a writer is more likely to win. It's worth noting that the minimum daily average needed to win a NaNoWriMo contest is about 1,666 (50,000 words / 30 days).
Number of Wins vs Number of times participated
It looks like there may be possible clusters here as well. Writers who have already had more than 5 wins are very likely to win again. Also, writers who have participated more than 5-10 times have better chances of winning as well.
Expected Daily Average vs Expected Num Submissions
Many writers seem to cluster around an Expected Daily Average of 1500-2000. 1,666 is the minimum daily average to win a NaNoWriMo contest. The higher an Expected Daily Average, the more likely a writer is to win the upcoming contest.
Also interesting is how the density of nonwinners decreases as Expected Num Submissions increases, so higher Expected Num Submissions may also be indicative of winning.
Distribution Word Count Submissions in early weeks of a contest
I wanted to look retrospectively at the latest NaNoWriMo contest and see how winners can be predicted as early as the first week or two weeks of a contest
As expected, writers who submit more often in the early weeks are more likely to win.
Average First Week Submissions vs Expected Daily Average
Additionally, writers whose daily average in the first week is equal to or greater than the Expected Daily Average of their past novels are more likely to win.
Does being Municipal Liaison or having a novel sponsored have effect on winning?
Municipal Liaisons, which I've flagged with a binary variable (1 if they are a ML, 0 if otherwise), are a small fraction of the total NaNoWriMo writer population, but the majority of these MLs turn out to be winners at the end of the month.
Likewise, very few writers have sponsors for their novel.
The ratio of winners to nonwinners for those with Sponsors is 2. The ratio of winners to nonwinners for those who are Municipal Liaisons is almost 6. It definitely seems like one is more likely to win if they are a Municipal Liaison of if their novel is sponsored!
| Writer Name | Novel Name | Genre | Final Word Count | Daily Average | Winner | Synopses | url | Novel Date | Excerpt | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Nicaless | Novel: Lauren's Birthday | Genre: Young Adult | 24229 | 807 | 0 | \n<p></p>\n | http://nanowrimo.org/participants/nicaless/nov... | November 2015 | \n<p></p>\n |
| 1 | Nicaless | Novel: A Mystery in the Kingdom of Aermon | Genre: Fantasy | 50919 | 1,697 | 1 | \n<p>Hitoshi is appointed the youngest Judge a... | http://nanowrimo.org/participants/nicaless/nov... | November 2014 | \n<p>This story, funnily enough, started out a... |
| 2 | Rachel B. Moore | Novel: Finding Fortunato | Genre: Literary | 50603 | 1,686 | 1 | \n<p>Sam and Anna Gold and their newly adoptiv... | http://nanowrimo.org/participants/rachel-b-moo... | November 2015 | \n<p></p>\n |
| 3 | Rachel B. Moore | Novel: The Residency | Genre: Literary | 50425 | 1,680 | 1 | \n<p>It's every writer's dream - an all-expens... | http://nanowrimo.org/participants/rachel-b-moo... | November 2014 | \n<p></p>\n |
| 4 | Rachel B. Moore | Novel: The Jew From Fortunato | Genre: Literary Fiction | 41447 | 1,381 | 0 | \n<p>20-something Andre Levinsky is a fish out... | http://nanowrimo.org/participants/rachel-b-moo... | November 2013 | \n<p></p>\n |
Overall Wins and Losses
The total number of novels in this sample is 2123. 1333 winners and 790 nonwinners for a 63/37 split. It's interesting that there are more winning novels than nonwinning novels while there are more winning writers for the most recent NaNoWriMo than there are nonwinning writers. But this makes sense because writers who write more novels are more likely to have their novels reach the 50,000 word goal.
Text Features
| Winner | Novel Date | has genre | standard genre | has_synopses | num words | num uniques | num sentences | paragraphs | fk score | has excerpt | num words excerpt | num uniques excerpt | num sentences excerpt | paragraphs excerpt | fk score excerpt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | November 2015 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0.00 |
| 1 | 1 | November 2014 | 1 | 1 | 1 | 44 | 42 | 3 | 1 | 65.73 | 1 | 132 | 96 | 13 | 7 | 78.25 |
| 2 | 1 | November 2015 | 1 | 1 | 1 | 153 | 109 | 7 | 4 | 58.62 | 0 | 0 | 0 | 0 | 0 | 0.00 |
| 3 | 1 | November 2014 | 1 | 1 | 1 | 59 | 51 | 4 | 3 | 65.73 | 0 | 0 | 0 | 0 | 0 | 0.00 |
| 4 | 0 | November 2013 | 1 | 0 | 1 | 124 | 93 | 4 | 1 | 56.93 | 0 | 0 | 0 | 0 | 0 | 0.00 |
The average length in words of a synopses is 50 words, or about a few good sentences. This is likely skewed by the fact that 729 novels, more than a third, don't even have a synopsis. There are few novels with synopses longer than 100 words, but as synopses get longer, it seems more likely that they belong to a winning novel.
from scipy.stats import ttest_ind
ttest_ind(winlose['fk score'].get_group(0), winlose['fk score'].get_group(1))Ttest_indResult(statistic=-1.4376558464994371, pvalue=0.1506792394358735)
The Flesch-Kincaid reading scores look about normally distributed for this sample of novel synopses for both winners and nonwinners. In a t-test comparing the two data sets, the resulting p-value is greater than 10%. This means, I cannot reject a null hypothesis that the winning and non-winning novels have equal averages of Flesch-Kincaid scores. Flesch-Kincaid scores for a novel synopses are unlikely to be indicative of a winning novel.
Trying to plot reading score of synopses against length of synopses produces this gobbled mess. It may be hard trying to predict winning novels with these features...
As the variable I want to predict is binary (1 if a writer is a winner, 0 if otherwise) I decided to use a logistic regression as my prediction model.
After extracting only the numerical columns from the writer data, replacing any NaN entries - entries belonging new writers who don't have data from past NaNoWriMos - with 0, I applied a Standard Scaler to normalize the data. I then performed an 80/20 split on the data - 400 observations for training and 101 observations for testing.
Cross-Validation Score
I created 10 different folds of the training data to train, test, and cross-validate a Logistic Regression model. The average cross-validation score was .86. This is a promising indication that this model does very well in predicting a winning or non-winning outcome for a writer.
Confusion Matrix and Classification Report
After cross-validating on just the training data, I re-trained the model on the entire training data set and then used the model to predict the outcomes for the writers in the test data set. Comparing the model's predictions with the actual outcomes, I obtained the following confusion matrix and classification report.
| Actual 0 | Actual 1 | |
|---|---|---|
| Predicted 0 | 51 | 4 |
| Predicted 1 | 0 | 46 |
| Precision | Recall | F1-Score | Support | |
|---|---|---|---|---|
| 0 | 1.00 | 0.93 | 0.96 | 55 |
| 1 | 0.92 | 1.00 | 0.96 | 46 |
| avg/total | 0.96 | 0.96 | 0.96 | 101 |
Actual Class 0 Actual Class 1
Predicted Class 0 46 9
Predicted Class 1 8 38
precision recall f1-score support
0 0.85 0.84 0.84 55
1 0.81 0.83 0.82 46
avg / total 0.83 0.83 0.83 101
Only 4 winners were misclassified as non-winners. The Logistic Regression correctly identified the winners and nonwinners in the test data with about 83% accuracy, as illustrated by its precision, recall, and F1-scores.
ROC Curve
In plotting the the ROC curve for the model, I found the area under the curve was about .9, pretty close to an ideal area of 1.
It seems like it's a pretty good model!
There are a lot of features in this data set, so I used Principal Components Analysis to decompose the data and easily to visualize where the winners and non-winners fall on a 2 dimensional plane.
Above are the first and second principal components of the training data set, colored by the winners and nonwinners.
Above is how the Logistic Regression splits the decomposed test data. Comparing it with the actual results of the test data below, the Logistic Regression did very well generalizing the data and sorting out the winners and nonwinners of NaNoWriMo.
Pleased with the results of the Logistic Regression model, I then similarly trained a Decision Tree on the features to compare the two methods.
It also performed very well in predicting winners and nonwinners achieving a similar scores for cross validation and precision and recall.
| Actual 0 | Actual 1 | |
|---|---|---|
| Predicted 0 | 46 | 9 |
| Predicted 1 | 7 | 39 |
| Precision | Recall | F1-Score | Support | |
|---|---|---|---|---|
| 0 | .87 | 0.84 | 0.85 | 55 |
| 1 | 0.81 | 0.85 | 0.83 | 46 |
| avg/total | 0.84 | 0.85 | 0.84 | 101 |
The Decision tree found the following features to be the most important.
| 0 | 1 | |
|---|---|---|
| 23 | FH Total | 0.712847 |
| 1 | LifetimeWordCount | 0.057194 |
| 24 | FH Sub | 0.045414 |
| 14 | Expected Avg Submission | 0.039117 |
| 11 | Consecutive Part | 0.036241 |
FH Total - the total word count of a writer's novel submitted in the first half of the contest - is the most predictive feature of winning by a long shot, but this is a metric collected after the current contest has started. For next steps, I want to build a model with just the information I have from past contests.
I excluded the features relevant to the current contest - the number of words and submissions accounted in the first week, first two weeks, or second two weeks. I then re-applied the Logistic Regression model.
| Actual 0 | Actual 1 | |
|---|---|---|
| Predicted 0 | 48 | 7 |
| Predicted 1 | 22 | 24 |
| Precision | Recall | F1-Score | Support | |
|---|---|---|---|---|
| 0 | 0.69 | 0.87 | 0.77 | 55 |
| 1 | 0.77 | 0.52 | 0.62 | 46 |
| avg/total | 0.73 | 0.71 | 0.70 | 101 |
The difference between this model and the previous, which included the current contest data, is about 10%.
I then compared the results against other models.
| Actual 0 | Actual 1 | |
|---|---|---|
| Predicted 0 | 48 | 7 |
| Predicted 1 | 26 | 20 |
| Precision | Recall | F1-Score | Support | |
|---|---|---|---|---|
| 0 | 0.65 | 0.87 | 0.74 | 55 |
| 1 | 0.74 | 0.43 | 0.55 | 46 |
| avg/total | 0.69 | 0.67 | 0.65 | 101 |
Naive Bayes is not as accurate as Logistic Regression in this case.
| Actual 0 | Actual 1 | |
|---|---|---|
| Predicted 0 | 49 | 6 |
| Predicted 1 | 20 | 26 |
| Precision | Recall | F1-Score | Support | |
|---|---|---|---|---|
| 0 | 0.71 | 0.89 | 0.79 | 55 |
| 1 | 0.81 | 0.57 | 0.67 | 46 |
| avg/total | 0.76 | 0.74 | 0.73 | 101 |
This Support Vector Machine does a little bit better than the Logistic Regression.
| Actual 0 | Actual 1 | |
|---|---|---|
| Predicted 0 | 42 | 13 |
| Predicted 1 | 22 | 24 |
| Precision | Recall | F1-Score | Support | |
|---|---|---|---|---|
| 0 | 0.66 | 0.76 | 0.71 | 55 |
| 1 | 0.85 | 0.52 | 0.58 | 46 |
| avg/total | 0.65 | 0.65 | 0.65 | 101 |
The Decision Tree did not do as well this time without data from the current contest.
| 0 | 1 | |
|---|---|---|
| 3 | Expected Final Word Count | 0.356964 |
| 13 | Expected Num Submissions | 0.126865 |
| 1 | LifetimeWordCount | 0.124095 |
| 0 | Member Length | 0.070148 |
| 2 | Age | 0.067985 |
This time, the most important feature is Expected Final Word Count, or a writer's average final word count over all his or her past NaNoWriMos.
I trained the data on a Random Forest which yielded the following results.
| Actual 0 | Actual 1 | |
|---|---|---|
| Predicted 0 | 48 | 7 |
| Predicted 1 | 18 | 28 |
| Precision | Recall | F1-Score | Support | |
|---|---|---|---|---|
| 0 | 0.73 | 0.87 | 0.79 | 55 |
| 1 | 0.80 | 0.61 | 0.69 | 46 |
| avg/total | 0.76 | 0.75 | 0.75 | 101 |
Random Forests and Support Vector Machines do best in predicting winners and nonwinners when excluding data from the current contest.
While it's possible past data to predict the outcome of a contest with a good degree of accuracy, including data from the first few weeks of a contest after it starts improves accuracy greatly.
Many non-winners were predicted to win for this second model, which is interesting. This means the past NaNoWriMo data for these writers showed promise that they would win again in the coming NaNoWriMo. However, they fell short in the first few weeks of the contest which effected their end outcome.
I wanted to attempt to predict which novels will be winning novels based on what little I know about them: their genre, synopsis, and excerpt.
| Winner | Novel Date | has genre | standard genre | has_synopses | num words | num uniques | num sentences | paragraphs | fk score | has excerpt | num words excerpt | num uniques excerpt | num sentences excerpt | paragraphs excerpt | fk score excerpt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | November 2015 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0.00 |
| 1 | 1 | November 2014 | 1 | 1 | 1 | 44 | 42 | 3 | 1 | 65.73 | 1 | 132 | 96 | 13 | 7 | 78.25 |
| 2 | 1 | November 2015 | 1 | 1 | 1 | 153 | 109 | 7 | 4 | 58.62 | 0 | 0 | 0 | 0 | 0 | 0.00 |
| 3 | 1 | November 2014 | 1 | 1 | 1 | 59 | 51 | 4 | 3 | 65.73 | 0 | 0 | 0 | 0 | 0 | 0.00 |
| 4 | 0 | November 2013 | 1 | 0 | 1 | 124 | 93 | 4 | 1 | 56.93 | 0 | 0 | 0 | 0 | 0 | 0.00 |
However, the results for my Logistic Regression were lackluster.
Cross-Validation Score
| Actual 0 | Actual 1 | |
|---|---|---|
| Predicted 0 | 1 | 158 |
| Predicted 1 | 0 | 266 |
| Precision | Recall | F1-Score | Support | |
|---|---|---|---|---|
| 0 | 1.00 | 0.01 | 0.01 | 159 |
| 1 | 0.63 | 1.00 | 0.77 | 266 |
| avg/total | 0.77 | 0.63 | 0.49 | 425 |
The Logistic Regression did not do much better than guessing, and other models yielded similar results.
Maybe it just doesn't make sense to predict if a novel wins just based on it's synopses or excerpt. Don't judge a book by it's cover I guess.
I've tried classifying writers by whether or not they've "won" or not in the next NaNoWriMo contest, but that sort of dampens the spirit of NaNoWriMo. It's not just about winning after all. I want to see what other ways to create clusters of writers with K Means.
It looks a k of 5 produces the best silhouette score, so the data can best be fitted into
While I could not create a very accurate model for predicting whether or not a novel will win based on its synopses or excerpt, I still wanted to do something interesting with all the novel data I had. So I started to build a simple recommendation system that, given a writer's NaNoWriMo username, would suggest new genres for the writer to try writing for based on their past.
Here's a subset of the list of writers and all the genres they've written in past NaNoWriMos.
| Writer Name | Genres | |
|---|---|---|
| 0 | Nicaless | Fantasy, Young Adult |
| 1 | Rachel B. Moore | Literary, Literary Fiction |
| 2 | abookishbabe | Young Adult |
| 3 | alexabexis | Romance, Horror/Supernatural, Horror & Superna... |
| 4 | AllYellowFlowers | Literary, Literary Fiction |
Below defines a function that calculates the jaccard distance from two different lists of genres.
def jaccard(a, b):
if (type(a) == "float" or type(b) == float):
return 0
a = set(a.split(", "))
b = set(b.split(", "))
intersect = a.intersection(b)
union = a.union(b)
return float(len(intersect)) / len(union)
nicaless_genres = writer_genres['Genres'][writer_genres['Writer Name'] == "Nicaless"].values[0]
abookishbabe_genres = writer_genres['Genres'][writer_genres['Writer Name'] == "abookishbabe"].values[0]
jaccard(nicaless_genres, abookishbabe_genres)
0.5
The above score means that the "distance" between the above two writers' genres is .5. In other words, half of all the genres written between the two writers are shared between the two.
I then created a function called getSimilar that uses the jaccard function to calculate the distance between a given writer's list of genres and all other writers' genres and returns a set of suggested genres based on the top ten closes writers.
getSimilar("Nicaless")I suggest you try writing for the following genres:
{'Romance', 'Science Fiction', 'Young Adult & Youth', 'nan'}
getSimilar("Trillian Anderson")I suggest you try writing for the following genres:
{'Fanfiction','Non-Fiction','Romance','Science Fiction','Steampunk','Thriller/Suspense','Young Adult','nan'}
getSimilar("AmberMeyer")I suggest you try writing for the following genres:
{'Fantasy', 'Science Fiction', 'Young Adult'}
getSimilar("Brandon Sanderson")I suggest you try writing for the following genres:
{'Romance', 'Science Fiction', 'Young Adult & Youth', 'nan'}
Cool! Looks like I have a lot in common with what Brandon Sanderson writes based on our recommendations!
Of course, this recommender only works for writers already in my list of writers and their known past-written genres, but I'm hoping it's a list that will continue to expand so that I can then evaluate the effectiveness of the recommender and make improvements.
I thoroughly enjoyed diving into the NaNoWriMo data and exploring this intersection between my two passions: data science and writing.
Some possible next steps for this project include:
- Collecting more data to see how well the models perform in predicting outcomes for new writers not currently in my data set
- Performing more feature engineering on the data points excluding the current contest data to see how I can boost model scores
- Figuring out what are the defining features in the 5 unique clusters of writers discovered from KMeans Clustering
- Predicting a final word count instead of just a binary win/lose outcome
- Building out the genre recommender
- Exploring what features might be better predictive winning novels
- Are novels in certain genres more likely to win than others?























