diff --git a/posts/2014-02-Fanfiction-Graphs-PageRank/index.html b/posts/2014-02-Fanfiction-Graphs-PageRank/index.html index e1e1462..70d062a 100644 --- a/posts/2014-02-Fanfiction-Graphs-PageRank/index.html +++ b/posts/2014-02-Fanfiction-Graphs-PageRank/index.html @@ -105,13 +105,13 @@

Fanfiction.net, Graphs, and PageRank: Oh My!

This blog post will explore the structure of the relationships between stories on fanfiction.net by constructing visualizations like the above, and much much larger ones. It will also provide story recommendations for many of the top users of fanfiction.net.

Introduction

-

Fanfiction is a wide-spread phenomenon where fans of different works write derivative stories. This ranges from young children writing their first stories about their favorite fictional characters, to professional-quality stories written by aspiring novelists. Many such stories are posted to websites where they are read by a large audience and commented on. The largest such website is fanficiton.net.

+

Fanfiction is a wide-spread phenomenon where fans of different works write derivative stories. This ranges from young children writing their first stories about their favourite fictional characters, to professional-quality stories written by aspiring novelists. Many such stories are posted to websites where they are read by a large audience and commented on. The largest such website is fanficiton.net.

The sheer amount of fanfiction out there is rather staggering. The total number of stories on fanfiction.net exceeds six million. Harry Potter stories account for around 14% of these, followed by Naruto (around 7%) and Twilight (around 4%) (FFN Research). The majority of these stories have very little in the way of readership, but popular stories can have a large number of readers.

Some research was done into the demographics of fanfiction.net users and other topics by FFN Research. They found that 78% of fanfiction.net authors who joined in 2010 identified as female. Further, around 80% of users who report their age are between 13 and 17.

A lot of other interesting research and analysis has been done on the blogs Destination: Toast! and TOASTYSTATS.

In this post, we will examine the relationships between different Harry Potter stories on fanfiction.net. We will create visualizations, experiment with the application of Google’s PageRank algorithm, and finally construct a crude recommendation tool. We will also discuss a number of directions for future exploration.

Basic Methods

-

In addition to allowing users to post stories they write, fanfiction.net allows authors to “favorite” stories they like. Looking at which stories tend to be favorited by the same users gives us a way to understand connections between stories.

+

In addition to allowing users to post stories they write, fanfiction.net allows authors to “favourite” stories they like. Looking at which stories tend to be favourited by the same users gives us a way to understand connections between stories.

@@ -188,7 +188,7 @@

Large Graph visualizations
-Graph of Harry Potter Fanfiction, colored by language +Graph of Harry Potter Fanfiction, coloured by language
@@ -198,7 +198,7 @@

Large Graph visualizations
-Graph of Harry Potter Fanfiction, colored by ship +Graph of Harry Potter Fanfiction, coloured by ship
@@ -210,7 +210,7 @@

Large Graph visualizations
-Graph of Harry Potter Fanfiction, colored by slash +Graph of Harry Potter Fanfiction, coloured by slash
@@ -237,7 +237,7 @@

Large Graph Visualizations
-Graph of top Naruto fanfiction, colored by language +Graph of top Naruto fanfiction, coloured by language
@@ -247,7 +247,7 @@

Large Graph Visualizations
-Graph of top Naruto fanfiction, colored by ship +Graph of top Naruto fanfiction, coloured by ship
@@ -264,11 +264,11 @@

Large Graph Visualizations
-

We can color it by language:

+

We can colour it by language:

-Graph of top Twilight fanfiction, colored by language +Graph of top Twilight fanfiction, coloured by language
@@ -278,7 +278,7 @@

Large Graph Visualizations
-Graph of top Twilight fanfiction, colored by ship +Graph of top Twilight fanfiction, coloured by ship
@@ -356,7 +356,7 @@

PageRank

More -

One neat thing we can do is give nodes on our graphs a size based on their PageRank. (We can also color nodes based on the first three components of the singular value decomposition of the adjacency matrix.)

+

One neat thing we can do is give nodes on our graphs a size based on their PageRank. (We can also colour nodes based on the first three components of the singular value decomposition of the adjacency matrix.)

@@ -368,8 +368,8 @@

Story Recommendation

This problem is called collaborative filtering, and is a well-established area. Unfortunately, it isn’t something I’m terribly knowledgeable about, so I took a relatively naive approach: sum over the preferences of all users, weighted by how similar their preferences are to the user you are trying to predict.

Specifically, we give each story, \(s\), a rank \(R_u(s)\), for a user \(u\). If the rank is high, we think \(u\) is likely to like \(s\).

\[R_u(s) = \sum_{v\in F_s \setminus \{u\}} \left(\frac{|S(u)\cap S(v)|}{20+|S(v)|}\right)^2\]

-

where \(F_s\) is the set of users who favorited \(s\) and \(S(u)\) is the stories favorited by the user \(u\).

-

For example, we can make recommendations for S’TarKan, the author of the most favorited Harry Potter story on fanfiction.net:

+

where \(F_s\) is the set of users who favourited \(s\) and \(S(u)\) is the stories favorited by the user \(u\).

+

For example, we can make recommendations for S’TarKan, the author of the most favourited Harry Potter story on fanfiction.net:

-

A * denotes that this is already one of the users favorite stories or one of their own stories. We can exclude their favorite stories, and their own stories:

+

A * denotes that this is already one of the users favourite stories or one of their own stories. We can exclude their favourite stories, and their own stories:

  • Make A Wish (0.949) @@ -444,7 +444,7 @@

    Conclusion

    In light of all this, I’d like to reflect on a few things.

    Big Data: A year ago, I was very dismissive of “big data” as a buzzword. Primarily, it seems to be thrown around by business people who don’t really understand much. But one thing I’ve learned in explorations of data like this one and working in machine learning, is that there is something very powerful about larger amounts of data. There’s something very qualitatively different. The fanfiction data I used was actually quite small, only a few hundred users, because of how I limited the amount I downloaded, but I think it still demonstrates the sorts of things that become possible as you have larger amounts of data. (To be honest, a much more compelling example is the progress that’s been made in computer vision using ImageNet… But this still influenced my views.)

    Digital Humanities: Digital humanities also seems to be a bit of a buzzword. But I hope this provides a simple example of the power that can come from applying a little bit of math and computer science to humanities problems.

    -

    Metdata and Privacy: In this essay, we looked analyzed stories by looking at whether they were favorited by the same users. There’s a natural “dual” to this: analyzing users by looking at whether they favorited the same stories. This would give us a graph of connections between users and allow us to find clusters of users. But what if you use other forms of metdata? For example, we now know that the US government has metdata on who phones who. It seems very likely that many companies and governments have information on where your cellphone is as a function of time. All this can construct a graph of society. I can’t really fathom how much one must be able to learn about someone from that. (And how easy it would be to misinterpret.)

    +

    Metdata and Privacy: In this essay, we looked analyzed stories by looking at whether they were favourited by the same users. There’s a natural “dual” to this: analyzing users by looking at whether they favourited the same stories. This would give us a graph of connections between users and allow us to find clusters of users. But what if you use other forms of metdata? For example, we now know that the US government has metdata on who phones who. It seems very likely that many companies and governments have information on where your cellphone is as a function of time. All this can construct a graph of society. I can’t really fathom how much one must be able to learn about someone from that. (And how easy it would be to misinterpret.)

    Fanfiction Websites: I think there’s a lot of potential for fanfiction websites to better serve their users based on the techniques outlined here. I’d be really thrilled to see fanficiton.net or Archive Of Our Own adopt some of these ideas. Imagine being able to list a handful of stories in some category you’re interested in and discover others? Or get good recommendations? The ideas are all pretty straightforward once you think of them. I’d be very happy to talk to the groups behind different fanfiction websites and provide some help or share example code.

    Deep Learning and NLP: Recently, there’s been some really cool results in applying Deep Learning to Natural Language Processing. One would need a lot more data than I collected, and it would take more effort, but I bet one could do some really interesting things here.

    Resources: In principle, I’d really like to share my code and make it easy for people to replicate the work I described here. However, I think that would be really rude to fanfiction.net because it could result in lots of people scraping their website, and it seems likely many would remove my rate limiter. An alternative would be to share my extracted metadata, but, again, I think it would be really rude to do that without fanfiction.net’s permission, and possibly a violation of their terms of service. So, in the end, I’m not sharing any resources. That said, all of this can be done pretty easily.

    diff --git a/posts/2014-07-FFN-Graphs-Vis/index.html b/posts/2014-07-FFN-Graphs-Vis/index.html index 4b09e58..f0026bc 100644 --- a/posts/2014-07-FFN-Graphs-Vis/index.html +++ b/posts/2014-07-FFN-Graphs-Vis/index.html @@ -95,12 +95,12 @@

    Fanfiction, Graphs, and PageRank


-

On a website called fanfiction.net, users write millions of stories about their favorite stories. They have diverse opinions about them. They love some stories, and hate others. The opinions are noisy, and it’s hard to see the big picture.

+

On a website called fanfiction.net, users write millions of stories about their favourite stories. They have diverse opinions about them. They love some stories, and hate others. The opinions are noisy, and it’s hard to see the big picture.

With tools from mathematics and some helpful software, however, we can visualize the underlying structure.

-Graph of Harry Potter Fanfiction, colored by ship +Graph of Harry Potter Fanfiction, coloured by ship
@@ -113,12 +113,12 @@

Fanfiction, Graphs, and PageRank

Story Recommendations: Harry Potter, Naruto, Twilight

And of course, you might skim below to see the pretty pictures!

Introduction

-

Fanfiction is a wide-spread phenomenon where fans of different works write derivative stories. This ranges from young children writing their first stories about their favorite fictional characters, to professional-quality stories written by aspiring novelists. Many such stories are posted to websites where they are read by a large audience and commented on. The largest such website is fanficiton.net.

+

Fanfiction is a wide-spread phenomenon where fans of different works write derivative stories. This ranges from young children writing their first stories about their favourite fictional characters, to professional-quality stories written by aspiring novelists. Many such stories are posted to websites where they are read by a large audience and commented on. The largest such website is fanficiton.net.

The sheer amount of fanfiction out there is rather staggering. The total number of stories on fanfiction.net exceeds six million. Harry Potter stories account for around 14% of these, followed by Naruto (around 7%) and Twilight (around 4%) (FFN Research). The majority of these stories have very little in the way of readership, but popular stories can have a large number of readers.

Some research was done into the demographics of fanfiction.net users and other topics by FFN Research. They found that 78% of fanfiction.net authors who joined in 2010 identified as female. Further, around 80% of users who report their age are between 13 and 17.

A lot of other interesting research and analysis has been done on the blogs Destination: Toast! and TOASTYSTATS.

Basic Methods

-

In addition to allowing users to post stories they write, fanfiction.net allows authors to “favorite” stories they like. Looking at which stories tend to be favorited by the same users gives us a way to understand connections between stories.

+

In addition to allowing users to post stories they write, fanfiction.net allows authors to “favourite” stories they like. Looking at which stories tend to be favourited by the same users gives us a way to understand connections between stories.

@@ -134,7 +134,7 @@

Basic Methods

In order to ensure compliance with these terms, the author intentionally built significant rate limiting into the scraper and took care to minimize the load put on fanfiction.net. While the issue of academic analysis was not mentioned, it was not excluded and fanfiction.net’s operators have not previously objected to similar academic work. Further, this work could be the preliminary research needed for someone to build a good fanficiton search engine.

Another section of the terms of service prohibits collecting personally identifiable information, which they define to include usernames. As such, I have deliberately discarded all such information and don’t use it. (Though, I note that several search engines do – try searching for an authors name on any major search engine.) I do refer to some usernames in this post, but that was done entirely by hand.

-

In collecting data, since we are only looking at a subset of users, it is important to be wary of sampling bias. For example, if we sampled authors starting from the favorites of a particular author, or from those who had contributed stories to a community, we might get a very skewed perspective of the stories on fanfiction.net. The author considered a number of approaches, but concluded the fairest approach would be to use the authors of the most reviewed stories on fanfiction.net. This is a bias, but it should bias us towards the most interesting and important parts of the graph.

+

In collecting data, since we are only looking at a subset of users, it is important to be wary of sampling bias. For example, if we sampled authors starting from the favourites of a particular author, or from those who had contributed stories to a community, we might get a very skewed perspective of the stories on fanfiction.net. The author considered a number of approaches, but concluded the fairest approach would be to use the authors of the most reviewed stories on fanfiction.net. This is a bias, but it should bias us towards the most interesting and important parts of the graph.

Graph Construction

A graph, in the context of mathematics, is a collection of objects called vertices joined by connections called edges. For example, cities can be thought of as the vertices a graph connected by different highways and roads (the edges).

@@ -147,17 +147,17 @@

Graph Construction

A weighted graph is a graph where some edges are “stronger” than others. For example, some cities are connected by giant 6-lane highways, while others are connected by gravel roads. Larger weights represent stronger connections and smaller weights represent weaker ones. A weight of zero is the same thing as having no connection at all.

-

We will be interpreting fanfiction as a weighted graph, where edges represent a “connection” between stories. We will be using as our weights for edges the probability that someone will like both stories, given that they like one. That is, \(W_{a, b} = \frac{|F_a \cap F_b|}{|F_a \cup F_b|}\) where \(F_s\) is the users who favorited the story \(s\).

+

We will be interpreting fanfiction as a weighted graph, where edges represent a “connection” between stories. We will be using as our weights for edges the probability that someone will like both stories, given that they like one. That is, \(W_{a, b} = \frac{|F_a \cap F_b|}{|F_a \cup F_b|}\) where \(F_s\) is the users who favourited the story \(s\).

There are lots of other possibilities, some resulting in directed graphs:

    -
  • (directed) The probability that someone who favorites \(a\) will favorite \(b\): \(W_{a\to b} = \frac{|F_a \cap F_b|}{|F_a|}\)
  • -
  • The probability that someone who favorites \(a\) favorites \(b\) times the probability that someone who favorites \(b\) favorites \(a\): \(W_{a,b} = \frac{|F_a \cap F_b|^2}{|F_a| * |F_b|}\)
  • -
  • The lesser of the probability that someone who favorites \(a\) favorites \(b\) and the probability that someone who favorites \(b\) favorites \(a\): \(W_{a,b} = \min\left(\frac{|F_a \cap F_b|}{|F_a|}, \frac{|F_a \cap F_b|}{|F_b|} \right)\)
  • +
  • (directed) The probability that someone who favourites \(a\) will favourite \(b\): \(W_{a\to b} = \frac{|F_a \cap F_b|}{|F_a|}\)
  • +
  • The probability that someone who favourites \(a\) favourites \(b\) times the probability that someone who favourites \(b\) favourites \(a\): \(W_{a,b} = \frac{|F_a \cap F_b|^2}{|F_a| * |F_b|}\)
  • +
  • The lesser of the probability that someone who favourites \(a\) favourites \(b\) and the probability that someone who favourites \(b\) favourites \(a\): \(W_{a,b} = \min\left(\frac{|F_a \cap F_b|}{|F_a|}, \frac{|F_a \cap F_b|}{|F_b|} \right)\)

Our experience was that it didn’t matter too much for the results, for large graphs.

(It’s worth noting that many of these could easily generalize to higher-dimensional edges for a weighted hyper-graph.)

-

In our selected weight definition, \(W_{a, b} = \frac{|F_a \cap F_b|}{|F_a \cup F_b|}\), we give equal weight to the preferences of all users. But there’s a lot of variance between users: some favorite everything under the sun, while others very selectively favorite stories they really like. If we give the users who favorite thousands of stories the same weight as users who favorite ten, the users who favorite thousands dominate everything (and aren’t a very good signal).

-

Instead, we give each user \(u\) a weight of \(\frac{1}{20+n(u)}\) where \(n(u)\) denotes the number of stories \(u\) has favorited. This results in a measure on the space of users, \(\mu(S) = \sum_{u \in S} \frac{1}{20+n(u)}\), and the equation for our weights becomes \(W_{a, b} = \frac{\mu(F_a \cap F_b)}{\mu(F_a \cup F_b)}\).

+

In our selected weight definition, \(W_{a, b} = \frac{|F_a \cap F_b|}{|F_a \cup F_b|}\), we give equal weight to the preferences of all users. But there’s a lot of variance between users: some favourite everything under the sun, while others very selectively favourite stories they really like. If we give the users who favourite thousands of stories the same weight as users who favourite ten, the users who favourite thousands dominate everything (and aren’t a very good signal).

+

Instead, we give each user \(u\) a weight of \(\frac{1}{20+n(u)}\) where \(n(u)\) denotes the number of stories \(u\) has favourited. This results in a measure on the space of users, \(\mu(S) = \sum_{u \in S} \frac{1}{20+n(u)}\), and the equation for our weights becomes \(W_{a, b} = \frac{\mu(F_a \cap F_b)}{\mu(F_a \cup F_b)}\).

Applying these techniques to a couple of the top Harry Potter stories, we get the following graph (using graphviz):

Small labeled graph of top Harry Potter stories @@ -178,7 +178,7 @@

Graph Construction

A quick Google search reveals that this triangular clique consists of the “Dark Prince Trilogy” by Kurinoone. The stories are more strongly linked to their immediate predecessor/successor than the pair separated by a story are to eachother.

Large Graph visualizations for Harry Potter

If we use different tools, we can visualize much larger graphs.

-

We consider the top 2,000 most reviewed Harry Potter stories and their authors. Based on the author’s favorite lists, we construct a weighted graph, with the stories as nodes (edge weights are calculated as above).

+

We consider the top 2,000 most reviewed Harry Potter stories and their authors. Based on the author’s favourite lists, we construct a weighted graph, with the stories as nodes (edge weights are calculated as above).

We then prune the graph’s edges, keeping the top 8,000 most strongly weighted edges. We also prune the nodes, keeping only those with at least one edge. This leaves us with a graph of 1,623 nodes and 8,000 edges.

We then load this graph into the graph visualization tool gephi. We layout the graph using the OpenOrd and ForceAtlas2 layout algorithms. (OpenOrd was particularly good at extracting clusters. Beyond that, this was largely a matter of aesthetic taste.)

@@ -195,7 +195,7 @@

Large Graph visualizations
-Graph of Harry Potter Fanfiction, colored by language +Graph of Harry Potter Fanfiction, coloured by language
@@ -205,7 +205,7 @@

Large Graph visualizations
-Graph of Harry Potter Fanfiction, colored by ship +Graph of Harry Potter Fanfiction, coloured by ship
@@ -217,7 +217,7 @@

Large Graph visualizations
-Graph of Harry Potter Fanfiction, colored by slash +Graph of Harry Potter Fanfiction, coloured by slash
@@ -244,7 +244,7 @@

Large Graph Visualizations
-Graph of top Naruto fanfiction, colored by language +Graph of top Naruto fanfiction, coloured by language
@@ -254,7 +254,7 @@

Large Graph Visualizations
-Graph of top Naruto fanfiction, colored by ship +Graph of top Naruto fanfiction, coloured by ship
@@ -271,11 +271,11 @@

Large Graph Visualizations
-

We can color it by language:

+

We can colour it by language:

-Graph of top Twilight fanfiction, colored by language +Graph of top Twilight fanfiction, coloured by language
@@ -285,7 +285,7 @@

Large Graph Visualizations
-Graph of top Twilight fanfiction, colored by ship +Graph of top Twilight fanfiction, coloured by ship
@@ -295,7 +295,7 @@

Large Graph Visualizations

You can also explore an interactive graph of Naruto fanfiction and of Twilight fanfiction.

PageRank

What are the best fanfics on fanfiction.net? How can we identify them?

-

A naive approach would be to select the most favorited or reviewed stories. But people’s quality of taste varies. A more sophisticated approach is Google’s PageRank algorithm which is used to determine which web pages are of high quality.

+

A naive approach would be to select the most favourited or reviewed stories. But people’s quality of taste varies. A more sophisticated approach is Google’s PageRank algorithm which is used to determine which web pages are of high quality.

In a normal vote gives equal weight to every voter. But some voters are better qualified to decide than others. In PageRank, we recalculate the votes again and again, giving each “person’s” vote a weight based on how many votes they received in the previous step.

In the case of the Internet, we interpret a website linking to another website as that website voting for the one it links to. Similarly, we can apply it to fanfiction by interpreting story A as “voting” for a story B with a weight of the probability that a user who likes A also likes B.

Harry Potter top stories by PageRank:

@@ -364,7 +364,7 @@

PageRank

More -

One neat thing we can do is give nodes on our graphs a size based on their PageRank. (We can also color nodes based on the first three components of the singular value decomposition of the adjacency matrix.)

+

One neat thing we can do is give nodes on our graphs a size based on their PageRank. (We can also colour nodes based on the first three components of the singular value decomposition of the adjacency matrix.)

@@ -376,8 +376,8 @@

Story Recommendation

This problem is called collaborative filtering, and is a well-established area. Unfortunately, it isn’t something I’m terribly knowledgeable about, so I took a relatively naive approach: sum over the preferences of all users, weighted by how similar their preferences are to the user you are trying to predict.

Specifically, we give each story, \(s\), a rank \(R_u(s)\), for a user \(u\). If the rank is high, we think \(u\) is likely to like \(s\).

\[R_u(s) = \sum_{v\in F_s \setminus \{u\}} \left(\frac{|S(u)\cap S(v)|}{20+|S(v)|}\right)^2\]

-

where \(F_s\) is the set of users who favorited \(s\) and \(S(u)\) is the stories favorited by the user \(u\).

-

For example, we can make recommendations for S’TarKan, the author of the most favorited Harry Potter story on fanfiction.net:

+

where \(F_s\) is the set of users who favourited \(s\) and \(S(u)\) is the stories favourited by the user \(u\).

+

For example, we can make recommendations for S’TarKan, the author of the most favourited Harry Potter story on fanfiction.net:

-

A * denotes that this is already one of the users favorite stories or one of their own stories. We can exclude their favorite stories, and their own stories:

+

A * denotes that this is already one of the users favourite stories or one of their own stories. We can exclude their favourite stories, and their own stories:

  • Make A Wish (0.949) @@ -452,7 +452,7 @@

    Conclusion

    In light of all this, I’d like to reflect on a few things.

    Big Data: A year ago, I was very dismissive of “big data” as a buzzword. Primarily, it seems to be thrown around by business people who don’t really understand much. But one thing I’ve learned in explorations of data like this one and working in machine learning, is that there is something very powerful about larger amounts of data. There’s something very qualitatively different. The fanfiction data I used was actually quite small, only a few hundred users, because of how I limited the amount I downloaded, but I think it still demonstrates the sorts of things that become possible as you have larger amounts of data. (To be honest, a much more compelling example is the progress that’s been made in computer vision using ImageNet… But this still influenced my views.)

    Digital Humanities: Digital humanities also seems to be a bit of a buzzword. But I hope this provides a simple example of the power that can come from applying a little bit of math and computer science to humanities problems.

    -

    Metadata and Privacy: In this essay, we analyzed stories by looking at whether they were favorited by the same users. There’s a natural “dual” to this: analyzing users by looking at whether they favorited the same stories. This would give us a graph of connections between users and allow us to find clusters of users. But what if you use other forms of metadata? For example, we now know that the US government has metadata on who phones who. It seems very likely that many companies and governments have information on where your cellphone is as a function of time. All this can construct a graph of society. I can’t really fathom how much one must be able to learn about someone from that. (And how easy it would be to misinterpret.)

    +

    Metadata and Privacy: In this essay, we analyzed stories by looking at whether they were favourited by the same users. There’s a natural “dual” to this: analyzing users by looking at whether they favourited the same stories. This would give us a graph of connections between users and allow us to find clusters of users. But what if you use other forms of metadata? For example, we now know that the US government has metadata on who phones who. It seems very likely that many companies and governments have information on where your cellphone is as a function of time. All this can construct a graph of society. I can’t really fathom how much one must be able to learn about someone from that. (And how easy it would be to misinterpret.)

    Fanfiction Websites: I think there’s a lot of potential for fanfiction websites to better serve their users based on the techniques outlined here. I’d be really thrilled to see fanfiction.net or Archive Of Our Own adopt some of these ideas. Imagine being able to list a handful of stories in some category you’re interested in and discover others? Or get good recommendations? The ideas are all pretty straightforward once you think of them. I’d be very happy to talk to the groups behind different fanfiction websites and provide some help or share example code.

    Deep Learning and NLP: Recently, there’s been some really cool results in applying Deep Learning to Natural Language Processing. One would need a lot more data than I collected, and it would take more effort, but I bet one could do some really interesting things here.

    t-SNE: t-Distributed Stochastic Neighbor Embedding, is an algorithm for visualizing the structure of high-dimensional data. It would be a much simpler approach to understanding the structure of fanfiction than the graph based one I used here, and probably give much better results. If I was starting again, I would use it.