1. DLP Flash Christmas Competition + Writing Marathon 2024!

    Competition topic: Magical New Year!

    Marathon goal? Crank out words!

    Check the marathon thread or competition thread for details.

    Dismiss Notice
  2. Hi there, Guest

    Only registered users can really experience what DLP has to offer. Many forums are only accessible if you have an account. Why don't you register?
    Dismiss Notice
  3. Introducing for your Perusing Pleasure

    New Thread Thursday
    +
    Shit Post Sunday

    READ ME
    Dismiss Notice

Recommendation Engine for FF.net

Discussion in 'Fanfic Discussion' started by JordanL, Nov 24, 2010.

  1. pfeil

    pfeil First Year

    Joined:
    Jun 19, 2008
    Messages:
    24
    Very cool!

    I wonder if there's some way to find the sweet spot between the fics that are on gazillions of lists that I've already heard of and the fics that are so uncommon as to be irrelevant to the search.

    Also, what was your justification for going the dedicated server route instead of GAE, AEC2, MSA, or similar? I've been looking at starting a FFnet-scraping site of my own, and GAE looked quite attractive -- particularly since it's free initially -- assuming I can figure out how to make a an efficient non-relational data model.
     
  2. Militis

    Militis Supreme Mugwump

    Joined:
    Jun 24, 2008
    Messages:
    1,683
    Location:
    Online
    I know what GAE is, but what are the other two?

    Google App Engine, for those who don't.
     
  3. pfeil

    pfeil First Year

    Joined:
    Jun 19, 2008
    Messages:
    24
  4. JordanL

    JordanL Third Year

    Joined:
    Jan 17, 2009
    Messages:
    98
    Several reasons actually, none of which you have to agree with.

    - Storing information like this 'in the cloud' releases a large amount of control over the site and the data.
    - If I released this to the public, I had always intended it to be a fully-fledged website.
    - I wanted creative control over the aesthetics.
    - I wanted to have complete control over the data structure and such to ensure speed. When I got this dedicated server average query time was still at 6 seconds. By using smart server administration and data structure I've gotten the average query time down to under half a second.

    Again, you don't have to agree with any of them, and you certainly don't have to contribute to the costs if you don't feel like it.

    I imagine if the cost becomes too great for me personally I'll set up some kind of membership system that will give additional benefits, and if that doesn't do it, I'll probably release the psuedo-code for the site and take it offline.

    The largest problem with using cloud solutions is that the data itself HAS to be so highly relational that cloud storage is slow and cumbersome and would greatly deteriorate the ability to use the tool.

    ---------- Post automerged at 12:23 PM ---------- Previous post was at 02:19 AM ----------

    Alright, I got a much-requested feature implemented: filtering common favorites.

    So, on the basic search you will now see an option called "Tailor Results". This feature will change the sorting to sort by how unique a result is to your particular search, instead of how many common favorites are found.

    The actual formula is: (num of authors with story favorited in results)^2/(total number of favorites story has)

    An example:

    If result A has 600 people in the database that have it favortied, but only 5 have it favorited in your results, the new formula will calculate a score of 5*5/600, or 0.04166...

    Previously, its score would have been 5.

    How does this affect results? Well, if you put in a story ID, and only 6 people both your search AND a result favorited, it would have appeared low in the list before, giving way to stories like CED, which may have 30. Now however, say this example result only has 8 people who have it favorited period... that means 75% of people who have it favorited will be counted in your results, giving a VERY high score of 4.5, whereas CED would have a score somewhere around 0.8, since only a small portion of the total number of people with it favorited appear in your results.

    You can still see results the traditional way by selecting no for the tailored option, or leaving it blank.
     
  5. Rocag

    Rocag Third Year

    Joined:
    Aug 1, 2010
    Messages:
    96
    Gender:
    Male
    Awesome! I very much appreciate the changes.
     
  6. The Arid Legion

    The Arid Legion Professor

    Joined:
    Oct 6, 2010
    Messages:
    420
    Wow, this is just :awesome. I wish I had a paypal account. Although I noticed that you can only choose one fandom at a time. Is there any possible way that you could allow us to select multiple fandoms in one search? If so that would be great.
     
  7. JordanL

    JordanL Third Year

    Joined:
    Jan 17, 2009
    Messages:
    98
    I could... but it's likely your search would take 6-12 seconds.

    EDIT:

    And you don't actually need a paypal account to donate. ;) Just a card.
     
  8. The Arid Legion

    The Arid Legion Professor

    Joined:
    Oct 6, 2010
    Messages:
    420
    A non-South African card. The only things that accept those are ATM's.
     
  9. JordanL

    JordanL Third Year

    Joined:
    Jan 17, 2009
    Messages:
    98
    Oh, lol. Yeah. That would make it difficult.
     
  10. pfeil

    pfeil First Year

    Joined:
    Jun 19, 2008
    Messages:
    24
    I'm more curious than worried about "agreeing" or not.

    Very true. (Though I'll argue that for a site that scrapes everything, so long as you have the code -- you do use a VCS, right? -- you're only ever a week away from having it back again if you just have a semi-recent export of which favourites lists have been imported, which you could have the site email to you every week.)

    These I'm confused by. Since GAE, for example, lets you run arbitrary Python or Java code to produce the HTML for the site, I don't see how you couldn't do whatever you want, aesthetically.

    All by tweaking the SQL server config? Impressive.

    As for "the data itself HAS to be so highly relational", how is that determined? I always thought relational data modelling was just one way of representing data -- or multiple, if you take into account the different normal forms.

    Thinking about how to represent things in GAE's DataStore has been an interesting exercise, actually. Rather a challenge, since I'm used to thinking relational, but the list properties and such have made some things surprisingly nice.

    :cool:
     
  11. JordanL

    JordanL Third Year

    Joined:
    Jan 17, 2009
    Messages:
    98
    Well, stats have been turned back on, and are updated on a rolling basis.

    Actually a combination of several things:

    - Improving the queries to better use indexes
    - Moving to a custom-compiled version of MySQL 5.5.7 from MySQL 5.1.63
    - Moving from PHP 5.1.X to 5.3.X
    - Customizing the SQL server settings to better utilize JOINs and temporary tables
    - Doing some usage analysis to determine what indexes provide the best performance.

    The data as I'm collecting it should never be stored in any non-relational way. My current database would be in excess of 2 GB if it were not stored relationally, and several of the filtering options would be much more difficult to implement. Creating temporary tables would be... next to impossible for some of the queries.

    Setting aside that Python and Java are both terrible tools for the purpose of this website, you do not get complete aesthetic control over GAE sites, to my knowledge. This might have changed since I last looked, but when I checked (not for this site) the ability was still limited.

    Additionally, using my logs to model usage, all three distributed options would cost me more per month than my dedicated server currently is. EC2 would cost almost twice as much.
     
    Last edited: Dec 21, 2010
  12. pfeil

    pfeil First Year

    Joined:
    Jun 19, 2008
    Messages:
    24
    That, of course, is likely the most important consideration anyways.
     
  13. JordanL

    JordanL Third Year

    Joined:
    Jan 17, 2009
    Messages:
    98
    You know, I've noticed something about the "Tailored" results.

    It gives you stories, generally, that are at about the same popularity level. If you put in CED or Time Braid or Reload, you get most of the other really popular stories.

    But if you put in a good story that's only partially known, you tend to get other good stories that are only partially known.

    That's very interesting, considering all I'm doing is a brief stats analysis on references...

    Anyway, I have another background feature running now. I have indexed some 6,000 new profiles by spidering (yes, ACTUALLY spidering) fanfiction.net, instead of the "put an id in here" box. 12 times per hour, 10 of them get processed, (or about 2900 per day) and actually added to the index of favorites.

    This is to prevent flooding the fanfiction.net servers, and possibly causing an inadvertent DoS attack, which I'm sure they both wouldn't appreciate, and would draw their attention.

    What this means is that from now on, in addition to any profiles that are added by users of the search engine, it will update itself with up to 120 new profiles an hour. I've already added almost 1,000 new profiles to the index today.

    Unfortunately, the sample has a bias of origin, since I only spider once section (i.e. Harry Potter) at a time. So if you guys want to see stories from other fandoms, please continue adding profiles from there. :)

    Again, just looking for little things to help make everything work better. After I added the fandom filter back in, and added the tailored results, I decided that the next best thing would be to increase the size of the database.

    The hosting bill is coming up soon, so if anyone else wants to help, that would be appreciated, although I don't think it is strictly an emergency situation this month.

    What features should be next?
     
  14. Militis

    Militis Supreme Mugwump

    Joined:
    Jun 24, 2008
    Messages:
    1,683
    Location:
    Online
    I guess this is as good a place as any to ask...Does FFN still put multiple accounts on the same ID? (I.E.: When I went to nonjon's profile, I couldn't use this link because it would take me to someone else's profile...I had to use this one instead.) I can't remember how long ago it was that this happened, or if it ever stopped happening.
     
  15. Sesc

    Sesc Slytherin at Heart Moderator

    Joined:
    Dec 20, 2007
    Messages:
    6,216
    Gender:
    Male
    Location:
    Blocksberg, Germany
    Multiple accounts? I was under the impression those were simply two different ways to reach the same account. A third one in this instance would be http://www.fanfiction.net/~Nonjon

    The first link works fine for me too.
     
  16. Militis

    Militis Supreme Mugwump

    Joined:
    Jun 24, 2008
    Messages:
    1,683
    Location:
    Online
    Yeah, the first link works now - so I was assuming they fixed it. I remember when the first link used to send you to Joe Shmoe's profile, not nonjons, and that the only ways to get to nonjon's were the second link or your link.
     
  17. pfeil

    pfeil First Year

    Joined:
    Jun 19, 2008
    Messages:
    24
    I'm very, very skeptical. That user id is most likely a primary key and used in all sorts of foreign key references, so uniqueness ought to be enforced way down at the database level, and if it weren't, things would have been more than just slightly broken.

    Especially since the string suffixes are completely ignored by FFnet, so there's no reason changing it would affect anything at all.
     
  18. JordanL

    JordanL Third Year

    Joined:
    Jan 17, 2009
    Messages:
    98
    The ID numbers are unique to the user at all times. It's the tilde addresses that can change.
     
  19. JordanL

    JordanL Third Year

    Joined:
    Jan 17, 2009
    Messages:
    98
    An update on the funding status for the site (since it was asked in another forum).

    There is a single ad at the top of every page. So far, the ads are providing approximately 8% of what I need every month to pay for the server.

    The split for January's hosting costs was approximately:

    8% Advertising revenue (of which none will be received until February because of billing cycles)
    31% Donations
    61% Me

    Unfortunately I may only be able to keep that situation going for another two or three months. That's enough time for me to see if there's enough usage to figure out something else, or to find more donators, or to create some kind of paid membership or something.

    Beyond that I may have to shut the site down and perhaps tell ff.net how I did it, and maybe hope they implement something similar in the next few years.

    But for the next few months at least the site will be up. Advertising is definitely not the way to do it. Only about 60% of people who visit the site are seeing an ad (thank, AdBlocker...) and only 0.21% who see one click it. That's only about 0.12% of all people that visit the website that click an ad. On average, only about one in every five thousand web pages the site serves result in an ad click, each of which on average is producing $0.87 in revenue. That essentially means I need about 80 MB of bandwidth and one hour of processing time per dollar generated by advertising. To pay for the hosting costs on advertising alone I'd need about 18 GB of bandwidth and 10 days of processing time every month.

    In short, advertising isn't going to pay for the site, and unless I figure something else out soon, this will be a successful but short-lived experiment.

    Thank you to all the people who donated and really made this next month possible. It was especially tight for me to cover the whole cost, as I had to visit my sister across the country, got stuck in an airport without food for several days, and had the holidays all this month. :)
     
  20. JordanL

    JordanL Third Year

    Joined:
    Jan 17, 2009
    Messages:
    98
Loading...