Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

A Collaborative Filtering Recommender System for K5

By jacoplane in Meta
Mon Jul 08, 2002 at 11:15:14 AM EST
Tags: Scoop (all tags)

I'd like to propose a recommender system to help users find content on K5 that could be of interest to them. The system I have in mind uses collaborative filtering to achieve this. Another reason I feel this system would be useful is that it could help preserve some of the excellent older content that exists on K5.

To many new users entering the kuro5hin community, things can seem a bit intimidating at first. Many new users do not really realize some of the classic content that has been created for k5. The problem, of course, is that these people merely look at what is on the front page, and maybe what is on the section pages. There have been some attempts to create a kind of community-edited guide to K5. One such attempt is Ko4ting, a wiki that, amongst other things, contains a k5 hall of fame.

This hall of fame lists what users think of as some of the best classic stories that have featured on k5. However, I was thinking, people have different tastes, so a story which one person likes is not necessarily a story that others like too. I feel the best way to help users find content they like on k5 is through the use of a collaborative filtering based recommender system.

It is probably a good idea to explain what these terms mean. A recommender system is basically a system that can be used to recommend content to users. The user specifies some content that they like, and some content that they dislike. This information is then used to find content which is likely to appeal to the user. The collaborative filtering approach to recommending basically means that if a thousand users who liked story A also liked story B, then it is probably a good idea to recommend story B to fans of story A.

If you want to see a collaborative filtering recommender system in action, I suggest you go have a look at Amazon.com. Type in the name of your favourite CD, and it will suggest other CDs to you. I find that the recommendations it makes are quite good usually (it certainly helps me when I'm trying to find new music to download :P). For example, when I asked amazon about music similar to Aphex Twin, the list produced certainly contained some new stuff I did not know about and certainly enjoy.

I think that such a system could be implemented for K5 stories relatively easily. Users would be asked to specify some stories that they liked, and some they disliked. This information required here is rather like the ratings used in the story queue. Of course, the system works better when the user provides more detailed information regarding their interests, so ideally the system would be integrated into K5 and would allow users to specify their vote on every story. This information is then stored in the user profile. When a user asks for recommendations from the system, the system then compares different user profiles to find new stories that might be of interest to this user. Another possibility is that a user requests stories "like" the current story. One thing to note is that stories that are liked by a large part of the user population are more likely to be recommended to other users, so there is an automatic floating to the surface of the better stories.

Enabling it to do content-based recommending too could further enhance this system. Collaborative filtering systems don't usually look at what the actual content is. But by looking at the words in a story, filtering away all the most common words in English, one can get a grasp of what this story is about. However, this would be a nice addition for later. Collaborative filtering should be a good thing to start with.

Some people might argue that these kinds of automated systems are not a good way to recommend content to users, and the best filters for quality are, in fact, other humans. However, I feel that these two ideas do not have to be mutually exclusive. For example, Amazon.com uses both collaborative filtering user lists, with it's Listmania system. Humans make these lists, and then collaborative filtering is used to suggest lists that contain content users have said they like.

I'm wondering what the k5 community thinks of this idea. I personally think that this system could help preserve some of the older content on k5, which doesn't happen very well currently.


Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure


Related Links
o Kuro5hin
o community- edited
o guide
o Ko4ting
o hall of fame
o collaborat ive filtering
o similar to Aphex Twin
o Listmania
o Also by jacoplane

Display: Sort:
A Collaborative Filtering Recommender System for K5 | 42 comments (37 topical, 5 editorial, 0 hidden)
Just release the database... (3.66 / 3) (#6)
by dipierro on Sun Jul 07, 2002 at 03:21:00 PM EST

Let us download the database of votes to accept or dump stories and then anyone who comes up with a crazy scheme to implement a "recommender system" can just download the database and implement it.

i still say my way works better;) (none / 0) (#7)
by infinitera on Sun Jul 07, 2002 at 03:26:25 PM EST

Users who hotlisted this story also hotlisted [...]

That would seriously make my day. Votes in and of themselves are mostly meaningless, especially if k5 is working as designed, and people are voting up quality things, not necessarily things that interest them.

[ Parent ]
privacy issues here? [nt] (none / 0) (#16)
by Meatbomb on Mon Jul 08, 2002 at 02:36:29 AM EST


Good News for Liberal Democracy!

[ Parent ]
It's already public (5.00 / 1) (#18)
by dipierro on Mon Jul 08, 2002 at 04:47:52 AM EST

so no.

[ Parent ]
1st lets get the search engine working again (4.66 / 3) (#8)
by semaphore on Sun Jul 07, 2002 at 05:30:55 PM EST

I'm Sorry!

There's a problem with the database, and story and comment searches are causing it to lock up badly. We're trying to fix it, but for now, I have to disable these searches. A Google search is likely to find what you're looking for. The other types of searches work normally. Only "Stories" and "Comments" are

"you want enlightenment? stare into the sun."

This is very interesting. (3.00 / 1) (#9)
by Farq Q. Fenderson on Sun Jul 07, 2002 at 05:40:34 PM EST

I think it should be discussed and expanded upon. I don't think that we should just consider this particular model, but explore possibilites.

For example, "domains" (which are essentually keywords, see below) could be created and articles could involve any number of particular domains. For each domain that has certain number of individual votes, it could become a 'key domain' for that story which would cause it to score higher for searches on that domain. I'm referring to domains as such because "key keyword" is too awkward.

Domains could be completely arbitrary words, and as such a search on them could be a normal search:

"DMCA DeCSS 2600"

could, in addition to searching for the strings in the text, search upon the domains and key domains. Key domains would a high number of points, normal domains would add a medium number and text within the body would add a low number. So suddenly we have scoring and ranking that seems to work okay.

Also, domains could be added/strengthened if the searcher finds what they're looking for and confims it. So when someone clicks on a link in the search results, they have the option to confirm a the find as satisfactory, thus strengthening the link against all of the search terms. Perhaps they could even rank it, 1 to 5.

But this is all just speculation.

farq will not be coming back

My Two Cents (3.00 / 1) (#10)
by jgk on Sun Jul 07, 2002 at 06:21:18 PM EST

I would like to see an abilty to compare the opinions of other users.

I would like to find out which people like which sorts of stories.For example if 50 of a 1000 users all enjoyed stories A, B, C and D (which I didn't enjoy) I would like to know what they thought of story E. As well as knowing what everyone or just one person thought of story E obviously.

Gore Vidal is cool.
one nudge (4.66 / 3) (#11)
by athagon on Sun Jul 07, 2002 at 06:30:20 PM EST

Sounds interesting, but if it were implemented, I would have one feature request. Namely, a constant-tuning system. In other words, in my theoretical "Other stories you might enjoy" box, "+" and "-" buttons next to each story link to enable me to tell the system whether or not I found those stories to be of interest.

automatic (2.50 / 2) (#15)
by Meatbomb on Mon Jul 08, 2002 at 02:32:12 AM EST

After you read a story, it is in your "stories i have read" column, not "stories you might like". It would then have your score, and the machine will now be able to more finely tune your recommendations.


Good News for Liberal Democracy!

[ Parent ]
It would need to be a bit more than that (4.00 / 1) (#19)
by salsaman on Mon Jul 08, 2002 at 06:13:00 AM EST

What if you read a story, and hated it ? It would be better to have a system as the author suggested, where you could vote from say -5 (I really don't like this story) to +5 (I liked it a lot).

[ Parent ]
I'll Write It (4.80 / 5) (#12)
by Carnage4Life on Sun Jul 07, 2002 at 09:23:34 PM EST

If enough people are interested in this idea and can point me to papers on collaborative filtering (that don't infringe any patents) I could probably knock up something similar to the K5 user information page.

Of course, I have to find time to reinstall a database management system on my machine first.

Not sure about patents, but (5.00 / 3) (#21)
by jacoplane on Mon Jul 08, 2002 at 06:37:24 AM EST

here are some sites regarding collaborative filtering.

[ Parent ]
Fast Algorithm to Cluster High Dimensional Baskets (4.50 / 2) (#37)
by Baldrson on Tue Jul 09, 2002 at 04:18:36 PM EST

You might look at the general area of expectation maximization algorithms for statistical imputation of missing data. I've looked around a bit and found a few tempting citations such as "A Fast Algorithm to Cluster High Dimensional Basket Data" and its predecessor "SQLEM: Fast Clustering in SQL using the EM Algorithm" but I'm not going to recommend that you necessarily do those things.

Other approaches involve doing eigenvector analysis and hyperdimensional attraction through subspaces constrained by observations.

I've been working with a genuine statistical mechanic on the eigenvector analysis approach for the last few months precisely for the K5 type of message rating scheme and if we come up with something and you're still interested I'll see if we can knock something out that will spec the algorithm to drive guessed relative ratings from existing ratings.

-------- Empty the Cities --------

[ Parent ]

Eigentaste (none / 0) (#38)
by Baldrson on Tue Jul 09, 2002 at 04:50:04 PM EST

Eigentaste is pretty close to the eigenvector analysis approach over which we've been vacillating for the few months we've been thinking about the relativity problem with ratings.

-------- Empty the Cities --------

[ Parent ]

Cool (none / 0) (#41)
by Carnage4Life on Thu Jul 11, 2002 at 01:28:29 AM EST

I gave the paper a glance and it looked interesting. Whenever you guys are done and have something specced out just give me a holla via my website or in my K5 diary.

[ Parent ]
No interest to me. (4.00 / 2) (#13)
by ti dave on Sun Jul 07, 2002 at 09:35:21 PM EST

Variety is the spice of life.

I read damn near every Diary, and though there are some turds in the lot, I've found many informative and entertaining threads and stories by reading it all.

I don't think k5 would be improved by further Balkanization.

"If you dial," Iran said, eyes open and watching, "for greater venom, then I'll dial the same."

I like this story (3.50 / 2) (#14)
by ennui on Sun Jul 07, 2002 at 10:30:56 PM EST

I hereby request more stories like it.

"You can get a lot more done with a kind word and a gun, than with a kind word alone." -- Al Capone
There's a place to put story requests (4.00 / 1) (#28)
by sebpaquet on Mon Jul 08, 2002 at 03:48:02 PM EST

Seb's Open Research - Pointers and thoughts on the evolution of knowledge sharing and scholarly communication.
[ Parent ]
Oh good (3.50 / 2) (#33)
by ennui on Mon Jul 08, 2002 at 05:43:54 PM EST

I hereby request more places to request stories about stories about requesting stories and places to put them.

"You can get a lot more done with a kind word and a gun, than with a kind word alone." -- Al Capone
[ Parent ]
For that... (none / 0) (#39)
by Kaki Nix Sain on Tue Jul 09, 2002 at 09:08:23 PM EST

... we are going to need a special committee to work on where to put such places, another to look at what they should look like, another to decide on the content, and another to oversee them, crush all thought, get paid more, and so forth.

[ Parent ]

memetics (3.00 / 1) (#17)
by idea poet on Mon Jul 08, 2002 at 04:17:35 AM EST

This story is once again proof for me of the existence of meme like transmissions across the globe. This morning, whilst waking up, I was contemplating submitting a meta-story with exactly the same suggestion.

I just wanted to propose a one pager, somewhere on the K5 site, that summises what each section on K5 is for. We often see comments to posts that ask for a story to be resectioned. It would be handy to have a reference guide.

In fact, a new-user welcome pack email with similar information, including community guides will be a great addition to the K5 experience and will hopefully greatly reduce the signal to noise ratio.

A newbie's guide exists - please help build it! (5.00 / 1) (#29)
by sebpaquet on Mon Jul 08, 2002 at 03:55:26 PM EST

See K5 Newbie's Guide on Ko4ting.

Sectioning tips are most welcome.
Seb's Open Research - Pointers and thoughts on the evolution of knowledge sharing and scholarly communication.
[ Parent ]

Algorithm suggestion (none / 0) (#20)
by salsaman on Mon Jul 08, 2002 at 06:33:48 AM EST

For each story a user (A) reads, imagine that they can give it a score of -100 to +100.

OK, loop through each other user (B). For each story that user A has rated, subtract B's score (unrated counts as 0). This gives you an 'affinity' score between user A and user B. An affinity of 0 means that A and B have rated all their stories exactly the same, a higher number means they have rated differently. We divide each score by the number of stories rated by A, and take the |abs| so we get an 'average affinity'.

Next we subtract or add this average affinity from all of B's other ratings. If B rated a story above 0, we subtract the aa. If B rated a story below 0, we add the aa.

We repeat this process for all users B, and at the end we should have a grand total for each story not rated by A.

Then we simply select the ten stories with the highest totals.

Users could set a multplying factor for the average affinity, a higher multplier would mean that stories liked by people who were closest to A would come up more. A lower value would make popular stories liked by more people come up.

Slight correction (none / 0) (#22)
by salsaman on Mon Jul 08, 2002 at 06:38:53 AM EST

When you subtract B's score from A's, take the |abs| then, rather than later on. Otherwise of course, -100 +100 and +100 -100 would cancel out.

[ Parent ]
How about.... (none / 0) (#27)
by Elkor on Mon Jul 08, 2002 at 02:54:16 PM EST

Instead of from -100 to 100, just use the ranking set currently in place for the votes. The rank goes from -1 to +2 (being for FP votes)

Compare User A (you) against user B.

Every time they agree on a vote, the "correlation" goes up by 2. Every time they disagree, take the absolute value of the difference and subtract that from their score.

So: If they both vote +1, FP then the correlation goes up by 2.
If one votes +1 and the other votes Abstain, the correlation goes down by 1.

If one votes +1, FP and the other votes -1, then the correlation goes down 3.

If user A doesn't have a chance to vote on a story in the queue, it can be filtered according to the correlation scores of other people that did vote on it. If thurler and I have have High correlation scores, then any story that he voted +1 on will appear higher in my list than any story he voted -1 on.

This actually doesn't have anything to do with maintaining old content, but it seemed better than giving users another thing to vote on.


"I won't tell you how to love God if you don't tell me how to love myself."
-Margo Eve
[ Parent ]
Too much granularity; scaling problems (none / 0) (#42)
by mcphee on Sat Jul 20, 2002 at 02:22:37 PM EST

The basic idea seems nice enough, but I think that a -100 to +100 scale is way to fine grained to be useful. What in the world is the difference between a +78 and a +79? Keeping it simple (1-5 or 1-7) would seem preferable. (There's considerable research on this question of scale granularity. Anyone familiar with it care to comment?)

I would also worry slightly about the scaling of this idea, since it's O(N^2) in the number of users. I'm sure Amazon doesn't do O(N^2) algorithms on their user database, and I doubt K5 wants to either.

[ Parent ]

How would this be different... (4.00 / 1) (#23)
by bobpence on Mon Jul 08, 2002 at 07:20:03 AM EST

... from the Citizen Tracking Cards (a.k.a. grocery store club/affinity/membership cards) that have everyone in this story up in arms?

I mean, I don't like it when someone purporting to be affiliated with my credit card company calls offering me a $99 yearly membership in a bogus discount club (30 days free!); but I do like the 20% off coupon that Borders sent me through Akamai. (I liked the one they mailed me recently, too.)

So do K5ers like being profiled or not? More than twice as many are voting this story up as down, even though there is no mechanism suggested to opt-out of the story recommendation system. Is it a matter of convenience that determines when we want our data mined? Do you like seeing the Amazon suggestions, but hate getting emails about your inadequacies?
"Interesting. No wait, the other thing: tedious." - Bender

Who is watching, and who has control (5.00 / 2) (#24)
by rusty on Mon Jul 08, 2002 at 09:43:32 AM EST

I think the difference is who is doing the watching, and what information you choose to reveal about yourself to that person or organization. Here we're talking about some way for readers to explicitly say "I like/dislike this story" and also explicitly request the system to keep track of this information and use it to suggest other stories you'd like. This information is only tied to you personally by an email address (which I think everyone trusts me to keep secret) which may or may not even be traceable to you personally, and by any information which you chose to reveal publically in comments or stories.

Your other example is a case where you reveal personal information, typically in the process of (and specifically for the purpose of) making a purchase, and the organization you gave that information to then turns around and considers it their "property" to sell or trade on the open market, to their "partners" who you never had any intention of doing business with, but had no option of withholding your information from. That is an entirely different ball of fish, I think.

So limited opt-in profiling with clear benefits and rules is good, profiling that is done invisibly and without any consent is bad.

Not the real rusty
[ Parent ]

An automatic version (5.00 / 3) (#25)
by tmoertel on Mon Jul 08, 2002 at 10:30:03 AM EST

A few months ago, I posted to my diary a definition for a system that automatically finds interesting information on K5. It deduces interest based on threading so that users don't need to say "I like this" repeatedly to earn decent recommendations.

Maybe it's time to revisit the issue.

My blog | LectroTest

[ Disagree? Reply. ]

It is! [nt] (none / 0) (#30)
by sebpaquet on Mon Jul 08, 2002 at 04:06:03 PM EST

Seb's Open Research - Pointers and thoughts on the evolution of knowledge sharing and scholarly communication.
[ Parent ]
3 comments (3.00 / 2) (#26)
by tps12 on Mon Jul 08, 2002 at 11:18:51 AM EST

  1. I don't think k5 is big enough right now for this kind of thing to be useful or feasible. Amazon has a huge data pool from which to draw, orders of magnitude beyond k5's. The hard-core k5 users to which these kinds of features might appeal probably read most of what's posted here anyway.
  2. Blasphemer! You claim there is music like Aphex Twin's!
  3. I'm the straw that broke the camel's back.

Ama5on.com (4.00 / 2) (#31)
by marktaw on Mon Jul 08, 2002 at 04:18:00 PM EST

Joking aside, this sounds like another attempt to, generically speaking:

1) automate an otherwise manual process i.e. using a computer to create community and group knowledge
2) find new ways to categorize data
3) exercise our statistical/decision making/data mining minds
4) err.. invade... err.. kuro5hin... err... oh forget it

If we're determined to do this, why don't we use the already public voting mechanism?

_Members who voted for/against this article also voted for/against the following articles:_

All the back end processing would be the same as the above proposed.

Also a "one year ago today" and "most popular articles" as determined by views if they're available or number of comments if they're not.

Another collaborative filtering system (3.00 / 1) (#32)
by EricBoyd on Mon Jul 08, 2002 at 04:33:17 PM EST

is StumbleUpon. Choose your interests, rate some websites, and let it show you stuff you'll like.

not reliable with IE6 (none / 0) (#35)
by tbc on Tue Jul 09, 2002 at 03:51:07 AM EST

It causes my Win2K system to max out the CPU. The problem went away when I unistalled it.

[ Parent ]
I really want to see this succeed... (none / 0) (#40)
by tbc on Thu Jul 11, 2002 at 01:02:59 AM EST

Eric and I have been corresponding. (Nice working with someone who has a clue.)

[ Parent ]
search (1.50 / 8) (#34)
by dirvish on Mon Jul 08, 2002 at 06:57:49 PM EST

Try the search link you dumbass.

Technical Certification Blog, Anti Spam Blog
Finding interests (4.50 / 2) (#36)
by Beltza on Tue Jul 09, 2002 at 10:54:38 AM EST

Could this system also be used to find users with the same interests? Could it give me a list of persons who have read the same articles I have and rated the same articles up or down?

A Collaborative Filtering Recommender System for K5 | 42 comments (37 topical, 5 editorial, 0 hidden)
Display: Sort:


All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!