Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Opening the K5 Database

By recursive in Meta
Fri Jan 19, 2001 at 11:21:48 AM EST
Tags: Kuro5hin.org (all tags)
Kuro5hin.org

The publishing model of kuro5hin.org (k5) is driven by its users who publish, rate, and comment articles. Users also develop and maintain the software that runs the k5 web site. It focuses on the moderation and publishing of a set of current articles through a HTTP server. Old articles, which have disappeared from the front page, can still be searched, but mining data from k5 is difficult in general. By (partially) opening the underlying k5 database, and thus introducing a second interface to the content, a great number of third party provided services would become possible.


Articles and comments posted on k5 are not only interesting by themselves, but also by the many relations they form within the k5 community. Access to meta-data would allow to provide new kind of services without implementing them directly at the level of the publishing system Scoop:

  • Statistics: number of users, most active users, average article length, average rating for articles on the front page vs. in sections.
  • Identification of user groups: given a user, find all user with similar voting patterns.
  • Uncovering the mojo mafia: is there a group of people who constantly moderate each others articles up?
  • Syndication: provide selected headlines or articles in a different context to promote k5.

All these informations can be gained from the data stored in the k5 MySQL database. A MySQL database can be accessed remotely using a MySQL shell or through many programming languages that provide bindings for MySQL, like Perl, or PHP. These languages would likely be used to provide some of the services mentioned above. The result would be a number of k5 associated sites which provide k5 related services, which, however, would be implemented and maintained independently of k5. Since the MySQL interface already exists, no changes to existing software at k5 would be required.

The k5 database contains sensitive data, like users' real email addresses and their passwords. These, of course, should not be made readable from the outside. Fortunately offers MySQL fine grained access control at the level of users, hosts, databases, relations, and columns. Which informations should be made available is a matter of a separate discussion.

An open database allows to run complex queries on the database which might harm the functionality of k5. If this would become an issue a second database, nightly synchronized with the primary database, would be a possible solution. Read access then would be only granted to the second data base where a high load would not matter. This would rule out real-time services that rely on up-to-the-minute informations. Most of the mentioned services, however, explore the general structure of the k5 community that changes much slower.

Opening the k5 database would unlock its full potential: new services would become available without having to implement them in Scoop. The K5 Associate Program that would be triggered by this seems to be unique at the moment. However, given the trends to information exchange using XML we can probably find such a program on a server next to us real soon.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Poll
Read access to the k5 database for ...
o k5 friends and family 28%
o me 9%
o rusty 62%

Votes: 53
Results | Other Polls

Related Links
o Scoop
o kuro5hin.org
o Scoop [2]
o MySQL
o Perl
o PHP
o XML
o Also by recursive


Display: Sort:
Opening the K5 Database | 24 comments (23 topical, 1 editorial, 0 hidden)
database server (3.90 / 10) (#1)
by Defect on Fri Jan 19, 2001 at 09:10:39 AM EST

opening any site up to the level i understand you want is pretty much screaming "i want to die." Opening up a database where queries could be shot at it left and right is almost suicide. Throwing complex multiple join select queries from several computers would kill the server, or at least bog it down to the point of it being useless as a web server, regardless of whether or not a couple of restrictions were in place. (it would still tie up the computer regardless of whether or not it was a separate database, unless you sync'ed it with another server somewhere, and then i think that's getting way out of control for something that shouldn't be done in the first place)

I think you're missing something about scoop, in that it wasn't meant to enable such data retrieval. I think it works fine right now, and if you need specific information, ask one of the admins.
defect - jso - joseth || a link
Yep (4.00 / 2) (#4)
by goonie on Fri Jan 19, 2001 at 09:32:28 AM EST

I agree that making this kind of database querying live on K5 for all and sundry would a) have privacy implications, and b) allow all sorts of nasty "DoS in a table join" attacks.

However, making a snapshot of the database available and letting a small group of people (including some statisticians) loose on it would probably reveal all sorts of interesting snippets of information. I know I'd be interested in what sort of patterns people could find.

[ Parent ]

Close database in an open community? (none / 0) (#20)
by recursive on Fri Jan 19, 2001 at 02:14:35 PM EST

opening any site up to the level i understand you want is pretty much screaming "i want to die." Opening up a database where queries could be shot at it left and right is almost suicide.
Do these technical concerns mean, that only very few people should have access to the data that was provided by the community? I agree that a totally open the database bears too many risks. Technically, but also with respect to privacy and copyright. But a totally closed database seems also contradictory to the otherwise open approach of k5.

-- My other car is a cdr.


[ Parent ]
Prelim thoughts (3.75 / 4) (#2)
by slaytanic killer on Fri Jan 19, 2001 at 09:15:25 AM EST

Very interesting idea. But keep in mind that the point of interfaces are to limit something's power of expressiveness. The power of things like object-oriented programming are in its restrictions.

I understand we can currently do this (very inefficiently) using spiders, and there can be great potential to improve K5 using different views on the data, such as skins and more informative interfaces. But at the same time, these changes would make it easier for bots to operate here, since K5 is optimized for humans reading it and not programs. The more regular and standard something is, the easier it is to program a machine that can navigate it.

And plus, things may become too predigested. Statistics can be gathered from users (and who really cares about comment mods, really?), leading them to be easily pigeonholed. Numbers can be the worst form of abstraction. Huge fragmentation can result.

But who knows? This can be an amazing idea. Or it can spell death for Kuro5hin. Or evolution into something completely unforseen.

Use, cost, and maybe retribution? (4.33 / 3) (#3)
by duxup on Fri Jan 19, 2001 at 09:26:34 AM EST

My first concern is how this data would be used. Much of the data suggested is statistical, yet everyone knows you can make statistics say anything you want to. So of what use would these be?

My second is cost. How much would a second synchronized database cost? Can/will K5's sponsors absorb that cost? What if syndication becomes so popular that the cost becomes prohibitive?

My third (and most worrying) concern is regarding you comment regarding the "mojo mafia." How do you identify such a group? I have a friend who frequently lurks on K5, never posts, often finds my comments insightful and moderates me up frequently. Is this just a shell account created by me? Can you tell the difference? Isn't it common for like minded individuals to see eye to eye and often vote on articles and moderate accordingly?

Lets then say you identify some members of this "mojo mafia", what then? Do you propose some sort of punishment? If so I'll take a bit of a previous comment I posted when Rusty made changes so that we could see who moderates us (something I agreed with):

"One concern I'd like to voice is that many users have been calling for some sort of retribution directed at people who abuse the rating system. I'm very concerned with such a plan. I don't believe that we can reliably yet identify moderation abuse. I know now we will be able to see who moderated what comment at what level, but I do not believe that alone is enough to prove any kind of abuse.

Many people on k5, including myself, came here from Slashdot, and have very strong feelings about how moderation and moderation abuse should be handled. I worry that moderation is becoming almost as big a discussion point as the sites regular topics. So much emphasis is being placed on how we're moderated that we're in danger of starting a moderation witch hunt where the solution to our current problems are worse than the problems themselves."

Examples (none / 0) (#7)
by recursive on Fri Jan 19, 2001 at 10:25:44 AM EST

Lets then say you identify some members of this "mojo mafia", what then? Do you propose some sort of punishment? If so I'll take a bit of a previous comment I posted when Rusty made changes so that we could see who moderates us (something I agreed with)
The examples given are just that: examples. I have absolutely no interest to punish anybody for anything on k5. Most of the services mentioned there could be implemented using the existing (HTTP) interface, although it would be much more difficult. My main point was that an open database offers an opportunity for value added services -- what ever that might be. In addition, I also feel that opening selected parts of the database would be a better balance of power between the current maintainers and the community.

DoS attacks are of course a real concern. One way to go would be explicit database accounts and a policy how it should be used. Any violation of this would lead to closing the account.

-- My other car is a cdr.


[ Parent ]
Mojo for the goose is mojo for the gander (none / 0) (#8)
by the trinidad kid on Fri Jan 19, 2001 at 10:47:20 AM EST

The core premise of Kuro5hin is that there are 2 classes of user - us (have a sign-on, subscribe to the community, produce the site, etc, etc) and them (don't have a sign-on, read 'our' output). The overwhelming worry then is some 'deviant' us-es misrepresenting the 'real' us-es to 'them'.

Implicit in this model is social coercion of some sort - the voting/moderation system - and the focus is on noise reduction.

It is clear (from pop sociology) that if there developed internal pressure on site to change tack (say from technology and culture to culture after technology) the degree and nature of that coercion would increase dramatically with bitterly opposed camps emerging. Noise would go from being poor quality to opinions I disagree with.

Another option is to filter out noise of either type rather than suppress it. The way this would work would be to allow individual users to 'train' the system as to whose opinions they were interested in and whose opinions they never wanted to see again. With a web of trust (ie partly trust people that people I trust trust and vice versa [trying saying that when in drink]) and ranking by rating.

The downside of that is that several communities superimposed on each other might develop, with little or no communication between them and the notion that there is an 'us' that published coherently to 'them' disappears. The major loss would be in the presentation of the community to the new user who would find no coherence (or even quality) until they had trained the system to their preference. The 'them' experience would be completely destroyed. However as there is a promotion vector from 'them'to 'us' to grow and develop the site, eventually the 'us' experience would decay also.

However if the core system (the producer, the 'us') was merely a data repository for a series of display sites (the reader, the 'them') and the transition from reader to producer was mediated in some other way then it might be possible to have Kuro5hin serving a variety of (potentially antagonistic) communities with content being developed and promoted to a series of external content sites...

(None of the above is to be taken to mean that I personally recommend any particular course of action as I quite like K5 as it is...)

[ Parent ]
two purposes for opening the database (3.00 / 1) (#6)
by Jim Madison on Fri Jan 19, 2001 at 10:12:23 AM EST

great idea, here's my thoughts. There are two reasons to open the database: (1) to allow for analysis and (2) to open new syndication possibilities. These cases have different requirements

In the first case, it doesn't matter if the data is timely or comprehensive. An analytical database could be all the activity from Nov-Dec and that may be enough to track repeat usage, etc. Or it might only have the activity of 100 randomly selected users over the last year.

In the second case, having access to the latest content matters a lot and perhaps even having write permission (e.g., posting comments directly to the database). I think this question is pretty tricky in fact, but very important. For example, I run Quorum.org, a political community site that has quite a lot of philosophical and technical similarities with K5. I'd love to share the politics conversations across the two sites, whereas the tech discussions would not be as relevant to our users. Perhaps some of our users would prefer the K5 interface to read and interact with our content or some of the k5 folks would want to read this content through our interface. In any event, this could become possible if the database--or access to some aspects of the database--were open.

We're looking at from the perspective of creating "mini-quorums". It would be as if anybody could create a high-level tab like "freedom & politics" that users could choose from, with special attention to try to get locally-oriented communities.

I look forward to seeing the guts of k5!

Got democracy? Try e-thePeople.org.

Interesting idea, but... (4.66 / 3) (#9)
by Luke Francl on Fri Jan 19, 2001 at 10:59:23 AM EST

Instead of providing read access to the database (which could slow it down considerably), why not just provide XML feeds to the data which is of intrest? This would open up the database for users who don't speak MySQL.

After all, isn't that what XML is for?

XML vs. DB access (none / 0) (#10)
by recursive on Fri Jan 19, 2001 at 11:13:08 AM EST

why not just provide XML feeds to the data which is of int[e]rest?

When an XML feed is specialized for a certain aspect it requires work for each of these aspects on the server side. And when it is very general (i.e. includes big parts of the database) it performs much worse than raw access to the database. XML is probably good for real-time access to the actual set of articles, but not for the more subtle interactions between a huge number of older articles.


-- My other car is a cdr.


[ Parent ]
Yes, but... (4.00 / 1) (#16)
by sab39 on Fri Jan 19, 2001 at 12:12:25 PM EST

As others have said, in general the set of services requiring up-to-the-minute information is distinct from the set of services requiring large datasets. If a number of XML reports were available, ranging from the RSS summary of current stories, through an "all recent stories in all sections" and "all comments in storyid=foo" up to the "big-mother-file-of-everything-that-ever-happens". The first couple of files should probably be produced as "up-to-the-minute" (generated on demand much as the content of the site is now), but the last would probably be fine produced on a monthly basis for statistical analysis.

The first kind of file would probably be used directly as XML - either through XSLT or through something like an RSS parser. The big-mother file would usually be dumped into another database (using XML means that there isn't even any need for this other database to be MySQL - it could be Postgres or Oracle) and analyzed using tools on that database, with no load on the k5 server at all.

I do like this idea, because it allows for the possibility of experimentation by third parties. For example, if I wanted to create a "k5-a-like" that used an advogato-based trust metric rooted at myself that would tell me whose comments I should read, that could be done (and realtime) based on the first kind of files. Statistical analysis as discussed by the original poster would be possible from the big-mother-file.

Stuart.
--
"Forty-two" -- Deep Thought
"Quinze" -- Amélie

[ Parent ]
Boy that would be great ... (4.00 / 1) (#11)
by tetsuo on Fri Jan 19, 2001 at 11:18:24 AM EST

... if we could all always act like adults here.

But we can't. I know that, you know that. We've all seen allegations flung far, false and true. We've seen grudges and bitter spats.

I fear opening up the DB's would expose several "cabals"; groups which tend to think the same, post the same. There's enough retribution already, I can't imagine wanting more.

Let Me Hit The Delete Key (4.80 / 5) (#12)
by Seumas on Fri Jan 19, 2001 at 11:18:39 AM EST

If something like this were done, I would have a strong desire to have a link which lets me easily delete my selected posts -- or even all of them. Many of us contribute and participate in K5 with the understanding that the data will be used soley on this site without any other granted rights (in fact, this is basically what the FAQ claims). I may or may not have much of a problem with making my comments as publically available as, say, a Usenet post -- but I would still like the option to remove my comments. I do offer a lot of comments here that I would not be so likely to offer should they be in the full view of every bozo on the internet. I enjoy and respect most of the people and their replies/discussions on this site, which is why I take liberties in some of the discussions whereas in a more public arena, I may be more tight-lipped.

Providing access to the backend so you can generate interfaces for it is one thing -- as long as that interface is limited to some sort of search capability. Anything beyond that (say, syndicated 'columns' on other sites from our database and contributions) and I K5 would lose some of its credibility.
--
I just read K5 for the articles.

No. (none / 0) (#15)
by Signal 11 on Fri Jan 19, 2001 at 12:07:34 PM EST

No, you can't just delete a post. Perhaps adding a retraction statement to it, with an optional explanation.. but something that's posted stays posted, even if we regret it later.


--
Society needs therapy. It's having
trouble accepting itself.
[ Parent ]
I Disagree (5.00 / 3) (#19)
by Seumas on Fri Jan 19, 2001 at 02:06:22 PM EST

I disagree. If the statements in the FAQ no longer are true in reality, then one should have every right to remove their comments. For example, while the FAQ states that messages till belong to the author they may be used and reproduced in other areas -- all within K5, allowing them to be available through some sort of external service or syndicated display (ala BBC slashbox on Slashdot, for example), then one should be able to back out of the "deal", just as K5 would have backed out of the "deal" that they agreed on in their FAQ.

Now, one could argue "but the information would still be available only on K5! The other sites are just linking to/barrowing segments from/ making available to a broader audience -- the information!". Yet, there is a great difference conveyed between suggesting that your articles, comments and diary entries are for the use of the K5 membership and will only be used on this site -- and saying "your stuff will only be used on this site, but we'll have templates and searches and other forms of access to our database to anyone in the world who wants to build their own site or service around our database".

If one part (K5 in the example) were to change the agreement, the other party (posters and submitters) should be able to physically remove all copies of the information they provided under the original agreement.
--
I just read K5 for the articles.
[ Parent ]

Nice and dandy BUT ... (3.00 / 1) (#13)
by Dries on Fri Jan 19, 2001 at 11:23:29 AM EST

It all sounds nice and dandy to me, BUT what services are you talking about? I admit that the idea of opening the database sounds exciting but could you name me 5 services that are as exciting? What can you do with the data except for displaying it on a mirror-like site or in a minimalistic version?

So face the facts and question yourself: is there a "market" for K5 services? And if not, what is the purpose of opening the database as such?

It all sounds very exciting but maybe it isn't that funky or useful at all...


-- Dries
XML output (4.00 / 1) (#14)
by CaseyB on Fri Jan 19, 2001 at 11:28:55 AM EST

Allowing connections straight into the database is probably not smart, mostly because of the security implications.

But an XML interface, even if it were of a very simple "URL request returns XML output" variety (as opposed to a full blown "submit xml query, get xml response") would be incredibly useful. I'd love to work on a customized kuro5shin client that parsed and displayed the stories in any way I wanted.

discussions on #k5 (none / 0) (#17)
by phunbalanced on Fri Jan 19, 2001 at 12:21:10 PM EST

we talked about this there. Basically we thought the best bet would be an hourly updated xml transport. that would solve any security / load issues. ( or maybe that's just what I left the conversation with )

Interesting idea. (4.66 / 6) (#18)
by rusty on Fri Jan 19, 2001 at 01:36:09 PM EST

I'm of several minds about this. Ok, there are a few points brought up that concern me. The performance thing of course-- no way would I open the live DB to the public, even with extremely limited SELECT privileges only. It would be way too easy to DoS the real site with a few repeated queries. But assume that we just replicated it off to another box at intervals instead.

The second big concern I'd have is the copyright thing. The FAQ explicitly says that your comments are submitted for one-time publication on the page you posted them on here at K5 (and any associated search pages here), and that's it. It seems to me that opening the database to anyone is tantamount to granting rights to anyone to get and display that content any way they like, which are rights I basically don't have.

The argument could be made that if you post something on the web, it's bound to be spidered, cached, and otherwise widely used without explicit permission anyway, and this may be so. But I don't feel like that's any kind of ethical defense for me, you know? The question is, Do *I* have the right to grant anyone else rights to use all this data any way they like? I don't think I do. So that would be an issue.

I'm not actually opposed to the idea, despite the foregoing. I like using Scoop to interface the DB, but data is data, and I know at least a few people who could do interesting stuff with the K5 data. Mainly, I just think it needs a little more thought, especially in the field of copyright issues.

____
Not the real rusty

copyright issues (5.00 / 2) (#21)
by Delirium on Sun Jan 21, 2001 at 05:12:59 AM EST

Well, I don't think the comments copyright is really an issue relating to opening the database. The comments can already be mined as it is now by a few Perl scripts crawling kuro5hin.org. Sure, opening the database would make this easier, but it wouldn't actually open up any information that isn't already open. The copyright issues would remain the same - while you can crawl k5 with some scripts and repost the comments elsewhere, this would be a copyright violation. Same for reposting comments from an opened database.

So, in short, I don't think granting access to the database changes the copyright situation - as it is now, republication of comments is technically feasible, but forbidden. In the case of an opened database, you'd have the exact same situation, it's just that someone violating copyright could make do with easier-to-write Perl scripts. So I don't think this is enough to justify keeping the db closed.

[ Parent ]

Copyright (3.00 / 3) (#22)
by QuoteMstr on Sun Jan 21, 2001 at 09:37:21 AM EST

The database would allow them to do nothing they could not do by parsing HTML --- it would just allow them to do it *much* less painfully.

[ Parent ]
Hellooo Data Miners (2.00 / 1) (#23)
by yonderboy on Mon Jan 22, 2001 at 03:19:26 AM EST

Data mining. It's a BadThing(tm).

Opening the k5 DB would only screw everyone in the DB. It's a bad idea.

Open the database via nntp? (4.00 / 1) (#24)
by mumble on Mon Jan 22, 2001 at 07:30:35 AM EST

People are all suggesting xml and related web-based interfaces. What about nntp? Surely it is the perfect system to distribute comments. If not nntp, a very similar thing could be done with a mailing list....

Set it up with a nntp server sitting on kuro5hin.org:119, and then those wishing to download just hook up to it and suck down what interests them. Presumably the comments would not be on distributed, and they most certainly would not be distributed to the rest of usenet. Though hopefully there would be third party web interfaces to the k5 database. Otherwise, what is the point?

The great thing is, this would allow people to choose other interfaces to k5 (either a newsreader of their choice, or some other interface.) The down side is, k5 might lose some community feel to it, or there might be problems with regard to advertising (maybe add a text advert to the bottom of posts?)

To help sort the different types of posts that would be comming out of the k5 newsserver, you would need some custom headers. Here's my guess:

X-K5-story: story_title:id
X-K5-accepted: (front-page | section:section_name | pending | no)
X-K5-comment: comment_id
X-K5-vote: (story | comment) { has id:vote in body of text}

Have a seperate "newsgroup" for each k5 section, including a newsgroup called "everything" that is always cross-posted to.

The admin then chooses the nntp gateway to be either:
read-only,
read-write, no story or comment voting.
read-write, story and comment voting supported.
and depending on level of abuse, you can drop the third and second options.

Comments are posted by using kuro5hin.org:119 directly. And have the appropriate headers added by the k5 server before they are re-distributed to both those reading from nntp and http.


As far as plain text comments goes, usenet scales very well. So the nntp gateway could also be a very convenient way to connect several different websites with scoop engines that overlap some stories and comments, but not others.

There are more details to pin down, but what do people think?

-----
stats for a better tomorrow
bitcoin: 1GsfkeggHSqbcVGS3GSJnwaCu6FYwF73fR
"They must know I'm here. The half and half jug is missing" - MDC.
"I've grown weary of googling the solutions to my many problems" - MDC.

Opening the K5 Database | 24 comments (23 topical, 1 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest © 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!