Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
The MDC k5 index

By mumble in Meta
Sat Feb 11, 2012 at 04:43:50 PM EST
Tags: MDC, moar Crawdocs?, Crawfordology, k5 index, MDC Corpus, rusty, shell scripts (all tags)

The text original: http://pastebin.com/FKQfhCuv
With all this Crawforditis, Crawdocs, and Crawfordology around this place, I decided to contribute my 5c.
I thought it might be interesting to slurp down all of Crawfords posts to k5.
And so was born:
$ ./dump-user.sh

Usage: ./dump-user.sh c|d|s|all count|all user [index] [silent]

First parameter:
c for comments only
d for diaries only
s for stories only
all for comments + diaries + stories.

Second parameter:
count for count posts, all for all of them.

Then you have k5-username, index only mode, and silent URLs mode.

Let's do some examples to show how this all works:


First up, Rusty, with 2 of each of comments, diaries and stories:
======
$ ./dump-user.sh all 2 rusty
Selected types: comments diaries stories
rusty: comments: 15963, diaries: 412, stories: 223
rawuser = rusty, type = comments, count = 2, pages = 1, k5type = comment_by, grepswitch = -C3, extractfn = extract-comment-data.sh
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&search=Sear ch&count=30&type=comment_by&section=&string=rusty
Found 2 comments.
1: Downloading: http://www.kuro5hin.org/comments/2012/1/2/04942/47617/5#5 using k5id = 2012-1-2-04942-47617-5#5
2: Downloading: http://www.kuro5hin.org/comments/2011/12/23/17501/965/43#43 using k5id = 2011-12-23-17501-965-43#43
-----
rawuser = rusty, type = diaries, count = 2, pages = 1, k5type = diary_by, grepswitch = -B2, extractfn = extract-story-data.sh
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&search=Sear ch&count=30&type=diary_by&section=&string=rusty
Found 2 diaries.
1: Downloading: http://www.kuro5hin.org/story/2011/12/23/17501/965 using k5id = 2011-12-23-17501-965
2: Downloading: http://www.kuro5hin.org/story/2011/12/15/10813/898 using k5id = 2011-12-15-10813-898
-----
rawuser = rusty, type = stories, count = 2, pages = 1, k5type = author, grepswitch = -B2, extractfn = extract-story-data.sh
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&search=Sear ch&count=30&type=author&section=&string=rusty
Found 2 stories.
1: Downloading: http://www.kuro5hin.org/story/2003/6/27/132927/446 using k5id = 2003-6-27-132927-446
2: Downloading: http://www.kuro5hin.org/story/2003/4/14/102135/324 using k5id = 2003-4-14-102135-324
-----
======
Next, 5 Rusty comments:
======
$ ./dump-user.sh c 5 rusty
Selected types: comments
rusty: comments: 15963, diaries: 412, stories: 223
rawuser = rusty, type = comments, count = 5, pages = 1, k5type = comment_by, grepswitch = -C3, extractfn = extract-comment-data.sh
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&search=Sear ch&count=30&type=comment_by&section=&string=rusty
Found 5 comments.
1: Skipping: http://www.kuro5hin.org/comments/2012/1/2/04942/47617/5#5
2: Skipping: http://www.kuro5hin.org/comments/2011/12/23/17501/965/43#43
3: Downloading: http://www.kuro5hin.org/comments/2011/12/29/215841/46/4#4 using k5id = 2011-12-29-215841-46-4#4
4: Downloading: http://www.kuro5hin.org/comments/2011/12/23/17501/965/41#41 using k5id = 2011-12-23-17501-965-41#41
5: Downloading: http://www.kuro5hin.org/comments/2011/12/23/17501/965/40#40 using k5id = 2011-12-23-17501-965-40#40
-----
======
The interesting thing about this example is that it checks whether a post is already in the archive before downloading. If it is, it just skips it.
And we have just downloaded (1) and (2) in the previous example, so that is why they are skipped here.

Next, an example showing index-only mode.
Let's index all of MDC Protector's posts.
======
$ ./dump-user.sh all all "MDC Protector" x x
Selected types: comments diaries stories
Mode: index-only
Mode: silent URLs
MDC Protector: comments: 90, diaries: 13, stories: 1
rawuser = MDC%20Protector, type = comments, count = 90, pages = 3, k5type = comment_by, grepswitch = -C3, extractfn = extract-comment-data.sh
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&search=Sear ch&count=30&type=comment_by&section=&string=MDC%20Protector
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&count=30&am p;am p;next=Next+Page+%3E%3E&type=comment_by&section=&string=MDC%20Protec tor
http://www.kuro5hin.org/?op=search&offset=30&old_count=30&count=30&a mp;a mp;next=Next+Page+%3E%3E&type=comment_by&section=&string=MDC%20Prote ctor
Found 89 comments.
-----
rawuser = MDC%20Protector, type = diaries, count = 13, pages = 1, k5type = diary_by, grepswitch = -B2, extractfn = extract-story-data.sh
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&search=Sear ch&count=30&type=diary_by&section=&string=MDC%20Protector
Found 13 diaries.
-----
rawuser = MDC%20Protector, type = stories, count = 1, pages = 1, k5type = author, grepswitch = -B2, extractfn = extract-story-data.sh
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&search=Sear ch&count=30&type=author&section=&string=MDC%20Protector
Found 1 stories.
-----
======
Index-only mode means you create an index of all the posts, but you don't bother downloading them.
If later you want to download those, then I wrote a trimmed down version of ./dump-user.sh called ./slurp-index.sh that downloads all the posts in an index.
Note in the above example that ./dump-user.sh extracts tallies from http://www.kuro5hin.org/user/$rawuser
Then since we told it to index all posts, rather than giving a count, it uses the triple (90,13,1) to work out how many pages to download from k5.
Basically, it is fully automated.

Here is what the index files look like:
They are of form for index-comments.txt:
URL, k5id, date+time, score, reply count, comment title
For index-diaries.txt and index-stories.txt:
URL, k5id, date+time, comment count, diary title
The tab char separates each field, which makes them easy to process with cut -f.
======
$ head -5 MDC%20Protector/index-comments.txt
http://www.kuro5hin.org/comments/2012/2/6/22319/40838/12#12 2012-2-6-22319-40838-12#12 02/06/2012 11:34:17 PM EST [none / 1] Replies: 0 I'M ON IT
http://www.kuro5hin.org/comments/2012/2/6/22319/40838/11#11 2012-2-6-22319-40838-11#11 02/06/2012 11:32:05 PM EST [none / 1] Replies: 0 YOU'RE A FUCKING FAGGOT
http://www.kuro5hin.org/comments/2012/2/6/22319/40838/9#9 2012-2-6-22319-40838-9#9 02/06/2012 11:26:59 PM EST [none / 1] Replies: 1 SRSLY ARE YOU GOING TO FOLLOW UP
http://www.kuro5hin.org/comments/2012/2/6/22319/40838/8#8 2012-2-6-22319-40838-8#8 02/06/2012 11:26:20 PM EST [none / 1] Replies: 0 SO LET'S TAKE INVENTORY
http://www.kuro5hin.org/comments/2012/2/6/221236/7594/8#8 2012-2-6-221236-7594-8#8 02/06/2012 11:07:37 PM EST [none / 0] Replies: 0 HAVE YOU FOUND THE KITTEN KATANA PIC YET? $
======
======
$ head -2 MDC%20Protector/index-diaries.txt
http://www.kuro5hin.org/story/2012/2/6/221236/7594 2012-2-6-221236-7594 02/06/2012 10:12:36 PM EST 10 comments EVERY TIME A VIOLENT GANG BANGER (I.E. A COP) DIES, I CELEBRATE
http://www.kuro5hin.org/story/2012/2/5/111644/7639 2012-2-5-111644-7639 02/05/2012 11:16:44 AM EST 16 comments HEY REMEMBER WHEN DIPSHIT TWEETSYGALORE CALLED THE FBI ON ME
======
Linky: MDC%20Protector-k5-index.zip: http://www.2shared.com/file/kSOwnlnZ/MDC20Protector-k5-index.html

Now a quick example of slurp-index.sh
$ ./slurp-index.sh

Usage: ./slurp-index.sh c|d|s|all user

======
$ ./slurp-index.sh s "MDC Protector"
Selected types: stories
MDC Protector: comments: 90, diaries: 13, stories: 1
Found 1 stories.
1: Downloading: http://www.kuro5hin.org/story/2012/1/29/12141/1610 using k5id = 2012-1-29-12141-1610
-----
======

Now, for something more serious! Let's index all of Rusty's posts:
Note that even though it says 15963 comments, 412 diaries, and 223 stories, k5 will only give us some of them. Not sure why.
======
$ ./dump-user.sh all all rusty x x
Selected types: comments diaries stories
Mode: index-only
Mode: silent URLs
rusty: comments: 15963, diaries: 412, stories: 223
rawuser = rusty, type = comments, count = 15963, pages = 533, k5type = comment_by, grepswitch = -C3, extractfn = extract-comment-data.sh
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&search=Sear ch&count=30&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&count=30&am p;am p;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=30&old_count=30&count=30&a mp;a mp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=60&old_count=30&count=30&a mp;a mp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=90&old_count=30&count=30&a mp;a mp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=120&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=150&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=180&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=210&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=240&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=270&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=300&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=330&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=360&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=390&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=420&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=450&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
Trying again ...
http://www.kuro5hin.org/?op=search&offset=480&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
Trying again ...
http://www.kuro5hin.org/?op=search&offset=510&old_count=30&count=30& amp; amp;next=Next+Page+%3E%3E&type=comment_by&section=&string=rusty
Trying again ...

Too many missed pages. bailout-threshold = 3 exceeded. Exiting ...

Found 478 comments.
-----
rawuser = rusty, type = diaries, count = 412, pages = 14, k5type = diary_by, grepswitch = -B2, extractfn = extract-story-data.sh
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&search=Sear ch&count=30&type=diary_by&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&count=30&am p;am p;next=Next+Page+%3E%3E&type=diary_by&section=&string=rusty
Trying again ...
http://www.kuro5hin.org/?op=search&offset=30&old_count=30&count=30&a mp;a mp;next=Next+Page+%3E%3E&type=diary_by&section=&string=rusty
Trying again ...
http://www.kuro5hin.org/?op=search&offset=60&old_count=30&count=30&a mp;a mp;next=Next+Page+%3E%3E&type=diary_by&section=&string=rusty
Trying again ...

Too many missed pages. bailout-threshold = 3 exceeded. Exiting ...

Found 13 diaries.
-----
rawuser = rusty, type = stories, count = 223, pages = 8, k5type = author, grepswitch = -B2, extractfn = extract-story-data.sh
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&search=Sear ch&count=30&type=author&section=&string=rusty
http://www.kuro5hin.org/?op=search&offset=0&old_count=30&count=30&am p;am p;next=Next+Page+%3E%3E&type=author&section=&string=rusty
Trying again ...
http://www.kuro5hin.org/?op=search&offset=30&old_count=30&count=30&a mp;a mp;next=Next+Page+%3E%3E&type=author&section=&string=rusty
Trying again ...
http://www.kuro5hin.org/?op=search&offset=60&old_count=30&count=30&a mp;a mp;next=Next+Page+%3E%3E&type=author&section=&string=rusty
Trying again ...

Too many missed pages. bailout-threshold = 3 exceeded. Exiting ...

Found 5 stories.
-----
======
So there you have it. k5 gave us only: 478 comments, 13 diaries, 5 stories.
Linky: rusty-k5-index.zip: http://www.2shared.com/file/4WMHU6FR/rusty-k5-index.html

Now on to the main event. Let's index all of Crawfords posts, of the accounts we know about (post a comment if you know one I missed).
I had to slightly edit my code to handle the fact that some of Crawford's accounts have a deleted user page.
eg: http://www.kuro5hin.org/user/MichaelCrawford reports: "Sorry, I can't seem to find that user"
But if we go here: http://www.kuro5hin.org/?op=search&offset=0&old_count=30&type=commen t_by&section=&string=MichaelCrawford&search=Search&count=30
we still see plenty of posts.
Aside from commenting out an "exit" line in the script when it detects a missing user page, it just means we have to give a big count value, rather than using the "all" option.
10000 was more than enough, and the code knows when to stop anyway - that is what the whole bailout-threshold thing is about.

So I went ahead with the indexing of his posts.
Now I have data for these accounts:
$ cat Crawford-accounts.txt
Zombie Jesus Christ
Repeatible Hairstyle
GoingWare
Rippit the Ogg Frog
Michael Crawford
MichaelCrawford
Michael David Crawford
Jesus h Bar Christ
Jonathan Swift

All that is left is to merge that into one place. I could have done it by hand, but why do that when you can write a script to do it :).
(turns out writing the script was a good idea - saved me lots of effort).
$ ./merge-users.sh

Usage: ./merge-users.sh user-list.txt destination-folder

Which leaves us with:
======
$ ./merge-users.sh Crawford-accounts.txt Crawford-archive
Crawford-accounts.txt:
----------
comments diaries stories user
----------
1983 137 0 Zombie Jesus Christ
867 92 0 Repeatible Hairstyle
1 0 0 GoingWare
8 0 6 Rippit the Ogg Frog
0 0 0 Michael Crawford
3563 122 69 MichaelCrawford
1197 52 0 Michael David Crawford
0 0 0 Jesus h Bar Christ
632 0 0 Jonathan Swift
----------
8251 403 75
----------
======
Linky: Crawford-k5-index.zip: http://www.2shared.com/file/Tt6OiPXr/Crawford-k5-index.html

In the end I decided not to download 8729 of his posts, and feel satisfied with just indexing them.
But if you want to, go ahead (code provided below).
Indeed, if more than one person wants to downlaod them all, then maybe it should be done once, and then share that tar.gz. So k5 only gets hit for 8700 posts once.
So I may have created the MDC-k5-index, but I'll leave it to some later time, or someone else, to create the full MDC-k5-Corpus.

Now, if you want to slurp down the posts for rusty or MDC Protector, then:
unzip them and:
./slurp-index.sh d rusty # d means just his diaries. d is just an example, you choose what you want.
./slurp-index.sh all "MDC Protector" # all means comments + diaries + stories.

For Crawford there is the extra directory to deal with.
So unzip, then cd Crawford-archive, then pick the account you are interested in, and then:
../slurp-index.sh all "Rippit the Ogg Frog" # this is a good account to start with as it only has 14 posts.

And that folks is about it.
- Note that when the code does download a post body, it gets stored with all its k5 crap still surrounding the post.
So if you want to clean the post up, so you are left with just the content of the post, you are going to need to write a cleaning script.
- My comment indexing code had a bug that occurred 0.19% of the time. As an "ethical engineer" I was obligated to fix that before releasing the Crawford index.
Anyway, I wrote some code to isolate the problem, and quickly fixed it. Turns out I had lazy grep patterns that matched more than intended.

Here is the slurp-index.sh code (public domain of course!).
I'm not releasing dump-user.sh and related code at this time, though I may do user requests to index a user.
=========
$ cat slurp-index.sh
#!/bin/sh

if [ $# -lt 2 ] ; then
echo -e "\nUsage: ./slurp-index.sh c|d|s|all user\n"
exit
fi

sleeptime=15s

if [ $1 = "c" ] ; then
types="comments"
elif [ $1 = "d" ] ; then
types="diaries"
elif [ $1 = "s" ] ; then
types="stories"
elif [ $1 = "all" ] ; then
types="comments diaries stories"
else
echo "Error: unrecognized posting type."
exit
fi
echo "Selected types: $types"

user="$2"
rawuser=$(echo "$user" | sed 's/ /%20/g')
if [ ! -f "$rawuser/tallies.txt" ] ; then
echo "Error: user $user not found. Exiting."
exit
fi

# Let's log what we are doing.
echo "$(date +"%Y-%m-%d %r") ./slurp-index.sh $1 $2" >> $rawuser/log.txt

# NB: set really bugs out if the file doesn't exist, or is empty!!!
set $(< "$rawuser/tallies.txt")
commentcount=$1
diarycount=$2
storycount=$3
echo "$user: comments: $commentcount, diaries: $diarycount, stories: $storycount"

for type in $types ; do
mkdir -p $rawuser/$type
# OK. Let's loop through our index-$type.txt file.
echo "Found $(cat $rawuser/index-$type.txt | wc -l) $type."
i=0
while read URL k5id rest; do
i=$(($i + 1))
if [ -f "$rawuser/$type/$k5id.html" ] ; then
# We already have it, so don't download it.
echo "$i: Skipping: $URL"
else
echo "$i: Downloading: $URL using k5id = $k5id"
# Hrmm... need a way to check for k5 empty pages!
# Also need code here to clean up k5's html.
# Maybe also need a tweak to $URL so don't download comments attached to stories or diaries.
wget "$URL" -q -O "$rawuser/$type/$k5id.html"
sleep $sleeptime
fi
done < $rawuser/index-$type.txt
echo "-----"
done
================================================================================ ===================
tl;dr version:
Here are the index files for the posts of MDC Protector, rusty, Crawford, and trane.

They are of form for index-comments.txt:
URL, k5id, date+time, score, reply count, comment title
For index-diaries.txt and index-stories.txt:
URL, k5id, date+time, comment count, diary title
The tab char separates each field.

This means you can use cut -f to select different columns.
eg: $ cut -f1 trane/index-comments.txt
to see just the URLs.
eg: $ cut -f6 trane/index-comments.txt
to see just the comment titles.
eg: $ cut -f4,5 trane/index-comments.txt
to see scores and reply count.

Links:
MDC%20Protector-k5-index.zip http://www.2shared.com/file/kSOwnlnZ/MDC20Protector-k5-index.html
rusty-k5-index.zip http://www.2shared.com/file/4WMHU6FR/rusty-k5-index.html
Crawford-k5-index.zip http://www.2shared.com/file/Tt6OiPXr/Crawford-k5-index.html
trane-k5-index.zip http://www.2shared.com/file/xnuXSZZ6/trane-k5-index.html

I will possibly do requests.
Just provide username, comments or diaries or stories or all three, and the number of posts you want.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o Kuro5hin
o Also by mumble


Display: Sort:
The MDC k5 index | 10 comments (2 topical, 8 editorial, 0 hidden)
Rubbing my penis through my jeans ATM (3.00 / 13) (#1)
by schlouse on Thu Feb 09, 2012 at 03:53:28 AM EST



So you're the one that killed kitten $ (none / 1) (#9)
by Enlarged to Show Texture on Fri Feb 10, 2012 at 10:23:06 AM EST




"Those people who think they know everything are a great annoyance to those of us who do." -- Isaac Asimov
[ Parent ]
The MDC k5 index | 10 comments (2 topical, 8 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!