create account | help/FAQ | contact | links | search | IRC | site news
 Everything Diaries Technology Science Culture Politics Media News Internet Op-Ed Fiction Meta MLP

 How to post code to K5 -- the easy way! By tmoertel in MetaSun Oct 07, 2001 at 12:18:08 PM EST Tags: Kuro5hin.org (all tags) Ever try including code in your posts? It's a pain. Somehow the code always manages to get mangled before landing on the page. Well, no more, thanks to this nifty little Perl script.

The idea is simple. Write your post as you normally would, using K5's subset of HTML, but enclose your code snippets in PRE elements. K5 doesn't support PRE, but the following script converts PRE elements (and their contents) into K5-friendly HTML. Just process your posts with the script before posting to K5. (I write my posts in Emacs, and it's easy to convert code regions via shell-command-on-region.)

For example, if you write the following:

One can compute the Fibonacci Series as defined in <pre>fibs</pre>, below:

<pre>

fibs :: [Integer]
fibs@(_:fibs') =
1 : 1 : zipWith (+) fibs fibs'

</pre>
the script will convert it into
One can compute the Fibonacci Series as defined in <tt>fibs</tt>, below:

<tt><br>
<br>
fibs :: [Integer]<br>
fibs@(_:fibs') =<br>
&nbsp;&nbsp;&nbsp; 1 : 1 : zipWith (+) fibs fibs'<br>
<br>
</tt>
which, when posted to K5, will look like
One can compute the Fibonacci Series as defined in fibs, below:

fibs :: [Integer]
fibs@(_:fibs') =
1 : 1 : zipWith (+) fibs fibs'

Thus are the magical characters and spacing of your source code protected on the dangerous journey from your screen to those of your readers.

Note that I treat PRE as an inline element rather than a block element, as is normal in HTML. Inline treatment allows you to mix code into a sentence, as shown in the example above. Also note that the lines of the converted code are allowed to wrap. This ensures that K5's column width doesn't bloat up to accommodate any obnoxiously long lines in the code. Nevertheless, line breaks are preserved; widening the browser window will show the original breaks, and copy-and-paste will transfer them truly.

The script is also handy for ASCII art, such as this illustration demonstrating that the script works hard to ensure that your code can survive the round trip from your computer to K5 and back to a reader's computer:

M A G I C A L   R O U N D - T R I P   P R O P E R T Y

+----------+                            +----------+
| INPUT =  |                            | OUTPUT = |
,-> | original | --> [weblog-filter.pl] --> | k5-ready | ---.
|    | code     |                            | code     |     |
|    +----------+                            +----------+     |
|                                                             | [POST]
|                                                             |
|                                            +----------+     |
|                                            | Your     |     |
`------- [copy-and-paste from browser] ---- | code on  | <--'
| K5.org ! |
+----------+

The code
Here's the code (which was run on itself to yield the K5-compatible version you see below). It's nothing special. Tweak as you desire.

#!/usr/bin/perl -w
#
# weblog-filter.pl - filter code and code-containing HTML for posting to K5
#
# \$Id: weblog-filter.pl,v 1.2 2001/10/07 00:55:02 thor Exp \$
#
# Tom Moertel <tk5 /a/t/ moertel.com>
#
# Filters a block of code or HTML possibly mixed with code into
# something suitable posting to kuro5hin.org, which doesn't have
# support for PRE and other elements that are useful for presenting
# code in HTML.
#
# If your input text contains PRE blocks, only they will be processed;
# otherwise, the entire text will be processed.

use strict;

sub fixup(\$) {

local \$_ = \$_[0];

# escape characters that could be confused for HTML
# character-entity references (note that order is important here)

s/&/&amp;/g;                   # escape & chars
s/</&lt;/g;                    # escape < chars

# perform markup conversions

s/(\r\n?|\n)/<br>\$1/sg;        # lineends    -> <br>
s/\t/    /g;                   # tabs        -> 4 spaces
s/ (?= )/&nbsp;/g;             # space runs  -> &nbsp;
s/^ /&nbsp;/mg;                # leading sp  -> &nbsp;

# package it all up inside of TT

return "<tt>\$_</tt>";
}

# MAIN

my \$hits = 0;
undef \$/;  # read input in one big chunk
\$_ = <>;
s{<pre>(.*?)</pre>}{\$hits++,fixup(\$1)}gsie;
\$_ = fixup(\$_) unless \$hits;
print;

Is there a better way?
If there is, I sure would like to know about it. Are there any plans to put something like this into Scoop? (Or is it already there?) Don't be shy.

 Poll
How do you post code on K5?
 I don't. 51% I just paste the code, hit Submit, and pray. 3% I painstakingly hand-edit the markup for the code. 20% I wrote my own darn script, thank you very much. 6% Gingerly. 17%

 Votes: 58 Results | Other Polls

 Display: Threaded Minimal Nested Flat Flat Unthreaded Sort: Unrated, then Highest Highest Rated First Lowest Rated First Ignore Ratings Newest First Oldest First
 How to post code to K5 -- the easy way! | 30 comments (29 topical, 1 editorial, 0 hidden)
 Why TT? (3.80 / 5) (#1) by fluffy grue on Sat Oct 06, 2001 at 11:47:06 PM EST

 Why not use CODE? By the way, Scoop used to just support PRE outright, but it got removed because too many people were abusing it (not that lack of PRE makes it any more difficult, as a certain diary by "cunt" shows). Also, I think the use of the variable name "\$hits" is kind of... unfortunate. :) --"Is not a quine" is not a quine.I have a master's degree in science!
 Because not all preformatted text is code (4.50 / 2) (#6) by tmoertel on Sun Oct 07, 2001 at 01:23:52 AM EST

 The W3C defines the TT element for rendering "teletype or monospaced text," whereas CODE designates "fragments of computer code." Because the script is useful for posting more than just code (as the ASCII art illustrates), I used the more general element. Also, I think the use of the variable name "\$hits" is kind of... unfortunate. :) Indeed! I code in Emacs using font-lock mode (and Andale Mono as the typeface), so between the color coding and the matter-of-fact nature of the type, I was oblivious to the unfortunate symbolic coincidence that occured when the variable identifier (hits) was juxtaposed with the scalar designator (\$). Sadly, a quick grep(1) through my code library revealed that this wasn't my first use of the offending combination. Thanks for opening my eyes. ;-) --My blog | LectroTest [ Disagree? Reply. ][ Parent ]
 I've done that (\$hits) (none / 0) (#12) by hurstdog on Sun Oct 07, 2001 at 11:46:29 AM EST

 There are places in Scoop that I was using variables for permission stuff and I named them stuff like \$perm, and \$perm_section. whups :) [ Parent ]
 Tough Decisions (4.60 / 5) (#2) by Blarney on Sun Oct 07, 2001 at 12:07:41 AM EST

 Congratulations! You've managed to cleverly work around K5's
-removing feature.  Your little script inserts
and   to produce something that appears just the same way that it would if K5 allowed
.  Whatever reason that K5 management had to remove
is now a reason to find some way to disallow your style of preformatted text.  Are we going to lose   and
now? Myself, I think restricting
on a community-moderated site like K5 is a big, fat waste of time.  Any garbage or ASCII art is going to get voted into oblivion anyway.  Maybe
vention is needed to keep the diary pages from turning into glop, but there is no need to have it for the stories.    Your little script here has reopened the issue of
.
 Abuse (4.33 / 3) (#3) by panner on Sun Oct 07, 2001 at 12:34:49 AM EST

 There seems to be a tendancy for some people to abuse
, and it's easy for people to accidently abuse it. Since it no longer lets the browser do the word-wrapping, it's of course easy to abuse on purpose (but multiple other cases have shown that it's just as easy to put a long string with no spaces in it).  The big thing is people that just wrap their comment/diary/story in
so that they don't have to bother with splitting paragraphs up themselves. But then they write a few lines too long and it messes up the entire page.  Your write, poorly formatted stories are either voted down or editted, but stories don't make up most of what's posted on k5. Diaries can be abused, and even more so comments are. Comments can't be editted by admins to fix the problems, so it's either leave it or delete it. At least now, if a comment is screwing up the page, then the abuse is on purpose, and it can be deleted.  As for using
in a story, usually you can just mail help@k5 and get an editor to put
 Keep an honest man honest (5.00 / 2) (#10) by Blarney on Sun Oct 07, 2001 at 04:30:22 AM EST

 Yeah,
vention is good to prevent Joe or Jill lazybones from wrapping a plain-ASCII article,  and it won't stop shit like cunt's diary, so it keeps honest men honest.  Good fences, good neighbors,  all that good stuff.  I agree! However, it seems to me that the proper way to prevent people from sticking in 300-character lines is algorithmic.  Line length is the main issue here, so it should be addressed programmatically - long lines should be broken before output.    [  Parent  ]
 fold -s (4.00 / 1) (#14) by fluffy grue on Sun Oct 07, 2001 at 01:28:54 PM EST

 `fold -s` does the Right Thing. It splits the line at whitespace if possible, and if not, it splits the offending word. Simply piping the output of Scoop through `fold -s` would break purposeful abuses and do a good job of fixing accidental abuses (such as PRE-formatted comments without linebreaks), and wouldn't affect (legitimate) non-PRE text at all (since outside of PRE, HTML considers all whitespace to be the same). Well, >80-character URLs might get borked, but there's always makeashorterlink.com... ;) --"Is not a quine" is not a quine.I have a master's degree in science![ Hug Your Trikuare ] [ Parent ]
 The script doesn't suffer from PRE's problems (5.00 / 2) (#18) by tmoertel on Sun Oct 07, 2001 at 10:29:41 PM EST

 Blarney wrote: Your little script inserts
and   to produce something that appears just the same way that it would if K5 allowed
And panner wrote:  Since [the PRE element] no longer lets the browser do the word-wrapping, ...  I should point out that my script converts PRE elements into markup that does not prevent browsers from word-wrapping.  (I mentioned this in my story, but it seems that a few people overlooked this tidbit.)  Much of the potential for abuse, especially accidental abuse, has thus been eliminated. --My blog | LectroTest [ Disagree? Reply. ][  Parent  ]
 The easy fix (3.75 / 4) (#5) by fluffy grue on Sun Oct 07, 2001 at 01:20:25 AM EST

 By the way, I just sent an email to rusty suggesting an easy fix which will both allow PRE again and prevent other formatting/wrapping abuses, without having to resort to ugly hacks such as randomly inserting spaces into strings. Simply pipe the output of Scoop through `fold -s`. This way, PRE will be forced to wrap at 80 characters, long words will be split at 80 characters, and everyone is happy, except the format-manglers who have completely lost the ability to format-mangle (since not even long strings of Ms will work), and it doesn't even matter which element is being abused (so even story titles and the like will be safe). Plus, the format change would be retroactive, and wouldn't require modifying any existing stuff in the database. Rusty hasn't replied to me yet, but I figured I'd raise the issue here, since it seemed an appropriate place. --"Is not a quine" is not a quine.I have a master's degree in science!
 My preference... (4.75 / 4) (#7) by tmoertel on Sun Oct 07, 2001 at 01:54:05 AM EST

 Simply pipe the output of Scoop through fold -s Hmm... That method doesn't seem to work when the overly long lines contain no whitespace. For example, try:     perl -e'print "x"x200' | fold -s (Tested with the version of fold in textutils-2.0e.6 on RHL.) In any case, hard-breaking the text at 80 columns would damage long lines of code. I'd opt for something along the lines of my script (which allows for wrapping at spaces while preserving the line-endings in the code) with the addition of logic to break any non-whitespace sequence greater than 80 characters in length. Even then, I'd place a backslash at the break to make interpretation and re-assembly easier. This approach would provide the abuse-deterring benefits of fold as well as the insert-random-spaces method, all while preserving the ability for legitimate users to post code. After all, couldn't we use a little more code on K5 these days? --My blog | LectroTest [ Disagree? Reply. ][ Parent ]
 My mistake re. fold: It does wrap . . . (5.00 / 2) (#8) by tmoertel on Sun Oct 07, 2001 at 03:20:53 AM EST

 . . . sequences w/o whitespace, but I didn't notice it doing so in my tests because of my terminal width. Repeating the tests with a smaller width (via the -w switch) revealed my error. Sorry for the mistake. If fold would just mark its breaks with backslashes, it would be perfect. --My blog | LectroTest [ Disagree? Reply. ][ Parent ]
 backslashes (5.00 / 2) (#9) by fluffy grue on Sun Oct 07, 2001 at 03:48:43 AM EST

 Backslashes suck for if you've got, say, a really long URL. If your code really needs backslashes, then you should be breaking up the lines yourself to beegin with. --"Is not a quine" is not a quine.I have a master's degree in science![ Hug Your Trikuare ] [ Parent ]
 Once more on backslashes (none / 0) (#13) by tmoertel on Sun Oct 07, 2001 at 01:19:19 PM EST

 Backslashes suck for if you've got, say, a really long URL. But, then again, so do spaces. Inserting any foreign characters into a URI will damage it. However, backslashes are a fairly common form of escaping that will be properly interpreted by most shells and more than a few common programming languages. So breaking with backslashes lets us preserve the ability to copy-and-paste forcibly broken URIs and code in many cases. If you break a URI with a backslash, for example, it works fine if you paste it into a shell. Try pasting this link behind echo (or, more realistically, wget or curl) on the command line, and see what you get: http://www.ellium.com/~thor/\ hangman/cheating-hangman.pdf If we must damage content by forcibly breaking lines in the middle of a non-whitespace sequence, we should at least minimize the damage. Again, I'd say that the optimal solution is something like this: long lines are not forcibly broken; rather, they are allowed to wrap on whitespace (this is what my script does) in rare cases where a non-whitespace sequence of more than 80 characters is encountered, the sequence is forcibly broken and a backlash inserted at the break --My blog | LectroTest [ Disagree? Reply. ][ Parent ]
 Yes... (none / 0) (#16) by fluffy grue on Sun Oct 07, 2001 at 09:26:37 PM EST

 But fixing spaces in a URL is much easier than fixing backslashes. --"Is not a quine" is not a quine.I have a master's degree in science![ Hug Your Trikuare ] [ Parent ]
 How are spaces easier to fix? (none / 0) (#17) by tmoertel on Sun Oct 07, 2001 at 10:19:09 PM EST

 But fixing spaces in a URL is much easier than fixing backslashes. This is the part I don't understand. How are spaces easier to fix? It seems like the opposite is true. With backslashes there often isn't a need to "fix" anything: Just paste the text -- backslashes and all -- into your shell, editor, etc., and the backslashes take care of themselves. In the case when the destination isn't smart enough to automatically unescape backslashes, removing them by hand isn't any more difficult than removing spaces by hand. So, on the one hand, backslashes are easier than spaces, and on the other hand, backslashes are no more cumbersome than spaces. Therefore, it would appear that backslashes have the advantage. --My blog | LectroTest [ Disagree? Reply. ][ Parent ]
 Why easier (none / 0) (#19) by fluffy grue on Sun Oct 07, 2001 at 10:38:12 PM EST

 Because, in the rare case that there's a split URL, you just go into the navigation bar, hit end, ctrl-left, backspace, ctrl-left, backspace, etc., and that'll repair the single spaces which are in it, assuming the browser even gives a damn about whitespace anyway. For an example, copy-paste this URL into a browser's location field: http://trikuare.cx/fluffyporcupine/ Then for a comparison, copy-paste this URL into a browser's location field: http:\//tri\kuare\.cx/f\luffy\porcu\pine/ Most webbrowsers are able to filter out the whitespace in the first one. NO webbrowsers are able to filter out the backslashes in the second one! --"Is not a quine" is not a quine.I have a master's degree in science![ Hug Your Trikuare ] [ Parent ]
 Doesn't seem to work . . . (none / 0) (#21) by tmoertel on Sun Oct 07, 2001 at 11:57:44 PM EST

 Most webbrowsers are able to filter out the whitespace in the [space-separated] one. What browsers do this? Neither IE 5.5 nor Mozilla 0.9.4 were able to handle the space-split URL, regardless of whether the URL was in a link's HREF or copy-and-pasted into the browser's Location field. (Moreover, neither allowed me to paste more than the first line of the broken link. It would appear that both spaces and backslashes are equally worthless methods for breaking URLs.) However, I was able to take the backslash-separated URL, copy it, type "lynx " into the command line, and paste the URL to go straight to the site. I couldn't do that with the space-separated version. Thus it would appear that the advantage of using spaces to break URLs is dubious. Given that backslashes are advantageous in many other circumstances (such as when breaking code), I must still prefer backslashes as the general-purpose means for breaking long sequences of non-whitespace characters. --My blog | LectroTest [ Disagree? Reply. ][ Parent ]
 Okay (none / 0) (#23) by fluffy grue on Mon Oct 08, 2001 at 01:53:08 AM EST

 Maybe the split code just shouldn't split stuff inside HTML entities then. Easy enough of a fix. Still, I don't think extra printable characters should be added in. It's ugly, and, once again, if you have code which is longer than 80 lines, you really should learn how to make readable code. --"Is not a quine" is not a quine.I have a master's degree in science![ Hug Your Trikuare ] [ Parent ]
 Oh yeah (none / 0) (#20) by fluffy grue on Sun Oct 07, 2001 at 10:42:39 PM EST

 Also, try these two links, which is how the output would be put out using just whitespace vs. using backslashes: Which one works? Which one doesn't? Granted, on the first one under Netscape 4.77 it doesn't work right off, but it's VERY easy to fix in the browser, while the second one take a lot more work/thought/etc. And anyway, URLs which are long enough to get split at 80 characters are probably fake to begin with! --"Is not a quine" is not a quine.I have a master's degree in science![ Hug Your Trikuare ] [ Parent ]
 +1. FP (2.60 / 5) (#11) by Vladinator on Sun Oct 07, 2001 at 07:34:25 AM EST

 Excellent! I love good troll tools. --LRSE Hosting
 Everything2.com has some code formatting tools (none / 0) (#15) by pin0cchio on Sun Oct 07, 2001 at 02:14:35 PM EST

 If you run your code through some tools on Everything 2 (namely E2 Source Code Formatter and then Wharfinger's Linebreaker), you'll get pseudo-PREformatted text, with < and > automatically converted to character entities and lines broken properly. Note that this breaks indentation, but unless you're using Python or some other language that treats indentation as syntax, you can always run your code through GNU indent (for C code) or Emacs's indent-region function (for code in Java language, C, Scheme, or any other language that has an Emacs mode) to restore nice-looking indentation. lj65
 It'll be fixed (5.00 / 2) (#22) by panner on Mon Oct 08, 2001 at 12:11:12 AM EST

 I just put in some code to do splitting of long lines, and to word wrap
tags. I'll commit this later on (this will go in along with a somewhat revamped HTML checker and a spell checker), and it might take awhile to get to k5, but it'll be there.  Both are admin configured. By default, though, it splits if there are more than 100 non-whitespace characters in a row. So if someone posts a diary with 500 letters, it'll insert a newline every 100 letters. From my testing, this doesn't affect newlines in links (both moz .9.4 and w3m ignore the newline), and outside of a link it'll just be a space (easy to fix when you paste it in).  As for
, it uses Text::Wrap to word wrap at (by default) 100 columns. Note that this won't wrap if there are no newlines, but that case is caught by the other filter, so it's no problem. (okay, browsing the man page for Text::Wrap, I see that it will break long lines by itself, but it won't need to :). Anyway, when some
text is wrapped, it'll just have a newline inserted at a word boundry. This is done before saving into the DB, so if it wraps at the wrong place in a story, an admin could probably fix it up for you.  -- Keith Smiley Get it right, for God's sake. Pigs can work out how to use a joystick, and people still can't do this!
 Newlines seem to break links (none / 0) (#24) by tmoertel on Mon Oct 08, 2001 at 11:31:35 AM EST

 From my testing, this doesn't affect newlines in links (both moz .9.4 and w3m ignore the newline) Can you describe exactly how you tested this? I couldn't get either of Moz 0.9.4 (Win32) or IE 5.5 to accept newline-broken links. --My blog | LectroTest [ Disagree? Reply. ][ Parent ]
 Testing (none / 0) (#26) by panner on Mon Oct 08, 2001 at 03:03:13 PM EST

 My comment on that is irrelevant, since I realized something this morning about that, but I'll explain it anyway :) I just made a quick test page that had a link broken half-way through, like so: I tried that in both moz and w3m, both of which followed it fine. But this morning I remembered that the HTML parser strips newlines from within tags as it runs, and the breaking of lines is done before this, so the HTML checker will fix it on its own. -- Keith Smiley Get it right, for God's sake. Pigs can work out how to use a joystick, and people still can't do this![ Parent ]
 A few suggestions (4.66 / 3) (#25) by tmoertel on Mon Oct 08, 2001 at 11:57:41 AM EST

 I just put in some code ... to word wrap
tags. [...] Anyway, when some
text is wrapped, it'll just have a newline inserted at a word boundry.  Please don't handle PRE elements by reformatting the text inside of them.  There are two good reasons for this:    PRE elements (usually) prevent browsers from wrapping lines. Regardless of where such lines are wrapped by K5's HTML handler, be it 100 columns or the more common 80, when they land on the page, they are going to be wide and bloat up the page they land on. Consider an 80-column PRE block appearing in a leaf comment in Nested display mode.  It's going to throw off the display.  The browser should decide where to wrap the lines at render time, not K5 at post time.  Except for abusers, people use PRE to preserve the integrity of text, code, machine-generated/machine-readable information, etc. Changing it in any way damages the information.  We want readers to be able to copy-and-paste K5's version of the information and have the result be identical to what the poster intended.  Information should survive the round trip to and from K5, undamaged and unchanged.    My script gets both of these things right: It allows the browser to wrap lines and ensures that what a reader sees (and copy-and-pastes) is identical to what the poster wrote.  This is done before saving into the DB, so if it wraps at the wrong place in a story, an admin could probably fix it up for you.  There's no need for admin intervention if you use the code from my script.  Hard line breaks are exactly as the poster intended, but the browser is free to wrap lines earlier if the column is too narrow.  SuggestionsPlease consider the following suggestions:   Create a new pseudo-element LITERAL. (Don't use PRE.)  Most people who resort to PRE just want their content preserved exactly.  Yet, that isn't what PRE does. The W3C's definition of PRE allows it to contain markup, which is to be interpreted.  Give people what they want; hence, the LITERAL element.  Process the text inside of LITERAL elements the way my script does, with the following change.  Break any non-whitespace sequence of greater than N characters, except for HREF attributes inside of A elements, which should be left alone.   There is some debate about how best to break non-whitespace sequences. I think that a backslash followed by a hard linebreak (i.e.,
markup) is best. Others think that a lone hard linebreak is best. My argument for the former is this: If we are forced to alter content (i.e., damage it), we shouldn't do so silently. The backslash makes our editing visible. Also, many shells and programming languages consider a backslash-linebreak combination to be a line-continuation indicator, and they will undo our damage automatically. The argument for the lone linebreak is that it works better for URLs, but none of my tests showed this to be the case. If we're going to put PRE-like functionality into K5, let's do it right. --My blog | LectroTest [ Disagree? Reply. ][ Parent ]
 Updated version of the PRE-filtering code (5.00 / 1) (#27) by tmoertel on Tue Apr 23, 2002 at 12:14:27 PM EST

 I have created an updated version of the PRE-filtering code that accommodates the recent addition of long-sequence breaking to Scoop. Please use this version instead of the one posted in the main text of the story. Otherwise, highly-indented code may not post correctly. --My blog | LectroTest [ Disagree? Reply. ]
 My code formating program (none / 0) (#28) by codemonkey_uk on Mon Apr 29, 2002 at 10:54:12 AM EST

 I wrote this 'C' program in order to post 'FPC4.cpp'. Its designed for posting C/C++ code, it does whitespace, line breaks, entities, and italicises comments. `---Thad`"The most savage controversies are those about matters as to which there is no good evidence either way." - Bertrand Russell
 An alternative (3.00 / 1) (#29) by fluffy grue on Sun May 12, 2002 at 03:47:31 PM EST

 Start to post a comment in 'auto-format' mode Paste your code in and hit 'preview' View the page source and copy the resulting HTML --"#kuro5hin [is like] a daycare center [where] the babysitter had been viciously murdered." -- CaptainObvious (we
 this doesn't work (none / 0) (#30) by codemonkey_uk on Tue Dec 10, 2002 at 05:15:22 AM EST

 As you so clearly demonstrated here! `---Thad`"The most savage controversies are those about matters as to which there is no good evidence either way." - Bertrand Russell[ Parent ]
 How to post code to K5 -- the easy way! | 30 comments (29 topical, 1 editorial, 0 hidden)
 Display: Threaded Minimal Nested Flat Flat Unthreaded Sort: Unrated, then Highest Highest Rated First Lowest Rated First Ignore Ratings Newest First Oldest First