Converting my old content to markdown

So I have just converted all the content on this blog to markdown. It was rather painful. I had really old content ranging as far back as 2005 in here, and I went through about 3 distinct markup filters here, most of which were irregular and changing according to the position of the sun, the drupal.org releases and wind speed. Now it's all markdown. This involved patience, drush and 3 hours of wasted time.

Now, the fact that Markdown picked up speed is always a little strange to me. The syntax isn't particularly complete, which leads to non-standard extension like markdown-extra popping up, with the inevitable variations according to the language. Github, for example, has its own flavor of the famous markup. Finally, Drupal's filters are kind of klunky: the usual < url > markup doesn't work. So things are a little weird, but Markdown seems to be here to stay, or anyways it's the only markup I have seen supported reliably across multiple CMS and sites. One has to wonder why we are still stuck with plain old HTML on Drupal.org...

The actual conversion

The conversion was rather annoying. I had to track down all those formats, which meant mostly converting a wiki-like syntax from the freelinking module to markdown. (It's actually more complicated than that, because there was also the simplewiki filter, but let's ignore that because they were few and I just did them by hand.)

In the end, I arrived to the following script:

<?php

function wiki2mdwn($match) {
  $orig = $match[0];
  if (count($match) > 2) {
    $mdwn = "[" . $match[2] . "](" . $match[1] . ")";
  } else {
    $mdwn = "[" . $match[1] . "](" . $match[1] . ")"; # hack: drupal fails on 
  }
  print "$orig\t=>\t$mdwn\n";
  return $mdwn;
}

$q = db_query("select node.nid, format, FROM_UNIXTIME(created) AS c, body, teaser, node.title from node_revisions inner join node on node.vid = node_revisions.vid where format = 1 AND ( teaser like '%[[%' OR body like '%[[%' ) order by created LIMIT 1;");

while ($row = db_fetch_object($q)) {
  print $row->nid . " | " . $row->format . " | " . $row->c . " | " . $row->title . "\n";
  $node = null;
  foreach (array('teaser', 'body') as $part) {
    print "checking $part... ";
    $newpart = preg_replace_callback('/\[\[([^]\|]*)(?:\|([^]]*))?\]\]/', 'wiki2mdwn', $row->$part);
    if ($newpart != $row->$part) {
      print "replacement... ";
      if (is_null($node)) {
        $node = node_load($row->nid);
        print "node loaded... ";
      }
      $node->$part = $newpart;
    }
  }
  if (!is_null($node)) {
    node_save($node);
    print "node {$node->nid} saved... ";
  }
  print "\n";
}

$q = db_query("SELECT nid, cid,FROM_UNIXTIME(timestamp),format, subject, comment FROM comments WHERE format = 1 AND comment LIKE '%[[%' ORDER BY cid LIMIT 1;");

while ($row = db_fetch_object($q)) {
  print "checking comment {$row->cid} in node {$row->nid} with subject {$row->subject}... ";
  $newcom = preg_replace_callback('/\[\[([^]\|]*)(?:\|([^]]*))?\]\]/', 'wiki2mdwn', $row->comment);
  print "\nsaving... ";
  db_query("UPDATE comments SET comment = '%s' WHERE cid = %d", $newcom, $row->cid);
  print "comment {$row->cid} in node {$row->nid} saved.\n";
}

Yes. This is klunky and ugly. But it works. If you have more than... say.. 200 nodes or comments to convert, I would strongly recommend optimizing this into SQL directly, but I was worried I would break stuff so I preferred operating on a preg_replace_callback() than plain SQL.

Oh, and this is a drush snippet, for those who don't know about that (rather old) drush feature, by the way. :) To run this, you basically dump this in a file and run it:

drush @anarcat.koumbit.org wiki2mdwn.php

Notice how I use a drush alias there - this one is automatically created by the Aegir this site lives on. Time saver.

So long and annoying, but at long last done!

Commentaires

Portrait de dasjo

#1 dasjo : why would you want to convert

why would you want to convert existing content to markdown if the main advantage to me is enhanced writing experience. why not just skip that, write new content using markdown and optionally convert legacy content to HTML if it wasn't already?


Portrait de anarcat

#2 anarcat : That's a good point, i didn't

That's a good point, i didn't actually think of that. I guess it was a tad harder to hook into the filter system than to just do simple regex replacements and node_save(), which I am more familiar with. The possibility of easily editing previous content is also attractive.

I was already writing new content in markdown, though...


Portrait de juan_g

#3 juan_g : Markdown, HTML, Org-mode...

Now that it's all Markdown, if you later wish to convert it to other formats, there is Pandoc document converter (to HTML, ODT, LaTeX, PDF, MediaWiki...). My favorite conversion is to Emacs Org-mode. ;)


Portrait de anarcat

#4 anarcat : I know right! It's awesome.

I know right! It's awesome. At long last, standards! Or somehow standard.