At the moment I'm creating a Wordpress plugin for manipulating external links in content, such as to add rel=nofollow or icons to a link. (see https://github.com/robogeek/wp-nofollow)
That means I've been reviewing both Wordpress plugins and Drupal modules with similar functionality, to see how others have solved these same problems. Most are using regular expressions (PHP's regexp function) to match text, and PHP's str_replace to make changes.
I can think of several potential bugs with this. For example if the same text appears twice in the text, won't str_replace mash the wrong piece of text?
The improved technique I'm recommending is to use the PHP DOMDocument object instead. One uses that class to parse the $content variable, and then you have all the DOM API calls you'd want to manipulate the text. Kudo's to the https://github.com/whyte624/wordpress-favicon-links/ plugin for teaching me this trick.
The outline of your processing filter goes like so:
function xyzzy_links_the_content($content)
{
try {
$html = new DOMDocument(null, 'UTF-8');
@$html->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">' . $content);
// ... process the $html DOM object
return $html->saveHTML();
} catch (Exception $e) {
return $content;
}
}
add_filter('the_content', 'xyzzy_links_the_content');
With this you have a properly parsed DOM object and you don't have to worry about the encoding of anything. You're manipulating objects, and then when you're done it's serialized back to HTML.
If your processing needs to inspect all "a" tags:
foreach ($html->getElementsByTagName('a') as $a) {
// ... process each link
}
If you want to add an attribute to a specific link, like target=_blank
$a->setAttribute('target', '_blank');
Basically with a DOM object you're free to make any HTML manipulation you want.
To learn more about DOMDocument - http://php.net/manual/en/class.domdocument.php