How to find words with doubled letters in HTML text with a regexp

This is not an obvious task, first off because parsing html with regex is hazardous. With all the disclaimers about doing so, here's a regex for the job:

(?s)(?:<body>|\G)(?:.(?!</body>))*?\K\b\w*(\w)\1\w*\b

See the demo.

In Perl:

@result = $subject =~ m%(?s)(?:<body>|\G)(?:.(?!</body>))*?\K\b\w*(\w)\1\w*\b%g;
  • (?s) allows the dot to match newlines
  • (?:<body>|\G) matches <body> or the ending position of the previous match
  • (?:.(?!</body>))*? lazily matches chars that are not followed by the closing </body> tag
  • \K tells the engine to drop what had been matched so far from the returned match
  • \b\w*(\w)\1\w*\b matches a word (without \b boundaries) made of some optional chars \w* then one captured char (\w) immediately followed by itself as referenced by the Group 1 captured \1 and more optional chars \w*

If you only want to allow letters (no digits and underscores), replace all the \w with [a-z] and replace (?s) with (?is) to make it case-insensitive.



perl regexp