HowProgOne: How to find words with doubled letters in HTML text with a regexp

How to find words with doubled letters in HTML text with a regexp

This is not an obvious task, first off because parsing html with regex is hazardous. With all the disclaimers about doing so, here's a regex for the job:

(?s)(?:<body>|\G)(?:.(?!</body>))*?\K\b\w*(\w)\1\w*\b

See the demo.

In Perl:

@result = $subject =~ m%(?s)(?:<body>|\G)(?:.(?!</body>))*?\K\b\w*(\w)\1\w*\b%g;

(?s) allows the dot to match newlines
(?:<body>|\G) matches <body> or the ending position of the previous match
(?:.(?!</body>))*? lazily matches chars that are not followed by the closing </body> tag
\K tells the engine to drop what had been matched so far from the returned match
\b\w*(\w)\1\w*\b matches a word (without \b boundaries) made of some optional chars \w* then one captured char (\w) immediately followed by itself as referenced by the Group 1 captured \1 and more optional chars \w*

If you only want to allow letters (no digits and underscores), replace all the \w with [a-z] and replace (?s) with (?is) to make it case-insensitive.

perl regexp

How to find words with doubled letters in HTML text with a regexp

Popular Topics

Recent Languages

How to find words with doubled letters in HTML text with a regexp

Related Questions

Popular Topics

Recent Languages