This is not an obvious task, first off because parsing html with regex is hazardous. With all the disclaimers about doing so, here's a regex for the job:
(?s)(?:<body>|\G)(?:.(?!</body>))*?\K\b\w*(\w)\1\w*\b
See the demo.
In Perl:
@result = $subject =~ m%(?s)(?:<body>|\G)(?:.(?!</body>))*?\K\b\w*(\w)\1\w*\b%g;
(?s)
allows the dot to match newlines(?:<body>|\G)
matches <body>
or the ending position of the previous match(?:.(?!</body>))*?
lazily matches chars that are not followed by the closing </body>
tag\K
tells the engine to drop what had been matched so far from the returned match\b\w*(\w)\1\w*\b
matches a word (without \b
boundaries) made of some optional chars \w*
then one captured char (\w)
immediately followed by itself as referenced by the Group 1 captured \1
and more optional chars \w*
If you only want to allow letters (no digits and underscores), replace all the \w
with [a-z]
and replace (?s)
with (?is)
to make it case-insensitive.
How to delete a newline if it is the last character in a file?
How to pass command-line arguments to a Perl program?
How to efficiently calculate a running standard deviation
Howto use a variable in the replacement side of the Perl substitution operator?
How to summ quickly all numbers in a file?
How to remove duplicate items from an array in Perl?
How to differ of Two Arrays Using Perl