This is not an obvious task, first off because parsing html with regex is hazardous. With all the disclaimers about doing so, here's a regex for the job:
(?s)(?:<body>|\G)(?:.(?!</body>))*?\K\b\w*(\w)\1\w*\b
See the demo.
In Perl:
@result = $subject =~ m%(?s)(?:<body>|\G)(?:.(?!</body>))*?\K\b\w*(\w)\1\w*\b%g;
(?s)
allows the dot to match newlines(?:<body>|\G)
matches <body>
or the ending position of the previous match(?:.(?!</body>))*?
lazily matches chars that are not followed by the closing </body>
tag\K
tells the engine to drop what had been matched so far from the returned match\b\w*(\w)\1\w*\b
matches a word (without \b
boundaries) made of some optional chars \w*
then one captured char (\w)
immediately followed by itself as referenced by the Group 1 captured \1
and more optional chars \w*
If you only want to allow letters (no digits and underscores), replace all the \w
with [a-z]
and replace (?s)
with (?is)
to make it case-insensitive.
When should I use jQuery's document.ready function?
Find all elements on a page whose element ID contains a certain text using jQuery
How to run cron once, daily at 10pm
Run CRON job every day at specific time
CRON job to run on the last day of the month
How to pass in password to pg_dump?
How to run a cron job inside a docker container?