Find and Replace With Wildcards But Stop At Specific Character

JohnF1 · August 11, 2022, 5:58am

I have a spreadsheet with a column that contains only HTML.

I am trying to remove hyperlink HTML anchor tags ONLY.

2)The only solution I found was Find/Replace and use the formula from cell A3

How do I get it to start at <a but stop at the first </ a> tag that it runs into? In the attached example, I assume cell D1 would be achieved after running a modified version of the find/replace twice, so that it removes 1 anchor each time it’s ran.
ExampleDoc.ods (10.1 KB)

mikekaganski · August 11, 2022, 6:14am

I assume that your problem is that your <a.*</a> regex is “greedy”, i.e. it will match everything from the first <a to the last </a>, including all intermediate </a>…<a.

That is the property of the * operator; and the regular expression documentation provides you the non-greedy alternative: *?.

Since <a.* also matches e.g. <area>, you need to indicate that you only need those as that are the whole words (using \b metacharacter).

So: <a\b.*?</a>

Note that you may use REGEX spreadsheet function to replace all occurrences in the text at once, and this way not have to use Find & Replace:

=REGEX(B2;"<a\b.*?</a>";"";"g")

Also note that the regex does not handle nested a tags correctly. If this nesting is possible, then you would need to use negative look-ahead assertions (like (?:.(?!<a\b)) in place of .), and several passes.