Deleting html codes in spreadsheet

spmcg77 · May 14, 2018, 8:46am

I want to replace all instances of html code in a spreadsheet column (it’s output of student text from Moodle).

I found how to use regular expressions in search and replace, but if I do a search for <.*> more cells are completely deleted than should be. For example, the following cell gets blanked out.

<p>I think during war , because reb background like fire.</p><p>I think poor ,because this child look gaunt.</p><p>&nbsp;</p>

However, the first sentence in this one remains

I think it is in the evening because the sun is set.

<p><span style="font-size: 1rem; -webkit-text-size-adjust: 100%;">I think we are in the desert because there are only two cactuses</span></p>

I don’t know programming well, but I don’t see any unbalanced <>s. I guess this will be obvious to those who know this stuff. I’ve tried everything I can think of.

Thanks for any help.

(Slightly edited by @Lupp in the hope to get a better view of the question intended.)

Lupp · May 14, 2018, 11:31am

“Removing html” will be more complex
Do you actually want to remove every cell content being a html paragraph object?

dajare · May 14, 2018, 1:57pm

Even with @Lupp’s helpful edit, it is still not clear (a) what your raw text looks like; or (b) what you want to achieve. I assume for (b), that you simply want to remove all HTML tags and have the text strings left intact – is that correct? (A “before/after” example of what you want to see would help.)

Also, if you could upload/attach a bit of a sample document (use the little “paperclip” icon in the edit window) with some suitably anonymized data, that would help enormously.

JohnSUN · May 14, 2018, 9:47am

It seems to me that you are looking for this regular expression

<[^>]*>

(Left angle bracket - any number of characters that are not the right angle bracket - right angle bracket)

Lupp · May 14, 2018, 12:23pm

I simply added an answer to a different interpretation of the question. In fact I don’t feel sure at all.
However, the code-piece   from the question would result in   if only the lead-in and the lead-out of the tag are removed. That wouldn’t make much sense. An actual replacement with a no-break-space would be needed. (What that special character should achieve being the only character of a paragraph is again unclear to me.)

mikekaganski · May 14, 2018, 1:34pm

I’m afraid that both proposals by @JohnSUN and @Lupp can easily remove unpaired starting/ending tags… e.g., <[^>]*> will remove as many starting tags as there will happen after , but will stop at first closing (EDIT: sorry - I’m wrong: this will simply remove all tags); and <p[^>]*>.* will end up at the very last , because greedy regex search will skip all previous in .* (and no, I don’t have a proposal myself)

dajare · May 14, 2018, 1:50pm

I’m delighted to have an opportunity to share a link to my favourite answer on StackOverflow which happens to address the issue of using regex and html. It is (justly) famous, so perhaps it is old news to those participating in this Q&A.

mikekaganski · May 14, 2018, 2:03pm

@dajare: great! (and was a news for me )

Lupp · May 14, 2018, 3:35pm

@mikekaganski: If there may be more than one html paragraphs mixed with different content in a cell you can make the second asterisk in my Regex “lazy” by appending a question mark: <p[^>]*>.*?

Lupp · May 14, 2018, 12:17pm

Referring tp my comment above:

To remove complete html paragraphs you may search for the regex <p[^>]*>.* and replace the findings woith nothing (empty string).

Is this what you wanted?

Edit 1:
Referring to my second comment on the answer by @JohnSUN (there answering to a comment by @mikekaganski ), I would suggest to also take in account the RegEx <p[^>]*>.*? if more than one html-paragraphs can occur in a single cell.
Example attached.

This again if the intention is to completely remove the html-paragraphs and not just the lead-in and lead-out.

Lupp · May 14, 2018, 4:05pm

The impossibility to parse [X]HTML with Regex may not apply to regular subsets of HTML. And a way to switch off greediness may help.
(I didn’t know about that switch half an hour ago. Found it concerning ? and ?? here (much recommended!), tried it with *? and it worked with our ICU Regex engine.