Ask Your Question
0

Deleting html codes in spreadsheet

asked 2018-05-14 10:46:18 +0200

spmcg77 gravatar image

updated 2018-05-14 13:24:14 +0200

Lupp gravatar image

I want to replace all instances of html code in a spreadsheet column (it's output of student text from Moodle).

I found how to use regular expressions in search and replace, but if I do a search for <.*> more cells are completely deleted than should be. For example, the following cell gets blanked out.

<p>I think during war , because reb background like fire.</p><p>I think poor ,because this child look gaunt.</p><p>&nbsp;</p>

However, the first sentence in this one remains

I think it is in the evening because the sun is set.

<p><span style="font-size: 1rem; -webkit-text-size-adjust: 100%;">I think we are in the desert because there are only two cactuses</span></p>

I don't know programming well, but I don't see any unbalanced <>s. I guess this will be obvious to those who know this stuff. I've tried everything I can think of.

Thanks for any help.

(Slightly edited by @Lupp in the hope to get a better view of the question intended.)

edit retag flag offensive close merge delete

Comments

"Removing html" will be more complex
Do you actually want to remove every cell content being a html paragraph object?

Lupp gravatar imageLupp ( 2018-05-14 13:31:37 +0200 )edit

Even with @Lupp's helpful edit, it is still not clear (a) what your raw text looks like; or (b) what you want to achieve. I assume for (b), that you simply want to remove all HTML tags and have the text strings left intact -- is that correct? (A "before/after" example of what you want to see would help.)

Also, if you could upload/attach a bit of a sample document (use the little "paperclip" icon in the edit window) with some suitably anonymized data, that would help enormously.

David gravatar imageDavid ( 2018-05-14 15:57:24 +0200 )edit

2 Answers

Sort by » oldest newest most voted
0

answered 2018-05-14 11:47:08 +0200

JohnSUN gravatar image

It seems to me that you are looking for this regular expression

<[^>]*>

(Left angle bracket - any number of characters that are not the right angle bracket - right angle bracket)

edit flag offensive delete link more

Comments

I simply added an answer to a different interpretation of the question. In fact I don't feel sure at all.
However, the code-piece <p>&nbsp;</p> from the question would result in &nbsp; if only the lead-in and the lead-out of the tag are removed. That wouldn't make much sense. An actual replacement with a no-break-space would be needed. (What that special character should achieve being the only character of a paragraph is again unclear to me.)

Lupp gravatar imageLupp ( 2018-05-14 14:23:49 +0200 )edit

I'm afraid that both proposals by @JohnSUN and @Lupp can easily remove unpaired starting/ending tags... e.g., <[^>]*> will remove as many starting tags as there will happen after <p>, but will stop at first closing (EDIT: sorry - I'm wrong: this will simply remove all tags); and <p[^>]*>.*</p> will end up at the very last </p>, because greedy regex search will skip all previous in .* (and no, I don't have a proposal myself)

Mike Kaganski gravatar imageMike Kaganski ( 2018-05-14 15:34:34 +0200 )edit
3

I'm delighted to have an opportunity to share a link to my favourite answer on StackOverflow which happens to address the issue of using regex and html. It is (justly) famous, so perhaps it is old news to those participating in this Q&A. :)

David gravatar imageDavid ( 2018-05-14 15:50:33 +0200 )edit

@David: great! (and was a news for me :) )

Mike Kaganski gravatar imageMike Kaganski ( 2018-05-14 16:03:09 +0200 )edit

@Mike Kaganski: If there may be more than one html paragraphs mixed with different content in a cell you can make the second asterisk in my Regex "lazy" by appending a question mark: <p[^>]*>.*?</p>

Lupp gravatar imageLupp ( 2018-05-14 17:35:27 +0200 )edit
0

answered 2018-05-14 14:17:42 +0200

Lupp gravatar image

updated 2018-05-14 18:07:46 +0200

Referring tp my comment above:

To remove complete html paragraphs you may search for the regex <p[^>]*>.*</p> and replace the findings woith nothing (empty string).

Is this what you wanted?

Edit 1:
Referring to my second comment on the answer by @JohnSUN (there answering to a comment by @Mike Kaganski ), I would suggest to also take in account the RegEx <p[^>]*>.*?</p> if more than one html-paragraphs can occur in a single cell.
Example attached.

This again if the intention is to completely remove the html-paragraphs and not just the lead-in and lead-out.

edit flag offensive delete link more

Comments

The impossibility to parse [X]HTML with Regex may not apply to regular subsets of HTML. And a way to switch off greediness may help.
(I didn't know about that switch half an hour ago. Found it concerning ? and ??here (much recommended!), tried it with *? and it worked with our ICU Regex engine.

Lupp gravatar imageLupp ( 2018-05-14 18:05:02 +0200 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2018-05-14 10:46:18 +0200

Seen: 49 times

Last updated: May 14 '18