Characters turn into question marks when opening HTML file

euuuuuu · June 5, 2018, 7:59pm

I saved a html file using Google Chrome, and when I open that file in Writer, some characters (the accented ones, and, sometimes, the one that follows it) appear as question marks (actual question marks, and not �). For example: “rep?lica” instead of “república”.

If I open this same file using Chrome, the text is normal.
This happens if I download using the option “Webpage, HTML only”; but the text appears normally if I download using the option “Webpage, Complete”.
I downloaded other web pages with this same option and Writer shows accented characters normally.

I’m trying to download this page: https://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm

I tried to put the downloaded file as an attachment, but the site doesn’t allow html attachments.

The system and LO are in Portuguese.
Versão: 6.0.4.2 (x64)
ID de compilação: 9b0d9b32d5dcda91d2f1a96dc04c645c450872bf
Threads da CPU:4; SO:Windows 10.0; Realizador da interface: padrão;
Local: pt-BR (pt_BR); Calc: CL

gabix · June 6, 2018, 9:41am

I cannot reproduce:

Opened the web page by the link in the original posting in Mozilla Firefox 60 on Windows 7.
The letters with diacritics were replaced with Cyrillic letters. It appeared that the stupid ape(s) who created the page did not add an encoding declaration.
So, I forced a Western European encoding.
Saved as a full HTML page.
Opened in LO Writer/Web 6.0.4.2. No problem with the letters.

Thus, the problem might be about Chrome.

euuuuuu · June 6, 2018, 11:01am

But you saved “full HTML page”, and I don’t want that. When I download the page using Firefox and the option “Web page, HTML only” the problem also happens.

How do I force Western European encoding? Does that solve the problem whe downloading using the option “Web page, HTML only”?

mikekaganski · June 6, 2018, 3:13pm

The Webpage, HTML only option in Chrome saves verbatim copy of the web page, just as it is received from server. On the other hand, Webpage, Complete saves result of browser processing, including its guessing of the codepage (which it does correctly in this case, it seems); and it writes everything it guessed to te resulting file.

LibreOffice doesn’t try to guess used charset of the HTML. It uses the meta information from the file - if present - which is not the case in this original web page. In case of absent data, it probably defaults to utf-8 - haven’t checked actually, but that’s irrelevant. What is relevant is that this is an example of “Garbage in, garbage out” principle.

euuuuuu · June 7, 2018, 12:38am

Thanks for the explanation.

petermau · June 6, 2018, 2:37pm

The HTML page source says that it is using the Windows-1252 character set, not the International standard Unicode or its subset ISO-8859-1. This is causing the incorrect display of the data. It is in fact the reverse of the � problem. The page should either be saved using Unicode, as used by the Internet or LibreOffice or imported into LibreOffice specifying the data is in Windows-1252 format. I am surprised the Gov.Br is not using Unicode.

euuuuuu · June 7, 2018, 12:41am

Thanks for the explanation! But could you explain how do I save the page “using Unicode, as used by the Internet or LibreOffice or imported into LibreOffice specifying the data is in Windows-1252 format”? When I open the file in LO there is no option to choose the format; neither when I save the page in Chrome.