Is it possible in Libre Office to create word/phrase index based on a word-list?

bkpsusmitaa · February 13, 2021, 10:44am

Question edited following advice from @ajlittoz.

I have already searched the extension manager, but didn’t come across one such. May be I missed. Could I please be guided?

I would like a plug in / extension that would create a thorough index of words in a document, except the weak words (including weak verbs):

a, an, am, is, the, this, that, these,
those, are, was, were, be, being,
been;

verbs (a list from the common dictionary)

and prepositions:

about, above, across, after,against,
among, around, at, before, behind,
below, beside , between, by, down,
during, for, from, in, inside, into,
near, of, off, on, out, over, through,
to, toward, under, up, with, aboard,
along, amid, as, beneath, beyond, but,
concerning, considering, despite,
except, following, like, minus, next,
onto, opposite, outside, past, per,
plus, regarding, round, save, since,
than, till, underneath, unlike, until,
upon, versus, via, within, without,
etc.,

and the places of their occurrences page-wise, para-wise, and others. Also, user-defined phrases.

Package information for swish-e informs:

SWISH-Enhanced is a fast, powerful,
flexible, and easy to use system for
indexing collections of HTML Web
pages, or any XML or text files like
Open Office Documents, Open Document
files, emails, and so on.

Following advice from @ajlittoz, it now appears that the question should be re-phrased to invite an answer that is required for the ultimate objective. Thank you, Mr. @ajlittoz!

Following the 1st Edit by Mr. @ajlittoz, who has been very patient in listening to my requirements and helped me rephrase/clarify further, for which much admiration is due him, as posted below:

index all words except a list of
“noise” words, there is no automatic
solution.

I would have to edit my question a little bit more:

Coming back to the edited question, my requirement is that I should be able to have a separate list of words to be included (sic, excluded) for indexing.
My error was an inadvertent typo which I had overlooked, Mr. @ajlittoz. I apologise for the typo sincerely. I have already given a list of words to be excluded, which should explain my needs. But that such a possibility isn’t still a reality, should inspire our wonderful community of programmers and team leaders and perhaps hint at a direction, for them to work for a solution, if considered a good tool in future.

An Exclusion List should be an option to create a better secondary indexing tool. The primary indexing tool is just fine.

Thank you Mr. @ajlittoz, for your 2nd edit:

… you can prepare a concordance file
…

Unfortunately, Mr. @ajlittoz, what I instead need is an anti-concordance (or be called Exclusion / Discordance) file, and not a concordance file. I apologise that I wasn’t able to make myself clearer. May I please be forgiven for the intrinsic limitations in explaining my needs clearer.

The main issue remains: to have an index of all words used, except the weak verbs, prepositions, conjunctions, et al.

I am thinking aloud: Isn’t there any package that could list all words(phrases) used in a one file in another file, along with the number of times those words(phrases) have occurred?

[Added after your comment]
Yes, understood the single line code, and thank you for such a quick reply. There exists a python code for counting frequencies of words too, here: Counting Word Frequency in a File Using Python, with the code:

import re
import string
frequency = {}
document_text = open('test.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
    count = frequency.get(word,0)
    frequency[word] = count + 1
frequency_list = frequency.keys()
for words in frequency_list:
    print words, frequency[words]

So these frequency lines could also be added to your suggested code line. I could create a script file to do so.
But we also have to eliminate the list of unwanted words too. Let us take some time off and contemplate about this.

I end with a big Thank You, Mr. @ajlittoz, for your staying with me throughout this query and helping me with inputs. Best wishes.

ajlittoz · February 14, 2021, 5:08pm

When you explicitly mark the word you include in the index, there is no need for an exclusion list because you just don’t mark the unwanted words.

There is however a feature I had forgotten about for semi-automatic index construction: concordance file.

ajlittoz · February 15, 2021, 9:18am

Unfortunately, you can’t use an anti-concordance or exclusion list because Writer doesn’t index anything by default. Therefore, even if the notion of anti-concordance existed, excluding something from the empty set still returns the empty set.

What you can do if you have the skill for it could be to export your document as plain text (*.txt) and extract the words out of it. You expunge this list off the noise words and you nearly have your concordance file. Here is a one-line command for bash (Linux):

cat exported.txt | grep -o -E '\w+' | tr '[A-Z]' '[a-z]' | sort | uniq -c >preconcordance.txt

(This command is a single line even if it displays on two lines here)

EDIT The command above provides the word frequency.

anon87010807 · February 15, 2021, 9:41am

That’s a cool script, but it will mangle words containing accented letters, won’t it?

bkpsusmitaa · February 15, 2021, 9:50am

Oh! Thank you, Mr. ajlittoz This frequency factor would indeed be great. Thank you indeed. I have accordingly edited the question based on your inputs.

ajlittoz · February 15, 2021, 10:05am

@anon87010807: “words” are defined by pattern \w+ in the grep step. It is likely that grep does not take the locale into account, but you can add to the “letter” group as [A-Za-zàâéèêÀÂÉÈÊ…]+

anon87010807 · February 15, 2021, 11:12am

@ajlittoz: correct.
@bkpsusmitaa: many years ago I wrote code to extract all words in a text document that didn’t occur in a kind of anti-concordance list (a list with correctly spelled words, so that you could spot anything out of the ordinary). It should be fairly easy to write code that filters the concordance list produced by ajlittoz’s script against a list with your weak words. Step 1: read the list with weak words. Step 2: read the word list, and save everything not in the weak words list in a new file.

ajlittoz · February 13, 2021, 11:06am

I am also the administrator of the LXR general-purpose cross-reference utility. I use both SWISH-E and Glimpse.

As far as I know, SWISH-E can only index HTML documents. It may work on XML, like the internal Writer format. But you need to be aware of the limitations of SWISH-E (this is why I rather recommend Glimpse in LXR):

SWISH-E will only give a reference to the file (web page) containing the sought terms

In my understanding, SWISH-E is rather a tool to find information in a bunch of web pages and you then use more traditional search tools to locate the terms in the suggested page.
it give a “relevance score”, roughly proportional to the number of occurrences in the file
it does not give more accurate location information such as line number (contrary to Glimpse)
it has no notion of the structure and semantics of ODF (the internal Writer format)

Since SWISH-E is open-source, you can implement a scanner for ODF and plug it into SWISH-E if you have the skill for it.

With regard to location information, be aware it is “dynamic”: any edit in the document may change page number, line number or para number within the page. Therefore the utility/extension needs to be periodically relaunched to avoid stale data.

Perhaps a simpler approach would be to squat the index feature by stuffing into it the words of the document filtered out of the “noise” words.

EDIT 1

If you want to index only some words in your document, you only need to mark them:

select the word (or group of words you consider being the index key)
Insert>TOC & Index>Index Entry
check Apply to all similar texts to avoid having to do it everywhere

Note the Apply to similar texts is valid only for existing text. When you later add same terms, you’ll need to index them individually.

If you want to index all words except a list of “noise” words, there is no automatic solution.

The Writer Alphabetical Index is a traditional index. I am not aware of any extension for KWIC (keyword in context) or KWOC (keyword out of context) indexes.

EDIT 2

If your list of words or expressions is fairly limited (“closed”, to be understood as defined a priory and not susceptible to be augmented in the text), you can prepare a concordance file. How this file is made is explained in the built-in help at concordance files;definition.

The concordance file is designated when you Insert>TOC & Index>TOC, Index & Bibliography by ticking the check box.

To show the community your question has been answered, click the ✓ next to the correct answer, and “upvote” by clicking on the ^ arrow of any helpful answers. These are the mechanisms for communicating the quality of the Q&A on this site. Thanks!

In case you need clarification, edit your question (not an answer which is reserved for solutions) or comment the relevant answer.

bkpsusmitaa · February 13, 2021, 11:29am

Thank you Mr. ajlittoz! I have rephrased the question based on your suggestion. So kindly accept my admiration for helping me rephrase my query.

bkpsusmitaa · February 15, 2021, 9:01am

Thank you Mr. ajlittoz! I have rephrased the question-body based on your inputs so kindly provided in Edit 2. Thank you for staying with me.

bkpsusmitaa · February 15, 2021, 10:36am

One query, Mr. @ajlittoz: will it be possible in LibreOffice to have an add-on to create a trimmed temporary text file, stripping down all the formattings, create a word list, manually eliminate all unwanted words and make it into a concordance file? It might then be possible to have phrases included in the concordance file?

ajlittoz · February 15, 2021, 10:45am

To create an unformatted copy of your document File>Save a Copy and select Text (.txt) from the filter menu.

Creation of the word list can be done with the command I gave. Make it a script for easy invocation.

You can put anything in a concordance file: words, groups of words, sentences, … Look at the built-in help for more information on the format.

bkpsusmitaa · August 26, 2021, 2:33pm

Apologise to have come back so late to the post. Covid19 created a lot of uncertainties for a lot many. But I just wanted to thank you once more for your kind support.
[Please consider the following deleted as the script works quite well: Since you are so competent in Python, could we plan an anti-concordance list in another longer way? Could we use this methodology:
A script that creates a file to record the words encountered within the file (simple actually, the list could be created by replacing space, comma, colon, semicolon, stop, etc., into Line Break).
Then run another program to create a file with the number of times each word occurred in the above file.
Then make a customised concordance list from those word lists?]

ajlittoz · August 26, 2021, 2:43pm

I am not competent at all in Python.

I thought about a shell script similar to the one I gave in another comment. This command is valid under Linux for a variety of shells.

bkpsusmitaa · August 26, 2021, 2:56pm

Thank you for replying so quickly. I tested the script and it works quite well for any text file. Surely it is quite competent in finding words, their frequency of occurrences and help form a customised concordance list. So I think I should correct my earlier post.

bkpsusmitaa · August 29, 2021, 9:58am

Dear Mr. ajlittoz,
Because of your continued support and the ease of using python, I was able to edit the python script to make it write the word index and frequency on to a text file.
The script, named as file.py, is as follows between the lines:

print(‘python test.py’)
print(‘index file within the program’)
import sys
import re
import string
frequency = {}
document_text = open(‘ToBeEditedFile.txt’, ‘r’)
text_string = document_text.read().lower()
match_pattern = re.findall(r’\b[a-z]{3,15}\b’, text_string)

for word in match_pattern:
count = frequency.get(word,0)
frequency[word] = count + 1

frequency_list = frequency.keys()

orig_stdout = sys.stdout
f = open(‘List.txt’, ‘a’)
sys.stdout = f

for words in frequency_list:
print words, frequency[words]
sys.stdout = orig_stdout
f.close()

I hope that someone with special programming capabilities shall definitely visit this thread and improve the script to add the functionality and other such related tools to Libre Office Writer.