Question edited following advice from @ajlittoz.
I have already searched the extension manager, but didn’t come across one such. May be I missed. Could I please be guided?
I would like a plug in / extension that would create a thorough index of words in a document, except the weak words (including weak verbs):
a, an, am, is, the, this, that, these,
those, are, was, were, be, being,
been;
verbs (a list from the common dictionary)
and prepositions:
about, above, across, after,against,
among, around, at, before, behind,
below, beside , between, by, down,
during, for, from, in, inside, into,
near, of, off, on, out, over, through,
to, toward, under, up, with, aboard,
along, amid, as, beneath, beyond, but,
concerning, considering, despite,
except, following, like, minus, next,
onto, opposite, outside, past, per,
plus, regarding, round, save, since,
than, till, underneath, unlike, until,
upon, versus, via, within, without,
etc.,
and the places of their occurrences page-wise, para-wise, and others. Also, user-defined phrases.
Package information for swish-e informs:
SWISH-Enhanced is a fast, powerful,
flexible, and easy to use system for
indexing collections of HTML Web
pages, or any XML or text files like
Open Office Documents, Open Document
files, emails, and so on.
Following advice from @ajlittoz, it now appears that the question should be re-phrased to invite an answer that is required for the ultimate objective. Thank you, Mr. @ajlittoz!
Following the 1st Edit by Mr. @ajlittoz, who has been very patient in listening to my requirements and helped me rephrase/clarify further, for which much admiration is due him, as posted below:
index all words except a list of
“noise” words, there is no automatic
solution.
I would have to edit my question a little bit more:
Coming back to the edited question, my requirement is that I should be able to have a separate list of words to be included (sic, excluded) for indexing.
My error was an inadvertent typo which I had overlooked, Mr. @ajlittoz. I apologise for the typo sincerely. I have already given a list of words to be excluded, which should explain my needs. But that such a possibility isn’t still a reality, should inspire our wonderful community of programmers and team leaders and perhaps hint at a direction, for them to work for a solution, if considered a good tool in future.
An Exclusion List should be an option to create a better secondary indexing tool. The primary indexing tool is just fine.
Thank you Mr. @ajlittoz, for your 2nd edit:
… you can prepare a concordance file
…
Unfortunately, Mr. @ajlittoz, what I instead need is an anti-concordance (or be called Exclusion / Discordance) file, and not a concordance file. I apologise that I wasn’t able to make myself clearer. May I please be forgiven for the intrinsic limitations in explaining my needs clearer.
The main issue remains: to have an index of all words used, except the weak verbs, prepositions, conjunctions, et al.
I am thinking aloud: Isn’t there any package that could list all words(phrases) used in a one file in another file, along with the number of times those words(phrases) have occurred?
[Added after your comment]
Yes, understood the single line code, and thank you for such a quick reply. There exists a python code for counting frequencies of words too, here: Counting Word Frequency in a File Using Python, with the code:
import re
import string
frequency = {}
document_text = open('test.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
count = frequency.get(word,0)
frequency[word] = count + 1
frequency_list = frequency.keys()
for words in frequency_list:
print words, frequency[words]
So these frequency lines could also be added to your suggested code line. I could create a script file to do so.
But we also have to eliminate the list of unwanted words too. Let us take some time off and contemplate about this.
I end with a big Thank You, Mr. @ajlittoz, for your staying with me throughout this query and helping me with inputs. Best wishes.