I want to create a spellchecker for Bengali language. How can i do that?

I’m from Bangladesh & a native Bengali. I use Libreoffice to write my official document. I spot that there are no Spellchecker for Bengali language. So i decide to create one. But i don’t know how to do that. Can someone help me to find out how to create a spellchecker ??

There is a dictionary editing tool called Proofing Tool GUI, by Marco Pinto, but I’ve never used it.

Thanks. With is tools I can create a dictionary .dic & .aff file. Any Idea how can I pack it into a LibreOffice extension?

You can always take an existing dictionary extension, unpack it (it’s a zip file with a modified file extension) and replace all files in it. You need to also edit the files description.xml and dictionaries.xcu to point to the new files and set the language, etc.

I don’t think there is any simple guidance on creating spellcheck dictionaries for LibreOffice. You might want to consult Hunspell documentation, but it is full of technicalities. So, here are some quick tips:

  1. Dictionaries are packed as extensions being ZIP archives (.zip is replaced with .oxt for convenience, but this is not a must actually). So, you can use any archive/compress program for the task.

  2. Extensions have certain structures, so, as RGB-es points, download a couple of existing extensions to understand it.

  3. Dictionary/affix files are plain-text files, so you can use virtually any text editor to edit them. As I understand, Bengali uses a complex script (Devanagari?), thus, the text editor that you will use must support UTF-8, that is the only requirement. Highlighting XML markup is a plus when you edit XML files (an extension must include at least three ones).

  4. For a totally new language, I would start from a mere word list, i. e. simply collecting all words and their forms in the .dic file and leaving .aff file empty with only one line with the encoding declaration as follows:

    SET UTF-8

If Bengali is not a highly inflective language, that’s enough. Otherwise, you should consider building word paradigms (this will reduce the dictionary size and make it easier to add new words), but that’s for a later time when you’ll begin to understand the syntax.

The .dic and .aff files can be created using a plain text editor - I use Emacs. Hunspell is supposed to support ‘morphologically complex languages, such as Hungarian’, but it has trouble with languages like Sanskrit and Pali, and that’s without considering sandhi between words. For compound affixes, I ended up writing a program to merge repetitive application of affixes. Hunspell can handle one prefix plus one suffix on a word, or two suffixes on a word, but fails at three suffixes.

As to packaging, if you can sort out Bengali affixes, you can probably do as RGB-es suggests. There is a packaging tool at GitHub - silnrsi/oxttools: Tools for creating language support oxt extensions for LibreOffice ; however, I couldn’t get it to install.