mirror of
https://github.com/Ponce/slackbuilds
synced 2024-11-25 10:03:03 +01:00
d4c3076cb0
Signed-off-by: Willy Sudiarto Raharjo <willysr@slackbuilds.org>
380 lines
12 KiB
Text
380 lines
12 KiB
Text
|
||
Format for StarDict dictionary files
|
||
------------------------------------
|
||
|
||
StarDict homepage: http://stardict.sourceforge.net
|
||
|
||
{0}. Number and Byte-order Conventions
|
||
When you record the numbers that identify sizes, offsets, etc., you
|
||
should use 32-bit numbers, such as you might represent with a glong.
|
||
|
||
In order to make StarDict work on different platforms, these numbers
|
||
must be in network byte order. You can ensure the correct byte order
|
||
by using the g_htonl() function when creating dictionary files.
|
||
Conversely, you should use g_ntohl() when reading dictionary files.
|
||
|
||
Strings should be encoded in UTF-8.
|
||
|
||
|
||
|
||
{1}. Files
|
||
|
||
Every dictionary consists of three files:
|
||
|
||
(1). somedict.ifo
|
||
(2). somedict.idx or somedict.idx.gz
|
||
(3). somedict.dict or somedict.dict.dz
|
||
|
||
You can use gzip -9 to compress the .idx file. If the .idx file are not
|
||
compressed, the loading can be fast and save memory when using, compress it
|
||
will make the .idx file load into memory and make the quering fast when using.
|
||
|
||
You can use dictzip to compress the .dict file.
|
||
"dictzip" uses the same compression algorithm and file format as does gzip,
|
||
but provides a table that can be used to randomly access compressed blocks
|
||
in the file. The use of 50-64kB blocks for compression typically degrades
|
||
compression by less than 10%, while maintaining acceptable random access
|
||
capabilities for all data in the file. As an added benefit, files
|
||
compressed with dictzip can be decompressed with gunzip.
|
||
For more information about dictzip, refer to DICT project, please see:
|
||
http://www.dict.org
|
||
|
||
Stardict will search for the .ifo file, then open the .idx or
|
||
.idx.gz file and the .dict.dz or .dict file which is in the same directory and
|
||
has the same base name.
|
||
|
||
|
||
|
||
{2}. The ".ifo" file's format.
|
||
|
||
The .ifo file has the following format:
|
||
|
||
StarDict's dict ifo file
|
||
version=2.4.2
|
||
[options]
|
||
|
||
Note that the current "version" string must be "2.4.2". If it's not,
|
||
then StarDict will refuse to read the file.
|
||
|
||
[options]
|
||
---------
|
||
|
||
In the example above, [options] expands to any of the following lines
|
||
specifying information about the dictionary. Each option is a keyword
|
||
followed by an equal sign, then the value of that option, then a
|
||
newline. The options may be appear in any order.
|
||
|
||
Note that the dictionary must have at least a bookname, a wordcount and a
|
||
idxfilesize, or the load will fail. All other information is optional. All
|
||
strings should be encoded in UTF-8.
|
||
|
||
Available options:
|
||
|
||
bookname= // required
|
||
wordcount= // required
|
||
idxfilesize= // required
|
||
author=
|
||
email=
|
||
website=
|
||
description=
|
||
date=
|
||
sametypesequence= // very important.
|
||
|
||
|
||
wordcount is the count of word entries in .idx file, it must be right.
|
||
|
||
idxfilesize is the size(in bytes) of the .idx file, even the .idx is compressed
|
||
to a .idx.gz file, this entry must record the original .idx file's size, and it
|
||
must be right too. The .gz file don't contain its original size information,
|
||
but knowing the original size can speed up the extraction to memory, as you
|
||
don't need to call realloc() for many times.
|
||
|
||
|
||
The "sametypesequence" option is described in further detail below.
|
||
|
||
***
|
||
|
||
sametypesequence
|
||
|
||
You should first familiarize yourself with the .dict file format
|
||
described in the next section so that you can understand what effect
|
||
this option has on the .dict file.
|
||
|
||
If the sametypesequence option is set, it tells StarDict that each
|
||
word's data in the .dict file will have the same sequence of datatypes.
|
||
In this case, we expect a .dict file that's been optimized in two
|
||
ways: the type identifiers should be omitted, and the size marker for
|
||
the last data entry of each word should be omitted.
|
||
|
||
Let's consider some concrete examples of the sametypesequence option.
|
||
|
||
Suppose that a dictionary records many .wav files, and so sets:
|
||
sametypesequence=W
|
||
In this case, each word's entry in the .dict file consists solely of a
|
||
wav file. In the .dict file, you would leave out the 'W' character
|
||
before each entry, and you would also omit the 32-bit integer at the
|
||
front of each .wav entry that would normally give the entry's length.
|
||
You can do this since the length is known from the information in the
|
||
idx file.
|
||
|
||
As another example, suppose a dictionary contains phonetic information
|
||
and a meaning for each word. The sametypesequence option for this
|
||
dictionary would be:
|
||
sametypesequence=tm
|
||
Once again, you can omit the 't' and 'm' characters before each data
|
||
entry in the .dict file. In addition, you should omit the terminating
|
||
'\0' for the 'm' entry for each word in the .dict file, as the length
|
||
of the meaning string can be inferred from the length of the phonetic
|
||
string (still indicated by a terminating '\0') and the length of the
|
||
entire word entry (listed in the .idx file).
|
||
|
||
So for cases where the last data entry for each word normally requires
|
||
a terminating '\0' character, you should omit this character in the
|
||
dict file. And for cases where the last data entry for each word
|
||
normally requires an initial 32-bit number giving the length of the
|
||
field (such as WAV and PNG entries), you must omit this number in the
|
||
dictionary.
|
||
|
||
Every dictionary should try to use the sametypesequence feature to
|
||
save disk space.
|
||
|
||
***
|
||
|
||
|
||
|
||
{3}. The ".idx" file's format.
|
||
|
||
The .idx file is just a word list.
|
||
|
||
The word list is a sorted list of word entries.
|
||
|
||
Each entry in the word list contains three fields, one after the other:
|
||
|
||
word_str; // a utf-8 string terminated by '\0'.
|
||
word_data_offset; // word data's offset in .dict file
|
||
word_data_size; // word data's total size in .dict file
|
||
|
||
word_str gives the string representing this word. It's the string
|
||
that is "looked up" by the StarDict.
|
||
|
||
word_data_offset and word_data_size should both be 32-bit numbers in
|
||
network byte order.
|
||
|
||
No two entries should have the same "word_str". In other words,
|
||
(strcmp(s1, s2) != 0).
|
||
|
||
The length of "word_str" should be less than 256. In other words,
|
||
(strlen(word) < 256).
|
||
|
||
The word list must be sorted by calling stardict_strcmp() on the "word_str"
|
||
fields. If the word list order is wrong, StarDict will fail to function
|
||
correctly!
|
||
|
||
============
|
||
gint stardict_strcmp(const gchar *s1, const gchar *s2)
|
||
{
|
||
gint a;
|
||
a = g_ascii_strcasecmp(s1, s2);
|
||
if (a == 0)
|
||
return strcmp(s1, s2);
|
||
else
|
||
return a;
|
||
}
|
||
============
|
||
|
||
g_ascii_strcasecmp() is a glib function:
|
||
|
||
Unlike the BSD strcasecmp() function, this only recognizes standard
|
||
ASCII letters and ignores the locale, treating all non-ASCII characters
|
||
as if they are not letters.
|
||
|
||
stardict_strcmp() works fine with English characters, but the other
|
||
locale characters' sorting is not so good. There should be a _strcmp
|
||
function which handles the utf-8 string sorting better. If you know
|
||
one, email me :)
|
||
|
||
g_utf8_collate()? This is a locale-dependent funcition. So if you look
|
||
up Chinese characters while in the Chinese locale, it works fine. But
|
||
if you are in some other locale then the lookup will fail, as the
|
||
order is not the same as in the Chinese locale (which was used when
|
||
creating the dictionary).
|
||
|
||
g_utf8_to_ucs4() then do comparing? This sounds like a good solution, but..
|
||
|
||
The complete solution can be found in "Unicode Technical Standard #10: Unicode
|
||
Collation Algorithm", http://www.unicode.org/reports/tr10/
|
||
|
||
I hope glib will provide a locale-independent g_utf8_collate() soon.
|
||
http://bugzilla.gnome.org/show_bug.cgi?id=112798
|
||
|
||
|
||
|
||
{4}. The ".dict" file's format.
|
||
|
||
The .dict file is a pure data sequence, as the offset and size of each
|
||
word is recorded in the corresponding .idx file.
|
||
|
||
If the "sametypesequence" option is not used in the .ifo file, then
|
||
the .dict file has fields in the following order:
|
||
|
||
==============
|
||
word_1_data_1_type; // a single char identifying the data type
|
||
word_1_data_1_data; // the data
|
||
word_1_data_2_type;
|
||
word_1_data_2_data;
|
||
...... // the number of data entries for each word is determined by
|
||
// word_data_size in .idx file
|
||
word_2_data_1_type;
|
||
word_2_data_1_data;
|
||
......
|
||
==============
|
||
|
||
It's important to note that each field in each word indicates its
|
||
own length, as described below. The number of possible fields per
|
||
word is also not fixed, and is determined by simply reading data until
|
||
you've read word_data_size bytes for that word.
|
||
|
||
|
||
Suppose the "sametypesequence" option is used in the .idx file, and
|
||
the option is set like this:
|
||
|
||
sametypesequence=tm
|
||
|
||
Then the .dict file will look like this:
|
||
|
||
==============
|
||
word_1_data_1_data
|
||
word_1_data_2_data
|
||
word_2_data_1_data
|
||
word_2_data_2_data
|
||
......
|
||
==============
|
||
|
||
The first data entry for each word will have a terminating '\0', but
|
||
the second entry will not have a terminating '\0'. The omissions of
|
||
the type chars and of the last field's size information are the
|
||
optimizations required by the "sametypesequence" option described
|
||
above.
|
||
|
||
|
||
Type identifiers
|
||
----------------
|
||
|
||
Here are the single-character type identifiers that may be used with
|
||
the "sametypesequence" option in the .idx file, or may appear in the
|
||
dict file itself if the "sametypesequence" option is not used.
|
||
|
||
Lower-case characters signify that a field's size is determined by a
|
||
terminating '\0', while upper-case characters indicate that the data
|
||
begins with a 32-bit integer that gives the length of the data field.
|
||
|
||
'm'
|
||
Word's pure text meaning.
|
||
The data should be a utf-8 string ending with '\0'.
|
||
|
||
'l'
|
||
Word's pure text meaning.
|
||
The data is NOT a utf-8 string, but is instead a string in locale
|
||
encoding, ending with '\0'. Sometimes using this type will save disk
|
||
space, but its use is discouraged.
|
||
|
||
'g'
|
||
A utf-8 string which is marked up with the Pango text markup language.
|
||
For more information about this markup language, See the "Pango
|
||
Reference Manual."
|
||
You might have it installed locally at:
|
||
file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html
|
||
|
||
't'
|
||
English phonetic string.
|
||
The data should be a utf-8 string ending with '\0'.
|
||
|
||
Here are some utf-8 phonetic characters:
|
||
θʃŋʧðʒæıʌʊɒɛəɑɜɔˌˈːˑ
|
||
æɑɒʌәєŋvθðʃʒːɡˏˊˋ
|
||
|
||
'y'
|
||
Chinese YinBiao.
|
||
The data should be a utf-8 string ending with '\0'.
|
||
|
||
|
||
'W'
|
||
wav file.
|
||
The data begins with a network byte-ordered glong to identify the wav
|
||
file's size, immediately followed by the file's content.
|
||
|
||
'P'
|
||
png file.
|
||
The data begins with a network byte-ordered glong to identify the png
|
||
file's size, immediately followed by the file's content.
|
||
|
||
'X'
|
||
this type identifier is reserved for experimental extensions.
|
||
|
||
|
||
|
||
{5}. Tree Dictionary
|
||
|
||
The tree dictionary support is used for information viewing, etc.
|
||
|
||
A tree dictionary contains three file: sometreedict.ifo, sometreedict.tdx.gz
|
||
and sometreedict.dict.dz.
|
||
|
||
It is better to compress the .tdx file, as it is always load into memory.
|
||
|
||
The .ifo file has the following format:
|
||
|
||
StarDict's treedict ifo file
|
||
version=2.4.2
|
||
[options]
|
||
|
||
Available options:
|
||
|
||
bookname= // required
|
||
tdxfilesize= // required
|
||
wordcount=
|
||
author=
|
||
email=
|
||
website=
|
||
description=
|
||
date=
|
||
sametypesequence=
|
||
|
||
wordcount is only used for info view in the dict manage dialog, so it is not
|
||
important in tree dictionary.
|
||
|
||
The .tdx file is just the word list.
|
||
|
||
-----------
|
||
|
||
The word list is a tree list of word entries.
|
||
|
||
Each entry in the word list contains four fields, one after the other:
|
||
word_str; // a utf-8 string terminated by '\0'.
|
||
word_data_offset; // word data's offset in .dict file
|
||
word_data_size; // word data's total size in .dict file. it can be 0.
|
||
word_subentry_count; //have many sub word this entry has, 0 means none.
|
||
|
||
Subentry is immidiately followed by its parent entry. This make the order is
|
||
just as when a tree list with all its nodes extended, then sort from top to
|
||
bottom.
|
||
|
||
The .dict file's format is the same as the normal dictionary.
|
||
|
||
|
||
|
||
{6}. More information.
|
||
|
||
You can read "src/lib.cpp", "src/dictmanagedlg.cpp" and
|
||
"src/tools/*.cpp" for more information.
|
||
|
||
If you have any questions, email me. :)
|
||
|
||
Thanks to Will Robinson <wsr23@stanford.edu> for cleaning up this file's
|
||
English.
|
||
|
||
Hu Zheng <huzheng_001@163.com>
|
||
http://forlinux.yeah.net
|
||
|
||
2003.11.11
|
||
|