mirror of
https://github.com/Ponce/slackbuilds
synced 2024-12-01 01:00:03 +01:00
381 lines
12 KiB
Text
381 lines
12 KiB
Text
|
|
|||
|
Format for StarDict dictionary files
|
|||
|
------------------------------------
|
|||
|
|
|||
|
StarDict homepage: http://stardict.sourceforge.net
|
|||
|
|
|||
|
{0}. Number and Byte-order Conventions
|
|||
|
When you record the numbers that identify sizes, offsets, etc., you
|
|||
|
should use 32-bit numbers, such as you might represent with a glong.
|
|||
|
|
|||
|
In order to make StarDict work on different platforms, these numbers
|
|||
|
must be in network byte order. You can ensure the correct byte order
|
|||
|
by using the g_htonl() function when creating dictionary files.
|
|||
|
Conversely, you should use g_ntohl() when reading dictionary files.
|
|||
|
|
|||
|
Strings should be encoded in UTF-8.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
{1}. Files
|
|||
|
|
|||
|
Every dictionary consists of three files:
|
|||
|
|
|||
|
(1). somedict.ifo
|
|||
|
(2). somedict.idx or somedict.idx.gz
|
|||
|
(3). somedict.dict or somedict.dict.dz
|
|||
|
|
|||
|
You can use gzip -9 to compress the .idx file. If the .idx file are not
|
|||
|
compressed, the loading can be fast and save memory when using, compress it
|
|||
|
will make the .idx file load into memory and make the quering fast when using.
|
|||
|
|
|||
|
You can use dictzip to compress the .dict file.
|
|||
|
"dictzip" uses the same compression algorithm and file format as does gzip,
|
|||
|
but provides a table that can be used to randomly access compressed blocks
|
|||
|
in the file. The use of 50-64kB blocks for compression typically degrades
|
|||
|
compression by less than 10%, while maintaining acceptable random access
|
|||
|
capabilities for all data in the file. As an added benefit, files
|
|||
|
compressed with dictzip can be decompressed with gunzip.
|
|||
|
For more information about dictzip, refer to DICT project, please see:
|
|||
|
http://www.dict.org
|
|||
|
|
|||
|
Stardict will search for the .ifo file, then open the .idx or
|
|||
|
.idx.gz file and the .dict.dz or .dict file which is in the same directory and
|
|||
|
has the same base name.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
{2}. The ".ifo" file's format.
|
|||
|
|
|||
|
The .ifo file has the following format:
|
|||
|
|
|||
|
StarDict's dict ifo file
|
|||
|
version=2.4.2
|
|||
|
[options]
|
|||
|
|
|||
|
Note that the current "version" string must be "2.4.2". If it's not,
|
|||
|
then StarDict will refuse to read the file.
|
|||
|
|
|||
|
[options]
|
|||
|
---------
|
|||
|
|
|||
|
In the example above, [options] expands to any of the following lines
|
|||
|
specifying information about the dictionary. Each option is a keyword
|
|||
|
followed by an equal sign, then the value of that option, then a
|
|||
|
newline. The options may be appear in any order.
|
|||
|
|
|||
|
Note that the dictionary must have at least a bookname, a wordcount and a
|
|||
|
idxfilesize, or the load will fail. All other information is optional. All
|
|||
|
strings should be encoded in UTF-8.
|
|||
|
|
|||
|
Available options:
|
|||
|
|
|||
|
bookname= // required
|
|||
|
wordcount= // required
|
|||
|
idxfilesize= // required
|
|||
|
author=
|
|||
|
email=
|
|||
|
website=
|
|||
|
description=
|
|||
|
date=
|
|||
|
sametypesequence= // very important.
|
|||
|
|
|||
|
|
|||
|
wordcount is the count of word entries in .idx file, it must be right.
|
|||
|
|
|||
|
idxfilesize is the size(in bytes) of the .idx file, even the .idx is compressed
|
|||
|
to a .idx.gz file, this entry must record the original .idx file's size, and it
|
|||
|
must be right too. The .gz file don't contain its original size information,
|
|||
|
but knowing the original size can speed up the extraction to memory, as you
|
|||
|
don't need to call realloc() for many times.
|
|||
|
|
|||
|
|
|||
|
The "sametypesequence" option is described in further detail below.
|
|||
|
|
|||
|
***
|
|||
|
|
|||
|
sametypesequence
|
|||
|
|
|||
|
You should first familiarize yourself with the .dict file format
|
|||
|
described in the next section so that you can understand what effect
|
|||
|
this option has on the .dict file.
|
|||
|
|
|||
|
If the sametypesequence option is set, it tells StarDict that each
|
|||
|
word's data in the .dict file will have the same sequence of datatypes.
|
|||
|
In this case, we expect a .dict file that's been optimized in two
|
|||
|
ways: the type identifiers should be omitted, and the size marker for
|
|||
|
the last data entry of each word should be omitted.
|
|||
|
|
|||
|
Let's consider some concrete examples of the sametypesequence option.
|
|||
|
|
|||
|
Suppose that a dictionary records many .wav files, and so sets:
|
|||
|
sametypesequence=W
|
|||
|
In this case, each word's entry in the .dict file consists solely of a
|
|||
|
wav file. In the .dict file, you would leave out the 'W' character
|
|||
|
before each entry, and you would also omit the 32-bit integer at the
|
|||
|
front of each .wav entry that would normally give the entry's length.
|
|||
|
You can do this since the length is known from the information in the
|
|||
|
idx file.
|
|||
|
|
|||
|
As another example, suppose a dictionary contains phonetic information
|
|||
|
and a meaning for each word. The sametypesequence option for this
|
|||
|
dictionary would be:
|
|||
|
sametypesequence=tm
|
|||
|
Once again, you can omit the 't' and 'm' characters before each data
|
|||
|
entry in the .dict file. In addition, you should omit the terminating
|
|||
|
'\0' for the 'm' entry for each word in the .dict file, as the length
|
|||
|
of the meaning string can be inferred from the length of the phonetic
|
|||
|
string (still indicated by a terminating '\0') and the length of the
|
|||
|
entire word entry (listed in the .idx file).
|
|||
|
|
|||
|
So for cases where the last data entry for each word normally requires
|
|||
|
a terminating '\0' character, you should omit this character in the
|
|||
|
dict file. And for cases where the last data entry for each word
|
|||
|
normally requires an initial 32-bit number giving the length of the
|
|||
|
field (such as WAV and PNG entries), you must omit this number in the
|
|||
|
dictionary.
|
|||
|
|
|||
|
Every dictionary should try to use the sametypesequence feature to
|
|||
|
save disk space.
|
|||
|
|
|||
|
***
|
|||
|
|
|||
|
|
|||
|
|
|||
|
{3}. The ".idx" file's format.
|
|||
|
|
|||
|
The .idx file is just a word list.
|
|||
|
|
|||
|
The word list is a sorted list of word entries.
|
|||
|
|
|||
|
Each entry in the word list contains three fields, one after the other:
|
|||
|
|
|||
|
word_str; // a utf-8 string terminated by '\0'.
|
|||
|
word_data_offset; // word data's offset in .dict file
|
|||
|
word_data_size; // word data's total size in .dict file
|
|||
|
|
|||
|
word_str gives the string representing this word. It's the string
|
|||
|
that is "looked up" by the StarDict.
|
|||
|
|
|||
|
word_data_offset and word_data_size should both be 32-bit numbers in
|
|||
|
network byte order.
|
|||
|
|
|||
|
No two entries should have the same "word_str". In other words,
|
|||
|
(strcmp(s1, s2) != 0).
|
|||
|
|
|||
|
The length of "word_str" should be less than 256. In other words,
|
|||
|
(strlen(word) < 256).
|
|||
|
|
|||
|
The word list must be sorted by calling stardict_strcmp() on the "word_str"
|
|||
|
fields. If the word list order is wrong, StarDict will fail to function
|
|||
|
correctly!
|
|||
|
|
|||
|
============
|
|||
|
gint stardict_strcmp(const gchar *s1, const gchar *s2)
|
|||
|
{
|
|||
|
gint a;
|
|||
|
a = g_ascii_strcasecmp(s1, s2);
|
|||
|
if (a == 0)
|
|||
|
return strcmp(s1, s2);
|
|||
|
else
|
|||
|
return a;
|
|||
|
}
|
|||
|
============
|
|||
|
|
|||
|
g_ascii_strcasecmp() is a glib function:
|
|||
|
|
|||
|
Unlike the BSD strcasecmp() function, this only recognizes standard
|
|||
|
ASCII letters and ignores the locale, treating all non-ASCII characters
|
|||
|
as if they are not letters.
|
|||
|
|
|||
|
stardict_strcmp() works fine with English characters, but the other
|
|||
|
locale characters' sorting is not so good. There should be a _strcmp
|
|||
|
function which handles the utf-8 string sorting better. If you know
|
|||
|
one, email me :)
|
|||
|
|
|||
|
g_utf8_collate()? This is a locale-dependent funcition. So if you look
|
|||
|
up Chinese characters while in the Chinese locale, it works fine. But
|
|||
|
if you are in some other locale then the lookup will fail, as the
|
|||
|
order is not the same as in the Chinese locale (which was used when
|
|||
|
creating the dictionary).
|
|||
|
|
|||
|
g_utf8_to_ucs4() then do comparing? This sounds like a good solution, but..
|
|||
|
|
|||
|
The complete solution can be found in "Unicode Technical Standard #10: Unicode
|
|||
|
Collation Algorithm", http://www.unicode.org/reports/tr10/
|
|||
|
|
|||
|
I hope glib will provide a locale-independent g_utf8_collate() soon.
|
|||
|
http://bugzilla.gnome.org/show_bug.cgi?id=112798
|
|||
|
|
|||
|
|
|||
|
|
|||
|
{4}. The ".dict" file's format.
|
|||
|
|
|||
|
The .dict file is a pure data sequence, as the offset and size of each
|
|||
|
word is recorded in the corresponding .idx file.
|
|||
|
|
|||
|
If the "sametypesequence" option is not used in the .ifo file, then
|
|||
|
the .dict file has fields in the following order:
|
|||
|
|
|||
|
==============
|
|||
|
word_1_data_1_type; // a single char identifying the data type
|
|||
|
word_1_data_1_data; // the data
|
|||
|
word_1_data_2_type;
|
|||
|
word_1_data_2_data;
|
|||
|
...... // the number of data entries for each word is determined by
|
|||
|
// word_data_size in .idx file
|
|||
|
word_2_data_1_type;
|
|||
|
word_2_data_1_data;
|
|||
|
......
|
|||
|
==============
|
|||
|
|
|||
|
It's important to note that each field in each word indicates its
|
|||
|
own length, as described below. The number of possible fields per
|
|||
|
word is also not fixed, and is determined by simply reading data until
|
|||
|
you've read word_data_size bytes for that word.
|
|||
|
|
|||
|
|
|||
|
Suppose the "sametypesequence" option is used in the .idx file, and
|
|||
|
the option is set like this:
|
|||
|
|
|||
|
sametypesequence=tm
|
|||
|
|
|||
|
Then the .dict file will look like this:
|
|||
|
|
|||
|
==============
|
|||
|
word_1_data_1_data
|
|||
|
word_1_data_2_data
|
|||
|
word_2_data_1_data
|
|||
|
word_2_data_2_data
|
|||
|
......
|
|||
|
==============
|
|||
|
|
|||
|
The first data entry for each word will have a terminating '\0', but
|
|||
|
the second entry will not have a terminating '\0'. The omissions of
|
|||
|
the type chars and of the last field's size information are the
|
|||
|
optimizations required by the "sametypesequence" option described
|
|||
|
above.
|
|||
|
|
|||
|
|
|||
|
Type identifiers
|
|||
|
----------------
|
|||
|
|
|||
|
Here are the single-character type identifiers that may be used with
|
|||
|
the "sametypesequence" option in the .idx file, or may appear in the
|
|||
|
dict file itself if the "sametypesequence" option is not used.
|
|||
|
|
|||
|
Lower-case characters signify that a field's size is determined by a
|
|||
|
terminating '\0', while upper-case characters indicate that the data
|
|||
|
begins with a 32-bit integer that gives the length of the data field.
|
|||
|
|
|||
|
'm'
|
|||
|
Word's pure text meaning.
|
|||
|
The data should be a utf-8 string ending with '\0'.
|
|||
|
|
|||
|
'l'
|
|||
|
Word's pure text meaning.
|
|||
|
The data is NOT a utf-8 string, but is instead a string in locale
|
|||
|
encoding, ending with '\0'. Sometimes using this type will save disk
|
|||
|
space, but its use is discouraged.
|
|||
|
|
|||
|
'g'
|
|||
|
A utf-8 string which is marked up with the Pango text markup language.
|
|||
|
For more information about this markup language, See the "Pango
|
|||
|
Reference Manual."
|
|||
|
You might have it installed locally at:
|
|||
|
file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html
|
|||
|
|
|||
|
't'
|
|||
|
English phonetic string.
|
|||
|
The data should be a utf-8 string ending with '\0'.
|
|||
|
|
|||
|
Here are some utf-8 phonetic characters:
|
|||
|
θʃŋʧðʒæıʌʊɒɛəɑɜɔˌˈːˑ
|
|||
|
æɑɒʌәєŋvθðʃʒːɡˏˊˋ
|
|||
|
|
|||
|
'y'
|
|||
|
Chinese YinBiao.
|
|||
|
The data should be a utf-8 string ending with '\0'.
|
|||
|
|
|||
|
|
|||
|
'W'
|
|||
|
wav file.
|
|||
|
The data begins with a network byte-ordered glong to identify the wav
|
|||
|
file's size, immediately followed by the file's content.
|
|||
|
|
|||
|
'P'
|
|||
|
png file.
|
|||
|
The data begins with a network byte-ordered glong to identify the png
|
|||
|
file's size, immediately followed by the file's content.
|
|||
|
|
|||
|
'X'
|
|||
|
this type identifier is reserved for experimental extensions.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
{5}. Tree Dictionary
|
|||
|
|
|||
|
The tree dictionary support is used for information viewing, etc.
|
|||
|
|
|||
|
A tree dictionary contains three file: sometreedict.ifo, sometreedict.tdx.gz
|
|||
|
and sometreedict.dict.dz.
|
|||
|
|
|||
|
It is better to compress the .tdx file, as it is always load into memory.
|
|||
|
|
|||
|
The .ifo file has the following format:
|
|||
|
|
|||
|
StarDict's treedict ifo file
|
|||
|
version=2.4.2
|
|||
|
[options]
|
|||
|
|
|||
|
Available options:
|
|||
|
|
|||
|
bookname= // required
|
|||
|
tdxfilesize= // required
|
|||
|
wordcount=
|
|||
|
author=
|
|||
|
email=
|
|||
|
website=
|
|||
|
description=
|
|||
|
date=
|
|||
|
sametypesequence=
|
|||
|
|
|||
|
wordcount is only used for info view in the dict manage dialog, so it is not
|
|||
|
important in tree dictionary.
|
|||
|
|
|||
|
The .tdx file is just the word list.
|
|||
|
|
|||
|
-----------
|
|||
|
|
|||
|
The word list is a tree list of word entries.
|
|||
|
|
|||
|
Each entry in the word list contains four fields, one after the other:
|
|||
|
word_str; // a utf-8 string terminated by '\0'.
|
|||
|
word_data_offset; // word data's offset in .dict file
|
|||
|
word_data_size; // word data's total size in .dict file. it can be 0.
|
|||
|
word_subentry_count; //have many sub word this entry has, 0 means none.
|
|||
|
|
|||
|
Subentry is immidiately followed by its parent entry. This make the order is
|
|||
|
just as when a tree list with all its nodes extended, then sort from top to
|
|||
|
bottom.
|
|||
|
|
|||
|
The .dict file's format is the same as the normal dictionary.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
{6}. More information.
|
|||
|
|
|||
|
You can read "src/lib.cpp", "src/dictmanagedlg.cpp" and
|
|||
|
"src/tools/*.cpp" for more information.
|
|||
|
|
|||
|
If you have any questions, email me. :)
|
|||
|
|
|||
|
Thanks to Will Robinson <wsr23@stanford.edu> for cleaning up this file's
|
|||
|
English.
|
|||
|
|
|||
|
Hu Zheng <huzheng_001@163.com>
|
|||
|
http://forlinux.yeah.net
|
|||
|
|
|||
|
2003.11.11
|
|||
|
|