2007年9月28日星期五

The progress of CJK functions

Recently in OOoCon 2007, I made a talk about the CJK (Chinese, Japanese and Korean) functions what I have been working on. I think it is better to show the progress to the world. If you have any hot issue about CJK functions, please let me know.

Here is the main CJK functions.

CJK functions that have been done:

  • Text grid enhancement

  • CJK font relevant stuff

  • MS Word compatibility options enhancement

CJK functions that I am working on:

  • Character/Line measurement unit and ruler

  • Paragraph style default settings for CJK

CJK functions that I will work on:

  • Punctuation compress

  • Bullets and numbering enhancement

OK, I will make a summary for each CJK function so that you could get more knowledge. :-)

Text grid enhancement (i76247)

One of the main CJK functions is text grid which is wildly used in CJK environment. For example, Chinese government document must use text grid with 22 lines per page and 28 characters per line.

However, there are two types of paper mode for text grid. One is "squared" paper mode which OOo supports currently, the other is "Standard" paper mode which most CJK versions of office suite (including MS Word) support.

What's the difference between these two types of paper mode.

For "squared" paper mode, as shown in Figure 1, the page is divided in a fixed numbers of lines , and each line is divided into square cells . The number of lines per page depends on the line height ( i.e., the sum of grid base and ruby height), and the Asian characters per line also depends on the line height.


Figure 1. Squared paper mode

In this type of paper mode, if we change the "Lines per page" setting in the "Text Grid" tab page, the type are of the page will be changed. This type of paper mode is only used in limited case. Most CJK users have habit to the “standard” paper mode.

Figure 2 illustrates the "Standard" paper mode. As we can see, the lines per page depends on the base text size, while the characters per line depends on the character width. The ruby text is unavailable anymore. Moreover if the lines per page is changed, the type area is not changed yet.


Figure 2. Standard paper mode

Now, both types of paper mode are supported in ooo-build. In order to ensure only one type paper mode is used in the whole document, as shown in figure 3, a global setting option is provided in Writer tab page (Tools -> Options -> OOo Writer -> General).The codes are also up-streamed which is under QA so far.


Figure 3. Global setting option for text grid

CJK font relevant

There are some issues about CJK font in OOo.

Asian font list box ( i73003 )

The Asian font list box in the character property dialog lists all the available fonts even if they are not Asian font in OOo currently. It is better to only list the available CJK fonts.


Figure 4. Asian font list box

CJK mess font ( i73003 )

The issue in current OOo is that the selected western font is applied to the CJK text even if the selected western font doesn't support CJK language. So I extend the usage of fontconfig to check whether the selected font supports CJK language before applying the selected font to the CJK text.


Figure 5. Mess font


Chinese font size (i54603 )

China has his own unit to measure font size such as "五号". Of course, there is a conversion map between Chinese font size unit and western font size unit.



Figure 6. Chinese font size

Font substitute (i54603 )

The fontconfig library is extended to use to find more suitable font when the desired font is missing. The patch is initialed by Caolan. We make it more better support Chinese.

Microsoft Word compatibility options enhancement (i78591 )

As we know, in MS Word, there are quite a lot of compatibility options which are used to control layout for different versions.

Now, OOo only supports a few of compatibility options. Most of them are not handled in WW8 filter, which may cause layout different when do .DOC-> OOo -> .DOC converter.

Of course, this is a common issue not just for CJK.

Actually, it is not necessay for OOo to handle all the compatibility options. But in order to improve the interoperability with MS Word, one alternative way is to store the unhandled compatibility options in document model when import a .DOC document, and save them out when export to .DOC document again.

Character/Line measurement unit and ruler (i72655 )

Character/line measurement unit and ruler is another important CJK function that OOo doesn't support now.

As shown in Figure 7 from MS Word 2003, character is used to measured the indentation of the paragraph, and the line is used to measure the paragraph spacing.


Figure 7. Character/line measurement unit

Also as shown in Figure 8 from MS Word 2003, the horizontal ruler can be measured by character, and the vertical ruler can be measure by line. Character/line ruler is always used in a document with text grid.


Figure 8. Character/line ruler

Paragraph style default settings for CJK (i54320 )

Currently, the paragraph style default settings are set for western users, some default values of the paragraph properties don't need the habit of CJK users.

For example, the default font size of western font size is 12pt, while it is 10.5pt for CJK. The current default tab spacing is 1.25cm, while it is too big in writing Chinese. It is better to use about 0.74cm in writing Chinese.

As shown in Figure 9, the current default setting of Asian typography is not suitable for CJK users yet.


Figure 9. Asian typography

Another example is the default value of automatic text indent. The automatic text indent is used to specify the leave spacing for the first line of a paragraph. The current default value is one-character while it is better to use two-characters in writing Chinese as shown in Figure.

Punctuation compress

As shown in Figure 10 , if two Chinese punctuation are close, the first one should be compressed to occupy only half spacing of a Character, not full spacing of a character.


Figure 10. punctuation compress

Butters and numberings (i70031 , i69855 )

I have heard many CJK users to complaint that the bullets and numbering are not so good.

For example, OOo can only automatic recognize Arabic numbers or alphabets. Most CJK numbers can not be automatic recognized.

In the end, some important CJK functions are listed in the following wiki. http://wiki.services.openoffice.org/wiki/CJK_Group
If you have any good idea or suggestion about CJK functions, you are welcome to discuss in the wiki.