2007年9月28日星期五

The progress of CJK functions

Recently in OOoCon 2007, I made a talk about the CJK (Chinese, Japanese and Korean) functions what I have been working on. I think it is better to show the progress to the world. If you have any hot issue about CJK functions, please let me know.

Here is the main CJK functions.

CJK functions that have been done:

  • Text grid enhancement

  • CJK font relevant stuff

  • MS Word compatibility options enhancement

CJK functions that I am working on:

  • Character/Line measurement unit and ruler

  • Paragraph style default settings for CJK

CJK functions that I will work on:

  • Punctuation compress

  • Bullets and numbering enhancement

OK, I will make a summary for each CJK function so that you could get more knowledge. :-)

Text grid enhancement (i76247)

One of the main CJK functions is text grid which is wildly used in CJK environment. For example, Chinese government document must use text grid with 22 lines per page and 28 characters per line.

However, there are two types of paper mode for text grid. One is "squared" paper mode which OOo supports currently, the other is "Standard" paper mode which most CJK versions of office suite (including MS Word) support.

What's the difference between these two types of paper mode.

For "squared" paper mode, as shown in Figure 1, the page is divided in a fixed numbers of lines , and each line is divided into square cells . The number of lines per page depends on the line height ( i.e., the sum of grid base and ruby height), and the Asian characters per line also depends on the line height.


Figure 1. Squared paper mode

In this type of paper mode, if we change the "Lines per page" setting in the "Text Grid" tab page, the type are of the page will be changed. This type of paper mode is only used in limited case. Most CJK users have habit to the “standard” paper mode.

Figure 2 illustrates the "Standard" paper mode. As we can see, the lines per page depends on the base text size, while the characters per line depends on the character width. The ruby text is unavailable anymore. Moreover if the lines per page is changed, the type area is not changed yet.


Figure 2. Standard paper mode

Now, both types of paper mode are supported in ooo-build. In order to ensure only one type paper mode is used in the whole document, as shown in figure 3, a global setting option is provided in Writer tab page (Tools -> Options -> OOo Writer -> General).The codes are also up-streamed which is under QA so far.


Figure 3. Global setting option for text grid

CJK font relevant

There are some issues about CJK font in OOo.

Asian font list box ( i73003 )

The Asian font list box in the character property dialog lists all the available fonts even if they are not Asian font in OOo currently. It is better to only list the available CJK fonts.


Figure 4. Asian font list box

CJK mess font ( i73003 )

The issue in current OOo is that the selected western font is applied to the CJK text even if the selected western font doesn't support CJK language. So I extend the usage of fontconfig to check whether the selected font supports CJK language before applying the selected font to the CJK text.


Figure 5. Mess font


Chinese font size (i54603 )

China has his own unit to measure font size such as "五号". Of course, there is a conversion map between Chinese font size unit and western font size unit.



Figure 6. Chinese font size

Font substitute (i54603 )

The fontconfig library is extended to use to find more suitable font when the desired font is missing. The patch is initialed by Caolan. We make it more better support Chinese.

Microsoft Word compatibility options enhancement (i78591 )

As we know, in MS Word, there are quite a lot of compatibility options which are used to control layout for different versions.

Now, OOo only supports a few of compatibility options. Most of them are not handled in WW8 filter, which may cause layout different when do .DOC-> OOo -> .DOC converter.

Of course, this is a common issue not just for CJK.

Actually, it is not necessay for OOo to handle all the compatibility options. But in order to improve the interoperability with MS Word, one alternative way is to store the unhandled compatibility options in document model when import a .DOC document, and save them out when export to .DOC document again.

Character/Line measurement unit and ruler (i72655 )

Character/line measurement unit and ruler is another important CJK function that OOo doesn't support now.

As shown in Figure 7 from MS Word 2003, character is used to measured the indentation of the paragraph, and the line is used to measure the paragraph spacing.


Figure 7. Character/line measurement unit

Also as shown in Figure 8 from MS Word 2003, the horizontal ruler can be measured by character, and the vertical ruler can be measure by line. Character/line ruler is always used in a document with text grid.


Figure 8. Character/line ruler

Paragraph style default settings for CJK (i54320 )

Currently, the paragraph style default settings are set for western users, some default values of the paragraph properties don't need the habit of CJK users.

For example, the default font size of western font size is 12pt, while it is 10.5pt for CJK. The current default tab spacing is 1.25cm, while it is too big in writing Chinese. It is better to use about 0.74cm in writing Chinese.

As shown in Figure 9, the current default setting of Asian typography is not suitable for CJK users yet.


Figure 9. Asian typography

Another example is the default value of automatic text indent. The automatic text indent is used to specify the leave spacing for the first line of a paragraph. The current default value is one-character while it is better to use two-characters in writing Chinese as shown in Figure.

Punctuation compress

As shown in Figure 10 , if two Chinese punctuation are close, the first one should be compressed to occupy only half spacing of a Character, not full spacing of a character.


Figure 10. punctuation compress

Butters and numberings (i70031 , i69855 )

I have heard many CJK users to complaint that the bullets and numbering are not so good.

For example, OOo can only automatic recognize Arabic numbers or alphabets. Most CJK numbers can not be automatic recognized.

In the end, some important CJK functions are listed in the following wiki. http://wiki.services.openoffice.org/wiki/CJK_Group
If you have any good idea or suggestion about CJK functions, you are welcome to discuss in the wiki.

2007年6月29日星期五

UOF Import Filter for OpenOffice.Org

The UOF - Uniform Office Format is an emerging standard, which is being developed by the Chinese Office Software Work Group (COSWG), led by the China Electronics Standard Institute (CESI), the Ministry of Information Industry (MII), major suppliers of Chinese office software suites, and other academic institutions.

This week is a hackfest week at Novell. I am hacking a UOF( Chinese Office File Format) import filter for OpenOffice.org. This filter is an external component based on ODF-UOF Converter.

An extension is developed so that OpenOffice.Org is able to open UOF text document. Here is some screenshots.






Although some features such as paragraph style, table are supported yet, there are a lot of work need to do. I will continue to work on them when I get time.

2007年5月28日星期一

UOF–OpenXML Translator

I was surprised the news that Microsoft announced UOF-OpenXML translator project with China. The goal of UOF–OpenXML translator :

"As part of Microsoft’s continued commitment to interoperability, Microsoft decided to work with CHINA Electronics Standardization Institute, Beijing Information Technology Institute, one of the co-creators of the UOF Chinese standard , Beihang University of Beijing and with other partners to create a Translator between UOF and Open XML and provide interoperability between the two formats in both directions. Microsoft is funding and providing technical architectural guidance for the development of the translator that will benefit millions of people who live in China."

In one sense, China is so important for Microsoft's further strategies.
It is time to do something about UOF plug-in for OpenOffice.org, isn't it?

The UOF-OpenXML project is available at http://uof-translator.sourceforge.net
.
The UOF-ODF project is available at http://odf-to-uof.sourceforge.net.


2007年4月24日星期二

Issues wrt. text grid in Issuezilla

The prototype of text grid enhancement is available and the patch is under review now. Anybody who is interested in it could check out the patch from the cws cjksp1.

Today, I happen to find that there are quite many text grid issues open in Issuezilla. I will look at them one by one and try to fix them. :)

Some issues wrt. text grid in issuezilla:

i53425:more flexibility in the grid layout
i73011:Chinese layout incorrectly with text grid
i40768:CJK:Register-true not activated for frames when importing; Grid layout
i53464:Word table reformatted shorter on import due to Grid layout
i15251:"Snap to Grid" doesn't work correctly
i29543:WW8: paragraph with "snap to grid" invisible in table
i15424:baseline in grid
i55461:snap to grid doesn't work correctly inside frame
i54864:WW8: Different treatment of tables in page due to active CJK Grid
i72657:A single line on Word got converted to two lines on Writer due to Text Grid
i68204:option "Print grid" becomes inactive when reopening Page dialog
i35684:[Text Grid] Vertical alignment for different font sizes differ from MS Word
i49214:CJK: Line spacing interpretation is different between Word and Writer due to text grid
i24195:WW Import: Cell height is too big when opened in OOo 1.1.1a (due to CJK-Grid)
i56820:Ruby in vertical text in grid wrongly moves original text sideways

Please let me know if anybody has question about text grid layout. :)

2007年3月22日星期四

User scenarios of text grid

The following user scenarios of text grid are under investigation.

1. When user creates a text document, the default paper mode of text grid is read from user preference settings. User can switch the default paper mode of text grid in text document options tab page. (Tools --> Options --> Text Document --> General) by clicking the check box "Use squared paper mode for text grid".

2. When import a Ms Word 97/2000 file (.doc), the default paper mode is treated as "standard paper mode".

3. When import a text document of the previous version of OO.org (.odt, .sxw), the default paper mode is treated as "squared paper mode"

4. When user is editing a text document with text grid.
if he switchs the paper mode from "squared" to "standard", the following behavior is used:
* line height = Max base text size + Max ruby text size
* lines per page = type area height / line height
* Max ruby text size = 0

if he switch the paper mode from "standard" to "squared", the following behavior is used:
* line height = type area height / lines per page
* Max base text size = line height * 2 / 3
* Max ruby text size = line height / 3

2007年3月20日星期二

Two kinds of text grid layout

As mensioned before, there are two kinds of text grid layout which are used in CJK users. One is "squared page mode", the other is "standard (rectangle) page mode".

Per the text grid prososal approved by ODF TC, the style:layout-grid-standard-mode property is added to specified which kind of text grid is used for the document.

In order to ensure that either “sqaured mode” or “standard mode “is used for the whole document, the style:layout-grid-standard-mode property can only be set for the default style of the “page-layout”. When the style:layout-grid-standard-mode attribute appears inside a style:page-layout definition, then the attribute MUST be ignored.

Global setting entry for selection of which kind of text grid layout is used for the whole document. (Tools-->Options-->OpenOffice.Org Writer-->General)


When the "Use squared page mode for text grid" is checked, the original tab page of text grid is used.


When the "Use squared page mode for text grid" is unchecked, the following tab page of text grid is used.

2007年3月13日星期二

Text grid in Ms Word 97 binary file format

In Ms Word 97 binary file format, there are three SPRMs to deal with grid. They are:

Name

sprm

Property

size

Description

sprmSDxtCharSpace

0x7030

Sep.dxtCharSpace

long

Specifies the grid width

sprmSDyaLinePitch

0x9031

Sep.dyaLinePitch

long

Specifies the grid height

sprmSClm

0x5032

Sep.clm

long

Specifies the grid type


sprmSClm has four values, which corresponds to four types of grid.

Name

value

Grid type

sprmSClm

0

No grid

1

Specify line and character grid

2

Specify line grid only

3

Text snaps to character grid


2007年3月12日星期一

Text grid enhancement in MS Office 2007

Today, I downloaded and installed a Chinese trial version of MS Office 2007. I was surprised that “Square page mode” is also supported in MS word 2007, which is not supported in previous version.

Below illustrates the menu entry for “square mode” setting. ( Maybe this function is disabled in none_CJK version in default).

Setting tab page:

Squared page (20 ×20):

If the “square page mode” is enabled, the page setting menu is disabled. “standard page mode” setting is not allowed, which avoids mixed page mode in a document.

2007年2月26日星期一

Text doesn't snap to grid

Contiue to investigate the layout behavior when the text doesn't snap to grid.

For East Asian text, not more than one Asian character is displayed within a signle cell. The space between two Asian characters is (grid width - font height)

For Western text, The space between two Western characters is (grid width - font height)/2.

The following diagram illustrates this algorithm in brief.



Certainly, it just specifies what happens if grid width is greater than font height.

2007年2月16日星期五

Some knowledge about text grid

I just checked the Office OpenXML file format and try to find the difference between the two types of text grid.
  • linesAndChars (Line and Character Grid): Specifies that the parent section shall have both the additional line pitch and character pitch added to each line and character within it (as specified on the docGrid element (§2.6.5)) in order to maintain a specific number of lines per page and characters per line. When this value is set, the input specified via the user interface may be allowed in exact number of line/character pitch units.
  • snapToChars (Character Grid Only) : Specifies that the parent section shall have both the additional line pitch and character pitch added to each line and character within it (as specified on the docGrid element (§2.6.5)) in order to maintain a specific number of lines per page and characters per line.When this value is set, the input specified via the user interface may be restricted to the number of lines per page and characters per line, with the consumer or producer translating this information based on the current font data to get the resulting line and character pitch values
OpenXML file format just specifies the difference of the user interface between these two types of text grid, while it doesn't piont out the difference of the layout behavior, especially for the layout behavior of Asian text and Webtern Text.

It seems that the author of the specification doen't know the essential difference between these two types of text grid. :-)
Currently, I am investigating the Document Grid specified in CSS3 mensioned by Florian.
http://www.w3.org/TR/2003/CR-css3-text-20030514/#document-grid

2007年2月15日星期四

Investigated the behavior of "Text snaps to characters grid".
* If the grid type is grid(line and characters) and the "snap-to-characters" attribute is ture, the Asian character is centered in the single cell, while the non Asian text is centered within as many cells as required.

* If the grid type is grid(line and characters) and the "snap-to-characters" attribute is false, the Asian character is not centered in the single cell, but what is the behavior for the non Asian text?

Below is the screenshot from MS Word


Figure 1,Text snaps to character grid

Figure 2, Specify line and character grid

It seems that there is a special behavior for non Asian text when the text doesn't snap to the character grid.
Where can I get some clues?

2007年2月14日星期三

Text Grid Enhancement

It is my pleasure to announce a new enhancement feature (Text Grid) in writer document. Currently I am work on this feature.

* Create a new page in openoffice wiki to ask for input. Here is the link : http://wiki.services.openoffice.org/wiki/Text_grid

* Below is the two kinds of text grid layout.







Figure1, Squared page mode


Figure2, Rectangle page mode

1) "Squared paper mode". It is used by ODF.
For this kind of grid layout, the page is divided in a fixed numbers of lines (lines per page). The lines are divided into squared cells (characters per line). The number of lines per page depends on the line height ( i.e., the sum of grid base and ruby height), and the characters per line also depends on the line height.

In "Squared paper mode", the “Characters per line” setting has the most high priority, it will determine the height of line, then the “Lines per page” setting has the second priority, it will determine the height of type area.

This kind of grid layout "Squared paper mode" is used to simulate “squared paper”, which is a kind of specific paper used 20 years ago(before personal computer is widely used in text processing). But now, “squared paper” is only used in very limited case, and most CJK users won't use it anymore.

2) "Rectangle paper mode". It is used by OpenXML, UOF (Chinese Office File Format) and other CJK office suites (i.e., EIOFFICE, WPS).
For this kind of grid layout, the page is divided in a fixed numbers of lines (lines per page). While the lines are divided into rectangle cells (characters per line). Ruby grid is not specified in this paper mode. So the line height is grid base height and the characters per line depends on the grid base width, not grid base height.

In "Rectangle paper mode", type area has the most high priority, it will determine the height of line; the “Lines per page” setting has the second priority, and “Characters per line” setting can only determine the width of characters, and can't influence line height in anyway.

This kind of grid layout "Rectangle paper mode" is wildly used in CJK users.

* Snap to characters

We add a “snap to characters” attributes to specify whether the Asian character is centered in the grid cell or not when the grid type is set as “Grid(lines and characters)”. The default value “snap to characters” is “true”.

This additional attribute also improve the compatibility with MS word. Below is the map of grid type between MS Word 2003 and OpenOffice

Word 2003
OpenOffice.org 2.1.0
1. No grid
A. No grid
2. Specify line grid only (Default)
B. Grid (lines only) (Default)
3. Specify line and character grid 
C. Grid (lines and characters), “snap-to-characters” is false.
4. Text snaps to character grid  
D. Grid (lines and characters), “snap-to-characters” is true.


Figure 3. No grid

Figure 4 . Grid (lines only)


Figure 5. Grid (lines and characters), “snap-to-characters” is false


Figure 6. Grid (lines and characters), “snap-to-characters” is true

As shown in Figure 5, when "snap-to-chars" is false, the Asian characters are not centered in the cell, while the Western character ("OpenOffice.Org) are also centered in the cell. This is not accurate. It is work in progress now.

Valentine's Day Evening

In Valentine's Day evening, other staffs are off work. Only I am in office to create a blog to blog what I am doing everyday, quite interesting.
Florian help me to create it, thank Florian