src/bidi.c: Describe the design of reordering engine in the commentary.

author: Eli Zaretskii 2014-04-06 18:56:01 +0300
committer: Eli Zaretskii 2014-04-06 18:56:01 +0300
commit: aa5ccb01a59901cb15a25995b70a7f49d2b03b57 (patch)
tree: 93091dc421f965adca9bd28da45b90fa81cd27ae
parent: c8e7f832ead47a5b790beb21da471074dbbacd68 (diff)
download: emacs-aa5ccb01a59901cb15a25995b70a7f49d2b03b57.tar.gz
emacs-aa5ccb01a59901cb15a25995b70a7f49d2b03b57.zip
1 files changed, 186 insertions, 4 deletions
diff --git a/src/bidi.c b/src/bidi.c
index b96cc24bbd1..53c2dad1b6b 100644
--- a/src/bidi.c
+++ b/src/bidi.c
@@ -22,9 +22,16 @@ along with GNU Emacs.  If not, see <http://www.gnu.org/licenses/>.  */
   A sequential implementation of the Unicode Bidirectional algorithm,
   (UBA) as per UAX#9, a part of the Unicode Standard.
-   Unlike the reference and most other implementations, this one is
+   Unlike the Reference Implementation and most other implementations,
-   designed to be called once for every character in the buffer or
+   this one is designed to be called once for every character in the
-   string.
+   buffer or string.  That way, we can leave intact the design of the
+   Emacs display engine, whereby an iterator object is used to
+   traverse buffer or string text character by character, and generate
+   the necessary data for displaying each character in 'struct glyph'
+   objects.  (See xdisp.c for the details of that iteration.)  The
+   functions on this file replace the original linear iteration in the
+   logical order of the text with a non-linear iteration in the visual
+   order, i.e. in the order characters should be shown on display.
   The main entry point is bidi_move_to_visually_next.  Each time it
   is called, it finds the next character in the visual order, and
@@ -52,7 +59,182 @@ along with GNU Emacs.  If not, see <http://www.gnu.org/licenses/>.  */
   A note about references to UAX#9 rules: if the reference says
   something like "X9/Retaining", it means that you need to refer to
   rule X9 and to its modifications described in the "Implementation
-   Notes" section of UAX#9, under "Retaining Format Codes".  */
+   Notes" section of UAX#9, under "Retaining Format Codes".
+   Here's the overview of the design of the reordering engine
+   implemented by this file.
+   Basic implementation structure
+   ------------------------------
+   The sequential processing steps described by UAX#9 are implemented
+   as recursive levels of processing, all of which examine the next
+   character in the logical order.  This hierarchy of processing looks
+   as follows, from the innermost (deepest) to the outermost level,
+   omitting some subroutines used by each level:
+     bidi_fetch_char         -- fetch next character
+     bidi_resolve_explicit   -- resolve explicit levels and directions
+     bidi_resolve_weak       -- resolve weak types
+     bidi_resolve_neutral    -- resolve neutral types
+     bidi_level_of_next_char -- resolve implicit levels
+   Each level calls the level below it, and works on the result
+   returned by the lower level, including all of its sub-levels.
+   Unlike all the levels below it, bidi_level_of_next_char can return
+   the information about either the next or previous character in the
+   logical order, depending on the current direction of scanning the
+   buffer or string.  For the next character, it calls all the levels
+   below it; for the previous character, it uses the cache, described
+   below.
+   Thus, the result of calling bidi_level_of_next_char is the resolved
+   level of the next or the previous character in the logical order.
+   Based on this information, the function bidi_move_to_visually_next
+   finds the next character in the visual order and updates the
+   direction in which the buffer is scanned, either forward or
+   backward, to find the next character to be displayed.  (Text is
+   scanned backwards when it needs to be reversed for display, i.e. if
+   the visual order is the inverse of the logical order.)  This
+   implements the last, reordering steps of the UBA, by successively
+   calling bidi_level_of_next_char until the character of the required
+   embedding level is found; the scan direction is dynamically updated
+   as a side effect.  See the commentary before the 'while' loop in
+   bidi_move_to_visually_next, for the details.
+   Fetching characters
+   -------------------
+   In a nutshell, fetching the next character boils down to calling
+   STRING_CHAR_AND_LENGTH, passing it the address of a buffer or
+   string position.  See bidi_fetch_char.  However, if the next
+   character is "covered" by a display property of some kind,
+   bidi_fetch_char returns the u+FFFC "object replacement character"
+   that represents the entire run of text covered by the display
+   property.  (The ch_len and nchars members of 'struct bidi_it'
+   reflect the length in bytes and characters of that text.)  This is
+   so we reorder text on both sides of the display property as
+   appropriate for an image or embedded string.  Similarly, text
+   covered by a display spec of the form '(space ...)', is replaced
+   with the u+2029 paragraph separator character, so such display
+   specs produce the same effect as a TAB under UBA.  Both these
+   special characters are not actually displayed -- the display
+   property is displayed instead -- but just used to compute the
+   embedding level of the surrounding text so as to produce the
+   required effect.
+   Bidi iterator states
+   --------------------
+   The UBA is highly context dependent in some of its parts,
+   i.e. results of processing a character can generally depend on
+   characters very far away.  The UAX#9 description of the UBA
+   prescribes a stateful processing of each character, whereby the
+   results of this processing depend on various state variables, such
+   as the current embedding level, level stack, and directional
+   override status.  In addition, the UAX#9 description includes many
+   passages like this (from rule W2 in this case):
+     Search backward from each instance of a European number until the
+     first strong type (R, L, AL, or sos) is found. If an AL is found,
+     change the type of the European number to Arabic number.
+   To support this, we use a bidi iterator object, 'struct bidi_it',
+   which is a sub-structure of 'struct it' used by xdisp.c (see
+   dispextern.h for the definition of both of these structures).  The
+   bidi iterator holds the entire state of the iteration required by
+   the UBA, and is updated as the text is traversed.  In particular,
+   the embedding level of the current character being resolved is
+   recorded in the iterator state.  To avoid costly searches backward
+   in support of rules like W2 above, the necessary character types
+   are also recorded in the iterator state as they are found during
+   the forward scan, and then used when such rules need to be applied.
+   (Forward scans cannot be avoided in this way; they need to be
+   performed at least once, and the results recorded in the iterator
+   state, to be reused until the forward scan oversteps the recorded
+   position.)
+   In this manner, the iterator state acts as a mini-cache of
+   contextual information required for resolving the level of the
+   current character by various UBA rules.
+   Caching of bidi iterator states
+   -------------------------------
+   As described above, the reordering engine uses the information
+   recorded in the bidi iterator state in order to resolve the
+   embedding level of the current character.  When the reordering
+   engine needs to process the next character in the logical order, it
+   fetches it and applies to it all the UBA levels, updating the
+   iterator state as it goes.  But when the buffer or string is
+   scanned backwards, i.e. in the reverse order of buffer/string
+   positions, the scanned characters were already processed during the
+   preceding forward scan (see bidi_find_other_level_edge).  To avoid
+   costly re-processing of characters that were already processed
+   during the forward scan, the iterator states computed while
+   scanning forward are cached.
+   The cache is just a linear array of 'struct bidi_it' objects, which
+   is dynamically allocated and reallocated as needed, since the size
+   of the cache depends on the text being processed.  We only need the
+   cache while processing embedded levels higher than the base
+   paragraph embedding level, because these higher levels require
+   changes in scan direction.  Therefore, as soon as we are back to
+   the base embedding level, we can free the cache; see the calls to
+   bidi_cache_reset and bidi_cache_shrink, for the conditions to do
+   this.
+   The cache maintains the index of the next unused cache slot -- this
+   is where the next iterator state will be cached.  The function
+   bidi_cache_iterator_state saves an instance of the state in the
+   cache and increments the unused slot index.  The companion function
+   bidi_cache_find looks up a cached state that corresponds to a given
+   buffer/string position.  All of the cached states must correspond
+   1:1 to the buffer or string region whose processing they reflect;
+   bidi.c will abort if it finds cache slots that violate this 1:1
+   correspondence.
+   When the parent iterator 'struct it' is pushed (see push_it in
+   xdisp.c) to pause the current iteration and start iterating over a
+   different object (e.g., a 'display' string that covers some buffer
+   text), the bidi iterator cache needs to be "pushed" as well, so
+   that a new empty cache could be used while iterating over the new
+   object.  Later, when the new object is exhausted, and xdisp.c calls
+   pop_it, we need to "pop" the bidi cache as well and return to the
+   original cache.  See bidi_push_it and bidi_pop_it for how this is
+   done.
+   Some functions of the display engine save copies of 'struct it' in
+   local variables, and restore them later.  For examples, see
+   pos_visible_p and move_it_in_display_line_to in xdisp.c, and
+   window_scroll_pixel_based in window.c.  When this happens, we need
+   to save and restore the bidi cache as well, because conceptually
+   the cache is part of the 'struct it' state, and needs to be in
+   perfect sync with the portion of the buffer/string that is being
+   processed.  This saving and restoring of the cache state is handled
+   by bidi_shelve_cache and bidi_unshelve_cache, and the helper macros
+   SAVE_IT and RESTORE_IT defined on xdisp.c.
+   Note that, because reordering is implemented below the level in
+   xdisp.c that breaks glyphs into screen lines, we are violating
+   paragraph 3.4 of UAX#9. which mandates that line breaking shall be
+   done before reordering each screen line separately.  However,
+   following UAX#9 to the letter in this matter goes against the basic
+   design of the Emacs display engine, and so we choose here this
+   minor deviation from the UBA letter in preference to redesign of
+   the display engine.  The effect of this is only seen in continued
+   lines that are broken into screen lines in the middle of a run
+   whose direction is opposite to the paragraph's base direction.
+   Important design and implementation note: when the code needs to
+   scan far ahead, be sure to avoid such scans as much as possible
+   when the buffer/string doesn't contain any RTL characters.  Users
+   of left-to-right scripts will never forgive you if you introduce
+   some slow-down due to bidi in situations that don't involve any
+   bidirectional text.  See the large comment near the beginning of
+   bidi_resolve_neutral, for one situation where such shortcut was
+   necessary.  */
 #include <config.h>
 #include <stdio.h>
author	Eli Zaretskii	2014-04-06 18:56:01 +0300
committer	Eli Zaretskii	2014-04-06 18:56:01 +0300
commit	aa5ccb01a59901cb15a25995b70a7f49d2b03b57 (patch)
tree	93091dc421f965adca9bd28da45b90fa81cd27ae
parent	c8e7f832ead47a5b790beb21da471074dbbacd68 (diff)
download	emacs-aa5ccb01a59901cb15a25995b70a7f49d2b03b57.tar.gz emacs-aa5ccb01a59901cb15a25995b70a7f49d2b03b57.zip

diff --git a/src/bidi.c b/src/bidi.c index b96cc24bbd1..53c2dad1b6b 100644 --- a/src/bidi.c +++ b/src/bidi.c
@@ -22,9 +22,16 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */
22	A sequential implementation of the Unicode Bidirectional algorithm,	22	A sequential implementation of the Unicode Bidirectional algorithm,
23	(UBA) as per UAX#9, a part of the Unicode Standard.	23	(UBA) as per UAX#9, a part of the Unicode Standard.
24		24
25	Unlike the reference and most other implementations, this one is	25	Unlike the Reference Implementation and most other implementations,
26	designed to be called once for every character in the buffer or	26	this one is designed to be called once for every character in the
27	string.	27	buffer or string. That way, we can leave intact the design of the
		28	Emacs display engine, whereby an iterator object is used to
		29	traverse buffer or string text character by character, and generate
		30	the necessary data for displaying each character in 'struct glyph'
		31	objects. (See xdisp.c for the details of that iteration.) The
		32	functions on this file replace the original linear iteration in the
		33	logical order of the text with a non-linear iteration in the visual
		34	order, i.e. in the order characters should be shown on display.
28		35
29	The main entry point is bidi_move_to_visually_next. Each time it	36	The main entry point is bidi_move_to_visually_next. Each time it
30	is called, it finds the next character in the visual order, and	37	is called, it finds the next character in the visual order, and
@@ -52,7 +59,182 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */
52	A note about references to UAX#9 rules: if the reference says	59	A note about references to UAX#9 rules: if the reference says
53	something like "X9/Retaining", it means that you need to refer to	60	something like "X9/Retaining", it means that you need to refer to
54	rule X9 and to its modifications described in the "Implementation	61	rule X9 and to its modifications described in the "Implementation
55	Notes" section of UAX#9, under "Retaining Format Codes". */	62	Notes" section of UAX#9, under "Retaining Format Codes".
		63
		64	Here's the overview of the design of the reordering engine
		65	implemented by this file.
		66
		67	Basic implementation structure
		68	------------------------------
		69
		70	The sequential processing steps described by UAX#9 are implemented
		71	as recursive levels of processing, all of which examine the next
		72	character in the logical order. This hierarchy of processing looks
		73	as follows, from the innermost (deepest) to the outermost level,
		74	omitting some subroutines used by each level:
		75
		76	bidi_fetch_char -- fetch next character
		77	bidi_resolve_explicit -- resolve explicit levels and directions
		78	bidi_resolve_weak -- resolve weak types
		79	bidi_resolve_neutral -- resolve neutral types
		80	bidi_level_of_next_char -- resolve implicit levels
		81
		82	Each level calls the level below it, and works on the result
		83	returned by the lower level, including all of its sub-levels.
		84
		85	Unlike all the levels below it, bidi_level_of_next_char can return
		86	the information about either the next or previous character in the
		87	logical order, depending on the current direction of scanning the
		88	buffer or string. For the next character, it calls all the levels
		89	below it; for the previous character, it uses the cache, described
		90	below.
		91
		92	Thus, the result of calling bidi_level_of_next_char is the resolved
		93	level of the next or the previous character in the logical order.
		94	Based on this information, the function bidi_move_to_visually_next
		95	finds the next character in the visual order and updates the
		96	direction in which the buffer is scanned, either forward or
		97	backward, to find the next character to be displayed. (Text is
		98	scanned backwards when it needs to be reversed for display, i.e. if
		99	the visual order is the inverse of the logical order.) This
		100	implements the last, reordering steps of the UBA, by successively
		101	calling bidi_level_of_next_char until the character of the required
		102	embedding level is found; the scan direction is dynamically updated
		103	as a side effect. See the commentary before the 'while' loop in
		104	bidi_move_to_visually_next, for the details.
		105
		106	Fetching characters
		107	-------------------
		108
		109	In a nutshell, fetching the next character boils down to calling
		110	STRING_CHAR_AND_LENGTH, passing it the address of a buffer or
		111	string position. See bidi_fetch_char. However, if the next
		112	character is "covered" by a display property of some kind,
		113	bidi_fetch_char returns the u+FFFC "object replacement character"
		114	that represents the entire run of text covered by the display
		115	property. (The ch_len and nchars members of 'struct bidi_it'
		116	reflect the length in bytes and characters of that text.) This is
		117	so we reorder text on both sides of the display property as
		118	appropriate for an image or embedded string. Similarly, text
		119	covered by a display spec of the form '(space ...)', is replaced
		120	with the u+2029 paragraph separator character, so such display
		121	specs produce the same effect as a TAB under UBA. Both these
		122	special characters are not actually displayed -- the display
		123	property is displayed instead -- but just used to compute the
		124	embedding level of the surrounding text so as to produce the
		125	required effect.
		126
		127	Bidi iterator states
		128	--------------------
		129
		130	The UBA is highly context dependent in some of its parts,
		131	i.e. results of processing a character can generally depend on
		132	characters very far away. The UAX#9 description of the UBA
		133	prescribes a stateful processing of each character, whereby the
		134	results of this processing depend on various state variables, such
		135	as the current embedding level, level stack, and directional
		136	override status. In addition, the UAX#9 description includes many
		137	passages like this (from rule W2 in this case):
		138
		139	Search backward from each instance of a European number until the
		140	first strong type (R, L, AL, or sos) is found. If an AL is found,
		141	change the type of the European number to Arabic number.
		142
		143	To support this, we use a bidi iterator object, 'struct bidi_it',
		144	which is a sub-structure of 'struct it' used by xdisp.c (see
		145	dispextern.h for the definition of both of these structures). The
		146	bidi iterator holds the entire state of the iteration required by
		147	the UBA, and is updated as the text is traversed. In particular,
		148	the embedding level of the current character being resolved is
		149	recorded in the iterator state. To avoid costly searches backward
		150	in support of rules like W2 above, the necessary character types
		151	are also recorded in the iterator state as they are found during
		152	the forward scan, and then used when such rules need to be applied.
		153	(Forward scans cannot be avoided in this way; they need to be
		154	performed at least once, and the results recorded in the iterator
		155	state, to be reused until the forward scan oversteps the recorded
		156	position.)
		157
		158	In this manner, the iterator state acts as a mini-cache of
		159	contextual information required for resolving the level of the
		160	current character by various UBA rules.
		161
		162	Caching of bidi iterator states
		163	-------------------------------
		164
		165	As described above, the reordering engine uses the information
		166	recorded in the bidi iterator state in order to resolve the
		167	embedding level of the current character. When the reordering
		168	engine needs to process the next character in the logical order, it
		169	fetches it and applies to it all the UBA levels, updating the
		170	iterator state as it goes. But when the buffer or string is
		171	scanned backwards, i.e. in the reverse order of buffer/string
		172	positions, the scanned characters were already processed during the
		173	preceding forward scan (see bidi_find_other_level_edge). To avoid
		174	costly re-processing of characters that were already processed
		175	during the forward scan, the iterator states computed while
		176	scanning forward are cached.
		177
		178	The cache is just a linear array of 'struct bidi_it' objects, which
		179	is dynamically allocated and reallocated as needed, since the size
		180	of the cache depends on the text being processed. We only need the
		181	cache while processing embedded levels higher than the base
		182	paragraph embedding level, because these higher levels require
		183	changes in scan direction. Therefore, as soon as we are back to
		184	the base embedding level, we can free the cache; see the calls to
		185	bidi_cache_reset and bidi_cache_shrink, for the conditions to do
		186	this.
		187
		188	The cache maintains the index of the next unused cache slot -- this
		189	is where the next iterator state will be cached. The function
		190	bidi_cache_iterator_state saves an instance of the state in the
		191	cache and increments the unused slot index. The companion function
		192	bidi_cache_find looks up a cached state that corresponds to a given
		193	buffer/string position. All of the cached states must correspond
		194	1:1 to the buffer or string region whose processing they reflect;
		195	bidi.c will abort if it finds cache slots that violate this 1:1
		196	correspondence.
		197
		198	When the parent iterator 'struct it' is pushed (see push_it in
		199	xdisp.c) to pause the current iteration and start iterating over a
		200	different object (e.g., a 'display' string that covers some buffer
		201	text), the bidi iterator cache needs to be "pushed" as well, so
		202	that a new empty cache could be used while iterating over the new
		203	object. Later, when the new object is exhausted, and xdisp.c calls
		204	pop_it, we need to "pop" the bidi cache as well and return to the
		205	original cache. See bidi_push_it and bidi_pop_it for how this is
		206	done.
		207
		208	Some functions of the display engine save copies of 'struct it' in
		209	local variables, and restore them later. For examples, see
		210	pos_visible_p and move_it_in_display_line_to in xdisp.c, and
		211	window_scroll_pixel_based in window.c. When this happens, we need
		212	to save and restore the bidi cache as well, because conceptually
		213	the cache is part of the 'struct it' state, and needs to be in
		214	perfect sync with the portion of the buffer/string that is being
		215	processed. This saving and restoring of the cache state is handled
		216	by bidi_shelve_cache and bidi_unshelve_cache, and the helper macros
		217	SAVE_IT and RESTORE_IT defined on xdisp.c.
		218
		219	Note that, because reordering is implemented below the level in
		220	xdisp.c that breaks glyphs into screen lines, we are violating
		221	paragraph 3.4 of UAX#9. which mandates that line breaking shall be
		222	done before reordering each screen line separately. However,
		223	following UAX#9 to the letter in this matter goes against the basic
		224	design of the Emacs display engine, and so we choose here this
		225	minor deviation from the UBA letter in preference to redesign of
		226	the display engine. The effect of this is only seen in continued
		227	lines that are broken into screen lines in the middle of a run
		228	whose direction is opposite to the paragraph's base direction.
		229
		230	Important design and implementation note: when the code needs to
		231	scan far ahead, be sure to avoid such scans as much as possible
		232	when the buffer/string doesn't contain any RTL characters. Users
		233	of left-to-right scripts will never forgive you if you introduce
		234	some slow-down due to bidi in situations that don't involve any
		235	bidirectional text. See the large comment near the beginning of
		236	bidi_resolve_neutral, for one situation where such shortcut was
		237	necessary. */
56		238
57	#include <config.h>	239	#include <config.h>
58	#include <stdio.h>	240	#include <stdio.h>