diff options
| author | Eli Zaretskii | 2014-04-06 18:56:01 +0300 |
|---|---|---|
| committer | Eli Zaretskii | 2014-04-06 18:56:01 +0300 |
| commit | aa5ccb01a59901cb15a25995b70a7f49d2b03b57 (patch) | |
| tree | 93091dc421f965adca9bd28da45b90fa81cd27ae | |
| parent | c8e7f832ead47a5b790beb21da471074dbbacd68 (diff) | |
| download | emacs-aa5ccb01a59901cb15a25995b70a7f49d2b03b57.tar.gz emacs-aa5ccb01a59901cb15a25995b70a7f49d2b03b57.zip | |
src/bidi.c: Describe the design of reordering engine in the commentary.
| -rw-r--r-- | src/bidi.c | 190 |
1 files changed, 186 insertions, 4 deletions
diff --git a/src/bidi.c b/src/bidi.c index b96cc24bbd1..53c2dad1b6b 100644 --- a/src/bidi.c +++ b/src/bidi.c | |||
| @@ -22,9 +22,16 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */ | |||
| 22 | A sequential implementation of the Unicode Bidirectional algorithm, | 22 | A sequential implementation of the Unicode Bidirectional algorithm, |
| 23 | (UBA) as per UAX#9, a part of the Unicode Standard. | 23 | (UBA) as per UAX#9, a part of the Unicode Standard. |
| 24 | 24 | ||
| 25 | Unlike the reference and most other implementations, this one is | 25 | Unlike the Reference Implementation and most other implementations, |
| 26 | designed to be called once for every character in the buffer or | 26 | this one is designed to be called once for every character in the |
| 27 | string. | 27 | buffer or string. That way, we can leave intact the design of the |
| 28 | Emacs display engine, whereby an iterator object is used to | ||
| 29 | traverse buffer or string text character by character, and generate | ||
| 30 | the necessary data for displaying each character in 'struct glyph' | ||
| 31 | objects. (See xdisp.c for the details of that iteration.) The | ||
| 32 | functions on this file replace the original linear iteration in the | ||
| 33 | logical order of the text with a non-linear iteration in the visual | ||
| 34 | order, i.e. in the order characters should be shown on display. | ||
| 28 | 35 | ||
| 29 | The main entry point is bidi_move_to_visually_next. Each time it | 36 | The main entry point is bidi_move_to_visually_next. Each time it |
| 30 | is called, it finds the next character in the visual order, and | 37 | is called, it finds the next character in the visual order, and |
| @@ -52,7 +59,182 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */ | |||
| 52 | A note about references to UAX#9 rules: if the reference says | 59 | A note about references to UAX#9 rules: if the reference says |
| 53 | something like "X9/Retaining", it means that you need to refer to | 60 | something like "X9/Retaining", it means that you need to refer to |
| 54 | rule X9 and to its modifications described in the "Implementation | 61 | rule X9 and to its modifications described in the "Implementation |
| 55 | Notes" section of UAX#9, under "Retaining Format Codes". */ | 62 | Notes" section of UAX#9, under "Retaining Format Codes". |
| 63 | |||
| 64 | Here's the overview of the design of the reordering engine | ||
| 65 | implemented by this file. | ||
| 66 | |||
| 67 | Basic implementation structure | ||
| 68 | ------------------------------ | ||
| 69 | |||
| 70 | The sequential processing steps described by UAX#9 are implemented | ||
| 71 | as recursive levels of processing, all of which examine the next | ||
| 72 | character in the logical order. This hierarchy of processing looks | ||
| 73 | as follows, from the innermost (deepest) to the outermost level, | ||
| 74 | omitting some subroutines used by each level: | ||
| 75 | |||
| 76 | bidi_fetch_char -- fetch next character | ||
| 77 | bidi_resolve_explicit -- resolve explicit levels and directions | ||
| 78 | bidi_resolve_weak -- resolve weak types | ||
| 79 | bidi_resolve_neutral -- resolve neutral types | ||
| 80 | bidi_level_of_next_char -- resolve implicit levels | ||
| 81 | |||
| 82 | Each level calls the level below it, and works on the result | ||
| 83 | returned by the lower level, including all of its sub-levels. | ||
| 84 | |||
| 85 | Unlike all the levels below it, bidi_level_of_next_char can return | ||
| 86 | the information about either the next or previous character in the | ||
| 87 | logical order, depending on the current direction of scanning the | ||
| 88 | buffer or string. For the next character, it calls all the levels | ||
| 89 | below it; for the previous character, it uses the cache, described | ||
| 90 | below. | ||
| 91 | |||
| 92 | Thus, the result of calling bidi_level_of_next_char is the resolved | ||
| 93 | level of the next or the previous character in the logical order. | ||
| 94 | Based on this information, the function bidi_move_to_visually_next | ||
| 95 | finds the next character in the visual order and updates the | ||
| 96 | direction in which the buffer is scanned, either forward or | ||
| 97 | backward, to find the next character to be displayed. (Text is | ||
| 98 | scanned backwards when it needs to be reversed for display, i.e. if | ||
| 99 | the visual order is the inverse of the logical order.) This | ||
| 100 | implements the last, reordering steps of the UBA, by successively | ||
| 101 | calling bidi_level_of_next_char until the character of the required | ||
| 102 | embedding level is found; the scan direction is dynamically updated | ||
| 103 | as a side effect. See the commentary before the 'while' loop in | ||
| 104 | bidi_move_to_visually_next, for the details. | ||
| 105 | |||
| 106 | Fetching characters | ||
| 107 | ------------------- | ||
| 108 | |||
| 109 | In a nutshell, fetching the next character boils down to calling | ||
| 110 | STRING_CHAR_AND_LENGTH, passing it the address of a buffer or | ||
| 111 | string position. See bidi_fetch_char. However, if the next | ||
| 112 | character is "covered" by a display property of some kind, | ||
| 113 | bidi_fetch_char returns the u+FFFC "object replacement character" | ||
| 114 | that represents the entire run of text covered by the display | ||
| 115 | property. (The ch_len and nchars members of 'struct bidi_it' | ||
| 116 | reflect the length in bytes and characters of that text.) This is | ||
| 117 | so we reorder text on both sides of the display property as | ||
| 118 | appropriate for an image or embedded string. Similarly, text | ||
| 119 | covered by a display spec of the form '(space ...)', is replaced | ||
| 120 | with the u+2029 paragraph separator character, so such display | ||
| 121 | specs produce the same effect as a TAB under UBA. Both these | ||
| 122 | special characters are not actually displayed -- the display | ||
| 123 | property is displayed instead -- but just used to compute the | ||
| 124 | embedding level of the surrounding text so as to produce the | ||
| 125 | required effect. | ||
| 126 | |||
| 127 | Bidi iterator states | ||
| 128 | -------------------- | ||
| 129 | |||
| 130 | The UBA is highly context dependent in some of its parts, | ||
| 131 | i.e. results of processing a character can generally depend on | ||
| 132 | characters very far away. The UAX#9 description of the UBA | ||
| 133 | prescribes a stateful processing of each character, whereby the | ||
| 134 | results of this processing depend on various state variables, such | ||
| 135 | as the current embedding level, level stack, and directional | ||
| 136 | override status. In addition, the UAX#9 description includes many | ||
| 137 | passages like this (from rule W2 in this case): | ||
| 138 | |||
| 139 | Search backward from each instance of a European number until the | ||
| 140 | first strong type (R, L, AL, or sos) is found. If an AL is found, | ||
| 141 | change the type of the European number to Arabic number. | ||
| 142 | |||
| 143 | To support this, we use a bidi iterator object, 'struct bidi_it', | ||
| 144 | which is a sub-structure of 'struct it' used by xdisp.c (see | ||
| 145 | dispextern.h for the definition of both of these structures). The | ||
| 146 | bidi iterator holds the entire state of the iteration required by | ||
| 147 | the UBA, and is updated as the text is traversed. In particular, | ||
| 148 | the embedding level of the current character being resolved is | ||
| 149 | recorded in the iterator state. To avoid costly searches backward | ||
| 150 | in support of rules like W2 above, the necessary character types | ||
| 151 | are also recorded in the iterator state as they are found during | ||
| 152 | the forward scan, and then used when such rules need to be applied. | ||
| 153 | (Forward scans cannot be avoided in this way; they need to be | ||
| 154 | performed at least once, and the results recorded in the iterator | ||
| 155 | state, to be reused until the forward scan oversteps the recorded | ||
| 156 | position.) | ||
| 157 | |||
| 158 | In this manner, the iterator state acts as a mini-cache of | ||
| 159 | contextual information required for resolving the level of the | ||
| 160 | current character by various UBA rules. | ||
| 161 | |||
| 162 | Caching of bidi iterator states | ||
| 163 | ------------------------------- | ||
| 164 | |||
| 165 | As described above, the reordering engine uses the information | ||
| 166 | recorded in the bidi iterator state in order to resolve the | ||
| 167 | embedding level of the current character. When the reordering | ||
| 168 | engine needs to process the next character in the logical order, it | ||
| 169 | fetches it and applies to it all the UBA levels, updating the | ||
| 170 | iterator state as it goes. But when the buffer or string is | ||
| 171 | scanned backwards, i.e. in the reverse order of buffer/string | ||
| 172 | positions, the scanned characters were already processed during the | ||
| 173 | preceding forward scan (see bidi_find_other_level_edge). To avoid | ||
| 174 | costly re-processing of characters that were already processed | ||
| 175 | during the forward scan, the iterator states computed while | ||
| 176 | scanning forward are cached. | ||
| 177 | |||
| 178 | The cache is just a linear array of 'struct bidi_it' objects, which | ||
| 179 | is dynamically allocated and reallocated as needed, since the size | ||
| 180 | of the cache depends on the text being processed. We only need the | ||
| 181 | cache while processing embedded levels higher than the base | ||
| 182 | paragraph embedding level, because these higher levels require | ||
| 183 | changes in scan direction. Therefore, as soon as we are back to | ||
| 184 | the base embedding level, we can free the cache; see the calls to | ||
| 185 | bidi_cache_reset and bidi_cache_shrink, for the conditions to do | ||
| 186 | this. | ||
| 187 | |||
| 188 | The cache maintains the index of the next unused cache slot -- this | ||
| 189 | is where the next iterator state will be cached. The function | ||
| 190 | bidi_cache_iterator_state saves an instance of the state in the | ||
| 191 | cache and increments the unused slot index. The companion function | ||
| 192 | bidi_cache_find looks up a cached state that corresponds to a given | ||
| 193 | buffer/string position. All of the cached states must correspond | ||
| 194 | 1:1 to the buffer or string region whose processing they reflect; | ||
| 195 | bidi.c will abort if it finds cache slots that violate this 1:1 | ||
| 196 | correspondence. | ||
| 197 | |||
| 198 | When the parent iterator 'struct it' is pushed (see push_it in | ||
| 199 | xdisp.c) to pause the current iteration and start iterating over a | ||
| 200 | different object (e.g., a 'display' string that covers some buffer | ||
| 201 | text), the bidi iterator cache needs to be "pushed" as well, so | ||
| 202 | that a new empty cache could be used while iterating over the new | ||
| 203 | object. Later, when the new object is exhausted, and xdisp.c calls | ||
| 204 | pop_it, we need to "pop" the bidi cache as well and return to the | ||
| 205 | original cache. See bidi_push_it and bidi_pop_it for how this is | ||
| 206 | done. | ||
| 207 | |||
| 208 | Some functions of the display engine save copies of 'struct it' in | ||
| 209 | local variables, and restore them later. For examples, see | ||
| 210 | pos_visible_p and move_it_in_display_line_to in xdisp.c, and | ||
| 211 | window_scroll_pixel_based in window.c. When this happens, we need | ||
| 212 | to save and restore the bidi cache as well, because conceptually | ||
| 213 | the cache is part of the 'struct it' state, and needs to be in | ||
| 214 | perfect sync with the portion of the buffer/string that is being | ||
| 215 | processed. This saving and restoring of the cache state is handled | ||
| 216 | by bidi_shelve_cache and bidi_unshelve_cache, and the helper macros | ||
| 217 | SAVE_IT and RESTORE_IT defined on xdisp.c. | ||
| 218 | |||
| 219 | Note that, because reordering is implemented below the level in | ||
| 220 | xdisp.c that breaks glyphs into screen lines, we are violating | ||
| 221 | paragraph 3.4 of UAX#9. which mandates that line breaking shall be | ||
| 222 | done before reordering each screen line separately. However, | ||
| 223 | following UAX#9 to the letter in this matter goes against the basic | ||
| 224 | design of the Emacs display engine, and so we choose here this | ||
| 225 | minor deviation from the UBA letter in preference to redesign of | ||
| 226 | the display engine. The effect of this is only seen in continued | ||
| 227 | lines that are broken into screen lines in the middle of a run | ||
| 228 | whose direction is opposite to the paragraph's base direction. | ||
| 229 | |||
| 230 | Important design and implementation note: when the code needs to | ||
| 231 | scan far ahead, be sure to avoid such scans as much as possible | ||
| 232 | when the buffer/string doesn't contain any RTL characters. Users | ||
| 233 | of left-to-right scripts will never forgive you if you introduce | ||
| 234 | some slow-down due to bidi in situations that don't involve any | ||
| 235 | bidirectional text. See the large comment near the beginning of | ||
| 236 | bidi_resolve_neutral, for one situation where such shortcut was | ||
| 237 | necessary. */ | ||
| 56 | 238 | ||
| 57 | #include <config.h> | 239 | #include <config.h> |
| 58 | #include <stdio.h> | 240 | #include <stdio.h> |