3 Vectorized monomial operations

The PoSSo library [14] already uses parts of this idea to speed up monomial additions, assignments, and tests for equalities, but has not (yet?) carried these ideas over to the much more time-critical monomial comparisons and divisibility tests.

The details of this technique are based on the following lemma whose proof is straightforward and, therefore, omitted:

Lemma 1: Let $s_e, m \in \textbf{N}$ , and $\alpha = (\alpha_0, \ldots \alpha_{m-1}), \beta = (\beta_0, \ldots \beta_{m-1}) \in M_m$ with $\alpha_i, \beta_i < 2^{{s_e}-1}$ . Furthermore, let $s_w = s_e \, m$ and $a, b , \widehat{a}, \widehat{b}, \gamma_0, \ldots \gamma_{m-1}, \delta_0 \ldots \delta_{m-1} \in \textbf{N}$ with $\gamma_i, \delta_i < 2^{s_e}$ such that

Lemma 1 shows how the monomial operations on the right-hand sides of (1) - (4) can be ``vectorized'', such that, on a 2's complement machine with word-size s_w, the checks (resp. operations) on the left hand sides of (1) - (4) can be performed in at most three single machine instructions (see the source code examples below for further details).

However, before Lemma 1 can be applied in an actual implementation, some technical difficulties have to be overcome:

Firstly, stricter bounds on the values of the single exponents have to be assured (i.e., the exponent values need to be less than 2^{s_e
-1} instead of 2^s_e).

Secondly, the condition $s_w = s_e \, m$ implies that the total length of the exponent vector has to be a multiple of the word-size which requires that $(-n)\!\!\pmod m$ ``unnecessary'' exponents (whose value is set to and always kept at 0) might have to be added to the exponent vector .

Thirdly, and most importantly, the order and arrangement of the exponents within the exponent vector has to be adjusted, depending on the monomial ordering and on the endianess of the used machine. On big-endian machines, the order of the exponents has to be reversed for reverse lexicographical orderings whereas on little-endian machines, the order of the exponents has to be reversed for lexicographical orderings. In practice, this fact can be hidden behind appropriate macro (or inline) definitions for accessing single exponents. In our implementation, we used a global variable called pVarOffSet and implemented the exponent access macro as follows:

Provided that n is the number of variables of the current ring then we set the value of pVarOffSet as follows:

Some source code fragments can probably explain it best: Figure 1 shows (somewhat simplified) versions of our implementation of the vectorized monomial operations. Some explanatory remarks are in order:

n_w is a global variable denoting the length of the exponent vectors in machine words (i.e., if s_w is the size of a machine word, s_e the size of an exponent, n the number of variables, then $n_w = \lceil n\,s_e/s_w \rceil$ ).

pLexSgn and pOrdSgn (used in MonComp) are global variables which are used to appropriately manipulate the return value of the comparison routine and whose values are set as follows:

Notice that MonAdd works on three monomials and it is most often used as a ``hidden'' initializer (or, assignment), since monomial additions are the ``natural source'' of most new monomials.

Our actual implementation contains various, tediously to describe, but more or less obvious, optimizations (like loop unrolling, use of pointer arithmetic, replacement of multiplications by bit operations, etc). We apply, furthermore, the idea of ``vectorized operations'' to monomial assignments and equality tests, too. However, we shall not describe here the details of these routines, since their implementation is more or less obvious and they have less of an impact on the running time than the basic monomial operations.

Table 1: Detailed timings for vectorized monomial operations: SINGULAR compiled with gcc version 2.7.2 for a Pentium Pro 200 (32 bit, little-endian) running Linux and average of timings for a HP C160 (32 bit, big-endian) running HP-UX 10.20 and a DEC Alpha (64 bit, little endian) running Linux.

Example	t_n,s	%mon	%comp	%div	%add	$\displaystyle \frac{t_{n,s}}{t_s}$	$\displaystyle \frac{t_{n,c}}{t_c}$	$\displaystyle \frac{t_s}{t_c}$	$\displaystyle \frac{t_{1.0}}{t_{n,s}}$	$\displaystyle \frac{t_{1.0}}{t_c}$
ecyclic 6	3.9	79.4	68.9	1.6	8.8	1.7	2.0	1.3	1.3	2.8
homog cyclic 7	125.8	72.5	33.6	11.8	27.1	1.5	1.9	1.3	1.3	2.6
homog cyclic 6	0.9	69.5	30.5	13.3	25.8	1.6	1.9	1.3	1.3	2.6
katsura 7	8.8	71.5	41.5	3.7	26.3	1.6	1.9	1.2	1.3	2.6
rcyclic 15	53.2	76.2	29.6	26.0	20.6	1.6	2.0	1.4	1.2	2.5
rcyclic 14	22.9	76.0	28.1	28.5	19.4	1.5	1.9	1.3	1.2	2.5
rcyclic 16	106.4	78.2	26.5	30.7	21.0	1.6	2.0	1.4	1.2	2.5
rcyclic 19	740.0	80.3	29.5	29.9	20.9	1.5	1.9	1.4	1.2	2.5
rcyclic 18	396.5	80.0	29.4	29.2	21.4	1.5	1.9	1.4	1.2	2.4
rcyclic 13	10.2	74.8	27.6	26.6	20.7	1.5	1.8	1.3	1.2	2.4
rcyclic 12	4.3	74.7	28.4	29.8	16.5	1.6	1.9	1.2	1.2	2.4
rcyclic 11	1.6	67.4	22.9	24.7	19.8	1.6	1.8	1.2	1.2	2.4
bjork 8	4.0	67.8	32.0	13.2	22.5	1.5	1.8	1.2	1.3	2.3
rcyclic 17	210.9	78.2	26.5	30.3	21.4	1.5	1.8	1.3	1.2	2.3
gerhard 1	1.7	68.7	48.1	4.5	16.0	1.3	1.5	1.2	1.5	2.3
katsura 8	94.3	73.1	45.1	3.5	24.5	1.5	1.7	1.2	1.3	2.3
homog 2mat3	126.0	81.9	77.5	1.9	2.5	1.3	1.5	1.4	1.2	2.2
cyclic 7	190.0	66.9	31.0	11.2	24.7	1.4	1.8	1.3	1.2	2.2
rcyclic 10	0.5	53.7	16.2	20.0	17.5	1.6	1.7	1.1	1.2	2.2
cyclic 6	0.7	61.6	26.8	16.3	18.6	1.4	1.7	1.2	1.2	2.1
homog gonnet	125.4	76.8	43.0	31.5	2.2	1.5	1.7	1.2	1.1	2.1
gerhard 2	13.8	72.3	54.8	3.2	14.3	1.3	1.4	1.1	1.4	2.1
homog alex 2	12.9	71.7	58.2	4.8	8.8	1.3	1.4	1.1	1.4	2.0
schwarz 11	131.6	65.4	30.6	18.4	16.5	1.3	1.6	1.3	1.2	2.0
2mat3	123.1	80.2	75.7	1.9	2.6	1.2	1.4	1.4	1.2	2.0
schwarz 10	21.9	62.5	28.2	18.4	15.9	1.3	1.5	1.2	1.2	2.0
cohn2	9.7	64.4	29.7	4.1	30.6	1.4	1.5	1.1	1.2	1.9
gerhard 3	24.4	67.1	45.5	13.1	8.6	1.3	1.3	1.1	1.3	1.8
symmetric 6	60.6	68.0	25.6	14.5	28.0	1.4	1.4	1.1	1.2	1.8
schwarz 9	3.3	55.5	23.0	20.0	12.5	1.3	1.4	1.1	1.2	1.8
alex 2	7.1	63.6	52.1	3.3	8.2	1.3	1.4	1.1	1.2	1.8
schwarz 8	0.6	50.6	24.7	14.6	11.2	1.3	1.5	1.1	1.1	1.7
ecyclic 7	1076.4	86.4	84.2	0.3	2.0	1.2	1.4	1.2	1.1	1.6
alex 3	2.0	65.9	53.5	7.7	4.7	1.3	1.3	1.1	1.2	1.6
homog alex 3	2.0	66.2	47.2	14.8	4.1	1.2	1.2	1.1	1.3	1.6
gonnet	1.2	43.8	6.2	34.2	3.4	1.1	1.3	1.2	1.1	1.5
averages
Pentium Pro	103.3	69.8	38.4	15.6	15.8	1.4	1.6	1.2	1.2	2.2
HP C160	125.4	69.0	41.7	17.9	9.5	1.2	1.5	1.3	1.8	2.9
DEC Alpha	221.4	69.8	37.7	16.1	16.0	1.7	1.7	1.2	1.1	2.3

So much for the theory, now let us look at some actual timings: Table 1 shows various timings illustrating the effects of the vectorized monomial operations described in this section. In the first column, we list the used examples -- details about those can be found in the Appendix. All GB computations were done using the degree reverse lexicographical ordering (dp) over the coefficient field Z/32003. We measured the following times (all in seconds):

Firstly, they support our assertion that for GB computations over finite fields of small characteristic, the running time is largely dominated by basic monomial operations (see column %mon). However, in which of the (three) basic monomial operations the most time is spent varies very much from example to example (compare, e.g., line ``gonnet'' with line ``ecyclic 6'').

Secondly, and most importantly: the impact of the vectorized monomial operations is quite substantial (see columns t_n,s/t_s, t_n,c/t_c which show the speedup gained by vectorized operations). As expected, the larger the ratio m = s_w/s_e (i.e, the number of exponents packed in one machine word), the more speedup is usually gained (compare column t_n,s/t_s and t_n,c/t_c). However, notice that we cannot conclude a direct correlation between the percentage of time spent in monomial operations and the efficiency gains of vectorized operations. This is due to the fact that the number of inner loop iterations (and, hence, reduction of inner loop iterations) for comparisons and divisibility tests is not constant, but depends on the input monomials.

Thirdly: As we would expect, the more exponents are encoded into one machine word, the faster the GB computation is accomplished (see the t_s/t_c column). This has two main reasons: first, more exponents are handled by one machine operation; and second, less memory is used, and therefore, the memory performance increased (e.g., number of cache misses is reduced). However, we also need to keep in mind that the more exponent are encoded into one word, the smaller are the upper bounds on the value of a single exponents. This is especially crucial for computations with char exponents, since these require that an exponent may not be larger than 127. Ideally, we would like to ``dynamically'' switch from one exponent size to the next larger one, whenever it becomes necessary. With SINGULAR, we cannot do this yet, at the moment, but we intend to implement this feature in one of the next major upgrades.

Fourthly: We would like to point out that the only difference between the SINGULAR versions resulting in t_1.0 and t_n,s is that monomial comparisons for the different orderings are not reduced to one inlined routine but instead are realized by function pointers (and, therefore, each monomial comparisons requires a function call). Hence, column t_1.0/t_n,s shows that already the in-place realization of monomial comparisons results in a considerable efficiency gain which by far compensates the additional overhead incurred by the additional indirection for accessing one (single) exponent value (i.e., the overhead incurred by the pGetExp macro shown above).

Last, but not least, the combination of all these factors into our newest and fastest version of SINGULAR, leads to a rather significant improvement (see column t_1.0/t_c) over the ``old'' SINGULAR version 1.0.

$\displaystyle b - a \!\!\pmod{2^{s_w}} < 2^{s_w-1}$	$\textstyle \mbox{ iff }$	$\displaystyle \mbox{Lex}(\alpha, \beta) = 1$	(1)
$\displaystyle \widehat{b} - \widehat{a} \!\!\pmod{2^{s_w}} < 2^{s_w-1}$	$\textstyle \mbox{ iff }$	$\displaystyle \mbox{RevLex}(\alpha, \beta) = -1$	(2)
$\displaystyle (\gamma_0, \ldots, \gamma_{m-1})$	=	$\displaystyle \alpha + \beta$	(3)
$\displaystyle \forall \, 0 \leq i < m: \delta_i < 2^{s_e -1}$	$\textstyle \mbox{ iff }$	$\displaystyle \alpha \,\vert \,\beta$	(4)

	type of machine
type of ordering	big-endian	little-endian
lexicographical	0	n-1
reverse lexicographical	n-1	0