Logical synthesis of general hardware operations

Theory

Multi-digit operations have a “local” and “non-local” component.

The “local” component uses digit \(i\) of the inputs to influence digit \(i\) of the output.

The “non-local” component uses non-\(i\) digits of the inputs to influence digit \(i\) of the output. The pptrees library works under the assumption that this non-locality only goes in one direction, for example, that only digits less significant than \(i\) influence the output. Operations that are non-local in both directions exist, and these must be handled by stacking two trees in series, one for each direction.

The pptrees library assumes that all strictly non-local components can be phrased in terms of if statements (nested ones, if needed). A formal proof can most likely be derived from group cohomology theory, which I do not know sufficiently.

The local and non-local components can be combined into the final result through a simple if-then statement:

if the non-local component modifies the final result then return a modified local result else return the original local result

This if-then statement, or multiplexer logic gate, can have multiple branches, depending on how many ways the final result may be modified in. A formal approach to this problem can most likely be derived from group cohomology theory, which I do not know sufficiently.

Because everything so far is composed of if-then statements, it can all associate.

if cond1:
  if cond2:
    if cond3:
      if cond4:
        do_this

is equivalent to

if cond1 and cond2:
  if cond3 and cond4:
    do_this

This is just one way in which if-then statements can associate.

Note that this notebook will use the short-hand of \(s ?\; a : b\) to mean

if s then a else b

The recipe for any operation

  1. Make sure that bit \(i\) of the result only depends on bits \(0\) through \(i\) of the input.

If this is not true, manually split the operation up into two parts, one that depends on bits \(0\) through \(i\) and another that depends on bits \(i+1\) through \(n\).

  1. Figure out what part is local (output at bit \(i\) only depends on bit \(i\) of the inputs).

  1. Obtain an expression for the final result at bit \(i\) using the template

if, the non-local component is X
  then, the final result is something other than the local result, to be determined
  otherwise, the final result is the local result
  1. Find an operation that answers the question “is the non-local component X”?

Define this operation as an if-statement, or series of nested if-statements.

  1. Repeat steps 1 through 5 for this new operation if the formula has any non-local terms that are not itself.

For example, the following formula would not need more work, since it’s only written in terms of itself and local aspects:

\(f_i\) = if \(s_i\) then \(g_i\) else \(f_{i-1}\), where \(s_i\) and \(g_i\) are local.

While the following formula would need to be broken apart:

\(f_i\) = if \(s_i\) then \(g_i\) else \(f_{i-1}\), where \(g_i\) is non-local.

  1. Use if-statement association rules to associate this operation.

This is explained more in the binary addition theory section.

Binary addition

Theory (skip if this is your first read?)

Just addition

Let us define the operands of addition as the digit-vectors \(a\), \(b\), and the result as the digit-vector \(s\).

The local component of addition at digit \(i\) is

\(s_i^L = a_i \oplus b_i = p_i\)

The non-local component of addition at digit \(i\) can be expressed as

\(s^{nL}_i = c_i\)

Where \(c\) stands for “carry”.

The final result of addition is \(s_i = s^{nL}_i ?\; \overline{s}^L_i : s^L_i\), or in other words \(c_i ?\; \overline{p}_i : p_i\)

If there’s no carry, keep the local component. If there’s a carry, negate the local component, since negating is the same thing as adding 1.

Carry-generation (non-local aspect of addition)

Let us define the operands of carry-generation as the digit-vectors \(a\), \(b\), and the result as the digit-vector \(c\).

There is no local component of the carry, it only depends on previous digits.

The non-local component of addition at digit \(i\) in plain English is:

If the previous pair of digits adds up to 0 (that is, \(\overline{a\oplus b}\)), there is a carry out of them if both of the digits are 1.

Otherwise, the pair of digits adds up to 1. Then, the carry coming out of them is the same as the carry going into them. Whether or not they overflow depends on whether or not the previous pair of digits also overflowed.

\(c_i = \overline{p}_{i-1} ?\; g_{i-1} : c_{i-1}\)

Where \(p_i = a_i \oplus b_i\) from before, while \(g_i = a_i \wedge b_i\).

Since only \(c_0\) is known a priori, all \(c_i\) must be expressed in terms of it.

For example, \(c_5 = \overline{p}_4 ?\; g_4 : \left(\overline{p}_3 ?\; g_3 : \left(\overline{p}_2 ?\; g_2 : \left(\overline{p}_1 ?\; g_1 : \left(c_0\right)\right)\right)\right)\)

This is where associativity comes into play.

These if-statements

if not p_4:
  g_4
elif not p_3:
  g_3
elif not p_2:
  g_2
elif not p_1:
  g_1
else:
  c_0

are equivalent to

if not p_4 or not p_3:
  not p_4 ? g_4 : g_3
elif not p_2 or not p_1:
  not p_2 ? g_2 : g_1
else:
  c_0

Long story short, the general rule is the following.

Instead of \(c_5\), call it \(c_{5:0}\). The weird \(\overline{p}_2 ?\; g_2 : g_1\) thing? Call it \(c_{3:2}\). \(g_1\)? Call it \(c_{2:2}\). \(g_2\)? Call it \(c_{3:3}\).

The whole (\(\overline{p}_3\) or \(\overline{p}_2\) or \(\overline{p}_1\))? Call it \(\overline{p}_{3:1}\).

The recurrence then becomes

\(c_{i:k} = \overline{p}_{i-1:j-1} ?\; c_{i:j} : c_{j:k}\)

We must also then keep track of \(p_{i:k}\):

\(\overline{p}_{i:k} = \overline{p}_{i:j} + \overline{p}_{j:k}\)

Or to put it all together in terms of some operator, ■

\((c_{i:k}, \overline{p}_{i-1:k-1}) = (c_{i:j}, \overline{p}_{i-1:j-1}) ■ (c_{j:k}, \overline{p}_{j-1:k-1}) = (\overline{p}_{i-1:j-1} ?\; c_{i:j} : c_{j:k}\;,\; \overline{p}_{i-1:j-1} + \overline{p}_{j-1:k-1})\)

Or even shorter

\((C, \overline{P}) ■ (C', \overline{P'}) = (\overline{P} \;?\; C : C', \overline{P} + \overline{P'})\)

Sorry about the off-by-one issue, that’s just carry-generation for ya.

In general though, this is how if statements associate. This applies for all operations that are non-local in a single direction.

For carry-generation in particular, the previous statement can be simplfied. Expanding out the boolean logic, we have: \(c_{i:k} = \overline{p}_{i-1:j-1} ?\; c_{i:j} : c_{j:k} = \overline{p}_{i-1:j-1} c_{i:j} + p_{i-1:j-1}c_{j:k}\)

However, \(\overline{p}_{i-1:j-1}\) and \(c_{i:j}\) are mutually exclusive. The formula is thus reduced to:

\((C, P) ■ (C', P') = (C + PC', PP')\)

The boolean algebra works out, even though it’s not shown here.

Using the recipe

  1. Bit \(i\) of the sum only depends on bits 0 through \(i\) of the input.

  1. \(s^L_i = a_i\oplus b_i = p_i\) is the local aspect.

if the carry into bit i, henceforth called c_i, is 1
  then, the final result is the inverse of the local result
  otherwise, the final result is the local result

In other words, \(s_i = c_i ?\; \overline{p}_i : p_i\)

  1. \(c_i\) can be defined as

if the local sum of the i-1 pair of input bits is 0
  then, c_i is 1 if both the bits are 1 [1 + 1 = 0 with an overflow]
  otherwise, the local sum of the i-1 pair would be 1,
    so c_i happens only if c_i-1 happens

In other words, \(c_i = \overline{p}_i ?\; g_i : c_{i-1}\)

Where \(p_i = a_i \oplus b_i\) and \(g_i = a_i \wedge b_i\)

  1. \(c_i\) has no other non-local aspects, so it is fine.

  1. The previous section shows how step 4’s definition of \(c_i\) becomes

\((C, \overline{P}) ■ (C', \overline{P'}) = (\overline{P} \;?\; C : C', \overline{P} + \overline{P'})\)

and can then be simplified to

\((C, P) ■ (C', P') = (C + PC', PP')\)

Putting it all together

The pptrees library defines an AdderTree as an ExpressionTree with the following nodes:

node_defs = {
    "pre": "ppa_pre",
    "root": "ppa_post",
    "cocycle": "ppa_cocycle",
    "buffer": "ppa_buffer",
    "lspine_pre": "ppa_lspine_pre",
    "lspine": "ppa_lspine",
    "small_root": "ppa_small_root",
    "small_pre": "ppa_lspine_pre_simple",
}

A couple of short-hands were introduced in the recipe: \(p_i = a_i \oplus b_i\) and \(g_i = a_i \wedge b_i\). This is “pre-processing” logic. These form the leafs of the tree. They are represented by the “ppa_pre” node.

Step 3 of the recipe obtains an expression for the final result.

This is “post-processing” logic. This forms the root of the tree. It is represented by the “ppa_post” node.

Step 2 of the recipe describes the local aspect of the operation.

This is represented by the “ppa_lspine_pre” node.

Step 6 of the recipe describes the non-local aspect of the operation.

This is represented by the “ppa_cocycle” node.

The “ppa_buffer” node is used for fan-out decoupling.

Since “ppa_cocycle” node operates on tuples of size 2, so does “ppa_buffer”.

“ppa_lspine” takes advantage of the fact that “cocycle”, “root”, and “lspine_pre” can all be seen as if-then statements. That means they can all associate. This is a concept that has not been introduced thus far, because it’s just a cherry on top that can be discussed separately.

“ppa_small_root” and “ppa_small_pre” account for the fact that a width-1 circuit doesn’t form a binary tree. There’s only one leaf, so it’s just the root node and 1 child.

Perhaps the software should be able to handle this on its own, somehow, but it is currently not able.

Possible expansion: addition-based comparator

Discussion

A comparator compares two numbers, gauging whether one is greater, less than, or equal to the other.

Checking whether a number is greater than another can be done via addition. Just subtract B from A. If the result is negative, B > A. Otherwise, not (B > A).

Subtracting two numbers via two’s complement addition is straight-forward:

  1. Invert \(B\) into \(\overline{B}\)

  2. Add \(A + \overline{B} + 1\).

The “+1” can be handled by adding an extra digit to the addition operation.

That digit can be real, in which case both A and B would have a least-significant \(1\) appended to them.

That digit could instead be virtual: an extra leaf is added to the carry-chain tree, with the leaf output being a constant \(1\) instead of \(a_{-1} \wedge b_{-1}\)

Finally, binary addition of two \(n\)-bit numbers has an \(n\)-bit result.

But comparators only need 1 bit of result. So the structure necessary isn’t \(n\) trees, but rather only 1 tree.

The specific bit that needs to be checked depends on whether the numbers are signed or unsigned.

For unsigned numbers, A - B can be performed using two’s complement subtraction. This will require both the “+1” padding trick, as well as an extra sign bit on the most-significant side.

B > A if the operation overflows into the sign bits. But since the sign bits are both 0, we don’t need to calculate the actual sum, just the carry.

The answer is “B > A = \(c_{n+1}\)”, the result of a \(n+1\) bit carry generation.

For signed numbers, A - B can be performed using two’s complement subtraction. This will require the “+1” padding trick.

But that can only work if \(A\) and \(B\) are both positive or negative. If the two numbers are of opposite polarity, some magic is required.

Using the recipe (unsigned numbers)

The recipe doesn’t need to be used per se. This is just the carry-generation portion of an adder.

Putting it all together

The tree would just be a variant on the AdderTree.

The main differences are:

  • Signed numbers should be handled in some way.

  • lspine_pre and lspine can be omitted, making them use the normal nodes. This changes it from an addition circuit to a carry-computation circuit.

  • The software should invert one of the inputs on its own, in a modified pre-processing node, instead of forcing the user to do it in HDL.

  • The software should pad the least-significant bit on its own, to account for the +1. This can be done by increasing the width by 1 and throwing in an extra pre node. The extra pre node should perform the operation assign gout = 1.

  • The tree should always be balanced, because that’s just the best option if there’s only one tree. Left-balanced vs right-balanced shouldn’t make a difference.

  • The circuit should be a forest consisting of one tree, because writing out HDL from Forests is tested, but writing out HDL from Trees is not. This is just a cautionary note due to the beta version of the software.

Analysis (unsigned numbers)

The circuit is a single tree, of width \(n+1\).

The tree is a carry-generation tree, where the cocycle nodes are an AOI and NAND cell each.

Note that this circuit ONLY calculates B > A, not also A > B nor B = A.

Area

A single tree of width \(n+1\) has \(n\) nodes. The pre-processing logic leads to \(n+1\) leafs, since the 0th is special (+1).

Each tree node is basically an AOI and a NAND. Each tree leaf is basically two NANDs.

The total area is thus roughly 3n NANDs + n AOIs.

A complete adder-based circuit would have roughly 2n + 2n lg(n) NANDs, 2n lg(n) AOIs, and n XORs.

The area savings over a complete adder would thus be roughly:

  • n XOR gates

  • 2n lg(n) - n AOI gates

  • 2n lg(n) - n NAND gates

For a 64-bit word size, this is roughly an 85% area savings.

Delay

The critical path through the tree requires \(\lceil lg(n+1)\rceil\) AOI nodes. Pre-processing add one NAND node. There is no fan-out or wire tracks.

The total delay is thus roughly \(1 + \lceil lg(n+1)\rceil\)

A complete adder-based circuit would have the same critical path.

However, that critical path would be fraught with fan-out and tracks.

For a 64-bit word size, that represents a roughly 20-25% difference in delay.

Possible expansion: non addition-based comparator

Discussion

This typically takes the form of adding a constant +1 to increment, or adding a constant +4 to the program counter.

The benefit of knowing the constant value, \(B\), being added is:

  • Anywhere where \(b_i = 0\), \(g_i = a_i \wedge b_i = 0\), while \(p_i = a_i \oplus b_i = a_i\)

  • Anywhere where \(b_i = 1\), \(g_i = a_i \wedge b_i = a_i\), while \(p_i = a_i \oplus b_i = \overline{a}_i\)

Let’s call \(G = (B > A)\), \(L = (B < A)\), and \(E = (A == B)\)

But wait. If we look at the most significant bits of \(B\) and \(A\), and \(B > A\), then it is certain that \(B > A\).

So the non-local component doesn’t come from the less significant bits like it does for addition or carry-generation. It comes from the more significant bits.

What can be done about this?

Easy: just reverse the bit string.

Using the recipe

  1. \(G\) and \(L\) only get affected by more-significant bits, never by less-significant bits.

Reversing the inputs into the tree can make the data flow into its standard direction that is assumed by pptrees.

  1. The local aspects are

\(G_i^L = \overline{a}_ib_i\)

\(L_i^L = a_i\overline{b}_i\)

Explanation: the only way \(b_i > a_i\) is if \(b_i = 1\) and \(a_i = 0\).

if the non-local component is (not E)
  then, the final result is not the local result
  otherwise, the final result is the local result

For both \(G\) and \(L\) the situation is the same. Even if, locally, \(G\) is true, that only matters if the non-local \(E\) is false.

That is to say, the less-significant digits only matter if the more-significant digits are all equals.

The math then is

\(G_i = \overline{E}_{i+1} ?\; G_{i+1} : G^L_{i}\)

\(L_i = \overline{E}_{i+1} ?\; L_{i+1} : L^L_{i}\)

  1. \(E_i\) can be defined as

if (not G_i) and (not L_i)
  then, E
  otherwise, (not E)

In other words, \(E_i = \overline{G}_i\overline{L}_i ? \; 1 : 0\)

Or to simplify it, \(E_i = \overline{G}_i\overline{L}_i\)

  1. At first glance, in step 3, it seems like \(E\) is going to need its own separate tree.

But \(E\) only depends on \(G\) and \(L\).

That is why we are computing both \(G\) and \(L\) at the same time. Because they’re both needed anyway.

  1. Let’s just go through the math, even if you already see the answer.

\((G, L, \overline{E}) ■ (G', L', \overline{E'}) = (\overline{E'} \;?\; G' : G, \overline{E'} \;?\; L' : L, \overline{E} + \overline{E'})\)

expanding this once:

\((G, L, \overline{E}) ■ (G', L', \overline{E'}) = (\overline{E'}G' + E'G, \overline{E'}L' + E'L, \overline{E} + \overline{E'})\)

substituting the definition of \(E\), into just one part for brevity:

\((G,\;,\;) ■ (G',\;,\;) = ((G'+L')G' + \overline{G'}\overline{L'}G,\;,\;) = (G' + \overline{G'}\overline{L'}G,\;,\;) = (G' + \overline{L'}G,\;,\;)\)

It turns out that we no longer need \(E\) at all, so we can eliminate it from our consideration.

\((G,L) ■ (G',L') = (G' + \overline{L'}G, L' + \overline{G'}L)\)

Putting it all together

The tree would inherit AdderTree.

The main differences from AdderTree are:

  • New nodes would need to be defined for pre-processing a constant bit of 1, pre-processing a constant bit of 0, and the two modified ■ nodes.

  • The optimize_nodes() method would first call its super(), then it’d do additional things:

  1. Somehow take in info about where the constant has 1’s and 0’s, and morph the pre-processing nodes.

  2. Keep track of whether there’s a chain of 0’s on-going (can be done through the nodes’ leafs attributes, which is a binary-encoded list of all the leafs that feed into a certain node), and then morphs the ■ node accordingly.

Analysis (unsigned numbers)

The circuit is a single tree, of width \(n\).

The cocycle (main node) of the tree is composed of two AOI cells, possibly with some inverters, but smart synthesis can eliminate the need for inverters.

The pre-processing nodes seem to be two NAND2B cells, but in order to enable smart synthesis of the tree, it’s better to make them two NAND cells and an inverter.

Note that this circuit calculates both B > A and A > B. This is in contrast to the previous section, which cannot calculate both of these values. Thus in a normal application, this circuit on its own would suffice, while the circuit from the previous section would need to be duplicated. Take this into account when considering the values presented below.

Area

A single tree of width \(n\) has \(n\) nodes. The pre-processing logic leads to \(n+1\) leafs.

Each tree node is basically two AOIs. Each tree leaf is basically two NANDs and an inverter.

The total area is thus roughly 2n NANDs + 2n AOIs + n INVs.

For a 64-bit word size, this is about 50% larger than the previous section, but still about 80-85% smaller than a complete adder-based circuit.

Delay

The critical path through the tree requires \(\lceil lg(n)\rceil\) AOI nodes. Pre-processing adds one NAND node and one INV node. There is no fan-out or wire tracks.

The total delay is thus roughly \(2 + \lceil lg(n)\rceil\).

This is slightly faster than the previous section for word-size that is a power of 2, and slightly slower for other word sizes.

It remains an estimated 20-25% faster than a complete adder-based circuit.

Delay

The critical path through the tree requires \(\lceil lg(n+1)\rceil\) AOI nodes. Pre-processing add one NAND node. There is no fan-out.

The total delay is thus roughly \(1 + \lceil lg(n+1)\rceil\)

Possible expansion: adding a constant value

Discussion

This typically takes the form of adding a constant +1 to increment, or adding a constant +4 to the program counter.

The benefit of knowing the constant value, \(B\), being added is:

  • Anywhere where \(b_i = 0\), \(g_i = a_i \wedge b_i = 0\), while \(p_i = a_i \oplus b_i = a_i\)

  • Anywhere where \(b_i = 1\), \(g_i = a_i \wedge b_i = a_i\), while \(p_i = a_i \oplus b_i = \overline{a}_i\)

This allows for the pre-processing logic at those specific spots to be optimized.

But the \(b_i = 0 \implies g_i = 0\) is interesting. Continuous chains of \(b_i = 0\) lead to continuous chains of \(g_i = 0\). And continuous chains of \(g_i = 0\) means large sub-blocks of \(C_{i:j} = 0\).

To show this, consider:

\(C_{i:i} = g_{i-1}\)

\(C_{i:i-1} = C_{i:i} + P_{i-1:i-1} C_{i-1:i-1}\)

If both those corresponding \(g\) bits are 0, due to the constant value, then the entire \(C_{i:i-1}\) block is 0. Entire \(C_{i:j}\) blocks can be hard-coded to 0 in this way.

So the pre-processing logic optimization is clear.

But this tree, cocycle node optimization.

First off, the previous cell leads to the conclusion that \(C_{i:j}\) is 0 if \(g_{i-1}\) through \(g_{j-1}\) are all zero. So if such a chain is encountered in the tree logic, the normal cocycle node of

\((C, P) ■ (C', P') = (C + PC', PP')\)

can be reduced to

\((, P) ■ (, P') = (, PP')\)

This saves area and power for sure. Probably delay, it’s hard to tell?

But once the chain of zeroes ends, and these weird nodes must combine with the rest of the tree, what do we do?

Say, for example, we have the addition \(A + 0000\;0100\)

There’s a chain of 5 zeroes in the front. Let’s look at that. We handle it as above, simplifying ■. How do we combine it with the stuff on the right?

In other words, what is

\((, P) ■ (C', P')\)

Well, we know that missing \(C\) is 0. So it’s just

\((0, P) ■ (C', P') = (PC', PP')\)

A similar effect can occur for chain of 1’s, but that’s a topic for another day.

Using the recipe

The recipe doesn’t need to be used per se.

Yes, the ■ operator changes. But it can be thought of as just a synthesis optimization, something to be done after the tree structure is generated.

Putting it all together

The tree would inherit AdderTree.

The main differences from AdderTree are:

  • New nodes would need to be defined for pre-processing a constant bit of 1, pre-processing a constant bit of 0, and the two modified ■ nodes.

  • The optimize_nodes() method would first call its super(), then it’d do additional things:

  1. Somehow take in info about where the constant has 1’s and 0’s, and morph the pre-processing nodes.

  2. Keep track of whether there’s a chain of 0’s on-going (can be done through the nodes’ leafs attributes, which is a binary-encoded list of all the leafs that feed into a certain node), and then morphs the ■ node accordingly.