# XNOR-Nets with SETs: Proposal for a binarised convolution processing elements with Single-Electron Transistors

Figure 1 shows the typical Coulomb staircase behavior of current-voltage characteristics for an one level single-electron system^{13}† The results are obtained for a simple one-level device at two gate voltages, Vg = 0 V and Vg = 0.15 V. The current-voltage curves are obtained by analytically solving the Master equation for a single energy state. The initial point of the problem is to implement AND operation using the non-linear Coulomb staircase of a single SET. We use this one-level toy model to briefly discuss about the implementation of AND. Here, the two inputs are encoded as voltages of source and gate respectively and the output is encoded as the current. AND gate could be realized by encoding input logical **0** and **1** as voltages 0 V and 0.15 V respectively and output logical **0** and **1** as 0 A and 1.2e−7 A respectively. For example, inputs **0** and **0** gives output logic **0** and similarly, inputs **1** and **1** give output logic **1**†

With the basic philosophy of the problem laid down using a simple one-level model, we proceed on to a detailed study of the problem with a more rigorous Master equation based SPICE model^{18}† Figure 2a depicts the typical Coulomb-staircase behavior of a single-electron transistor obtained from SPICE-level simulation at gate voltages of 0 V and 0.045 V. The operation of AND as described above is illustrated in Fig. 2a. 2 taking an example input instance of **1** and **1**.The output current measured at 3.6 nA is the logical output **1** for the given inputs. The inputs **0** and **1** of the Boolean gate could be set at the Coulomb levels of choice depending on the drive current and energy trade-off required in the application-specific circuit design. The current levels also offer high robustness to input voltage noises due to the Coulomb blockade of electrons. Moreover, the sensing margin of outputs namely I(**1**)/I(**0**) has an ideal value of infinity, in principle.

In Binary Neural Nets, the input activations and weights are binarised to either † 1 or − 1 and therefore, n-bit floating point Multiply-Accumulate operations are transformed to XNOR-Accumulate^{5}† XAC remains the core operation of the binary deep neural networks consuming the bulk of space, time and energy resources. XNOR operation is derived from the AND operation by adding a second SET and inverting the input voltage signals as implemented by Eq. 1 and shown in FIG. 3. XNOR and XOR operations are also obtained in the previous studies^{19.20} using a similar input and output encoding scheme but with a more complicated multi-gate device design and fabrication process.

$$begin{aligned} mathbf{p} quad XNOR quad mathbf{q} = ( mathbf{p} quad AND quad mathbf{q} ) quad OR quad ( overline{ mathbf{p }} quad AND quad overline{mathbf{q }} ) end{aligned}$$

(1)

### XNOR and accumulate (XAC)

The accumulation is done by POPCOUNT instruction in digital systems and consumes vast resources of space and time for big-length inputs. The XAC circuit that builds on the previous XNOR circuit is shown in Fig. 4a that POPCOUNT the 2-bit length input. The circuit uses the Kirchhoff’s law of current to POPCOUNT the input. Long Cheng et al.^{21} experimentally demonstrated a 4-bit POPCOUNT accelerator using a memristive array that similarly operates on the addition of currents. For illustration of the POPCOUNT operation, consider two binarised inputs A = [1 1] and B = [1 1]† The first bits of A and B are fed into V1 and V2, while second bits are fed into V3 and V4. SETs numbered 1,2,3 and 4 perform XNOR operation on the vectors and the POPCOUNT is read by measuring the output current Ipop. In the present example, the input voltages are fixed at 0.045 V and the output current is measured at 7.2 nA which is encoded as integer 2 (Fig. 4).

### Reconfigurable XAC circuit

The advantage of the proposed XAC circuit is that it could be reconfigured to carry out the 1D and 2D convolutions^{7} for a parallel binarized CNNs. Here, we demonstrate the reconfigurability of XAC circuit by taking an example of performing 1D row convolution of (4times 4) kernel with input activations. Once the POPCOUNT operation as implemented by the XAC unit is completed, the POPCOUNT outputs of different XAC units are routed back to a selected XAC unit to obtain 1D convolution output as shown in Fig. 5a. The gate terminals could be grounded or fixed at selected voltage and therefore, the devices function essentially as two-terminal single-electron junctions for the 1D convolution operation. POPCOUNT is encoded at voltages that maps to equivalent Coulomb stair-case level. Each quantized current level maps to an unique integer obtained from the POPCOUNT instruction. The resultant output current obtained by the summation of individual devices represent the 1D convolution value. Both positive and negative sums could be computed by encoding the sign in the direction of current. For example, assume the POPCOUNTs obtained from XAC of one row of (4times 4) kernel with input activations to be V1 = 1, V2 = 1, V3 = 0 and V4 = 0. These integers are fed into the reconfigured XAC unit, with the voltages fixed at 0.045 V and 0 V to represent integer 1 and 0 respectively. the output current, (I_{1D}) measured at 7.2 nA encode the 1D convolution value 2 as shown in Fig. 5. It should be noted that current should be converted to appropriate voltage level with an additional converter circuit before feeding inputs into the reconfigured XAC circuit^{22}†

### Bus bar circuit

The re-configurable XAC circuit provides a significant savings in transistor consumption and the chip area. But it is achieved at the expense of complex three-terminal transistor fabrication, delayed computation and importantly, integrating the SET-XAC circuit with other contemporary deep neural hardware could be challenging. Therefore, a simpler busbar circuit consisting of two-terminal single-electron junctions is proposed to address this problem. The crossbar of silicon quantum dots has already been demonstrated^{23} for a more complicated architecture in quantum computing. In particular, we demonstrate the 1D convolution operation of the busbar circuit by integrating into and augmenting a contemporary hardware accelerator, XNORBIN^{24}† XNORBIN, a completely digital and tapeout ASIC achieved the second-best result of 100 TOP/s/W designed on a Global Foundries 65nm node.

Here, we use a simple two-element busbar to demonstrate our proof-of-concept as shown in Fig. 6a and the present design could be scaled to arbitrary length. POPCOUNT and full-adder units of Basic processing unit (BPU) of XNORBIN are replaced with two busbars and the outputs of BPU XNOR ( voltage scaled appropriately ) are fed into the inputs of busbar. Since the busbar element length is 2, the input and size of the kernel is restricted to 2-bit vector and (2times 2) respectively for our proof-of-concept illustration. Consider the outputs of XNOR to be 0 and 1 in both the BPUs (BPU0 and BPU1). As shown in FIG. 6, the output current (I_{POP0}) ( POPCOUNT from BPU0 = 1 ) generated from two-terminal single-electron junctions 1 and 2 of first busbar is fed into current-to-voltage converter (concomitantly POP1 from BPU1 = 1 is also fed ) to generate the appropriate input voltages for the second bus bar. The POPCOUNT inputs are loaded onto the two-terminal single-electron junctions 3 and 4 of second busbar to get 1D convolution output of value 2. In principle, the complex digital adder circuits of XNORBIN could be replaced by a simple analog busbars to further accelerate the deep neural convolution nets.

For a methodical performance analysis of our convolution processing elements, a rigorous system-level performance is required accounting for interconnects, current-sensing amplifiers, current-controlled voltage sources and is beyond the scope of present work. Nevertheless, an attempt is made to reasonably compare the SET based computation and the digital baseline taking fundamental XNOR and Full-adder(FA) operations on a 2-bit 2 inputs as the relevant parameter and tabulated in Table 1. Here, the comparison is benchmarked against a fully digital circuit^{25} taking only the design that performed best in Power-Delay-Product(PDP) metrics. All the digital circuits have been designed on a 65 nm TSMC CMOS process technology node. The proposed full-swing XNOR gate consists of 7 transistors with the best PDP metric of 52.9 aJ. The FA, with the best PDP metric of 241.1 aJ, is a 22-transistor configuration that implements a well-known four-transistor 2-1-MUX structure. SET-based inverters^{26} are considered to drive inverted inputs into the SET-XAC unit and the worst-case performance metrics of the XAC is reported. The parameters of the remaining SETs are also assumed to have similar values as SET-inverters. The SET based unit shows 66% lesser transistor utilization, nearly-equivalent speed of XAC operation and an almost 4-order magnitude reduction in power dissipation at 1 GHz clock frequency.