# **SUB-PICO JOULE SWITCHING**

# HIGH-SPEED RELIABLE CMOS CIRCUITS ARE FEASIBLE

V. Beiu

College of Information Technology, United Arab Emirates University, Al Ain, UAE

J. Nyathi

Electrical Engineering and Computer Science, Washington State University, Pullman, USA S. Aunet

Department of Informatics, University of Oslo, Norway Emails: <a href="mailto:vbeiu@uaeu.ac.ae">vbeiu@uaeu.ac.ae</a>, <a href="jabu@eecs.wsu.edu">jabu@eecs.wsu.edu</a>, <a href="mailto:sa@ifi.uio.no">sa@ifi.uio.no</a>

### **ABSTRACT**

The desire for high transistor densities and faster devices has resulted in aggressive CMOS scaling. The improved switching speeds have led to increased dynamic power dissipation, while the smaller and denser devices are dissipating even more, due to increased leakage currents. Beside this power dissipation challenge, the smaller CMOS devices are expected to have a much higher rate of soft/transient errors. In this paper we try to address all these three issues—low-power, performance, and reliability—together, and demonstrate that high-performance circuits can be operated reliably at (ultra) low switching energies. We show simulations supporting the claim that multi GHz frequencies are sustainable at energy consumptions in the femto Joules (per switch per transistor) range.

Keywords: Low-power, low voltage, subthreshold, performance/speed, optimal biasing, reliability.

### 1. INTRODUCTION

Device scaling has brought about the need for very tight power budgets while the need for high-performance (speed) remains a necessary metric to pursue. Scaling also permits for the reduction of power supply and device threshold voltages, but these lead to increased leakage currents.

VLSI circuits' power dissipation is considered to consist of three components, namely the dynamic, short circuit and static components. The dynamic power component has been well addressed over time and still continues to be studied for future technologies. Techniques such as clock gating [1], variable power supply voltage [2], [3], and optimal power supply and threshold voltage scaling [4] have been devised, and variations of these are widely used to reduce the dynamic power component (well known examples are the TransMeta's Crusoe processor and the Foxton technology for the Itanium chips). A number of logic family styles have also been developed to combat the dynamic power component. The differential logic gates are one such family [5], and have even seen further improvements such as charge recycling to reduce power consumption while also shortening switching times (some of these are covered in [6]).

To limit the energy and power increase in future CMOS technology generations, the supply voltage  $(V_{DD})$  will have to be (continually) scaled. Along with  $V_{DD}$  scaling, the threshold voltage  $(V_{th})$  of MOS devices will have to be scaled to sustain the traditional 30% gate delay reduction [7]. One of the challenges brought about by  $V_{DD}$  and  $V_{th}$  scaling is the rapid increase in subthreshold leakage. Should the present scaling trends continue it is to be expected that the subthreshold leakage power will become a prominent part of the total dissipated power. Sakurai [8] depicts this well by showing how leakage power per device will continually increase with scaling, while dynamic power per device decreases. This realization has led to considerable activity on efforts to manage leakage power [9]–[11]. The second component is the short circuit power, which is normally estimated at about 10% of the dynamic power. Further more, if

subthreshold power supply voltages are used, the short circuit power component could be eliminated since both devices will never be on simultaneously.

For the most part, low power for scaled electronics circuits by reducing *static power* has been advocated. This is achieved widely by eliminating the direct path from the power supply to the ground rail, and by using methods such as multiple threshold CMOS (MTCMOS) [11], variable threshold CMOS (VTCMOS) [12], [13], dynamic threshold CMOS (DTMOS) [14], etc.

# 2. SUB-PICO (FEMTO) JOULE DESIGN

VLSI designs have traditionally been classified as either high-performance (refers to high end high-speed designs) or low power (referring to those designs that operate at low power and low speeds). We expect that in the nanometer regime low power and medium to high speeds can be achieved simultaneously by systems designed using innovative techniques.

Attaining high-speed with low *dynamic power* dissipation at any given technology node can be achieved by keeping the operation frequency  $f_{CLK}$  constant and reducing the supply voltage  $V_{DD}$ , as can be deduced from the expression for *dynamic power*:  $\alpha C_L V_{DD} V_{swing} f_{CLK}$ . Reducing  $V_{DD}$  lowers the dissipated power, but has limitations as the process leads to a decrease in speed. There is therefore an "ideal"  $V_{DD}$  when trading performance for power. An approach for determining an optimal  $V_{DD}$  is reported in [13], and shows optimal  $V_{DD}$  values 3 to 4 times larger than  $V_{th}$  (see also [15]). If one considers *short circuit power* (neglected in the previous two analyses), an interesting  $V_{DD}$  value is around  $V_{th} + V_{tp} \approx 2V_{th}$ , which would in principle eliminate the short circuit power for the standard CMOS logic style.

For testing such ideas, we have used five-stage ring oscillators. Table 1 summarizes our simulation results for two technology nodes in both standard CMOS and pseudo-nMOS. It has to be mentioned that pseudo-nMOS is a logic style that does not benefit from the scaling of  $V_{DD}$  to about  $V_{tn} + V_{tp}$ , as short circuit power is not eliminated in this case. Besides delays and average currents, we also report the power-delay-products (PDPs) and the energy-delay-products (EDPs).

|        | W <sub>p</sub> (nm) | W <sub>n</sub> (nm) | Delay (ns) | Ι (μΑ) | Power (µW) | PDP (fJ) | EDP (fJ*ns) |
|--------|---------------------|---------------------|------------|--------|------------|----------|-------------|
| 180 nm | V <sub>DD</sub> =   | 600 mV              |            |        |            |          |             |
| CMOS   | 2340                | 900                 | 3.84       | 3      | 1.8        | 6.91     | 26.53       |
| Pseudo | 1250                | 900                 | 1.58       | 5      | 3.0        | 4.74     | 7.49        |
| 120 nm | V <sub>DD</sub> =   | 600 mV              |            |        |            |          |             |
| CMOS   | 450                 | 180                 | 0.4        | 14.5   | 8.7        | 3.48     | 1.39        |
| Pseudo | 800                 | 800                 | 0.2        | 165    | 99         | 19.8     | 3.96        |

**Table 1:** PDPs and EDPs for five-stage ring oscillators operating at approximately  $V_{tn} + V_{tp}$ .

The results confirm that sub-pico (femto) joule PDPs are achievable at quite high speeds. In particular, at the 120 nm node, speeds as high as 5 GHz are attained (i.e., 40 ps per inverter, 0.2 ns / 5 inverters) for a PDP below 4 fJ per inverter (19.8 fJ / 5 inverters). Figure 1 shows the simulation results.

## 3. SUB-FEMTO JOULE DESIGN (SUBTHRESHOLD)

The aggressive scaling of  $V_{DD}$  has led to subthreshold designs, i.e.  $V_{DD} < V_{th}$ . Practical subthreshold designs (such as is presented in [16]) allow for ultra low power, but normally run at (very) slow speeds. Such designs are suitable for portable (battery operated) systems. A few subthreshold circuits have been shown to achieve medium speeds [17], [18]. Besides the low speed attribute, another problem with designs in subthreshold is that they are sensitive to

variations and noise (due to the ultra low  $V_{DD}$ ), hence raising concerns about their reliability. Recent publications show a thrust by researchers towards minimizing the leakage currents and gaining speed by using any of the various forms of body biasing [19]–[21]. Reconfiguration of the functioning of elementary gates based on varying the body biasing has also recently been proposed [22].



**Figure 1:** Simulation results for a five-stage pseudo-nMOS ring oscillator in 120 nm CMOS at  $V_{DD} = 600$  mV. The delay is 40 ps per pseudo-nMOS inverter, but the DC current is significant.

Performance issues using subthreshold voltages at current technology nodes have been under consideration and our group has been studying some of the issues (of subthreshold design) at different technology nodes ranging from 250 nm down to 70 nm. These studies have been on standard CMOS, pseudo-nMOS, and output-wired-inverters [22]–[25] (see also [26]). Five-stage ring oscillator circuits have been used in this paper. It must be noted here that the use of ring oscillators is simply for the sake of evaluating the different technology nodes and there is a clear need for an extensive analysis of various styles of logic gates and data path circuits powered by subthreshold voltages in order to gain a proper sense of the power-delay(-reliability) tradeoffs. Based on the ring oscillator simulations at different technology nodes one can see that the delay reduces with each technology node and we can only expect it to reduce further at lower technology nodes (recent results for the 70 nm node are being reported in [23]–[25]). Table 2 presents simulation results for three different technology nodes, including: standard CMOS, pseudo-nMOS, and pseudo-nMOS with swapped body biasing.

**Table 2:** PDPs and EDPs for five-stage ring oscilators operating in subthreshold ( $V_{DD} < V_{th}$ ).

|         | $W_p(nm)$         | $W_n(nm)$ | Delay (ns) | I (nA) | Power (nW) | PDP (fJ) | EDP (fJ*ns) |
|---------|-------------------|-----------|------------|--------|------------|----------|-------------|
| 250 nm  | V <sub>DD</sub> = | 450 mV    |            |        |            |          |             |
| CMOS    | 3900              | 1500      | 297        | 286    | 25.7       | 7.64     | 2269        |
| Pseudo  | 1250              | 1500      | 183        | 480    | 43.2       | 7.90     | 1445        |
| Swapped | 1500              | 1500      | 147        | 2800   | 252        | 36.9     | 5424        |
| 180 nm  | $V_{DD} =$        | 450 mV    |            |        |            |          |             |
| CMOS    | 3375              | 1080      | 177        | 271    | 24.4       | 4.30     | 761         |
| Pseudo  | 900               | 1080      | 75.5       | 688    | 62         | 4.68     | 353         |
| Swapped | 1080              | 1080      | 62.4       | 3055   | 275        | 17.2     | 1073        |
| 120 nm  | $V_{DD} =$        | 300 mV    |            |        |            |          |             |
| CMOS    | 450               | 780       | 4.30       | 2600   | 156        | 0.67     | 2.88        |

| Pseudo  | 450 | 780 | 2.50 | 5100 | 306 | 0.77 | 1.93 |
|---------|-----|-----|------|------|-----|------|------|
| Swapped | 450 | 780 | 2.40 | 5450 | 327 | 0.78 | 1.87 |

The swapping of the body terminals refers to a configuration that ties the nMOS device's bulk terminal to the most positive voltage ( $V_{DD}$ ), while that of the pMOS is tied to the most negative value (GND). This approach was presented in [27], but only for standard CMOS designs, while pseudo-nMOS with swapped body bias seems to be able to achieve even higher speeds. The delay is reduced when using this approach, but the PDP increases five fold. This shows that the PDP metric favors designs with higher delays, or equivalently those that operate at low frequencies. Table 2 also shows that standard CMOS ring oscillators for any given technology node have much lower PDPs compared to pseudo-nMOS. The delay improves by 40% from the 250 nm technology node to the 180 nm nodes and by 98% to the 120 nm node for standard CMOS. If this performance improvement were to hold true for the subsequent technology nodes, subthreshold operation would see a PDP improvement by 90% for a similar change in technology node. It has to be mentioned that wire delays and quantum effects will become dominant, and when properly accounted for could introduce potential short falls for subthreshold design (and these effects are yet to be investigated). Figure 2 shows some of the simulation results.



**Figure 2:** Simulation results for a five-stage standard CMOS ring oscillator in 120 nm operating in subthreshold ( $V_{DD} = 300 \text{ mV}$ ) show a total delay of 2.96 ns (*i.e.*, 600 ps per inverter). The swapped body bias solution achieves 2.14 ns (*i.e.*, 430 ps per inverter)

All the simulations for the different five-stage ring oscillators in 250 nm, 180 nm, and 120 nm can be seen in a compact form in Figure 3. The delay and the power consumption are reported *per inverter*, and corresponds to an activity factor  $\alpha = 20\%$  (as there are 5 inverters in a ring). The three oblique lines represent constant PDPs of 1 fJ, 0.1 fJ, and 0.01 fJ.



**Figure 3:** The delays and the power consumptions per inverter:  $\Delta = 180$  nm standard CMOS;  $\Box = 180$  nm pseudo-nMOS; \* = 120 nm standard CMOS; o = 120 nm pseudo-nMOS.

Transistor sizing presents a challenge when currents through the nMOS and pMOS devices have to be matched. Current trends in standard CMOS are such that making the ratio of the pMOS to nMOS equal to  $\beta$  results in equal drive capability. This does not hold true in subthreshold CMOS—especially when also considering higher performances and robustness—and calls for a deep understanding of the behavior of the devices as they constantly operate with voltages below their threshold voltages. Methods or algorithms to perform device sizing are only now starting to be considered [28], [29].

Figure 4 has plots that show how inverter outputs (designed for symmetric switching) vary with temperature. The first plot shows how the output signal disperses as temperature varies from 27°C to 127°C for an inverter using the "standard" transistor sizing approach (*i.e.*, the pMOS being about  $\beta$  times larger than the nMOS). The second instance shows a much more robust inverter achieved through a more tedious transistor sizing, but which clearly proves to be a much better fit for the subthreshold regime [29]. The two transistors are excited separately by two slow ramps going in opposite directions:  $V_{inn}$  and  $V_{inp}$  (see Fig. 4), and the output is plotted versus temperature. For symmetric switching, *i.e.*, equal rise and fall times (which also reduces power and increases speed), the two ramps and the output should intersect at  $V_{in} = V_{out} = V_{DD}/2$ .





**Figure 4:** The influence of sizing (for symmetric switching) on environmental (temperature) sensitivity variation. The upper drawing shows the simulations for a classically sized standard CMOS inverter (*i.e.*, pMOS is  $\beta$  times larger), while the bottom one has a larger pMOS.

Device behavior in subthreshold might not support a linear trend (in terms of leakage currents drive capability) with scaling, and we are still experimenting with these devices at the 70 nm technology node in order to provide a more comprehensive analysis. One way of dealing with such problems is to use body biasing. Classical adaptive body biasing techniques [19]–[22] require that the fourth terminals of the n- and p- devices be tied to: (i) the input signal; (ii) another transistor; or (iii) another (small) inverter. They tend to reduce speed due to the larger capacitances.

Table 3 has results of a five-stage standard CMOS ring oscillator with appropriately sized devices (i.e., pMOS is  $\beta$  times larger). Effects of body biasing on delay have been explored with the objective of pinpointing the best bias combination for the p- and n-wells that would produce optimal PDP and EDP operation points for subthreshold operation. The same approach has just recently been advocated for normal  $V_{DD}$  operation [30]. Varying the substrate bias of both the p-and the n-type devices shows that tying the bulk of the n-type devices to 300 mV and those of the p-type to 50 mV produces the lowest EDP. It follows that by dynamically changing the body bias the circuit can be optimized (e.g., run faster and/or dissipate less).

**Table 3:** Power, PDPs and EDPs for a five-stage standard CMOS ring oscillator ( $W_n = 1080 \text{ nm}$ ,  $W_p = 3375 \text{ nm}$ ) in 180 nm at  $V_{DD} = 350 \text{ mV}$  and at different body biasing voltages.

| Optimal<br>Body Bias | V <sub>n</sub><br>(mV) | V <sub>p</sub> (mV) | Delay (ns) | I (nA) | Power (nW) | PDP (fJ) | EDP (fJ*ns) |
|----------------------|------------------------|---------------------|------------|--------|------------|----------|-------------|
| PDP                  | 0                      | 350                 | 1260       | 11.94  | 4.18       | 5.27     | 6634.6      |
|                      | 50                     | 300                 | 899.4      | 16.89  | 5.91       | 5.32     | 4782.0      |
|                      | 100                    | 250                 | 659.6      | 23.14  | 8.01       | 5.34     | 3523.4      |
|                      | 150                    | 200                 | 496.1      | 30.99  | 10.85      | 5.38     | 2669.1      |
|                      | 200                    | 150                 | 381.5      | 40.66  | 14.23      | 5.43     | 2071.3      |
|                      | 250                    | 100                 | 299.6      | 52.92  | 18.52      | 5.55     | 1663.1      |
| EDP                  | 300                    | 50                  | 238.5      | 72.10  | 25.23      | 6.02     | 1436.0      |
|                      | 350                    | 0                   | 191.4      | 126.5  | 44.27      | 8.48     | 1622.5      |

# 4. BRIDGING THE SPEED GAP (BETWEEN FEMTO AND SUB-FEMTO)

The simulations we have performed and presented before are proof of concept since it is understood that ring oscillators do not quite encompass the actual data path (critical) delays (unless serial and/or systolic architectures are going to become the norm [24], [25]). They however show the sub-femto joule switching energy achieved when operating in subthreshold. One of our goals is to bridge the speed gap between operation above and below (sub)threshold (*i.e.*, between the sub-pico/femto and the sub-femto designs), while retaining the low-power edge of the subthreshold region. It is customary to consider that these are conflicting metrics.

In fact, what we are attempting is to improve the delays of circuits operating in subthreshold and to simultaneously reduce the power of the same circuit when operated above threshold. To this end, simulations at 250, 180 and 120 nm nodes have been performed employing an enabling technique that strives to keep the capacitive load at a minimum. Inverter devices (nMOS and pMOS transistors) were kept at minimum sizes while the devices' bulk voltages were varied (in sync) with  $V_{DD}$ . For optimally chosen bulk bias voltages this technique allows to achieve optimal operation (i.e., the fastest speed and/or the lowest power) both above and bellow (sub)threshold, without the need of resizing the transistors. Besides bridging the two regions in a simple way, this technique reduces power and increases speed (as using minimum sized transistors leads to minimum capacitances). Using the bulk voltages as knobs allows for the circuit's bulk voltages to be driven by voltages that provide increased leakage currents and thus improved speed.

Five-stage ring oscillators using the above mentioned approach have been simulated and the results show an excellent compromise of speed and energy. To give the approach credence we have designed and simulated standard CMOS ring oscillators that have the  $W_p/W_n$  ratios optimized for symmetric switching (for different  $V_{DD}$  values). These results have been used for comparison purposes. Preliminary results on the idea of maintaining minimum sized devices and optimally controlling the bulk terminals are extremely promising. They show not only speed improvements, but also significantly reduce power. Figure 5 shows results at the 180 nm technology node only. [Remark: There has not been any aggressive effort to optimize this novel enabling approach since our only goal was mainly to make a preliminary evaluation of its potential.] The results we have obtained show gains for all of the metrics.



**Figure 5:** Performance comparisons of five-stage ring oscillators at the 180 nm node: optimally sized devices (dotted lines) and minimum sized devices with optimal control voltages for the bulk terminals (continuous lines). The horizontal axes represent  $V_{DD}$  (from about  $V_{th}/2$  to about  $3V_{th}$ ).

Finally, the performance gains (*i.e.*, how many times our approach is better than an optimally sized solution at the same  $V_{DD}$ ) are presented in a compact form in Fig. 6. At about  $V_{DD} = V_{th}$ , switching delays are improved by over 6 times, PDP by over 20 times, while EDP by over 60 times. It is worth mentioning that similar simulations have been performed at the 250 nm and 120 nm technology nodes and the results obtained from keeping the devices at their minimum sizes with optimally changing bulk terminal voltage show similar trends.



**Figure 6:** Performance gain factors. Controlling the bulk terminal voltages leads to improved speeds, as well as reduced P, PDP, and EDP (when compared to optimized standard CMOS circuit designs).

## 5. ENHANCING RELIABLITY

The design approach we have proposed would foster (ultra) low-power and medium to high performance. Still, the fact that we rely on subthreshold operation raises reliability concerns. To achieve high defect/fault-tolerance an original enhanced von Neumann multiplexing technique has recently been proposed [31]. It allows for very small redundancy factors, and can be easily integrated with novel architectural concepts [32]. The technique [33] uses majority (MAJ) gates, which can be implemented in many different ways like *e.g.* standard CMOS, pseudo-nMOS, output-wired-inverters, or by simply short-circuiting the outputs [23]–[25], [33]. The approach we have presented in this paper seamlessly integrates with such techniques. Future work should explore further reliability issues when very fewer circuits/devices placed in parallel are used.

### CONCLUDING REMARKS

In this paper we have analyzed and compared several schemes for attaining low power and medium to high speeds while optimally operating both above and bellow threshold. Work in progress has only examined inverters, ring oscillators and NAND gates, and we have reported on simulations of some of these circuits—particularly ring oscillators—at different technology

nodes. It must be noted that ring oscillators fall short in exposing design issues, unless serial and/or systolic architectures are used.

Ring oscillators in 120 nm CMOS, operating at voltages approximately double the threshold voltage of the devices can achieve very high speeds (up to 5 GHz) with PDPs' in the femto joule range (as low as 4 fJ per inverter) if pseudo-nMOS inverters are used. Given the fact that increasing speed translates into an increase in power as technology scales, we have shown that power supply voltages can be scaled down to below threshold voltages, while reasonable speeds could still be attained by a judicious selection of the logic style in conjunction with body biasing schemes. As an example, in subthreshold, the same five-stage ring oscillator runs at 0.48 GHz achieving a PDP of less than 0.2 fJ per inverter (in this case swapped body biasing has been used to boost the speed of the pseudo nMOS inverters).

We have also introduced an enabling scheme for bridging the speed gap between operations above (strong inversion operation) and bellow (subthreshold operation) threshold. It improves the delay of the circuit when operated in subthreshold, and reduces the power consumption when the circuit is operated above threshold. The method relies on minimally sized transistors and uses the bulk voltages for optimizing the functioning of the circuit. By varying the bulk voltages simultaneously with varying the power supply voltage the circuit can operate optimally at any power supply voltage (*i.e.*, both above and below threshold). The reported results are for standard CMOS only and are preliminary, but significant in that they provide insights on the potential of the method, which shows gains for all the metrics: speed, power, power-delay-product, and energy-delay-product.

Finally, another method enables for lessening of noise due to variations (environmental and process), and can increase the overall reliability by using redundancy (at the device and/or circuit level).

### **ACKNOWLEDGEMENT**

The authors would like to thank B. Oelmann from the Electronics Design Division, Mid Sweden University, for helping with some of the very early simulations in 180 nm.

## **REFERENCES**

- [1] M. Donno, A. Ivaldi, L. Benini, and E. Macii, "Clock-Tree Power Optimization Based on RTL Clock-Gating," *Proc. Design Autom. Conf.*, Jun. 2003, pp. 622-627.
- [2] H. Li, C.-Y. Cher, T. N. Vijaykumar, and K. Roy, "VSV: L2-Miss-Driven Variable Supply-Voltage Scaling for Low Power," *Proc. Intl. Symp. Microarch.*, Dec. 2003, pp. 19-28.
- [3] T. Kuroda, K. Suzuki, S. Mita, T. Fujita, F. Yamane, F. Sano, A. Chiba, Y. Watanabe, K. Matsuda, T. Maeda, T. Sakurai, and T. Furuyama, "Variable Supply-Voltage Scheme for Low-Power High-Speed CMOS Digital Design," *IEEE J. Solid-State Circ.*, Vol. 33, No. 3, Mar. 1998, pp. 454-462.
- [4] A. P. Chandrakasan, "Voltage Reduction Techniques for Portable Systems," *Proc. Intl. ASIC Conf.*, Sep. 1997, pp. 3-6.
- [5] V. Beiu, "Low Power Differential Conductance-Based Logic Gate and Method of Operation Thereof," *US Patent 6580296*, 17 Jun. 2003.
- [6] J. Nyathi, V. Beiu, S. Tatapudi, and D. Betowski, "A Charge Recycling Differential Noise Immune Perceptron," *Proc. Intl. Joint Conf. Neural Networks*, Jul. 2004, pp. 1995-2000.
- [7] The International Technology Roadmap for Semiconductors, 2004. Available: <a href="http://public.itrs.net/">http://public.itrs.net/</a>
- [8] T. Sakurai, "Perspectives of Low-Power VLSI's," *IEICE Trans. Electr.*, Vol. E87-C, No. 4, Apr. 2004, pp. 429-436 [see also "Perspectives on Power-Aware Electronics," *Proc. Intl. Solid-State Circ. Conf.*, Vol. 1, Feb. 2003, pp. 26-29].
- [9] S. M. S. Kang, "Elements of Low Power Design for Integrated Systems," *Proc. Intl. Symp. Low Power Electr. Design*, Aug. 2003, pp. 205-210.
- [10] T. Kuroda, T. Fujita, F. Hatori, and T. Sakurai, "Variable Threshold-Voltage CMOS Technology," *IEICE Trans. Electr.*, Vol. E83-C, No. 11, Nov. 2000, pp. 1705-1715.
- [11] S. Shigematsu, T. Hatano, Y. Tanabe, and S. Mutoh, "Low-Power High-Speed 1-V LSI Using a 0.25-µm MTCMOS/SIMOX Technique," *Proc. Intl. ASIC Conf.*, Sep. 1998, pp. 103-107.

- [12] T. Inukai, T. Hiramoto, and T. Sakurai, "Variable Threshold Voltage CMOS (VTCMOS) in Series Connected Circuits," *Proc. Intl. Symp. Low Power Electr. Design*, Aug. 2001, pp. 201-206.
- [13] T. Kuroda, T. Fujita, S. Mita, T. Nagamatsu, S. Yoshioka, K. Suzuki, F. Sano, M. Norishima, M. Murota, M. Kako, M. Kinugawa, M. Kakumu, and T. Sakurai, "A 0.9-V, 150-MHz, 10-mW, 4mm<sup>2</sup>, 2-D Discrete Cosine Transform Core Processor with Variable Threshold-Voltage (VT) Scheme," *IEEE J. Solid-State Circ.*, Vol. 31, No. 11, Nov. 1996, pp. 1770-1779.
- [14] F. Assaderaghi, "DTMOS: Its Derivatives and Variations, and Their Potential Applications," *Proc. Intl. Conf. Microelectr.*, Nov. 2000, pp. 9-10.
- [15] M. R. Stan, "Low-Power CMOS with Subvolt Supply Voltages," *IEEE Trans. VLSI Sys.*, Vol. 9, No. 2, Apr. 2001, pp. 394-400.
- [16] C. H. Kim, H. Soeleman, and K. Roy, "Ultra-Low-Power DLMS Adaptive Filter for Hearing Aid Applications," *IEEE Trans. VLSI Sys.*, Vol. 11, No. 6, Dec. 2003, pp. 1058-1067.
- [17] A. Shibata, T. Matsuoka, S. Kakimoto, H. Kotaki, M. Nakano, K. Adachi, K. Ohta, and N. Hashizume, "Ultra Low Power Supply Voltage (0.3 V) Operation with Extreme High Speed Using Bulk Dynamic Threshold Voltage MOSFET (B-DTMOS) with Advanced Fast-Signal-Transmission Shallow Well," *Proc. Symp. VLSI Tech.*, Jun. 1998, pp. 76-77.
- [18] V. Svilan, M. Matsui, and J. B. Burr, "Energy-Efficient 32 x 32-bit Multiplier in Tunable Near-Zero Threshold CMOS," *Proc. Intl. Symp. Low Power Electr. Design*, Jul. 2000, pp. 268-272.
- [19] K. Ishibashi, T. Yamashita, Y, Arima, I. Minematsu, and T. Fujimoto, "A 9μW 50MHz 32b Adder Using a Self-Adjusted Forward Body Bias in SoCs," *Proc. Intl. Solid-State Circ. Conf.*, Feb. 2003, pp. 116-118.
- [20] T. Kawahara, M. Horiguchi, Y. Kawajiri, G. Kitsukawa, T. Kure, and M. Aoki, "Subthreshold Current Reduction for Decoded-Driver by Self-Reverse Biasing," *IEEE J. Solid-State Circ.*, Vol. 28, No. 11, Nov. 1993, pp. 1136-1144.
- [21] C. H. Kim, J.-J. Kim, S. Mukhopadhyay, and K. Roy, "A Forward Body-Bias Low-Leakage SRAM Cache: Device and Architecture Considerations," *Proc. Intl. Symp. Low Power Electr. Design*, Aug. 2003, pp. 6-9.
- [22] S. Aunet, B. Oelmann, S. Abdalla, and Y. Berg, "Reconfigurable Subthreshold CMOS Perceptron," Proc. Intl. Joint Conf. Neural Networks, Jul. 2004, pp. 1983-1988.
- [23] V. Beiu, S. Aunet, R. R. Rydberg III, A. Djupdal, and J. Nyathi, "The Vanishing Majority Gate: Trading Power and Speed for Reliability," *Proc. Intl. Work. Design & Test Defect-Tolerant Nanoscale Arch.*, May 2005.

  Available http://www.eecs.wsu.edu/~vbeiu/Publications/2005%20NanoArch.pdf
- [24] V. Beiu, A. Djupdal, and S. Aunet, "Ultra Low Power Neural Inspired Addition: When Serial Might Outperform Parallel Architectures," *Intl. Work-conf. Artif. Neural Networks*, Jun. 2005, pp. 486-493.
- [25] V. Beiu, S. Aunet, J. Nyathi, R. R. Rydberg III, and A. Djupdal, "On the Advantages of Serial Architectures for Low-Power Reliable Computations," *Proc. Intl. Conf. App.-specific Sys. Arch. & Proc.*, Jul. 2005, pp. 276-281.
- [26] V. Beiu, and U. Rückert, "Emerging Brain Inspired Nano Architectures," *World Scientific Press*, accepted (in progress), 2006.
- [27] S. Narendra, J. Tschanz, J. Hofsheier, B. Bloechel, S. Vangal, Y. Hoskote, S. Tang, D. Somasekhar, A, Keshavarzi, V. Erraguntla, G. Dermer, N. Borkar, S. Borkar, and V. De, "Ultra-Low Voltage Circuits and Processor in 180nm to 90nm Technologies with a Swapped-Body Biasing Technique," *Proc. Intl. Solid-State Circ. Conf.*, Feb. 2004, Vol. 1, pp. 156-157, pp. 511-518.
- [28] B. H. Calhoun, A. Wang, and A. Chandrakasan, "Device Sizing for Minimum Energy Operation in Subthreshold Circuits," *Proc. Custom IC Conf.*, Oct. 2004 pp. 95-98.
- [29] J. Nyathi, V. Beiu, and S. Aunet, "Femto Joule Switching: Review of Low Energy Design Approaches for the Nano Era," *Nano and Giga Challenges in Microelectronics*, Krakow, Poland, Sep. 2004, invited presentation.
- [30] Q.-W. Kuo, V. Sharma, and C. C.-P. Chen, "Substrate-Bias Optimized 0.18um 2.5GHz 32-bit Adder with Post-Manufacture Tunable Clock," *Proc. Intl. Symp. VLSI Design, Autom. & Test*, Hsinchu, Taiwan, Apr. 2005.
- [31] S. Roy, and V. Beiu, "Majority Multiplexing—Economical Redundant Fault-Tolerant Designs for Nanoarchitectures," *IEEE Trans. Nanotech.*, Vol. 4, No. 4, Jul. 2005, pp. 441-451 [short version in *Proc. Intl. Conf. Nanotech.*, Aug. 2004, pp. 589-592].
- [32] V. Beiu, "A Novel Highly Reliable Low Power Architecture: When von Neumann Augments Kolmogorov," *Proc. Intl. Conf. App.-specific Sys. Arch. & Proc.*, Sep. 2004, pp. 167-177.
- [33] S. Aunet, and M. Hartmann, "Real-time Reconfigurable Linear Threshold Elements and Some Applications to Neural Hardware", *Proc. Intl. Conf. Evolvable Sys.*, Mar. 2003 pp. 365-376.