Bit slicing approaches for variability aware ReRAM CIM macros

Christopher Bengel; Leon Dixius; Rainer Waser; Dirk J. Wouters; Stephan Menzel

doi:10.1515/itit-2023-0018

Article Open Access

Bit slicing approaches for variability aware ReRAM CIM macros

Christopher Bengel
Christopher Bengel received the B.Sc. and M.Sc. degrees in electrical engineering from RWTH Aachen University in 2016 and 2018, respectively. He is currently pursuing the Ph.D. degree with IWE 2 RWTH Aachen, with a focus on modeling resistive switching devices for Computation-in-Memory and neuromorphic computing.
, Leon Dixius
Leon Dixius was born in Trier, Germany, in 1999. He received his B.Sc. degree in electrical engineering from RWTH Aachen University in 2021, where he is currently pursuing his M.Sc. degree in electrical engineering.
, Rainer Waser
Prof. Dr. Rainer Waser (M’79) received the Ph.D. degree in physical chemistry from the University of Darmstadt, Darmstadt, Germany, in 1984. In 1992, he joined the Faculty of Electrical Engineering and Information Technology, RWTH Aachen University, as a Professor. He is the Director of the Peter Grünberg Institute 7 & 10, Forschungszentrum Jülich. He received the Gottfried Wilhelm Leibniz Preis in 2014.
, Dirk J. Wouters
Dr. Dirk J. Wouters received the master’s and Ph.D. degrees in electrical engineering from the University of Leuven, in 1982 and 1989, respectively. In 2014, he joined the Institute of Electronic Materials, RWTH Aachen University, where he focuses on the research of metal-oxide-based RRAM and is the leader of the neuromorphic computing group.
and Stephan Menzel
Dr. Stephan Menzel (M’12) was born in Bremen, Germany. He received the Diploma degree and the Ph.D. degree (summa cum laude) in electrical engineering from RWTH Aachen University in 2005 and 2012, respectively. He is currently a Senior Researcher with Peter-Grünberg-Institut 7, Forschungszentrum Jülich GmbH where he is leading the Simulation Group.

Published/Copyright: April 21, 2023

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal it - Information Technology Volume 65 Issue 1-2

Abstract

Computation-in-Memory accelerators based on resistive switching devices represent a promising approach to realize future information processing systems. These architectures promise orders of magnitudes lower energy consumption for certain tasks, while also achieving higher throughputs than other special purpose hardware such as GPUs, due to their analog computation nature. Due to device variability issues, however, a single resistive switching cell usually does not achieve the resolution required for the considered applications. To overcome this challenge, many of the proposed architectures use an approach called bit slicing, where generally multiple low-resolution components are combined to realize higher resolution blocks. In this paper, we will present an analog accelerator architecture on the circuit level, which can be used to perform Vector-Matrix-Multiplications or Matrix-Matrix-Multiplications. The architecture consists of the 1T1R crossbar array, the optimized select circuitry and an ADC. The components are designed to handle the variability of the resistive switching cells, which is verified through our verified and physical compact model. We then use this architecture to compare different bit slicing approaches and discuss their tradeoffs.

Keywords: bit slicing; computation-in-memory; memristive devices; Vector-Matrix-Multiplications

ACM CCS: analysis and design of emerging devices and systems; emerging architectures; emerging technologies; hardware

1 Introduction

The dominance and commercial success of machine learning algorithms for the processing of images, speech and video signals [1] in the last ten years, has largely been enabled by the utilization of Graphics Processing Units (GPUs) during training [1, 2]. These successes have further lead to the development of Application Specific Integrated Circuits (ASICs), specifically targeted for machine learning workloads. Examples of such chips are the Tensor Processing Units from Google [3] or Hanguang from Alibaba [4] further improving the efficiency of the hardware for machine learning algorithms. The great performance benefits have come at the cost of exponentially increasing energy cost for training and inference [5]. During the training phase of a machine-learning algorithm the parameters of a computational model, such as a multilayer neural network, are adapted to produce distinguishable mappings of different training inputs to output categories. The goal of the training phase is to enable the network for independently matching of unseen inputs to a range of trained output categories during the inference phase with as high accuracies as possible. Accelerators for machine learning algorithms can also be realized based on memristive devices such as filamentary Valence Change Mechanism (VCM) cells [6], [7], [8]. VCM cells are a type of two terminal and non-volatile redox based resistive switching random access memory, in which the resistive switching is based on the movement of oxygen vacancies. Filamentary switching VCM cells consist of an electronically active electrode (AE), a mixed ionic-electronic conducting layer and an ohmic counter electrode. The AE is characterized by a high work function from the metal to the oxide, while the ohmic electrode forms a low work function interface. The height of the work function corresponds to the difficulty of transporting electrons across it. The oxygen vacancies are then moved, via an electrical field, near the AE. If they accumulate at the AE interface and form a filament the resistance of the cell is reduced. If they are removed from the AE interface the filament is pushed back and the cell resistance increases [6, 9]. These devices are also called memristive devices [10]. They can be switched between at least one high resistive state (HRS) and one low resistive state (LRS), although in the context of machine learning often the equivalent conductances, low conductive state (LCS) for the HRS and high conductive state (HCS) for the LRS are used. Using the conductances has the advantage, that they are directly proportional to the current values. The transition from a LCS towards a HCS is called a SET operation. The opposite direction is termed RESET operation. VCM cells can also switch in a more gradual fashion in the RESET/SET direction, by controlling the maximum voltage/current. For machine learning accelerator applications, VCM devices are often integrated together with N-type metal-oxide-semiconductor (NMOS) transistors in 1T1R crossbar arrays, to prevent sneak path currents [11] and to allow for a precise programming of the devices without program disturb of half-selected devices [12]. Recently, major semiconductor companies such as TSMC [13], Intel [14] and Infineon [15] have displayed the feasibility of realizing large-scale reliable memories based on filamentary VCM systems. One of the main selling points of using memristive devices is their ability to unite computation and memory in a single physical location, alleviating the limitations of von-Neumann architectures [16]. Because of this advantage, many architectures have been proposed, that enable Computation-in-Memory (CIM) for the acceleration of machine learning algorithms. These architectures often focus on the central Vector-Matrix-Multiplication (VMM) or Matrix-Matrix-Multiplication (MMM) operations to improve the energy efficiency of the inference phase for neural networks [7, 8, 17, 18]. It should be noted, that VMM and MMM operations are not only required for machine learning, but also for example in scientific computing [19, 20]. System level analyses have already shown the benefits of memristive matrix operation accelerators, and found especially benefits regarding the energy consumption, ranging from 10× to 1000× [17, 21, 22]. In this work, we are proposing and investigating such an VCM-based accelerator on the circuit level. The required theoretical background as well as the different bit slicing techniques are introduced in Section 2. In Section 3, the designed circuits will be discussed in more detail, the bit slicing approaches will be compared and their variability tolerance will be discussed. Section 4 summarizes the paper and gives an outlook for future work.

2 Simulation setup and architecture

Figure 1 shows the block level architecture, organized around the 1T1R crossbar array. The input vectors are applied serially to a shift register and then in parallel to the 1T1R array via the Wordline (WL) drivers. The Sourcelines (SL) are connected to the read drivers or RESET drivers via the SL select circuits. Similarly, the SET drivers and the ADC stage are connected to the Bitlines (BL), via the BL select circuit. For the simulation of the VCM devices we used the physically motivated compact model JART VCM v1b Readvar [23], [24], [25], [26], which can describe the programming variability between different devices (device-to-device variability), between different cycles in a single device (cycle-to-cycle) and between different read operations. Additionally, it can describe different types of filamentary VCM devices such as ZrO_x [24, 26], HfO_x [27, 28], TaO_x [29] and SrTiO₃ [30]. The parameter set we are using was fitted to ZrO_x devices to describe the reliability properties relating to read disturb, read noise and programming variability for binary VMM [24]. We use the parameter set shown in Table 1 of [24] for the deterministic and variability parameters. Programming variability describes the variation of conductance states during switching. It can be reduced via program verify algorithms [31]. Read disturb and read noise affect the programmed conductance levels over time. Read noise is an undirected and random process [26], while read disturb depends on the voltage polarity and amplitude. For the ZrO_x devices we have shown that read disturb does not represent a problem, when reading in the RESET direction and keeping the read voltage below 500 mV [24]. This recommendation is followed here. The transistors in the arrays and all other circuits are modeled using a commercially available 180 nm CMOS technology.

Figure 1:

Block level analog VCM architecture. Bit slicing is investigated in the N·M 1T1R crossbar array (blue rectangle) or in the ADC stage (red rectangle).

Table 1:

Configuration of BL/SL select circuits for the different operations in the selected column.

Signal name	Write	Read	VMM
WRITE	1	0	1
READ	0	1	0
MAC CONTROL	0	0	1
SET/RESET	1/0 or 0/1	0/0	0/0
MAC SELECT	0	0	1
SELECT	1	1	1

To perform a VMM operation, a row of the matrix is mapped to a column of the crossbar. In the case of only binary weights, this means that ‘0’ is represented by the LCS and ‘1’ is represented by the HCS. In the case of multilevel weights, proportionally mapping the matrix values, by using equidistant conductance spacing, is the easiest mapping strategy. The crossbar is enabled to perform certain computational operations via the periphery. To perform VMMs, the periphery consists of SET and RESET drivers for the programming, digital to analog converters (DAC) to apply the input vectors and analog to digital converters (ADC) to convert the analog result of the VMM into a digital representation. While the DACs are often realized as buffers that can only apply binary voltages to the crossbar (0 V or V_DD = 5 V), the ADCs are more complex and therefore occupy a significant part of the total accelerator area and consume a lot of energy. For example, in [8] it consumed 58 % of the computing tiles power and occupied 31 % of the total area and in [32] it required more than 75 % of both energy and power. Consequently, the optimization of ADC designs is a large focus with a wide range of concepts being explored [8, 33, 34]. Due to the high precision requirements of VMM for the training of deep neural networks (DNN) or for scientific computing applications, often a technique called bit slicing is used in memristive hardware accelerators. Bit slicing combines multiple lower resolution components together to realize higher resolution blocks. During each evaluation cycle, a dot product operation is computed in our proposed architecture. For a complete VMM or MMM these dot product results have to be shifted and added via digital CMOS circuits to the right bit positions [20, 35]. For the ADC to be able to convert all possible dot product results of the VMM to digital signals without any loss of information it needs a minimum resolution of [8]:

(1) B A D C = B W + B i n + log ( N ) , if B W > 1 , B i n > 1 B W + B i n + log ( N ) − 1 , otherwise .

In (1) B_W is the number of bits per weight, B_in is the number of bits per input and N is the number of rows in the crossbar being read out. Bit slicing can be applied to the weights (weight slicing), the input voltages (input slicing) and the number of rows that are evaluated by the ADC (ADC slicing). Input slicing can be performed, by breaking down a high-resolution input voltage into multiple equal pulses after one another at 1-Bit resolution (Pulse Frequency Modulation). Another approach for input slicing would be Pulse Width Modulation, where the proportion of ‘On’ time to the total pulse time period is varied. In our architecture, we only consider the 1-Bit input case. Weight slicing on the weight values can be realized in two different ways, either directly in the array or in the peripheral circuitry. In the direct approach, a multibit matrix value is mapped to different numbers of columns, depending on the resolution of the 1T1R cells. This approach is shown in equation (2)

(2) 3 2 1 2 … … … … = 3 2 1 2 2 = 1 1 1 1 1 + 1 1 0 1 1 + 1 0 0 1 , 1

where the matrix on the left side represents the numerical matrix from which the first row is mapped to the columns of the crossbar. In the case of 2-Bit resolution devices (possible values from ‘0’ to ‘3’) one column is required as indicated by the single matrix. If the devices only have a resolution of 1-Bit (right side) three columns of the crossbar are required. In the middle and on the right side, each row represents an individual 1T1R cell. Another approach of performing the weight slicing has been shown in an analog fashion in [33], where the individual columns are weighted differently through the ADC. Alternatively, weight slicing can be done after the ADC stage, in the digital periphery, via shift and add operations [20]. From equation (2) it can already be seen, that having higher resolution devices, saves considerable space or increases the information density of the array. It should be noted that the conductance should be linearly spaced. Performing the weight slicing by increasing the resolution of the 1T1R elements has the additional advantage that the periphery does not have to be modified. However, it must then be designed, such that it can handle the increased resolution according to equation (1). Lastly, the bit slicing can also be applied to the ADC by adapting the number of rows that are evaluated in a given cycle. While bit slicing is conventionally used to realize higher precision components from lower precision components, it can also be used to reduce the requirements for the ADC.

3 Results

3.1 Peripheral circuits

Figure 2 shows the selection circuits (a) as well as the ADC (b). The select circuitry for the BL and SL are structurally the same, but connected to different control signals and peripheral blocks for BL (marked in orange) and SL (marked in blue). The left branch uses a transmission gate to connect the RESET/SET drivers to the BL/SL respectively. Those drivers are required to program the VCM cells. At the opposite side of the active driver, the BL/SL is pulled to ground via an low ohmic NMOS transistor. The right branch of the select circuit is connected to the read driver/sensing stage for the BL/SL side. The inset in gray shows the internal structure of the used single stage differential amplifier (DA). To increase the precision in the read path we have removed the transmission gates to reduce the IR drop resulting from the series resistance of the transmission gate. Instead, power gating is used to set this path high ohmic. Power gating is achieved via a PMOS transistor between the supply voltage and the supply voltage pin of the sensing stage (right side of Figure 2(a)). By removing the supply voltage the buffers in the read drivers are set high ohmic. Also, the ADC is set high ohmic if the supply voltage is reduced (compare Figure 2(b)). The ADC circuit is inspired from [33, 36], as the sensing stage from [33] is combined with the voltage controlled oscillator from [36].

Figure 2:

Circuit blocks of the Bitline and Sourceline select circuits (a) and the ADC stage (b). In (a) signal names without curly braces are used in the BL and SL select circuits. For the signals in curly brackets the orange/blue colored signals are applied in the BL/SL select circuit. The gray inset shows the internal structure of the differential amplifier (DA).

A VMM is performed, by first turning on the read drivers. This is done by setting the ‘MAC Control’ signal and the ‘READ’ signal to ‘1’. The read drivers are implemented as digital buffers and apply a voltage of 2.2 V via the SL select circuit to the 1T1R crossbar. The ADC keeps the voltage at the output of the crossbar at 2 V, resulting in a voltage drop of 0.2 V across the 1T1R elements. Then, the logical input vector is read in and applied via the Wordlines (WL₁ … WL_N). The Wordline drivers (or input DACs) are realized as digital buffers, either applying 0 V or V_DD (5 V) to the transistor gates. This reduces the required driving power for the WL buffers, as they only have to drive a certain number of transistor gates, depending on the number of selected columns. The input vector is applied serially and fed through a shift register, to be applied in parallel to the WL drivers (see Figure 1). The correct column is chosen via the ‘SELECT’ signal, which is generated via a demultiplexer, filled by another shift register. In this way, single columns, multiple columns or all columns can be selected for an evaluation cycle. The maximum number of columns, which can be activated at the same time, depends on several factors, e.g. the layout size of the ADC, or the requirements of the application. For the highest throughput, all columns should be active at the same time, requiring one ADC per column. Our chosen approach allows for complete flexibility as to the number of activated columns. If multiple columns share a single ADC, e.g. every four columns have one ADC, the ‘MAC Select’ signal can be used to switch between these four columns. This approach is useful, if the columns should be weighed differently in the digital periphery as in the first cycle all first bit positions (2°) can be evaluated, in the second cycle all second bit positions (2¹) can be evaluated and so on. Via the ‘MAC Select’ signal one can also switch between different numbers of columns to be evaluated. After all components have been activated, they are kept active for the evaluation time, which can also be chosen in a flexible fashion depending on the required accuracy. Programming of cells is performed individually, via the left side path of Figure 2(a). To perform a write operation, first the ‘WRITE’ signal is set to one and column is chosen via the ‘SELECT’ signal. As for the VMM, the correct row can be chosen via the logical input vector by only applying a ‘1’ to the programmed cell and applying ‘0’ to all other rows. Depending on the switching direction, the SET/RESET driver is connected at the BL/SL side while the SL/BL is set to ground via an NMOS transistor. The configuration of the periphery for the different operation modes is summarized in Table 1. The signals in bold print are global signals, that are only required once for all columns.

The full ADC, shown in Figure 2(b), can be split into three parts, namely the voltage stabilization stage (marked in blue in Figure 2), the ring oscillator stage and the counter stage. The sensing stage clamps the BL voltage to the reference voltage V_Ref = 2 V during the VMM operation to achieve a constant voltage drop across the 1T1R crossbar, which increases the resolution of the ADC [33]. With the read driver applying 2.2 V and V_Ref set to 2 V, 0.2 V drop over the 1T1R cells. This is the same voltage configuration for which the conductance levels are verified, to prevent the IV nonlinearity of the VCM cells from affecting the results. Via the differential amplifier it is ensured, that the resulting current levels are spaced linearly. N₁ mirrors the current from N₀ into the voltage controlled oscillator (VCO) stage, where it is converted into the supply voltage of the VCO, V_{DD, VCO} by R_Supp. Thus, the VCO oscillates at different frequencies, depending on its supply voltage. To improve the counting of the pulses, the output signal of the VCO is buffered through an inverter chain. As the ADC converts an amplitude encoded signal (crossbar output current) into a time encoded signal (pulse train to be counted), higher accuracies can be achieved by increasing the conversion time [36]. The time for which the pulses are counted depends on the required precision as will be shown in the next section.

3.2 Basic performance characteristics

In order to study the effects of the different approaches to bit slicing, first a baseline case needs to be defined. Here, we choose a simulation in which 16 rows with binary devices are read out with a single 4-Bit ADC. The 16 rows are chosen to not put to high requirements on the devices and the ADC. For the chosen array transistors (W = 10 μm, L = 500 nm) and the device states N_{disc, LRS/HRS} = 0.904·10²⁶·1/m³/0.148·10²³·1/m³ the resulting resistances of the 1T1R structure are 4.65 kΩ/95.97 kΩ or 214.95 μS/10.42 μS. In the JART VCM v1b model the switching filament is simplified via a well conducting oxygen reservoir, the plug region, and the variable resistance disc region. The state variable of the model (N_disc) can then be used to set the resistance of the devices. In the 1-Bit case HRS represents the logical ‘0’ and the LRS represents the logical ‘1’. For 2-Bit the HRS still represents level ‘0’, while the LRS represents level ‘3’. Both values were evaluated at a read voltage of 0.2 V in the RESET direction and a gate voltage of 5 V. For the investigation of 2-Bit devices we add two additional levels using proportional conductance mapping [37] at 6.78 kΩ (=147.4 μS) for a ‘2’ and 12.74 kΩ (=78.47 μS) for a ‘1’. This leads to a HRS/LRS ratio of around 20. For this case, Figure 3(a) shows the number of generated pulses as a function of the number of devices in the HRS and for different conversion times. In this simulation, all WL inputs are at V_DD. A higher number of HRS represents smaller numerical values stored in the matrix and results in a larger number of generated pulses, as the smaller crossbar current (I_BL) is mirrored from N₀ to N₁ and results in a smaller voltage drop over R_Supp. This in turn increases the supply voltage of the VCO (V_{DD, VCO}) which increases its oscillation frequency. If the conversion time is increased, the number of generated pulses is increased linearly. This makes distinguishing the different levels easier, as the differences between adjacent levels are increased linearly as well. The black solid lines at the right side of Figure 3(a) indicate the required counter length which increases with increasing conversion times. Therefore, there exists a tradeoff between accuracy and variability tolerance on the one side and energy consumption and area on the other side, as longer counters consume more energy and require more space, but enable longer conversion times. Figure 3(b) shows the minimum difference between the number of generated pulses over the conversion time for the same inputs as in (a). As expected from (a), the difference increases very linearly from two at 160 ns conversion time to 12 for a conversion time of 760 ns. The conversion time consists of a setup time until the VCO reaches its correct frequency and the actual counting time. Based on the results from Figure 3 we can see how the ADC can trade off between low precision, low energy consumption and small area towards high precision, high energy consumption and large area. In this way, the operational parameters and design choices are made to follow different optimization routes depending, for example, on the device technology, the timing, the energy or the accuracy requirements of the application. From these initial simulations we define our baseline case as a conversion time of 260 ns for a minimum guaranteed pulse difference of four. Apart from being able to tolerate more device variability, the guaranteed minimum pulse difference also allows the ADC to distinguish between more dot products. Those dot products can be realized by using multilevel devices with additional conductance states between the minimum and maximum values. It should be noted, that the worst case for 1-Bit devices is observed for the dot products (V_DD 0 … 0) ⋅ (LRS HRS … HRS)^T = 1 and (V_DD V_DD … V_DD) ⋅ (HRS HRS … HRS)^T = 0. In this case the ADC has to differentiate between 16 selected HRS states (=95.92 kΩ/16 ≈ 6 kΩ) and one selected LRS (4.65 kΩ). For this case, the ADC produces a difference in the number of generated pulses of one for the baseline case (260 ns, 16 binary devices). To guarantee a perfect precision, this case needs to be distinguished. However, this worst case will occur with a very low probability. To mitigate this difficulty, higher HRS/LRS are required or less devices can be read out at the same time.

Figure 3:

Generated number of pulses as a function of the number of devices in the HRS for different conversion times (a). The required counter length is indicated on the right side of (a). (b) Shows the minimum difference in generated pulses for different conversion times.

3.3 Comparing different bit slicing cases

In our analysis we compared two different types of bit slicing, namely weight slicing between 1-Bit and 2-Bit devices and ADC slicing, where either 16, 8 or 4 rows are read out at the same time. The 1-Bit and 16 rows case was previously discussed as the baseline case for a conversion time of 260 ns, giving a guaranteed pulse difference of four in Figure 3. The three other bit slicing cases are then simulated for conversion times long enough to guarantee the same pulse difference, i.e. the same accuracy or noise resilience. In the case of 2-Bit devices this requires the counter to run for a longer time than in the case of 1-Bit devices, as the additional levels are added in between the 1-Bit levels. The longer run time of the counter can also require a bigger counter circuit, as indicated in Figure 3(a). Both factors increase the energy consumption, but a higher number of performed computations compensates them. In the case of using only 8 rows or 4 rows, the counter can be implemented smaller, as the maximum number of pulses to be counted is smaller. However, to achieve the same number of bit operations, the computation of 8 rows has to be repeated twice. For 4 rows it must be repeated four times. Under the requirement of the same precision, the conversion time is increased to 760 ns for the 2-Bit 16 rows case and it is decreased to 130 ns for 1-Bit/8 rows and decreased to 65 ns for 1-Bit/4 rows. The counter width required is 8-Bit in the baseline case, 9-Bit in the 2-Bit/16 rows case, 7-Bit in the 1-Bit/8 rows case and 5-Bit in the 1-Bit weight/4 rows case. The average energy consumption of the different components and the average total energy per bit are shown in Figure 4. For the total energy per bit the energies of the three components are added together and then divided by the number of bit operations. This number is 32 for 2-Bit 16 rows, 16 for 1-Bit 16 rows, 8 for 1-Bit 8 rows and 4 for 1-Bit 4 rows. From the results it can be seen, that the counter and select circuits are the largest energy consumer. As the select circuits are static during the dot product evaluation this energy is mostly consumed by the counter. The VCO and sensing stage are the second largest consumer and lastly the crossbar consumes the least amount of energy. Additionally, the results suggest that it is more energy efficient in the given architecture to perform many small dot product operations with 1-Bit devices, as those can be finished in a much faster time and with smaller counters.

Figure 4:

Average energy consumption of the different circuit components for the different bit slicing approaches.

3.4 Variability tolerance

VCM devices show different types of variability, such as device-to-device (d2d) variability, cycle-to-cycle (c2c) variability and read-to-read variability [38]. For the considered use case of VMM for machine learning accelerators, the most important variability effects are read disturb, read noise and programming variability [24]. Regarding the impact of read disturb, we showed in [24] that reading in the RESET direction is preferable to the SET direction and that read voltages up to 0.5 V over single devices are possible in the case of binary devices. As the read voltage here is 0.2 V in the RESET direction, read disturb will likely not be an issue. Regarding the read noise, we calculated it at the different conductance levels. Table 2 summarizes the VCM device conductances of the four considered levels and the minimum, median and maximum values for the amount of read noise and for the number of oxygen vacancies at these levels. The minimum and maximum numbers are based on the extreme cases for the filament volumes as will be described below in more detail. With the chosen transistor dimensions (W = 10 μm, L = 500 nm) the conductance of the 1T1R structure is almost completely determined by the resistances of the VCM cells. The JART VCM v1b Readvar model from [26] is then used to simulate the minimum, median and maximum amount of ΔG/G at the four different conductance levels, under the same read conditions as described above. The read variability model first discretizes the number of oxygen vacancies in the disc and then changes this number as described in [26]. As mentioned above, the filamentary oxide in the JART VCM v1b model consists of the plug region and the disc region, where the oxygen vacancy concentration in the disc is modulated during switching. For the discretization, the disc volume is multiplied with the oxygen vacancy concentration at the different levels. The worst case noise at a given vacancy concentration is then obtained for the smallest filament volume, as this gives the smallest number of oxygen vacancies. Adding or removing one vacancy has then a larger influence on the conductance, compared to larger disc volumes. The rightmost column in Table 2 shows the floored number of vacancies at the different conductance levels for the different filament volumes. The minimum number of vacancies leads to the maximum ΔG/G. As can be seen from the results, the worst-case noise can be very significant at the lower conductance levels (up to 1). The number of vacancies in this case is only two. However, most devices will have filament volumes close to the median case, where the read noise is between 5 % at the smallest conductance and 0.8 % at the highest conductance level. This read noise levels are then the best case accuracies for the different levels for the median devices. To program multilevel devices usually write verify algorithms are used [31, 39]. Even if it was possible to program levels with a 100 % accuracy, the intrinsic read noise would still lead to some uncertainty at the different conductance levels. With the read noise results we can reconsider the 1-Bit worst case from Section 3.1, between 16 selected HRS or level 0 states (=16·10.44 μS = 167.04 μS) and one selected LRS or level 3 (223.1 μS). In the JART VCM v1b Readvar model the read noise module can change the number of vacancies by two at most. In the worst case, all the devices at level zero gain two additional vacancies, increasing their conductance, while the single level 3 device has two vacancies removed, decreasing its conductance. In this case, the equivalent conductance will be increased by 10 % for the level 0 devices and it will be decreased by 1.6 % for the level 3 device. The two conductance levels are then 183.74 μS and 219.53 μS. With these worst case conductances considering read noise we can calculate the programming accuracy at which the two levels will overlap, making a distinction impossible. This is calculated by 219.53 μS/183.74 μS = 1.19, meaning that the program verify algorithms have to be able to achieve errors smaller than 9.5 %. With the given ADC two neighboring conductance states can in principle always be resolved, if the conversion time is increased. In practice, again it would be a better solution to use higher conductance ratios, or to relax the accuracy requirements to allow for some errors in the VMM. With regard to the bit slicing approaches higher device resolutions make the differentiation much more difficult, as the levels are here inserted between the 1-Bit levels. Then the conductances have to be spread out over a larger range or fewer rows have to be read out at the same time. Reading out fewer rows at a time does not directly affect the accuracy, due to the linear conductance spacing. However, it improves the energy consumption as shown in Section 3.3.

Table 2:

Minimum, median and maximum read noise impact for the considered conductance levels.

Level	G_VCM [μS]	Min/med/max ΔG/G	Max/med/min # of vacancies
0	10.44	0.01/0.05/1	139/33/2
1	79.54	0.004/0.015/0.3	459/110/6
2	151.2	0.0024/0.01/0.19	672/161/10
3	223.1	0.0018/0.008/0.14	848/203/12

4 Conclusion

In this work, we have presented a CIM analog core architecture based on filamentary VCM devices. We have identified areas where different bit slicing techniques can be applied and shown the associated energy and latency tradeoffs between them. Understanding these tradeoffs is important, for future design space exploration. With our proposed ADC concept, it is possible to trade off accuracy for latency and energy consumption, especially interesting for machine learning or neuromorphic computing algorithms in which the accuracy of individual VMM might not directly affect the accuracy of the algorithm. Furthermore, we have discussed the effect of VCM intrinsic reliability issues and their impact on the architecture. Due to the intrinsic read noise of VCM devices the precise programming of the weight is challenging, especially at high resistance levels. To ensure no overlap between neighboring dot products the write verify algorithms have to be sufficiently precise.

Corresponding author: Stephan Menzel, Forschungszentrum Jülich, Peter Grünberg Institut 7, D-52428 Jülich, Germany, E-mail: st.menzel@fz-juelich.de

Christopher Bengel and Leon Dixius contributed equally.

About the authors

Christopher Bengel

Christopher Bengel received the B.Sc. and M.Sc. degrees in electrical engineering from RWTH Aachen University in 2016 and 2018, respectively. He is currently pursuing the Ph.D. degree with IWE 2 RWTH Aachen, with a focus on modeling resistive switching devices for Computation-in-Memory and neuromorphic computing.

Leon Dixius

Leon Dixius was born in Trier, Germany, in 1999. He received his B.Sc. degree in electrical engineering from RWTH Aachen University in 2021, where he is currently pursuing his M.Sc. degree in electrical engineering.

Rainer Waser

Prof. Dr. Rainer Waser (M’79) received the Ph.D. degree in physical chemistry from the University of Darmstadt, Darmstadt, Germany, in 1984. In 1992, he joined the Faculty of Electrical Engineering and Information Technology, RWTH Aachen University, as a Professor. He is the Director of the Peter Grünberg Institute 7 & 10, Forschungszentrum Jülich. He received the Gottfried Wilhelm Leibniz Preis in 2014.

Dirk J. Wouters

Dr. Dirk J. Wouters received the master’s and Ph.D. degrees in electrical engineering from the University of Leuven, in 1982 and 1989, respectively. In 2014, he joined the Institute of Electronic Materials, RWTH Aachen University, where he focuses on the research of metal-oxide-based RRAM and is the leader of the neuromorphic computing group.

Stephan Menzel

Dr. Stephan Menzel (M’12) was born in Bremen, Germany. He received the Diploma degree and the Ph.D. degree (summa cum laude) in electrical engineering from RWTH Aachen University in 2005 and 2012, respectively. He is currently a Senior Researcher with Peter-Grünberg-Institut 7, Forschungszentrum Jülich GmbH where he is leading the Simulation Group.

Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
Research funding: This work was in part funded by the German Research Foundation (DFG) under Grant No. SFB 917, in part by the Federal Ministry of Education and Research (BMBF, Germany) in the Project NEUROTEC (Project Nos. 16ME0398K and 16ME0399) and it is based on the Jülich Aachen Research Alliance (JARA-FIT).
Conflict of interest statement: The authors declare no conflicts of interest regarding this article.

References

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015. https://doi.org/10.1038/nature14539.Search in Google Scholar PubMed

[2] V. Sze, Y. Chen, J. Emer, A. Suleiman, and Z. Zhang, “Hardware for machine learning: challenges and opportunities,” in 2018 IEEE Custom Integrated Circuits Conference (CICC), 2018, pp. 1–8.10.1109/CICC.2018.8357072Search in Google Scholar

[3] N. P. Jouppi, C. Young, N. Patil, et al.., “In-datacenter performance analysis of a tensor processing unit,” SIGARCH Comput. Archit. News, vol. 45, pp. 1–12, 2017. https://doi.org/10.1145/3140659.3080246.Search in Google Scholar

[4] Y. Jiao, L. Han, and X. Long, “Hanguang 800 npu – the ultimate ai inference solution for data centers,” in 2020 IEEE Hot Chips 32 Symposium (HCS), 2020, pp. 1–29.10.1109/HCS49909.2020.9220619Search in Google Scholar

[5] N. C. Thompson, K. H. Greenewald, K. Lee, and G. F. Manso, “The computational limits of deep learning,” arXiv:2007.05558v2, vol. abs/2007.05558, 2020.Search in Google Scholar

[6] R. Dittmann, S. Menzel, and R. Waser, “Nanoionic memristive phenomena in metal oxides: the valence change mechanism,” Adv. Phys., vol. 70, no. 2, pp. 155–349, 2022. https://doi.org/10.1080/00018732.2022.2084006.Search in Google Scholar

[7] P. Chi, S. Li, C. Xu, et al.., “PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 27–39.10.1109/ISCA.2016.13Search in Google Scholar

[8] A. Shafiee, A. Nag, N. Muralimanohar, et al.., “ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 14–26.10.1109/ISCA.2016.12Search in Google Scholar

[9] R. Waser, R. Dittmann, G. Staikov, and K. Szot, “Redox-based resistive switching memories - nanoionic mechanisms, prospects, and challenges,” Adv. Mater., vol. 21, nos. 25–26, pp. 2632–2663, 2009. https://doi.org/10.1002/adma.200900375.Search in Google Scholar PubMed

[10] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The missing memristor found,” Nature, vol. 453, no. 7191, pp. 80–83, 2008. https://doi.org/10.1038/nature06932.Search in Google Scholar PubMed

[11] M. A. Zidan, H. A. H. Fahmy, M. M. Hussain, and K. N. Salama, “Memristor-based memory: the sneak paths problem and solutions,” Microelectron. J., vol. 44, no. 2, pp. 176–183, 2013. https://doi.org/10.1016/j.mejo.2012.10.001.Search in Google Scholar

[12] H. Kim, M. R. Mahmoodi, H. Nili, and D. B. Strukov, “4K-memristor analog-grade passive crossbar circuit,” Nat. Commun., vol. 12, no. 1, pp. 5198/1–5198/11, 2021. https://doi.org/10.1038/s41467-021-25455-0.Search in Google Scholar PubMed PubMed Central

[13] C. Chou, Z. Lin, C. Lai, et al.., “A 22 nm 96KX144 RRAM macro with a self-tracking reference and a low ripple charge pump to achieve a configurable read window and a wide operating voltage range,” in 2020 IEEE Symposium on VLSI Circuits, 2020, pp. 1–2.10.1109/VLSICircuits18222.2020.9163014Search in Google Scholar

[14] P. Jain, U. Arslan, M. Sekhar, et al.., “13.2 A 3.6 Mb 10.1 Mb/mm2 embedded non-volatile ReRAM macro in 22 nm Fin FET technology with adaptive forming/set/reset schemes yielding down to 0.5 V with sensing time of 5 ns at 0.7 V,” in 2019 IEEE International Solid- State Circuits Conference – (ISSCC), Intel, 2019, pp. 212–214.10.1109/ISSCC.2019.8662393Search in Google Scholar

[15] N. Kopperberg, S. Wiefels, K. Hofmann, et al.., “Endurance of 2 mbit based beol integrated reram,” IEEE Access, vol. 10, pp. 122696–122705, 2022. https://doi.org/10.1109/access.2022.3223657.Search in Google Scholar

[16] M. Horowitz, “Computing’s energy problem (and what we can do about it),” in ISSCC 2014, pp. 10–14, 2014.10.1109/ISSCC.2014.6757323Search in Google Scholar

[17] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, et al.., “PANTHER: a programmable architecture for neural network training harnessing energy-efficient ReRAM,” IEEE Trans. Comput., vol. 69, pp. 1128–1142, 2020. https://doi.org/10.1109/tc.2020.2998456.Search in Google Scholar

[18] C. Li, J. Ignowski, X. Sheng, et al.., “CMOS-integrated nanoscale memristive crossbars for CNN and optimization acceleration,” in 2020 IEEE International Memory Workshop (IMW), 2020, pp. 1–4.10.1109/IMW48823.2020.9108112Search in Google Scholar

[19] J. J. Dongarra, J. D. Croz, S. Hammarling, and I. S. Duff, “A set of level 3 basic linear algebra subprograms,” ACM Trans. Math. Softw., vol. 16, pp. 1–17, 1990. https://doi.org/10.1145/77626.79170.Search in Google Scholar

[20] B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang, and E. Ipek, “Enabling scientific computing on memristive accelerators,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018, pp. 367–382.10.1109/ISCA.2018.00039Search in Google Scholar

[21] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, et al.., “A programmable ultra-efficient memristor-based accelerator for machine learning inference,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 715–731.Search in Google Scholar

[22] W. Wan, R. Kubendran, C. Schaefer, et al.., “A compute-in-memory chip based on resistive random-access memory,” Nature, vol. 608, no. 7923, pp. 504–512, 2022. https://doi.org/10.1038/s41586-022-04992-8.Search in Google Scholar PubMed PubMed Central

[23] JART, “Juelich aachen resistive switching tools (JART),” Tech. Rep., IWE 2, RWTH Aachen, 2019.Search in Google Scholar

[24] C. Bengel, J. Mohr, S. Wiefels, et al.., “Reliability aspects of binary vector-matrix-multiplications using ReRAM devices,” Neuromorph. Comput. Eng., vol. 2, no. 3, p. 034001, 2022. https://doi.org/10.1088/2634-4386/ac6d04.Search in Google Scholar

[25] C. Bengel, A. Siemon, F. Cüppers, et al.., “Variability-aware modeling of filamentary oxide based bipolar resistive switching cells using SPICE level compact models,” IEEE Trans. Circuits Syst. I: Regul. Pap., vol. 67, no. 12, pp. 4618–4630, 2020. https://doi.org/10.1109/tcsi.2020.3018502.Search in Google Scholar

[26] S. Wiefels, C. Bengel, N. Kopperberg, K. Zhang, R. Waser, and S. Menzel, “HRS instability in oxide based bipolar resistive switching cells,” IEEE Trans. Electron Devices, vol. 67, no. 10, pp. 4208–4215, 2020. https://doi.org/10.1109/ted.2020.3018096.Search in Google Scholar

[27] F. Cüppers, S. Menzel, C. Bengel, et al.., “Exploiting the switching dynamics of HfO2-based ReRAM devices for reliable analog memristive behavior,” APL Mater., vol. 7, no. 9, pp. 91105/1–91105/9, 2019. https://doi.org/10.1063/1.5108654.Search in Google Scholar

[28] C. Bengel, F. Cüppers, M. Payvand, et al.., “Utilizing the switching stochasticity of HfO2/TiOx-based ReRAM devices and the concept of multiple devices for the classification of overlapping and noisy patterns,” Front. Neurosci., vol. 15, p. 621, 2021.10.3389/fnins.2021.661856Search in Google Scholar PubMed PubMed Central

[29] C. Bengel, A. Siemon, V. Rana, and S. Menzel, “Implementation of multinary lukasiewicz logic using memristive devices,” in 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, Korea, 22–28 May, 2021, IEEE, 2021.10.1109/ISCAS51556.2021.9401367Search in Google Scholar

[30] C. L. Torre, “Physics-based compact modeling of valence-change-based resistive switching devices,” Ph.D. thesis, RWTH Aachen, 2019.Search in Google Scholar

[31] E. Perez, M. K. Mahadevaiah, E. P. Quesada, and C. Wenger, “Variability and energy consumption tradeoffs in multilevel programming of RRAM arrays,” IEEE Trans. Electron Devices, vol. 68, pp. 2693–2698, 2021. https://doi.org/10.1109/ted.2021.3072868.Search in Google Scholar

[32] P. Yao, H. Wu, B. Gao, et al.., “Fully hardware-implemented memristor convolutional neural network,” Nature, vol. 577, no. 7792, pp. 641–646, 2020. https://doi.org/10.1038/s41586-020-1942-4.Search in Google Scholar PubMed

[33] A. Singh, M. A. Lebdeh, A. Gebregiorgis, R. Bishnoi, R. V. Joshi, and S. Hamdioui, “SRIF: scalable and reliable integrate and fire circuit ADC for memristor-based CIM architectures,” IEEE Trans. Circuits Syst. I: Regul. Pap., vol. 68, pp. 1–14, 2021. https://doi.org/10.1109/tcsi.2021.3061214.Search in Google Scholar

[34] J. M. Hung, C. X. Xue, H. Y. Kao, et al.., “A four-megabit compute-in-memory macro with eight-bit precision based on CMOS and resistive random-access memory for AI edge devices,” Nat. Electron., vol. 4, no. 12, p. 921+, 2021. https://doi.org/10.1038/s41928-021-00676-9.Search in Google Scholar

[35] M. Zahedi, M. Mayahinia, M. A. Lebdeh, S. Wong, and S. Hamdioui, “Efficient organization of digital periphery to support integer datatype for memristor-based CIM,” in 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2020, pp. 216–221.10.1109/ISVLSI49217.2020.00047Search in Google Scholar

[36] M. Mayahinia, A. Singh, C. Bengel, et al.., “A novel voltage controlled oscillation based ADC design for computation-in-memory using emerging ReRAMs,” J. Emerg. Technol. Comput. Syst., vol. 18, no. 2, pp. 1–25, 2022. https://doi.org/10.1145/3451212.Search in Google Scholar

[37] T. P. Xiao, B. Feinberg, C. H. Bennett, et al.., “An accurate, error-tolerant, and energy-efficient neural network inference engine based on sonos analog memory,” IEEE Trans. Circuits Syst. I: Regul. Pap., vol. 69, pp. 1480–1493, 2022. https://doi.org/10.1109/tcsi.2021.3134313.Search in Google Scholar

[38] S. Wiefels, “Reliability aspects in resistively switching valence change memory cells,” Ph.D. thesis, RWTH Aachen, 2021.Search in Google Scholar

[39] B. Q. Le, A. Grossi, E. Vianello, et al.., “Resistive ram with multiple bits per cell: array-level demonstration of 3 bits per cell,” IEEE Trans. Electron Devices, vol. 66, no. 1, pp. 641–646, 2019. https://doi.org/10.1109/ted.2018.2879788.Search in Google Scholar

Received: 2023-04-07

Accepted: 2023-04-09

Published Online: 2023-04-21

Published in Print: 2023-05-25

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/itit-2023-0018

Keywords for this article

bit slicing; computation-in-memory; memristive devices; Vector-Matrix-Multiplications

Creative Commons

BY 4.0