By leveraging the physical properties of PCM devices, analog AI chips overcome the von Neumann bottleneck, performing computation in the the same place where the data is stored. Because there is no movement of data, tasks can be performed in a fraction of the time and require much less energy.
For example, moving 64 bits of data from DRAM to CPU consumes 1-2nJ, which is 10,000–2,000,000 times more energy than is dissipated in a PCM device performing a multiplication operation (1-100fJ).
When scaled across billions of operations, these energy savings are immense. PCM devices also show promise as a more reliable data store — PCM does not consume power when the devices are inactive, and the data will be retained for up to 10 years even when the power supply is turned off.
With phase-change memory, an electrical pulse is applied to the material, changing the conductance of the device by switching the material between amorphous and crystalline phases. A low electrical pulse will make the PCM device more crystalline (less resistance). A high electrical pulse will make the device more amorphous (more resistance). Instead of recording a 0 or 1 like in the digital world, the PCM device records its state as a continuum of values between the two. This value is called a synaptic weight. These weights are stored in the physical atomic configuration of each PCM in a non-volatile way (the weights are retained when the power supply is turned off).
When PCM devices are arranged in a crossbar configuration, it’s possible to perform an analog matrix-vector multiplication in a single time step, exploiting the advantages of multi-level storage capability and Kirchhoff’s circuits laws.
In deep learning inference, data propagation through multiple layers of a neural network involves a sequence of matrix multiplications, as each layer can be represented as a matrix of synaptic weights. On the Fusion chip, these weights are stored in the conductance states of PCM devices. The devices are arranged in crossbar arrays, creating an artificial neural network where all matrix multiplications are performed in-place in an analog manner. This structure allows inference to be performed using little energy with high areal density of synapses.
The arrays on the Fusion chip directly relate to the synapse layers of the neural network. Each synapse layer consists of two arrays, one encoding the positive part of the synaptic weight and the other encoding the negative part.
The first set of arrays, each made of 784 x 250 PCM devices, correspond to the synapses connecting the 784 input neurons to the 250 hidden neurons of the neural network. The second set of arrays, each made of 250 x 10 PCM devices, correspond to the synapses connecting the 250 hidden neurons to the 10 output neurons. The output layer consists of 10 output neurons representing the 10 numbers (0 to 9) classified by the neural network.
Each synaptic weight of the neural network is encoded as the difference in conductance values of two corresponding PCM devices, one in the positive array and one in the negative array. The calculation of the synapse layer output is done by subtracting the output of the negative array from the output of the positive array.
To illustrate the early capabilities of analog AI, we encoded a neural network directly onto an IBM Fusion chip. The neural network was trained to recognize a number drawn on the screen in real time.
While this classic challenge has widely been conquered by conventional computers, we successfully displayed how the Fusion chip could dynamically interpret vector inputs from the numbers drawn on the screen.
After a number is drawn in a 28px by 28px input area, each pixel is assigned a value depending on its shade of gray. The pixels are transformed into a vector made of 784 values which are then converted into voltages. These voltages are then fed into the first set of arrays.
The first set of arrays made of 784 x 250 PCM devices arranged in a crossbar configuration, representing the first layer of a neural network. The 784 input voltages from the input layer are fed into both positive and negative arrays, which are needed to encode the synaptic weights. Output current is then obtained by subtracting the positive and negative currents of the two arrays.
Output currents from the first set of arrays go through a nonlinear activation function to obtain the final output of the 250 hidden neurons. This current is fed into a second set or arrays: one positive, one negative, each made of 250 x 10 PCM devices.
The 10 output currents from the second set of arrays are subtracted to obtain the output of the second synapse layer. This output then goes through another nonlinear activation function to obtain the final output of the neural network, and the system’s classification of the number that was drawn.
The Fusion chip is a test prototype with which only a single PCM device can be accessed at a time. Future chips will allow us to perform the matrix multiplications all in parallel in a single time step, greatly improving the speed and energy efficiency.
AI Hardware: We’re developing new devices and architectures to support the tremendous processing power AI requires to realize its full potential.