dram memoryOur analysis of the evolution of memory begins in the dynamic memory era, that is, with dynamic random access memory (DRAM).

Without going into technology specifics such as the structure of a memory cell, the distinguishing characteristics of DRAM versus SRAM (static RAM) are basically twofold: (1) the full address is usually presented to SRAM just once, while it is multiplexed to DRAM, first the row and then the column; (2) DRAM also needs to be refreshed periodically to maintain the integrity of stored data.

This memory family kicks off with Fast Page Mode (FPM) DRAM.  In the early days, 5V technology and asynchronous memories were used as these memory devices did not require a clock signal input to synchronize commands and I/O. Data access time, from the moment that the memory controller (located in the CPU or chipset) supplied the row address, was around 35 ns, and 13 ns for the column address.  Right from the first implementations, once the row address had been supplied, it was possible to vary the column so as to have the data arriving every 13 ns. A further improvement came in 1995 in the form of extended data output (EDO) DRAM, which simply held the read data stable until the falling edge of CAS# in the next cycle, rather than putting them into high impedance at the rising edge of CAS#. With this precharge time (tCP) was gained, which allowed the bursts to be shortened from X-3-3-3 cycles of the front side bus (FSB) to X-2-2-2 cycles. This simple improvement enabled a ten percent increase in performance while maintaining the price. It was the Pentium era with a 133-200 MHz internal clock and 66 MHz FSB. 

The next step for the technology is synchronous DRAM.  Among the changes introduced with this type of memory we have:  (1) among the signals reaching the device is a 100-133 MHz (PC100 and PC133) clock signal (hence the name), (2) the power supply voltage is reduced  to +3.3V, marking the beginning of a continuous reduction, (3) signaling is LVTTL, (4) read and write access will be burst-oriented with the burst length, and other operating parameters, programmed during initialization[i], and (5) organized into four internal banks.  As in the case of FPM and EDO, accesses begin with the registration of an ACTIVATE command, which is then followed by a READ or WRITE command. The address bits registered at the same time as the ACTIVATE command are used to select the bank and row to be accessed. The address bits registered at the same time as the READ or WRITE command are used to select the bank and the starting column location for the burst access. Access time from row activation is 30 ns (tRCD + CL x tCK) and 15 ns (CL x tCK) from registration of the READ command until the first burst beat becomes available for the PC133 specification; the following three beats arrived at a rate of one per clock cycle: X-1-1-1. And now, on top of this, once a row in the bank is open, any column in that row can be accessed without having to wait for the row to be reopened; the burst under these conditions is: 2-1-1-1 compared with 4-1-1-1. However, by far the biggest advance over previous EDOs has had more to do with the possibility of initiating a second access in another bank while the previous one is still in progress, than to do with latency. Thus, bursts could be juxtaposed: X-1-1-1-1-1-1-1 compared with X-2-2-2-X-2-2-2, and at the same time the clock frequency increases from 66 to 133 MHz.  By the year 2000, this technology had completely replaced the former EDO.

The next improvement came in the form of Double Data Rate (DDR): (1) the power supply voltage is reduced to +2.5V, (2) signaling is now SSTL and continues to be so right through DDR3, (3) size increases to 1 Gb, (4) the clock becomes differential, (5) each byte/nibble is accompanied by a co-directional data strobe (DQS) used as a clock to capture data using, and hence the name, (6) both edges. This innovation allows us to double the amount of information transferred in each clock cycle. The voltage reduction and other improvements allow the clock frequency to increase to 167 MHz (although there were 200 MHz versions powered at +2.6V).  Although the access time from the row and the column to the first data beat is 30 ns and 15 ns, respectively, subsequent beats are received every 3 ns (tCK /2 @ 167 MHz). Thus, all of the information contained in a burst is received in 42 ns, or 24 ns if the row is already open.  

The evolution continues with DDR2: (1) the power supply voltage is reduced to +1.8V, (2) size increases to 2 Gb[ii], (3) the number of banks is doubled to eight, (4) DQSs become differential, and (5) dynamically activated on-die termination resistance (ODT) is included in data lines to improve signal integrity. The clock frequency increases to 533 MHz. Row and column access times to the first burst data beat vary little, being 26.25 ns and 13.125 ns respectively, but subsequent ones have far lower latency (0.94 ns). The entire 4-beat burst is transferred in 30 ns from row activation which is the worst case.

The next generation, and we are now nearing the present, is DDR3(L): (1) the power supply voltage is reduced to +1.5V, and even +1.35V in the low-power version, (2) capacity ranges from 4 Gb (+1.5V) to 8 Gb (+1.35V), (3) the number of banks remains the same, (5) the number of ODT termination values increases from three to five, (6) you can choose between two different memory driver strengths and, most importantly, (7) the bus routing paradigm between the DRAM and the memory controller changes. We pass from the symmetrical tree-type topology for command/address/control signals and static skew control between them and the data bus, to a “fly-by” topology for command/address/control and clock (CK) lines and de-skewing the DQS strobe to clock (CK) relationship at the DRAM, through a process that the controller, aided by the memory, must implement during the so-called Write Leveling initialization phase. This is when the controller adjusts each byte’s DQS strobe displacement in submultiples of the clock period until it is aligned with the clock signal. During each step of the process, DDR3 memory samples the clock signal at the rising edge of the DQS, returning the value at the least significant bit in the octet/nibble. The process ends when the controller receives a CK transition event from 0 to 1. The corresponding delay represents the value which de-skews the trace length mismatch between ADD/CMD/CTL/CK and the corresponding octet/nibble. The new topology allows us to double the operating frequency to 1066 MHz so that row and column access times of the first data beat are 13.09 ns and 13.13 ns, respectively, while the latency to the following one is reduced to 0.469 ns. Thus the burst transfer takes 28.1 ns from the row and 15 ns from the column.

Finally we come to the last step in the DRAM evolution, DDR4: (1) the power supply voltage is reduced once again to +1.2V, (2) signaling changes to POD, (3) capacity increases to 16 Gb, (4) the number of banks is doubled to 16[iii], (5) frequency is increased to 1600 MHz, and as a result, (6) we see an increase in the ODT values (with up to seven possible values). Performance increases proportionally with the increase in clock frequency.

In view of the calculated access times, which are always around 30 ns from the row and 15 ns from the column, you might be mistaken for thinking that performance has failed to increase significantly over time. However, such a perception does not do justice to reality since the controller usually maintains several active banks (up to sixteen with DDR4) so that while we still have the aforementioned latency, the controller can schedule the accesses so the bursts are placed back-to-back achieving a throughput that is two orders of magnitude higher than FPM and EDO.  Let’s take an example: suppose the program flow requires the activation of one row  followed by another and so on, and that, as a result, the controller activates the first row in the N cycle, the next one in the N+2 cycle and so on. Well, if we were using the DDR4-3200 we would have the first data available in the N+44 cycle, the second in the N+44.5, the third in N+45 and the fourth and last of the first burst in N+45.5. The first one corresponding to the N+2 activation would appear in N+46, the second in N+46.5 and so on. As you can see, the throughput is one data beat every 0.5 x tCK with tCK being the inverse of 1600 MHz (625 ps),  which expressed in transfers per second is 3200 MT/s compared to the 22 and 33 MT/s data rates obtained with FPM and EDO, respectively.

Teldat devices haven’t remained outside this evolution. The N+ used FPM DRAM to operate at 33 MHz; the ATLAS200, ATLAS 250 and ATLAS 150 used SDR SDRAM to operate at 50 MHz, 66 MHz and 100 MHz, respectively, depending on the version; the ATLAS160 and ATLAS360 use DDR2 at 200 MHz (400 MT/s) and 266 MHz (533 MT/s); the ATLAS6x inaugurated the use of DDR3 at 333 MHz (666 MT/s) and the more modern iM8 and i70 routers use the latter memory type to achieve 1600 MT/s transfers.

[i] Burst lengths of 4 are considered in the text.

[ii] Only parts of a single DIE are considered.

[iii] The sixteen banks are actually organized into four groups of four banks. New temporary restrictions relating to being part of a group or bank have implications for controller design.


About the author

Manuel SanchezManuel Sanchez
Manuel Sánchez González-Pola, Telecommunications Engineer, is part of Teldat’s R&D Department. Within this department he works as a Project Manager in the Hardware team.  

Share this post

Tweet about this on TwitterShare on LinkedInShare on Google+Email this to someone