Clocks Getting Skewed Up

The clock network is complex, critical to performance, but often it’s treated as an afterthought. Getting this wrong can ruin your chip.

popularity

At a logical level, synchronous designs are very simple and the clock just happens. But the clocking network is possibly the most complex in a chip, and it’s fraught with the most problems at the physical level.

To some, the clock is the AC power supply of the chip. To others, it is an analog network almost beyond analysis. Ironically, there are no languages to describe clocking, few tools to automate its design, and a patchwork of analysis systems. In addition, few significant advances have been made in the way clocks are distributed, and that continues to become an increasingly complex problem.

“The clocking signals are the lifelines of chips,” says Pradeep Thiagarajan, principal product manager at Siemens EDA. “They pretty much dictate the functionality of all blocks within an IC, directly or indirectly, whether analog or digital, and synthesizing a clock frequency and its derivatives is an art of its own. However, distributing them reliably into, across, and out of a packaged chip can be a bear, and herein lies many complications. The main challenge lies in ensuring that the wave profile of any clock signal does not deteriorate beyond expectations while en route to the recipient circuit or location. By wave profile I mean amplitude, frequency, slew rate, duty cycle, and its application-specific jitter.”

With an increasing number of chips approaching the reticle limit, it becomes impossible to operate a chip from a single synchronous clock. “If you can’t get across the chip in one clock cycle, you have to look at things as being locally synchronous, but have longer distance runs that are either asynchronous clocked, where you go through synchronizers, or you use methodologies from the good old CPU times — you build a low-skew clock grid,” says Michael Frank, fellow and system architect at Arteris IP. “The problem is that clocks take power, and you have big rebuffered trees that feed a large number of flops.”

Power-frugal designs tend to have large numbers of clock domains. “A Bluetooth chip probably has hundreds of clock regions,” says Marc Swinnen, product marketing director for the semiconductor business unit at Ansys. “They divide it up into lots of clocks so they can switch off things on a very atomic level. Big desktop-driven, CPU-type things, they’re all just one big clock region. They don’t even bother. The number of clock regions goes up as you get more and more concerned about power.”

But CPUs have a number of different problems. “If you run a big area synchronously, you’re feeding from one point and you have to distribute this in a tree that’s balanced,” says Arteris’ Frank. “Then you are trying to make sure that everybody who talks to each other with logic signals sits somewhere at the same level of hierarchy in that tree. That causes a lot of problems, especially if you take into account that the clock by itself is not a fixed thing. You might have dynamic voltage and clock scaling (DVFS), you have to test these things, as well. You may have a scan process that uses the clocks to move and shift everything through a big shift register, and now you have to meet setup and hold times for different logic paths between the flops.”

Clocks are pervasive. “You have to think about the clock at a block level, right next to the flops, and you have to think about it at the core level, then you have to think about it at a cluster of cores and you have to think about it if you have a NOC or some sort of interconnect, and then of course any sort of IOs and memories,” says Mo Faisal, president and CEO of Movellus. “Every single layer of the SOC, every single level of the hierarchy — it’s a big deal and it makes an impact. Every picosecond counts, and people have some techniques at their disposal, but it’s very limited. The architectures being used are actually quite outdated. Little innovation has happened in the architectures of clock networks in the last 20 plus years.”

Defining clocks
Clocking impacts every stage of the chip development process, from architectural definition to physical analysis to test. But for most companies, there are no clearly defined methodologies or people charged with overseeing its design.

“It all starts from a clock architecture specification, which is ad hoc,” says Paras Mal Jain, R&D director for CDC and SDC constraint products at Synopsys. “There is no defined language for clocks. Every customer is doing it differently. One company will do it using CSV files, another will do it using XML. They are all error-prone. There is no one owner for the clock network. From the clock architect spec, they will likely use their own mechanisms to generate constraints. These flow into multiple tools. They will go to verification, they will go to implementation, sign-off, and downstream tools. Nobody’s tracking this, and at each stage, changes are being made to the clock. Nobody’s making sure these clock network changes are actually in sync with the clock architecture.”

Some companies do have clock architects. “If you look at how many clock architects there are in the industry, the number is very limited,” says Movellus’ Faisal. “It’s between 50 and 100 total. These are the clock architects who know how to do high-performance clock distribution structures.”

These are very special people. “It is a career of its own,” adds Siemens’ Thiagarajan. “Distributing the clocks requires someone to think the normal way and out-of-the-box to assess all kinds of issues. It’s definitely a high-focus realm to invest effort into, ensuring that the distribution happens well within the chip and outside the chip. It requires knowledge of both front-end and back-end tasks. A holistic approach is required when deciding what kind of structures, or repeaters, you want to use on the chip. It depends on the signal requirements for integrity, how far you want it to go, and how robust you need it to be.”

Most companies have no such person. “For CPU designs, clocks are considered fairly early on,” says Frank. “In other designs, which develop relatively limited IP, this is one of those problems that usually gets procrastinated until physical design hits the fan. Then you do clock tree synthesis (CTS). There are tools for helping you there, and you then try to close timing based on clocks that you have inserted.”

If you don’t start with a clock specification, you still have to generate the constraints. “The tree is a set of sub-trees and there are constraints not only from top to bottom, but also from branch to branch,” says Ansys’ Swinnen. “Sometimes you have multi-cycle paths. You could make a simple constraint and say the whole tree has to have 100 picosecond skew. That would be great and really make your life easy, but that also would give you a huge, very heavyweight tree and consume a lot of power and a lot of area. To fix that, you have to break it up and say, ‘This section of the logic doesn’t really need this tight of a skew, and this section of the logic really needs to be much shorter in latency.’ Then a CTS tool has to manage all these constraints and give you a minimal power structure that achieves those skew and latency requirements.”

Relying on these tools has a downside. “If you look at how clock tree synthesis works, you optimize it at design time, but that’s the solution you get,” says Faisal. “In reality, your chip runs in all kinds of different environments. If you’re shipping billions of units, you’re going to have all kinds of silicon skews. With CTS, or any other tool-based technique, you can optimize it for ‘t equals zero.’ After that, everything by definition is sub-optimal.”

Added complications
Other common design techniques complicate things further. “In the early days there were only a couple of modes, functional and test,” says Synopsys’ Jain. “Now there may be dozens of functional modes. Closing timing with all the modes is a big challenge. In the past, companies would make serial runs to verify each mode. That is not scalable. They want EDA tools to be more efficient and do multimode analysis.”

But it is not just modes. “The analog designer tends to focus predominantly on functional modes,” says Thiagarajan. “But on top of that, you’ve got timing corners and reliability testing. It requires a comprehensive level of verification for each mode, and scenarios that you may not see at just an IP or a block level. You only start experiencing them once you expand the scope and look at it using a much bigger cross-section view, or even a chip-level view.”

New levels of complexity are being added, too. “In a monolithic chip, your power network and clock networks are contiguous,” says Faisal. “They’re literally physically connected. In a chiplet-based system, that is no longer the case. Your power supply network is also fragmented. Every chiplet has its own, and then your clock network is all over the place and you no longer have the ability to have a well-defined relationship. This is why the industry now needs to invent all kinds of interfaces — bunch of wires, AIB, and most recently UCIe — to move data from one chiplet to the other. You actually have lost your time base. You don’t know what time it is on different chips.”

Proposed solutions are taking different approaches. “With 2.5D, some people are looking into having clock generation external to the package and feeding from that point, a controlled clock to each chiplet,” says Frank. “Another option is to treat this as an asynchronous boundary. (UCIe is an example.) If you look at designs that build PHYs, clock recovery is everything. You’re putting a lot of effort and a lot of power into regenerating the clocks. A third solution is source synchronous clocking — HBM is an example — where you deliver the clock with the data to build independent tasks, like a localized clock domain that is only the interface. From there, you figure out how to get into the internal clock domains.”

Unintended couplings
Unintended couplings are a growing problem these days. “Clocks and power are tightly related,” says Preeti Gupta, head of PowerArtist product management at Ansys. “Simple things like turning on or off a power domain can cause a major current spike, which can lead to a voltage drop, which in turn can impact timing. Both power gating and clock gating need to be very carefully considered.”

Power variations directly impact the clock. “Clock ticks are supposed to come at the same time,” says Swinnen. “When you have built your clock network, the skew is what it is. But if you monitor the arrival time at a particular flip flop, some ticks will arrive a little early and some will arrive a little late, and it may appear random. It’s called clock jitter. The main cause is dynamic voltage drop. If a bunch of gates switch in a certain area, they pull down the voltage locally. The clock running through the area suddenly slows down a bit. But at the next clock edge, they are not switching, so it comes through a little faster. Depending upon the switching behavior, the clock jitters. You have to take that into account in your setup and hold checks.”

Some problems can remain hidden. “If you have a data path from one power domain to another, one voltage domain to another, and I’m doing traditional verification, I will not see any issue with this,” says Jain. “It looks like the same clock is going to both the receiver and sender, but because they are different voltage domains, you can have a clock domain crossing (CDC) issue. Your clocks may be switching at different times because of power domains and different voltage domains, so you have to really worry about a different kind of verification to handle voltage domains.”

Rising frequencies also can have an impact. “At a gigahertz, you do not have too much opportunity to mitigate power surges because power doesn’t travel very fast,” says Frank. “This is because of the inductance of the power grid, and it is why people started splitting chips into power domains and feeding them with a lot of pins. You see packages that have hundreds of power supply and ground pins. There’s not much you can do if you want to have gigahertz clock frequencies, and in the worst case, everybody’s toggling within the first 200 picoseconds of that gigahertz.”

Power results in heat, and that also can have an impact. “Temperature variations have to be managed as this whole system heats up,” adds Frank. “As the voltage varies, there is noise, wiggles, and shakes. It means you have to think about how much margin you need and how you close timing on your signals.”

Margin directly equates to opportunity lost in performance, power or area.

This, and many other factors, increase the complexity. “Timing closure, variation, skew — it becomes a very complex, multi-dimensional, multi-surface problem, especially if you add DVFS,” says Faisal. “If somebody is running their SoC from 0.5 to 1 volt, they’re going to have a big challenge closing their setup at 0.5 volts, and then closing their hold at 1 volt. One of the big challenges is your clock skews and clock variation are going to be very different, possibly many times different, in those corners.”

Implementing clocks
There are several techniques being considered for clock network implementation. “In advanced finFET process nodes, the trend has been shifting from single-ended distribution schemes to differential schemes for supply noise immunity,” says Thiagarajan. “Differential structures have the ability to operate at lower voltages, which can lead to reduced power utilization and EMI emissions. However, it does increase analog content, and that will need custom layout to address mismatch and precision of differential blocks and associated current reference structures. Differential clocking has pros and cons, especially with newer process technologies. They are intended to increase your circuit density, but it does increase signal congestion.”

Movellus has been developing more adaptive clocking networks. “By adding intelligent clock networks (see figure 1), you can optimize the clock dynamically, as it’s being used in the environment, and for the operating conditions it is in,” says Faisal. “It requires the ability to sense the operating condition and know you are running at 0.5 volt, or 1 volt, and then automatically correct for any on-chip variation and skew. It can adapt to temperature and voltage and correct the alignment of the clocks across the whole chip Then your timing problem becomes a lot simpler.”

Fig. 1: Smart Clocking Networks. Source: Movellus

Fig. 1: Smart Clocking Networks. Source: Movellus

Most people continue to use older techniques. “For long-distance communication, people resort to things like source synchronous clocks or completely asynchronous transactions,” says Frank. “Source synchronous is when you send the signal from a source, you generate the clock at the source, and drive it out from the source of your signals. One example is a data bus where you wire it with the clock, so it flies in the same direction, eventually hitting the destination of the data. Since the clock travels with the data, it’s much easier to control the skew. You then latch the data with its own clock and treat that as either a mesochronous (same frequency but different phase) or an asynchronous interface.”

Significant care has to be taken at the interfaces. “High-speed interfaces are increasing the complexity of CDC analysis,” says Jain. “We need to consider both meso-synchronous, and plesio-synchronous interfaces. Traditional synchronizers will not work. Plesio-synchronous is when the clocks are not completely synchronous, with the frequencies slightly mismatching. And there is a slight difference in the phase, but they are not totally asynchronous also.”

Many pitfalls await. “This is why physical design schedules have grown,” says Frank. “I know about a couple of chips that had issues because of not properly doing the clocking. It is something you would expect people have learned by now. But people still have problems when using multiple clock domains, and the domains are running at the same clock frequency. If you’re missing the clock edge, that’s no big deal. But if you’re getting close to each other, you will be exercising metastability problems every single clock cycle. And since this is a statistical thing, you end up getting your average time to failure to a very short time.”

Conclusion
Clocks are hard. They are not the perfect logical signals that first appear in your architecture diagrams or in RTL. They have to be treated with respect, and dealing with them should not be a physical implementation issue. Process technologies, device sizes, and new packaging methods add layers of complexity, and analysis tools are stretched to their limit. The earlier you start looking at your clocks, the less likely you are to be skewed.



Leave a Reply


(Note: This name will be displayed publicly)