A version of this post originally appeared on the IBM Cloud blog.
Michelangelo spent four years painting the ceiling of the Sistine Chapel, for a total of 5,000 square feet. Imagine if Michelangelo could have cloned himself into hundreds of equally talented artists, each working in parallel on small tiles of the ceiling? An army of Michelangelos could have finished within weeks!
When we “burst” workloads onto the cloud, the concept is the same. As one example, we discuss an optical proximity correction (OPC) workload. An engineer starts with an intended semiconductor design, which goes through a computationally intensive process of shape manipulation (Figure 1). The altered design is created on a photomask, then transferred through a lithographic process onto a silicon wafer for multiple levels, and the final chip ends up in an IBM server (or in your cell phone).
By breaking up the design into millions of tiles and then running OPC on all of the tiles in parallel, one can take a slow step in the semiconductor manufacturing process and speed it up by an order of magnitude. OPC run times are usually limited by the number of cores available on a farm. OPC runs that might take over a week using ~2000 core, can be scaled to 10,000 cores or much higher in the cloud, with a tremendous runtime improvement. Considering that a typical semiconductor device may have well over 50 mask layers (all of which need OPC), reducing the OPC run time of each layer translates directly to critical time-to-market gains:
The testchip vehicle that we’ve used for this work has been discussed in a prior IBM announcement. It is a state-of-the-art EUV lithography mask with a minimum pitch of 28 nm (indicating minimum 14 nm wide lines), and it supports logic scaling below the 5 nm technology node. Figure 1 illustrates a simple example of OPC on both a rectangle and “T-shaped” design.
To achieve a robust design on the silicon wafer, we must distort or “correct” the images on the photomask. The local environment or “proximity” of the design plays an important role as well, hence the appropriate name: optical proximity correction (OPC). Figure 1 shows a simple example of two shapes, but a semiconductor design may have up to trillions of shapes, all of which need a correction specific to their environment. For each of these reasons, this workload is computationally intensive, embarrassingly parallel and a perfect example for cloud bursting.
The OPC correction flow consists of three steps: retargeting, OPC and verification. Figure 2 illustrates the OPC flow sequence and main purpose of each step. For each of these steps, we have utilized the Synopsys Proteus suite of tools. During the first step of retargeting, input design shapes are modified to offset the effects of downstream process biases created during reactive ion etch, wet processes, liner cleans, etc., or to offset weaknesses in a photolithography process.
The second step in the OPC flow is the actual optical proximity correction on target layers that include shapes from Step 1. The simulation engine uses mathematical models from the illumination source, photo resist, flare map, etc. to perform iterative calculations that optimize the mask shape for final printing.
Finally, the third step is verification. Typically, verification includes both optical simulation and rule-based checking. This demo includes the rule-based portion, or mask rule checking (MRC). It performs several geometric measurements on all shapes produced from Step 2 to ensure that the generated mask shapes conform to mask-making requirements. The MRC portion of these recipes contains some input-output limitations and does not typically require the full complement of worker nodes to complete its tasks. While the input-output portions of the run are parallel, they do not typically scale as linearly as the main OPC step:
IBM’s cloud infrastructure provides a rich set of application programming interfaces and command line, web-based, and Terraform plugins to provision infrastructure. For this work, we chose to leverage Terraform and Ansible to provision a robust Linux cluster capable of running OPC workloads spanning three data centers and up to 11,400 worker cores in total.
To simulate a hybrid cloud infrastructure, we chose a multicloud approach in which a virtual private cloud was created in our US-South region to house license servers. This virtual private cloud in US-South was then connected to a second virtual private cloud in IBM’s Great Britain data centers (EU-GB region). The configuration is similar to what might be used in a typical customer burst scenario where the license servers in US-South represent an on-prem data center, and the Linux cluster in EU-GB represents burst compute capacity, as shown in Figure 3. In the case of a true burst run, the IBM Cloud Transit Gateway connection in Figure 3 would be replaced by a secure IBM Cloud Direct Link network connection:
To automate the provisioning of this infrastructure, we used the IBM Terraform plugin to provision the license servers, storage clusters, head node and worker nodes. In this case, we attached the Transit Gateway using the IBM Cloud web user interface, but this step could be completed as part of the Terraform run.
Within EU-GB, IBM Cloud has three distinct zones to provide high availability, fault tolerance and flexible capacity. In our current demonstration, OPC runs were deployed across computing resources in each of these three zones to achieve the desired scale and also test application performance distributed across three availability zones. In addition, we chose to host a shared filesystem service in London 2, as shown in Figure 3.
IBM Cloud virtual servers enable network traffic separation and isolation by attaching multiple network interfaces per VSI. We leveraged this capability and configured our OPC cluster with three distinct networks. One network is used to connect all the storage nodes to license servers in Dallas, as shown by the yellow line. Another network is used to connect all the compute nodes across all three zones, as shown by the dark black line. A third network is used to connect the compute nodes with the storage nodes, as shown by the green and purple lines in Figure 3. This architecture ensures that the traffic to I/O servers is isolated from the traffic among compute nodes, and it also provides dedicated 2x16Gbps bandwidth outside of each VSI — one 16Gbps network for compute communication and another 16Gbps network for all storage operations.
In this work, we explore scaling runs for a challenging beyond 5 nm EUV thin wire level using a Linux cluster that spans three data zones. The layout that we chose consists of a 172 mm2 chiplet that contains thin wire design and macro content. This EUV run featured full flare compensation and was taken from a chiplet currently running in IBM’s Albany NanoTech wafer facility. Ground-breaking lithography results from this chiplet and others were featured recently in a press release on the team in Albany.
For this work, we explored scaling of the OPC portion of the recipe over a range of core counts from 2,000 physical cores up to 11,400 physical cores, as shown in Figure 4. In Figure 4, the plot shows 1,000/number of cores versus the running time in minutes of OPC for the 172 mm2 thin wire chiplet:
The OPC portion of the recipe is expected to scale linearly throughout a broad range of core counts and is a good indication of scalable performance of cluster infrastructure. Figure 4 clearly shows a linear response to scaling over this range and is remarkable for a number of reasons.
First, this run was implemented using fully virtualized server instances (VSI) across three availability zones. As with many cloud infrastructures, networking is top-notch, but there are physical limitations to how low latency can be controlled across data centers. In these runs, we did not observe any evidence of latency issues when running across three data centers.
Second, these results were obtained using a scratch filesystem built over stock VSI instances using GlusterFS. GlusterFS performed well under this use case and did not impact scaling over a range of worker nodes from 2,000 to 11,400 physical cores. OPC, as a rule, is a spoke-and-hub application that requires a head node to communicate in a many-to-one configuration with multiple server processes reading and writing to the same shared directories. If shared filesystem latency were an issue, one would expect to see deviations from linear scaling across this range. As an example, with kernel-based NFS approaches, one can often see deviations from linear scaling in the range above 3,000 physical cores. Often, these limitations are overcome by resorting to Ganesha parallel NFS or other parallel NFS approaches which was not needed using GlusterFS.
Third, in the scaling results shown in Figure 4, we do not observe indications of network latency on the application network between the head node and worker processes. In some cases, bottlenecks in large runs can be observed in which the main OPC process fails to get messages from worker processes in a predefined time. If this happens, often a tile has to be reprocessed, which will manifest itself as a deviation from linear scaling in OPC. We don’t see evidence in scaling plots (Figure 4) nor in software logs that would indicate a problem with main process — worker communication.
In summary, we have scaled a state-of-the-art EUV OPC run with 11,400 cores and demonstrated a linear run time reduction. Furthermore, we were able to achieve this without latency issues across data centers, without shared file latency issues and with no observed network (worker communication) latency. For future work, we plan to expand these scaling results even higher. The primary questions are: How many clones of Michelangelo are possible? And what’s the shortest amount of time we can achieve for painting our version of the Sistine Chapel?