Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
(USC Thesis Other)
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A LOGIC PARTITIONING FRAMEWORK AND IMPLEMENTATION OPTIMIZATIONS FOR 3-DIMENSIONAL INTEGRATED CIRCUITS by GOPI NEELA ADissertationPresentedtothe FACULTYOFTHEGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (ELECTRICALENGINEERING) May2015 Copyright 2015 GopiNeela Tomyparents... Shri. NeelaShambaiahandShrimathi. Karuna ii Acknowledgments The pursuit of doctoral degree is an opportunity for an individual to attain the highest levels of knowledge and thinking available in formal education. Though the rewards are personal, this dissertation would not have been possible without the inspiration and supportofmanyindividuals. It is with immense gratitude that I acknowledge the support and help of my advisor Dr. Jeffrey Draper. I am honored to have work with him, and could not have asked for a better guide. The freedom he provided me to pursue my research interests and ideas is remarkable. His mentorship extends beyond the academic research, he has been a role-modelforaperfect,personalandprofessionalbalanceinlife. Iamverygratefultomydissertationandqualifiercommitteemembers,Dr. Sandeep Gupta, Dr. Aiichiro Nakano, Dr. Viktor Prasanna, and Dr. Alice Parker, for providing valuablesuggestionsandconstructivefeedbackformyresearch. Iwouldespeciallylike tothankDr. SandeepGuptaforhisguidanceinearlydaysofmygraduatestudies. My special thanks to Jeff Sondeen of Information Sciences Institute of USC for his indispensable help with design tools for building chip layouts, and for all the useful discussions that contributed to my dissertation. I would like extend my thanks to my colleagues and staff at Information Sciences Institute, and also to academic advisors of ElectricalEngineeringdepartmentatUSC. Iamthankfultoeachoneofmyfriendsfortheircompanionshipthatmadepossible tosustainthelongandchallenginglifeofdoctoralstudiesawayfrommyfamily. iii Finally, and most importantly, I am indebted to my parents Karuna and Shambaiah for their unconditional love and the sacrifices they made for me, to accomplish this dream. Their hard-work has always been an inspiration for me. I express my gratitude tomybrother-in-laws,SubrahmanyamandKishore,andmysistersSwapnaandSwarna, forsharingmyresponsibilitieswhileIamawayfromhome. Iamthankfultomybeloved wife,Pravallika,forhersupportduringthefinal,yetcriticaldaysofmygraduatelife. iv Contents Dedication ii Acknowledgments iii ListofTables ix ListofFigures x Abstract xii 1 AboutResearch 1 1.1 3-DimensionalIntegratedCircuits . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 DissertationOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 3-DimensionalIntegratedCircuits 9 2.1 About3DIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 BondingMethods . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 HomogeneousandHeterogeneous3DICs . . . . . . . . . . . . 11 2.2.3 AssemblyOptions . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Motivationfor3DIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Challenges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 3DICLiterature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.1 ArchitecturalExploration. . . . . . . . . . . . . . . . . . . . . 16 2.5.2 3DICModeling . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.3 3DICTools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Designfor3D 18 3.1 DesignFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.1 “Designfor3D”Flow . . . . . . . . . . . . . . . . . . . . . . 20 v 3.1.2 3DICasanImplementationPlatform . . . . . . . . . . . . . . 20 3.1.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 SinglePrecisionFloating-PointUnit3DIC . . . . . . . . . . . . . . . . 22 3.2.1 3DICFPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Energy-Efficient3DICMultiplier . . . . . . . . . . . . . . . . . . . . . 28 3.3.1 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.2 3DICMultiplier . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.4 SimulationEnvironmentandResults . . . . . . . . . . . . . . . 39 3.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 LogicPartitioningPrinciples 46 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 ImportanceofPartitioning . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.1 2DICvs. 3DICLogicPartitioning . . . . . . . . . . . . . . . . 48 4.2.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 LogicPartitioningPrinciples . . . . . . . . . . . . . . . . . . . . . . . 49 4.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.1 DelayOptimization . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.2 Fanout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.3 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5 Demonstration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5.1 3DICTechnology . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.6 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 3DICAverageWireLengthModel 59 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 Tier-by-TierApproach . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3.1 3DICAverageWireLength . . . . . . . . . . . . . . . . . . . 63 5.4 ImpactofTSVInstances . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4.1 GatesInterconnectsDistribution: I i (l) . . . . . . . . . . . . . . 66 5.4.2 TSVWiringPenalty . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.3 TierAreaandNumberofGates . . . . . . . . . . . . . . . . . 75 5.5 UpperBound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6 ExperimentsandResults . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 vi 6 3DICLogicPartitioningFramework 82 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3 Data-FlowModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.3.1 GraphRepresentation. . . . . . . . . . . . . . . . . . . . . . . 84 6.3.2 RanksandDegrees . . . . . . . . . . . . . . . . . . . . . . . . 86 6.3.3 DesignCharacteristics . . . . . . . . . . . . . . . . . . . . . . 88 6.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.4.1 Area-ConstrainedOptimization . . . . . . . . . . . . . . . . . 93 6.4.2 3D-viasBoundedProcedure . . . . . . . . . . . . . . . . . . . 94 6.4.3 InterconnectCount . . . . . . . . . . . . . . . . . . . . . . . . 96 6.4.4 ComputationReductionTechniques . . . . . . . . . . . . . . . 97 6.5 3DICFloating-PointUnit . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.5.1 ApplicationoftheFramework . . . . . . . . . . . . . . . . . . 99 6.5.2 3DICImplementation . . . . . . . . . . . . . . . . . . . . . . 100 6.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.6 3DICProcessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.6.1 ApplicationoftheFramework . . . . . . . . . . . . . . . . . . 103 6.6.2 3DICImplementation . . . . . . . . . . . . . . . . . . . . . . 105 6.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7 Inter-TierSignalsto3D-ViasAssignmentTechniques 108 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.3 DesignFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.4 AssignmentTechniques . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.1 Inter-TierSignalSource/SinkLocationEstimation . . . . . . . 112 7.4.2 AssignmentProcedures . . . . . . . . . . . . . . . . . . . . . . 114 7.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.4.4 Architecture-DrivenManualAssignment . . . . . . . . . . . . 118 7.5 DemonstrationDesigns . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.6 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.7 OptimalTechniqueswithPathControl . . . . . . . . . . . . . . . . . . 124 7.8 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.8.1 FormulationasAssignmentProblem . . . . . . . . . . . . . . . 125 7.8.2 CostFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.8.3 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.8.4 PathControl . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.9 TheImpactofTSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.9.1 AssignmentofInter-TierSignalstoTSVs . . . . . . . . . . . . 128 vii 7.10 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8 Congestion-AwareAssignmentTechniques 133 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 8.2 MotivationandRelatedWork . . . . . . . . . . . . . . . . . . . . . . . 134 8.3 AssignmentProcedure . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.3.1 CostFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.4 Congestion-AwareTechniques . . . . . . . . . . . . . . . . . . . . . . 137 8.4.1 GlobalWeightFunction . . . . . . . . . . . . . . . . . . . . . 138 8.4.2 LocalWeightFunction . . . . . . . . . . . . . . . . . . . . . . 139 8.4.3 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.5 DemonstrationDesigns . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.6 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 9 Conclusions 149 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 9.2 FutureProspects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 References 154 viii ListofTables 3.1 Results: 2DICFPUvs. 3DICFPU . . . . . . . . . . . . . . . . . . . . 27 3.2 Results: 2DICFPUvs. 3DICFPUtier-1andtier-2 . . . . . . . . . . . 27 3.3 Netswitchingpower(NS)andtotalpower(TP)ofalldesigns . . . . . . 40 3.4 Comparisonsbetweenvariousdesigns . . . . . . . . . . . . . . . . . . 41 4.1 Summaryofimpactofdesignpartitioninggranularityon3DIC . . . . . 47 4.2 Results: S2Cbased3DICFPUvs. C2Sbased3DICFPUvs. 2DICFPU 57 4.3 Results: Tier-wisecomparisonforS2CandC2Smethods . . . . . . . . 57 5.1 Firstexperiment,configuration-1results . . . . . . . . . . . . . . . . . 77 5.2 Firstexperiment,configuration-2results . . . . . . . . . . . . . . . . . 78 5.3 Firstexperiment,configuration-3results . . . . . . . . . . . . . . . . . 79 5.4 Experimentresultsforafour-tierF2Bbonded3DIC . . . . . . . . . . . 80 6.1 Results: 3DICFPUvs. 2DICFPU . . . . . . . . . . . . . . . . . . . . 101 6.2 Areaofnodesinprocessorgraphexpressedaspercentageoftotalarea . 104 6.3 Results: 3DICvs. 2DICprocessor . . . . . . . . . . . . . . . . . . . . 106 7.1 Resultscomparisonbetweenthetechniquesfor3DICmultiplier . . . . 121 7.2 3DICmultiplierpowerconsumption . . . . . . . . . . . . . . . . . . . 121 7.3 Resultscomparisonbetweenthetechniquesfor3DICFPU . . . . . . . 122 7.4 3DICFPUpowerconsumption . . . . . . . . . . . . . . . . . . . . . . 122 7.5 3DICmultiplier: Resultscomparisonbetweenthetechniques . . . . . . 130 7.6 3DICFPU:Resultscomparisonbetweenthetechniques . . . . . . . . . 131 8.1 Congestion-awaretechniquevs. othersforFPU . . . . . . . . . . . . . 147 8.2 Congestion-awaretechniquevs. othersformultiplier . . . . . . . . . . 147 ix ListofFigures 2.1 3DICstructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 3DICbondingtechniques: clockwisefromtop-right(i)face-to-face,(ii) back-to-back,and(iii)face-to-back . . . . . . . . . . . . . . . . . . . . 11 3.1 3DICdesignflowoptions . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 5-stagepipelinedFPUwitha2-stagemultiplierunit . . . . . . . . . . . 23 3.3 FPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 3DICtechnologyasexplainedbytheMOSIS . . . . . . . . . . . . . . 26 3.5 Multiplierpipelinestage-1 . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6 Stage-2: 3264operation. . . . . . . . . . . . . . . . . . . . . . . . . 33 3.7 Stage-2: 6464operation. . . . . . . . . . . . . . . . . . . . . . . . . 33 3.8 3DICmultiplierblockdiagram . . . . . . . . . . . . . . . . . . . . . . 35 3.9 Stage-1ofthesymmetric2DICdesign . . . . . . . . . . . . . . . . . . 37 3.10 Switchingenergysavings . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.11 Bondpoints: blue-signals,white-unused,red-VDD,yellow-Ground . . . 42 4.1 Designpartitioningtechniques . . . . . . . . . . . . . . . . . . . . . . 51 4.2 ExampledatapathinC2Cmethod . . . . . . . . . . . . . . . . . . . . 52 4.3 3DICFPUlayouts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1 3DICbondingtechniques,lefttoright: F2F,F2B,B2B . . . . . . . . . 59 5.2 Socketsarrangement: gatesandTSVs(graycolored) . . . . . . . . . . 64 5.3 Partialcircleofradius(l1) . . . . . . . . . . . . . . . . . . . . . . . 67 5.4 Illustrationforestimatingf(l) . . . . . . . . . . . . . . . . . . . . . . 69 5.5 Graphshowingdegrading3DICAWLasTSVsizeincreases . . . . . . 79 6.1 Graphrepresentationofradix-2butterfly(FFT)structure . . . . . . . . 85 6.2 Graph representation of radix-2 butterfly (FFT) structure at lower gran- ularitythanthatshowninFigure6.1 . . . . . . . . . . . . . . . . . . . 85 6.3 GraphrepresentationofFPUdesign,atpipelinestagegranularity . . . . 86 6.4 5-Stagepipelinedsingle-precisionFPU . . . . . . . . . . . . . . . . . 98 6.5 FPUdata-flowmodelwithranksanddegrees . . . . . . . . . . . . . . 99 6.6 FPUgraphoptimizedfor3DICdesign . . . . . . . . . . . . . . . . . . 100 6.7 Blockdiagramoftheprocessor . . . . . . . . . . . . . . . . . . . . . . 102 x 6.8 Data-flowmodeloftheprocessor . . . . . . . . . . . . . . . . . . . . . 102 6.9 Processordata-flowmodelwithranksanddegrees . . . . . . . . . . . . 103 6.10 Processorgraphoptimizedfor3DICdesignusingonlyranksanddegrees 103 6.11 3DICprocessorgraphobtainedusingarea-constrainedapproach . . . . 104 7.1 3DICbondingtechniques: clockwisefromtop-right(i)face-to-face,(ii) back-to-back,and(iii)face-to-back . . . . . . . . . . . . . . . . . . . . 109 7.2 3DICdesignflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.3 Bondpointsconnectedtopowerringsandstripesareexcludedinthelist ofavailablebondpoints . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.4 Multiplesinksillustration . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.5 Midpointassignmentillustration . . . . . . . . . . . . . . . . . . . . . 116 7.6 3DICmultiplierblockdiagram . . . . . . . . . . . . . . . . . . . . . . 119 7.7 3DICFPUblockdiagram . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.8 Comparisonofproposedtechniquesagainstmanualassignment . . . . . 123 7.9 TSV placement. (a): Restricted to fixed regular sites (black-bordered squares). (b): Placedanywhere. BlacksolidsquaresrepresentTSVs. . . 129 7.10 Percentage improvement in total and average net length for proposed techniquescomparedtoarchitecture-drivenmanualassignment . . . . . 131 8.1 3DICdesignflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.2 Tiers quantized into smaller rectangles to capture congestion and con- tentioninformationwithinneighborhoodofinter-tiersignals . . . . . . 140 8.3 Multiplier3DICblockdiagram . . . . . . . . . . . . . . . . . . . . . . 143 8.4 FPU3DICblockdiagram . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.5 Histogramsoflocalweightsoflayer-1,W1(i),ofFPU . . . . . . . . . 145 8.6 Histogramsoflocalweightsoflayer-1,W1(i),ofmultiplier . . . . . . 146 xi Abstract A three dimensional integrated circuit (3DIC) is formed by vertically stacking multiple active devices with high-bandwidth vertical interconnect. 3DIC technology achieves higher-density integration, better scalability, and improved performance, as compared to conventional IC technology. While several challenges exist to achieve volume man- ufacturing, this dissertation addresses two principal challenge areas identified from an experimentallogic-on-logicstacked3DICdesignexercise: logicpartitioningandimple- mentationtechniques,especiallyconcerninginter-tiersignals. Forbuildinga3DICchip,itiscrucialtohaveadesignlogicpartitioningframework thatprovidesguidelinestobothadesigneraswellastools. Aspartofdevelopingsucha framework,S2C,C2S,andhybrid(partitionatinputoroutputofasequentialcell)design partitioning techniques, are introduced to smartly partition a design. These techniques significantly reduce the search space for optimal partitioning. A formal framework for 3DIC logic partitioning requires a determination of an upper bound on the number of 3D-vias, as one cannot assume high availability of 3D-vias, especially in the case of through-silicon- vias (TSVs). A model based on Rents rule is developed to estimate 3DIC average wire length, which serves as a parameter to obtain an upper bound on the number of 3D-vias. A tier-by-tier hierarchical wire length distribution estimation approach is used as the Rents parameters used in modeling may not be the same for all tiers. TheproposedmodelaccommodatesvariableTSVdimensionsanddifferentbond- ingtechniques. ResultsshowthataslightdifferenceintheRentsexponent(about0.01) xii betweentiers,anddifferentTSVsizeshavesignificanteffectson3DICwirelengthdis- tribution. Then, a logic partitioning framework is proposed, which can adapt to various partitioninggranularitiesandworkwithlimitedamountsofdatatoenableefficientpar- titioning even at initial stages of design. This framework uses a data-flow graph model tooptimizethe3DICdesignforreducedcommunicationbetweenthelogicnodes. Mul- tiple optimization algorithms are introduced that fine-tune the 3DIC depending on the design characteristics available to the framework. The framework is demonstrated by partitioningaprocessorandafloating-pointunitdesignsfor3DICimplementation. The 3DICprocessorachieved15.85%reducedtotalwirelength,18.68%reducedpowercon- sumption,and16.35%smallercellareacomparedtothatofa2DICimplementation. Ina3DIC,tiersareinterconnectedusingTSVsortop-metallayerbondpoints(micro- bumps), called 3D-vias. Similar to I/O pins in a traditional chip, the locations of 3D- vias carrying inter-tier signals have a significant effect on logic placement, making the assignment very crucial in 3DIC implementations. Unlike I/O signals, these inter-tier signals are very large in number, and manual assignment (like assigning I/O signals to pins) is impractical. Hence, optimal assignment techniques suitable for automation are introducedwhichresultinminimumwirelengthforconnectinginter-tiersignals,while providing path control by prioritizing the critical paths and limiting the longest path. The proposed techniques successfully automated the assignment process, and achieved up to 9.4% lower total wire length and 10.4% shorter average wire length compared to an architecture-driven manual assignment. Furthermore, congestion-aware techniques areintroducedtoaddresscongestionandwiringrequirementsinthe3D-viaassignment step. These congestion-aware methods provide globally and locally optimized assign- ment techniques. Results obtained from layouts show that the designs with higher rel- ative congestion between interconnecting tiers benefit significantly from the proposed congestion-awaretechniques. xiii Chapter1 AboutResearch 1.1 3-DimensionalIntegratedCircuits TraditionalCMOSscalingasprojectedbyMoore’slaw[1]isarguablyinitsfinalstages. Thecostofdevelopingtheprocesstechnologyandbuildingafoundryforthenexttech- nology node is becoming very expensive [2–6]. Also, wires are dominating the power anddelayofdigitalcircuits[7]. Anemergingtechnologycalledthreedimensionalinte- gratedcircuits(3DIC),inwhichmultipleactivetiersarestackedvertically,isapotential solution to the current challenges in the semiconductor industry. 3DIC technology has manyadvantagesincludinghighdevicedensity,smallformfactor,reducedwirelength, low power, and heterogeneous stacking [3,8–13]. It has the potential to further extend Moore’s trend of integration [14]. However, 3DIC faces some obstacles like complex design space exploration, higher operating temperatures, lack of CAD/EDA tools sup- port,absenceofdesignstandards,testing,andmanufacturingchallenges[3,15–18]. 1.2 Motivation As 3DIC technology is an appealing prospect to overcome the limitations of conven- tional integrated circuit technology, it has captured the attention of many researchers around the world. There are two types of philosophies for 3DIC design: one which considers3DICjustasanimplementationplatformandanotherwhichaimstomakethe most of the third dimension by specifically exploiting 3DIC characteristics during the 1 designdevelopment. Intheformer,partitioningisexecutedafterlogicsynthesisanddur- ing place and route stages using automated algorithms. In design for 3D, logic in each tier is determined by design, and logic blocks are not moved across tiers in later stages ofthedesignflow. Asaresultthemajorityofresearchrelatedto3DIClogicpartitioning usesanautomatedprocedurethatsimplypartitionsanexisting2DICdesignintheback- end after logic synthesis [19–22] or provides a 3DIC design of a particular functional unit by splitting the unit across multiple tiers (design for 3D) [23–29], but there are no generictechniquesthatguidedesignersaswellastoolsforlogicpartitioning. In traditional 2DIC, logic partitioning is extensively used in the back-end of the design flow. One of the major differences between 2DIC and 3DIC logic partitioning is that in a 2DIC, a design (or netlist) is partitioned only in the back-end [30], whereas in a 3DIC, a design is partitioned either in front-end or in the back-end. The front-end partitioningallows3DICstobepartitionedatvariousabstractionlevelsincludingarchi- tectural and functional level [9]. The back-end partitioning allows only a circuit level or synthesized netlist level partitioning in the case of 2DIC. One might extrapolate that 3DIC may also be similarly partitioned only in back-end to allow users to use current 2DIC partitioning techniques, but the partitioning and implementation optimizations becomechallengingforlargedesigns,necessitatingarchitectural/functionalpartitioning for general-case optimized 3DICs. Such partitioning simplifies tier-level testing and debugging, which are required to increase the yield of 3DIC chips. The problem size is significantly smaller for partitioning at architecture/functional level, making it more likely to achieve an optimal partitioning. Also, knowing exactly in which tier specific logic blocks reside helps for better testability, and floor-planning. Moreover, the same front-end partitioning potentially can be used for any technology node. The presence of additional parameters including the third-dimension, through-silicon-vias (TSVs), multiple bonding configurations, tier-level testability, and thermal issues exasperate the 2 3DIC partitioning problem. Traditional 2DIC partitioning algorithms with some mod- ifications may be used in the back-end for 3DIC partitioning, but to fully exploit the thirddimensionandbuildefficient3DICs,3DIC-specificlogicpartitioningmethodsare essential. Theseobservationsraisethefollowingquestions: What quantifiable characteristics enable optimal yet efficient 3DIC logic partitioning? Can a logic partitioning framework be developed which worksatvariousabstractionlevels,especiallyatthearchitecture/functional levelinthefront-end? Thesequestionsmotivatedthedevelopmentofaformalframeworkforlogicpartitioning in3DIC.Initially,designpartitioningprinciplesareproposed,whichconstrainpartition- ing of logic only at input or output of a sequential cell. These techniques can be used by a designer or can be integrated into partitioning tools, which significantly reduce the search space. Then a data-flow model of a design is developed to achieve an effi- cientpartitioning,byidentifyingthecharacteristicsthatassistinlogicpartitioning. The salient feature of this framework is its ability to work at various partitioning granular- ities and with a limited amount of data from the designer. The proposed framework enablesanalysisoflogic-on-logicstacked3DICdesignintheearlieststagesofadesign cycle. Multiple algorithms are introduced that optimize the 3DIC design depending on thedesigncharacteristicsavailabletotheframework. Though 3DIC offers great interconnect bandwidth between bonding tiers, it is very important to know “how many is too many?”, especially in cases of face-to-back or back-to-backbondedtiers,whereTSVsareusedforverticalinterconnects. Thevertical interconnectsbetweenthebondingtiersarecalled3D-vias. Inthecaseofaface-to-face bonded 3DIC, the bondpoints (micro-bumps) in the top metal layer of both tiers are used as 3D-vias, while in other forms of bonding, TSVs are used as 3D-vias. Before partitioningadesign,itisusefultodetermineaprioriestimateofanupperboundonthe 3 number of vertical interconnects, as one cannot assume a high availability of 3D-vias, particularly in the case of TSVs. TSVs are significantly larger than typical standard cells [31,32] and aggregately occupy a huge area in the active layer. Hence, there will be a threshold point, where having more 3D-vias might result in lower performance in a 3DIC when compared to 2DIC implementation. Hence, another research-worthy questionis: Whatistheupperboundonthenumberof3D-viasina3DIC,suchthatthe performance of the resulting 3DIC is at least as good as that of the 2DIC implementation? Hence,a3DICaveragewirelengthmodelbasedonRentsrule[33]isdeveloped,which servesasaperformancemeasuretodetermineanupperboundonthenumberof3D-vias foragivencircuitand3DICconfiguration. Ina2DICimplementation,thelocationsofI/Opinsinfluencethestandardcellplace- ment. Similarly,ina3DIC,alongwithI/Opins,thelocationsof3D-viascarryinginter- tier signals have an effect on standard cell placement in a tier, making this assignment verycrucial. Duetothesheernumberof3D-vias,possiblyseveralthousandsindesigns like processors, it is impractical to manually assign inter-tier signals to 3D-vias. Hence automatedmethodsarerequiredtoassigninter-tiersignalsto3D-vias,whichanswerthe followingquestions: Isitpossibletoautomatetheinter-tiersignalsto3D-viasassignmentproce- dure? Ifso,isitpossibletoestablishanapproachforachievingaprovably optimumsolution? In search of answers to above questions, initially four heuristics were introduced that successfullyautomatedtheassignmentprocedure. Then,theywereenhancedtooptimal 4 assignment techniques which result in minimum wire length for interconnecting tiers, while providing control over paths by prioritizing the critical paths and limiting the longestpath. Furthermore,congestion-awaretechniqueswereintroducedtoaddressthe congestion and wiring requirements in the 3D-via assignment step. These congestion- awaremethodsprovidegloballyandlocallyoptimizedassignmenttechniques. 1.3 Contributions Thecontributionsofthisdissertationareasfollows: Quantified the potential benefits of logic-on-logic stacked 3DIC and analyzed challenges in 3DIC implementation using current CAD tools for even small designs[34](Chapter3). Introduced logic-on-logic partitioning techniques for 3DICs: sequential-to- combinational(S2C),combinational-to-sequential(C2S),andhybrid(partitionat input or output of a sequential cell, a mix of S2C and C2S) [35]. These tech- niques partition a design with respect to sequential and combinational signals. Theseapproachesareindependentofthesizeofthedesignandthetargetednum- ber of layers for the 3DIC, and are thus scalable. Also, these techniques, when integrated into partitioning tools, can speedup the execution time of such tools and are more likely to achieve a near-optimal solution, as the solution space is significantlyreduced(Chapter4). Introduced an asymmetric adaptive-precision 3DIC multiplier designed using the proposed partitioning principles, which saves energy by exploiting the frequent presenceoflow-precisionoperands[36]. Dependingontheprecisionofoperands the multiplier dynamically switches between three modes of operation: 3232, 5 6464, or asymmetric 3264, to reduce energy consumption. It is designed for 3DIC implementation. Additionally, introduced a two-tier 3DIC floating point multiplierbuiltusinganadaptiveprecision53-bitmultiplier[37](Chapter3). Developed a tier-by-tier hierarchical wire length distribution estimation method which can be applied for cases of different Rents parameters for each tier [38]. ThemodelcanhandlevariableTSVdimensionsanddifferentbondingtechniques. Moreover, this model can be applied to a 3DIC with any number of tiers. The hierarchical model allows the flexibility of having unequal numbers of vertical interconnects between different pairs of adjacent tiers. It also provides an upper bound estimate on the number of TSVs, above which the resulting average wire lengthdegrades(Chapter5). Introduced a formal logic partitioning framework that enables analysis of logic- on-logic stacked 3DIC design in the earliest stages of a design cycle and also supports higher abstraction levels (architecture/functional) [39]. The proposed framework uses a data-flow graph modeling of the design, with nodes represent- inglogicblocks,andedgesrepresentingcommunicationbetweenthelogicblocks. The optimization goal of the framework is to reduce the overall communication length in the design. A simulated-annealing approach was shown to automate theoptimizationprocess;howevertheprocedurepresentedcanalsobeappliedby a designer, depending on the number of nodes and edges present in the graph. Computation reduction techniques for algorithms are also proposed. Multiple optimization algorithms are introduced that fine-tune the 3DIC depending on the designcharacteristicsavailableintheframework. Aprocessorandafloating-point unitwerepartitionedusingtheproposedframeworkfor3DICimplementation,to demonstratetheapplicationoftheframework(Chapter6). 6 Proposed four new inter-tier signal assignment approaches: three nearest- neighbor (NN1, NN2, and NNC) and a midpoint approach to automate assigning inter-tier signals to bondpoints [40]. These 3D-vias, bondpoints or TSVs, influ- ence the cell placement in the layout of a tier, and are hence very crucial. Then, the four heuristics are enhanced to optimal assignment techniques, RT1, RT2, RTCandmidpoint,whichresultinminimumwirelengthforconnectinginter-tier signals, while providing control over paths, by prioritizing the critical paths and limitingthelongestpath[41](Chapter7). Introduced Congestion- and neighborhood- aware inter-tier signals to 3D-vias assignment techniques to address the impact of wire congestion, total wire requirements, and the number of near-by inter-tier signals in the 3D-via assign- ment step [42]. Due to the presence of TSVs, wire congestion is a more severe problemin3DICs,andtheresultsdemonstratedthat3DICdesignswithhigherrel- ative congestion between the interconnecting tiers can benefit significantly from theproposedcongestion-awaretechniques(Chapter8). 1.4 DissertationOutline The rest of the dissertation is organized as follows. Chapter 2 provides an introduc- tion to 3DIC technology, and a brief literature survey on research in 3DIC. Chapter 3 describes two variations of 3DIC design flows along with 3DIC design and implemen- tationdetailsofasingle-precisionfloating-pointunit,anda64-bitenergy-efficientmul- tiplier. Chapter 4 presents an overview of the importance of logic partitioning in 3DIC, and proposed logic partitioning principles. Chapter 5 presents a tier-level hierarchical 3DIC average wire length estimation model. Chapter 6 presents the formal framework for logic partitioning in 3DIC. Chapter 7 introduces techniques for assigning inter-tier 7 signals to 3D-vias in a 3DIC, followed by Chapter 8, which presents congestion-aware techniques for assignment of inter-tier signals to 3D-vias. Finally, Chapter 9 concludes thisdissertationwithasummaryandinsightsintothefutureof3DICtechnology. 8 Chapter2 3-DimensionalIntegratedCircuits 2.1 About3DIC Inathreedimensionalintegratedcircuit(3DIC),multipleactivelayersarestackedverti- callytomakeasinglechip(Figure2.1). Theactivedevicesindifferenttiersareintercon- nected using vertical vias (or 3D-vias). Based on the type of 3DIC these 3D-vias could beeithertopmetallayerbondpoints(micro-bumps)orthrough-silicon-vias(TSVs). As transistorsizesarescalingdown,duetophysicallimitations,manufacturing,testing,and building reliable chips is becoming increasingly expensive and challenging [2–6,43]. 3DIC is an attractive option as it can achieve higher device density with lower average Figure2.1: 3DICstructure 9 interconnectlengthwithouttherequirementforfurtherscalingthetransistorsize[8–14]. Thischapterprovidesanoverviewof3DICtechnology. Thefollowingsectionsdescribe various types of 3DICs based on its bonding type, stacked materials, and assembling process. Themotivationfor3DICtechnologyanditschallengesarediscussed,followed byanoverviewof3DICliterature. 2.2 Configurations Ashortintroductiontovarious3DICstructuresisrequiredtobetterunderstandthemoti- vation for 3DIC technology and its limitations. The various configurations of 3DIC are classified based on bonding techniques, type of stacked materials, and the assembling processasexplainedbelow. 2.2.1 BondingMethods Anytwotiersina3DICcanbebondedinthreedifferentways: face-to-face(F2F),face- to-back (F2B), or back-to-back (B2B), as shown in Figure 2.2. In a F2F bonded 3DIC, thetwotiersarebondedusingtopmetallayerbondpoints. Oneofthetiersisflippedand bondedtotheother. InaF2Bbonded3DIC,theverticalinterconnectsbetweenbonding tiersareformedusingTSVs. ATSVisconnectedtothefirstmetallayerofatier(back), which tunnels through the tier’s substrate body (silicon) and connects to a landing pad on the top metal layer of the other tier (face). In case of B2B bonding a TSV tunnels through the bodies of both tiers and connects to metal layer one on both tiers. Among the three bonding options, the F2F bonding provides the highest density of 3D-vias, withoutaffectingtheactivearea. However,similartoB2Bitislimitedtotwotiersonly; any3DICwithmorethantwotiersrequiresF2Bbonding. 10 Figure 2.2: 3DIC bonding techniques: clockwise from top-right (i) face-to-face, (ii) back-to-back,and(iii)face-to-back 2.2.2 HomogeneousandHeterogeneous3DICs A 3DIC formed by stacking the same materials for all tiers is called a homogeneous 3DIC. For example a CPU core stacked on top of another CPU core is a homogeneous 3DIC. If dissimilar materials are stacked, it is called a heterogeneous 3DIC. Examples include an analog device stacked on a digital device, and a DRAM stacked on a CPU. Heterogeneous 3DICs have a potential to be a cost-effective option for system-on-chip (SoC)implementation,whilehomogeneous3DICcanachievehigherdevicedensityfor aparticulardesign(formicroprocessorsetc.) withoutscalingthetransistorsize. 2.2.3 AssemblyOptions There are three options to assemble tiers in a 3DIC: wafer-to-wafer (W2W), die-to-die (D2D), and die-to-waiter (D2W). In W2W assembly hundreds or thousands of 3DICs 11 canbe manufacturedat once, makingit potentiallythemostcost-effective3DICmanu- facturing process [11]. However, W2W may only be practical for homogeneous stack- ing, duetopossiblemismatchbetweenthermalcoefficientsinheterogeneousmaterials, as the stacking process requires high temperatures. The main advantage of D2D and D2Wassemblyoptionsisthatonlyknown-good-dies(KGD)canbestackedtoimprove 3DIC yield. Each option has its advantages, and hence a thorough exploration of vari- ousoptionsisrequiredtodeterminethebesttargetassemblyapproachforanyparticular 3DICimplementation. One can build a F2F bonded homogeneous W2W assembled 3DIC, which implies thatsimilarmaterialsarestackedatwaferlevel,byflippingoneofthetiersandbonding on top of the other tier. The research presented in this dissertation is applicable to any typeof3DICbondingorassemblingtechniques. 2.3 Motivationfor3DIC Theexponentialgrowthofthesemiconductorindustryhasbeenoneofthemainenablers for the technology boom in the past decades. Each generation, more transistors are integrated into a single chip to provide more functionality, resulting in ever increasing demand for higher device density, higher bandwidth, and lower power. Over the years, traditional IC technology (2DIC) has met these demands by means of device scaling. But,furthersustainingthetrendofintegration,whileremainingcost-effective,intheera ofslowertransistorscalingrequiresanewsolution. Perhaps3DICtechnologyisthebest currentsolutiontothephysicallimitationsconcerningdevicescalingandinterconnects. The following are the additional motivating factors for the semiconductor industry and academiatoactivelypursueresearchinthefieldof3DIC. 12 2.3.1 Advantages Higher device density at smaller footprint: The most obvious advantage of a 3DIC is its ability to pack billions of transistors at a very high density (transistors per unit chip footprint area) into a single chip. Based on the total number of tiers in the 3DIC, the chip footprint will be several times smaller than that of 2DIC. This is a very desirable feature,especiallyformobiletechnologies. Shorter Interconnects: Theoretical estimation of 3DIC wire length characteristics showed that 3DIC indeed provides a smaller wire length distribution, with the largest effect on the longest paths. 3DIC is expected to reduced total interconnect lengths and average net lengths [13,44,45]. As the technology node is scaling down, wire delays and power consumed by wires is becoming the significant portion of circuit delay and total power consumption of the chip, respectively [7]. Shorter wire lengths enabled by a3DIC approachcan providesignificantimprovementsinpower/energy, anddelaydue tosmallercapacitancesofshorterinterconnects. Additionally,smallerwirecapacitance willresultinreducednoisecausedbyconcurrentswitchingevents. Heterogeneous Integration: The ability to stack dissimilar materials is one of the most attractive features of 3DIC. Heterogeneous 3DIC is a cost and performance effective option for SoC implementation [16,46]. All the components (analog and digital, pos- sibly even manufactured at different technology nodes) can be integrated into the same chip,withhigh-densityinterconnects. ADRAMchipstackedonaprocessorcorehasthe potential to significantly reduce the performance gap between processor and a DRAM i.e.,helpincombatingthememorywallproblem. Cost: Manufacturing reliable chips at technology nodes like 22nm, 14nm, etc. is very expensive, due to the high cost of developing IC process technology [2–6]. 3DIC can 13 continue the Moore integration trend [1] by increasing the number of tiers in a single chip [14]. Additionally, each tier of a 3DIC is smaller than that of a 2DIC implementa- tionofthechip. Astheyieldisproportionaltotheareaoftheactivelayer,itispossible to achieve higher yield [11]. Additionally, if any non-critical tiers are present in a chip, theycanbemanufacturedatanoldertechnologynodetoreducemanufacturingcosts. Design Space: Exciting new architectures and high performance designs are possible withincreaseindesignspaceduetotheadditionofthethirddimension[47–50]. Chip Security: A less pronounced advantage of 3DIC is its ability to combat reverse engineering and protect intellectual property from infringement [51]. Sensitive logic blocks can be partitioned across multiple tiers, to obscure the functionality and make it complextoreverseengineer. Also,eachtiercanbemanufacturedatdifferentthirdparty vendors,andassembledatadifferentlocations. 2.4 Challenges Likeanyemergingtechnology,3DICalsofacessomechallenges,asdescribedbelow. Complex Design Space Exploration: The presence of third dimension, various bonding techniques, ability to stack heterogeneous materials, high-bandwidth vertical intercon- nects results in huge number of possibilities in3DIC design space. While havingmany optionsmightbeanadvantage,exploringsuchvastspaceandfindinganoptimumsolu- tionisacomplexandchallengingtask. Absence of CAD Tools and Design Standards: Currently there are no industry-standard CAD tools to build 3DIC chips, however a wide range of research is being pursued in this area. Also, there are no design standards or design rules that are specific to 14 3DICs. As stated in [16], 3DIC will never become “mainstream” unless 3DICs can be designed and produced in a cost-effective way, with sufficient turnaround time. This will be possible only with a robust and well-defined 3DIC ecosystem which includes designstandardsandEDA/CADtools. ThermalManagement: Heatdissipationisoneofbiggestobstaclesfor3DIC.Increased device density, and tight integration of tiers, leads to higher probability of hot spots within a 3DIC [15]. Heat dissipated by the signal activity in internal tiers (farther from heat sink) gets trapped inside resulting in high operating temperatures. Sophisticated thermal/temperature control and cooling techniques are essential for survival and large scaleadaptationof3DIC. Testing: To achieve good yield percentage and build reliable chips, 3DICs should be testedatvariousstagesofthemanufacturingprocess. Itisverydesirabletotesteachtier (wafer or die) before stacking into a 3DIC. Testing a thinned wafer or die is extremely challenging, as it is very complex to determine the faults caused during thinning and bonding of a 3DIC [16,17]. Design-for-test, i.e., inserting optimal scan chains across tiers,isanongoingresearchactivity[18]. Manufacturing: 3DICrequiresadditionalmanufacturingstepsofwafer/diethinningand bonding to assemble tiers into a single chip [3]. These steps incur additional costs, and any faults occurring during these processing steps will affect the overall yield and performance of the resulting 3DIC. The 3DIC process technology should mature over timetobecomecosteffective,i.e.,thepotentialcostbenefitspresentedaboveshouldbe morethanthecostspentinadditionalsteps. 15 2.5 3DICLiterature An overview of 3DIC literature on the three main fields of research related to 3DICs: architecture,modeling,andtools,ispresentedinthefollowingsections. 2.5.1 ArchitecturalExploration Various types of 3DIC specific processor-memory organizations were explored. Exam- ples include memory on processor [12,48–50] and a single processor core partitioned across tiers [12,52,53]. In [50], a two-tier 3DIC prototype with embedded DRAM stackedoverprocessor-likelogicisdemonstrated,while[48]and[49]presentedtwo-tier 3DIC multi-core systems with a SRAM memory stack. The authors in [53] introduced several micro-architecture techniques where a core is split across tiers to control hot spots. Significant research is also carried out for porting specific functional blocks like instruction scheduler, register files, FFT, TCAM, etc. to 3DICs [23–25,27,28]. Apart from microprocessor and related areas, researchers also explored 3DIC inspired archi- tecturesforfield-programmablegatearrays(FPGA)[54–56]andimagesensors[57,58]. Both FPGA and image sensors, are very wire-oriented devices and require high band- widthinterconnections,whichisofferedbythe3DICplatform. 2.5.2 3DICModeling Several a priori system analysis models based on Rent’s rule [33] have been devel- oped for 3DICs [44,59–61]. These models demonstrated a significant decrease in the interconnect length in 3DICs. Apart from wire length distribution, system performance metrics such as clock frequency, chip area, power dissipation, etc., are also estimated using these models. A cost-driven 3DIC design flow is presented in [62] to guide the designspaceexplorationfor3DICstowardacost-effectivedirection. Theauthorsin[63] 16 presented a flow for 3DIC system-level exploration, useful for path finding studies which enables users to explore the trade-offs between different stacking and partition- ing schemes. Models for analyzing thermal behavior of 3DICs are introduced to assist packagingandcoolingstrategiesandtoestimateoverallsystemcosts[64]. 2.5.3 3DICTools A variety of partitioning, floorplanning, placement, and routing techniques for 3DICs have been developed to cater to unique properties of 3DICs. A flat place and route tool, called PR3D is presented in [19], which simultaneously partitions a design and places across the given number of tiers. In contrast, a hierarchical approach is also explored, where a design is first partitioned into tiers, and then placed [20,65]. Once a design is partitioned, during placement, cells move only within a tier and not between thetiers. Thermal-awareplacementtechniquesareusedtoreducethepeaktemperature and avoid hot spots [66,67]. There are several other methods with approaches such as TSV-aware[31],TSVandcellsco-placement[31,68],eachtryingtooptimizeaphysical designforaspecificobjective. 17 Chapter3 Designfor3D 3.1 DesignFlow Thedesignofanintegratedcircuit(IC)startswithahigh-levelspecificationofthetarget functionality of the chip. Then a step-by-step procedure is adapted to build the chip, whichistypicallyreferredtoasa“designflow”. Overtime,thedesignflowforbuilding atraditionalCMOSdigitalIC(2DIC)hasbecomewellestablished,withagreatdealof automation by means of standard cell design. A similar design flowis also required for building3DICs. Figure3.1showstwopossibledesignflowsfora3DICthatarewidely usedintheliterature. Thechoicebetweenthesetwoflowsdependsontheanswertothe followingquestion. Should the back-end 3DIC implementation technology be relevant to the activitiesofafront-enddesignerorarchitect? Iftheanswerisno,thedesignflowinFigure3.1(a)suits,where3DICisusedmerelyas animplementationplatform. Thedesignerorarchitectmayormaynothaveanyknowl- edge of the structure of the target 3DIC. The design development and logic synthesis is executed in the same manner as in 2DIC. At some stage during place and route the designispartitionedacrosstiersautomatically. Iftheanswerisyes,thechipisdesignedfor3DICimplementation,whichmeansthat the architecture may be inspired by the 3DIC technology and tries to take full advan- tage of the vertical integration. In this case the design flow shown in Figure -3.1(b) is 18 (a)Alternate3DICdesignflow (b)“Designfor3D”flow Figure3.1: 3DICdesignflowoptions followed, which is referred to as a “design for 3D” flow. Here, the partitioning is done by-design, i.e., once the high level description of chip is done, logic is fixed to a tier. Anyinter-tierlogicblockswappingormovingisnotpermittedinlaterstages. Inanutshell,thedesignfor3Dflowaimstofullyexploit3DICtechnology,whereas the other alternative works for porting any design (existing or future) to 3DIC, with- out significant (or any) redesigning effort. The following sections define, compare and contrastboththeflows. 19 3.1.1 “Designfor3D”Flow 1. Designedfor3DICimplementationfromthetop. Eachtierisobtainedby-design, andnoautomatedpartitioningtechniquesarerequired. 2. Eachtierissynthesizedseparately,usingitsHDLdescription. 3. Post-synthesis timing analysis and logic verification is executed to validate the architecture. 4. Similar to 2DIC, chip I/O is planned. Additionally, inter-tier signals are assigned toavailable3D-vias. 5. Cell placement is guided by I/O pins, 3D-vias (inter-tier signals), and floorplan. Current 2DIC tools can accomplish the required task. However, a 3DIC-specific placement tool that operates simultaneously on multiple (or all) tiers and opti- mizes the cell placement to avoid hot spots is preferred. It should be noted that onlyintra-tiercellmovementisallowed. 6. Layoutisthenextracted,timingisanalyzed,andlogicisverified. 3.1.2 3DICasanImplementationPlatform 1. BehavioralorfunctionaldescriptioninaHDLisdeveloped. 2. Logicissynthesized. 3. Post-synthesis timing analysis and logic verification is executed to validate the architecture. 4. Similarto2DIC,chipI/Oisplanned. 20 5. In this step the design is partitioned across tiers before or during place and route. In[19]aflatnetlistisusedandthecellsaremovedwithinandacrosstiersduring placement. Thiswillensureaglobaloptimalplacementisachieved,but,maynot bepracticalforlargedesignslikechipmultiprocessors,wheresomefloorplanning isneededtoguidethetools. Incontrast,asexplainedin[20],ahierarchicalframe- workcanbeused,whereadesignisfirstpartitionedintotiers,andblockscanonly bemovedwithinatierduringfloorplanning. Inter-tiersignalsareassignedto3D- vias,andallthetiersareplacedandrouted. 6. Layoutisthenextracted,timingisanalyzed,andlogicisverified. Inboththedesignflows,atanystageifanyvalidationfails,theflowshouldberolled backtoearlierstagesandnecessarymodificationsshouldbemade. 3.1.3 Comparison Many 3DIC inspired architectures/designs are possible with the design for 3D flow, where the designer has complete control over partitioning logic across the tiers. Such designs are not possible with the alternate design flow. Using 3DIC just as an imple- mentation technology has benefits like potentially reduced design time due to less user intervention in 3DIC-related issues, but to make the most of the 3DIC, 3DIC inspired architectures are necessary. Additionally such designs can be developed to combat the challengesof3DICtechnology. Also,thedesignfor3Dflowworksbetterforheteroge- neousstacking. A design can be partitioned at various levels, depending on the granularity of the blocksthataretobestacked: Core-level,functionalunits,logicgates,andtransistors[9]. Controllingpartitioninggranularitybymeansofautomatedpartitioningisverydifficult. In a given design some functional units may benefit from splitting them across tiers 21 and some may have to be simply stacked completely in a tier [52]. Such unit by unit basis granularity management is easily possible with a design for 3D flow. Also a flat placementofanentirechip,assuggestedin[19],isnotpracticalforlargedesigns. Even 2DICsfollowaguidedfloorplanandhierarchical/modularapproachforcellplacement. Tier-level testing, verification, and debugging is essential for a 3DIC. A design for 3Dflowmaybefavorable,asitiseasiertotestanddebugatanarchitecturallevelthanat circuitlevel. Whenadesignispartitionedautomatically,itisunclearwhatlogicresides inwhichtier,andtier-leveltestingbecomesmorecomplex. Architecturalexplorationandmorehumaninvolvementindesignfor3Dmayresult inalongerdesigntime. Logicoptimizationcanbelimitedincaseofdesignspartitioned at logic-gate granularity (intra-block splitting) in the case of design for 3D. However, most of the current tools with some improvements are sufficient for the design for 3D flow,butthealternatedesignflowrequiresnovelplaceandroutetools. Though the design for 3D flow is favored and used for building all 3DIC designs presentedinthisdissertation,theresearchcontributionsareusefulirrespectiveofwhich flow is used. The subsequent sections provide 3DIC design and implementation details of two designs: a single-precision floating point unit, and a 64-bit multiplier. These 3DICdesignexercisesprovidedseveralinsights(Section3.4)thatinspiredtheresearch presentedinthisdissertation. 3.2 SinglePrecisionFloating-PointUnit3DIC The design selected for exploring logic-on-logic 3DIC implementation is an existing single-precision floating-point unit (FPU) that has been formerly implemented on a number of chips, dating back to the Data-Intensive Architecture (DIVA) processing-in- memorychips. TheFPUisdesignedbasedonTaylor-seriesexpansionwithsquaringand 22 Figure3.2: 5-stagepipelinedFPUwitha2-stagemultiplierunit cubingunits. Ithasa5-stagepipelinedarchitecturewithasingle2-stagepipelinedmul- tiplier, which is used to execute all multiplication operations, including those required in the powering units [69]. Figure 3.2 presents the block diagram of the FPU. For all instructionsotherthandivision,processingoccursinalinearpipelinefashion. Thus,the latency is 5 cycles and an instruction can be issued at every cycle. The division oper- ation is executed in a non-linear pipeline fashion and has a latency of 12 cycles. The instructionthatfollowsadivisioncanbeissuedaftereither5or8cycles,dependingon theinstructiontype. 3.2.1 3DICFPU Theaimofthistaskistobuilda3DICusingthetoolsandtechnologythatareavailable today. The 3DIC FPU is designed for 3DIC implementation by manually partitioning 23 (a)FPUDivisionOperationDataFlow (b)3DICFPUBlockDiagram Figure3.3: FPU the original FPU module into two tiers. The following sections discuss the 3DIC FPU design,anditsimplementationdetails. NumberofTiers The primary step in designing a 3DIC is to set the number of tiers. Given the 2-tier constraints of the MOSIS service, it is a straight forward choice to split the design into two tiers. Otherwise, because of the relatively smaller size of the FPU design (when compared to a multicore processor) as well as, given this is a first attempt, it is also beneficialtoanalyzethechallengesinimplementingthedesignacrosstwotiers,before tryingtoextrapolatetomoretiers. 3DICDesign Figure 3.3(a) shows the data flow of FPU division operation. As mentioned earlier it is an irregular pipeline: some signals from stages 4 and 5 of the pipeline drive the logic in stage-2. This observation led to a partition of the design as shown in Figure 3.3(b), where stage-2 is placed in tier-2, above stages 4 and 5, which are placed in tier-1. The multipliermoduleusedintheFPUdesignisaSynopsysDesign-Warecomponent,hence 24 it cannot be split across the two tiers; therefore stage-3 is placed in tier-2, along with stage-2. Finally, stage-1 is placed in the same tier as stage-5, so that all I/O signals are interfacedwithtier-1; otherwiseeitherinputoroutputsignals(dependingonwhichtier interfaces with I/O) would have to traverse through a stack to I/O pads, which could potentiallyincreasethedelay. Also,itshouldbenotedthatallthesignals,includingthe clocksignal,areroutedtotier-2throughtier-1. 3.2.2 Technology Now that an overview of the FPU design has been given, this section describes the specific3DICtechnologytobetargeted. Tezzaron3DICTechnology Tezzaron uses a wafer-to-wafer bonding technology to build3DIC chips [70,71] called FaStack. Adesignispartitionedintoseveraltiers,eachoneofwhichisbuiltonseparate wafersusingastandardfabricationprocess,andthenbondedatwafer-level. Thevertical connections between face-to-face bonding wafers are made using bondpoints that are developed on the top metal layers of both the bonding wafers. During the bonding process, the wafers are aligned to a precision of 0.5 micron [71], to enable accurate bonding of the contacts. Once the wafers are bonded together, the stack is then thinned anddicedinto3DICs. MOSIS3DICPackage The 3DIC FPU is implemented using the MOSIS 3DIC package which uses Tezzaron- Global foundries two-tier 130nm fabrication process. This package is limited to two tiersanditusesface-to-facebonding. Theinterconnectsbetweenthetwotiersaremade using M6 metal layer micro-bumps (also called as bondpoints, see Figure 3.4). The 25 Figure3.4: 3DICtechnologyasexplainedbytheMOSIS input/output and power signals are interfaced with pads using TSVs, called Super-Vias [72]. Tezzaron’s super-vias require a very thin wafer (approximately 5m), hence the top-tier,onwhichthepadsarelocatedisthinned. 3.2.3 Implementation Once the FPU design is partitioned, each tier is synthesized separately using Synopsys Design Compiler, targeting the design to Tezzaron-Global foundries 130nm standard cell libraries. Then both the tiers are verified for any timing violations, together as a 3D design, using Synopsys Prime Time. Finally, place-and-route (PAR) each tier is performed using Cadence First Encounter. Before placing, inter-tier signals should be assignedtobondpoints. Thisassignmentiscarriedoutinagreedyfashionbyassigning a bondpoint to an inter-tier signal as closely as possible to the cell that sources or sinks itintier-1. Then,amirroredversionofassignmentisusedforPARoftier-2. Acrucialpartofthelayoutprocessisclocktreesynthesis(CTS).Similartosynthesis and PAR, CTS is also executed independently on each tier. But the problem in the 3DIC is that the clock signal reaches tier-2 later than tier-1, as it traverses to tier-2 via a bondpoint. So, to balance the clock trees, the insertion delays of both clock trees 26 Table3.1: Results: 2DICFPUvs. 3DICFPU 2DICFPU 3DICFPU Improvement ClockPeriod 6.83ns 6.63ns 3% Footprint 400x400um 2 306x306um 2 41.5% Table3.2: Results: 2DICFPUvs. 3DICFPUtier-1andtier-2 2DICFPU Tier-1 Tier-2 AverageConnectionLength(um) 10.37 12.51 9.70 StandardDeviation(um) 14.80 14.75 9.97 Power(mW) 9.954 4.639 6.08 CellArea(um 2 ) 131200 62736 79590 are observed, and then the tier with lower insertion delay (in this case tier-1) is re- synthesized(CTS),toforceittohavethesameinsertiondelayastheslowerone. Final timing verification and logic equivalence check are performed. The design is analyzed for any timing violations using Synopsys Prime Time. Cadence Conformal is usedfora logicalequivalencecheck ofthe3DICdesignagainsttheoriginal2DICFPU HDL source code. Furthermore, a post-layout, SDF back-annotated simulation is used tovalidatethelogicalcorrectnessofthe3DICdesign. 3.2.4 Results Itisimportanttoanalyzetheresultstounderstandtheabilityof3DICtechnology. Table 3.1 shows the comparison of a 2DIC FPU versus a 3DIC FPU. Table 3.2 shows the resultsofthe2DICFPUand3DICFPUtiers1and2. The2DICFPUwasimplemented usingthesamestandardlibraryasthe3DICFPUdesign. The3DICFPUshowsaspeed up of 3%, and an impressive 41.5% reduction in chip footprint. The smaller footprint (or chip size) is very desirable for mobile devices. In spite of dividing the design into twotiers,thecombinedcellareaofthetwotiersisonlyabout8.5%morethanthe2DIC 27 FPU design. Overall, the average connection length (ACL) and power consumption are similar in both designs. However better support from tools for assigning vertical interconnects can significantly reduce the ACL, and hence reduce power consumption. The higher standard deviation in interconnection lengths in the 2DIC FPU, especially when compared to tier-2, shows the presence of longer interconnects in the 2DIC FPU. Also,the3DICFPUhasarelativelylargenumberofverticalinterconnectsforitssmall size,becauseofwhichmorewiringisrequiredtoroutesignalstobondpoints,contribut- ing to the increased ACL. Hence, a larger design like a multicore processor can better exploit vertical stacking, especially, at an advanced technology node like 22nm where interconnectionsdominatethedelayandpowerconsumptionoftheoveralldesign. 3.3 Energy-Efficient3DICMultiplier Energyefficiencyisamongthemostimportantgoalsofcurrentsemiconductorresearch. The key reasons being (i) recent explosion of mobile computing, and (ii) the push for Exascalecomputing. Boththesetechnologiesdemandhigherenergyefficiency,thefor- mer for longer battery life, and the latter for low operational cost. Low energy con- suming logic blocks such as multipliers are an essential building block for such sys- tems. Multipliers,signedorunsigned,areveryimportantfunctionalunitsforbothinte- ger and floating-point computational units. These units are critical components of any high-performancemicroprocessor(CPU,GPU,etc.). Atremendousamountofresearch has been dedicated to optimizing multipliers for low power, high speed, and area effi- ciency [73–75]. However, continued effort is being made for further improving multi- pliers. The design presented in this section is an input data dependent 64-bit multiplier thatasymmetricallyadaptstooperandprecisionstosaveenergy. 28 The multiplier architecture is also inspired by 3DIC technology. It is designed for 3DIC implementation to further enhance the performance of the multiplier. 3DIC is a very promising technology with many advantages (refer to Section 2.3.1). However, 3DIC suffers from some challenges including thermal issues and lack of CAD tools. Hence,the3DICmultiplierisdesignedforefficient3DICimplementationfollowingthe “designfor3D”approach,tooffsetsomeoftheobstaclesof3DICtechnology. ImportanceofOperandPrecision Most modern microprocessors performing 64-bit operations have a native 64-bit mul- tiplier unit for multiplication instructions. However, they still support 32-bit or even half word (16-bit) multiplication instructions. These lower operand width multiplica- tionsareoftenexecutedusingthe64-bitmultiplier,orinsomecasesusinganadditional multiplierunit. Theprocessortypicallyreliesonsoftwaretoindicatewhetherthemulti- plicationisa64-bitoperationornot,dependingontheinstructiontype. Overtheyears, researchers have identified that most input operands for 64-bit multiplication instruc- tions have small inherent precision. In [73] the authors analyzed the operand values of multiplication instructions across the SPEC benchmark suite and concluded that nearly 58% of the 64-bit multiplication instructions have both operand values with precision less than or equal to 32-bit. Moreover, the authors in [74,75] did a similar analysis and showedthatmostofthetimestheinputoperandsaresmallaswellaspositive. Although anN-bitmultiplierproducesa2N-bitproduct,manyarchitecturesonlyreturnthelower N bits of the product and an overflow flag [76], because most of the time the least sig- nificant N bits are sufficient for representing the product. For N=64, this implies that whenoneoftheinputoperandprecisionsisgreaterthan32-bit,itishighlyprobablethat the other is less than 32-bit, otherwise there will be many overflows. Therefore, most 29 64-bitmultiplicationsrequire3232or3264multiplication,andrarely6464. These findingsinspiredthedesignoftheasymmetricadaptive-precisionmultiplier. 3.3.1 Architecture The motivation for the multiplier design is twofold: one is to take advantage of the narrow-widthoperandstosaveenergy,andtheotheristodesignfor3DICimplementa- tion. Thearchitecturaldetailsaregiveninthissection. The main idea is to dynamically detect the precision of each operand and perform one of the following operations: (i) 3232 multiplication: If both operand precisions are less than or equal to 32 bits; (ii) 3264 or 6432 asymmetric multiplication: Only if one of the operand precisions is less than 32 bits; and (iii) 6464 multiplication: Otherwise. This functionality is achieved by building a 64-bit multiplier using four 32- bitmultipliers. Also,suchahierarchicaldesigniswellsuitedfor3DICimplementation, which is discussed in Section 3.3.2. The multiplier design functions in two halves, one half which detects the operand precisions and selectively updates the inputs of only the required 32-bit multipliers. As a result, no dynamic energy is consumed by the other 32-bit multipliers whose inputs remain unchanged. The second half of the design combinesthepartialproductsgeneratedbythefour32-bitmultipliers,basedonthetype ofoperationtobeperformed: 3232,3264,6432,or6464. Additionally,toachievehighthroughput,themultiplierispipelinedintotwostages. Attributing to the structure of the multiplier design, it is a natural choice to designate the first half of the design as pipeline stage-1 and the other half as stage-2. Neverthe- less, if required, the design can be further pipelined to match the specifications of a target system into which it may be integrated. Before getting into further details of the architecture, the following sections focus on unsigned multiplication because, most of 30 Figure3.5: Multiplierpipelinestage-1 the times multiplications are of positive numbers [74,75], and also FPUs use a sign- magnitude multiplier, which is essentially an unsigned multiplier. However, sections 3.3.1and3.3.4,presentthedetailsofextendingthedesigntoasignedmultiplier,andits results,respectively. Henceforth,AhandBhrepresentthehigherhalf,i.e.,32mostsig- nificant bits (MSB) ofA andB, respectively. Similarly,Al andBl represent the lower half, i.e., 32 least significant bits (LSB) ofA andB, respectively. Now that a top-level ideaofthedesignispresented,thefollowingsectionsdiscusseachstageindetail. Stage-1 Figure3.5showsstage-1ofthepipelineddesign. Thepreprocessinglogicincludestwo 32-bit zero detectors with outputs Za and Zb, respectively. These two signals control themultiplier’smodeofoperationasfollows: Za Zb : ModeofOperation 0 0 : 6464 0 1 : 6432withM1 =AhBl 1 0 : 3264withM1 =AlBh 1 1 : 3232 31 It should be noted that both 3264 and 6432 operations require only two 32-bit multipliers to compute the output. In these cases, signal Za controls the input selec- tion for the 32-bit multiplier that computes M1. This allows the logic to remain the same in pipeline stages for both cases of asymmetric multiplication, i.e., 3264 and 6432. Henceforth, 3264 is used to represent both the cases. E1 and E2 signals enable the stage-1 input pipeline registers only when a new computation is required, avoiding unwanted activity to save energy. Additionally, these 32-bit multipliers’ out- puts that drive the input pipeline registers of stage-2 also remain unchanged, further increasingtheenergysavings. Stage-2 This half is responsible for combing the four partial products: M0,M1,M2, andM3, generated by the 32-bit multipliers to produce the requiredProduct. Based on various modesofoperation,stage-2functionsasfollows: 1. 3232Multiplication: Product =f64 ′ d0;M0g. 2. Asymmetric3264Multiplication: Figure3.6illustratesthepartialproductscom- binationtoproducetheoutput. Product =f32 ′ d0; M1+f32 ′ d0;M0[63 : 32]g; M0[31 : 0]g 3. 6464 Multiplication: Figure 3.7 illustrates the multiplication operation, where an array of 64 3:2 compressors reduceM1,M2, andfM3[31 : 0];M0[63 : 32]g to generate a Sum and Carry streams. Then, a 64-bit adder is used to further reduce partial products as shown in Figure 3.7, to generate A sum and C out , the sumandcarryoutputs,respectively. TheProductisobtainedasfollows: Product =fM3[30 : 0]+C out ; A Sum ; Sum[0];M0[31 : 0]g 32 Figure3.6: Stage-2: 3264operation Figure3.7: Stage-2: 6464operation Analysis Theabilitytodynamicallyadaptmultiplierprecisioncomeswithanareaoverhead. The asymmetricmultiplierrequirestwo32-bitzerodetectors,multiplexerstocontrolinputs, andadditionalstage-1registers. Mostofthisoverheadoccursinthepreprocessinglogic. As indicated in [73], a good portion of this overhead (zero detectors and multiplexers inpreprocessinglogic)canpotentiallybealreadypresentinthemicroprocessorthatcan 33 be modified to assist the multiplier in the processor architecture. However the results andanalysisarebased on adesign that includesthisoverhead. So, whenthismultiplier is integrated into a typical microprocessor, the area penalty can be minimized and thus furtherimprovetheenergy. As mentioned earlier, the multiplier design can be extended to handle 2’s comple- mentbasedsignedmultiplicationoperations. Thisrequiresafewadditionallogicblocks inthepreprocessinglogicthatconvertthe2’scomplementnumbertoanunsignednum- berandkeeptrackofthesign. Giventheprobabilityofnegativeoperandsislow[74,75], the number of times these computations are required is less, hence the dynamic energy penalty could be low. Additionally, another logic block is required that converts the product generated in stage-2 to a 2’s complement notation. This is required only when oneoftheinputsisnegative. Again,thechancesofthisadditionalcomputationarelow. Furthermore, the same architecture can be used for building a FPU. A 53-bit unsigned multiplier, required for a double precision FPU, can be designed using four 27-bit multipliers in a manner similar to the 64-bit multiplier [37]. All single-precision operations, which require a 24-bit multiplier, will be computed using 2727 mode of operation. Thedoubleprecisionmultiplicationoperationswillbeexecutedinoneofthe three modes, depending on values of mantissas. For further enhancing the 53-bit mul- tiplier, it can be built using one 2424, two 2429, and one 2929 multipliers. This architecture will be able to dynamically switch between the following modes: 2424, 2429, 2929, 2453, 2953, and 5353, depending on the input operands and the FPUprecision,thusresultinginhighenergy-efficiency. Clock-Gating: All the enable signals in the multiplier architecture can be used as control signals for clock-gating. The enable signals help in blocking unwanted data to appear at the flip-flops, but if these signals are used for clock-gating then additional dynamicenergysavingsinflip-flopcellswouldresultinhigherenergy-efficientdesign. 34 Figure3.8: 3DICmultiplierblockdiagram 3.3.2 3DICMultiplier The architecture of the multiplier presented in the previous section is inspired by 3DIC technology. The hierarchical architecture is designed not only to take advantage of narrow width operands, but also to exploit the benefits of the third dimension. In the following sections, the 3DIC design methodology is explained, along with the details onhowthedesignaddresseschallengesof3DIC. Thefundamentalstepinbuildinga3DICistodeterminethenumberoftiers. Given thesizeofthemultiplierdesignisrelativelysmallwhencomparedtodesignslikemulti- core processors, a two-tier 3DIC is adequate. Also, this multiplier design may not have extremelylonginterconnectsthatcouldbenefitfromhavingahighernumberoftiers. Design partitioning is the most crucial step in designing for a 3DIC. It affects the designflowandperformanceofthetarget3DIC.Thedesignissplitasfollows(seeFig- ure3.8): Tier-2,comprisingthethree32-bitmultipliersthatcomputeM1,M2,andM3; Tier-1, containing the rest of the logic. This choice of design partitioning is consistent withdesignpartitioningtechniquesdiscussedinSection4.3. 35 Designfor3D As stated in the previous section, the structure of the 64-bit multiplier simplified the crucial design partitioning task for building a 3DIC. Similarly, there are several other factors that influenced the architecture of the multiplier design, with the aim of making itamenableto3DICimplementation. Oneofthemajorroadblocksin3DICtechnologyisthermalmanagement. Itisvery challenging to control the operating temperature in the tiers that are far from the heat sink. One way to avoid thermal hot spots is to minimize the logic activity in the tiers that are far from the heat sink. By the virtue of this 3DIC design, the activity in tier-2 (farther from the heat sink) is kept to a minimum. Tier-2 is fully active only in 6464 mode of operation, which is rare. Most of the time the multiplier operates in 3232 mode, and the 32-bit multiplier that computes in this mode is located in tier-1, the one closest to the heat sink. Therefore, by-design, the probability of thermal hot spots is greatlyreduced. Clock tree synthesis across multiple tiers is one of the challenging tasks for tools to accomplish. Given the 3DIC multiplier has purely combinational logic in tier-2, a clock signal is not required in tier-2. Therefore, the clock tree synthesis problem is reducedtoasingletier(tier-1),whichcurrenttoolscanhandleseamlessly. Moreover,the clock tree distribution spans across a smaller area, when compared to a corresponding 2DICimplementation. Asaresultthenumberofclock-buffersrequiredtodistributeand balancetheclockacrossthechipisless,therebyfurtherenhancingtheenergy-efficiency. 3.3.3 Implementation To make a fair comparison and show the superiority of the asymmetric multiplier the followingmultiplieralternativeswereimplementedforexperimentalanalysis. 36 Figure3.9: Stage-1ofthesymmetric2DICdesign 1. Baseline Design: A two-stage pipelined, 64-bit unsigned Synopsys DesignWare multipliercomponentthatusesanefficienttree-basedarchitecture. 2. Asymmetric 2DIC Design (A2DIC): Traditional 2DIC implementation of the asymmetric multiplier. For an unbiased comparison, the 32-bit multipliers are implemented using Synopsys DesignWare single-stage unsigned multiplier com- ponents that have the same architecture as that of the baseline. So, any improve- mentsonthebaselinedesigncanalsobeappliedtothese32-bitmultipliers. 3. Symmetric 2DIC Design (S2DIC): To demonstrate the advantages of supporting an asymmetric operation, a symmetric variant of the multiplier design is imple- mented. This version does not support the 3264 mode of operation. Its stage-1 architecture is shown in Figure 3.9. The preprocessing logic now updates either all 32-bit multipliers inputs (6464 mode) or just the lower 32-bit multiplier (3232 mode). In the 6464 mode of operation, the second stage executes in the same fashion as that of the asymmetric version to generate theProduct out- put,asshowninFigure3.7. 4. 3DIC Design: 3DIC version of the asymmetric multiplier; its implementation detailsaregiveninSection3.3.3. 37 All2DICdesignsareimplementedusingastandardVLSIdesignflowwiththehelp of Cadence and Synopsys suites of CAD tools. The designs are synthesized to 45nm IBM SOI technology standard cell libraries, and placed-and-routed with a target clock periodof1ns. 3DICImplementation The3DICdesignistargetedtothesame45nmtechnologystandardcelllibrariesthatare used for building the 2DIC counterparts. First, the 2DIC asymmetric multiplier design RTLcodeismanuallypartitionedintotwotiers,asshowninFigure3.8. Then,eachtier is synthesized independently, and verified together for any timing violations. Finally, a layoutisobtainedusingCadenceFirstEncounter,withbothtiersbondingface-to-face. Theinter-tiersignalsareassignedtoanavailablebondpointmanually,asthecurrent toolsareincapable ofautomaticallydetermining theassignment. Therearenostandard design rules to guide tools, to layout these bondpoints at 45nm technology. Hence, the requiredrules(relatedtothesizeandpitchofbondpoints)areborrowedfromTezzaron’s 130nm 3DIC technology. Once the bondpoint grid is determined, each inter-tier signal isassignedabondpointmanually. Thenthetiersareindependentlyplacedandroutedto buildlayoutsrequiredfor3DIC.Like2DICdesign,3DICdesignisalsobuiltforatarget clockperiodof1ns. Thepost-routenetlistsarethenanalyzedtogetherfortimingverifi- cation. Asafinalverificationstep,CadenceConformalisusedforalogicalequivalence checkofthe3DICdesignagainsttheoriginal2DICdesignHDLsourcecode. Afterallthedesignsareimplemented,apost-routesimulationenvironmentisdevel- oped as explained in the subsequent section. Then, each design is thoroughly tested to generatevariousresultsthatareanalyzedinSection3.3.4. 38 3.3.4 SimulationEnvironmentandResults ASDFback-annotatedsimulationofpost-routenetlistsisperformedusingtheCadence NCsimsimulator. Thesimulationtraceiscapturedinavalue-change-dump(VCD)file. Then,powerconsumptionisanalyzedusingtheSynopsysPrimeTime-PXtoolthattakes theSPEFRCparasiticsfile(generatedbyCadenceFirstEncounterlayouttool),andthe VCDfile,asinputs. Theformerisusedforback-annotatingthepost-routenetlist,while the latter provides the actual signal transitions. Using the signal activity, accurate aver- age power consumption is measured, which includes net switching power, cell internal power, and leakage power. Then total energy consumption is obtained by multiplying theaveragepowerconsumptionwithtotalsimulationtime. Oncesuchanaccuratesimulationenvironmentisestablished,realistictestcasesare required to generate meaningful results. Each test case is composed of 100 thousand (100,000) test vectors, with each vector containing two 64-bit inputs. To emulate a practicalworkload,thesetestcasesaregeneratedusingarandomnumbergeneratorthat has a distribution similar to the one presented in [73]. About, 58% of the test vectors contain both inputs that have operands of precision 32-bit or less. However, this test case does not capture the distribution of test vectors that have one input with operand precision greater than 32-bit while the other operand precision is less than 32-bit. This is very important to demonstrate the usefulness of the asymmetric mode of operation. So, the following criterion is used: The remaining 42% of the test vectors have either oneorbothinputoperandswithprecisiongreaterthan32-bit. Let’sdefineP asfollows, P is the fraction of test vectors of the above defined 42% of test vectors that have one input with operand precision less than 32-bit. Hence, P 42% of total test vectors in a test case execute in the 3264 mode of operation. Using this criterion, P is swept from 1=10 to 9=10, in steps of 1=10, to generate 9 different test cases each containing 100,000 test vectors. Each design, including the baseline is then simulated with these 39 9 test cases, and the results obtained are presented in the following section. While it is difficult to specify aP that applies to all workloads, for most workloads theP value is quitehighbecausemostprogramsrunningon64-bitmachinesrarelyencounteroverflow exceptions [76], implying that when one of the input operand precisions is greater than 32-bitforamultiplyoperation,theothermustbelessthan32-bitmostofthetime. These instructionswillbeexecutedinasymmetricmode. Results Table3.3showstheaverageswitchingandtotalpowerconsumptionmeasuredusingthe VCD files obtained by simulating the designs at a clock frequency of 1GHz. Clearly, all the proposed variants outperform the baseline design, in terms of power. Using these values total energy consumption is calculated. The A2DIC, 3DIC, and S2DIC designs consume up to 33.4%, 31.7%, and 24.6% less energy than the baseline design, respectively. Figure 3.10 shows the plot of percentage of switching (dynamic) energy Table3.3: Netswitchingpower(NS)andtotalpower(TP)ofalldesigns P Baseline A2DIC S2DIC 3DIC NS TP NS TP NS TP NS TP 1/10 17.4 39.1 12.3 29.9 12.2 29.5 12.0 31.0 2/10 17.1 38.6 11.9 29.1 12.1 29.3 11.6 30.1 3/10 16.8 38.0 11.4 28.2 12.0 29.1 11.1 29.2 4/10 16.5 37.4 11.0 27.4 11.8 28.8 10.7 28.3 5/10 16.0 36.5 10.3 26.1 11.6 28.3 10.0 26.9 6/10 15.7 35.9 9.9 25.3 11.4 27.9 9.6 26.0 7/10 15.4 35.2 9.4 24.4 11.1 27.5 9.1 25.1 8/10 15.0 34.4 8.9 23.4 10.9 26.9 8.6 24.0 9/10 14.7 33.8 8.5 22.5 10.7 26.5 8.2 23.1 P=(Asymmetricoperations/Operationsthatarenot3232). Powerunits: mW 40 Figure3.10: Switchingenergysavings Table3.4: Comparisonsbetweenvariousdesigns Design Footprint Density CellArea ACL Leakage mm % m 2 m mW Baseline 174168 76 22216 6.613 2.96 A2DIC 188184 74 25598 5.223 3.28 S2DIC 185181 75 25114 5.571 3.27 3DIC 136136 71 26264 4.318 3.57 saved when compared to the baseline design versus the P value. These results prove that the asymmetric mode of operation is very fruitful, as A2DIC provides nearly 9% additionalenergysavingsthanS2DIC. Table 3.4 presents the comparisons between various designs, with respect to foot- print, cell density, cell area, average connection length (ACL), and leakage power. Pre- dictably, the leakage power increased with cell area, but the difference is marginal. As expected, the 3DIC demonstrates its ability to have a high device density achieved through a lower footprint. The 3DIC is 36.7%, 46.5%, and 44.8%, smaller in terms of chipsizethanbaseline,A2DIC,andS2DICdesigns,respectively. 41 Figure3.11: Bondpoints: blue-signals,white-unused,red-VDD,yellow-Ground As stated above the 3DIC chip is almost half the size of the A2DIC chip, but it has about 2.6% more cell area. The ACL of 3DIC is 17.3% smaller than that of A2DIC. WhencomparedtoA2DIC,3DICconsumes2.9%lessswitchingenergy,butuses3.1% moreoverallenergy,duetolargercellinternalpowerdissipation. Energyefficiencycan be heavily boosted by a more mature and standardized 3DIC technology. The size and pitchusedforthebondpointsarethatofTezzaron’s130nmtechnology. AsshowninFig- ure 3.11 the bondpoint density is quite low. The inter-tier interconnects are widespread acrossthechip(bluebondpointsinFigure3.11),causingsignalstotravellongdistances. For3DIClayout,theplace-and-routetoolinsertedbuffersorresizedcells(madelarger) thatdrivethebondpoints,tosatisfytimingconstraints,therebyreducingACLaswellas wire load capacitance, and hence reducing switching energy. But the cell area required for 3DIC layout has increased, causing cell internal energy consumption to increase. Thisisevidentfromthefactthatthecellareaofthepost-synthesis3DICisslightlyless (<1%), while the post-route is 2.6% larger, when compared to A2DIC. The penalty for increasedcellareadominatedthesavingsachievedbyreducedACLcausingtheoverall 42 energy consumption to increase. However, as the technology evolves the bondpoints sizeandpitchwillscaledown[77],enhancingthe3DICperformance. Additionally, signed variants of the baseline design and the asymmetric multiplier areimplemented. Resultsshowthatthesignedasymmetric2DICdesignconsumesabout 25.0% and 30.4% less energy than the signed baseline design forP = 5=10 and 9=10, respectively. Ifonlyenergyconsumedinswitchingisconsideredthesavingsincreaseto 33.1%and40.3%,respectively. However,thesignedA2DIChas26.9%moreareathan thebaselinedesign. 3.4 Observations Based on the 3DIC design and implementation exercises discussed in this chapter, two main areas in the 3DIC design flow are identified as needing further development and motivatedtheremainderofthedissertationresearch: Logicpartitioning Implementationtechniquesforassigninginter-tiersignalsto3D-vias LogicPartitioningin3DIC Logic partitioning is arguably the most crucial step in building a 3DIC chip, especially when a design for 3D approach is pursued (refer to Section 4.2). Whether partitioning is performedby-design or byCADtools, it isimportanttohaveanefficientdesignpar- titioning framework that provides guidelines to both a designer as well as tools. Such a formal framework will use the characteristics of the design (functionality, estimated area, etc.) and parameters of a 3DIC technology (number of tiers, TSV dimensions, bonding techniques, etc.), to provide important guidelines for partitioning a design. As part of developing such a framework, design partitioning techniques are introduced in 43 Chapter 4. Logic partitioning in 3DIC requires an upper bound on the number of 3D- vias,asonecannotassumeahighavailabilityof3D-vias,especiallyinthecaseofTSVs. Unlike 2DIC approaches, where min-cut is often used for logic partitioning to reduce communication between the partitions, in a 3DIC min-cut is not necessarily optimum, andthecommunicationbetweenthe3DICpartitionsislimitedbyanupper-bound. Such a bound will be a useful constraint in the partitioning framework. This upper bound is obtained by comparing the performance of the resulting 3DIC against a baseline 2DIC. Hence, a 3DIC average wire length model based on Rents rule is developed, which servesasaperformancemeasuretoobtainanupperboundforagivencircuitand3DIC configuration. The 3DIC average wire length model to achieve an upper bound on the numberof3D-viasispresentedinChapter5. Finally,aformalframeworkforlogicpar- titioning is introduced in Chapter 6, which is developed to have the following features: the ability to work at various partitioning granularities, with a limited amount of data fromthedesigner,andevenattheinitialstagesofa3DICchipdesign. ImplementationTechniques Traditionally, chip designers manually assign I/O signals to pins. These I/O pins influ- ence the floorplan and cell placement of the chip. Similarly, in a 3DIC, along with I/O pins, the locations of 3D-vias (TSVs or bondpoints) carrying inter-tier signals have a significant effect on standard cell placement in a tier. This assignment of inter-tier sig- nalsto3D-viasiscrucial,asitguidesthecellplacement,whichinturnaffectstheoverall performanceofthe3DIC.Intheaboveexercises,agreedyapproachwasusedforassign- ing inter-tier signals to 3D-vias in the 3DIC FPU design. The assignment was carried outmanuallyinthecaseofthe64-bitmultiplier. Theformerapproachisnotanoptimal method, and the latter is a very tedious and unpredictable task. The manual assign- ment (similar to chip I/O plan) would be impractical for large designs with billions (or 44 hundreds of millions) of transistors, as the resulting inter-tier signals are typically very large in number. Hence, it is vital to find automated and optimized implementation techniques for assigning inter-tier signals to 3D-vias. Chapter 7 presents the research that was inspired by this observation, which introduces several new assignment heuris- tics, followedbyoptimal assignment techniquesthatresultinminimumwirelengthfor interconnecting tiers. As the presence of TSVs makes congestion a more severe prob- lemin3DICascomparedto2DIC,congestion-awareoptimalassignmenttechniquesare introducedinChapter8. 45 Chapter4 LogicPartitioningPrinciples 4.1 Introduction Overthetime,withshrinkingtransistorsizeseachgeneration,designershavecontinued to add a tremendous amount of functionality into a single chip increasing its size. As a result, the cost of manufacturing has continued to increase due to reduced yield, given that yield is inversely proportional to chip size. Homogeneous 3D integration can be used to overcome this challenge. In a logic-on-logic stacked 3DIC, a single design is partitioned into several active tiers and stacked together. Each tier will be smaller than the corresponding conventional 2-dimensional chip, and hence has a higher probabil- ity to be fault-free. Also, a logic-on-logic stacked 3DIC can be manufactured using a wafer-to-wafer assembly, which is potentially the most cost-effective 3DIC fabrication process, because wafer-level processing can build hundreds or thousands of devices at once [11]. Additionally, chip securitycan beenhanced usinglogic-on-logic3DIC[51]. Theseattributesareverydesirable,especially,indefense,biomedical,andmobiletech- nologies. Giventhatlogic-on-logicstackinghasgreatcapacitytoovercomesomeofthechal- lenges that are present in the semiconductor industry, several researchers are focusing on this field [9,12,19–21,23–28,78]. One of the major steps in building a logic-on- logic stacked 3DIC is partitioning the design across tiers, which is the emphasis of the researchpresentedinthischapter. 46 4.2 ImportanceofPartitioning Howandwhenadesignispartitionedhasasignificantinfluenceonthe3DICendprod- uct. A design can be partitioned at various levels, ranging from coarse to fine grained, dependingonthegranularityoftheblocksthataretobestacked: Core-level,functional unit blocks (FUBs), logic gates, and transistors [9]. Table 4.1 summarizes the potential benefits of 3DIC based on the portioning granularity, and the impact on reuse of 2DIC designblocksandcurrentCADtools[9]. Table4.1: Summaryofimpactofdesignpartitioninggranularityon3DIC Granularity Potentialbenefits Redesigneffort Entirecore Low: Power and performance of indi- vidual components unchanged. Some benefit in reducing footprint of clock andpowernetworks. Low: Reuse existing 2DIC design FUBs Medium: Reduced latency and power ofglobalroutesprovidessimultaneous performance improvement with power reduction. Medium: Must re-floorplan and re-time paths. Existing 2DICFUBscanbereused. Logicgates High: Reduced latency/power of global, semi-global, and local routes. Further area reduction due to compact layoutandresizingopportunities. High: Need new 3D circuit designs,methodologies,and layout tools. Reuse existing 2DICstandardcelllibraries. Transistors High: Possible further reductions in area, latency, and power. Transis- tor size relative to 3D-via pitch makes gains unlikely except for large, com- plexgates. Extreme: Almostnoreuse Additionally, a design can be partitioned at various stages of the design flow, and it influencesthetypeoftoolsandmethodologiesrequiredtobuild3DIC.Indesignfor3D flow,adesignispartitionedby-designbeforelogicsynthesis. Otherdesignflowsarealso possible where a design is partitioned after logic synthesis, i.e., during place-and-route (PAR),usingautomatedalgorithms. 47 4.2.1 2DICvs. 3DICLogicPartitioning Oneofthemajordifferencesbetween2DICand3DIClogicpartitioningisthatinatra- ditional 2DIC, the design (or netlist) is partitioned only in the back-end [30], whereas in 3DIC, a design is partitioned either in the front-end or in the back-end. The front- end partitioning allows 3DICs to be partitioned at various abstraction levels including architectural and functional levels [9]. The back-end partitioning allows only a circuit levelorsynthesizednetlistlevelpartitioninginthecaseof2DIC.Onemightextrapolate that 3DIC may also be similarly partitioned only in the back-end to allow users to use current2DICpartitioningtechniques,butthepartitioningandimplementationoptimiza- tions become challenging for large designs, necessitating architectural/functional parti- tioningforgeneral-caseoptimized3DICs. Suchpartitioningsimplifiestier-leveltesting and debugging, which are required to increase the yield of 3DIC chips. The problem sizeissignificantlysmallerforpartitioningatanarchitecture/functionallevel,makingit morelikelytoachieveanoptimalpartitioning. Also,knowingexactlyinwhichtierspe- cificlogicblocksresidehelpsforbettertestabilityandfloor-planning. Additionally,the samefront-endpartitioningpotentiallycanbeusedforanytechnologynode(examples: 32nm,22nm.),avoidingredesignefforts. The essence of netlist partitioning in 2DIC is to divide a system specification into clusters such that the number of inter cluster connections is minimized [30]. In con- trast, a min-cut partitioning i.e., minimizing inter-tier connections can greatly impair the advantage of 3DICs in total wire length [13]. Typically, 3DIC partitioning algo- rithmsfocusonoptimizingothercriterialikeminimumwirelength,reducedchancesof hot-spots,lowerarea/foot-print,etc. The presence of additional parameters including the third-dimension, through- silicon-vias (TSVs), multiple bonding configurations, tier-level testability, and thermal 48 issues exasperate the 3DIC partitioning problem. Traditional 2DIC partitioning algo- rithmswithsomemodificationsmaybeusedintheback-endfor3DICpartitioning,but to fully exploit the third dimension and build efficient 3DICs, 3DIC-specific logic par- titioning methods are essential. This observation is an inspiration for the partitioning frameworkpresentedinthisdissertation. 4.2.2 RelatedWork Inbulkoftheliteraturepresentonlogic-on-logicstacked3DICs,thelogicpartitioningis design-specific[23–25,27,28],whereafunctionunitisredesignedfor3DICimplemen- tation. All the prior work provide good designs and demonstrate the potential benefits possiblewith3DICdesign,butthetechniquesusedforlogicpartitioningareapplicable onlyforaspecificdesign. Theworkpresentedinthischapterintroduceslogicpartition- ing techniques that can be applied to any design. In contrast to partitioning by-design, good amount research was dedicated in automating the partitioning of design across multiple tiers [19,20] during PAR. In this case, logic-on-logic stacked 3DIC is used only as an implementation platform i.e., the logic-design and architecture of the chip is same as that in 2DIC. As explained in later sections the proposed techniques can be integrated into these automated logic partitioning techniques, to achieve efficient 3DIC design. Potentially,makethemrunfasterandconvergeclosertoanoptimumvalue,due toreducedsearchspaceforfindinganoptimumsolution. 4.3 LogicPartitioningPrinciples The primary step in logic-on-logic 3DIC design is to determine the number of tiers into which the design will be partitioned and how to partition among the tiers. The numberoftiersisaconclusionadesignermakesafterexploringthe3DICdesignspace. 49 The discussion presented here does not make any assumptions about the number of partitions, as the proposed methods are independent of the number of tiers. Moreover, the partitioning techniques can be employed by the designer at the top-most level, i.e., front-end, or by the tools at the back-end. Due to a lack of 3DIC tool support, the demonstration design is partitioned manually, at the RTL level. The following sections describethedesignpartitioningmethods. Figure4.1showsthreepossiblepartitioningmethodsintheproposedapproach: 1. Combinational to Sequential (C2S): In this method, each signal traversing a ver- tical interconnect is an output of combinational logic driving thedata-input of a sequential element in the neighboring tier. This partitioning method is shown in Figure 4.1(a), where the inter-tier signals are generated as outputs in tier-i and connecttoinputsinitsadjacenttier-j. 2. SequentialtoCombinational(S2C):ThispartitioningmethodispresentedinFig- ure 4.1(b), where the outputs of sequential elements, i.e., Q-outputs of the flip- flopsofatier(tier-i),tunnelthroughtiersanddrivecombinationallogicinanother tier(tier-j). 3. CombinationaltoCombinational(C2C):IntheC2Cpartitioningapproach(shown in Figure 4.1(c)), the design can be partitioned anywhere in the combinational logic,i.e.,thismethodsplitscombinationallogicacrossthetiers. Hencetheinter- mediatesignalsofcombinationallogicintier-iwilltraversetheverticalintercon- nectstoadjacenttier-j. Thefourthpossiblemethod,sequentialtosequential,isnotconsidered,assuchpaths are atypical for most designs, as buffers or other delay elements are almost always inserted to avoid potential hold violations due to clock skew. Hence, the sequential tosequentialcasecanbeignored. 50 (a)Combinationaltosequential (b)Sequentialtocombinational (c)Combinationaltocombinational Figure4.1: Designpartitioningtechniques When compared to the other two methods, C2C has the following major disadvan- tages: (1) Larger search space: In the other two methods, the search is concentrated on either input or output nets of sequential nodes (lets call these nets sequential nets). These sequential nets are a fraction of the total number of nets in a typical design. For example a custom USC 32-bit RISC processor [79] has 1834 sequential nets compared 51 Figure4.2: ExampledatapathinC2Cmethod to a total of 18645 nets (excludes I/O ports), and the FPU design used as a demon- stration design in this chapter has 630 sequential nets out of 1700 total nets. Thus the search space in the C2C approach would be several times larger than that of the other cases;(2)Increasednumberofverticalinterconnects: Ingeneral,whenacombinational logic block is divided into two parts, the interconnects between the two parts will most likelybemorethanthenumberofinputsoroutputsoftheblock,whichareconnectedto sequentialelements;and(3)Harderoptimizationproblem: Considerthescenarioshown inFigure4.2. Theoptimizationproblemofminimizingthepathdelay,isconstrainedby the large capacitance (C via ) of the inter-tier via (Say, 3D-via). As a result, the logical effort [80] optimization problem isMinimize[D 1 +D 2 ]. The path electrical effort for D 1 is [(C via +C X )=C in ] andthat ofD 2 is [C Load =C X ]. Clearly, thepresence of 3D-via inthedata-pathexacerbatestheoptimizationproblem. Given these drawbacks, the C2C method is less promising than the other two approaches. It becomes even more challenging when the design is partitioned manu- ally for building a custom 3DIC. Hence, the later sections discuss only the first two methods,C2SandS2C. 52 4.4 Analysis Nowthatanoverviewofthepartitioningtechniqueshasbeengiven,thissectionprovides amoredetailedanalysis. 4.4.1 DelayOptimization Consider a situation in the C2S and the S2C methods that is similar to the C2C method showninFigure4.2. Thewirecapacitance,C via of3D-viawillbeseenattheoutputnet of the combinational logic in the C2S method, and at the input net in the S2C method. Hence,thedatapathsintheC2SandS2Cmethodsarenotconstrainedliketheoneofthe C2C method, resulting in a simpler logical effort optimization problem in these cases. However, intheS2C methodthesequentialunitdrivingthe3D-viashouldberelatively stronger than that in the C2S method. Therefore, a design partitioned using the S2C method could be larger in size, whereas, with the C2S method, the combinational logic will be sized appropriately to drive the high capacitive 3D-via. Hence the C2S method mayhaveanadvantageovertheS2Cmethod. 4.4.2 Fanout Given the methodology of the C2S approach, the fanout of a 3D-via is always one. Whereas, in the S2C approach the fanout of a 3D-via is design-dependent. Thus, in the S2C method, the capacitive load of a 3D-via could be exploited. Due to the high capacitanceofa3D-via,asignalroutedthrougha3D-viawillbemorecapableofdriving higher fanout since it will already be buffered to drive the higher capacitance of the 3D-via. Hence during partitioning using the S2C method, an additional optimization criterion can be added to choose sequential nets that have a higher fanout as favorable netsforpartitioning. 53 4.4.3 Advantages The main advantage of C2S or S2C or a hybrid (a partition can be made at the input or outputnetofasequentialelement)methodsisthesignificantreductioninthesizeofthe search space for optimal partitioning choice. Similar to the C2S and the S2C methods, in a hybrid approach, the search space also is a function of the number of sequential elements. Moreover,thesemethodswillnotonlymakethepartitioningalgorithmsfaster, but will also make tools converge to a near-optimum solution. As the search space is reduced, algorithms like simulated annealing, tend to converge closer to an optimal solution[81]. Anotherbenefitofpartitioningaroundasequentialelementisthattheseapproaches allowapipelinedfunctional-unitblock(FUB),forexample,multi-stageexecutionunits like multipliers, to be split across tiers, without any need for redesigning the FUBs. Partitioning at a lower granularity than a FUB, has shown a high potential to achieve better performance gains by using the third dimension [9]. Moreover, it is also possi- ble to use current standard-cell designs and CAD tools for building an efficient 3DIC. Therefore, these partitioning methods would be cost and time effective, as designing new3DICspecificCADtoolsandstandard-celllibrariesfromscratchwillbecostlyand time consuming. These low-cost and quick-to-product attributes are very desirable for anemergingtechnologylike3DIC. Further,theselogicpartitioningprinciplescomplementthedesignfor3Dparadigm, where a designer can divide the design into several tiers, seamlessly, e.g., these tech- niquescanalsobeusedataRTLleveltobuilduseful3DICs. 54 4.5 Demonstration In this section, the 3DIC technology and implementation details of the demonstration design,asingle-precisionFPU,arediscussed. 4.5.1 3DICTechnology The 3DIC FPU designs are implemented using the MOSIS 3DIC package based on the Tezzaron-Global foundries two-tier 130nm fabrication process. Tezzaron uses a wafer- to-wafer bonding technology to build 3D chips [71]. This package is limited to two tiers, and hence it uses face-to-face bonding, where the vertical interconnects are made using M6 metal layer bondpoints. In other cases the vertical interconnects are made using TSVs. The input/output and power signals are interfaced with pads using TSVs, called Super-Vias [72]. Tezzaron’s super-vias require a very thin wafer (approximately 5m),hencethetop-tier,onwhichthepadsarelocated,isthinned. 4.5.2 Implementation First, the FPU design is manually partitioned into two tiers. Two variants of the 3DIC FPUareimplemented: oneusingtheC2Smethod,andtheotherusingtheS2Cmethod. Once the FPU design is partitioned, each tier is synthesized independently using Syn- opsys Design Compiler, targeting the design to the Tezzaron-Global foundries 130nm standard cell libraries. Initially, the slower tier is identified, as it will become the bot- tleneckwhenbothtiersarestackedtogether. Thenthefastertierisre-synthesizedusing the clock speed of the slower tier, to minimize the cell area. Both the tiers are then verifiedforanytimingviolations,togetherasa3Ddesign,usingSynopsysPrimeTime, and timing reports are analyzed to optimize the design for faster speed. Finally, layout 55 (a) 3DIC FPU: Bumps color key: Signals - Blue; VDD - Red; Yellow-VSS. (b)Post-CTSlayoutoftier-2 Figure4.3: 3DICFPUlayouts is obtained using Cadence First Encounter. The PAR tool uses the design rules speci- fied by Tezzaron’s physical design kit. First one of the tiers is placed and routed; then the other tier is placed and routed using the mirrored version of the inter-tier signals to bondpoints assignment of the first tier as a reference. The 3DIC layouts of the S2C method based 3DIC FPU are shown in Figure 4.3(a), in which the arrows illustrate the mirrored locations of some of the inter-tier signals, as an example. The post-layout design is then analyzed for any timing violations using Synopsys Prime Time. As a finalverificationstep,CadenceConformalisusedforalogicalequivalencecheckofthe 3DICFPUdesignsagainsttheoriginalFPUsourcecode. 4.6 Results Theresultspresentedinthissectionaretheoutcomeoftheimplementationexperiments conducted using the FPU design. Table 4.2 shows the clock period and footprint com- parisons of the 3DIC FPU designs with a 2DIC FPU. The S2C and C2S variants of the 3DIC FPU are about 7% and 3% faster than the 2DIC FPU, respectively. Both have a 56 Table4.2: Results: S2Cbased3DICFPUvs. C2Sbased3DICFPUvs. 2DICFPU S2C3DICFPU C2S3DICFPU 2DICFPU ClockPeriod 6.37ns 6.61ns 6.83ns Footprint 306x306um 2 306x306um 2 400x400um 2 Table4.3: Results: Tier-wisecomparisonforS2CandC2Smethods S2C3DICFPU C2S3DICFPU Tier-1 Tier-2 Tier-1 Tier-2 AverageWireLength(m) 10.800 12.200 12.500 9.700 Power(mW) 4.586 6.256 4.639 6.08 CellArea(um 2 ) 81463 73972 62736 79590 41.5% smaller footprint than the 2D version. Clearly, both 3DIC FPU designs outper- formthe2DICFPUdesign,eventhoughtheFPUisarelativelysmalldesignwhencom- pared to designs like multi-core processors. These results indicate that larger designs can achieve greater performance improvements by using logic-on-logic stacking, when compared to the traditional 2DIC design approach. Moreover, better PAR techniques canfurtherenhance3DICperformance. Table 4.3 presents the tier-wise comparisons between S2C and C2S techniques. According to the data shown in tables 4.2 and 4.3, the S2C 3DIC FPU is slightly faster (about 3.6%) than the C2S but has more cell area (about 9.2%) when compared to the C2S variant. Hence both partitioning techniques (a hybrid of both is also possible) should be considered, so that the method that performs best for the given specifications canbeselected. Furthermore,thepowernumbers(calculatedusing0.2primaryinputactivityfactor) show the importance of reducing the average connection length in integrated circuits. Even though tier-1 of the S2C variant has about 30% more cell area than tier-1 of the C2S3DICFPU,ithasabout1.2%lessdynamicpowerconsumption,mostlikelybecause 57 tier-1 of the C2S variant has greater average connection length than tier-1 of the S2C version. Thesameistruefortier-2ofbothdesigns. 4.7 Conclusions TheS2C,C2S,andhybrid(partitionatinputoroutputofasequentialcell,amixofS2C and C2S) techniques provide useful heuristics to partition any design, and when inte- grated into automatic partitioning tools the search space is greatly reduced. However, to provide a concrete solution for logic partitioning more work needs to be done. Such a solution should be able to guide both designers and tools to achieve efficient logic partitioning. Givendesignpartitioningisaverycrucialstep,aformalframeworkwhich identifies the quantifiable characteristics of a design to find an optimum partitioning is presentedinChapter6. Before partitioning a design, it is useful to have an a priori estimate of an upper bound on the number of 3D-vias, as one cannot assume a high availability of 3D-vias, especiallyinthecaseofTSVs. Inordertoprovideanupperbound,a3DICaveragewire lengthmodelbasedonRentsruleispresentedinthefollowingchapter,whichcompares theresultant3DICaveragewirelengthagainstacorrespondingestimated2DICaverage wirelengthtoachieveanupperbound. 58 Chapter5 3DICAverageWireLengthModel 5.1 Introduction In a three dimensional integrated circuit (3DIC), multiple active-logic tiers are stacked vertically to make a single chip. The active devices in different tiers are interconnected using vertical vias (or 3D-vias). A face-to-face (F2F) bonded 3DIC uses bondpoints (micro-bumps) to interconnect the tiers, while through-silicon-vias (TSVs) are used as 3D-viasthattunnelthroughtheactivelayerofatierinface-to-back(F2B)andback-to- back (B2B) bonding, as shown in Figure 5.1 [77]. The size of a TSV is several times larger than a typical gate [31]; as a result TSVs may negatively impact interconnect lengthsifpresentinlargenumbers. A good amount of literature is present on estimating 3DIC wire length distribu- tion [32,82,83], as such modeling is useful for 3DIC design exploration. However, some simplifications were assumed that limit the applicability of this prior work. The authors in [82,83] did not consider the size of a TSV. Similarly, the work presented Figure5.1: 3DICbondingtechniques,lefttoright: F2F,F2B,B2B 59 in [32] assumes a fixed size for TSVs (4 times a typical gate). Though [32] introduced a TSV-aware wire length distribution estimation method, the work made simplifying assumptionsinitsuseofparameterswithRentsrule[33],aformulaoftenusedforesti- mating wire length distribution of logic circuits. The most notable assumption was that allthetiersofa3DIChavethesameRentscoefficient(k)andexponent(p). Toprovideamorerobustmodel,thischapterintroducesanewtier-levelhierarchical 3DIC average wire length estimation method that considers various 3DIC technology parameters. Similar to the models presented in [32,82,83], the proposed method is also developed on the wire length distribution estimation technique presented by Davis et al. [84] for a traditional 2DIC, based on Rents rule. However, since the Rents rule parameters may not be the same for all the tiers of a 3DIC, the proposed method uses a tier-by-tier approach, where the wire length distribution of each tier is estimated inde- pendently and then combined to achieve an average wire length estimate of an entire 3DIC. The proposed model is also applicable for variable TSV sizes. Additionally, the type of 3DIC bonding used, F2B or B2B, is considered for extended applicability and moreaccurateresults. Inadditiontowirelengthestimates,themodelprovidesanupper boundonthenumberofTSVsforoptimizingaveragewirelength. The subsequent Section 5.2 presents the motivation. Sections 5.3 and 5.4, discuss theapproachandimpactofTSVson3DICaveragewirelengthestimations. Section5.5 gives the method to obtain an upper bound on the number of TSVs. Finally the results arepresentedinSection5.6followedbyconclusions. 5.2 Motivation Oneofthemostattractivefeaturesof3DICtechnologyisitsabilitytostackdiversecir- cuitry, like memory on logic or network-on-chip over processing cores, where all tiers 60 may not have the same Rents parameters. Additionally, a 2DIC architecture may be modified [52] to take full advantage of a 3DIC by following a design-for-3D approach, where the Rents parameters can potentially be different for each tier. Hence a tier-by- tier hierarchical approach is suitable, where the wire length distribution of each tier is estimated separately, and then combined along with the distribution of inter-tier inter- connect lengths. The hierarchical model allows the flexibility of having unequal num- berofverticalinterconnectsbetweendifferentpairsofadjacenttiers,aseachtier’swire lengthdistributionisestimatedseparately. Forexample,ifa3DIChasthreetiers,which are bonded F2B, one of the pairs of adjacent tiers can have different number of TSVs betweenthemcomparedtotheotherpairofadjacenttiers. Themodelpresentedin[84] can be used for estimating the wire length distribution of a tier with no TSVs, but an extended method is needed to address the impact of TSVs, which is presented in this chapterSection5.4. Though 3DIC offers great interconnect bandwidth between bonding tiers, it is very importanttoknowtheboundsofitsapplicabilitytopracticaldesigns. InaF2F3DIC,the sizeandpitchofthebondpoints(top-metallayermicro-bumps)determinethemaximum number of 3D-vias physically possible per unit area. Given bondpoints physically do not affect the active silicon area, the upper bound on 3D-vias between F2F bonded tiers can be calculated using the dimensions of bondpoints, the area of the design, and the arrangement (layout) of the bondpoints (staggered or grid). The real challenge is to estimate the upper bound on 3D-vias for cases of F2B or B2B configurations, where TSVsareusedas3D-vias,whichprecludetheplacementofcircuitsinthecorresponding activesiliconarea. Partitioningadesignacrossmultipletiershasthepotentialtoreduce theaveragewirelength,butwiththeadditionofTSVstheactiveareapertierincreases, which in turn may increase the total and average wire length, potentially resulting in a negative effect on performance. Hence there will be a tipping point where any further 61 additionofTSVsbetweenthebondingtiersmayresultworseperformancein3DICthan thatinthe2DICimplementation. The aspect ratio of TSVs is critical for achieving acceptable yield, and due to wafer/diethinningissues,theresultisaTSVthatislargercomparedtogates. Astransis- torsizesscaledown,gatesbecomesmaller,butthesizeofTSVsmaynotscaledownat thesamerate. Hence,theaveragewirelengthestimationmodelmustaccommodatevari- able relative TSV sizes to allow for this unknown. First, a tier-level 3DIC average wire lengthestimationframeworkusingaRent’srule[33]basedapproachispresented. This framework takes into account technological parameters of TSVs and bonding methods forbetterestimationofaveragewirelengthandTSVbounds. 5.3 Tier-by-TierApproach Theideaistoestimatetheaveragewirelengthofa3DICusingRent’srule. Theauthors in[33]observedanempiricalrelationshipbetweenthenumberofinputandoutput(I/O) terminalsT, and the number of gatesN, in a random logic network, called Rent’s rule, whichisexpressedas: T = kN p (5.1) Where the parameters k and p, are known as the Rent coefficient and Rent exponent, respectively. For a given circuit, the net length statistics (distribution, average, etc.) of a 2DIC implementation can be estimated using Rent’s rule [84,85]. For a logic circuit withN gates,thenumberofinterconnectsoflengthl,I(l),isestimated[84]. UsingI(l) theaveragewirelengthiscalculatedasfollows: 62 AW 2D = Total WireLength Total Number of Nets = lmax ∑ l=1 lI(l) lmax ∑ l=1 I(l) (5.2) Where l max is the length of the longest net possible in the circuit, which is measured in units of gate pitches, and is equal to 2( p N 1). Using an approach similar to the techniques presented by Davis et al. [84], the wire length distribution of interconnects between the gates (only gates) of a tier in a 3DIC is estimated. However, the presence ofTSVscomplicatestheestimation,aspresentedinSection5.4. 5.3.1 3DICAverageWireLength In a multi-tier 3DIC, for the i-th tier the distribution of the interconnects between its gates Ii is estimated. (Accounting for wiring between gates and TSVs is described later). Then,theaveragewirelengthofamulti-tier3DICisgivenby: AW 3D = [ N tiers ∑ i=1 L i ∑ l=1 lI i (l) ] + [ (N tiers 1) ∑ i=1 P(i) ] [ N tiers ∑ i=1 L i ∑ l=1 I i (l) ] + fo [ (N tiers 1) ∑ i=1 n(i) ] (5.3) In the above equation the numerator represents the total wire length of the 3DIC, and thedenominatorgivesthetotalnumberofwires,where: N tiers : Numberoftiersin3DIC. L i : Lengthofthelongestnetini-thtier. n(i): Numberofsignalsbetweeni-thand(i+1)-thtiers. 63 Figure5.2: Socketsarrangement: gatesandTSVs(graycolored) P(i): Wirelengthpenaltyforbondingi-thand(i+1)-thtiers,i.e.,wirelengthrequired tointerconnectthetwotiers. 5.4 ImpactofTSVInstances Consider a 3DIC tier with N gates and n TSVs, as shown in Figure 5.2. Each grid in Figure 5.2, represents a socket, defined as a unit of area which can accommodate a single gate. Depending on whether a socket is occupied by a gate or a TSV, the socket is referred to as a gate-socket or TSV-socket, respectively. Assume the area occupied bya TSVist 2 , i.e.,tt TSV-sockets. The valueoft isdeterminedbythetarget3DIC fabricationtechnology. ThetotalareaofthetierA tier isgivenbythefollowingequation. A tier =N + nt 2 (5.4) 64 The maximum possible Manhattan net lengthl max , and the dimensions (LL) of this tieraregivenby: l max = 2 ( √ A tier 1) (5.5) L = √ A tier (5.6) In Figure 5.2, q is the pitch of TSVs (spacing between any two neighboring TSVs). There aren TSVs each of sizet 2 in the tier. Assuming a uniform distribution of TSVs acrossthetier,thepitchq isgivenby: q = √ A tier =n (5.7) Distributions other than uniform can also be used, and q can be modified accordingly alongwithotherrelatedderivedparameters. Notethatinpracticeq mustbeconstrained bytheminimumpitchimposedbytechnologydesignrules. Asmentionedintheprevioussection, firstthediscretewirelengthdistributionI i (l) ofthei-thtierisestimatedforthenetsbetweenthegates,as I i (l) = I exp3D (l)M 3D (l) (5.8) Where, I exp3D (l) is the expected number of interconnects between gate-socket pairs separated by a Manhattan distance l in the i-th tier, and M 3D (l) is the number of gate pairsthatareldistanceapartinthei-thtier.I i (l)doesnotincludeinterconnectstoTSVs. ThewiringadditionrequiredfortheseinterconnectsisdiscussedinSection5.4.2. 65 5.4.1 GatesInterconnectsDistribution: I i (l) Thissectionpresentsthetechniqueforestimatingthediscretewirelengthdistributionof interconnectsbetweengatesofatier. First,I exp3D (l)ispresented,followedbyM 3D (l). I exp3D (l)Estimation The technique proposed by Davis et al. [84] based on the principle of conservation of I/OterminalsismodifiedtoestimateI exp3D (l). Theexpectednumberofinterconnects of lengthl between two gate pairs is determined using Rents rule. As shown in Figure 5.3, the gates are grouped into three sets: A, B, and C. A contains the gate under inspection. C represents the set of all gates that are at a distance of l from the gate in A. All the gates that are at a distance less thanl comprise setB. LetN A ,N B , andN C be the number of gate sockets in sets A, B, andC, respectively. Using Rents rule, the numberofpoint-to-pointinterconnectsbetweenAandC isgivenbyI AtoC [84]. I AtoC = k[(N A +N B ) p (N B ) p +(N B +N C ) p (N A +N B +N C ) p ] (5.9) Thealphafactorusedintheaboveequationisthefractionofterminalsthataresinks. It isexpressedintermsofaveragefan-outfoofthesystem[86],as( =fo=(fo+1)). AsshowninFigure5.3,setB containsthesocketscoveredbythepartialManhattan circle of radius (l1) fromA, and the number of sockets in setB is equal tol(l1), whichisalsotheareaofsetB ingatepitchunits. Fora3DICtier,someofthesel(l1) sockets could be TSV sockets. For a tier in a 3DIC that has n TSVs of each t 2 in area, assuming a uniform distribution of TSVs across the tier, as shown in Figure 5.2, 66 Figure5.3: Partialcircleofradius(l1) the fraction of TSV sockets f TSV in a randomly chosen continuous group (an area) of socketsisgivenby: f TSV = Total Areaof TSVs A tier = nt 2 N +nt 2 (5.10) AssetB isacontinuousgroupofsockets,thenumberofgatesinB,N B ,isgivenby: N B = l(l1)(1f TSV ) = l(l1) ( N N +nt 2 ) (5.11) Thefraction(N)=(N +nt 2 )intheaboveequationisrepresentedbyg,giving: g = N=(N +nt 2 ) (5.12) N B = g l(l1) (5.13) 67 From Figure 5.3 it can be observed that the sum of N C and N B values for distance l fromAisequaltoN B appliedforadistance(l+1)fromA. Inotherwords,N B +N C isequaltothenumberofsocketsenclosedintheManhattansemicircleofradiusl from A. Hence, N C +N B = N B (for l+1) N C +N B = g l(l+1) (5.14) ThenumberofgatesinsetAisone. Substituting(N A = 1),(5.13),and(5.14)into(5.9) gives the expected number of interconnects I p (l) from the center gate to all periphery gatesinaManhattansemicircleofradiusofl. I p (l) = k [ (1+gl(l1)) p (gl(l1)) p + (gl(l+1)) p (1+gl(l+1)) p ] (5.15) Similar to Davis et al. [84], using partial Manhattan circle approximation, the average numberofinterconnectsbetweengatesseparatedbylengthlinagivenManhattansemi- circle is obtained by dividing I p (l) by the number of gates on the periphery, (2lf(l)). Thereare2lsocketsontheperipheryoftheManhattansemicircle. Thefractionofthese socketsoccupiedbygatesisgivenbyfunctionf(l). Tounderstandf(l),considerthescenarioshowninFigure5.4. Thegraysocketsare TSV sockets, and the value shown in each of these sockets is m = (l mod q), where l is the distance from the TSV socket labeledA and recallq is the TSV pitch. Firstf(l) for A is derived. The sockets under the cross-line are at a distance l = (q + 1) from A. As l changes, m goes from 0 to (t 1), and the number of sockets of a particular TSV that is on the periphery of radius l from A increases linearly from 1 to t. Then for m values t to (2t 1), the count reduces one-by-one to zero. Figure 5.4 shows 68 Figure5.4: Illustrationforestimatingf(l) only a quarter Manhattan circle of radius l, which encounters (⌊l=q⌋ + 1) TSVs, but a Manhattan semicircle will encounter (2⌊l=q⌋ + 1) TSVs over a perimeter of 2l. As theTSVsarepresentatregularintervals(distanceq apart)ontheperiphery,thefraction of sockets on the periphery that are TSV sockets can be estimated by dividing the total number of TSV sockets on the Manhattan semicircle by perimeter (2l) for a given m. Subtracting this fraction from 1 yields the fraction of sockets that are gates, which is givenbyf(l). f(l) = [ 1 ( 2⌊ l q ⌋+1 ) (m+1) 2l ] 0m<t [ 1 ( 2⌊ l q ⌋+1 ) (2tm1) 2l ] tm< 2t 1 Otherwise (5.16) 69 Where,m = (l modq). Now,considergatesocketsX andY inFigure5.4. Thefraction ofgatesocketsontheperipheryforX orY isthesameasthatofA,withonlyoneminor difference: the regions forf(l) change to (1 m < t+1) and (t+1 m < 2t+1). Hence by using the same f(l) for X and Y, a shift in counting occurs. The number of gate sockets on the perimeter of radiusl tol +t are slightly under-counted. But for radius l +t + 1 to l + 2t the number of gates is slightly over-counted. As (t << l), on average the error in counting is negligible. Hence, f(l) is approximated to be the same for all sockets and is dependent only on l for a given 3DIC configuration. Thus, the expected number of interconnects between gate pairs separated by a lengthl in the i-thtierisestimatedas I exp3D (l) = k 2lf(l) [ (1+gl(l1)) p (gl(l1)) p + (gl(l+1)) p (1+gl(l+1)) p ] (5.17) NowthatI exp3D (l)isestimated,thecompletewirelengthdistributionofinterconnects between the gates of a tier can be derived using the number of gate pairsM 3D (l) sepa- ratedbyalengthofl ina3DICtier. M 3D (l)Derivation To representM 3D (l) mathematically, a function 3D (i;j;l) is defined, which gives the numberofgatesthatareatadistancel fromthesocket(i;j). Thisexcludesallthegate sockets that are above (i;j) and to the left of (i;j) in the i-th row, to avoid duplicate counting of gate pairs. 3D (i;j;l) is equal to the number of gate sockets in setC (see Figure5.3),i.e.,excludingsocketsoccupiedbyTSVs. 70 Summation of 3D (i;j;l) over the entire LL square array of sockets gives the total number of gate-to-gate and TSV-to-gate pairs that are separated by a distance ofl (Equation5.18). M(l)doesnotincludeTSV-to-TSVpairs. M(l) = i=L ∑ i=1 j=L ∑ j=1 3D (i;j;l) (5.18) ThenumberofTSV-to-gatepairsthatareseparatedbylengthl,termedM T (l),shouldbe removed fromM(l) to attain the gate-to-gate pairs that are separated by distancel. As mentionedintheearliersection,eachTSVoccupiest 2 sockets,andTSVsareseparated by a pitchq. Without loss of generality, assume a TSV is located at the topmost corner ofthesocketgrid. Therefore,M T (l)is M T (l) = x=t ∑ x=1 y=t ∑ y=1 i=L=q1 ∑ i=0 j=L=q1 ∑ j=0 3D (qi+x;qj +y;l) (5.19) As(t<<L),M T (l)isapproximatedas, M T (l) = t 2 i=L=q1 ∑ i=0 j=L=q1 ∑ j=0 3D (qi+1; qj +1; l) (5.20) ) M 3D (l) = M(l) M T (l) (5.21) M(l) is computed using the function (i;j;l) defined by Davis et al. in [84]. This function gives all the sockets that are on the periphery of a partial Manhattan circle of radius l, i.e., it is equal to the number of sockets in set C (Figure 5.3). In other words (i;j;l) is equal to 3D (i;j;l) plus those sockets that are occupied by TSVs and are locatedatadistancel from(i;j). From[84](i;j;l)isgivenby 71 (i;j;l) = (l+1)u 0 (l+1)+(l1)u 0 (l1) (lL+j)u 0 (lL+j) (lL+i)u 0 (lL+i) + 2(l2L+j +i1)u 0 (l2L+j +i1) (lL1+j)u 0 (lL1+j) (lL1+i)u 0 (lL1+i) (5.22) Using the function f(l) given in Equation 5.16, and function (i;j;l), M(l) can be approximatedas, M(l) =f(l) i=L ∑ i=1 j=L ∑ j=1 (i;j;l) (5.23) Evaluating(5.23)using(5.22)exactlygivesRegionI:1l <L M(l) = f(l) [ l 3 3 2l 2 L+ 1 3 l(6L 2 1) ] RegionII:Ll < (2L2) M(l) = f(l) [ l 3 3 +2l 2 L 1 3 l(12L 2 1) + 2 3 L(2L1)(2L+1) ] (5.24) SimilartoM(l),M T (l),definedinEquation5.20,changesto M T (l) =t 2 f(l)M ′ (l) (5.25) 72 Where: M ′ (l) = i=L=q1 ∑ i=0 j=L=q1 ∑ j=0 (qi+1; qj +1; l) (5.26) EvaluatingM ′ (l)gives, RegionI:1l <L M ′ (l)= 2lL 2 q 2 + q 3 ( l+1 q )( l+1 q 1 )( l+1 q 2 ) L [( l+1 q )( l+1 q +1 ) + ( l q )( l q +1 )] RegionII:Ll < (2L2) M ′ (l)= 2L 2 q 2 (Ll+q1) ( 2Ll1 3q )( 2Ll1 q +1 ) (l2L2q+1) + ( L 3q )( L q +1 ) (3l4L2q+3) + ( L 3q )( L q 1 ) (3l2L2q +3) (5.27) Using(5.16),(5.24),(5.25),and(5.27)in(5.21)givesM 3D (l). 5.4.2 TSVWiringPenalty Every TSV has a source in one tier and one or more sinks in the other tier. Generally, placement tools aim to place gates that are connected to a terminal (in this case a TSV) very close to the terminal. For example, consider a TSV that is located at one of the corners of the tier: it is highly unlikely that a gate connected to a TSV is located in 73 the opposite corner of the tier. Assume that on average, a gate connected to a TSV is at distance d gt from the TSV, which can be determined based on the behavior of the interconnects in real circuits or can be swept across a range for a sensitivity study. Alternatively,theaveragewirelengthofacorresponding2DICcanbeusedasd gt ,like theexperimentspresentedinSection5.6. Consideringasourceinonetierandanaverage fan-outoffointheothertier,foreveryTSV,thewiringrequiredforinterconnectingtwo tiersusingTSVsisestimatedas, P =n[h + (fo+1)d gt ] (5.28) Where h is the height of the TSV in gate pitches and is determined by the technology andbondingstyle(F2BorB2B)usedtofabricatethe3DIC.ForF2Fbondedtiers,where bondpointsareusedinsteadofTSVs,hiszero. P isthewirelengthrequiredfor(nfo) interconnectsthatspanthetwotiers. Thefactor(nfo)interconnectsisconsideredin Equation5.3toaccuratelyestimate3DICaveragewirelength. ThewiringrequirementP assumesTSVtohavesimilarcapacitanceperunitlength comparedtoatypicalmetalwireina2DIC.Fora3DICversus2DICcomparison,when TSVs are used in 3DIC, a simple average wire length comparison may not reflect the delayandpowerperformanceofthe3DIC,asTSVsgenerallyhavemuchhighercapac- itanceperunitlengthcomparedtothatofametalwireina2DIC.Toaddressthisissuea parameter TSV is introduced, which is ratio of TSV capacitance per unit length to that of metal wire capacitance per unit length. TheP is modified as follows, where TSV h, givesequivalentmetalwirelengthofaTSVcomparedtoatypicalmetalwireina2DIC. P =n[ TSV h + (fo+1)d gt ] (5.29) 74 5.4.3 TierAreaandNumberofGates This section illustrates the method to compute tier area and number of gates in a tier, based on the type of bonding. For simplicity a two-tier 3DIC is considered, but this method is applicable to a 3DIC with any number of tiers. Assume that there are a total ofN Total gatesinthecircuit. F2BBonded3DIC In a F2B bonded 3DIC, TSVs occupy active area only in the tier that is bonded on its backside. ThereforethetotalareaofthetwotierscombinedisN Total + nt 2 . Hence, A tier = (N Total + nt 2 )=2 (5.30) Assuming a symmetric footprint for both tiers, for the tier with TSVs (back-bonded), thereareN Back gatesgivenby N Back = (N Total nt 2 )=2 (5.31) ForthetierwithoutTSVs(facebonded),theentireactiveareaisoccupiedbygates. The numberofgatesinthefacetier,N Face isgivenby: N Face = (N Total + nt 2 )=2 (5.32) As the number of TSVs are increased or the size of the TSV is increased, more of the active area of the back tier will be occupied by TSVs if a symmetric footprint is assumedforbothtiers. Insuchascenario,duetodesignrulesforminimumTSVpitch, a symmetric footprint may not be possible. In this case an asymmetric footprint for the tiersis inevitable, unlessn ort isreduced(orboth). Forthethirdexperimentpresented 75 in the results section (Section 5.6), such a scenario is addressed by distributing gates equallyamongthetwotierstoguaranteeameaningfulpitchbetweenTSVs. B2BBonded3DIC In a B2B bonded 3DIC, TSVs occupy active area in both tiers, making the total aggre- gateareaofthetwotiersN Total + 2nt 2 . Therefore, A tier = (N Total )=2 + nt 2 (5.33) InbothtiersthereareN gates,whichisgivenby: N =N Total =2 (5.34) Using the above expressions for tier area and number of gates in a tier, I i (l) can be estimatedforallthetiers. ThenEquation5.3isusedtocomputethe3DICaveragewire length. 5.5 UpperBound Basedonthenumberoftiersandtypeofbonding,theareaandnumberofgatesinai-th tierarecomputed. ThenI i (l)canbeestimatedforeachtierusingTSVdimensionsbased on the target 3DIC fabrication technology. The average wire length is then computed for n values swept over a large range to find the value of n where the 3DIC average wire length is more than that of a corresponding 2DIC, indicating an upper bound has been reached. The average wire length for a 2DIC is estimated using the techniques presentedin[84]. ThisupperboundonnumberofTSVswillbeanusefulparameterfor logicpartitioningframeworkpresentedinChapter6. 76 5.6 ExperimentsandResults Four different experiments were conducted to demonstrate the utility of the proposed model. For the first three experiments, a 2-tier 3DIC configuration is shown to better understandtheresultsandforbrevity. Thefourthexperimentpresentsresultsforafour- tier3DIC.Forsimplicity,allthetiersinthisexperimentareassumedtobeF2Bbonded. However, the proposed model is applicable to a 3DIC with any number of tiers, and foranybondingconfiguration. Avastnumberofconfigurationsarepossibleforfouror moretiers. Forexample: a4-tier3DICcanbeconfiguredas(i)4-tiersbondedF2B,(ii) two pairs of tiers bonded F2F, then the pairs bonded B2B, etc. The height of the TSVs issetto20and40gatepitchesforF2BandB2Bbonding,respectively. In the first experiment, three configurations are considered with 1, 10, and 100 mil- lion gates. The Rents parameters were set as k = 4 and p = 0:75, and an average fanout of fo = 3 was used. These values were used to draw comparisons with prior work [84]. The three configurations were analyzed for both bonding techniques and three different TSV sizes (t). For all 18 combinations of system parameters, the upper Table5.1: Firstexperiment,configuration-1results N Total : Numberofgates, n: NumberofTSVs, t: p TSV Area N TSV : TSVupperbound, AWL:AverageWireLength %Better: PercentageimprovementinAWLvs. 2DIC N Total =1e+6 Forn=1e+4 Bonding t N TSV AWL %Better AWLDifferent-p F2B 2 71300 14.96 13.17 14.51 B2B 2 36900 15.31 11.17 14.83 F2B 3 40700 15.29 11.25 14.85 B2B 3 19600 15.97 7.33 15.47 F2B 4 25600 15.73 8.72 15.31 B2B 4 11800 16.85 2.23 16.32 77 Table5.2: Firstexperiment,configuration-2results N Total : Numberofgates, n: NumberofTSVs, t: p TSV Area N TSV : TSVupperbound, AWL:AverageWireLength %Better: PercentageimprovementinAWLvs. 2DIC N Total =1e+7 Forn=1e+5 Bonding t N TSV AWL %Better AWLDifferent-p F2B 2 788700 26.36 13.44 25.34 B2B 2 406000 26.92 11.61 25.84 F2B 3 430300 26.95 11.50 25.95 B2B 3 205200 28.10 7.74 26.97 F2B 4 264700 27.73 8.94 26.77 B2B 4 121400 29.67 2.58 28.48 bound on TSVs was computed (rounded to nearest 100). Tables 5.1, 5.2, and 5.3 show the results obtained. The table also shows the average wire length (AWL) obtained for 10000, 100000, and 1 million TSVs, respectively for the three configurations. The col- umn % Better shows the percentage improvement in average wire length compared to a corresponding 2DIC implementation. As expected, F2B achieved better results than B2B, since less active silicon area is affected by TSVs in F2B bonded tiers (1.9 to 2.2 timeshigherupperbound,amongtheexperimentalconfigurations). The second experiment demonstrates the importance of different p values for each tier. The column AWL Diff-p in Tables 5.1, 5.2, and 5.3 show the average wire length when the Rents parameter p for one of the tiers is changed to 0:74, (reduced by 0:01). InF2BthereducedpisusedforthetierwithTSVs. Theaveragewirelengthisreduced up to 5% (and on average 4%) when compared to the results from the first experiment. Clearly, any slight difference in p for the tiers can significantly affect the average wire lengthofa3DIC. 78 Table5.3: Firstexperiment,configuration-3results N Total : Numberofgates, n: NumberofTSVs, t: p TSV Area N TSV : TSVupperbound, AWL:AverageWireLength %Better: PercentageimprovementinAWLvs. 2DIC N Total =1e+8 Forn=1e+6 Bonding t N TSV AWL %Better AWLDifferent-p F2B 2 8380800 46.63 13.59 44.41 B2B 2 4296100 47.57 11.85 45.23 F2B 3 4450400 47.68 11.64 45.51 B2B 3 2109700 49.67 7.96 47.22 F2B 4 2699500 49.07 9.07 46.98 B2B 4 1232300 52.47 2.78 49.88 Figure5.5: Graphshowingdegrading3DICAWLasTSVsizeincreases The third experiment illustrates the sensitivity to TSV size. The baseline configu- ration for the analysis was: 20 million gates, 400 thousand TSVs, k = 4, p = 0:8, and fo = 3. The TSV size t was swept from 1 to 5. Figure 5.5 shows the average wire length versus TSV size. The steep fall in performance compared to 2DIC as t 79 Table5.4: Experimentresultsforafour-tierF2Bbonded3DIC N Total : Numberofgates, n: NumberofTSVs t: p TSV Area AWL:AverageWireLength %Better: PercentageimprovementinAWLvs. 2DIC N Total =2e+7 Forn=4e+5 Bonding t AWL %Better F2B 2 41.65 29.10 F2B 4 45.72 22.17 F2B 6 51.59 12.18 F2B 8 58.58 0.28 F2B 10 66.25 -12.78 increasesindicatesthatTSVsizebecomesextremelycrucialincircuitswithlargenum- bers of interconnects. The fall is steeper forp = 0:8, than that forp = 0:75 in the first experiment(seeTables5.1,5.2,and5.3). The fourth experiment illustrates the applicability of the proposed model to a four- tier3DIC,anexampleofa3DICwithmorethantwotiers. Thebaselineconfigurationis thesameasthatofthethirdexperiment. TheTSVsizetwassweptfrom2to10,insteps of2. Table5.4showstheaveragewirelength,andcomparisontothebaseline2DIC.As mentioned above, all tiers are F2B bonded and the TSVs are assumed to be distributed equallybetweenthetiers(one-thirdtoeachadjacentpair). ComparedtotheF2Bbonded two-tiercaseinthethirdexperiment,afour-tier3DICwiththesameparametersismore tolerant to TSV size, as the logic gates are spread over a smaller foot-print in case of four-tierscomparedtothatofacorrespondingtwo-tier3DIC. Given3DICisanemergingtechnology,datafromreal3DICimplementationsisnot available. Hence,comparisonsofthemodeltotherealworldimplementationshavenot yetbeenperformed. However, goodfidelitybetweenthemodelandmeasuredresultsis expected,giventhemodelisderivedfromaproven2DICmodel[84]. 80 5.7 Conclusions Atier-levelhierarchicalapproachbasedonRentsruleisintroducedforestimating3DIC averagewirelength. Thediscretewirelengthdistributionofeachtierisestimatedinde- pendently to provide the ability to handle tiers with different Rents parameters. Addi- tionally, the proposed method is applicable for variable TSV dimensions and for both face-to-back and back-to-back bonding techniques. Moreover, the hierarchical model allows the flexibility of having unequal numbers of vertical interconnects between dif- ferent pairs of adjacent tiers. Using the proposed model, an important modeling factor of the upper bound on the number of TSVs is attained. This bound indicates the max- imum number of TSVs possible in the design where the resulting average wire length is no greater than that of a corresponding 2DIC. Several experiments have successfully demonstrated the application of the proposed model. Results show that a slight dif- ference in the Rents exponent (about 0.01) between tiers and different TSV sizes have significant effects on 3DIC average wire length. Additionally, as expected, the F2B bondingmethodresultedinahigherupperbound(1.9xto2.2x)thanB2Bbondingfora two-tier3DIC. The upper bound on 3D-vias, i.e., on bondpoints or TSVs between interconnect- ing tiers, serves as an important constraint while partitioning the logic across tiers in a 3DIC. Unlike in 2DIC, where min-cut is often used for logic partitioning to reduce communication between the partitions, min-cut is not necessarily optimum in a 3DIC, and the communication between the 3DIC partitions is limited by an upper-bound. If Rentsparametersofadesignareavailable,theupperboundachievedusingtheproposed average wire length estimation model can be integrated into the formal framework for 3DIClogicpartitioningpresentedinthefollowingchapter. 81 Chapter6 3DICLogicPartitioningFramework 6.1 Introduction The benefits offered by 3DIC over conventional 2DIC technology have captured wide- spread interest in the IC research community. There are two types of philosophies for 3DIC design: one which considers 3DIC just as an implementation platform, and another which aims to make the most of the third dimension by specifically exploit- ing 3DIC characteristics during the design development. In the former, partitioning is executed after logic synthesis and during place and route stages using automated algo- rithms. In design for 3D, logic in each tier is determined by-design, and logic blocks are not moved across tiers in later stages of the design flow. As a result the majority of research related to 3DIC logic partitioning uses an automated procedure that simply partitionsanexisting2DICdesignintheback-endafterlogicsynthesis[19],orprovides a 3DIC design of a particular functional unit by splitting the unit across multiple tiers (design for 3D) [23,25,27–29]. In contrast, this chapter presents a formal framework forlogicpartitioningin3DIC,whichcanadapttovariouspartitioninggranularitiesand workwith limited amounts of information to enable efficient partitioning even at initial stages of the chip design. This is achievedby modeling the design as a data-flowgraph with nodes representing logic blocks, and edges representing communication between thelogicblocks. Theframeworkoptimizesthe3DICdesignforreducedcommunication between the logic nodes. Multiple optimization algorithms are introduced to fine-tune theresultant3DICbasedonthedesigncharacteristicsavailableintheframework. These 82 algorithmsconstraintheareaofatierandboundtheinterconnectsbetweentiers,tofur- theroptimize3DIC. Theremainderofthechapterisorganizedasfollows. Section6.2presentsthemoti- vationfortheproposedresearch,followedbySection6.3,whichdiscussesthedata-flow modelusedintheframework. Section6.4presentstheoptimizationmethodology. Sec- tions6.5and6.6demonstrateapplicationoftheframeworktoafloating-pointunit(FPU) andaprocessor,respectively. FinallySection6.7concludesthechapter. 6.2 Motivation In logic-on-logic stacked 3DIC, the partitioning granularity has a significant effect on theperformance,cost,anddesign-timeofthechip[9]. Therearemanywaysinwhicha designcanbepartitioned,rangingfromcoarse-tofine-grained,dependingonthegran- ularity of the blocks that are to be stacked: core-level, functional-unit blocks (FUBs), logicgates,andtransistors[9]. StackingFUBsacrossmultipletiersreducesinter-block communication length, while partitioning a FUB across multiple tiers may also reduce intra-block communication length, as noted in [52]. Much work has been done to demonstratetheusefulnessoflogic-on-logicstacked3DICs. Theauthorsin[23,25,27] showed the benefits of splitting microprocessor functional components like a register file,instructionscheduler,etc.,acrossmultipletiers. Additionally,theauthorsin[28,29] presented the improvements in 3DIC designs of TCAM, FIFO and FFT functional blocks. Similarly, most of the research related to logic partitioning provides insights into potential benefits of designing a FUB for 3DIC implementation, but there is no formal model with the ability to partition any given circuit at a architectural/functional leveltoachieveefficient3DICdesign. 83 The aim of this work is to build a partitioning framework at the highest level of the design cycle, which can partition a design at any granularity, and result in an efficient 3DICimplementationevenwithlimitedinformationaboutthedesign. Therefore,adata- flow model is developed which uses a graph representation of the circuit to achieve a 3DIC design that is optimized for reduced communication length. However, with the addition of more information like the area of each logic node, the number of intercon- nectsbetweenlogicnodes,technologyrelatedparameters,etc.,additionalobjectiveslike minimizingfootprintandbounding3D-viascanbeincludedintheadvancedalgorithms of the framework. The following section discusses the graph representation used in the framework,followedbythedetailsoftheproposedframework. 6.3 Data-FlowModel A data-flow graph model is used in the framework to represent a design, which aids in logic partitioning for building a 3DIC. First the graph representation is described followedbyvariouscharacteristicsofthedata-flowmodel. 6.3.1 GraphRepresentation A circuit can be represented as a graph with nodes symbolizing logic blocks and edges representing the communication between the logic blocks. This graph has some ref- erence nodes: start and end nodes, which are typically inputs and outputs of the cir- cuit. The logic blocks or nodes can be realized at various granularities: logic gates, computational blocks (like adder, multiplier etc.), pipeline stages, building blocks of a hierarchical design. This graph representation is used to extract basic, but important 84 Figure6.1: Graphrepresentationofradix-2butterfly(FFT)structure Figure6.2: Graphrepresentationofradix-2butterfly(FFT)structureatlowergranularity thanthatshowninFigure6.1 characteristics of the design, which helps in formulating a framework for logic parti- tioning. The graph representation is explained below with some examples, followed by thedata-flowmodelinSection6.3. To illustrate the impact of differing granularities, Figures 6.1 and 6.2 are the graph representations of a radix-2 butterfly structure (4-point FFT), at different granularities. In all the graphs, the solid circles represent reference nodes: inputs (light-gray) and 85 Figure6.3: GraphrepresentationofFPUdesign,atpipelinestagegranularity outputs (dark-gray) of the design. An edge between two nodes implies communication between the two stages. In Figure 6.1, each node (other than inputs and outputs) repre- sents a small group of functional units (multipliers and adders, in this case), while the graph in Figure 6.2 represents the butterfly structure at a lower granularity, where each nodeisasinglearithmeticoperationunit: multiplieroradder. Clearly,foratargetdesign the graph representation should be at the granularity level intended for partitioning for buildinga3DIC. Another example in Figure 6.3 shows the graph representation of a single-precision FPU (presented in Section 3.2). This graph is derived at a different granularity than in theFFTexamplesshownabove,whereeachnoderepresentsapipelinestageoftheFPU, which is a five-stage pipelined design with stages 4 and 5 interconnecting with stage 2. In the figure the labels within a node indicate the pipeline stage they are representing (example stage 3 is indicated by S3). Such a graph representation of a design is very easy to develop and aids as a powerful tool for the logic partitioning framework. The modeldetailedinthefollowingsectionleveragesthesecircuitrepresentationconcepts. 6.3.2 RanksandDegrees The aim of the proposed framework is to provide an ability to the designer to explore logic partitioning at the highest level (initial stages) of the design cycle to build an efficient 3DIC. To achieve this, the circuit model should be able to provide an analysis even with a limited amount of information about the circuit, i.e., without lower-level 86 data like area, power, etc. Usually such information is not available at the initial stage of the design cycle. As a data-flow graph can capture the high-level architecture, the graphrepresentationexplainedintheprevioussectioniswell-suitedfortheframework. The high-level characteristics of a design are modeled using ranks and degrees, which are used to optimize the 3DIC for reduced communication. The terms rank and degree aredefinedasfollows: 1. The rank of a node is defined as the number of edges that are in a longest path from the node to any start node, typically the input node. The longest path is the onewiththemaximumnumberofedgesofallthepossiblepathsfromthenodeto anystartnode. Globalinputsignalslikeclock,reset,enable,etc.,arenotincluded inthesetofstartnodes. 2. The communication degree of an edge is the difference between the ranks of its nodes. Imaginestartnodestobeatthetopatreestructure,andtherankofeachprocessing node represents the depth of the node from the top. The node rank indicates where the correspondinglogicblockmightbeplacedfromthestartnodes,forexample,inputpads in a chip. This modeling is based on the observation that the cell placement in chips is guided by the locations of I/O pins. The higher the rank of a node the higher the probability of it being farther from the I/O pins. Hence, ranks give relative positioning of logic nodes from the reference start nodes, and the difference in ranks between two logic nodes gives the possible distance between them, which is captured by degrees. In essence, the ranks identify relative positions of logic nodes with respect to refer- encenodes, i.e.,inputs, andedge degreesareproportionaltothecommunicationlength betweenthenodestheyconnect. 87 6.3.3 DesignCharacteristics The ranks of nodes and degrees of edges in the data-flow model are the minimum required characteristics of a design to build a 3DIC using the proposed framework. However, additional design characteristics, presented below, are identified that enable furtheroptimizationofthe3DICdesign. 1. Area: Theareaofeachnodecanbefairlyaccuratelymodeledbysynthesizingthe RTLcodeofeachnode. Ifaplaced-and-routedversionofthenodeisavailable,an evenmoreaccurateareaprojectioncanbeprovided. IftheHDLrepresentationof the design contains a hierarchy preferably similar to the partitioning granularity, theareaestimatescanbeeasilyestimatedusingasynthesistoolsimilartoSynop- sys DC compiler. In DC compiler, the command report area -hierarchy reports theareaofeachindividualinstantiationinthedesignatalllevelsofthehierarchy. 2. Bounds on 3D-vias: The number of 3D-vias between two interconnecting tiers is an important constraint on partitioning, especially when the 3D-vias are realized usingTSVs. TSVsareseveraltimesbiggerthanatypicalstandardcellandaggre- gately occupy a significant portion of the active area. Hence they are available in limited numbers, and the upper bound on TSVs is an important parameter for partitioning. The upper bound on TSVs can be obtained using the model pre- sentedinChapter5. However,sometimesmanufacturinglimitationsmayenforce adifferentboundthanthatgivenbythetheoreticalmodels. 3. Interconnect count: Interconnect count refers to the number of interconnects between two communicating nodes. In the data-flow model, irrespective of the number of signals between two nodes, their communication is represented only by a single edge. With the addition of interconnect count, each edge in the graph will have an additional weight representing the number of signals along with the 88 degreeofcommunication. Theinterconnectcountcanbeobtainedbyparsingthe moduleportconnectionsintheHDLcodeimplementationsofthenodes. 4. Power: Given leakage power is directly proportional to cell area, the leakage component of power dissipation of each node can be modeled using the cell area estimate. Toobtainanaccurateestimateofdynamicpowerconsumption,thepost- synthesis netlist of a node is simulated using a realistic workload, and the value change dump of the simulation is captured. Using these actual signal transitions the dynamic power can be estimated using industry standard tools like Synopsys PrimeTime-PX. 5. Activityfactor: Similarto power, the activityfactorofeach nodecanbeobtained bysimulatingthepost-synthesisnetlistusingatargetworkload. 6. Technology parameters: These are the parameters that are obtained directly from the 3DIC technology used for the chip fabrication, which includes the height of a TSV, maximum number of tiers possible, type of bonding used (face-to-face, face-to-back, back-to-back), etc. These parameters help in accurately modeling thecommunicationdegreeof3D-vias(i.e.,anedgebetweentwotiers). 6.4 Optimization The intuition is that if two communicating nodes are far away from each other, moving one of them along the third dimension could reduce the wire length required for the signal interconnects. Hence, when two communicating nodes are placed in different tiers,thedegreeoftheedgeconnectingthemischangedtothenumberoftiersbetween them. So,theoptimizationproblemforlogicpartitioningistoembedthenodesina3D 89 structure (multiple tiers) such that sum of the degrees of all edges is minimized, for a givennumberoftiers. Letthedata-flowgraphbegivenbyG: G = (V;E) (6.1) WhereV isthesetofN nodes,andE isthesetofM edges. V =fn 1 ;n 2 ;:::n N g (6.2) E =fe 1 ;e 2 ;:::n M g (6.3) Each edge e i is a connection between two nodes in V. Let functions R(n i ) and D(e i ) represent the rank of a node and degree of an edge, respectively. Therefore, the opti- mizationproblemisexpressedas: Minimize [ ∑ e i 2E 3D D(e i ) ] (6.4) Where E 3D is the set of edges in the 3D embedding of the data-flow graph, which is obtainedbymovingthelogicnodesacrossthetiers. First, the number of tiers in the target 3DIC is set. However, the number of tiers can be swept across a range to obtain the optimum number of tiers. Once the number of tiers is set, nodes are moved across tiers to explore the possible design alternatives for the 3DIC. Each time a node is moved to a different tier, the ranks of nodes are recomputed. Each inter-tier edge (edge connecting nodes in different tiers) represents an inter-tier signal. These signals are either inputs or outputs to the respective tiers; hence, they become reference nodes. The ranks in each tier will be computed with respect to these inter-tier start nodes. For the purpose of computing ranks the data-flow graph also includes the direction of the communication, i.e., a directed graph is used. 90 Ranksofnodesarecomputedbyrecursivelyback-trackingtotheinputnodestofindthe longest path from a node to any input. Once the ranks are evaluated, the degrees are obtained for all edges. All chip start and end nodes are confined to tier-0 (or the base tier). The start nodes have a rank zero, so every edge that connects a start node and a logicnodehasadegreeequaltothelogicnode’srank. Allinter-tieredgeshaveadegree equal to the difference of number tiers between the nodes they connect. The ranks of end nodes are computed with respect to start nodes in a similar fashion to that of logic nodes, but end nodes are fixed to tier-0. The cost, C, of a configuration is obtained by addingthedegreesofalledges. C = [ ∑ e i 2E D(e i ) ] (6.5) Theaboveprocedurecanbeappliedbyadesignermanually(dependingonthenum- berofnodesandedgesinthedesign)oritcanbeautomatedusingasimulated-annealing algorithm,whichisawell-knownheuristicforsolvingsuchcombinatorialoptimization problems[87]. Thebasicprocedureistoacceptallmovesthatreducethecost,i.e.,sum ofthedegreesofalltheedges. Movesthatincreasethecostareacceptedwithaprobabil- ity that is inversely proportional to the increase in cost. The acceptance probability for the moves which increase cost is controlled by a parameter temperatureT. The higher the T value, the greater the acceptance probability, which is given by exp(∆C=T), where ∆C is the cost increase incurred by the new move. Initially, T is set to a high value to accept most of the moves, and then gradually decreased to reduce the chances ofacceptingamovethatresultsinhighercost. Finally,T isreducedtoaverylowvalue, allowing only moves that result in lower cost to be accepted, and thus converging to a solutionthatresultsinlowercost. 91 Algorithm1Simulated-AnnealingProcedure Input: G-Data-flowgraphofthedesign Input: M -Numberofmovestoattempt Output: G 3D -Data-flowgraphofthe3DICdesign 1: Initialize: loopCount,T 2: Generaterandomconfiguration: G 3D 3: Evaluatecost: C = [ ∑ e i 2E 3D D(e i ) ] 4: while(TERMINATE(loopCount,M,C)==FALSE)do 5: Makeamove(perturbation) 6: Evaluatechangeincost: ∆C 7: if(∆C < 0)then 8: Acceptthenewconfiguration 9: UpdateG 3D andC 10: else 11: Acceptwithaprobabilityofe (∆C=T) 12: UpdateG 3D andC,ifaccepted 13: endif 14: loopCount=loopCount+1 15: SCHEDULE(T,loopCount) 16: endwhile 17: return G 3D In Algorithm 1, the function TERMINATE will control the termination of the procedure based on the loop count and the cost of the current configuration. If the loopCountreachesthepresetmaximummovesM orthecurrentcostC istheminimum possible cost, the procedure will be terminated. The minimum possible cost of the system is equal to the number of edges in the graph, i.e., degree of all the edges is equal to one. However, it may not be possible to achieve the theoretically possible minimum cost for all the designs, hence the termination condition can be set to a target cost compared to the initial cost of the design (for example 30% reduction compared to initial graph). The SCHEDULE function gradually decreases the temperature T 92 based on the loopCount to reduce the probability of accepting a move that results in increasedcost. The automated procedure presented above is well-suited for logic partitioning at lowergranularity(orpartitioningintheback-end),butwhentheframeworkisusedata highergranularitythenumberofnodesaresignificantlyless. Insuchascenarioabrute- force approach may be used to obtain an optimum solution (say for up to 20 nodes). The two demonstration designs presented in Sections 6.5 and 6.6 have 5 and 11 nodes, respectively. Hence, at the architectural level where the number of nodes is small, a designer can use the proposed framework to design a 3DIC design without the help of automation algorithms. Now that a basic algorithm is presented, the following sections present advanced algorithms of the framework that use the additional characteristics of adesigntofurtheroptimizethelogicpartitioning. 6.4.1 Area-ConstrainedOptimization In area-constrained optimization, the user will set the maximum allowed area of a tier. For example, in the case of a two-tier 3DIC, a designer may determine that a tier can haveamaximumof60%oftotallogicintermsofarea. Theareaofeachnodeisrequired to enable area-constrained optimization. The area of a node can be an absolute or nor- malizedvalueobtainedfromthesynthesisofadesign,orestimatedforeachnodebased on the properties of each node. Algorithm 2 shows the procedure for area-constrained optimization. In Algorithm 2, G 3D is initialized to a random configuration. Compared toAlgorithm 1 in the previoussection, Algorithm 2differsin thegeneration ofrandom perturbations. In the latter a guided perturbation is used, where the higher the area of a tier(thantheuserdefinedmaximumareaforatier),thehighertheprobabilityofmoving a node from that tier. Using such a guided perturbation will eventually bring the areas of the tiers under the user-defined maximum, after which every tier will have an equal 93 Algorithm2Area-ConstrainedProcedure Input: G-Data-flowgraphofthedesign Input: M -Numberofmovestoattempt Input: A MAX -Maximumallowedareapertier Output: G 3D -Data-flowgraphofthe3DICdesign 1: Initialize: loopCount,T 2: Generaterandomconfiguration: G 3D 3: Evaluatecost: C = [ ∑ e i 2E 3D D(e i ) ] 4: while(TERMINATE(loopCount,M,C)==FALSE)do 5: GUIDED PERTURBATION(A MAX ) 6: Evaluatechangeincost: ∆C 7: if(∆C < 0)then 8: Acceptthenewconfiguration 9: UpdateG 3D andC 10: else 11: Acceptwithaprobabilityofe (∆C=T) 12: UpdateG 3D andC,ifaccepted 13: endif 14: loopCount=loopCount+1 15: SCHEDULE(T,loopCount) 16: endwhile 17: return G 3D chances in the perturbation step. Once a new configuration is obtained, the rest of the algorithmfunctionssimilarlytoAlgorithm1. 6.4.2 3D-viasBoundedProcedure Similar to area, a user can set limits on the number of interconnects between tiers in a 3DIC. If the user has Rents rule parameters (rents exponent and coefficient) [33] for the design, the upper bound on the number of TSVs can be estimated using the 3DIC averagewirelengthmodelpresentedinChapter5,whichalsousestheknowledgeofthe 94 Algorithm33D-viaBoundedOptimization Input: G-Data-flowgraphofthedesign Input: M -Numberofmovestoattempt Input: N MAX -Upperboundon3D-vias Output: G 3D -Data-flowgraphofthe3DICdesign 1: Initialize: loopCount,T 2: Generaterandomconfiguration: G 3D 3: Evaluatecost: C = [ ∑ e i 2E 3D D(e i ) ] 4: while(TERMINATE(loopCount,M,C)==FALSE)do 5: Makeamove(perturbation) 6: Computen: 3D-viasinnewconfiguration 7: if(n<N MAX )then 8: Discardnewconfiguration 9: else 10: Evaluatechangeincost: ∆C 11: if(∆C < 0)then 12: Acceptthenewconfiguration 13: UpdateG 3D andC 14: else 15: Acceptwithaprobabilityofe (∆C=T) 16: UpdateG 3D andC,ifaccepted 17: endif 18: endif 19: loopCount=loopCount+1 20: SCHEDULE(T,loopCount) 21: endwhile 22: return G 3D 3DICtechnologyforestimation. Additionally,thenumberofinterconnectsbetweenthe connectednodesisalsorequiredforthisenhancement. To bound the interconnects between two tiers, a different approach than area- constrained optimization is used, because moving a node from the tier with higher area will reduce its area and increase the area of the other tier, and thus eventually balance the tiers in terms of area. Similarly, moving a node with a higher number of 3D-vias from one tier to another cannot guarantee in a reduction of the 3D-vias count, as it 95 depends on the number of interconnects it has with other nodes in the tier. Hence, in a 3D-via bounded optimization, a move is discarded if the number of 3D-vias in the new configuration is higher than the user-defined upper bound. Algorithm 3 shows the 3D-via bounded procedure, whereN MAX represents the upper bound on the number of 3D-vias. The algorithm assumes a global upper bound, i.e., the total number of 3D- vias in the entire 3DIC cannot exceedN MAX . However, each pair of adjacent tiers can have their respective bounds, and the comparison will be done accordingly to accept or discardamove. 6.4.3 InterconnectCount As mentioned in Section 6.3.3, the knowledge of interconnects between nodes can enhance the data-flow model of a design. Let (e i ) be the number of interconnects between the nodes of edge e i . The impact of interconnect count is incorporated by modifyingthecostfunctionas: C NEW = [ ∑ e i 2E (e i )D(e i ) ] (6.6) C NEW is the new cost function, which multiplies the degree of communication between the nodes with the number of interconnects between them. Hence, the cost of an edge in the graph is not only proportional to the degree (i.e., length of the wire) butalsotothenumberofinterconnects(i.e.,numberofwires)betweenitsnodes. Apart from the above enhancements, the 3DIC design can be optimized for thermal activity to better manage heat dissipation in the internal tiers of the 3DIC design. This canbeachievedbyusingthedynamicpowerdissipationandsignal-transitionactivityof the nodes. Such an optimization will aim to place high-activity nodes in tiers closer to 96 theheatsinkandavoidplacingtwonodesthatdissipatehighdynamicpowerinadjacent tierstoreducetheprobabilityofhot-spots. 6.4.4 ComputationReductionTechniques The number of computations (or moves) required to achieve a near-optimum solution could be very high, especially when the graph is represented at a lower granularity (many nodes and edges). Hence the following techniques are proposed that help in reducingthecomputationalcomplexityofthesimulated-annealingbasedprocedures. NodeGrouping The idea here is to group nodes to reduce the number of nodes and edges in the data- flow graph, so that whenever a new configuration is obtained by moving nodes, all the nodes in the group are moved collectively. Once the nodes are grouped, the group acts as a single node; the rank and degrees of the graph should be computed accordingly. Nodesaregroupedbeforeapplyingtheoptimizationprocedureandwillremaingrouped throughout the process. Good candidates for node grouping are those which contain all edgesbetweenthemwithaunit-degree,i.e.,groupingcloselylocatednodes. PartitioningatSequentialElements Another way to reduce the computation is to restrict the location at which the design can be partitioned. Here the partitioning is restricted to edges that contain one of the vertices to be a sequential element. For example if an edge connects two nodes which are purely combinational logic, it is marked as “not-suitable” to be an inter-tier edge. This constraint is inspired by partitioning principles presented in Chapter 4, where it was proposed that partitioning the design around sequential elements is advantageous over partitioning between combinational logic. It is better to have each signal between 97 thetierseithersourcedorsunkbyasequentialunitinoneofthetiers,ratherthanhaving bothendsaspartofacombinationallogiccone. Asaresult,thenumberofedgeswhere thedesigncanbepartitionedarereduced,resultinginalowercomputation. Theframeworkalongwithcomputationreductiontechniquescanbeusedtoexplore partitioning alternatives for a 3DIC design. The demonstrations presented in the fol- lowing sections show how the framework can be used by designers as a guideline for partitioning when working with designs that are represented by relatively with smaller numbersofnodes. Forlargerdesigns,thosewithmanynodesandedgesinthedata-flow graphmodel,thesimulated-annealingprocedurewillautomatetheoptimizationprocess. 6.5 3DICFloating-PointUnit To better understand the proposed framework, its application is demonstrated using an existingsingle-precisionFPUthathasbeenformerlyimplementedonanumberofchips, Figure6.4: 5-Stagepipelinedsingle-precisionFPU 98 dating back to the Data-Intensive Architecture (DIVA) processing-in-memory chips. It hasa5-stagenon-linearpipelinedarchitecturewithasingle2-stagepipelinedmultiplier, which is used to execute all multiplication operations, including those required in the poweringunits[69]. Figure 6.4 presents the block diagram of the FPU. For all instructions other than division, processing occurs in a linear pipeline fashion. Thus, the latency is 5 cycles, and an instruction can be issued at every cycle. The division operation is executed in a non-linear pipeline fashion with a feedback from stages 4 and 5 to stage-2, and has a latency of 12 cycles. An instruction that follows a division can be issued after either 5 or8cycles,dependingontheinstructiontype. 6.5.1 ApplicationoftheFramework The data-flow graph shown in Figure 6.3 represents the FPU design. Figure 6.5 shows the same graph with ranks (inside nodes) and degrees (on edges). When this graph is optimizedfortwotiers,oneofthepossiblesolutionsisshowninFigure6.6. Stages2and 3areintier-1andtherestofthestagesareintier-0. Thetotaldegreeofcommunication of the design is reduced from 11 to 8. As there are eight edges in the graph, the total degree of 8 is the minimum cost possible, hence there is no need to explore design partitioning for more than two tiers. The partitioning shown in Figure 6.6 was used for a3DICimplementationofthedesign. Figure6.5: FPUdata-flowmodelwithranksanddegrees 99 Figure6.6: FPUgraphoptimizedfor3DICdesign 6.5.2 3DICImplementation The FPU design RTL code is manually partitioned into two tiers according to the con- figuration obtained from application of the model (shown in Figure 6.6). Each tier is synthesizedindependentlyusingSynopsysDesignCompiler,targetingthedesigntothe Tezzaron-Global foundries 130nm standard cell libraries. Initially, the slower tier is identified, as it will become the bottleneck when both tiers are stacked together. Then the faster tier is re-synthesized using the clock speed of the slower tier to minimize the cell area. Both the tiers are then verified together as a 3DIC design for any tim- ingviolations usingSynopsys Prime Time, andtiming reports are analyzed tooptimize the design for faster speed. Finally, layout is obtained using Cadence First Encounter, whichusedthedesignrulesspecifiedbyTezzaron’sphysicaldesignkittoplace3D-vias (bondpoints). Oneofthetiersisfirstplacedandrouted;thentheothertierisplacedand routed using the mirrored version of the inter-tier signals to 3D-vias assignment of the first tier as a reference. The post-layout design is then analyzed for any timing viola- tions using Synopsys Prime Time. As a final verification step, Cadence Conformal is usedfora logicalequivalencecheckofthe3DIC FPUdesignagainsttheoriginal2DIC FPUHDLsourcecode. 100 Table6.1: Results: 3DICFPUvs. 2DICFPU 3DICFPU 2DICFPU ClockPeriod 6.37ns 6.83ns Footprint 306x306um 2 400x400um 2 6.5.3 Results The results presented in this section are the outcome of the implementation experiment conducted using the FPU design. Table 6.1 shows the clock period and footprint com- parisonsofthe3DICFPUdesignwitha2DICimplementationoftheFPU,whichisalso targetedtothesamestandardcelllibrariesandbuiltusingastandardVLSIdesignflow. The 3DIC FPU is about 7% faster than the 2DIC FPU. The 3DIC version also has a 41.5%smallerfootprintthanthe2DICversion. Clearly,the3DICFPUdesignperforms betterthanthe2DICFPUdesign,eventhoughtheFPUisarelativelysmalldesignwhen comparedtodesignslikemulti-coreprocessors. Thelargerdesignsarelikelytoachieve greaterperformanceimprovementsbyusinglogic-on-logicstacking,whencomparedto thetraditional2DICdesignapproach. Hence,aprocessor,whichismuchlargerthanthe FPU,isusedinthefollowingsectiontoverifytheproposedframework. 6.6 3DICProcessor The block diagram of the processor design is shown in Figure 6.7. It has a classic five- stagepipelinearchitecturewitha64-bitdatapath. Thisprocessorhasa32-entryregister fileanddedicatedinstructionanddatamemories. Theinstructionsetarchitecture(ISA) of the processor is similar to that of a MIPS ISA [88]. This processor is an order-of- magnitude bigger than the FPU design (about 14 times in terms of cell area) described intheprevioussection. 101 Figure6.7: Blockdiagramoftheprocessor Figure6.8: Data-flowmodeloftheprocessor Among all the blocks in Figure 6.7, only the ‘PC Logic’ block interacts with the external environment. However, both memory modules can be directly accessed by an external direct memory access (DMA) control machine. During DMA the processor is disabled; hence, the DMA interface related IO pins to memories do not have an impact ontheperformanceoftheprocessor. Therefore,theIOpinsinterfacingwiththeexternal DMAcontrollerareexcludedasreferencenodes,astheyarenot‘designcharacteristics’ of the processor. The data-flow model of the processor is shown in Figure 6.8. The blocks in Figure 6.7 and their corresponding nodes in Figure 6.8 are represented by the namesN1,N2,andsoonthroughN11. 102 Figure6.9: Processordata-flowmodelwithranksanddegrees Figure6.10: Processorgraphoptimizedfor3DICdesignusingonlyranksanddegrees 6.6.1 ApplicationoftheFramework Asafirststep,theminimumrequiredcharacteristicsofthedesignareobtainedbyevalu- atingtheranksofnodesandthendegreesoftheedgesindata-flowgraphmodel(Figure 6.8) of the processor. Figure 6.9 shows the same graph with ranks (inside nodes) and degrees (on edges). Then the nodes are moved across two-tiers to optimize the proces- sorfor3DICimplementationusingAlgorithm1. Figure6.10showsoneofthepossible 103 Table6.2: Areaofnodesinprocessorgraphexpressedaspercentageoftotalarea Node N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 Area(%) 0.1 33.5 0.2 10 0.4 1 0.3 0.1 54 0.3 0.1 Figure6.11: 3DICprocessorgraphobtainedusingarea-constrainedapproach solutions. As the total degree of communication of the design is reduced to the num- berofedgesinthemodel(minimumcostpossible), thereisnoneedtomakeadditional movesorexploreformorenumbersoftiers. Next,theareaofallthenodesisaddedtotheframework,obtainedfromthesynthesis oftheRTLdescriptionoftheprocessor. Theareaofeachnodeisexpressedasapercent- age of total area of the design and is shown in Table 6.2. The solution obtained using only ranks and degrees has about 88.5% of total logic in one tier, resulting in a huge disparity between the areas of the two tiers. This solution might achieve performance improvement,butwillnotshowtheoneobviousadvantageof3DICdesign: smallerchip footprint. Hence, using the area characteristics of the nodes, the processor is optimized using an area-constrained approach resulting in the graph show in Figure 6.11. In this 104 solution,tiers0and1haveabout45%and55%ofthelogicarea,respectively. Duetothe huge spread among the areas of the nodes of the processor ranging from 0:1% to 54%, the disparity between the areas of two tiers cannot be further reduced while also result- ing in a minimum degree of communication. An interconnect-bounded optimization was not executed on this graph because the number of inter-tier signals in Figure 6.11 are 521, and the bondpoints in the F2F configuration available for inter-tier signals are 5717 (excluding those reserved for VDD and ground). The following sections present the3DICimplementationdetailsandtheresultscomparedto2DICimplementation. 6.6.2 3DICImplementation Boththe2DICand3DICprocessordesignsaretargetedtotheNanGate45nmopencell library[89]. Astandard2DICVLSIdesignflowwasusedforthe2DICimplementation. Forthe3DICimplementation,first,aVerilogRTLcodewasdevelopedforeachpartition (tier). Then, each tier was synthesized independently, and verified together for any timing violations. Finally a layout was obtained using Cadence First Encounter with an F2F bonding scheme. Given that there are no standard design rules to guide tools to layout bondpoints at 45nm technology, the required rules (related to the size and pitch of bond- points) were borrowed from Tezzarons 130nm 3DIC technology. Once the bondpoints grid was determined, the inter-tier signals were assigned to available bondpoints using the midpoint algorithm described in Section 7.7. There are a total of 7081bondpointsavailablebetweenthetwoF2Fbondingtiers,ofwhich1364wereused bythepowergridtointerconnectpowerrailstotier-2. Theremaining5717areavailable forinterconnectingthe521inter-tiersignals. Duetothelargeavailabilityof3D-viasfor thisdesign,asmentionedabove,a3D-viaboundedoptimizationwasnotusedtofurther optimize the 3DIC processor design. Finally, the tiers were independently placed-and- routedtobuildlayoutsrequiredfor3DIC,andthepost-layoutnetlistswerethenanalyzed 105 Table6.3: Results: 3DICvs. 2DICprocessor 3DIC 2DIC 3DICvs. 2DIC CellArea(m 2 ) 239384 286166 16.35% TotalWireLength(mm) 1388 1649 15.85% AverageWireLength(m) 11.97 11.69 -2.40% DynamicPower (mW) 43.50 53.30 18.39% LeakagePower (mW) 2.58 3.36 23.29% TotalPower (mW) 46.08 56.66 18.68% togetherfortimingverification. Boththe2DICand3DICdesignshavearesultingclock periodof6.2ns. 6.6.3 Results Table6.3showstheresultsobtainedfromthelayoutsofthe3DICand2DICimplemen- tations of the processor. The power values are estimated using the Synopsys Prime- TimePX tool. The post-layout netlists were back-annotated using SPEF parasitics for accurate estimation. An activity factor of 0.3 was used on primary inputs to measure the power consumption. Both designs were clocked at 6.2ns. Clearly, 3DIC has per- formedsignificantlybetterthanthecorresponding2DICimplementation. Though3DIC achieved15.85%shortertotalwirelength,theaveragewirelengthisslightlymorethan that of the 2DIC. This could be because the 2DIC has a higher number of cells, espe- cially buffers used for delay optimization (about 3.5% more buffers are present in the 2DIC netlist compared to that in both 3DIC tiers netlists combined). Furthermore, the chipfootprintisreducedfrom665665m 2 inthe2DICto437437m 2 inthe3DIC. Results show a dominant performance improvement for the 3DIC in terms of total wire length, power, cell area, and footprint compared to that of the 2DIC. These results are obtained at a 45nm technology node, with much manual work in the back-end of 106 the design flow. As the technology scales, and a mature 3DIC design flow is available strongerperformancegainsareexpectedfor3DICdesigns. Thesedemonstrationsshow thattheproposedframeworkresultsinefficient3DICdesignevenwithlimitedamounts ofdataandalsowhenusedatahigherabstractionlevel,likeanarchitecturelevel. 6.7 Conclusions A formal logic partitioning framework was introduced that enables analysis of logic- on-logic stacked 3DIC design in the earliest stages of a design cycle. The framework supports any partitioning granularity (core-level, functional-unit, gate, etc.). The pro- posed framework uses a data-flow graph representation of the design, with minimum information about the logic blocks represented by the graph nodes. The optimization goal of the framework is to reduce the overall communication length in the design. A simulated-annealing approach was shown to automate the optimization process; how- ever the procedure presented can also be applied by a designer, depending on the num- berofnodesandedgespresentinthegraph. Multipleoptimizationalgorithmsareintro- ducedtofine-tunetheresultant3DICbasedonthedesigncharacteristicsavailableinthe framework. The framework was applied to a single-precision floating-point unit, using adata-flowgraph,whereeachnoderepresentsapipelinestage. ThedemonstrationFPU 3DIC design yielded a 7% faster design compared to a corresponding 2DIC implemen- tation, when optimized using only the ranks and degrees of the graph representation. Additionally, the framework was used to partition a processor design using the area of the various logic blocks along with ranks and degree in the data-flow model of the pro- cessor. Results show 15.85% reduced total wire length, 16.35% smaller cell area, and about 18.68% reduced total power consumption in the 3DIC processor implementation whencomparedtothatofa2DICimplementation. 107 Chapter7 Inter-TierSignalsto3D-Vias AssignmentTechniques 7.1 Introduction Logic-on-logic stacked 3DIC is an attractive solution to overcome several challenges facedbythecurrentICtechnologyforfabricatinghomogeneouscircuits. Thelogictiers in a 3DIC are bonded using one of the following three ways: face-to-face (F2F), face- to-back (F2B), or back-to-back (B2B), as shown in the Figure 7.1. These methods are usedirrespectiveofwhetherthetiersarestackedwafer-to-wafer,die-to-die,orwafer-to- die. The active devices in different tiers are interconnected using vertical connections called3D-vias. Basedonthetypeof3DIC,these3D-viascouldbeeithertopmetallayer bondpoints(micro-bumps)orthrough-silicon-vias(TSVs). In 2DIC implementation the locations of I/O pins influence the standard cell place- ment. Similarly, in a 3DIC irrespective of the type of bonding, along with I/O pins, the locations of TSVs or bondpoints (3D-vias) carrying inter-tier signals have an effect on standard cell placement in a tier. This assignment of inter-tier signals to 3D-vias is crucial, as it guides the cell placement, which in turn affects the overall performance of the 3DIC. In a 2DIC the I/O signals are manually assigned to I/O pins by a designer based on the board-interface and the design-specifications. Unlike I/O signals, these inter-tier signals are very large in number (for designs like microprocessors) making 108 Figure 7.1: 3DIC bonding techniques: clockwise from top-right (i) face-to-face, (ii) back-to-back,and(iii)face-to-back manual assignment an extremely difficult task and demands for automated techniques. Thisassignmentproblemisthecruxoftheresearchpresentedinthischapter. In a F2F configuration a vertical interconnects between the two tiers are formed by bonding top-metal layer bondpoints [77]. In contrast, in the other two approaches a through-silicon-via (TSV) is required to form a vertical connection. Unlike TSVs, the bondpoints in the F2F configuration do not effect the active layer density and are also smaller in size. When compared to other alternatives, the F2F configuration provides thehighestinterconnectdensitybetweenthebondingtiers,makingitidealforlogic-on- logicstacked3DICs. Therefore,firstthecaseofF2F3DICsisinvestigated,andseveral automatedassignmenttechniquesareintroduced. Whilethepreliminaryresultsshowed thattheproposedtechniquescanautomatetheassignmentandyieldgoodperformance, 109 later sections introduce formal methods that address all three configurations of 3DIC stackingandachieveanoptimumassignment. 7.2 Background AsexplainedinSection3.2, anexperimental3DICwasimplementedusingtheMOSIS two-tier3DICpackagebasedontheTezzaron-Globalfoundries130nmfabricationtech- nology. The demonstration design is an existing 5-stage pipelined single-precision floating-point unit (FPU). Two3DIC variantsof this FPU design were built for demon- strating the logic partitioning techniques introduced in Chapter 4. In these exercises the problem of assigning inter-tier signal to a bondpoint is identified, where a greedy approachwasused. For the asymmetric adaptive-precision energy-efficient 3DIC multiplier discussed in Section 3.3, an architecture-driven manual assignment was used to assign inter-tier signals to bondpoints. It is observed that even for a small design like 64-bit multiplier (comparedtodesignslikemulti-coreprocessors)whichhasonlyafewhundredsofinter- tiersignals,themanualassignmentmethodisanarduoustask. Also,manualassignment wasiteratedmultipletimesbeforeachievingsatisfactoryresults. Theroomforimprovementleftbythegreedyapproachaswellasthelaboriousness and unpredictable nature of the manual approach calls for automated and sophisticated techniquesforassigninginter-tiersignalstobondpoints. Theseexperiencesinspiredthe workpresentedinthischapter. 7.3 DesignFlow The3DICdesignflow(Figure7.2)ispresentedinthissectiontoassistinunderstanding the problem and proposed techniques described in the following sections. This is the 110 Figure7.2: 3DICdesignflow same design flow that is discussed in Chapter 3, which is based on “design for 3D” approach. In design for 3D, 3DIC technology is seen as a design platform, right from the earliest stages of the chip design i.e., at the top-most level or architectural level (before HDL/RTL code), in contrast to design flows presented in [19,23,78], where 3DIC technology is used only as an implementation platform. In a design for 3D flow, the design is partitioned (by-design) into tiers before synthesis, whereas in others the design is partitioned after synthesis, i.e., before or during place and route. However, a similar design flow was used in [21], where a 3DIC is designed using off-the-shelf CAD tools. In [52] a custom 3D floor plan was developed for each tier separately to implementaF2Fbonded3DICofaniA32processor. 111 Theworkpresentedhereisrelatedtothefourthstepofthe3DICdesignflowshown in Figure 7.2, where the inter-tier signals are assigned to bondpoints of F2F tiers. The authors in [21] introduced a method to assign inter-tier signals to bondpoints, which is compared against the proposed techniques in Section 7.4.3. Once the assignment is determined, standard cell placement tools use these assigned bondpoints along with I/O pins (if any) to guide cell placement in a tier. The routing tools will implement the required interconnects between cells and bondpoints similar to that of any other interconnects. Finally,3DIC-supportedtoolscanbeusedforparasiticextractionofeach tiertoperformfinalanalysis. 7.4 AssignmentTechniques All the assignment techniques are two-step procedures. First, the locations of all cells thateithersourceorsinkaninter-tiersignalareestimatedbasedonapreliminaryplace- mentexercise. Usingthesecellslocationsandtheavailablebondpointslocations,inter- tiersignalsareassignedtobondpoints. Theavailablebondpointsexcludethosereserved for VDD and ground to connect power rails between the tiers. As shown in Figure 7.3, bondpoints are reserved based on the power plan of the design. Once the inter-tier sig- nals to bondpoints assignment is determined, it is then used in both tiers to perform a finalplace-and-route(PAR). 7.4.1 Inter-TierSignalSource/SinkLocationEstimation This is a common first step in all the techniques. The synthesized netlist of a tier is modified to set all inter-tier I/Os as wires (nets). Then the tier is placed without any optimization to preserve the logic (cells) associated with inter-tier signals. However, chipI/Osignalsarestillusedtoguideplacementalongwiththefloorplan(ifany). Once, 112 Figure 7.3: Bondpoints connected to power rings and stripes are excluded in the list of availablebondpoints this preliminary layout is executed, cells connected to inter-tier signals are identified. For each inter-tier signal, the source/sink location is approximated to be the center of its associated cell. Each inter-tier signal will have two reference locations: one is the sourcecellinoneofthetiers,andtheotheristhesinkcellintheothertier. In a tier, an input inter-tier signal can potentially have a fan-out greater than one. Hence, the number of sinks for such a signal is more than one, resulting in multiple choices for a sink location. For example, say there are four sinks as shown in Figure 7.4. In this case any point on the minimum rectilinear Steiner tree connecting all the sinks,canbeusedasthereferencesinklocation. Alternatively,thecenterofthesmallest 113 Figure7.4: Multiplesinksillustration rectangleenclosing(graydashed-boxinFigure7.4)allthesinksmaybeused,orsimply, acellcanberandomlychosenfromthefan-outasthereferencesink. 7.4.2 AssignmentProcedures Using the inter-tier signals source/sink cells locations in both the tiers, the inter-tier signalsareassignedtoavailablebondpointsinthefollowingdifferentways. Nearest-Neighbor Innearest-neighbor(NN),aninter-tiersignalisassignedtoanavailablebondpointwhich isclosesttoitssource/sinkcell. Thisapproachisfurtherclassifiedas: Tier-1Reference: Tier-1orbottom-tierisusedasthereference. Foreachinter-tier signal the reference source/sink location is taken from the preliminary placement of tier-1 to make the assignment. Using this assignment, a final PAR is executed onbothtiers. Tier-2 Reference: In this case, the tier-2 or top-tier is used as the reference for source/sink cell locations to make the assignment. This alternative requires a temporaryplacementoftier-2only. 114 Combined: Inthisapproach,onlythesourcecellslocationinformationfromboth the tiers is used for assignment. Every inter-tier signal has exactly one source cell, and it could be in either one of the tiers. For each inter-tier signal its source cell location is taken from the tier in which it is placed. When compared to the othertwoalternatives,thisavoidstheconfusionofwhichsinkcelltoselectcorre- spondingtoaninputinter-tiersignalthathasfan-outgreaterthanone. However,a preliminaryplacementofbothtiersisrequiredtoestimatesourcecellslocations. Henceforth, the above three alternatives are referred to as NN1, NN2, and NNC, respectively. After the reference source/sink locations are estimated, Algorithm 4 is usedtomaketheassignment. AssumingN inter-tiersignalsandM bondpoints. Algorithm4NNAssignmentProcedure Input: B-ListofMavailablebondpointlocations Input: C-ListofNsource/sinkcellslocations Output: A-Assignmentlist Require: NM 1: foreachlocationiinCdo 2: j =FindClosestBondpoint(i,B); 3: Markj inBasassigned; 4: Append(i;j)assignmenttoA; 5: endfor 6: return A FindClosestBondpoint()functioncomputesthewirelengthrequired(Manhattandis- tance) to connect a reference source/sink cell to a bondpoint, for each available bond- point,andreturnstheonewithminimumvalue. Midpoint In NN approach a closest bondpoint is assigned to an inter-tier signal’s source or sink cell in a greedy fashion. So the assignment is favorable to either tier-1 or tier-2. In the midpoint approach, the algorithm aims to assign a bondpoint that is midway between 115 Figure7.5: Midpointassignmentillustration thesourcecellinonetierandsinkcellintheothertier(closertomidpointofsourceand sink). Also, the bondpoint may interconnect the source and sink cells with minimum Manhattanwirelength. Thisapproachismotivatedbytheobservationthatthecellplace- mentforinter-tiersignalsisalsosomewhatinfluencedbyrelatedlogiccellplacementin additiontothebondpointandattemptstoforgeacompromise. Considerascenariogiven in Figure 7.5, where the solid-circles represent source cells and solid-squares represent sinkcells. Thehollow-circlesaretheavailablebondpointsforinterconnectingthesource and sink cells. For the inter-tier signalA in Figure 7.5, any bondpoint enclosed within theredbox(narrow-dashes)willinterconnectsourceandsinkofAwithminimumMan- hattan length. Similarly, for B any bondpoint within the blue box (wide-dashes) is an ideal candidate for assignment. Also, the smaller the box in which the source-sink pair is enclosed, the lesser the number of candidate bondpoints that are on a minimum lengthconnectingpath. Hence,thesource-sinkpairsofallinter-tiersignalsaresortedin increasingorderofManhattandistancebetweenthesourceandsinkcells. Apreliminary placement of both tiers is used to estimate source and sink cells locations. Then these source-sink pairs are sorted using the function SortIncreasingOrderOfDistance(), as shown below in Algorithm 5. As the distance in the third-dimension between the 116 sourceandsinkineverypairisthesame,itisignoredincomputingthedistance. Finally, thebondpointsareassignedtointer-tiersignalsinthisorderofincreasingdistances. Algorithm5MidpointAssignmentProcedure Input: B-ListofMavailablebondpointlocations Input: CM - List of N entires, each entry containing a pair of source and sink cells locations. Oneofthemfromtier-1andotherfromtier-2. Output: A-Assignmentlist Require: NM 1: foreachiinCMdo 2: ComputeManhattandistancebetweensourceandsink 3: endfor 4: CM SortIncreasingOrderOfDistance(CM); 5: foreachlocationiinCMdo 6: j =FindMinimumLengthConnectingBondpoint(i,B); 7: Markj inBasassigned; 8: Append(i;j)assignmenttoA; 9: endfor 10: return A FindMinimumLengthConnectingBondpoint() function computes the wire length required(Manhattandistance)toconnectsource,bondpoint,andsink,foreachavailable bondpoint,andreturnstheonewithminimumlength. 7.4.3 Analysis In the first step of NNC and midpoint approaches, a temporary place and route is required on both the tiers, whereas NN1 and NN2 require preliminary placement only inoneofthetiers,tier-1andtier-2,respectively. Comparedtoothers,themidpointalgo- rithmrequiresanadditionalsortingstepmakingitcomputationallymoreexpensivethan others. However,allthealgorithmshavethesameasymptoticcomputationalcomplexity. The second step in each technique has a computational complexity ofO(NM). In the midpoint approach sorting requires a O(NlogN) operations, but the overall algorithm is dominated by O(NM) computations for assigning a bondpoint (because N M). 117 In the midpoint approach the order of assignment is fixed due to sorting, but in NN the orderisrandomandmultipleoutputsarepossible. In contrast to the technique presented in [21], NNC and midpoint approaches use source/sink cells locations from both tiers to find inter-tier signals to bondpoint assign- ment. Also, the source/sink cells locations areidentified using adifferentmethod(Sec- tion7.4.1)comparedtothatin[21],whereano-I/Oplacementwasused. 7.4.4 Architecture-DrivenManualAssignment Apartfromproposedtechniques,anarchitecture-drivenmanualassignmentisalsoused. This approach is used for comparison with the proposed techniques, and is impractical for large designs. It would be an extremely tedious task to manually assign tens of thousands(ormore)ofinter-tiersignalstobondpointsinlogic-on-logicstackeddesigns likechipmultiprocessors. 7.5 DemonstrationDesigns Generally benchmark circuits similar to MCNC or ISCAS are used to test the effec- tiveness of new techniques. Given these circuits are not designed for 3DIC, and are also small, these suite of circuits are not used for analyzing the proposed techniques. Hence,twodesigns,a64-bitmultiplierandasingle-precisionFPU,whichwerebuilt(or modified) for 3DIC implementation, were used as demonstration designs. It should be noted these are still small in size when compared to designs like multi-core processors, which have higher potential to gain more from a 3DIC design. The following sections present an overview of the demonstration designs to help in understanding the results andanalysis. 118 Figure7.6: 3DICmultiplierblockdiagram 64-bitMultiplier The64-bitmultiplierexploitsthehighlyfrequentoccurrenceoflow-precisionoperands, bydynamicallyadaptingtoasymmetricprecisiontosaveenergy(Section3.3). Depend- ing on input operands, it functions in three modes: 3232, 6464, or asymmetric 3264/6432. It is a 1GHz two-stage pipelined design, built using four 32-bit mul- tipliers. The pre-processing logic enables only required number of 32-bit multipliers to execute a given multiplication operation, to save dynamic energy. Figure 7.6 shows the block diagram of the 64-bit multiplier 3DIC. The design partitioning is consistent with the hybrid partitioning principle presented in Section 4.3. All the inter-tier signals are outputsofsequentialunitsintier-1andareinputstocombinationalunitsintier-2. Tier-2 consistspurelyofcombinationallogicintheformofthree32-bitmultipliers. Single-PrecisionFPU Thesingle-precisionFPUisanexistingdesign,thathasa5-stagepipelinedarchitecture with a single 2-stage pipelined multiplier, which is used to execute all multiplication 119 Figure7.7: 3DICFPUblockdiagram operations, including those required in division [69]. For all instructions other than division, processing occurs in a linear pipeline fashion. Thus, the latency is 5 cycles and an instruction can be issued at every cycle. The division operation is executed in a non-linear pipeline fashion and has a latency of 12 cycles. Figure 7.7 shows the block diagram of the FPU 3DIC. It is partitioned using the S2C technique discussed in Section 4.3. Each inter-tier signal is an output of a sequential unit in one tier and input toacombinationalunitintheothertier. 7.5.1 Implementation 3DIC designs are targeted to the NanGate 45nm open cell library [89]. First, a Verilog RTL code is developed for each partition (tier). Then, each tier is synthesized indepen- dently,andverifiedtogetherforanytimingviolations. Finallyalayoutisobtainedusing CadenceFirstEncounter,withbothtiersbondingface-to-face. Given that there are no standard design rules to guide tools to layout bondpoints at 45nm technology, the required rules (related to the size and pitch of bondpoints) are borrowedfromTezzaron’s130nm3DICtechnology. Oncethebondpointsgridisdeter- mined,eachinter-tiersignalisassignedtoabondpoint. Thenthetiersareindependently placed and routed to build layouts required for 3DIC. The multiplier and FPU 3DICs 120 Table7.1: Resultscomparisonbetweenthetechniquesfor3DICmultiplier TotalNetLength AverageNetLength TotalCellArea (mm) (m) (m 2 ) NN1 184.81 10.035 29790 NN2 209.40 11.467 28700 NNC 195.73 10.338 29613 Midpoint 188.35 9.919 29434 Manual 201.37 11.064 28621 Table7.2: 3DICmultiplierpowerconsumption DynamicPower(mW) TotalPower(mW) NN1 4.522 4.8954 NN2 4.387 4.7346 NNC 4.472 4.8321 Midpoint 4.413 4.7684 Manual 4.334 4.6805 have a clock period of 5.4ns and 4.8ns, respectively. The post-layout netlists are then analyzedtogetherfortimingverification. 7.6 Results Tables 7.1, 7.2, 7.3, and 7.3 show the various results of multiplier and FPU 3DICs (respectively)forallthefourtechniquesandthearchitecture-drivenmanualassignment method. The total net length and average net length values are calculated based on the netlengthstatisticsgeneratedbyCadenceFirstEncounterforeachtierofa3DICdesign. The total cell area is the combined cell area of the layouts of both the tiers of a 3DIC. An average power analysis was carried out using Synopsys PrimeTime-PX. The power is estimated using post-layout netlists and SPEF parasitics, of both the tiers, with an activity factor of 0.3 on all inputs. The total power is comprised of leakage power and 121 Table7.3: Resultscomparisonbetweenthetechniquesfor3DICFPU TotalNetLength AverageNetLength TotalCellArea (mm) (m) (m 2 ) NN1 91.81 8.669 18543 NN2 92.29 8.794 18476 NNC 94.65 9.014 18421 Midpoint 88.52 8.530 18321 Manual 97.85 9.280 18644 Table7.4: 3DICFPUpowerconsumption DynamicPower(mW) TotalPower(mW) NN1 1.0172 1.2078 NN2 1.0486 1.2379 NNC 1.0227 1.2113 Midpoint 1.0325 1.2203 Manual 1.0270 1.2209 dynamic power (switching power + cell internal power). The graphs in Figures 7.8 and 7.8(d) show the comparisons of the four techniques against manual assignment. The resultsarenormalizedwithrespecttothoseobtainedbythemanualassignmentmethod. Results show that the proposed techniques have better net length statistics than that of manual assignment. The midpoint approach yielded more consistent results than the others, having about 10% lesser values than manual assignment. These designs were implemented using 45nm technology. Hence in smaller technology nodes (like 32nm,22nm,andsoon)wherewireshaveevenhigherimpactonchipperformance,the proposedtechniquesaremorelikelytoproducebetterresults. Comparatively, the manual assignment multiplier 3DIC has lower cell area and power consumption than others. In the multiplier 3DIC design all the inputs (operands to three 32-bit multipliers, see Figure 7.6) to tier-2 are concentrated in the upper half 122 (a) (b) (c) (d) Figure7.8: Comparisonofproposedtechniquesagainstmanualassignment of the floorplan and all tier-2 outputs are present in the lower half (outputs of three 32- bit multipliers). Due to this very regular structure, though it is very tedious to assign bondpointsmanually,itisefficient. InthecaseoftheFPU,thehighercomplexityofthe design, owing to its irregular pipeline design, causes the results to generally lag behind thoseoftheproposedtechniques,evenaftermultipletrial-and-erroriterationsofmanual assignments. Extrapolating, it can be expected that larger designs like processors will havehighercomplexitythantheFPU,andthetechniquescouldyieldbetterresultsthan manual assignment, with additional bonus of being automated. Manual assignment is analogous to custom design which is expected to perform superior to automated tech- niques,butresultsshowthattheproposedtechniquescanperformbetter. 123 7.7 OptimalTechniqueswithPathControl These preliminary results indicate the midpoint approach as the forerunner among the fourtechniques,butallofthemcansuccessfullyaccomplishtheassignmenttask. How- ever, these techniques require the following improvements to provide a formal solution totheassignmentproblem. 1. Achieveanoptimum assignment, i.e., proposetechniquesthatresultinminimum wirelengthforinterconnectinginter-tiersignals. 2. Provide path-controllability to prioritize critical paths, and limit the connection length. 3. Extendthetechniquestoaddressallthreebondingmethodsof3DICs,i.e.,address theimpactofTSVsonassignmenttechniques. In the subsequent sections, first, several techniques are presented which provide an optimum solution along with path control, followed by the impact of TSVs on the assignmenttechniques. 7.8 Optimization Similar to heuristics presented above, the optimal technicals are also two step proce- dures. In the first step reference locations are obtained as explained in Section 7.4.1. In all the techniques, once the reference locations for inter-tier signals are estimated, using the 3D-via locations, an assignment can be found that results in minimum total wirelengthforinterconnectinginter-tiersignalsand3D-vias. Theoptimumassignment isachievedbyformulatingtherequiredtaskastheclassicassignmentproblem[90–93], asexplainedinthefollowingsections. 124 7.8.1 FormulationasAssignmentProblem Theformaldefinitionoftheassignmentproblemisasfollows: Given: TwosetsAandB,containingN andM elements,respectively. C(i;j)-Costofassigningitoj,i2Aandj2B Optimization: Minimize [ ∑ i2A ∑ j2B x ij C(i;j) ] (7.1) Suchthat: ∑ j2B x ij = 1 8 i2A (7.2) ∑ i2A x ij = 0or 1 8 j2B (7.3) x ij = 0or 1 8 i2A; j2B (7.4) The variable x ij represents the assignment of i to j. x ij is 1 if assigned; 0 otherwise. If N = M the problem is called a balanced assignment problem; otherwise it is an unbalancedassignmentproblem. The inter-tier signals assignment to 3D-vias can be formulated as an unbalanced assignmentproblem(typicallyN <M). LetthesetsAandB representthesetsofinter- tier signals and 3D-vias, respectively. The cost C(i;j) is the wire length (Manhattan distance) required to interconnect a signal i in A to a 3D-via j in B. The condition N M has to be satisfied to interconnect all inter-tier signals between the tiers. The cost of the assignment is the total wire length required for the assignment. Hence, the optimizationgoaloftheassignmentproblemistoobtainminimumtotalwirelength,as showninExpression7.1. 125 7.8.2 CostFunction FortheRT1,RT2,andRTCmethods,anelementiofsetAisthesourceorsinklocation of an inter-tier signal. The costC(i;j) is the Manhattan distance between signali and 3D-via j. For the midpoint approach, each element i in set A is a pair of locations (source and sink reference locations of inter-tier signals). Here the cost C(i;j) is the Manhattanwirelengthrequiredtoconnectthesourcecellofitothesinkcellofithrough the 3D-viaj. The Manhattan wire length in horizontal plane (x-y plane) is used as the cost, because for inter-connecting inter-tiers various assignments only differ in wire length used in horizontal x-y plane, as vertical direction wire length is constant for all inter-tier interconnects. Irrespective of which j is assigned to which i all inter-tier signals have to traverse same amount of vertical wire (including bondpoints or TSVs), and hence, only Manhattan distance in x-y plane is used. It should be noted that as the various assignments differ only in wire length used in signal traversal in x-y plane, reduction in wire length is directly proportional to reduction in capacitance of wires, hence,thedelayandpower. 7.8.3 Solution Munkre’s algorithm can be used to solve the assignment problem in polynomial time, while resulting in minimum cost i.e., a provably optimal solution [90,91]. The inter- tier signals to 3D-vias assignment problem is unbalanced. In such a situation dummy elementsareintroducedinsetAtobalancebothsets. ThesedummyelementsinAhave zero-cost of assigning to any element in B. Thus the problem is balanced, and then Munkre’s method is used to find the optimal assignment which interconnects inter-tier signalswithminimumwirelengththrough3D-vias. Inabalancedassignmentproblem, theconditioninEquation7.3changesto ∑ i2A x ij = 1 8j2B. 126 7.8.4 PathControl Theaboveassignmentproblemaimstominimizethetotalwirelengthrequiredtointer- connect inter-tier signals but doesn’t provide any control over the length of the longest interconnecting path. Also, the critical paths do not have any priority in the assign- ment. Though the solution results in an optimum assignment, it is possible that one of the interconnects is so long that it becomes a bottleneck for overall chip performance. Additionally, if an inter-tier signal is part of a known critical path, it should be given priority in the assignment procedure to keep the critical path delay below the target value. Both these path controls can be achieved by modifying the cost values C(i;j). ForasignaliinAwhichispartofaknowncriticalpath,ifthecost(i.e.,wirelength)of assigningittoj inB ismorethanathresholdcostvalue,sayT cp ,thenC(i;j)ischanged toaveryhighcost(infinity). Thisensuresthattheoptimalassignmentdoesnotcontain the assignmenti toj. Similarly, to prohibit very long interconnects, a threshold value, sayT lp ,canbeusedinthesamemannerasT cp . Therefore,forionacriticalpath: C(i;j) = 1 if (C(i;j)T cp ) (7.5) = C(i;j) Otherwise Forallotheri(whichdonothaveacriticalpaththreshold): C(i;j) = 1 if (C(i;j)T lp ) (7.6) = C(i;j) Otherwise Several different threshold values could be used to constrain the assignment. A non- feasible solution is implied if the total cost of the solution is infinity. In this case, the threshold values should be increased in steps until a non-infinity total cost is produced, i.e.,afeasiblesolutionisobtained. 127 7.9 TheImpactofTSV In case of F2F bonding it is assumed that the bondpoints are located on a fixed grid (locations are fixed). Because bondpoints do not conflict with the active area, and they have small size and pitch (few microns), having a fixed grid (locations) is acceptable. But in F2B and B2B bonding, TSVs are required and assuming a fixed grid may not be efficient, as they aggregately occupy a huge active silicon area and are available in a limitednumber. Becauseofthesefactors,thepresenceofTSVscomplicatestheprocess ofdeterminingtheinter-tiersignalsto3D-vias(TSVs)assignment. Inthefirststepoftheprocedure,aplaceable-cell(dummycell)isusedtoplaceTSVs along with logic cells. A dummy cell representing a TSV has the same dimensions as those of a TSV. This trial place with dummy cell is executed in the tiers that have TSVs(i.e.,TSVsgoingthroughthesubstratebody). InF2Bbonding,thetierwithback side bonding will have TSVs, whereas in B2B both tiers will have these dummy cells, reservingspaceforTSVs. Thesedummycellsprovidecandidatereferencelocationsfor an inter-tier signal interconnected using a TSV. In F2B bonding, the TSV landing-pads are located on the top metal layer, similar to bondpoints in F2F bonding. Hence, the reference location estimation procedure in the face-tier is the same as that described in Section7.4.1. Oncethereferencelocationsareestimatedforallinter-tiersignalsineach tier,alocationisdeterminedforTSVstointerconnectsignalsbetweenthetiers. 7.9.1 AssignmentofInter-TierSignalstoTSVs As shown in Figure 7.9, there are two ways in which TSVs can be placed in a tier. Placement of TSVs is restricted to fixed sites: black-bordered hollow squares in Figure 7.9(a),representthesitesthatareallowedforaTSVplacement. Figure7.9(b)showsthe otherway,whereTSVscanbeplacedanywhere,buteachTSVhastosatisfydesignrules 128 (a) (b) Figure 7.9: TSV placement. (a): Restricted to fixed regular sites (black-bordered squares). (b): Placedanywhere. BlacksolidsquaresrepresentTSVs. like minimum distance between TSVs, size, dimensions, etc. The example in Figure 7.9(a), shows TSVs placed with restriction, whereas Figure 7.9(b) shows unrestricted placement. The squares with colored stripes represent the area around a TSV, in which another TSV is not allowed, i.e., enforcing the minimum distance rule (pitch) between any two TSVs. In the restricted (fixed-sites) placement case, the assignment procedure introduced in Section 7.8 is used, where set A contains reference locations, and set B containstheallowedTSVsites. In the unrestricted floating case, though a TSV can be placed anywhere, it should be aligned to a grid, similar to standard cells, contacts, etc. In Figure 7.9(b), the grid with gray lines represents the possible sites for placing a TSV. Let set B contain all these M sites where a TSV can be placed. As a result, one additional condition needs to be satisfied by the optimization problem presented in Section 7.8. This condition makes sure that if a TSV at locationj is assigned to an inter-tier signali (i.e.,x ij = 1), 129 thenallpossiblesitesforaTSVwithintheneighborhoodofj shouldnotbeassignedto any other signal. Therefore, for each j in B that is assigned to a signal, the following conditionshouldbesatisfied: ∑ k2N j ∑ i2A x ik = 0 8 i2A; j2B; k2N j (7.7) Where,N j is the set of sites for a TSV placement, within in the neighborhood of j, whichshouldnotbeassignedtoanyothersignal, ifj isalreadyassigned. Squareswith coloredstripesinFigure7.9representN j ofthesite-j,andj = 2N j . 7.10 Results The same multiplier and FPU designs presented in Section 7.5 are used for verifying the optimal techniques. The results obtained from the layouts of the multiplier and FPU 3DICs are presented in Tables 7.5 and 7.6, respectively. The total and average net length values are calculated based on the net length statistics generated by Cadence FirstEncounterforeachtier. Thetotalcellareaisthecombinedcellareaofthelayouts of both the tiers. As shown in Figure 7.10, all the proposed techniques yielded better Table7.5: 3DICmultiplier: Resultscomparisonbetweenthetechniques Technique TotalNetLength AverageNetLength TotalCellArea mm %less m %less m 2 %less RT1 182.48 9.4 9.91 10.4 28973 -1.2 RT2 184.7 8.3 10.10 8.7 28773 -0.5 RTC 184.48 8.4 10.11 8.6 28714 -0.3 Midpoint 185.74 7.8 10.15 8.2 28739 -0.4 Manual 201.37 11.06 28621 130 Table7.6: 3DICFPU:Resultscomparisonbetweenthetechniques Technique TotalNetLength AverageNetLength TotalCellArea mm %less m %less m 2 %less RT1 90.35 7.7 8.66 6.7 18347 1.6 RT2 90.06 8.0 8.69 6.4 18267 2.0 RTC 92.23 5.7 8.90 4.1 18276 2.0 Midpoint 93.73 4.2 8.55 7.8 18921 -1.5 Manual 97.85 9.28 18644 Figure7.10: Percentageimprovementintotalandaveragenetlengthforproposedtech- niquescomparedtoarchitecture-drivenmanualassignment resultsthanthemanualassignment,withupto9.4%and10.4%lowertotalandaverage wirelength,respectively,withnegligibleincreaseintotalcellarea. In comparison to the nearest-neighbor methods presented in Section 7.4, the 3DICs built using the RT1, RT2, and RTC techniques exhibit better wire length statistics with smaller chip area (RT2 has about 12% lower total and average net length compared to NN2). The sort-based greedy midpoint approach in Section 7.4 achieved results close tothoseofthemidpointoptimizationapproachbecausetheratioofthenumberofavail- able 3D-vias to the number of inter-tier signals is quite high for the studied designs 131 (FPU: 1.4, Multiplier: 1.9). For more complex designs where this ratio is closer to 1 and likely introducing increased contention in the assignment problem, the proposed optimum midpoint approach is expected to perform better than the greedy midpoint approach. 7.11 Conclusions This chapter addresses the research problems involved with assigning inter-tier signals to 3D-vias. First several heuristics were introduced, which successfully automated the assignment process. Then optimum techniques to assign inter-tier signals to 3D-vias were introduced, that interconnect signals between tiers with minimum wire length, while providing control over paths, by prioritizing the critical paths and limiting the longest path. Two 3DIC demonstration designs (a multiplier and a FPU) were built using the proposed techniques along with an architecture-driven manual assignment. Theproposedtechniquessuccessfullyautomatetheassignmentprocedure. Resultsshow that 3DICs built using the techniques achieved better interconnect statistics than using a manual assignment, with up to 9.4% lower total net length and up to 10.4% shorter average net length. The optimum techniques performed significantly better than the heuristics,withupto12%lowertotalandaveragenetlength. Though the optimum techniques achieve minimum wire length for interconnecting inter-tier signals, none of the techniques address the impact of wiring congestion, total wiring requirements, and the number of near-by inter-tier signals in the 3D-via assign- mentstep. . Thecongestionproblemin3DICsisexacerbatedbythepresenceofTSVs, hence the following chapter discusses congestion- and neighborhood-aware optimized assignmenttechniques. 132 Chapter8 Congestion-AwareAssignment Techniques 8.1 Introduction In a 3DIC, the assignment of inter-tier signals to 3D-vias (bondpoints or TSVs) is cru- cial,asitguidesthecellplacement,whichinturnaffectstheoverallperformanceofthe 3DIC.Ina2DICtheI/OsignalsaremanuallyassignedtoI/Opinsbyadesignerbasedon theboard-interfaceandthedesignobjectives. UnlikeI/Osignals,theseinter-tiersignals are typically very large in number; hence, manually assigning inter-tier signals to 3D- vias is impractical and requires automated techniques. In the previous chapter several techniques to automate the assignment procedure are presented, but those techniques did not consider the wiring requirements, congestion, and presence of other near-by inter-tiersignals,inthe3D-viaassignmentstep. Congestion can make routing difficult and potentially deteriorate the chip perfor- mance [94], and the congestion problem is relatively more severe in 3DICs [95,96]. Furthermore, TSVs, due to their large size and typically sparse placement, can con- tributetoroutingcongestion[97]. Apartfromcausingroute-abilityissues,congestionin 3DICscanexacerbatethermalissues[95]. Thusmanystepsofthe3DICdesignprocess could benefit from congestion-aware methodologies [95,96]. Hence, congestion- and neighborhood-aware techniques for inter-tier signals to 3D-vias assignment are intro- ducedinthischapter. 133 This chapter is outlined as follows. The following section presents the motivation. Section 8.3 gives a brief background for the assignment procedure, followed by pro- posed techniques in Section 8.4. The demonstration designs are discussed in Section 8.5. FinallytheresultsarepresentedinSection8.6followedbyconclusions. 8.2 MotivationandRelatedWork A‘designfor3D’designflowisshownfor3DICimplementationsinFigure8.1,where the chip is partitioned (by design) into tiers before synthesis. Alternatively, in other flows the design is partitioned after synthesis, i.e., before or during place and route, using an automatic partitioning tool. In either approach, once the logic is partitioned, thecellplacementineachtierisinfluencedbythelocationsof3D-viasinterconnecting inter-tier signals, along with I/O pins and floorplans (if any). The contributions of this chapter are related to the fifth step of the 3DIC design flow shown in Figure 8.1, where theinter-tiersignalsareassignedto3D-vias. InSection7.4severalheuristicsarepresented,whichautomatedtheassignmentpro- cessaswellasaddressedsomeofthelimitationsofpriorwork[21,29]. In[29]amanual Figure8.1: 3DICdesignflow 134 assignmentwasusedasthenumberof3D-viasinthedeignwasverysmall. Theauthors in[21]presentedanassignmentmethodthatusesinformationfromonlyonetier. Then, the assignment techniques are further enhanced in Section 7.8 by introducing optimal methodsthatresultinminimumwirelengthforinterconnectinginter-tiersignals. These techniques also feature path-controllability by providing an ability to prioritize criti- cal paths and limit the connection length. However, none of the techniques address the impactofwiringcongestion,totalwiringrequirements,andthenumberofnear-byinter- tier signals in the 3D-via assignment step. Hence congestion- and neighborhood-aware optimal assignment techniques are proposed in this chapter. The followingsection pro- videsthenecessarybackgroundforunderstandingtheproposedwork. 8.3 AssignmentProcedure The assignment procedure consists of two steps, similar to the techniques presented in Chapter7. First,thelocationsofallcellsthateithersourceorsinkaninter-tiersignalare estimatedbasedonapreliminaryplace-and-routeexercise. Thenusingthesesource/sink cells as reference locations, along with the locations of 3D-vias, an inter-tier signal is assignedtoa3D-viausingoptimizationcriteria. Step-1: FindingReferenceLocations The synthesized netlist of a tier is modified to set all inter-tier signals as wires (nets). Then each tier is placed, guided by I/O signals and floorplan (if any). Once, this pre- liminary layout is obtained, all the cells that are connected to the inter-tier signals are identified. The center of a source/sink cell of an inter-tier signal is approximated as its referencelocationinatier. Eachinter-tiersignalwillhavetworeferencelocations: one isthesourcecellinoneofthetiers,andtheotheristhesinkcellintheothertier. 135 In a tier, an input inter-tier signal can potentially have a fan-out greater than one. Hence, the number of sinks for such a signal is more than one, resulting in multiple choices for a sink location. In this case any point on the minimum rectilinear Steiner tree connecting all the sinks can be used as the reference sink location. Alternatively, thecenterofthesmallestrectangleenclosingallthesinksmaybeused,orsimply,acell canberandomlychosenfromthefan-outasareferencesink,anditscentercanbeused asreferencelocationfortheinter-tiersignal. Step-2: Optimization The optimum assignment is achieved by formulating the required task as the classic assignment problem [90,91]. The formal definition of the assignment problem is as follows: Given: Two setsA andB, containingN andM elements, inter-tier signals and 3D-vias locations,respectively. C(i;j)-Costofassigningitoj,i2Aandj2B Optimization: Minimize [ ∑ i2A ∑ j2B x ij C(i;j) ] (8.1) Suchthat: ∑ j2B x ij = 1 8 i2A (8.2) ∑ i2A x ij = 0or 1 8 j2B (8.3) x ij = 0or 1 8 i2A; j2B (8.4) 136 The variable x ij represents the assignment of i to j. x ij is 1 if assigned; 0 otherwise. If N = M the problem is called a balanced assignment problem; otherwise it is an unbalancedassignmentproblem. The assignment of inter-tier signals to 3D-vias can be formulated as an unbalanced assignment problem (typically N < M). Let the sets A and B represent the sets of inter-tiersignalsand3D-vias,respectively. EachelementiinsetAisapairoflocations (source and sink reference locations in interconnecting tiers). The functionC(i;j) rep- resentsthecostofconnectingthesourcecellofitothesinkcellofithroughthe3D-via j. Hence,theoptimizationgoalistominimizethetotalcostoftheassignment,asshown inExpression8.1. 8.3.1 CostFunction In Section 7.8, the midpoint algorithm aims to assign a 3D-via that is located midway betweenthetworeferencelocations. ThecostC(i;j)ofthemidpointtechniqueisequal to the Manhattan wire length required to connect the source cell of i to the sink cell of i through the 3D-via j. Here equal weight is given to the amount of wire, i.e., the cost required to connect the source in one tier to the 3D-via and the sink in the other tier to the same 3D-via, without any regard to congestion, total wire required in a tier, or number of near-by inter-tier signals. Hence, a congestion- and neighborhood-aware costfunctionisintroducedtoaddressthelimitationsofpriorwork. 8.4 Congestion-AwareTechniques The impact of wire-congestion and presence of other near-by inter-tier signals, on assignmentdecisionsisincorporatedbydevelopingacomplex,tierandinter-tiersignal dependent weighted-average cost function. Let d1 ij and d2 ij represent the Manhattan 137 distances from the reference location of signal i in tier-1 to the 3D-via j and from the reference location of signal i in tier-2 to the 3D-via j. The midpoint algorithm uses equal weights in the cost function C(i;j). Therefore, for the midpoint technique the cost is equal to (d1 ij +d2 ij )=2 or simply the sum of both distancesd1 ij andd2 ij . The newcostfunctionC N (i;j)isdefinedas: C N (i;j) =W1(i):d1 ij +W2(i):d2 ij (8.5) WhereW1(i)andW2(i)areweightfunctionsintiers1and2,respectively. AsC N (i;j) isaweighted-averagecostfunctionW1(i)andW2(i)satisfythefollowingconstraints: W2(i) = 1W1(i) (8.6) 0 < W1(i) < 1 (8.7) The efficiency of the resultant assignment depends on the quality of the cost function, making it the most critical part of the entire assignment procedure. In the weighted- average cost functionC N (i;j), the weights are modified based on the properties of the tiers and inter-tier signals. Two approaches are explored for these weight functions W1(i)andW2(i): globalandlocal,asexplainedbelow. 8.4.1 GlobalWeightFunction Intheglobalweightsmethod,theentiretierisassignedthesameweight,i.e.,W1(i)and W2(i) are independent of inter-tier signal i and only depend on characteristic of their respectivetiers,reducingthefunctionstojustW1andW2. Theglobalweightfunctions strategy is a simple one, and the weights can be easily computed by observing the total numberoffreetracksavailableforroutingattheendofthetrial-layoutexerciseusedin 138 Section 8.3. In Cadence Encounter this information is obtained by parsing the output of the dumpCongestArea command. The lower the available tracks for routing in a tier compared to the other tier, the higher the weight should be given to it. The available tracks for routing in tiers 1 and 2 is given byT1 andT2, respectively, and the weights arethen: W1 = T2 T1+T2 (8.8) W2 = T1 T1+T2 (8.9) The above equations satisfy the expressions in (8.6) and (8.7). The global weight approach clearly has a disadvantage. Suppose the congestion within a tier is not uni- form and areas of concentrated congestion exist. In such cases, it is possible that a tier will have low overall congestion, but the areas of concentrated congestion could have significantimpactsoncriticalpaths. Toavoidsuchunfairweights,alocation-dependent costfunctionisrequiredasexplainedinthefollowingsection. 8.4.2 LocalWeightFunction To generate local weight functions W1(i) and W2(i) that depend on both local tier characteristics and inter-tier signals, each tier is quantized into smaller rectangles as showninFigure8.2. Inbothtiers,eachofthesePQquantahavetwoparameters: 1. Remainingtracksavailableforroutinginter-tiersignalsaftertrial-layout 2. Numberofinter-tiersignalswithareferencelocationenclosedinthequantum The remaining tracks represent the available resources for routing inter-tier signals. When there is high congestion the available resources will be less. For assigning a 3D- via, a measure of the available resources alone is not sufficient, as it would not capture 139 Figure8.2: Tiersquantizedintosmallerrectanglestocapturecongestionandcontention informationwithinneighborhoodofinter-tiersignals the contention for the available resources. For example, say for an inter-tier signal i about 100 tracks are available for connecting to a 3D-viaj in tier-1, but only 50 tracks are available for routing from j to i’s reference in tier-2. It might appear that tier-2 must be given higher weight due to the presence of fewer resources. However, it could be that there are more than one inter-tier signals in the neighborhood of i in tier-1, potentially competing for those 100 tracks, and none in the neighborhood ofi in tier-2. Due to contention, the resources available per candidate is lesser in tier-1, demanding higherweightforwirecostintier-1. Hence,thenumberofotherinter-tiersignalswithin the neighborhood of an inter-tier signal is also required to capture both congestion and contention. 140 These two parameters for each quantum are obtained from the layout of a tier after the preliminary place-and-route stage. This is the same layout that is used to find ref- erence locations for inter-tier signals in Section 8.3. Typically, the router tools divide the layout into smaller grids called G-cells. Each G-cell has vertical (V) and horizon- tal (H) capacity, which is the number of vertical and horizontal wiring tracks available for routing. In Cadence First Encounter, the remaining available tracks of all G-cells, at any stage of the design flow, can be obtained using dumpCongestArea command. The dimensions of these G-cells are extremely small compared to that of a tier. The G-cells are about 1:7m 2 , which are very small compared to the total area of the FPU and multiplier designs presented in the Section 8.5 (about 13000m 2 and 22000m 2 , respectively). This granularity yields a very high resolution quantization for generat- ingcongestion-dependentlocalweights. Ifalowerquantizationresolution,i.e.,alarger quantumsizeisdesired,multipleG-cellscanbegroupedtoformasinglequantum. For each quantum, the number of inter-tier signals enclosed in its area, can be obtained using the coordinates of the reference locations of inter-tier signals and that of the quantum. Using the two parameters for each quantum, weights are derived for eachiusingthefollowingprocedure. In Figure 8.2 the circles represent reference locations of inter-tier signals in tier-1, and diamonds represent those in tier-2. The blue circle and the red diamond are refer- ences for inter-tier signal i. First, in both tiers the quantum that contains i’s reference locations is identified, (x 1 ;y 1 ) and (x 2 ;y 2 ) in tiers 1 and 2, respectively (gray shaded quanta in Figure 8.2). Then in both tiers, the available tracks (N T (i)) and number of inter-tier signals (N ITS (i)) present in the smallest rectangle enclosing both reference quantaarecomputedasfollows: 141 N1 T (i) = p=x 2 ∑ p=x 1 q=y 2 ∑ q=y 1 P1 T (p;q) (8.10) N1 ITS (i) = p=x 2 ∑ p=x 1 q=y 2 ∑ q=y 1 P1 ITS (p;q) (8.11) FunctionsP1 T (p;q) andP1 ITS (p;q) give the two parameters, the number of available tracksandinter-tiersignalsenclosed,respectively,forthe(p;q)quantumintier-1. Sim- ilarly, N2 T (i) and N2 ITS (i) for tier-2 are obtained. The quantities N1 T (i)=N1 ITS (i) andN2 T (i)=N2 ITS (i) are the tracks available per inter-tier signal in the neighborhood ofiintiers1and2,respectively. Thehigherthevalue,thelowertheweight. Therefore, thelocalweightfunctionsW1(i)andW2(i)aregivenby W1(i) = N2 T (i) N2 ITS (i) N1 T (i) N1 ITS (i) + N2 T (i) N2 ITS (i) (8.12) W2(i) = 1W1(i) (8.13) 8.4.3 Solution Using the cost function C N (i;j) defined in Equation 8.5, an M N cost matrix is developed for finding assignments using the optimization problem defined in Equation 8.1. The assignment problem can be solved in polynomial time using Munkre’s algo- rithmresultinginminimumcosti.e.,aprovablyoptimalsolution[90,91]. Theinter-tier signals to 3D-vias assignment problem is unbalanced as typically N < M. In such a scenario dummy elements are introduced in set A to balance both sets. In a balanced assignmentproblem,theconditioninEquation8.3changesto ∑ i2A x ij = 1 8j2B. 142 Figure8.3: Multiplier3DICblockdiagram ThesedummyelementsinAhavezero-costofassigningtoanyelementinB. Thusthe problemisbalanced,andthenMunkre’smethodisusedtofindtheoptimumassignment. 8.5 DemonstrationDesigns Two designs, a 64-bit multiplier and a single-precision FPU, that were built for 3DIC implementation, were used to evaluate the techniques. Although generally benchmark circuitssimilartoISCASorMCNCareusedtotesttheeffectivenessofnewtechniques, given these circuits are not designed for 3DIC, and are also small, the multiplier and FPU designs are more suitable. It should also be noted these designs are still small in size when compared to designs like processors, which have higher potential to gain morefroma3DICdesignapproach. The 64-bit multiplier is a 1GHz two-stage pipelined design, built using four 32-bit multipliers (see Section 3.3 for more details). Based on input data the pre-processing logic enables only the required number of 32-bit multipliers to execute a given multi- plication operation, to save dynamicenergy. Figure8.3 showsthe blockdiagram of the 143 Figure8.4: FPU3DICblockdiagram multiplier 3DIC design. The design partitioning is consistent with the hybrid partition- ingprinciplepresentedinSection4.3. The single-precision FPU has a five-stage pipelined architecture with a two-stage pipelined multiplier (see Section 3.2 for more details). It uses a non-linear pipeline for executing division operations, with data feedback in the pipeline. Figure 8.4 shows the block diagram of the FPU 3DIC design. It is partitioned using the S2C technique discussedinSection4.3. 8.6 Results 3DIC implementations of the above two designs were targeted to the NanGate 45nm open cell library [89]. Six 3DIC designs were placed-and-routed, three each for the multiplier and the FPU, using the proposed congestion-aware local weights function technique, mid-point algorithm, and an architecture-driven manual assignment. The midpoint and manual assignment variants were implemented for comparison with the proposed work. The three 3DIC variants of multiplier and FPU have a clock period of 144 4ns and 4.5ns, respectively. The manual method was iterated several times to achieve a good assignment. In each iteration the assignment was carefully modified, based on the 3DIC layoutresults, to improvethe wire lengthstatistics. Allthese3DICs areface- to-face bonded, and the design rules required for 3D-vias (related to the size and pitch oftop-metallayerbondpoints)weremodeledfromTezzaron’s130nm3DICtechnology. Alltheresultspresentedbelowareextractedfromthepostplace-and-routelayoutofthe demonstrationdesigns. Figures 8.5 and 8.6 show the histograms of the local weights generated for each inter-tier signal pair for the FPU and multiplier designs, respectively. The FPU and multiplier designs have 351 and 352 inter-tier signals between the two tiers, respec- tively. The local weights in the histograms are with respect to tier-1, i.e., W1(i) in Equation 8.12. The FPU design has a near-uniform frequency across all the bins in the histogram. A higher local weight W1(i), i.e., close to a value of 1, represents higher relative-congestion in tier-1 compared to that in tier-2 within the rectangle enclosing Figure8.5: Histogramsoflocalweightsoflayer-1,W1(i),ofFPU 145 Figure8.6: Histogramsoflocalweightsoflayer-1,W1(i),ofmultiplier the inter-tier signal i, and a weight closer to 0 indicates otherwise. The FPU design has significantly higher inter-tier signals pairs in contrasting zones in terms of relative congestion as many of the signals have weights closer to 0 or 1, instead of 0:5. A local weight closer to 0:5 is obtained when both the tiers have similar resources per inter- tier signal in the neighborhood of the candidate inter-tier signal. Clearly the FPU has highercongestion/contentionforwiretrackscomparedtothatofthemultiplier,asmost values in the multiplier are close to 0:5. As a result the FPU is expected to benefit more from the proposed congestion- and neighborhood-aware assignment techniques. Inshort,designswhichhavehighernumbersoflocalweightsintheextremezones,i.e., close to 0 or 1, are more likely to benefit significantly from the proposed congestion- andneighborhood-awaremethodthanthosewithmanylocalweightscloserto0:5(both tierswiththesameweight). 146 Table8.1: Congestion-awaretechniquevs. othersforFPU Technique TotalWireLength Avg. WireLength TotalCellArea mm %less m %less m 2 %more Manual 97.85 9.28 18644 Midpoint 93.73 4.2 8.55 7.8 18921 1.5 Congestion-aware 90.98 7.0 8.59 7.4 18687 0.2 Table8.2: Congestion-awaretechniquevs. othersformultiplier Technique TotalWireLength Avg. WireLength TotalCellArea mm %less m %less m 2 %more Manual 201.37 11.06 28621 Midpoint 185.74 7.8 10.15 8.2 28739 0.4 Congestion-aware 190.75 5.3 10.41 5.9 28694 0.2 Tables 8.1 and 8.2 show the results of the FPU and multiplier 3DIC designs com- pared against the architecture-driven manual assignment. Both the midpoint algo- rithm and the congestion- and neighborhood-aware technique not only automated the assignment process but also performed significantly better than the manual assignment method. As expected, the FPU better demonstrates the benefit of the proposed tech- nique yielding better overall wire length characteristics for that case. In both cases designs implemented with the proposed technique have lower area compared to that of the midpoint algorithm. However, all the three implementations havesimilar area, with negligible difference. As relative congestion between interconnecting tiers increases, the proposed technique is expected to excel better than the midpoint logarithm. Thus theseresultsshowthattheproposedlocalweightstechniqueishighlyrecommendedfor designsthatsufferhighrelative-congestionbetweentheinter-connectingtiers. 147 8.7 Conclusions Congestion-aware optimal techniques for assigning inter-tier signals to 3D-Vias in a 3DIC were developed. These techniques use a cost function based on congestion and wiring requirements of other near-by inter-tier signals in the 3D-via assignment step. The proposed techniques are optimized to achieve minimum cost (reduced con- gestion) while also achieving minimum wire length. Two 3DIC design layouts were implemented using the proposed techniques along with a midpoint algorithm and an architecture-driven manual assignment. Results show the proposed methods achieved lowertotalandaveragewirelength(upto7%)whencomparedtoanarchitecture-driven manual method. 3DIC designs that have higher relative congestion between intercon- nectingtierscanbenefitsignificantlyfromthecongestion-awaretechniques. 148 Chapter9 Conclusions A three-dimensional integrated circuit (3DIC) is formed by vertically stacking multi- ple active devices in a chip using high-bandwidth vertical interconnect. Diminishing returns from CMOS transistor scaling, increasing interconnect delay, and the need for high device density and high energy efficiency are pushing the semiconductor indus- try in the direction of 3DIC technology. Perhaps 3DIC technology is the best current solutiontothephysicallimitationsconcerningdevicescalingandinterconnects. Itisan attractiveoptionasitcanachievehigherdevicedensitywithloweraverageinterconnect length without the requirement for further scaling the transistor size. However, 3DIC faces some obstacles including complex design space exploration, lack of automation tools support, and higher operating temperatures. The benefits offered by the 3DIC technologyandthechallengesitfacesahead;inspiredtheresearchpresentedinthisdis- sertation. Thesubsequentsectionprovidesasummaryoftheresearchandcontributions ofthisdissertation,followedbyinsightsintofutureof3DICtechnology. 9.1 Summary The principal theme of this dissertation is to enable mainstream adaptation of 3DIC, especially logic-on-logic partitioning in 3DIC. First, to better understand the 3DIC technology and its limitations, experimental logic-on-logic stacked 3DICs were built following a design for 3D approach. Two challenging areas requiring improvement 149 were identified: logic partitioning and implementation techniques, especially concern- ingassignmentofinter-tiersignalstoverticalvias. Logicpartitioningispotentiallythemostcrucialstepinbuildinga3DICchip. Devel- oping a formal framework to provide guidelines to both a designer as well as tools for efficient logic partitioning is a crucial step to enable mainstream 3DIC. As part of buildingsuchaframework,logic-on-logicpartitioningprinciples,S2C,C2S,andhybrid (partition at input or output of a sequential cell, a mix of S2C and C2S) are introduced. These techniques are independent of the size of the design and the targeted number of layers for the 3DIC, and are thus scalable. Also, these techniques, when integratedinto partitioningtools,canspeeduptoolsperformanceandaremorelikelytoachieveanear- optimalsolution,asthesolutionspaceissignificantlyreduced. Usingthesepartitioning techniques,anenergy-efficienttwo-tier3DICmultiplierandamulti-mode3DICfloating pointmultiplierweredesigned. Atier-by-tierhierarchicalwirelengthdistributionestimationmodelwasalsodevel- oped. This method can be applied for cases of different Rents parameters for each tier, and it can handle variable TSV dimensions and different bonding techniques. More- over, this model can be applied to a 3DIC with any number of tiers. The hierarchical model allows the user the flexibility of having unequal numbers of vertical intercon- nectsbetweendifferentpairsofadjacenttiers. Italsoprovidesanupperboundestimate on the number of TSVs, above which the resulting average wire length degrades. This upper bound is an important parameter for logic partitioning and hence contributes to theframework. ResultsshowthataslightdifferenceintheRentsexponent(about0.01) between tiers and different TSV sizes have significant effects on 3DIC average wire length. Additionally, as expected, the F2B bonding method resulted in a higher upper bound(1.9xto2.2x)thanB2Bbondingforatwo-tier3DIC. 150 Additionally, a framework for logic partitioning is introduced, which enables anal- ysis of logic-on-logic stacked 3DIC design in the earliest stages of a design cycle and also supports higher abstraction levels (architecture/functional). The proposed frame- work uses a data-flow graph modeling of the design, with nodes representing logic blocks, and edges representing communication between the logic blocks. The opti- mization goal of the framework is to reduce the overall communication length in the design. A simulated-annealing approach was shown to automate the optimization pro- cess;however,theprocedurepresentedcanalsobeappliedbyadesigner,dependingon thenumberofnodesandedgespresentinthegraph. Computationreductiontechniques for algorithms are also proposed. Multiple optimization algorithms are introduced that fine-tune the 3DIC depending on the design characteristics available in the framework. A processor and a floating-point unit are partitioned using the proposed framework for 3DICimplementationtodemonstratetheapplicationoftheframework. TheFPU3DIC design yielded a 7% faster design compared to a corresponding 2DIC implementation, whenoptimizedusingonlytheranksanddegreesofthegraphrepresentation. The3DIC processor is designed using the area of the various logic blocks along with ranks and degreeinthedata-flowmodeloftheprocessor. Resultsshow15.85%reducedtotalwire length,16.35%smallercellarea,andabout18.68%reducedtotalpowerconsumptionin the3DICprocessorimplementationwhencomparedtothatofa2DICimplementation. Finally, 3DIC implementation techniques are introduced to assign inter-tier signals to available 3D-vias. The locations of 3D-vias, i.e., bondpoints or TSVs, influence the layoutofatierandarehenceverycrucial. Fournewapproaches: threenearest-neighbor (NN1, NN2, and NNC) and a midpoint approach to automate assigning inter-tier sig- nals to bondpoints are introduced. Then, the four heuristics are enhanced to formulate optimalassignmenttechniques,RT1,RT2,RTCandmidpoint,whichresultinminimum wire length for interconnecting tiers, while providing control over paths by prioritizing 151 the critical paths and limiting the longest path. Results show that 3DICs built using the optimal techniques achieved better interconnect statistics than using a manual assign- ment,withupto9.4%lowertotalnetlengthandupto10.4%shorteraveragenetlength. The optimum techniques performed significantly better than the heuristics, with up to 12%lowertotalandaveragenetlength. ThepresenceofTSVscontributestocongestion, due to their large size and typically sparse placement. Therefore, to address the impact ofwiringcongestion,totalwiringrequirements,andthenumberofnear-byinter-tiersig- nalsinthe3D-viaassignmentstep,congestion-andneighborhood-awaretechniquesfor theassignmentofinter-tiersignalsto3D-viasareproposed. Resultsshowtheproposed methods achieved lower total and average wire length (up to 7%) when compared to an architecture-driven manual method. 3DIC designs that have higher relative conges- tion between interconnecting tiers can benefit significantly from the congestion-aware assignmenttechniques. 9.2 FutureProspects Challengesfacedduringthecourseoftheresearchpresentedinthisdissertationandthe results achieved provide several avenues for further research in 3DIC. The following areasareidentified. The logic partitioning framework presented in this dissertation can be extended to provide multi-objective optimizations to typically conflicting objectives like speed and power consumption. Another interesting optimization trade-off involves optimizing for a faster design while trying to reduce the operating temperature. Such multi-objective optimizationsmightrequireextensivethermal/activitymodelingofindividualnodes(or logicblocks)ofthedesign. 152 Oneofthechallengesindeveloping3DICdesignsistoquantitativelyverifythepro- posed improvements. This is due to the lack of benchmarks tailored for 3DIC-specific validation. Also, often a design had to be implemented as both traditional 2DIC and 3DIC, and then compared to validate the 3DIC design. The availability of advanced tools and standardized flow for 2DIC development skews the results in favor of 2DIC. Developing benchmark designs or a platform for validating research in the 3DIC area will be very useful and helpful contributions to the 3DIC community. This platform couldbeasoftwaresimulatorwhichmakessimilarassumptionsandusesaccuratemod- elsforboth2DICand3DICimplementations,andgeneratesresultsbasedonthedesign providedand3DICconfiguration. Reliability of 3DIC is another domain with exciting research opportunities. The followingaretheresearchareasthatneeddevelopmentformakingreliable3DICchips. Tier-leveltestability TSV reliability - whether to use singe large TSV or multiple thinner TSVs for driving a single signal, i.e., finding optimum number of TSVs for reliable inter- connectionsbasedontheTSVdimensions,material,etc. Reliable operation of active devices in the inner tiers of 3DIC as operating tem- peraturevaries Otherareasof3DICincludingpackaging,advancedcoolingmechanisms,manufac- turing (bonding alignment strategies, wafer-alignment) process, and yield optimization arealsogoodprospectsforresearchin3DIC. Theenthusiasmoftheever-expanding3DICresearchcommunitywiththesameprin- ciplethemeasthisdissertation-toenablemainstreamadaptationof3DICtechnology- givesconfidencethatthree-dimensionalintegrationwillprosper. 153 References [1] G.E.Mooreetal.,“Crammingmorecomponentsontointegratedcircuits,”1965. [2] G.E.Moore,“Noexponentialisforever: but,”inSolid-StateCircuitsConference, 2003.DigestofTechnicalPapers.ISSCC.2003IEEEInternational. IEEE,2003, pp.20–23. [3] A.Papanikolaou,D.Soudris,andR.Radojcic,“IntroductiontoThree-Dimensional Integration,”inThreeDimensionalSystemIntegration. Springer,2011,pp.1–12. [4] S. Borkar, “Designing reliable systems from unreliable components: the chal- lenges of transistor variability and degradation,” Micro, IEEE, vol. 25, no. 6, pp. 10–16,2005. [5] S. Borkar, T. Karnik, and V. De, “Design and reliability challenges in nanometer technologies,” in Proceedings of the 41st annual Design Automation Conference. ACM,2004,pp.75–75. [6] A. Khakifirooz and D. A. Antoniadis, “The future of high-performance CMOS: trends and requirements,” in Solid-State Device Research Conference, 2008. ESS- DERC2008.38thEuropean. IEEE,2008,pp.30–37. [7] R. Ho, K. W. Mai, and M. A. Horowitz, “The future of wires,” Proceedings of the IEEE,vol.89,no.4,pp.490–504,2001. [8] A. Topol, D. L. Tulipe, L. Shi, D. Frank, K. Bernstein, S. Steen, A. Kumar, G. Singco, A. Young, K. Guarini, et al., “Three-Dimensional Integrated Circuits,” IBMJournalofResearchandDevelopment,vol.50,no.4.5,pp.491–506,2006. [9] G.H.Loh,Y.Xie,andB.Black,“ProcessorDesignin3DDie-StackingTechnolo- gies,”Micro,IEEE,vol.27,no.3,pp.31–48,2007. [10] W. R. Davis, J. Wilson, S. Mick, J. Xu, H. Hua, C. Mineo, A. M. Sule, M. Steer, and P. D. Franzon, “Demystifying 3D ICs: the pros and cons of going vertical,” Design&TestofComputers,IEEE,vol.22,no.6,pp.498–510,2005. 154 [11] R. Patti, “Homogeneous 3D Integration,” Three Dimensional System Integration: ICStackingProcessandDesign,p.51,2010. [12] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, et al., “Die stacking (3d) microarchitecture,” in Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACMInternationalSymposiumon. IEEE,2006,pp.469–479. [13] S. Das, A. Fan, K.-N. Chen, C. S. Tan, N. Checka, and R. Reif, “Technology, performance,andcomputer-aideddesignofthree-dimensionalintegratedcircuits,” in Proceedings of the 2004 international symposium on Physical design. ACM, 2004,pp.108–115. [14] C. Yu, “The 3rd dimension-More Life for Moore’s Law,” in Microsystems, Pack- aging,AssemblyConferenceTaiwan,2006. IMPACT2006. International. IEEE, 2006,pp.1–6. [15] A. Rahman and R. Reif, “Thermal analysis of three-dimensional (3-D) integrated circuits (ICs),” in Interconnect Technology Conference, 2001. Proceedings of the IEEE2001International. IEEE,2001,pp.157–159. [16] “3D ICs with TSVs - Design Challenges and Requirements,” Cadence White Paper,2011. [17] H.-H.LeeandK.Chakrabarty,“Testchallengesfor3Dintegratedcircuits,”Design &TestofComputers,IEEE,vol.26,no.5,pp.26–35,2009. [18] X. Wu, P. Falkenstern, and Y. Xie, “Scan chain design for three-dimensional inte- grated circuits (3D ICs),” in Computer Design, 2007. ICCD 2007. 25th Interna- tionalConferenceon. IEEE,2007,pp.208–214. [19] S.Das,“Designautomationandanalysisofthree-dimensionalintegratedcircuits,” Ph.D.dissertation,MassachusettsInstituteofTechnology,2004. [20] T.Yan,Q.Dong,Y.Takashima,andY.Kajitani,“Howdoespartitioningmatterfor 3d floorplanning?” in Proceedings of the 16th ACM Great Lakes symposium on VLSI. ACM,2006,pp.73–78. [21] T. Thorolfsson, G. Luo, J. Cong, and P. D. Franzon, “Logic-on-logic 3D integra- tion and placement,” in 3D Systems Integration Conference (3DIC), 2010 IEEE International. IEEE,2010,pp.1–4. [22] T.Thorolfsson,S.Lipa,andP.D.Franzon,“A10.35mW/GFlopstackedSARDSP unit using fine-grain partitioned 3D integration,” in Custom Integrated Circuits Conference(CICC),2012IEEE. IEEE,2012,pp.1–4. 155 [23] B. Vaidyanathan, W.-L. Hung, F. Wang, Y. Xie, V. Narayanan, and M. J. Irwin, “Architecting microprocessor components in 3D design space,” in VLSI Design, 2007.Heldjointlywith6thInternationalConferenceonEmbeddedSystems.,20th InternationalConferenceon. IEEE,2007,pp.103–108. [24] K. Puttaswamy and G. H. Loh, “The impact of 3-dimensional integration on the design of arithmetic units,” in Circuits and Systems, 2006. ISCAS 2006. Proceed- ings.2006IEEEInternationalSymposiumon. IEEE,2006,pp.4–pp. [25] K. Puttaswamy and G. H. Loh, “Implementing register files for high-performance microprocessors in a die-stacked (3D) technology,” in Emerging VLSI Technolo- gies and Architectures, 2006. IEEE Computer Society Annual Symposium on. IEEE,2006,pp.6–pp. [26] K.PuttaswamyandG.H.Loh,“Implementingcachesina3Dtechnologyforhigh performance processors,” in Computer Design: VLSI in Computers and Proces- sors, 2005. ICCD 2005. Proceedings. 2005 IEEE International Conference on. IEEE,2005,pp.525–532. [27] K.PuttaswamyandG.H.Loh,“Dynamicinstructionschedulersina3-dimensional integration technology,” in Proceedings of the 16th ACM Great Lakes symposium onVLSI. ACM,2006,pp.153–158. [28] W. R. Davis, E. C. Oh, A. M. Sule, and P. D. Franzon, “Application exploration for3-Dintegratedcircuits: TCAM,FIFO,andFFTcasestudies,”VeryLargeScale Integration (VLSI) Systems, IEEE Transactions on, vol. 17, no. 4, pp. 496–506, 2009. [29] T. Thorolfsson, K. Gonsalves, and P. D. Franzon, “Design automation for a 3DIC FFT processor for synthetic aperture radar: a case study,” in Design Automation Conference,2009.DAC’09.46thACM/IEEE. IEEE,2009,pp.51–56. [30] C. J. Alpert and A. B. Kahng, “Recent directions in netlist partitioning: a survey,” Integration,theVLSIjournal,vol.19,no.1,pp.1–81,1995. [31] M. Pathak, Y.-J. Lee, T. Moon, and S. K. Lim, “Through-silicon-via management during 3D physical design: When to add and how many?” in Proceedings of the International Conference on Computer-Aided Design. IEEE Press, 2010, pp. 387–394. [32] D. H. Kim et al., “TSV-aware interconnect length and power prediction for 3D stacked ICs,” in Interconnect Technology Conference, 2009. IITC 2009. IEEE International. IEEE,2009,pp.26–28. 156 [33] B.S.LandmanandR.L.Russo,“Onapinversusblockrelationshipforpartitions of logic graphs,” Computers, IEEE Transactions on, vol. 100, no. 12, pp. 1469– 1479,1971. [34] G. Neela and J. Draper, “Challenges in 3DIC Implementation of a Design using Current CAD Tools,” in Circuits and Systems (MWSCAS), 2012 IEEE 55th Inter- nationalMidwestSymposiumon,2012,pp.478–481. [35] G. Neela and J. Draper, “Logic-on-Logic Partitioning Techniques for 3- Dimensional Integrated Circuits,” in The IEEE International Symposium on Cir- cuitsandSystems(ISCAS),May2013. [36] G. Neela and J. Draper, “An Asymmetric Adaptive-Precision Energy-Efficient 3DIC Multiplier,” in Proceedings of the 23rd ACM international conference on GreatlakessymposiumonVLSI. ACM,2013,pp.269–274. [37] G. Neela and J. Draper, “A Multi-Mode Energy-Efficient Double-Precision Floating-Point Multiplier,” in Circuits and Systems (MWSCAS), 2014 IEEE 57th InternationalMidwestSymposiumon. IEEE,2014,pp.29–32. [38] G.NeelaandJ.Draper,“ModelingtheImpactofTSVsonAverageWireLengthin 3DICs Using a Tier-Level Hierarchical Approach,” in VLSI (ISVLSI), 2014 IEEE ComputerSocietyAnnualSymposiumon. IEEE,2014,pp.154–159. [39] G.NeelaandJ.Draper,“AFormalFrameworkforLogicPartitioningin3DIC,”in TheIEEETransactionsonCircuitsandSystemsII(TCAS-II)(submitted),2015. [40] G. Neela and J. Draper, “Techniques for assigning inter-tier signals to bondpoints in a face-to-face bonded 3DIC,” in 3D Systems Integration Conference (3DIC), 2013IEEEInternational. IEEE,2013,pp.1–6. [41] G.NeelaandJ.Draper,“Optimaltechniquesforassigninginter-tiersignalsto3D- vias with path control in a 3DIC,” in Circuits and Systems (ISCAS), 2014 IEEE InternationalSymposiumon. IEEE,2014,pp.802–805. [42] G. Neela and J. Draper, “Congestion-Aware Optimal Techniques for Assigning Inter-Tier Signals to 3D-Vias in a 3DIC,” in The IEEE Transactions on Circuits andSystemsII(TCAS-II)(submitted),2014. [43] S.Borkar,T.Karnik,S.Narendra,J.Tschanz,A.Keshavarzi,andV.De,“Parame- tervariationsand impacton circuitsandmicroarchitecture,” in Proceedingsof the 40thannualDesignAutomationConference. ACM,2003,pp.338–342. [44] A. Rahman and R. Reif, “System-level performance evaluation of three- dimensional integrated circuits,” Very Large Scale Integration (VLSI) Systems, IEEETransactionson,vol.8,no.6,pp.671–678,2000. 157 [45] S. Das, A. Chandrakasan, and R. Reif, “Design tools for 3-D integrated circuits,” inProceedingsofthe2003AsiaandSouthPacificDesignAutomationConference. ACM,2003,pp.53–56. [46] M. Bohr, “The new era of scaling in an SoC world,” in Solid-State Circuits Conference-Digest of Technical Papers, 2009. ISSCC 2009. IEEE International. IEEE,2009,pp.23–28. [47] Y. Xie, G. H. Loh, B. Black, and K. Bernstein, “Design space exploration for 3Darchitectures,”ACMJournalonEmergingTechnologiesinComputingSystems (JETC),vol.2,no.2,pp.65–103,2006. [48] D. H. Kim, K. Athikulwongse, M. Healy, M. Hossain, M. Jung, I. Khorosh, G.Kumar,Y.-J.Lee,D.Lewis,T.-W.Lin,etal.,“3D-MAPS:3DMassivelyparal- lel processor with stacked memory,” in Solid-State Circuits Conference Digest of TechnicalPapers(ISSCC),2012IEEEInternational. IEEE,2012,pp.188–190. [49] D. Fick, R. G. Dreslinski, B. Giridhar, G. Kim, S. Seo, M. Fojtik, S. Satpathy, Y. Lee, D. Kim, N. Liu, et al., “Centip3De: A Cluster-Based NTC Architecture With 64 ARM Cortex-M3 Cores in 3D Stacked 130 nm CMOS,” Solid-State Cir- cuits,IEEEJournalof,vol.48,no.1,pp.104–117,2013. [50] M. Wordeman, J. Silberman, G. Maier, and M. Scheuermann, “A 3D system pro- totypeofaneDRAMcachestackedoverprocessor-likelogicusingthrough-silicon vias,”inSolid-StateCircuitsConferenceDigestofTechnicalPapers(ISSCC),2012 IEEEInternational. IEEE,2012,pp.186–187. [51] Tezzaron, “3D-ICs and Integrated Circuit Security,” http://www.tezzaron.com/ about/papers/3D-ICs and Integrated Circuit Security.pdf. [52] B. Black, D. W. Nelson, C. Webb, and N. Samra, “3D processing technology and its impact on iA32 microprocessors,” in Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings. IEEE International Conference on. IEEE,2004,pp.316–318. [53] K. Puttaswamy and G. H. Loh, “Thermal herding: Microarchitecture techniques for controlling hotspots in high-performance 3D-integrated processors,” in High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposiumon. IEEE,2007,pp.193–204. [54] M. Leeser, W. M. Meleis, M. M. Vai, S. Chiricescu, W. Xu, and P. M. Zavracky, “Rothko: Athree-dimensionalFPGA,”Design&TestofComputers,IEEE,vol.15, no.1,pp.16–23,1998. 158 [55] A. Gayasen, V. Narayanan, M. Kandemir, and A. Rahman, “Designing a 3-D FPGA: switch box architecture and thermal issues,” Very Large Scale Integration (VLSI)Systems,IEEETransactionson,vol.16,no.7,pp.882–893,2008. [56] C.Ababei,P.Maidee,andK.Bazargan,“Exploringpotentialbenefitsof3DFPGA integration,” in Field programmable logic and application. Springer, 2004, pp. 874–880. [57] S. Kavusi, K. Ghosh, and A. El Gamal, “Architectures for High Dynamic Range, HighSpeedImageSensorReadoutCircuits,”inVeryLargeScaleIntegration,2006 IFIPInternationalConferenceon. IEEE,2006,pp.36–41. [58] N. Golshani, J. Derakhshandeh, R. Ishihara, C. I. M. Beenakker, M. Robertson, and T. Morrison, “Monolithic 3D integration of SRAM and image sensor using two layers of single grain silicon,” in 3D Systems Integration Conference (3DIC), 2010IEEEInternational,2010,pp.1–4. [59] A. Rahman, A. Fan, J. Chung, and R. Reif, “Wire-length distribution of three- dimensional integrated circuits,” in Interconnect Technology, 1999. IEEE Interna- tionalConference. IEEE,1999,pp.233–235. [60] K.C.Saraswat,S.J.Souri,K.Banerjee,andP.Kapur,“Performanceanalysisand technology of 3-D ICs,” in Proceedings of the 2000 international workshop on System-levelinterconnectprediction. ACM,2000,pp.85–90. [61] J.W.JoynerandJ.D.Meindl,“Opportunitiesforreducedpowerdissipationusing three-dimensionalintegration,”inInterconnectTechnologyConference,2002.Pro- ceedingsoftheIEEE2002International. IEEE,2002,pp.148–150. [62] X.DongandY.Xie,“System-levelcostanalysisanddesignexplorationforthree- dimensionalintegratedcircuits(3DICs),”inDesignAutomationConference,2009. ASP-DAC2009.AsiaandSouthPacific. IEEE,2009,pp.234–241. [63] S. Priyadarshi, J. Hu, W. H. Choi, S. Melamed, X. Chen, W. R. Davis, and P. D. Franzon,“Pathfinder3D:Aflowforsystem-leveldesignspaceexploration,”in3D Systems Integration Conference (3DIC), 2011 IEEE International. IEEE, 2012, pp.1–8. [64] V. F. Pavlidis and E. G. Friedman, “Thermal Analysis of 3-D ICs,” Three- dimensionalintegratedcircuitdesign,2010. [65] I.-R. Jiang, “Generic integer linear programming formulation for 3D IC partition- ing,” in SOC Conference, 2009. SOCC 2009. IEEE International. IEEE, 2009, pp.321–324. 159 [66] B. Goplen and S. Sapatnekar, “Efficient thermal placement of standard cells in 3D ICs using a force directed approach,” in Proceedings of the 2003 IEEE/ACM international conference on Computer-aided design. IEEE Computer Society, 2003,p.86. [67] J. Cong, G. Luo, J. Wei, and Y. Zhang, “Thermal-aware 3D IC placement via transformation,” in Design Automation Conference, 2007. ASP-DAC’07. Asia and SouthPacific. IEEE,2007,pp.780–785. [68] J. Cong, G. Luo, and Y. Shi, “Thermal-aware cell and through-silicon-via co- placement for 3D ICs,” in Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE. IEEE,2011,pp.670–675. [69] T.-J. Kwon and J. Draper, “Floating-point division and square root using a taylor- series expansion algorithm,” Microelectronics Journal, vol. 40, no. 11, pp. 1601– 1605,2009. [70] R.S.Patti,“Three-dimensionalintegratedcircuitsandthefutureofsystem-on-chip designs,”ProceedingsoftheIEEE,vol.94,no.6,pp.1214–1224,2006. [71] Tezzaron, “FaStack(R) Technology,” http://www.tezzaron.com/technology/ FaStack.htm. [72] “The MOSIS Service: Vendors: Tezzaron: TEZZ 2X GF 013: Tezzaron- Globalfoundries Two-Tier 130nm Fabrication Process,” http://www.mosis.com/ vendors/view/tezzaron/tezz 2x gf 013. [73] D. Brooks and M. Martonosi, “Dynamically exploiting narrow width operands to improve processor power and performance,” in High-Performance Computer Architecture,1999.Proceedings.FifthInternationalSymposiumOn,Jan1999,pp. 13–22. [74] Y. Liu and S. Furber, “The design of a low power asynchronous multiplier,” in Low Power Electronics and Design, 2004. ISLPED ’04. Proceedings of the 2004 InternationalSymposiumon,Aug.2004,pp.301–306. [75] D. Kelly, B. Phillips, and S. Al-Sarawi, “Approximate Multiplication and Divi- sion for Arithmetic Data Value Speculation in a RISC Processor,” Algorithm- ArchitectureMatchingforSignalandImageProcessing,pp.95–116,2011. [76] J. Hennessy and D. Patterson, Computer architecture: a quantitative approach. MorganKaufmann,2011. [77] S.Gupta,M.Hilbert,S.Hong,andR.Patti,“Techniquesforproducing3DICswith high-density interconnect,” in Proceedings of the 21st International VLSI Multi- levelInterconnectionConference,2004. 160 [78] P. D. Franzon, W. R. Davis, M. B. Steer, S. Lipa, E. C. Oh, T. Thorolfsson, S. Melamed, S. Luniya, T. Doxsee, S. Berkeley, et al., “Design and CAD for 3D integratedcircuits,”inProceedingsofthe45thannualDesignAutomationConfer- ence. ACM,2008,pp.668–673. [79] J.T.Draper, J.Sondeen, S.D.Mediratta, andI.Kim, “Implementationofa32-bit RISCProcessorfortheData-IntensiveArchitectureProcessing-In-MemoryChip.” inASAP,2002,pp.163–172. [80] I.E.SutherlandandR.F.Sproull,“Logicaleffort: designingforspeedontheback ofanenvelope,”pp.1–16,1991. [81] V. F. Pavlidis and E. G. Friedman, “Physical Design Techniques for 3-D ICs,” Three-dimensionalintegratedcircuitdesign,2010. [82] A. Nahman et al., “Wire-length distribution of three-dimensional integrated cir- cuits,”inInterconnectTechnology,1999.IEEEInternationalConference. IEEE, 1999,pp.233–235. [83] J. W. Joyner et al., “A three-dimensional stochastic wire-length distribution for variable separation of strata,” in Interconnect Technology Conference, 2000. Pro- ceedingsoftheIEEE2000International. IEEE,2000,pp.126–128. [84] J. A. Davis, V. K. De, and J. D. Meindl, “A stochastic wire-length distribution for gigascale integration (GSI). I. Derivation and validation,” Electron Devices, IEEE Transactionson,vol.45,no.3,pp.580–589,1998. [85] P. Christie and D. Stroobandt, “The interpretation and application of Rent’s rule,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 8, no. 6, pp.639–648,2000. [86] W. Donath, “Placement and average interconnection lengths of computer logic,” CircuitsandSystems,IEEETransactionson,vol.26,no.4,pp.272–277,1979. [87] S. K and M. P, “VLSI Cell Placement Techniques,” ACM Computing Surveys, vol.23,no.2,1991. [88] D. A. Patterson, “Reduced instruction set computers,” Communications of the ACM,vol.28,no.1,pp.8–21,1985. [89] “NanGate45nmOpenCellLibrary,”http://www.nangate.com. [90] J.Munkres,“Algorithmsfortheassignmentandtransportationproblems,”Journal oftheSocietyforIndustrial&AppliedMathematics,vol.5,no.1,pp.32–38,1957. 161 [91] F. Bourgeois and J.-C. Lassalle, “An extension of the Munkres algorithm for the assignment problem to rectangular matrices,” Communications of the ACM, vol.14,no.12,pp.802–804,1971. [92] H.W.Kuhn,“TheHungarianmethodfortheassignmentproblem,”NavalResearch LogisticsQuarterly,vol.2,no.1-2,pp.83–97,1955. [93] H. W.Kuhn, “Variantsof the Hungarianmethod for assignment problems,” Naval ResearchLogisticsQuarterly,vol.3,no.4,pp.253–258,1956. [94] P. Saxena, R. S. Shelar, and S. Sapatnekar, Routing Congestion in VLSI Circuits: EstimationandOptimization. Springer,2007. [95] T. Zhang, Y. Zhan, and S. S. Sapatnekar, “Temperature-aware routing in 3D ICs,” inDesignAutomation,2006.AsiaandSouthPacificConferenceon. IEEE,2006, pp.6–pp. [96] K. Balakrishnan, V. Nanda, S. Easwar, and S. K. Lim, “Wire congestion and thermal aware 3D global placement,” in Proceedings of the 2005 Asia and South PacificDesignAutomationConference. ACM,2005,pp.1131–1134. [97] S.Pasricha,“Exploringserialverticalinterconnectsfor3DICs,”inProceedingsof the46thAnnualDesignAutomationConference. ACM,2009,pp.581–586. 162
Abstract (if available)
Abstract
A three dimensional integrated circuit (3DIC) is formed by vertically stacking multiple active devices with high-bandwidth vertical interconnect. 3DIC technology achieves higher-density integration, better scalability, and improved performance, as compared to conventional IC technology. While several challenges exist to achieve volume manufacturing, this dissertation addresses two principal challenge areas identified from an experimental logic-on-logic stacked 3DIC design exercise: logic partitioning and implementation techniques, especially concerning inter-tier signals. ❧ For building a 3DIC chip, it is crucial to have a design logic partitioning framework that provides guidelines to both a designer as well as tools. As part of developing such a framework, S2C, C2S, and hybrid (partition at input or output of a sequential cell) design partitioning techniques, are introduced to smartly partition a design. These techniques significantly reduce the search space for optimal partitioning. A formal framework for 3DIC logic partitioning requires a determination of an upper bound on the number of 3D-vias, as one cannot assume high availability of 3D-vias, especially in the case of through-silicon-vias (TSVs). A model based on Rents rule is developed to estimate 3DIC average wire length, which serves as a parameter to obtain an upper bound on the number of 3D-vias. A tier-by-tier hierarchical wire length distribution estimation approach is used as the Rents parameters used in modeling may not be the same for all tiers. The proposed model accommodates variable TSV dimensions and different bonding techniques. Results show that a slight difference in the Rents exponent (about 0.01) between tiers, and different TSV sizes have significant effects on 3DIC wire length distribution. Then, a logic partitioning framework is proposed, which can adapt to various partitioning granularities and work with limited amounts of data to enable efficient partitioning even at initial stages of design. This framework uses a data-flow graph model to optimize the 3DIC design for reduced communication between the logic nodes. Multiple optimization algorithms are introduced that fine-tune the 3DIC depending on the design characteristics available to the framework. The framework is demonstrated by partitioning a processor and a floating-point unit designs for 3DIC implementation. The 3DIC processor achieved 15.85% reduced total wire length, 18.68% reduced power consumption, and 16.35% smaller cell area compared to that of a 2DIC implementation. ❧ In a 3DIC, tiers are interconnected using TSVs or top-metal layer bondpoints (micro-bumps), called 3D-vias. Similar to I/O pins in a traditional chip, the locations of 3D-vias carrying inter-tier signals have a significant effect on logic placement, making the assignment very crucial in 3DIC implementations. Unlike I/O signals, these inter-tier signals are very large in number, and manual assignment (like assigning I/O signals to pins) is impractical. Hence, optimal assignment techniques suitable for automation are introduced which result in minimum wire length for connecting inter-tier signals, while providing path control by prioritizing the critical paths and limiting the longest path. The proposed techniques successfully automated the assignment process, and achieved up to 9.4% lower total wire length and 10.4% shorter average wire length compared to an architecture-driven manual assignment. Furthermore, congestion-aware techniques are introduced to address congestion and wiring requirements in the 3D-via assignment step. These congestion-aware methods provide globally and locally optimized assignment techniques. Results obtained from layouts show that the designs with higher relative congestion between interconnecting tiers benefit significantly from the proposed congestion-aware techniques.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Improving efficiency to advance resilient computing
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
Defect-tolerance framework for general purpose processors
PDF
High level design for yield via redundancy in low yield environments
PDF
Synchronization and timing techniques based on statistical random sampling
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
A framework for soft error tolerant SRAM design
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Silicon photonics integrated circuits for analog and digital optical signal processing
PDF
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
PDF
Trustworthiness of integrated circuits: a new testing framework for hardware Trojans
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
Stochastic dynamic power and thermal management techniques for multicore systems
Asset Metadata
Creator
Neela, Gopi
(author)
Core Title
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
01/24/2015
Defense Date
11/24/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D stacking,3DIC,assignment techniques,bondpoints,integrated circuits,interconnects,logic partitioning,OAI-PMH Harvest,optimization,TSV
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Draper, Jeffrey (
committee chair
), Gupta, Sandeep K. (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
gneela@usc.edu,neelagopi@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-524437
Unique identifier
UC11297591
Identifier
etd-NeelaGopi-3127.pdf (filename),usctheses-c3-524437 (legacy record id)
Legacy Identifier
etd-NeelaGopi-3127.pdf
Dmrecord
524437
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Neela, Gopi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D stacking
3DIC
assignment techniques
bondpoints
integrated circuits
interconnects
logic partitioning
optimization
TSV