迥异是什么意思| 狗狗可以吃什么水果| 小孩早上起床咳嗽是什么原因| 为什么肚子总是胀胀的| 刘海是什么意思| 表虚自汗是什么意思| 得了便宜还卖乖是什么意思| 妈妈的弟弟的老婆叫什么| 丹毒用什么抗生素| 中国国菜是什么菜| 7月30号是什么星座| 社会保险是什么意思| pt950是什么金| 煮海带放什么容易烂| 抹茶是什么茶| 大保健是什么意思| 五十坐地能吸土是什么意思| 心仪的人是什么意思| 甘油三酯是什么| 绝经后吃什么能来月经| 什么是电子烟| 天什么地| 官杀混杂是什么意思| 蝶窦炎是什么病| 月经前尿频是什么原因| 钯金是什么金| 矫正牙齿挂什么科| 胃溃疡吃什么水果好| 吃什么可以快速美白| 活动性肺结核是什么意思| 吃什么补血贫血| 葛根是什么| 储备是什么意思| 人为什么会困| 他叫什么名字| 金银花有什么效果| 眼皮发黑是什么原因| 福尔马林是什么味道| 土豆不能和什么食物一起吃| 心影饱满是什么意思| bae什么意思| 什么牌子冰箱好| 得不到的永远在骚动什么意思| myp是什么意思| 快速补血吃什么| 草莓什么时候成熟| 什么叫免疫治疗| 冲锋衣三合一是什么意思| rv是什么品牌| 心源性猝死是什么意思| p是什么面料| 尿道炎吃什么药好得快| 湿气重吃什么好| 有潜力是什么意思| 什么是燕麦| 79年属羊的是什么命| 多巴胺分泌是什么意思| 吃过榴莲不能吃什么| 请示是什么意思| 英雄是什么生肖| 五味是什么| 同位素是什么| 为什么突然就得肝炎了| 天上九头鸟地上湖北佬是什么意思| 经常吃莲子有什么好处| 登对是什么意思| 牙疼吃什么菜降火最快| 黄芪最佳搭配是什么| 2016年属猴是什么命| cet是什么意思| 小孩子手脱皮是什么原因引起的| 绿色食品指什么食品| 为什么头朝西睡觉不好| 饱和什么意思| 墙内开花墙外香是什么意思| 肚脐眼疼是什么原因| 高血糖吃什么菜好| 黄油是用什么做的| 心理咨询挂什么科| 抽烟有什么危害| 脖子粗挂什么科| 图图是什么意思| 丸美属于什么档次| 偶尔耳鸣是什么原因| 人嗜睡是什么原因| 枯木逢春是什么生肖| 为什么会得血管瘤| ct和拍片有什么区别| 什么样的菊花| 李世民字什么| 少年班是什么意思| 胃受凉了吃什么药| 艺字五行属什么| 精神衰弱吃什么能改善| 小孩脸上长痣是什么原因引起的| 在五行中属什么| skp什么意思| 舌头疼吃什么药好得快| 头发痒是什么原因| 吗啡是什么| 什么药一吃就哑巴了| 肺炎支原体抗体阴性是什么意思| 为什么针灸后越来越痛| 梅毒阳性是什么意思| 母亲节要送什么礼物| no是什么| 痤疮是什么| 血小板分布宽度低是什么原因| 吃什么东西能减肥| 女人为什么会叫床| s和m分别是什么意思| 吃什么食物能降低胆固醇| 一个夸一个瓜念什么| 包菜是什么菜| 办理社保卡需要什么资料| 今年流行什么颜色头发| 什么邮票最值钱| 生殖器疱疹用什么药| 河豚吃什么食物| conch是什么牌子| 肛裂挂号挂什么科| 飞机选座位什么位置好| 胃溃疡可以吃什么水果| 甲肝戊肝是什么病| 什么果酒最好喝| dob是什么意思| 女生的下面长什么样| 什么水果可以解酒| who是什么组织| 附骨疽是什么病| 手臂内侧是什么经络| 左小腿麻木是什么原因| 吃核桃有什么好处和坏处| rts是什么意思| 桑叶长什么样子图片| 挖矿是什么| 今年流行什么发型女| 蚂蚁的触角有什么作用| 夏天吃什么最好| 临床是什么意思| 覆盖是什么意思| 杜甫被人们称为什么| 血糖高能吃什么菜| 什么时候教师节| 巴基斯坦是什么人种| 鱼腥草长什么样| george是什么牌子| 平安顺遂什么意思| 7月11号什么星座| 气短咳嗽是什么原因引起的| 嗓子有痰是什么原因引起的| 黑胡椒和白胡椒有什么区别| 心跳的快是什么原因| 做胃镜前要注意什么| 一个口一个塞念什么| 什么血型最招蚊子咬| 天天睡不着觉什么原因| 4月22日什么星座| 盆腔炎有什么明显症状| 斜视是什么原因导致的| 视线模糊是什么原因| 生地黄是什么| 福州有什么好玩的地方| 冠脉硬化什么意思| 下午五六点是什么时辰| 树叶为什么是绿色的| 葛根粉有什么功效| 先知是什么意思| 芙蓉花是什么花| 成人打虫吃什么药| 产妇吃什么下奶快又多| 胸口闷痛什么原因引起的| 什么钙片补钙效果好| 身上起痘痘是什么原因| 合胞病毒用什么药最好| 人生开挂是什么意思| 宫颈cin1级是什么意思| 梦见着火了是什么征兆| 香蕉为什么是弯的| 阴道长什么样子| 狗眼看人低是什么意思| 思维跳脱是什么意思| 神经衰弱什么症状| 腹腔淋巴结是什么意思| 不良于行是什么意思| 白带像豆腐渣用什么药| 用什么洗脸可以美白| 包皮什么年龄割最好| 精液是什么组成的| 小孩舌头发白什么原因| 郭敬明为什么叫小四| 感冒是什么原因引起的| 吹空调喉咙痛什么原因| 食人鱼的天敌是什么| 腊肠炒什么好吃| 金木水火土各代表什么| upi是什么意思| 什么样的雨珠| 什么是试管婴儿| 生活惬意是什么意思| 什么拉车连蹦带跳歇后语| 九华山在什么地方| 1965属什么生肖| 瘿瘤是什么病| 心包积液吃什么药| kb是什么| 高血压是什么症状| 头发出汗多是什么原因| 请辞是什么意思| 一月底是什么星座| 稽留流产是什么原因| 墨西哥讲什么语言| 断生是什么意思啊| 河北有什么市| 吃什么都拉肚子怎么回事| 9月13日是什么日子| kick什么意思| 熊猫为什么有黑眼圈| hpv感染后有什么症状| 腿抽筋什么原因引起的| 有什么有什么| 潮吹是什么感觉| 什么得当| 需要是什么意思| 优甲乐是什么药| 轴距是什么意思| 日月同辉是什么意思| 心里害怕紧张恐惧是什么症状| 足跟痛用什么药| 石斛有什么副作用| 梦见自己换衣服是什么意思| 奶瓶pp和ppsu有什么区别| 无花果为什么叫无花果| 小壁虎的尾巴有什么作用| 改编是什么意思| 硬卧是什么样子的| 备胎是什么意思| 埃及人是什么人种| 什么是优质蛋白食物| hardy是什么意思| 仓鼠突然死了是为什么| 顺产1-3天吃什么好| 灰指甲有什么危害| 吃护肝片有什么副作用| 青梅什么季节成熟| 怀孕初期要注意什么| 节气是什么意思| 什么病不能吃阿胶| 什么又什么又什么| 鱼最喜欢吃什么| 武则天原名叫什么| 舌炎吃什么药效果最好| 为什么低血糖| 逍遥丸什么人不能吃| 甲功五项查的是什么| 血管疼是什么原因| 立秋那天吃什么| 肝火郁结是什么症状| 岛屿是什么| 撸铁是什么意思| 沦落什么意思| 卧虎藏龙是什么生肖| 野趣是什么意思| 泌尿系统感染什么症状| 百度

港媒:不动产统一登记威慑灰色房源 或控房

百度 ”据介绍,万丰在铝轮毂和镁合金产业已实现行业全球领跑,2016年,万丰收购了加拿大钻石飞机公司,一跃成为世界三大多用途固定翼飞机制造商,奠定了航空小镇建设的基础。

Explicit data graph execution, or EDGE, is a type of instruction set architecture (ISA) which intends to improve computing performance compared to common processors like the Intel x86 line. EDGE combines many individual instructions into a larger group known as a "hyperblock". Hyperblocks are designed to be able to easily run in parallel.

Parallelism of modern CPU designs generally starts to plateau at about eight internal units and from one to four "cores", EDGE designs intend to support hundreds of internal units and offer processing speeds hundreds of times greater than existing designs. Major development of the EDGE concept had been led by the University of Texas at Austin under DARPA's Polymorphous Computing Architectures program, with the stated goal of producing a single-chip CPU design with 1 TFLOPS performance by 2012, which has yet to be realized as of 2018.[1]

Traditional designs

edit

Almost all computer programs consist of a series of instructions that convert data from one form to another. Most instructions require several internal steps to complete an operation. Over time, the relative performance and cost of the different steps have changed dramatically, resulting in several major shifts in ISA design.

CISC to RISC

edit

In the 1960s memory was relatively expensive, and CPU designers produced instruction sets that densely encoded instructions and data in order to better utilize this resource. For instance, the add A to B to produce C instruction would be provided in many different forms that would gather A and B from different places; main memory, indexes, or registers. Providing these different instructions allowed the programmer to select the instruction that took up the least possible room in memory, reducing the program's needs and leaving more room for data. For instance, the MOS 6502 has eight instructions (opcodes) for performing addition, differing only in where they collect their operands.[2]

Actually making these instructions work required circuitry in the CPU, which was a significant limitation in early designs and required designers to select just those instructions that were really needed. In 1964, IBM introduced its System/360 series which used microcode to allow a single expansive instruction set architecture (ISA) to run across a wide variety of machines by implementing more or less instructions in hardware depending on the need.[3] This allowed the 360's ISA to be expansive, and this became the paragon of computer design in the 1960s and 70s, the so-called orthogonal design. This style of memory access with wide variety of modes led to instruction sets with hundreds of different instructions, a style known today as CISC (Complex Instruction Set Computing).

In 1975 IBM started a project to develop a telephone switch that required performance about three times that of their fastest contemporary computers. To reach this goal, the development team began to study the massive amount of performance data IBM had collected over the last decade. This study demonstrated that the complex ISA was in fact a significant problem; because only the most basic instructions were guaranteed to be implemented in hardware, compilers ignored the more complex ones that only ran in hardware on certain machines. As a result, the vast majority of a program's time was being spent in only five instructions. Further, even when the program called one of those five instructions, the microcode required a finite time to decode it, even if it was just to call the internal hardware. On faster machines, this overhead was considerable.[4]

Their work, known at the time as the IBM 801, eventually led to the RISC (Reduced Instruction Set Computing) concept. Microcode was removed, and only the most basic versions of any given instruction were put into the CPU. Any more complex code was left to the compiler. The removal of so much circuitry, about 1?3 of the transistors in the Motorola 68000 for instance, allowed the CPU to include more registers, which had a direct impact on performance. By the mid-1980s, further developed versions of these basic concepts were delivering performance as much as 10 times that of the fastest CISC designs, in spite of using less-developed fabrication.[4]

Internal parallelism

edit

In the 1990s the chip design and fabrication process grew to the point where it was possible to build a commodity processor with every potential feature built into it. Units that were previously on separate chips, like floating point units and memory management units, were now able to be combined onto the same die, producing all-in one designs. This allows different types of instructions to be executed at the same time, improving overall system performed. In the later 1990s, single instruction, multiple data (SIMD) units were also added, and more recently, AI accelerators.

While these additions improve overall system performance, they do not improve the performance of programs which are primarily operating on basic logic and integer math, which is the majority of programs (one of the outcomes of Amdahl's law). To improve performance on these tasks, CPU designs started adding internal parallelism, becoming "superscalar". In any program there are instructions that work on unrelated data, so by adding more functional units these instructions can be run at the same time. A new portion of the CPU, the scheduler, looks for these independent instructions and feeds them into the units, taking their outputs and re-ordering them so externally it appears they ran in succession.

The amount of parallelism that can be extracted in superscalar designs is limited by the number of instructions that the scheduler can examine for interdependencies. Examining a greater number of instructions can improve the chance of finding an instruction that can be run in parallel, but only at the cost of increasing the complexity of the scheduler itself. Despite massive efforts, CPU designs using classic RISC or CISC ISA's plateaued by the late 2000s. Intel's Haswell designs of 2013 have a total of eight dispatch units,[5] and adding more results in significantly complicating design and increasing power demands.[6]

Additional performance can be wrung from systems by examining the instructions to find ones that operate on different types of data and adding units dedicated to that sort of data; this led to the introduction of on-board floating point units in the 1980s and 90s and, more recently, single instruction, multiple data (SIMD) units. The drawback to this approach is that it makes the CPU less generic; feeding the CPU with a program that uses almost all floating point instructions, for instance, will bog the FPUs while the other units sit idle.

A more recent problem in modern CPU designs is the delay talking to the registers. In general terms the size of the CPU die has remained largely the same over time, while the size of the units within the CPU has grown much smaller as more and more units were added. That means that the relative distance between any one function unit and the global register file has grown over time. Once introduced in order to avoid delays in talking to main memory, the global register file has itself become a delay that is worth avoiding.

A new ISA?

edit

Just as the delays talking to memory while its price fell suggested a radical change in ISA (Instruction Set Architecture) from CISC to RISC, designers are considering whether the problems scaling in parallelism and the increasing delays talking to registers demands another switch in basic ISA.

Among the ways to introduce a new ISA are the very long instruction word (VLIW) architectures, typified by the Itanium. VLIW moves the scheduler logic out of the CPU and into the compiler, where it has much more memory and longer timelines to examine the instruction stream. This static placement, static issue execution model works well when all delays are known, but in the presence of cache latencies, filling instruction words has proven to be a difficult challenge for the compiler.[7] An instruction that might take five cycles if the data is in the cache could take hundreds if it is not, but the compiler has no way to know whether that data will be in the cache at runtime – that's determined by overall system load and other factors that have nothing to do with the program being compiled.

The key performance bottleneck in traditional designs is that the data and the instructions that operate on them are theoretically scattered about memory. Memory performance dominates overall performance, and classic dynamic placement, dynamic issue designs seem to have reached the limit of their performance capabilities. VLIW uses a static placement, static issue model, but has proven difficult to master because the runtime behavior of programs is difficult to predict and properly schedule in advance.

EDGE

edit

Theory

edit

EDGE architectures are a new class of ISA's based on a static placement, dynamic issue design. EDGE systems compile source code into a form consisting of statically allocated hyperblocks containing many individual instructions, hundreds or thousands. These hyperblocks are then scheduled dynamically by the CPU. EDGE thus combines the advantages of the VLIW concept of looking for independent data at compile time, with the superscalar RISC concept of executing the instructions when the data for them becomes available.

In the vast majority of real-world programs, the linkage of data and instructions is both obvious and explicit. Programs are divided into small blocks referred to as subroutines, procedures or methods (depending on the era and the programming language being used) which generally have well-defined entrance and exit points where data is passed in or out. This information is lost as the high level language is converted into the processor's much simpler ISA. But this information is so useful that modern compilers have generalized the concept as the "basic block", attempting to identify them within programs while they optimize memory access through the registers. A block of instructions does not have control statements but can have predicated instructions. The dataflow graph is encoded using these blocks, by specifying the flow of data from one block of instructions to another, or to some storage area.

The basic idea of EDGE is to directly support and operate on these blocks at the ISA level. Since basic blocks access memory in well-defined ways, the processor can load up related blocks and schedule them so that the output of one block feeds directly into the one that will consume its data. This eliminates the need for a global register file, and simplifies the compiler's task in scheduling access to the registers by the program as a whole – instead, each basic block is given its own local registers and the compiler optimizes access within the block, a much simpler task.

EDGE systems bear a strong resemblance to dataflow languages from the 1960s–1970s, and again in the 1990s. Dataflow computers execute programs according to the "dataflow firing rule", which stipulates that an instruction may execute at any time after its operands are available. Due to the isolation of data, similar to EDGE, dataflow languages are inherently parallel, and interest in them followed the more general interest in massive parallelism as a solution to general computing problems. Studies based on existing CPU technology at the time demonstrated that it would be difficult for a dataflow machine to keep enough data near the CPU to be widely parallel, and it is precisely this bottleneck that modern fabrication techniques can solve by placing hundreds of CPU's and their memory on a single die.

Another reason that dataflow systems never became popular is that compilers of the era found it difficult to work with common imperative languages like C++. Instead, most dataflow systems used dedicated languages like Prograph, which limited their commercial interest. A decade of compiler research has eliminated many of these problems, and a key difference between dataflow and EDGE approaches is that EDGE designs intend to work with commonly used languages.

CPUs

edit

An EDGE-based CPU would consist of one or more small block engines with their own local registers; realistic designs might have hundreds of these units. The units are interconnected to each other using dedicated inter-block communication links. Due to the information encoded into the block by the compiler, the scheduler can examine an entire block to see if its inputs are available and send it into an engine for execution – there is no need to examine the individual instructions within.

With a small increase in complexity, the scheduler can examine multiple blocks to see if the outputs of one are fed in as the inputs of another, and place these blocks on units that reduce their inter-unit communications delays. If a modern CPU examines a thousand instructions for potential parallelism, the same complexity in EDGE allows it to examine a thousand hyperblocks, each one consisting of hundreds of instructions. This gives the scheduler considerably better scope for no additional cost. It is this pattern of operation that gives the concept its name; the "graph" is the string of blocks connected by the data flowing between them.

Another advantage of the EDGE concept is that it is massively scalable. A low-end design could consist of a single block engine with a stub scheduler that simply sends in blocks as they are called by the program. An EDGE processor intended for desktop use would instead include hundreds of block engines. Critically, all that changes between these designs is the physical layout of the chip and private information that is known only by the scheduler; a program written for the single-unit machine would run without any changes on the desktop version, albeit thousands of times faster. Power scaling is likewise dramatically improved and simplified; block engines can be turned on or off as required with a linear effect on power consumption.

Perhaps the greatest advantage to the EDGE concept is that it is suitable for running any sort of data load. Unlike modern CPU designs where different portions of the CPU are dedicated to different sorts of data, an EDGE CPU would normally consist of a single type of ALU-like unit. A desktop user running several different programs at the same time would get just as much parallelism as a scientific user feeding in a single program using floating point only; in both cases the scheduler would simply load every block it could into the units. At a low level the performance of the individual block engines would not match that of a dedicated FPU, for instance, but it would attempt to overwhelm any such advantage through massive parallelism.

Implementations

edit

TRIPS

edit

The University of Texas at Austin was developing an EDGE ISA known as TRIPS. In order to simplify the microarchitecture of a CPU designed to run it, the TRIPS ISA imposes several well-defined constraints on each TRIPS hyperblock, they:

  • have at most 128 instructions,
  • issue at most 32 loads and/or stores,
  • issue at most 32 register bank reads and/or writes,
  • have one branch decision, used to indicate the end of a block.

The TRIPS compiler statically bundles instructions into hyperblocks, but also statically compiles these blocks to run on particular ALUs. This means that TRIPS programs have some dependency on the precise implementation they are compiled for.

In 2003 they produced a sample TRIPS prototype with sixteen block engines in a 4 by 4 grid, along with a megabyte of local cache and transfer memory. A single chip version of TRIPS, fabbed by IBM in Canada using a 130 nm process, contains two such "grid engines" along with shared level-2 cache and various support systems. Four such chips and a gigabyte of RAM are placed together on a daughter-card for experimentation.

The TRIPS team had set an ultimate goal of producing a single-chip implementation capable of running at a sustained performance of 1 TFLOPS, about 50 times the performance of high-end commodity CPUs available in 2008 (the dual-core Xeon 5160 provides about 17 GFLOPS).

CASH

edit

CMU's CASH is a compiler that produces an intermediate code called "Pegasus".[8] CASH and TRIPS are very similar in concept, but CASH is not targeted to produce output for a specific architecture, and therefore has no hard limits on the block layout.

WaveScalar

edit

The University of Washington's WaveScalar architecture is substantially similar to EDGE, but does not statically place instructions within its "waves". Instead, special instructions (phi, and rho) mark the boundaries of the waves and allow scheduling.[9]

References

edit

Citations

edit
  1. ^ University of Texas at Austin, "TRIPS : One Trillion Calculations per Second by 2012"
  2. ^ Pickens, John (17 October 2020). "NMOS 6502 Opcodes".
  3. ^ Shirriff, Ken. "Simulating the IBM 360/50 mainframe from its microcode".
  4. ^ a b Cocke, John; Markstein, Victoria (January 1990). "The evolution of RISC technology at IBM" (PDF). IBM Journal of Research and Development. 34 (1): 4–11. doi:10.1147/rd.341.0004.
  5. ^ Shimpi, Anand Lal (5 October 2012). "Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel". AnandTech.
  6. ^ Tseng, Francis; Patt, Yale (June 2008). "Achieving Out-of-Order Performance with Almost In-Order Complexity". ACM SIGARCH Computer Architecture News. 36 (3): 3–12. doi:10.1145/1394608.1382169.
  7. ^ W. Havanki, S. Banerjia, and T. Conte. "Treegion scheduling for wide-issue processors", in Proceedings of the Fourth International Symposium on High-Performance Computer Architectures, January 1998, pg. 266–276
  8. ^ "Phoenix Project"
  9. ^ "The WaveScalar ISA"

Bibliography

edit
拉肚子吃什么药好 缺钾是什么原因引起 来之不易是什么意思 心理卫生科看什么病的 大拇指抖动是什么原因引起的
手术拆线挂什么科 甘草配什么泡水喝最好 6月20号什么星座 多汗症挂什么科 中指和无名指一样长代表什么
游离三碘甲状腺原氨酸是什么意思 禾字五行属什么 皮肤溃烂是什么病 西洋参补什么 xxs是什么意思
球蛋白偏低是什么原因 躺着头晕是什么原因 闭角型青光眼是什么意思 冲虎煞南是什么意思 叶绿素主要吸收什么光
omega是什么牌子的手表hcv7jop6ns7r.cn 骨头坏死是什么原因造成的hcv9jop2ns4r.cn 阴茎硬不起来吃什么药hcv9jop6ns8r.cn 子嗣是什么意思0735v.com 乐可是什么sscsqa.com
梦到鬼是什么意思hcv8jop2ns4r.cn 心有灵犀什么意思hcv8jop2ns6r.cn 文化大革命什么时候结束hcv9jop5ns7r.cn 4月16日什么星座xjhesheng.com 瘿瘤是什么病hcv9jop4ns9r.cn
外阴萎缩是什么症状hcv8jop1ns2r.cn 牙冠是什么意思hcv8jop1ns2r.cn 黄棕色是什么颜色hcv8jop5ns4r.cn 喝茶叶茶有什么好处和坏处hcv9jop8ns0r.cn 嫖娼是什么意思hcv8jop0ns6r.cn
扁平足适合穿什么鞋hcv8jop5ns4r.cn 阴茎里面痒是什么原因hcv8jop3ns7r.cn 女性湿气重喝什么茶hcv9jop3ns4r.cn 经常感觉口渴口干是什么原因hcv9jop5ns8r.cn 尿等待是什么原因hcv9jop7ns0r.cn
百度