Selim ZOGHLAMI*, Raphael DAVID*, Stéphane GUYETANT* and Daniel ETIEMBLE*** CEA LIST, Anchored Accretion Laboratory** LRI – Computer Science Lab
The processors that are acclimated in anchored systems charge fulfil a set of constraints: affairs beheading time, ability consumption, dent size, cipher admeasurement and so on. In this paper, we focus on the architectonics of Appliance Specific Instruction-set processors, and added absolutely on an able alignment for the Architectonics Amplitude Exploration of an ASIP for the audio and accent domain. Appliance this methodology, we advised a aerial achievement ASIP accomplishing over 13GOPS/mm2 with a 350MHz alarm abundance in a low-power 65-nm TSMC technology. The development time was beneath than two man-months.
The Architectonics Amplitude Exploration of an ASIP (Application Specific Instruction-set Processor) can be actual circuitous due to the ample cardinal of architectonics parameters. In our architectonics case study, we focus alone on some key architectural appearance like the activity depth, the cardinal of registers, the accomplishing of appropriate operations, the cardinal of instructions that can be accomplished accompanying and so on. Award the best accommodation for the ethics of all these ambit is not attainable and we charge a specific architectonics alignment to accommodated all requirements.
In bulk 1, we present altered approaches that can be acclimated to acquisition the best trade-off. To accomplish the bulk readable, we alone use accede two architectonics ambit P1 and P2 that could be for archetype activity abyss and the cardinal of instructions that are accomplished simultaneously.
Figure 1: Altered approaches to acquisition the optimal ethics of two architectonics parameters
(a) The all-embracing chase considers all the attainable ethics of anniversary parameter. Due the ample cardinal of parameters, it is absurd to appraise anniversary point of the architectonics amplitude and assay it to all the added ones. Heuristic chase techniques should be acclimated arch to suboptimal solution.
(b) In adjustment to abstain that an heuristic chase stops the chase at a bounded optimum, a additional address alleged accidental sampling is presented here. It consists in allotment about the couples of ambit but afresh there is no agreement to assemble appear an adequate result.
(c) With the guided-search approach, the artist starts with a basic best of two parameters, and iterates about footfall by footfall until award an adequate trade-off. This admission avoids inconsistent or adverse ethics for the altered ambit and represents the best architectonics band-aid aback the admission point is able-bodied chosen.
(d) Abounding added approaches could additionally be considered, as appliance abiogenetic algorithms, apparatus acquirements based searches, and so on.
For our design, we use the guided chase of parameters. First, we actuate the best important appearance of our architecture. Then, we use a architectonics apparatus to quantify these altered appearance and the blow of the architecture. So what we adduce actuality is a architectonics alignment based on a guided-search of parameters. The cardboard will abide with the presentation of that architectonics methodology, again the architectonics is detailed. The after-effects and the validation after-effects of the advised processor follow. And finally, added works are introduced.
2. DESIGN METHODOLOGY
Our aim is to acquisition a acceptable accommodation amid the time-todesign and the performances of an ASIP for a specific appliance domain.
2.1 Our benchmarks
For our case study, we accept the Audio and Accent Standards as a specific and abundantly acclimated breadth of anchored systems. Several audio and accent standards with altered encoding techniques are available, from lossless to lossy coding. Table 1 summarizes the set of benchmarks that we acclimated for the Audio ASIP Design. Best of these benchmarks appear from MediaBench. They awning both altered coding techniques and some key appearance like bit-rates and accretion complexities. Added capacity on audio coding techniques are accustomed in , ,  and .
Table 1: Audio Applications Benchmark
2.2 Criterion Profiling and Analysing
The alleged benchmarks accept been profiled appliance GPROF , the attainable GNU profiler. The outputs of the profiler accommodate the alarm graphs and the hotspots, i.e. the best time arresting functions. For our audio-speech benchmarks, we articular 14 hotspot functions such as the codebook best ambit clarify chase from the CELP (Code Excited Linear Prediction) accepted or the MP3 (Mpeg-1 audio- Part 3- band 3) Modified Discrete Cosine Transform. Those hotspots booty over 66% of all-embracing beheading time. With these assay of the hotspots, we awning all audio needs. Their bound cardinal makes the chiral assay feasible. The hotspots can additionally be analysed to actuate the architectural appearance that could advance the execution. For example, we can assay the annals and accumulator needs, the data-path widths, and so on. For instance, table 2 presents the cardinal of registers that would be bare for an able beheading of anniversary audio-speech hotspot. These needs were articular from the appraisal of the activity continuance of variables in the beheading graph.
Table 2: Estimated registers needs of audio-speech hotspots
We accept additionally articular some specific cipher appearance that could be accelerated by specific accouterments appearance such as a pre-arithmetic shift. Our benchmarks additionally assiduously use loops for which optimizing both bend codicillary branches and ciphering codicillary branches is fundamental.
2.3 Architectonics Sizing
2.3.1 Basic assumptions for the antecedent adaptation of the architecture
The antecedent adaptation of the architectonics that we acclimated is now presented. It uses a archetypal RISC (Reduced Instruction- Set Computer) apprenticeship set architectonics with 1-instruction delayed branches, codicillary cipher flags (CC flags) for codicillary branches (like the SPARC ISA). The ISA (Instruction- Set Architecture) is implemented either with a archetypal 5-stage activity for the scalar adaptation and the n-way superscalar or VLIW (Very Long Apprenticeship Word) versions. Some appearance abate the cardinal of accomplished instructions both for the scalar or n-way versions. The cardinal of CC flags is such a affection that is presented in the abutting section. One added axiological affection is the cardinal of instructions that the accouterments can assassinate simultaneously, i.e. the bulk of n for the n-way approach. It will be discussed in a consecutive section.
2.3.2 Codicillary Codes Flags Sizing
As ahead mentioned, loops are accepted in our benchmarks and they amalgamate a bend annex and one (or several) ciphering annex aural the loop. Generally, the aftereffect of an absolute bend ciphering is scaled at the end of the loop. So we charge a banderole for the bend annex and addition one for the codicillary aftereffect scaling. Having one or several CC flags impacts on the all-embracing achievement of the loop.
Table 3: Appraisal of Codicillary Codes Flags Implementation
In the archetype apparent in the table 3, implementing two altered CC flags saves one aeon by bend iteration. With alone one CC flag, there is no way to ample up the adjournment aperture afterwards the bend branch, as the CMOV apprenticeship charge chase the aboriginal SUBCC while the JMP CC charge chase the additional instance of the SUBCC. With two altered CC flags, the CMOV apprenticeship can be confused into the annex adjournment aperture removing the NOP that was bare in the antecedent case.
The accretion can rapidly grows with n-way architectures. In this situation, the bend annex action SUBCC1 can be evaluated in the aforementioned aeon as addition instruction. In table 4, with a 2-way architecture, we save one added aeon per iteration.
Table 4: Appraisal of Codicillary Codes Flags Accomplishing for a 2-ILP architecture
The appulse of the cardinal of CC flags can be evaluated by a metric alleged ”instruction appliance bulk (IUR)”, that is authentic as the cardinal of advantageous instructions over the all-embracing cardinal of instructions (that includes advantageous and NOP instructions). This apprenticeship appliance bulk can additionally be authentic as 1−NOPpercentage. In table 4, if the aboriginal M instructions altogether adapted on the architectonics arch to N/2 cycles and aught NOP, an appraisal of that metric for both implementations leads to:
Using several codicillary codes flags increases the achievement and it added calmly uses the capabilities of the architecture. The dent breadth bulk is almost baby and there is no affair for the instruction-set coding. Obviously, the after-effects that are apparent in table 4 are based on a simple 5-stage activity like the MIPS-R2000 one . Added pipelines could advance to added results. For instance, the activity of Alpha 21164  had 2 beheading stages (EX1 and EX2): the appraisal of the action was accomplished during EX1 stage, while the codicillary annex was accomplished during EX2. In that case, both the apprenticeship ambience the action and the codicillary annex can be appointed in the aforementioned alarm aeon removing a lot of NOP instructions in table 4. Appliance added pipelines will be advised in added works.
2.3.3 N-way architectures
The aim of the commodity is to present a architectonics alignment based on a active constant able-bodied chosen. We focus on the cardinal of executions to be accomplished accompanying as the active parameter. The capital cold to acquisition the best able architectonics that is able to accomplishment the ILP (Instruction Akin Parallelism) that exists in the benchmarks with the basal set of resources, i.e. the best silicon efficiency.
We charge to acquisition the best leash n-way (Nbways), apprenticeship appliance bulk (Tuse) and dent area. The processor abundance and the consistent Nbop/sec are acquired from the architectonics for anniversary altered n-way architecture.
As no compiler is attainable for anniversary evaluated architecture, the alone way to acquisition the best leash n-way -instruction appliance bulk and dent breadth is to manually agenda operations in beheading kernels according to anniversary architecture. The accumulation cipher of the articular hotspots has been accounting and the agnate beheading time (in alarm cycles) based on the abstracts dependencies and the apprenticeship appliance ante accept been affected for altered alongside architectures (2, 3, 4, 6 and 8-way architectures). We advised two types of data-paths : constant data-paths accept the aforementioned processing assets while amalgamate architectures accept specific processing assets for anniversary way of the data-path.
For our audio-speech benchmarks, on constant data-paths, the apprenticeship appliance bulk is 87% for a 2-way VLIW, 74% for 3-way, 54% for 4-way and beneath than 36% for added architectures. Obviously, the hotspot loops of the audio applications accept not abundant ILP to calmly accomplishment 6 or 8-way architectures. The apprenticeship appliance bulk on amalgamate architectures is 87% for a 2-way, 72% for 3-way and 52% for 4-way architectures, as apparent in bulk 2.
Heterogeneous data-paths acquiesce an important architectonics breadth save. At the aforementioned time, the appliance ante of both constant and amalgamate data-paths are absolutely similar. So, ambidextrous with silicon ability as the capital metric, the use of alongside architectures with amalgamate processing assets is actual interesting. We will alone accede amalgamate 2, 3 and 4-way architectures in the blow of the paper.
Figure 2: Apprenticeship appliance ante for n-way architectures for audio benchmarks
The additional footfall is to baddest the bulk of accompaniment in the architecture. This footfall needs a anticipation of the change of the accouterments complication aback accompanying resources. From a RISC processor admeasurement distribution, we appraisal the dent breadth of anniversary alongside architectures according to the afterward antecedent :
The change of the accouterments complication of altered architectural appearance is additionally estimated. For example, we accede that the affairs anamnesis admission bulk is proportional to the cardinal of fetched instructions per alarm cycle. Aback n increases with n-way architectures, the decoder complication increases, but abounding operations accept alternate decoding. Thus, we accept that the decoder breadth increases proportionally to the aboveboard basis of the bulk n. Bypassing and advice mechanisms are additionally affected to admission according to the aforementioned law.
As the annals book and the beheading units represent about 3/4 of the all-embracing dent area, we fabricated some specific investigations to appraisal added absolutely their change aback n increases. For the annals file, a set of aboideau akin amalgam based on 2R/1W RF description has been done. This abstraction shows an admission of 50% aback acceleration the cardinal of RF ports, an admission of 100% with a 6R/3W RF and over 2.5 admission agency for an 8R/4W RF against the aboriginal 2R/1W one. In table 5, we present the accouterments complication change of n-way processors almost to the RISC breadth complexity.
Table 5: Accouterments complication change for n-way architectures with amalgamate data-paths almost to RISC processor
Having evaluated anniversary of the constant presented in the blueprint 3, we can appraise the altered n-way architectures against the scalar accomplishing (i.e. bypassing the Nbop/sec that is not already known). Four our abstraction case, 2 and 3-way architectures represent a acceptable accommodation for audio-speech applications.
2.4 Development Tool
The Synopsys Processor Artist  is an automatic architectonics apparatus from the ADL (Architecture Description Language) LISA 2.0. It allows an able architectonics acknowledgment to alter and optimize the architecture. From a behavioral description of the operations, several architectures (RISC, DSP, VLIW) can be implemented. Also, an architectonics debugger gives a absolute afterimage of the ambit at the beheading time : registers, capacity of the altered memories, apprenticeship opcodes, activity stages, stalls and flushes, bend iterations, accepted activity signals, and so on. It allows a micro-step beheading of the LISA instructions, that is neither cycle-accurate nor instruction-accurate but ”LISA-line-accurate”.
This apparatus is acclimated to admeasurement a architectonics belief and to rapidly appraise its admission on the capital system. The development breeze and the apparatus appearance acclimated are presented in bulk 3. From the starting point authentic previously, this apparatus is acclimated during the guided chase action declared in bulk 1,c) of the breadth 1.
Figure 3: Audio Processor Architectonics with Synopsys
3. ARCHITECTURE OVERVIEW
A block diagram of the advised processor is presented in bulk 4.
This bulk shows a bristles activity date architecture: Apprenticeship Aback (FE), Adaptation (DE), Beheading (EX), Anamnesis or additional beheading date (MEM) and RF Writeback (WB). A n-way anatomy with three abstracted data-paths. The administration of the operators by data-path was acquired from the applications assay and their computational patterns. The apprenticeship appliance bulk estimated gives an overview of the rightness of the choice. This administration is accustomed beneath :
All these data-paths are 32-bit active except the multiplication. The multiplier takes 16-bit operands and explicits signs in adjustment to abutment added software (un)signed multiplications. The 16×16!32-bit multiplication is done in the MEM (or EX2) date in adjustment to not extend its analytical aisle (i.e. processor analytical path) with the abstracts hazard resolution. The multiplier aftereffect can be acclimated aural one aeon latency. The instruction-set coding is 96-bit advanced with mainly two antecedent operands and one annals aftereffect (Opcodedestreg, src1reg, src2reg−or−imm). The additional operand can be a annals or an actual bulk mostly 14- bit wide. The Annals Book includes 32 32-bit registers. It is absolutely attainable by the three data-paths: it includes 6 Read and 3 Write ports. The annex and jump assemblage is not represented in this figure. The agnate instructions are implemented by the decoder and the aftereffect is accustomed aback to the aback stage. Branches and all-overs are delayed by one alarm cycle, which agency that the adjournment aperture charge be abounding by a advantageous apprenticeship or a NOP. Like already presented in the archetype of Codicillary Codes Banderole Implementation, a codicillary move is implemented, that either writes aboriginal or additional operand to a annals bulk according to the accompaniment of CC flags. This address replaces codicillary branches by codicillary transfers. Its appliance increases performances because of availability of data-paths and absolution action appraisal waiting. The Load/Store Assemblage allows abstracts anamnesis access. It has 4 admission modes : ”.W” to dispense word-type data, ”.H” for active half-word-wide data, ”.UH” for aforementioned advanced bearding one and ”.B” for 8-bit one. All these admission are done in the MEM date which implies one aeon cessation to use the loaded results.
Figure 4: Architectonics Overview
4. RESULTS AND VALIDATION
4.1 Architectonics Design
The advised 3-way VLIW ASIP VHDL RTL has been generated appliance the Synopsys Processor Artist tool. RTL has abutting be gate-level actinic appliance Architectonics Compiler from Synopsys targeting 65-nm Low Ability TSMC technology. Under a minimum time coercion of 2.8ns, the all-embracing dent breadth is about 0.07mm2 with added than 45% committed to the Annals Book and 13% to the decoder. The validation action consists in active the profiled applications and evaluating the processor performances in agreement of Silicon Efficiency. Bulk 5 summarizes the all-embracing architectonics breeze from the appliance benchmarks to the ASIP achievement evaluation.
Figure 5: Alignment Architectonics Flow
First, we baddest a set of benchmarks from the appliance breadth that we contour and analyze. Then, we attending for the best leash Cardinal of instructions accomplished in alongside – Apprenticeship appliance bulk and dent area. For this, we appraise how the accumulation cipher of the altered criterion kernels assassinate on anniversary n-way architectonics and we appraise the beheading time and the apprenticeship utilisation rate. Third, we use a architectonics apparatus to admeasurement an able processor. We iterate the action until we accommodated our requirements. Finally, we validate the advised processor with a gate-level amalgam and we assassinate the advised hotspots kernels.
As no compiler was available, the accumulation cipher of three hotspots was manually-written and optimized to validate the ASIP architecture. The hotspots of the profiled applications were accomplished on the processor arch to an apprenticeship appliance bulk of 86%. We apprehension that alone three of the 14 hotspot functions were manually-written to appraise our processor. They alone represents about 20% of the all-embracing beheading time. The Silicon Ability of a processor is accustomed by:
The silicon ability of the advised ASIP is then:
The advised 3-way processor delivers about 13GOPS/mm2. The development lasted a brace of months. Its alarm acceleration is about 357MHz and it executes calmly GSM (Global Arrangement for Adaptable communications), CELP, ADPCM (Adaptive Differential Pulse Cipher Modulation) and MP3 applications.
4.2 Achievement Analysis
The Synopsys Processor Artist allows a fast bearing of added Audio ASIP versions based on the advised one. The aim to the presented architectonics alignment is to appearance that the architectonics ambit were accurately sized. A baby modification of one of them leads to absolutely altered results. In an archetype before, we showed the appulse of implementing two altered codicillary codes flags. Now we accede the appulse of abate instructions.
Few modifications are done to the ADL description to architectonics 2-way VLIW and RISC implementations. Evaluating their achievement with the audio benchmarks leads to altered after-effects in silicon ability as presented in bulk 6.
Figure 6: Normalized silicon efficiencies accomplished by altered n-way processors
For the three evaluated hotspot functions, we acutely beam that n-way architectures are bigger than scalar ones. At the alpha of the study, demography alone these 3 hotspots, 3-way architectonics was 0.78 times beneath able than the 2-way one in agreement on Silicon Efficiency. But for all the hostpots, the two versions were absolutely similar. The after-effects accustomed in the bulk 6 afterwards accomplishing accredit alone to the beheading of the three hotspots. So if we accept that the change from the basic after-effects to the after-effects afterwards accomplishing will be the aforementioned for all the hotspot functions, again we apprehend that the 3-way processor will be 1.23 times bigger than the 2-way one and alike added against the scalar implementation.
SPARC v8 is an instruction-set for RISC processors including load/store, arithmetic, argumentation and about-face instructions and all the all-important being for active a ample bulk of applications. We accept the Leon3 accomplishing of the SPARC v8 ISA to be our referent for the ASIP achievement achieved. The Leon3 has a seven date activity with a Harvard architectonics (with afar Affairs and Abstracts memories). It includes a accouterments multiplier/divider and a 3-port Annals File. The appropriate annals book contains 32 registers organized in windows. The three validation functions are accomplished on it and its RTL accomplishing is aboideau akin actinic with the aforementioned Low-power TSMC library. The all-embracing architectonics admeasurement is about 0.035mm2 with a alarm acceleration of 357MHz. In table 6 we assay both the after-effects of the advised 3-way ASIP and the after-effects of the Leon3 processor active the audio applications. With the declared architectonics methodology, the audio-speech 3-way ASIP is about 70% added able than the Leon3 processor.
Table 6: Audio ASIP vs Leon3 Silicon Efficiencies
5. CONCLUSION AND FUTURE WORK
The activated alignment accustomed a fast Architectonics Amplitude Exploration and an able allocation of the key parameters. Our alignment has several limitations:
The advised VLIW ASIP was actual able in agreement of performance. But its Silicon Ability was abominably bargain by its dent area. We noticed that the Annals Book took over 45% of the all-embracing area. In approaching works, we will focus on abbreviation the all-embracing arrangement silicon cost.
 Karlheinz Brandenburg, Oliver Kunz, and Akihiko Sugiyama. Mpeg-4 accustomed audio coding. Signal Processing: Image Communication, 15:423–444, 2000.
 M. Budagavi and J.D. Gibson. Accent coding in adaptable radio communications. Proceedings of the IEEE, 86(7):1402–1412, July 1998.
 Andres Vega Garcia. M´ecanismes de controle cascade la manual de l’audio sur l’internet. PhD thesis, Nice- Sophia Antipolis University, October 1996.
 A.S. Spanias. Accent coding: a tutorial review. Proceedings of the IEEE, 82(10):1541–1582, October 1994.
 N. Pinckney, T. Barr, M. Dayringer, M. McKnett, Nan Jiang, C. Nygaard, D. Money Harris, J. Stanley, and B. Phillips. A mips r2000 implementation. pages 102– 107, June 2008.
 P. Bannon and J. Keller. Internal architectonics of alpha 21164 microprocessor. In Compcon ’95.’Technologies for the Information Superhighway’, Digest of Papers., pages 79–87, Mar 1995.
 Karl V. Rompaey, Diederik Verkest, Ivo Bolsens, and Hugo D. Man. Coware – a architectonics ambiance for amalgamate hardware/software systems. EURO-DAC, pages 252–257, 1996.
Table De Multiplications – table de multiplications
| Welcome in order to my personal website, on this period I’ll explain to you regarding keyword. And today, here is the initial impression: