# Computer <br> Organization and Structure 

Bing-Yu Chen
National Taiwan University

## Textbook

$\square$ D. A. Patterson,
J. L. Hennessy.

Computer
Organization \& Design: The Hardware/Software Interface, 5th. ed., Morgan Kaufmann, 2013.


## Reference

- J. L. Hennessy,
D. A. Patterson.

Computer Architecture: A Quantitative Approach, 5th. ed., Morgan Kaufmann, 2011.


## Reference

$\square$ R. H. Katz,
G. Borriello.

Contemporary
Logic Design, 2nd ed., Prentice Hall, 2004.


SECOND EDITION

RANDY H. KATZ • GAETANO BORRIELLO

## Pre-requirements

$\square$ Binary Digital Systems

- Introduction to Computer


## Requirements

$\square$ Participants

- Homework
- maybe five or six times
with some small programs
- Examinations
- twice


## Why and What is the course ?

$\square$ This is the only Computer Hardware related course in IM department.
$\square$ The contents will cover

- Logic Design
$\square$ one or two weeks (maybe)
- Assembly Language
- two or three weeks (maybe)
- Computer Architecture
$\square$ the rest weeks


## The Computer Revolution

$\square$ Progress in computer technology

- Underpinned by Moore's Law*
$\square$ Makes novel applications feasible
- Computers in automobiles
- Cell phones
- Human genome project
- World Wide Web
- Search Engines
$\square$ Computers are pervasive


## Classes of Computers

$\square$ Personal computers

- General purpose, variety of software
- Subject to cost/performance tradeoff
$\square$ Server computers
- Network based
- High capacity, performance, reliability
- Range from small servers to building sized


## Classes of Computers

$\square$ Supercomputers

- High-end scientific and engineering calculations
- Highest capability but represent a small fraction of the overall computer market
$\square$ Embedded computers
- Hidden as components of systems
- Stringent power/performance/cost constraints


## History of Intel ${ }^{\circledR}$ CPU

$\square 1978 \quad 8086 / 8088 \quad 5-10 \mathrm{MHz}$
$\square 198280286$
$6-25 \mathrm{MHz}$
$\square 1985$ Intel $386^{\mathrm{TM}} \quad 16-33 \mathrm{MHz}$
$\square 1989$ Intel $486^{\text {TM }}$ DX $25-50 \mathrm{MHz}$
$\square 1993$ Pentium ${ }^{\circledR} \quad 60-233 \mathrm{MHz}$
$\square 1997$ Pentium ${ }^{\circledR}$ II $\quad 233-450 \mathrm{MHz}$
$\square 1999$ Pentium ${ }^{\circledR}$ III $450 \mathrm{M}-1.4 \mathrm{GHz}$
$\square 2000$ Pentium ${ }^{\circledR} 4 \quad 1.4-3.8 \mathrm{GHz}$

## The Processor Market



## The PostPC Era



## The PostPC Era

$\square$ Personal Mobile Device (PMD)

- Battery operated
- Connects to the Internet
- Hundreds of dollars
- Smart phones, tablets, electronic glasses
$\square$ Cloud computing
- Warehouse Scale Computers (WSC)
- Software as a Service (SaaS)
- Portion of software run on a PMD and a portion run in the Cloud
- Amazon and Google


## Understanding Performance

- Algorithm
- Determines number of operations executed
$\square$ Programming language, compiler, architecture
- Determine number of machine instructions executed per operation
$\square$ Processor and memory system
- Determine how fast instructions are executed
$\square$ I/O system (including OS)
- Determines how fast I/O operations are executed


## Eight Great Ideas

$\square$ Design for Moore's Law
$\square$ Use abstraction to simplify design
$\square$ Make the common case fast
$\square$ Performance via parallelism
$\square$ Performance via pipelining
COMMON CASE FAST
$\square$ Performance via prediction
$\square$ Hierarchy of memories
$\square$ Dependability via redundancy
PBEDICTION

## Below Your Program

$\square$ Application software

- Written in high-level language
$\square$ System software
- Compiler: translates HLL code to machine code
- Operating System: service code
$\square$ Handling input/output
$\square$ Managing memory and storage
$\square$ Scheduling tasks \& sharing resources
$\square$ Hardware
- Processor, memory, I/O controllers


## Levels of Program Code

swap(int v[], int k) \{int temp;

```
    temp=v[k];
    v[k]=v[k+1];
    v[k+1]=temp;
```

\}
(C compiler)
swap:
mull $\$ 2, \$ 5,4$
add $\$ 2, \$ 4, \$ 2$
lw \$15, 0(\$2)
lw \$16,4(\$2)
sw $\$ 16,0(\$ 2)$
sw \$15,4(\$2)
jr \$31


00000000101000010000000000011000 00000000000110000001100000100001 10001100011000100000000000000000 10001100111100100000000000000100 10101100111100100000000000000000 10101100011000100000000000000100 00000011111000000000000000001000

## Components of a Computer

$\square$ Same components for all kinds of computer

- Desktop, server, embedded
$\square$ Input/output includes
- User-interface devices
- Display, keyboard, mouse
- Storage devices
- Hard disk, CD/DVD, flash Network adapters
- For communicating with other computers


## Anatomy of a Computer



## Anatomy of a Mouse

$\square$ Optical mouse

- LED illuminates desktop
- Small low-res camera
- Basic image processor
$\square$ Looks for $x, y$ movement
- Buttons \& wheel
$\square$ Supersedes roller-ball mechanical mouse



## Touchscreen

$\square$ PostPC device
$\square$ Supersedes keyboard and mouse
$\square$ Resistive and Capacitive types

- Most tablets, smart phones use capacitive
- Capacitive allows multiple touches simultaneously



## Through the Looking Glass

$\square$ LCD screen: picture elements (pixels) - Mirrors content of frame buffer memory

Frame buffer


Raster scan CRT display


## Opening the Box



## Opening the Box



## Opening the Box



## Inside the Processor (CPU)

$\square$ Datapath: performs operations on data
$\square$ Control: sequences datapath, memory, ...
$\square$ Cache memory

- Small fast SRAM memory for immediate access to data


## Inside the Processor

## $\square$ AMD Barcelona: 4 processor cores




## Inside the Processor

## $\square$ Apple A5



## Abstractions

$\square$ Abstraction helps us deal with complexity

- Hide lower-level detail
$\square$ Instruction set architecture (ISA)
- The hardware/software interface
$\square$ Application binary interface
- The ISA plus system software interface
$\square$ Implementation
- The details underlying and interface


## A Safe Place for Data

$\square$ Volatile main memory

- Loses instructions and data when
 power off
$\square$ Non-volatile secondary memory - Magnetic disk
- Flash memory
- Optical disk (CDROM, DVD)



## Networks

$\square$ Communication, resource sharing, nonlocal access
$\square$ Local area network (LAN): Ethernet
$\square$ Wide area network (WAN): the Internet
$\square$ Wireless network: WiFi, Bluetooth


## Digital Systems



Digital: only assumes discrete values


Analog:
values vary over a broad range continuously

## Digital Binary Systems

$\square$ Two discrete values:

| yes | on | 5 volts | current <br> flowing | magnetized <br> North | $" 1 "$ |
| :--- | :--- | :--- | :--- | :--- | :--- |
| no | off | 0 volts | no current <br> flowing | magnetized <br> South | $" 0 "$ |

## Advantage of Binary Systems

$\square$ rigorous mathematical foundation based on logic
$\square$ IF the garage door is open
$\square$ AND the car is running
$\square$ THEN the car can be backed out of the garage

## Boolean Algebra \& Logical Operators

$\square$ Algebra: variables, values, operations
$\square$ Values: 0 and 1
$\square$ Operations: AND, OR, NOT

| $X$ | $Y$ | $X$ AND $Y$ |
| :---: | :---: | :---: |
| 0 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |


| $X$ | $Y$ | $X$ OR $Y$ |
| :---: | :---: | :---: |
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 1 |


| $X$ | NOT $X$ |
| :---: | :---: |
| 0 | 1 |
| 1 | 0 |

## Combinational vs. Sequential Logic

$\square$ Combinational logic: without a memory

- no feedback among inputs and outputs
- outputs are a pure function of the inputs
$\square$ Sequential logic: with a memory
- inputs and outputs overlap
- outputs depend on inputs and the entire history of execution
- ex. add in elementary school


## Synchronous vs. Asynchronous

 System$\square$ Synchronous system

- period reference signal, the clock, causes the storage elements to accept new values and to change state
$\square$ Asynchronous system
- no single indication of when to change state


## Technology Trends

$\square$ Electronics technology continues to evolve

- Increased capacity and performance
- Reduced cost


| Year | Technology | Relative performance/cost |
| :--- | :--- | :---: |
| 1951 | Vacuum tube | 1 |
| 1965 | Transistor | 35 |
| 1975 | Integrated circuit (IC) | 900 |
| 1995 | Very large scale IC (VLSI) | $2,400,000$ |
| 2013 | Ultra large scale IC | $250,000,000,000$ |

## Defining Performance

## $\square$ Which airplane has the best performance?






## Response Time and Throughput

$\square$ Response time

- How long it takes to do a task
$\square$ Throughput
- Total work done per unit time
- e.g., tasks/transactions/... per hour
$\square$ How are response time and throughput affected by
- Replacing the processor with a faster version?
- Adding more processors?
$\square$ We'll focus on response time for now...


## Relative Performance

$\square$ Define Performance $=1$ /Execution Time
$\square$ "X is $n$ time faster than $Y$ "

Performance $_{\mathrm{X}}$ / Performance $_{\mathrm{Y}}$
$=$ Execution Time $_{\mathrm{Y}} /$ Execution Time $_{\mathrm{X}}=n$
$\square$ Example: time taken to run a program - 10s on $A, 15$ s on B
$\square$ how much faster is $A$ than $B$ ? $\Rightarrow \frac{15}{10}=1.5$

## Measuring Execution Time

- Elapsed time
- Total response time, including all aspects $\square$ Processing, I/O, OS overhead, idle time
- Determines system performance
$\square$ CPU time
- Time spent processing a given job
$\square$ Discounts I/O time, other jobs' shares
- Comprises user CPU time and system CPU time
- Different programs are affected differently by CPU and system performance


## CPU Clocking

$\square$ Operation of digital hardware governed by a constant-rate clock

$\square$ Clock period: duration of a clock cycle - e.g., 250 ps $=0.25 \mathrm{~ns}=250 \times 10^{-12}$ s

- Clock frequency (rate): cycles per second - e.g., $4.0 \mathrm{GHz}=4000 \mathrm{MHz}=4.0 \times 10^{9} \mathrm{~Hz}$


## CPU Time

CPU Time $=$ CPU Clock Cycles $\times$ Clock Cycle Time

$$
=\frac{\text { CPU Clock Cycles }}{\text { Clock Rate }}
$$

$\square$ Performance improved by

- Reducing number of clock cycles
- Increasing clock rate
- Hardware designer must often trade off clock rate against cycle count


## CPU Time Example

- Computer A: 2GHz clock, 10s CPU time
$\square$ Designing Computer B
- Aim for 6 s CPU time
- Can do faster clock, but causes $1.2 \times$ clock cycles
$\square$ How fast must Computer B clock be?
Clock Rate $_{\mathrm{B}}=\frac{\text { Clock Cycles }_{\mathrm{B}}}{\text { CPU Time }_{\mathrm{B}}}=\frac{1.2 \times{\text { Clock } \text { Cycles }_{\mathrm{A}}}_{6 \mathrm{~s}}^{6}}{\text { ( }}$
Clock Cycles $_{\mathrm{A}}=$ CPU Time ${ }_{\mathrm{A}} \times$ Clock Rate $_{\mathrm{A}}$

$$
=10 \mathrm{~s} \times 2 \mathrm{GHz}=20 \times 10^{9}
$$

Clock Rate ${ }_{\mathrm{B}}=\frac{1.2 \times 20 \times 10^{9}}{6 \mathrm{~s}}=\frac{24 \times 10^{9}}{6 \mathrm{~s}}=4 \mathrm{GHz}$

## Instruction Count and CPI

Clock Cycles $=$ Instruction Count $\times$ Cycles per Instruciton
CPU Time $=$ Instruction Count $\times \mathrm{CPI} \times$ Clock Cycle Time

$$
=\frac{\text { Instruction Count } \times \text { CPI }}{\text { Clock Rate }}
$$

$\square$ Instruction Count for a program

- Determined by program, ISA and compiler
$\square$ Average cycles per instruction
- Determined by CPU hardware
- If different instructions have different CPI
$\square$ Average CPI affected by instruction mix


## CPI Example

$\square$ Computer A: Cycle Time $=250 \mathrm{ps}, \mathrm{CPI}=2.0$
$\square$ Computer B: Cycle Time $=500 \mathrm{ps}, \mathrm{CPI}=1.2$
$\square$ Same ISA
$\square$ Which is faster, and by how much?
CPU Time $A=$ Instruction Count $\times$ CPI $_{A} \times$ Cycle Time $_{A}$

$$
=\mathrm{I} \times 2.0 \times 250 \mathrm{ps}=\mathrm{I} \times 500 \mathrm{ps} \longleftarrow \mathrm{~A} \text { is faster } \ldots
$$

CPU Time ${ }_{B}=$ Instruction Count $\times$ CPI $_{\text {B }} \times$ Cycle Time $_{\text {B }}$

$$
=\mathrm{I} \times 1.2 \times 500 \mathrm{ps}=\mathrm{I} \times 600 \mathrm{ps}
$$

$\frac{\text { CPU Time }_{B}}{\text { CPU Time }_{A}}=\frac{I \times 600 \mathrm{ps}}{I \times 500 \mathrm{ps}}=1.2 \longleftarrow \ldots$ by this much

## CPI in More Detail

$\square$ If different instruction classes take different numbers of cycles

$$
\text { Clock Cycles }=\sum_{i=1}^{n}\left(\mathrm{CPI}_{i} \times \text { Instruction Count }_{i}\right)
$$

- Weighted average CPI

$$
\mathrm{CPI}=\frac{\text { Clock Cycles }}{\text { Insturction Count }}=\sum_{i=1}^{n}(\mathrm{CPI}_{i} \times \underbrace{\frac{\text { Instruction Count }}{i} \text { Instruction Count }}_{\text {Relative frequency }})
$$

## CPI Example

$\square$ Alternative compiled code sequences using instructions in classes A, B, C

| Class | A | B | C |
| :--- | :---: | :---: | :---: |
| CPI for class | 1 | 2 | 3 |
| IC in sequence 1 | 2 | 1 | 2 |
| IC in sequence 2 | 4 | 1 | 1 |

- Sequence 1: IC = 5 . Sequence 2: IC = 6
- Clock Cycles
$=2 \times 1+1 \times 2+2 \times 3$
$=10$
- Avg. $\mathrm{CPI}=10 / 5=2.0$
- Clock Cycles
$=4 \times 1+1 \times 2+1 \times 3$
$=9$
- Avg. $\mathrm{CPI}=9 / 6=1.5$


## Performance Summary

$$
\text { CPU Time }=\frac{\text { Instructions }}{\text { Program }} \times \frac{\text { Colck Cycles }}{\text { Instruction }} \times \frac{\text { Seconds }}{\text { Colck Cycle }}
$$

$\square$ Performance depends on

- Algorithm: affects IC, possibly CPI
- Programming language: affects IC, CPI
- Compiler: affects IC, CPI
- Instruction set architecture: affects IC, CPI, $\mathrm{T}_{\mathrm{c}}$


## MIPS as a Performance Metric

$\square$ MIPS: Millions of Instructions Per Second

- Doesn't account for
$\square$ Differences in ISAs between computers
$\square$ Differences in complexity between instructions

$$
\begin{aligned}
\text { MIPS } & =\frac{\text { Instruction Count }}{\text { Execution Time } \times 10^{6}} \\
& =\frac{\text { Instruction Count }}{\frac{\text { Instruction Count } \times \mathrm{CPI}}{\text { Clock Rate }} \times 10^{6}}=\frac{\text { Clock Rate }}{\mathrm{CPI} \times 10^{6}}
\end{aligned}
$$

- CPI varies between programs on a given CPU


## Amdahl's Law

$\square$ Improving an aspect of a computer and expecting a proportional improvement in overall performance

$$
T_{\text {improved }}=\frac{T_{\text {affected }}}{\text { improvement factor }}+T_{\text {unaffected }}
$$

$\square$ Example: multiply accounts for $80 \mathrm{~s} / 100 \mathrm{~s}$

- How much improvement in multiply performance to get $5 \times$ overall?

$$
20=\frac{80}{n}+20 \longleftarrow \text { Can't be done! }
$$

$\square$ Corollary: make the common case fast

## Power Trends



Power $=$ Capacitive load $\times$ Voltage $^{2} \times$ Frequency $^{\text {a }}$
 $\times 1000$

## Reducing Power

$\square$ Suppose a new CPU has

- $85 \%$ of capacitive load of old CPU
- $15 \%$ voltage and $15 \%$ frequency reduction

$$
\frac{P_{\text {new }}}{P_{\text {old }}}=\frac{C_{\text {old }} \times 0.85 \times\left(V_{\text {old }} \times 0.85\right)^{2} \times F_{\text {old }} \times 0.85}{C_{\text {old }} \times V_{\text {old }}{ }^{2} \times F_{\text {old }}}=0.85^{4}=0.52
$$

$\square$ The power wall

- We can't reduce voltage further
- We can't remove more heat
$\square$ How else can we improve performance?


## Uniprocessor Performance



## Multiprocessors

$\square$ Multicore microprocessors

- More than one processor per chip
$\square$ Requires explicitly parallel programming
- Compare with instruction level parallelism
$\square$ Hardware executes multiple instructions at once
$\square$ Hidden from the programmer
- Hard to do
$\square$ Programming for performance
$\square$ Load balancing
$\square$ Optimizing communication and synchronization


## Low Power at Idle

- Look back at i7 power benchmark - At 100\% load: 258W
- At 50\% load: 170W (66\%)
- At 10\% load: 121W (47\%)
$\square$ Google data center
- Mostly operates at $10 \%-50 \%$ load
- At 100\% load less than $1 \%$ of the time
$\square$ Consider designing processors to make power proportional to load

