## ELE617E <br> Lectures

# Prof. Dr. Müștak E. Yalçın <br> Istanbul Technical University 

mustak.yalcin@itu.edu.tr

## Folding

Folding is a technique to reduce the silicon area by time-multiplexing many algorithm operations into single functional units (such as adders and multipliers).
reduce hardware resources (\# multipliers, \# adders), at the expense of reduced throughput, by mapping multiple operations on the same HW resource.

- Folding is not: from MIMO back to SISO.
- Typically: given required throughput (latency) of a function (application), apply folding or unfolding (along with other transformations) to achieve the required throughput at minimal hardware costs.


## Folding

$$
y(n)=a(n)+b(n)+c(n)
$$



$$
T_{\mathrm{clk}} \geq 2 T_{\mathrm{add}}
$$

Lets time-multiplexed on a single pipelined adder !


## Folding



| Cycle | Adder Input <br> (left) | Adder Input <br> (top) | System Output |
| :---: | :---: | :---: | :---: |
| 0 | $a(0)$ | $b(0)$ | - |
| 1 | $a(0)+b(0)$ | $c(0)$ | - |
| 2 | $a(1)$ | $b(1)$ | $a(0)+b(0)+c(0)$ |
| 3 | $a(1)+b(1)$ | $c(1)$ | - |
| 4 | $a(2)$ | $b(2)$ | $a(1)+b(1)+c(1)$ |
| 5 | $a(2)+b(2)$ | $c(2)$ | - |

## Folding



| Cycle | Adder Input <br> (left) | Adder Input <br> (top) | System Output |
| :---: | :---: | :---: | :---: |
| 0 | $a(0)$ | $b(0)$ | - |
| 1 | $a(0)+b(0)$ | $c(0)$ | - |
| 2 | $a(1)$ | $b(1)$ | $a(0)+b(0)+c(0)$ |
| 3 | $a(1)+b(1)$ | $c(1)$ | - |
| 4 | $a(2)$ | $b(2)$ | $a(1)+b(1)+c(1)$ |
| 5 | $a(2)+b(2)$ | $c(2)$ | - |

One output sample is produced every 2 clock cycles !

## Folding Transformation

A systematic technique for designing control circuits for HW where alg. operations are time-multiplexed on a single functional unit.

- $N$ folding factor: the number of operations folded to a single functional unit.
- I: iteration cycle
- $u, v$ folding orders: $0 \leq u, v \leq N-1$
- $H_{u}$ : the functional unit that executes the node $U$.
- $P_{U}$ : pipelining stages of the functional unit of $H_{U}$
- $w(e)$ : delays of edge $U \xrightarrow{e} V$
- $N I+v$ : is the time units at which $I-t h$ iteration of the $v$ is scheduled.
- Folding Set: An ordered set of $N$ operations executed by the same functional unit. $S_{1}=\left\{A_{1}, 0, A_{2}\right\}$ is for folding order $N=3 . A_{1}$ has a folding order of 0 and $A_{2}$ of 2 and are respectively denoted by $\left(S_{1} \mid 0\right)$ and $\left(S_{1} \mid 2\right)$.


## Folding Transformation

- $N$ folding factor: the number of op. folded to a single func. unit.
- I: iteration cycle
- $u, v$ folding orders: $0 \leq u, v \leq N-1$
- $H_{u}$ : the functional unit that executes the node $U$.
- $P_{U}$ : pipelining stages of the functional unit of $H_{U}$
- $w(e)$ : delays of edge $U \xrightarrow{e} U$
- $N I+v$ : is the time units at which /th iteration of the $V$ is scheduled.
- Folding Set: An ordered set of $N$ operations executed by the same functional unit.



## Folding Transformation


the $l$-th iteration of $U$ is used by $(I+w(e))$-th iteration of node $V$

$H_{U}$ 's output is available at $N I+u+P_{U}$ and The node $V$ is executed at $N(I+w(e))+v$.
So, the result should be stored for

$$
\begin{aligned}
D_{F}(U \longrightarrow V) & =[N(I+w(e))+v]-\left[N I+P_{u}+u\right] \\
& =N w(e)-P_{u}+v-u
\end{aligned}
$$

independent of $I$.

## Folding Transformation

Folding a retimed biquad filter by $\mathrm{N}=4$.


Folding Sets : $S_{1}=\{4,2,3,1\}$ and $S_{2}=\{5,8,6,7\}$, Addition time $=$ 1 u.t., Multiplication time $=2$ u.t., 1 stage pipelined adder and 2 stage pipelined multiplier(i.e., $P_{A}=1$ and $P_{M}=2$ )

## Folding Transformation



## Folding Transformation



$$
\begin{aligned}
D_{F}(1 \rightarrow 2) & =N w(e)-P_{u}+v-u \\
& =N w(e)-P_{u}+v-u \\
& =4(1)-1+1-3=1 \\
D_{F}(1 \rightarrow 8) & =4(2)-1+1-3=5 \\
D_{F}(8 \rightarrow 4) & =4(1)-2+0-1=1
\end{aligned}
$$



Obtain $D_{F}$ for $N=4$


$$
\begin{aligned}
D_{F}(6 \rightarrow 4) & =N w(e)-P_{u}+v-u \\
& =4(0)-2+0-2=-4 \\
D_{F}(7 \rightarrow 3) & =4(0)-2+2-3=-3
\end{aligned}
$$

A folded architecture is realizable if and only if all delays $D_{F}(U \rightarrow V)$ are non-negative

Retiming for folding

- All $D_{F}^{\prime}(U \rightarrow V)$ of the retimed graph $G^{\prime}$ are non-negative
- $N w_{r}(e)-P_{U}+v-u \geq 0$ where $w_{r}(e)=w(e)+r(V)-r(U)$ So

$$
\begin{aligned}
& N(w(e)+r(V)-r(U))-P_{U}+v-u \geq 0 \\
& N(r(V)-r(U))+N\left(w(e)-P_{U}+v-u\right. \geq 0 \\
& N(r(V)-r(U))+D_{F}(U \rightarrow V) \geq 0 \\
& r(U)-r(V) \leq\left\lfloor\frac{D_{F}(U \rightarrow V)}{N}\right\rfloor
\end{aligned}
$$

to determine if a solution exists and to a solution if one indeed exists (like before :( ).

$D_{F}(7 \rightarrow 3)=-3$ then $r(7)-r(3) \leq\left\lfloor\frac{-3}{4}\right\rfloor=-1 \odot$ Folyd.Warshall.m

| Inf | Inf | 0 | Inf | Inf | Inf | Inf | Inf | Inf |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| -1 | Inf | Inf | 0 | Inf | Inf | Inf | Inf | Inf |
| Inf | Inf | Inf | Inf | 0 | Inf | -1 | Inf | Inf |
| Inf | Inf | Inf | Inf | Inf | -1 | Inf | -1 | Inf |
| 0 | Inf | Inf | Inf | Inf | Inf | Inf | Inf | Inf |
| 0 | Inf | Inf | Inf | Inf | Inf | Inf | Inf | Inf |
| 1 | Inf | Inf | Inf | Inf | Inf | Inf | Inf | Inf |
| 1 | Inf | Inf | Inf | Inf | Inf | Inf | Inf | Inf |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

> Folyd_Warshall(w)
ans $=$

| 0 | Inf | 0 | Inf | 0 | Inf | -1 | Inf | Inf | Inf |
| ---: | :---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| -1 | Inf | -1 | 0 | -1 | -1 | -2 | -1 | Inf | Inf |
| 0 | Inf | 0 | Inf | 0 | Inf | -1 | Inf | Inf | Inf |
| -1 | Inf | -1 | Inf | -1 | -1 | -2 | -1 | Inf | Inf |
| 0 | Inf | 0 | Inf | 0 | Inf | -1 | Inf | Inf | Inf |
| 0 | Inf | 0 | Inf | 0 | Inf | -1 | Inf | Inf | Inf |
| 1 | Inf | 1 | Inf | 1 | Inf | 0 | Inf | Inf | Inf |
| 1 | Inf | 1 | Inf | 1 | Inf | 0 | Inf | Inf | Inf |
| -1 | 0 | -1 | 0 | -1 | -1 | -2 | -1 | 0 | Inf |
| -1 | 0 | -1 | 0 | -1 | -1 | -2 | -1 | 0 | 0 |



## Register Minimization

- Folding may insert registers.
- Life time analysis is used for register minimization techniques in a DSP hardware
- A variable is live from the time it is produced until the time it is consumed. After then, it is dead. During that time, the variable is stored in a register.
- Linear life time chart:represents the lifetime of the variables in a linear fashion.
- Max. number of live variables in linear lifetime chart Min. number of registers in implementation


## Life Time Analysis

a procedure used to compute the minimum number of registers required to implement a DSP algorithm in hardware!
A variable is live from the time it is produced through the time it is consumed (dead).
A variable occupies one register while it is live. In lifetime analysis, the number of live variables at each time unit is computed, and the maximum number of live variables at any time unit is determined. Life time chart:


## Life Time Analysis

Life Time Analysis begins with lifetime table:
Ex: The transpose operation:

## ABCDEFGHI $\rightarrow$ ADGBEHCFI

| Sample | $T_{\text {input }}$ | $T_{\text {zlout }}$ | $T_{\text {diff }}$ | $T_{\text {output }}$ | Life Period |
| :---: | :---: | :---: | :---: | :---: | :---: |
| a | 0 | 0 | 0 | 4 | $0 \rightarrow 4$ |
| b | 1 | 3 | 2 | 7 | $1 \rightarrow 7$ |
| c | 2 | 6 | 4 | 10 | $2 \rightarrow 10$ |
| d | 3 | 1 | -2 | 5 | $3 \rightarrow 5$ |
| e | 4 | 4 | 0 | 8 | $4 \rightarrow 8$ |
| $f$ | 5 | 7 | 2 | 11 | $5 \rightarrow 11$ |
| g | 6 | 2 | -4 | 6 | $6 \rightarrow 6$ |
| h | 7 | 5 | -2 | 9 | $7 \rightarrow 9$ |
| i | 8 | 8 | 0 | 12 | $8 \rightarrow 12$ |
| $T_{\text {latency }}=\left\|\min \left\{T_{\text {diff }}\right\}\right\| T_{\text {output }}=T_{\text {zlout }}+T_{\text {latency }}$ |  |  |  |  |  |

## Life Time Analysis

| Sample | Life Period |
| :---: | :---: |
| a | $0 \rightarrow 4$ |
| b | $1 \rightarrow 7$ |
| c | $2 \rightarrow 10$ |
| d | $3 \rightarrow 5$ |
| e | $4 \rightarrow 8$ |
| f | $5 \rightarrow 11$ |
| g | $6 \rightarrow 6$ |
| h | $7 \rightarrow 9$ |
| i | $8 \rightarrow 12$ |



## Forward Backward Register Allocation

Determine the min. number of registers

- Input each variable at the time its life starts. If more than one use multiple registers such that the longest lifetime is allocated to the initial register.
- Each variable is allocated in a forward manner until it is dead or reaches the last register
- Allocation is periodic, all allocation to current iteration repeats itself after after N
- If reaches the last register and not dead allocate backward ((if more than one, choose one that has been allocated backward before), then forward again and so on.


## Forward Backward Register Allocation


allocation table

## Synthesis

| Cycle | I/P | R1 | R2 | R3 | R4 | O/P |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | a |  |  |  |  |  |
| 1 | b | a |  |  |  |  |
| 2 | c |  | a |  |  |  |
| 3 | d | c | b | a |  |  |
| 4 | e | ${ }^{4}$ | c | b | a | a |
| 5 | f | e | ${ }^{\text {d }}$ | ${ }^{4}$ | b | d |
| 6 | g | ${ }^{4}$ | e | b* | c | g |
| 7 | h | ${ }^{4}$ | f | e | b | b |
| 8 | i | 4 | , | 1 | $e^{4}$ | e |
| 9 |  | 1 | h | ${ }_{c}$ | $f^{4}$ | h |
| 10 |  |  | 1 | $\mathrm{f}^{4}$ | c | c |
| 11 |  |  |  | $\mathrm{i}^{4}$ | $\mathrm{f}^{4}$ | f |
| 12 |  |  |  |  | 1 | i |
|  |  |  |  |  |  |  |



## Register Minimization for Folded Architecture

- Perform retiming for folding
- Write the folding equations
- Use the folding equations to construct a lifetime table
- Draw the lifetime chart and determine the minimum number of registers
- Perform forward-backward register allocation
- Draw the folded architecture


## Register Minimization for Folded Architecture

- Perform retiming for folding
- Write the folding equations


$$
\begin{aligned}
D_{F}(1 \rightarrow 2) & =N w(e)-P_{u}+v-u \\
& =4(1)-1+1-3=1 \\
D_{F}(1 \rightarrow 8) & =4(2)-1+1-3=5 \\
D_{F}(8 \rightarrow 4) & =4(1)-2+0-1=1
\end{aligned}
$$

## Register Minimization for Folded Architecture

- Use the folding equations to construct a lifetime table (Node $u$ is created at time $u+P_{u}$ and Node $u$ is consumed at time $u+P_{u}+\max _{v}\left\{\left(D_{F}(U \rightarrow V)\right\}\right)$

$$
\begin{aligned}
& \mathrm{D}_{\mathrm{F}}(\mathrm{U}, \mathrm{~V})=\mathrm{Nw}(\mathrm{e})-\mathrm{P}_{\mathrm{u}}+\mathrm{v}-\mathrm{u} \\
& D_{F}(1 \rightarrow 2)=4(1)-1+1-3=1 \\
& D_{F}(1 \rightarrow 5)=4(1)-1+0-3=0 \\
& D_{F}(1 \rightarrow 6)=4(1)-1+2-3=2 \\
& D_{F}(1 \rightarrow 7)=4(1)-1+3-3=3 \\
& D_{F}(1 \rightarrow 8)=4(2)-1+1-3=5 \\
& D_{F}(3 \rightarrow 1)=4(0)-1+3-2=0 \\
& D_{F}(4 \rightarrow 2)=4(0)-1+1-0=0 \\
& D_{F}(5 \rightarrow 3)=4(0)-2+2-0=0 \\
& D_{F}(6 \rightarrow 4)=4(1)-2+0-2=0 \\
& D_{F}(7 \rightarrow 3)=4(1)-2+2-3=1 \\
& D_{F}(8 \rightarrow 4)=4(1)-2+0-1=1 .
\end{aligned}
$$

| Node | $T_{\text {input }} \rightarrow T_{\text {output }}$ |
| :---: | :---: |
| 1 | $3+1 \rightarrow 3+1+5$ |
| 2 | - |
| 3 | $3 \rightarrow 3$ |
| 4 | $1 \rightarrow 1$ |
| 5 | $2 \rightarrow 2$ |
| 6 | $4 \rightarrow 4$ |
| 7 | $5 \rightarrow 6$ |
| 8 | $3 \rightarrow 4$ |

## Register Minimization for Folded Architecture

- Draw the lifetime chart and determine the minimum number of registers

| Node | $T_{\text {input }} \rightarrow T_{\text {output }}$ |
| :---: | :---: |
| 1 | $4 \rightarrow 9$ |
| 2 | - |
| 3 | $3 \rightarrow 3$ |
| 4 | $1 \rightarrow 1$ |
| 5 | $2 \rightarrow 2$ |
| 6 | $4 \rightarrow 4$ |
| 7 | $5 \rightarrow 6$ |
| 8 | $3 \rightarrow 4$ |



## Register Minimization for Folded Architecture

- Perform forward-backward register allocation


| cycle | $\mathrm{i} / \mathrm{p}$ | R 1 | R 2 | o/p |
| :---: | :--- | :--- | :--- | :--- |
| 0 |  |  |  |  |
| 1 |  |  |  |  |
| 2 |  |  |  |  |
| 3 | n 8 |  |  |  |
| 4 | n 1 | $(\mathrm{n} 8)$ |  | n 8 |
| 5 | n 7 | n 1 |  |  |
| 6 |  | $(\mathrm{n} 7)$ | n 1 | n 7 |
| 7 |  |  | n 1 |  |
| 8 |  |  | n 1 |  |
| 9 |  |  | (n1) | n 1 |

## Register Minimization for Folded Architecture

- Draw the folded architecture


$$
\begin{aligned}
& D_{F}(1 \rightarrow 2)=4(1)-1+1-3=1 \\
& D_{F}(1 \rightarrow 5)=4(1)-1+0-3=0 \\
& D_{F}(1 \rightarrow 6)=4(1)-1+2-3=2 \\
& D_{F}(1 \rightarrow 7)=4(1)-1+3-3=3 \\
& D_{F}(1 \rightarrow 8)=4(2)-1+1-3=5
\end{aligned}
$$

$$
\begin{array}{cc}
T_{\text {m,ot }} \rightarrow & T_{\text {mot }} \\
4 & 5 \\
4 & 4 \\
4 & 6- \\
4 & 7 \\
4 & 9-
\end{array}
$$

## Register Minimization for Folded Architecture

- Draw the folded architecture



## Register Minimization for Folded Architecture

- Draw the folded architecture



## Register Minimization for Folded Architecture

- Draw the folded architecture

| cycle | $\mathrm{i} / \mathrm{p}$ | R 1 | R 2 | $\mathrm{o} / \mathrm{p}$ |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0 |  |  |  |  |
| 1 |  |  |  |  |
| 2 |  |  |  |  |
| 3 | n 8 |  |  |  |
| 4 | n 1 | $\mathrm{n} 8)$ |  | n 8 |
| 5 | n 7 | n 1 |  |  |
| 6 |  | $(\mathrm{n} 7)$ | n 1 | n 7 |
| 7 |  |  | n 1 |  |
| 8 |  |  | n 1 |  |
| 9 |  |  | (n1) | n 1 |



HW: Page 167, IIR Filter Example.

