Llano do AVX

Поделиться12010-04-23 20:34:16

Автор: OPTERON
Модератор
Зарегистрирован: 2010-04-16
Приглашений: 0
Сообщений: 279
Уважение: [+4/-0]
Позитив: [+0/-0]
Провел на форуме:
18 часов 45 минут
Последний визит:
2010-04-27 22:41:10

few observations suggest that AMD's Llano could do AVX instructions.

1) A reasonably large new block next to the FP register file.
2) Something what could be a new 3-way extra decoding stage in front of the FP units.
3) The large increase in size of the reorder buffer (3x24 to 3x32 or 3x36)

-It would be faster even if it's still using 128 bit hardware for the 256 bit
operations since typically many time slots are unused in FP units.

-The AVX performance would be ultimately limited by the cache bandwidth
to/from the SSE/AVX units (32 byte/cycle versus 48 byte/cycle for Sandy
Bridge)

-The 256 bit operations would be split into independent 128 bit operations
which would explain the increase in size of the reorder buffer.

-The size of the 3-way decode pack stage in front of the Integer units
has also increased also suggesting that something is added to the
decoding units (cache access for 2x128 bit words?)

------------------------------

Some extra points:

The second level TLB units for the data cache have been doubled from
512 entries to 1024 entries.

There is extra integer logic. A good guess would be a faster version
of the Integer divider. One that can produce multiple result bits/cycle
like the ones in the Core2 and Nehalem architecture.

0

Поделиться22010-04-23 20:35:19

Автор: OPTERON
Модератор
Зарегистрирован: 2010-04-16
Приглашений: 0
Сообщений: 279
Уважение: [+4/-0]
Позитив: [+0/-0]
Провел на форуме:
18 часов 45 минут
Последний визит:
2010-04-27 22:41:10

It's not that "crippled", not by a factor 2 (=256/128). For example:
If an SIMD FP add takes 4 clock cycles then:

128 bit: A+B+C takes 8 clock cycles.
256 bit: A+B+C takes 9 clock cycles. (using pipelined 128 bit hardware)

128 bit: A+B+C+D takes 9 clock cycles.
256 bit: A+B+C+D takes 11 clock cycles. (using pipelined 128 bit hardware)

It all depends on how many unused time-slots there are due to the data
dependencies. A bigger bottleneck for Llano would be the L1 cache access
bandwidth: 32 bytes/cycle for Llano versus 48 bytes/cycle for Sandy Bridge.

0

Поделиться32010-04-23 20:43:14

Автор: Celeron
Администратор
Зарегистрирован: 2010-04-16
Приглашений: 0
Сообщений: 218
Уважение: [+0/-0]
Позитив: [+7/-0]
Провел на форуме:
1 день 12 часов
Последний визит:
2010-05-21 20:30:16

OPTERON, ну переводи пожалуйста гуглом! У меня мендосино виснет, когда я переводить начинаю...

0

В гостях у Оптерона и Целерона (Гладиаторские бои в Колизее)

Меню навигации

Пользовательские ссылки

Информация о пользователе

Llano do AVX

Сообщений 1 страница 3 из 3

Поделиться12010-04-23 20:34:16

Поделиться22010-04-23 20:35:19

Поделиться32010-04-23 20:43:14