WRITE_SAME performance

Discussion:

Martin Svec

2014-07-17 18:51:34 UTC

Hello,

I noticed that LIO's WRITE_SAME implementation doesn't perform as well as I would expect. In fact,
non-accelerated ESXi eager disk zeroing over 10GE iSCSI is two times faster than the accelerated
zeroing in our test lab.

Initiator setup: Dell R300, ESXi 5.5, software iSCSI over 1GE/10GE.
Target setup: Dell R510, PERC H700 controller, 3x Intel S3700 SSD in RAID0, exported as iblock iSCSI
LUN, 3.15.5 kernel.

After some digging in sources and playing with blktrace, I think the problem comes from the way
iblock_execute_write_same() handles sequential writes. It adds them to bios in small 512-byte
blocks, i.e. one sector at a time. On the other hand, ESXi issues the non-accelerated writes as
128kB blocks. The difference in blktrace output is obvious. I tried to tweak deadline scheduler and
other tunables, but with no luck. To verify my idea, I modified iblock_execute_write_same() so that
it submits full ZERO_PAGEs by analogy with __blkdev_issue_zeroout(). In other words, I increased bio
blocks size from 512 to 4096 bytes. This quick dirty hack raised WRITE_SAME performance several
times and it's close to the maximal sequential write speed of the raw device now. Moreover, target
CPU usage and softirqs dropped significantly too.

Maybe iblock_execute_write_same() should prepare longer contiguous data block before it starts
submitting it to the block layer? And/or use ZERO_PAGE if it is used for zeroing (99% cases today, I
guess).

Any ideas?

Thanks,
Martin

Nicholas A. Bellinger

2014-07-30 01:11:12 UTC

Permalink

Hi Martin,

Apologies for the delayed response, comments are inline below.

Post by Martin Svec
Hello,
I noticed that LIO's WRITE_SAME implementation doesn't perform as well as I would expect. In fact,
non-accelerated ESXi eager disk zeroing over 10GE iSCSI is two times faster than the accelerated
zeroing in our test lab.
Initiator setup: Dell R300, ESXi 5.5, software iSCSI over 1GE/10GE.
Target setup: Dell R510, PERC H700 controller, 3x Intel S3700 SSD in RAID0, exported as iblock iSCSI
LUN, 3.15.5 kernel.
After some digging in sources and playing with blktrace, I think the problem comes from the way
iblock_execute_write_same() handles sequential writes. It adds them to bios in small 512-byte
blocks, i.e. one sector at a time. On the other hand, ESXi issues the non-accelerated writes as
128kB blocks. The difference in blktrace output is obvious. I tried to tweak deadline scheduler and
other tunables, but with no luck. To verify my idea, I modified iblock_execute_write_same() so that
it submits full ZERO_PAGEs by analogy with __blkdev_issue_zeroout(). In other words, I increased bio
blocks size from 512 to 4096 bytes. This quick dirty hack raised WRITE_SAME performance several
times and it's close to the maximal sequential write speed of the raw device now. Moreover, target
CPU usage and softirqs dropped significantly too.
Maybe iblock_execute_write_same() should prepare longer contiguous data block before it starts
submitting it to the block layer? And/or use ZERO_PAGE if it is used for zeroing (99% cases today, I
guess).
Any ideas?

Mmmm, there have been similar reports with ESX + WRITE_SAME recently, so
it's definitely an area ripe for optimization..

One approach that comes to mind is to allocate a scatterlist with a full
page (instead of a single block @ 512 byte sectors), and then memcpy
over the single block payload to fill the entire page. This should work
as long as the WRITE_SAME number of blocks (based upon sector size) is
greater than or equal to a single page.

If possible, I'd prefer not having a special case for ZERO_PAGE and just
treat all payloads the same like existing code does.

Adding hch to the CC' as he might have a better idea for doing this..

--nab

Christoph Hellwig

2014-07-30 01:14:27 UTC

Permalink

For iblock just use the new write same infrastructure, i.e.
REQ_WRITE_SAME type bio. Preferably asynchronously and not using
the synchronous blockdev helper.