Discussion:
System crashes with increased drive count
(too old to reply)
Jun Wu
2014-05-09 02:17:35 UTC
Permalink
We are running in system crashes as number of drive under test
increases. The test configuration is one initiator as server running
fio sessions to remote drives on a target server via fcoe vn2vn. Both
servers running fedora 20 (kernel 3.14.2-200). Running fio sessions up
to 7 remote drives works but target machines hangs when drive count
increased to 8. The system crashes are very repeatable and duplicated
on RHEL 6.5. Following are error messages on target server:


[ 1503.737314] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000048
[ 1503.737442] IP: [<ffffffffa0610885>] ft_sess_put+0x5/0x30 [tcm_fc]
[ 1503.737540] PGD 0
[ 1503.737575] Oops: 0000 [#1] SMP
[ 1503.737631] Modules linked in: tcm_fc target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
fcoe libfcoe libfc scsi_transport_fc scsi_tgt 8021q garp mrp fuse
ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw coretemp iTCO_wdt kvm_intel kvm gpio_ich
iTCO_vendor_support ses crc32c_intel tpm_tis enclosure i7core_edac
ioatdma edac_core shpchp serio_raw tpm lpc_ich mfd_core i2c_i801
microcode acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd sunrpc radeon
drm_kms_helper ttm
[ 1503.738933] igb ixgbe drm ata_generic mdio ptp pata_acpi pps_core
pata_jmicron i2c_algo_bit aacraid dca i2c_core
[ 1503.739118] CPU: 5 PID: 6537 Comm: kworker/5:4 Not tainted
3.14.2-200.fc20.x86_64 #1
[ 1503.739225] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011
[ 1503.739338] Workqueue: target_completion target_complete_ok_work
[target_core_mod]
[ 1503.739449] task: ffff88062071d580 ti: ffff88061a322000 task.ti:
ffff88061a322000
[ 1503.739553] RIP: 0010:[<ffffffffa0610885>] [<ffffffffa0610885>]
ft_sess_put+0x5/0x30 [tcm_fc]
[ 1503.739681] RSP: 0018:ffff88061a323ce8 EFLAGS: 00010016
[ 1503.739755] RAX: 0000000000000000 RBX: ffff880304a23498 RCX: 0000000000009010
[ 1503.739853] RDX: 0000000000009010 RSI: 00000000000000cb RDI: 0000000000000000
[ 1503.739953] RBP: ffff88061a323d08 R08: ffff88031c4c6500 R09: 000000018020000f
[ 1503.740051] R10: ffffffff815cfe87 R11: ffffea000c713180 R12: ffff88031c4c6500
[ 1503.740150] R13: ffff88031f7c1f80 R14: ffff88031f7c1fe8 R15: 0000000000000000
[ 1503.740250] FS: 0000000000000000(0000) GS:ffff88063fc20000(0000)
knlGS:0000000000000000
[ 1503.740363] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1503.740443] CR2: 0000000000000048 CR3: 0000000001c0c000 CR4: 00000000000007e0
[ 1503.740541] Stack:
[ 1503.740572] ffffffffa060e058 ffff880304a23568 ffff88031f7c1f80
ffff880304a234a8
[ 1503.740692] ffff88061a323d18 ffffffffa060e5e2 ffff88061a323d40
ffffffffa05cff42
[ 1503.740812] ffff880304a234a8 ffff880304a23568 0000000000000246
ffff88061a323d70
[ 1503.740931] Call Trace:
[ 1503.740969] [<ffffffffa060e058>] ? ft_free_cmd+0x58/0x60 [tcm_fc]
[ 1503.741057] [<ffffffffa060e5e2>] ft_release_cmd+0x12/0x20 [tcm_fc]
[ 1503.741150] [<ffffffffa05cff42>] target_release_cmd_kref+0x52/0x80
[target_core_mod]
[ 1503.741264] [<ffffffffa05d1bd3>] transport_release_cmd+0xd3/0xf0
[target_core_mod]
[ 1503.741377] [<ffffffffa05d1c28>]
transport_generic_free_cmd+0x38/0x250 [target_core_mod]
[ 1503.741491] [<ffffffffa060e600>] ft_check_stop_free+0x10/0x20 [tcm_fc]
[ 1503.741590] [<ffffffffa05cfe32>]
transport_cmd_check_stop+0xc2/0x140 [target_core_mod]
[ 1503.741708] [<ffffffffa05d3a97>]
target_complete_ok_work+0xe7/0x2d0 [target_core_mod]
[ 1503.741824] [<ffffffff810a6886>] process_one_work+0x176/0x430
[ 1503.741907] [<ffffffff810a74db>] worker_thread+0x11b/0x3a0
[ 1503.741985] [<ffffffff810a73c0>] ? rescuer_thread+0x370/0x370
[ 1503.742069] [<ffffffff810ae211>] kthread+0xe1/0x100
[ 1503.742138] [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 1503.742227] [<ffffffff816fef7c>] ret_from_fork+0x7c/0xb0
[ 1503.746668] [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 1503.751104] Code: 48 89 f0 48 89 c7 89 d6 48 89 e5 48 8b 49 10 48
89 ca e8 4f ed ff ff 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66
66 66 66 90 <8b> 47 48 85 c0 74 22 48 8d 47 48 f0 83 6f 48 01 74 09 c3
0f 1f
[ 1503.760558] RIP [<ffffffffa0610885>] ft_sess_put+0x5/0x30 [tcm_fc]
[ 1503.765145] RSP <ffff88061a323ce8>
[ 1503.769687] CR2: 0000000000000048
[ 1503.789003] ---[ end trace c7457ccb45bf0bc9 ]---



Before target hangs, a lot of messages as follows are printed out on
the initiator:

fio: io_u error on file /dev/sdl: Input/output error
read offset=1030152192, buflen=4096

[ 3787.971900] sd 8:0:0:0: [sdl] Unhandled error code
[ 3787.971907] sd 8:0:0:0: [sdl]
[ 3787.971910] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[ 3787.971913] sd 8:0:0:0: [sdl] CDB:
[ 3787.971915] Read(10): 28 00 00 1e b3 70 00 00 08 00
[ 3787.971924] end_request: I/O error, dev sdl, sector 2012016


Installation steps used:
yum install lldpad
yum install fcoe-utils
modprobe fcoe
yum install targetcli

Before these tests, we also installed Redhat 6.5 and followed
instructions on https://www.open-fcoe.org/. On Redhat, I were only
able to run fio to 3 target drives. Using 4 target drives crashed the
target machine.

Any suggestions?

Thanks,

Jun
Nicholas A. Bellinger
2014-05-12 19:34:28 UTC
Permalink
Post by Jun Wu
We are running in system crashes as number of drive under test
increases. The test configuration is one initiator as server running
fio sessions to remote drives on a target server via fcoe vn2vn. Both
servers running fedora 20 (kernel 3.14.2-200). Running fio sessions up
to 7 remote drives works but target machines hangs when drive count
increased to 8. The system crashes are very repeatable and duplicated
[ 1503.737314] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000048
[ 1503.737442] IP: [<ffffffffa0610885>] ft_sess_put+0x5/0x30 [tcm_fc]
[ 1503.737540] PGD 0
[ 1503.737575] Oops: 0000 [#1] SMP
[ 1503.737631] Modules linked in: tcm_fc target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
fcoe libfcoe libfc scsi_transport_fc scsi_tgt 8021q garp mrp fuse
ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw coretemp iTCO_wdt kvm_intel kvm gpio_ich
iTCO_vendor_support ses crc32c_intel tpm_tis enclosure i7core_edac
ioatdma edac_core shpchp serio_raw tpm lpc_ich mfd_core i2c_i801
microcode acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd sunrpc radeon
drm_kms_helper ttm
[ 1503.738933] igb ixgbe drm ata_generic mdio ptp pata_acpi pps_core
pata_jmicron i2c_algo_bit aacraid dca i2c_core
[ 1503.739118] CPU: 5 PID: 6537 Comm: kworker/5:4 Not tainted
3.14.2-200.fc20.x86_64 #1
[ 1503.739225] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011
[ 1503.739338] Workqueue: target_completion target_complete_ok_work
[target_core_mod]
ffff88061a322000
[ 1503.739553] RIP: 0010:[<ffffffffa0610885>] [<ffffffffa0610885>]
ft_sess_put+0x5/0x30 [tcm_fc]
[ 1503.739681] RSP: 0018:ffff88061a323ce8 EFLAGS: 00010016
[ 1503.739755] RAX: 0000000000000000 RBX: ffff880304a23498 RCX: 0000000000009010
[ 1503.739853] RDX: 0000000000009010 RSI: 00000000000000cb RDI: 0000000000000000
[ 1503.739953] RBP: ffff88061a323d08 R08: ffff88031c4c6500 R09: 000000018020000f
[ 1503.740051] R10: ffffffff815cfe87 R11: ffffea000c713180 R12: ffff88031c4c6500
[ 1503.740150] R13: ffff88031f7c1f80 R14: ffff88031f7c1fe8 R15: 0000000000000000
[ 1503.740250] FS: 0000000000000000(0000) GS:ffff88063fc20000(0000)
knlGS:0000000000000000
[ 1503.740363] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1503.740443] CR2: 0000000000000048 CR3: 0000000001c0c000 CR4: 00000000000007e0
[ 1503.740572] ffffffffa060e058 ffff880304a23568 ffff88031f7c1f80
ffff880304a234a8
[ 1503.740692] ffff88061a323d18 ffffffffa060e5e2 ffff88061a323d40
ffffffffa05cff42
[ 1503.740812] ffff880304a234a8 ffff880304a23568 0000000000000246
ffff88061a323d70
[ 1503.740969] [<ffffffffa060e058>] ? ft_free_cmd+0x58/0x60 [tcm_fc]
[ 1503.741057] [<ffffffffa060e5e2>] ft_release_cmd+0x12/0x20 [tcm_fc]
[ 1503.741150] [<ffffffffa05cff42>] target_release_cmd_kref+0x52/0x80
[target_core_mod]
[ 1503.741264] [<ffffffffa05d1bd3>] transport_release_cmd+0xd3/0xf0
[target_core_mod]
[ 1503.741377] [<ffffffffa05d1c28>]
transport_generic_free_cmd+0x38/0x250 [target_core_mod]
[ 1503.741491] [<ffffffffa060e600>] ft_check_stop_free+0x10/0x20 [tcm_fc]
[ 1503.741590] [<ffffffffa05cfe32>]
transport_cmd_check_stop+0xc2/0x140 [target_core_mod]
[ 1503.741708] [<ffffffffa05d3a97>]
target_complete_ok_work+0xe7/0x2d0 [target_core_mod]
[ 1503.741824] [<ffffffff810a6886>] process_one_work+0x176/0x430
[ 1503.741907] [<ffffffff810a74db>] worker_thread+0x11b/0x3a0
[ 1503.741985] [<ffffffff810a73c0>] ? rescuer_thread+0x370/0x370
[ 1503.742069] [<ffffffff810ae211>] kthread+0xe1/0x100
[ 1503.742138] [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 1503.742227] [<ffffffff816fef7c>] ret_from_fork+0x7c/0xb0
[ 1503.746668] [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 1503.751104] Code: 48 89 f0 48 89 c7 89 d6 48 89 e5 48 8b 49 10 48
89 ca e8 4f ed ff ff 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66
66 66 66 90 <8b> 47 48 85 c0 74 22 48 8d 47 48 f0 83 6f 48 01 74 09 c3
0f 1f
[ 1503.760558] RIP [<ffffffffa0610885>] ft_sess_put+0x5/0x30 [tcm_fc]
[ 1503.765145] RSP <ffff88061a323ce8>
[ 1503.769687] CR2: 0000000000000048
[ 1503.789003] ---[ end trace c7457ccb45bf0bc9 ]---
The v3.14 OOPs above looks like a free-after-use regression from the
v3.13 conversion to use percpu-ida for pre-allocation of ft_cmd
descriptors.

Here's the patch that I'm applying to address this specific bug in
tcm_fc. Please apply it and verify the fix on your end.
Post by Jun Wu
From 1d8dc8a29cfa6d66e5068ab6dad3216fe218cc53 Mon Sep 17 00:00:00 2001
From: Nicholas Bellinger <***@linux-iscsi.org>
Date: Mon, 12 May 2014 12:18:32 -0700
Subject: [PATCH] tcm_fc: Fix free-after-use regression in ft_free_cmd

This patch fixes a free-after-use regression in ft_free_cmd(), where
percpu_ida_free() was incorrectly called to release the tag before
ft_sess_put() is called to drop the session reference.

Fix this bug by moving the percpu_ida_free() call after ft_free_cmd().

The regression was originally introduced in v3.13-rc1 commit:

commit 5f544cfac956971099e906f94568bc3fd1a7108a
Author: Nicholas Bellinger <***@daterainc.com>
Date: Mon Sep 23 12:12:42 2013 -0700

tcm_fc: Convert to per-cpu command map pre-allocation of ft_cmd

Reported-by: Jun Wu <***@stormojo.com>
Cc: Mark Rustad <***@intel.com>
Cc: Robert Love <***@intel.com>
Cc: <***@vger.kernel.org> #3.13+
Signed-off-by: Nicholas Bellinger <***@linux-iscsi.org>
---
drivers/target/tcm_fc/tfc_cmd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/target/tcm_fc/tfc_cmd.c b/drivers/target/tcm_fc/tfc_cmd.c
index 01cf37f..28fce39 100644
--- a/drivers/target/tcm_fc/tfc_cmd.c
+++ b/drivers/target/tcm_fc/tfc_cmd.c
@@ -100,8 +100,8 @@ static void ft_free_cmd(struct ft_cmd *cmd)
if (fr_seq(fp))
lport->tt.seq_release(fr_seq(fp));
fc_frame_free(fp);
- percpu_ida_free(&se_sess->sess_tag_pool, cmd->se_cmd.map_tag);
ft_sess_put(cmd->sess); /* undo get from lookup at recv */
+ percpu_ida_free(&se_sess->sess_tag_pool, cmd->se_cmd.map_tag);
}

void ft_release_cmd(struct se_cmd *se_cmd)
--
1.7.10.4
Post by Jun Wu
Before target hangs, a lot of messages as follows are printed out on
fio: io_u error on file /dev/sdl: Input/output error
read offset=1030152192, buflen=4096
[ 3787.971900] sd 8:0:0:0: [sdl] Unhandled error code
[ 3787.971907] sd 8:0:0:0: [sdl]
[ 3787.971910] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[ 3787.971915] Read(10): 28 00 00 1e b3 70 00 00 08 00
[ 3787.971924] end_request: I/O error, dev sdl, sector 2012016
Not sure what's going on here without more information.
Post by Jun Wu
yum install lldpad
yum install fcoe-utils
modprobe fcoe
yum install targetcli
Before these tests, we also installed Redhat 6.5 and followed
instructions on https://www.open-fcoe.org/. On Redhat, I were only
able to run fio to 3 target drives. Using 4 target drives crashed the
target machine.
No idea without more info wrt RHEL 6.5, but it certainly doesn't have
the v3.13+ specific percpu-ida regression from above.

--nab
Jun Wu
2014-05-13 21:52:36 UTC
Permalink
Hi Nicholas,

We had to roll back system from 3.14 to 3.11 due to compile issues of
our software. So I am not able to verify your fix at this point.
I ran the same tests on 3.11 instead.

In one case the target crashed with following message:

May 13 13:06:25 poc2 kernel: BUG: unable to handle kernel paging
request at ffffffffffffffa4
May 13 13:06:25 poc2 kernel: IP: [<ffffffff8164ac07>]
_raw_spin_lock_bh+0x17/0x40
May 13 13:06:25 poc2 kernel: PGD 1c0f067 PUD 1c11067 PMD 0
May 13 13:06:25 poc2 kernel: Oops: 0002 [#1] SMP
May 13 13:06:25 poc2 kernel: Modules linked in: fcoe libfcoe 8021q
garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw nfsd auth_rpcgss nfs_acl lockd sunrpc
ixgbe mdio igb ptp pps_core serio_raw ses enclosure iTCO_wdt
iTCO_vendor_support lpc_ich mfd_core shpchp i2c_i801 coretemp
kvm_intel kvm crc32c_intel microcode i7core_edac ioatdma acpi_cpufreq
edac_core dca mperf radeon i2c_algo_bit
May 13 13:06:25 poc2 kernel: drm_kms_helper ttm drm ata_generic
i2c_core pata_acpi pata_jmicron aacraid
May 13 13:06:25 poc2 kernel: CPU: 0 PID: 1810 Comm: kworker/0:0 Not
tainted 3.11.10-301.fc20.x86_64 #1
May 13 13:06:25 poc2 kernel: Hardware name: Supermicro X8DTN/X8DTN,
BIOS 2.1c 10/28/2011
May 13 13:06:25 poc2 kernel: Workqueue: target_completion
target_complete_ok_work [target_core_mod]
May 13 13:06:25 poc2 kernel: task: ffff88032c5096e0 ti:
ffff88031bb78000 task.ti: ffff88031bb78000
May 13 13:06:25 poc2 kernel: RIP: 0010:[<ffffffff8164ac07>]
[<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
May 13 13:06:25 poc2 kernel: RSP: 0018:ffff88031bb79cf0 EFLAGS: 00010206
May 13 13:06:25 poc2 kernel: RAX: 0000000000000100 RBX:
ffffffffffffffa4 RCX: 0000000000000000
May 13 13:06:25 poc2 kernel: RDX: 0000000000000000 RSI:
0000000000000000 RDI: ffffffffffffffa4
May 13 13:06:25 poc2 kernel: RBP: ffff88031bb79cf8 R08:
00000000ffffffff R09: ffff88031a37f678
May 13 13:06:25 poc2 kernel: R10: 0000000000000001 R11:
0000000000000044 R12: 0000000000000000
May 13 13:06:25 poc2 kernel: R13: ffff88031a37f678 R14:
ffff88062d9fd6c8 R15: ffff88032c6da05c
May 13 13:06:25 poc2 kernel: FS: 0000000000000000(0000)
GS:ffff880333c00000(0000) knlGS:0000000000000000
May 13 13:06:25 poc2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 13 13:06:25 poc2 kernel: CR2: ffffffffffffffa4 CR3:
0000000001c0c000 CR4: 00000000000007f0
May 13 13:06:25 poc2 kernel: Stack:
May 13 13:06:25 poc2 kernel: ffffffffffffffa4 ffff88031bb79d18
ffffffffa0594d2b ffff880328d13410
May 13 13:06:25 poc2 kernel: ffff88031a37c200 ffff88031bb79d58
ffffffffa05356f2 0000000000000018
May 13 13:06:25 poc2 kernel: ffff88062cea8800 0000000000000000
ffffea000c8eb640 0000000000000000
May 13 13:06:25 poc2 kernel: Call Trace:
May 13 13:06:25 poc2 kernel: [<ffffffffa0594d2b>]
fc_seq_start_next+0x1b/0x40 [libfc]
May 13 13:06:25 poc2 kernel: [<ffffffffa05356f2>]
ft_queue_status+0xf2/0x220 [tcm_fc]
May 13 13:06:25 poc2 kernel: [<ffffffffa0536972>]
ft_queue_data_in+0x72/0x5a0 [tcm_fc]
May 13 13:06:25 poc2 kernel: [<ffffffffa04f57ba>]
target_complete_ok_work+0x14a/0x2b0 [target_core_mod]
May 13 13:06:25 poc2 kernel: [<ffffffff810810f5>] process_one_work+0x175/0x430
May 13 13:06:25 poc2 kernel: [<ffffffff81081d1b>] worker_thread+0x11b/0x3a0
May 13 13:06:25 poc2 kernel: [<ffffffff81081c00>] ? rescuer_thread+0x340/0x340
May 13 13:06:25 poc2 kernel: [<ffffffff81088660>] kthread+0xc0/0xd0
May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ?
insert_kthread_work+0x40/0x40
May 13 13:06:25 poc2 kernel: [<ffffffff8165332c>] ret_from_fork+0x7c/0xb0
May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ?
insert_kthread_work+0x40/0x40
May 13 13:06:25 poc2 kernel: Code: 1f 44 00 00 f3 90 0f b6 07 38 d0 75
f7 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb e8 7e
05 a2 ff b8 00 01 00 00 <f0> 66 0f c1 03 0f b6 d4 38 c2 74 0e 0f 1f 44
00 00 f3 90 0f b6
May 13 13:06:25 poc2 kernel: RIP [<ffffffff8164ac07>]
_raw_spin_lock_bh+0x17/0x40


In another case, the initiator crashed with:

May 13 12:00:47 poc1 kernel: [ 4086.708455] WARNING: CPU: 1 PID: 1869
at lib/list_debug.c:62 __list_del_entry+0x82/0xd0()
May 13 12:00:47 poc1 kernel: [ 4086.708459] list_del corruption.
next->prev should be ffff88061dab0318, but was ffff88061d257318
May 13 12:00:47 poc1 kernel: [ 4086.708461] Modules linked in: fcoe
libfcoe 8021q garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt
target_core_pscsi target_core_file target_core_iblock iscsi_target_mod
target_core_mod nf_conntrack_netbios_ns nf_conntrack_broadcast
ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw coretemp kvm_intel kvm crc32c_intel
iTCO_wdt iTCO_vendor_support microcode serio_raw i2c_i801 ses igb
enclosure lpc_ich mfd_core ixgbe ptp pps_core mdio i7core_edac ioatdma
edac_core dca shpchp acpi_cpufreq mperf nfsd auth_rpcgss nfs_acl lockd
sunrpc radeon i2c_algo_bit drm_kms_helper ttm drm ata_generic i2c_core
pata_acpi pata_jmicron aacraid
May 13 12:00:47 poc1 kernel: [ 4086.708556] CPU: 1 PID: 1869 Comm:
fcoethread/1 Not tainted 3.11.10-301.fc20.x86_64 #1
May 13 12:00:47 poc1 kernel: [ 4086.708558] Hardware name: Supermicro
X8DTN/X8DTN, BIOS 2.1c 10/28/2011
May 13 12:00:47 poc1 kernel: [ 4086.708561] 0000000000000009
ffff8806129dfb40 ffffffff816441db ffff8806129dfb88
May 13 12:00:47 poc1 kernel: [ 4086.708569] ffff8806129dfb78
ffffffff8106715d ffff88061dab0318 ffff88061dab0a00
May 13 12:00:47 poc1 kernel: [ 4086.708576] 0000000000000286
ffff880c1b5e4388 0000000000000030 ffff8806129dfbd8
May 13 12:00:47 poc1 kernel: [ 4086.708582] Call Trace:
May 13 12:00:47 poc1 kernel: [ 4086.708592] [<ffffffff816441db>]
dump_stack+0x45/0x56
May 13 12:00:47 poc1 kernel: [ 4086.708598] [<ffffffff8106715d>]
warn_slowpath_common+0x7d/0xa0
May 13 12:00:47 poc1 kernel: [ 4086.708602] [<ffffffff810671cc>]
warn_slowpath_fmt+0x4c/0x50
May 13 12:00:47 poc1 kernel: [ 4086.708608] [<ffffffff81311dc2>]
__list_del_entry+0x82/0xd0
May 13 12:00:47 poc1 kernel: [ 4086.708613] [<ffffffff81311e1d>]
list_del+0xd/0x30
May 13 12:00:47 poc1 kernel: [ 4086.708624] [<ffffffffa05de23c>]
fc_io_compl+0x1cc/0x710 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708633] [<ffffffffa05de7df>]
fc_fcp_complete_locked+0x5f/0x1a0 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708642] [<ffffffffa05dfac9>]
fc_fcp_resp.isra.22+0x79/0x2f0 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708651] [<ffffffff810a2a33>] ?
load_balance+0xe3/0x740
May 13 12:00:47 poc1 kernel: [ 4086.708660] [<ffffffffa05e0424>]
fc_fcp_recv+0x6e4/0xef0 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708666] [<ffffffff810115ce>] ?
__switch_to+0x13e/0x4b0
May 13 12:00:47 poc1 kernel: [ 4086.708673] [<ffffffff8164aab5>] ?
_raw_spin_unlock_bh+0x15/0x20
May 13 12:00:47 poc1 kernel: [ 4086.708682] [<ffffffffa05dfd40>] ?
fc_fcp_resp.isra.22+0x2f0/0x2f0 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708690] [<ffffffffa05d421b>]
fc_exch_recv+0x8eb/0xd70 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708695] [<ffffffffa0613299>]
fcoe_percpu_receive_thread+0x299/0x540 [fcoe]
May 13 12:00:47 poc1 kernel: [ 4086.708699] [<ffffffffa0613000>] ?
fcoe_set_port_id+0x50/0x50 [fcoe]
May 13 12:00:47 poc1 kernel: [ 4086.708705] [<ffffffff81088660>]
kthread+0xc0/0xd0
May 13 12:00:47 poc1 kernel: [ 4086.708710] [<ffffffff810885a0>] ?
insert_kthread_work+0x40/0x40
May 13 12:00:47 poc1 kernel: [ 4086.708717] [<ffffffff8165332c>]
ret_from_fork+0x7c/0xb0
May 13 12:00:47 poc1 kernel: [ 4086.708723] [<ffffffff810885a0>] ?
insert_kthread_work+0x40/0x40
May 13 12:00:47 poc1 kernel: [ 4086.708728] ---[ end trace 61dc774d1f379191 ]---

[***@poc1 log]# lspci | grep 82599
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)

[***@poc1 log]# uname -a
Linux poc1 3.11.10-301.fc20.x86_64 #1 SMP Thu Dec 5 14:01:17 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux

Thanks,

Jun

On Mon, May 12, 2014 at 12:34 PM, Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Post by Jun Wu
We are running in system crashes as number of drive under test
increases. The test configuration is one initiator as server running
fio sessions to remote drives on a target server via fcoe vn2vn. Both
servers running fedora 20 (kernel 3.14.2-200). Running fio sessions up
to 7 remote drives works but target machines hangs when drive count
increased to 8. The system crashes are very repeatable and duplicated
[ 1503.737314] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000048
[ 1503.737442] IP: [<ffffffffa0610885>] ft_sess_put+0x5/0x30 [tcm_fc]
[ 1503.737540] PGD 0
[ 1503.737575] Oops: 0000 [#1] SMP
[ 1503.737631] Modules linked in: tcm_fc target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
fcoe libfcoe libfc scsi_transport_fc scsi_tgt 8021q garp mrp fuse
ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw coretemp iTCO_wdt kvm_intel kvm gpio_ich
iTCO_vendor_support ses crc32c_intel tpm_tis enclosure i7core_edac
ioatdma edac_core shpchp serio_raw tpm lpc_ich mfd_core i2c_i801
microcode acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd sunrpc radeon
drm_kms_helper ttm
[ 1503.738933] igb ixgbe drm ata_generic mdio ptp pata_acpi pps_core
pata_jmicron i2c_algo_bit aacraid dca i2c_core
[ 1503.739118] CPU: 5 PID: 6537 Comm: kworker/5:4 Not tainted
3.14.2-200.fc20.x86_64 #1
[ 1503.739225] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011
[ 1503.739338] Workqueue: target_completion target_complete_ok_work
[target_core_mod]
ffff88061a322000
[ 1503.739553] RIP: 0010:[<ffffffffa0610885>] [<ffffffffa0610885>]
ft_sess_put+0x5/0x30 [tcm_fc]
[ 1503.739681] RSP: 0018:ffff88061a323ce8 EFLAGS: 00010016
[ 1503.739755] RAX: 0000000000000000 RBX: ffff880304a23498 RCX: 0000000000009010
[ 1503.739853] RDX: 0000000000009010 RSI: 00000000000000cb RDI: 0000000000000000
[ 1503.739953] RBP: ffff88061a323d08 R08: ffff88031c4c6500 R09: 000000018020000f
[ 1503.740051] R10: ffffffff815cfe87 R11: ffffea000c713180 R12: ffff88031c4c6500
[ 1503.740150] R13: ffff88031f7c1f80 R14: ffff88031f7c1fe8 R15: 0000000000000000
[ 1503.740250] FS: 0000000000000000(0000) GS:ffff88063fc20000(0000)
knlGS:0000000000000000
[ 1503.740363] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1503.740443] CR2: 0000000000000048 CR3: 0000000001c0c000 CR4: 00000000000007e0
[ 1503.740572] ffffffffa060e058 ffff880304a23568 ffff88031f7c1f80
ffff880304a234a8
[ 1503.740692] ffff88061a323d18 ffffffffa060e5e2 ffff88061a323d40
ffffffffa05cff42
[ 1503.740812] ffff880304a234a8 ffff880304a23568 0000000000000246
ffff88061a323d70
[ 1503.740969] [<ffffffffa060e058>] ? ft_free_cmd+0x58/0x60 [tcm_fc]
[ 1503.741057] [<ffffffffa060e5e2>] ft_release_cmd+0x12/0x20 [tcm_fc]
[ 1503.741150] [<ffffffffa05cff42>] target_release_cmd_kref+0x52/0x80
[target_core_mod]
[ 1503.741264] [<ffffffffa05d1bd3>] transport_release_cmd+0xd3/0xf0
[target_core_mod]
[ 1503.741377] [<ffffffffa05d1c28>]
transport_generic_free_cmd+0x38/0x250 [target_core_mod]
[ 1503.741491] [<ffffffffa060e600>] ft_check_stop_free+0x10/0x20 [tcm_fc]
[ 1503.741590] [<ffffffffa05cfe32>]
transport_cmd_check_stop+0xc2/0x140 [target_core_mod]
[ 1503.741708] [<ffffffffa05d3a97>]
target_complete_ok_work+0xe7/0x2d0 [target_core_mod]
[ 1503.741824] [<ffffffff810a6886>] process_one_work+0x176/0x430
[ 1503.741907] [<ffffffff810a74db>] worker_thread+0x11b/0x3a0
[ 1503.741985] [<ffffffff810a73c0>] ? rescuer_thread+0x370/0x370
[ 1503.742069] [<ffffffff810ae211>] kthread+0xe1/0x100
[ 1503.742138] [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 1503.742227] [<ffffffff816fef7c>] ret_from_fork+0x7c/0xb0
[ 1503.746668] [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
[ 1503.751104] Code: 48 89 f0 48 89 c7 89 d6 48 89 e5 48 8b 49 10 48
89 ca e8 4f ed ff ff 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66
66 66 66 90 <8b> 47 48 85 c0 74 22 48 8d 47 48 f0 83 6f 48 01 74 09 c3
0f 1f
[ 1503.760558] RIP [<ffffffffa0610885>] ft_sess_put+0x5/0x30 [tcm_fc]
[ 1503.765145] RSP <ffff88061a323ce8>
[ 1503.769687] CR2: 0000000000000048
[ 1503.789003] ---[ end trace c7457ccb45bf0bc9 ]---
The v3.14 OOPs above looks like a free-after-use regression from the
v3.13 conversion to use percpu-ida for pre-allocation of ft_cmd
descriptors.
Here's the patch that I'm applying to address this specific bug in
tcm_fc. Please apply it and verify the fix on your end.
From 1d8dc8a29cfa6d66e5068ab6dad3216fe218cc53 Mon Sep 17 00:00:00 2001
Date: Mon, 12 May 2014 12:18:32 -0700
Subject: [PATCH] tcm_fc: Fix free-after-use regression in ft_free_cmd
This patch fixes a free-after-use regression in ft_free_cmd(), where
percpu_ida_free() was incorrectly called to release the tag before
ft_sess_put() is called to drop the session reference.
Fix this bug by moving the percpu_ida_free() call after ft_free_cmd().
commit 5f544cfac956971099e906f94568bc3fd1a7108a
Date: Mon Sep 23 12:12:42 2013 -0700
tcm_fc: Convert to per-cpu command map pre-allocation of ft_cmd
---
drivers/target/tcm_fc/tfc_cmd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/target/tcm_fc/tfc_cmd.c b/drivers/target/tcm_fc/tfc_cmd.c
index 01cf37f..28fce39 100644
--- a/drivers/target/tcm_fc/tfc_cmd.c
+++ b/drivers/target/tcm_fc/tfc_cmd.c
@@ -100,8 +100,8 @@ static void ft_free_cmd(struct ft_cmd *cmd)
if (fr_seq(fp))
lport->tt.seq_release(fr_seq(fp));
fc_frame_free(fp);
- percpu_ida_free(&se_sess->sess_tag_pool, cmd->se_cmd.map_tag);
ft_sess_put(cmd->sess); /* undo get from lookup at recv */
+ percpu_ida_free(&se_sess->sess_tag_pool, cmd->se_cmd.map_tag);
}
void ft_release_cmd(struct se_cmd *se_cmd)
--
1.7.10.4
Post by Jun Wu
Before target hangs, a lot of messages as follows are printed out on
fio: io_u error on file /dev/sdl: Input/output error
read offset=1030152192, buflen=4096
[ 3787.971900] sd 8:0:0:0: [sdl] Unhandled error code
[ 3787.971907] sd 8:0:0:0: [sdl]
[ 3787.971910] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[ 3787.971915] Read(10): 28 00 00 1e b3 70 00 00 08 00
[ 3787.971924] end_request: I/O error, dev sdl, sector 2012016
Not sure what's going on here without more information.
Post by Jun Wu
yum install lldpad
yum install fcoe-utils
modprobe fcoe
yum install targetcli
Before these tests, we also installed Redhat 6.5 and followed
instructions on https://www.open-fcoe.org/. On Redhat, I were only
able to run fio to 3 target drives. Using 4 target drives crashed the
target machine.
No idea without more info wrt RHEL 6.5, but it certainly doesn't have
the v3.13+ specific percpu-ida regression from above.
--nab
Nicholas A. Bellinger
2014-05-14 21:39:22 UTC
Permalink
Post by Jun Wu
Hi Nicholas,
We had to roll back system from 3.14 to 3.11 due to compile issues of
our software. So I am not able to verify your fix at this point.
That is unfortunate your stuck on a now unsupported stable kernel.
There are some other libfc related fixes that have gone in during the
v3.13 timeframe, so I'd strongly recommend upgrading to at least that
stable version.

In any event, I'll be pushing that particular >= v3.13.y patch anyways,
as it's a obvious regression bugfix for percpu-ida pre-allocation.
Post by Jun Wu
I ran the same tests on 3.11 instead.
May 13 13:06:25 poc2 kernel: BUG: unable to handle kernel paging
request at ffffffffffffffa4
May 13 13:06:25 poc2 kernel: IP: [<ffffffff8164ac07>]
_raw_spin_lock_bh+0x17/0x40
May 13 13:06:25 poc2 kernel: PGD 1c0f067 PUD 1c11067 PMD 0
May 13 13:06:25 poc2 kernel: Oops: 0002 [#1] SMP
May 13 13:06:25 poc2 kernel: Modules linked in: fcoe libfcoe 8021q
garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi
Post by Jun Wu
target_core_file target_core_iblock iscsi_target_mod target_core_mod
ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw nfsd auth_rpcgss nfs_acl lockd sunrpc
ixgbe mdio igb ptp pps_core serio_raw ses enclosure iTCO_wdt
iTCO_vendor_support lpc_ich mfd_core shpchp i2c_i801 coretemp
kvm_intel kvm crc32c_intel microcode i7core_edac ioatdma acpi_cpufreq
edac_core dca mperf radeon i2c_algo_bit
May 13 13:06:25 poc2 kernel: drm_kms_helper ttm drm ata_generic
i2c_core pata_acpi pata_jmicron aacraid
May 13 13:06:25 poc2 kernel: CPU: 0 PID: 1810 Comm: kworker/0:0 Not
tainted 3.11.10-301.fc20.x86_64 #1
May 13 13:06:25 poc2 kernel: Hardware name: Supermicro X8DTN/X8DTN,
BIOS 2.1c 10/28/2011
May 13 13:06:25 poc2 kernel: Workqueue: target_completion target_complete_ok_work [target_core_mod]
May 13 13:06:25 poc2 kernel: task: ffff88032c5096e0 ti: ffff88031bb78000 task.ti: ffff88031bb78000
May 13 13:06:25 poc2 kernel: RIP: 0010:[<ffffffff8164ac07>] [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
May 13 13:06:25 poc2 kernel: RSP: 0018:ffff88031bb79cf0 EFLAGS: 00010206
May 13 13:06:25 poc2 kernel: RAX: 0000000000000100 RBX: ffffffffffffffa4 RCX: 0000000000000000
May 13 13:06:25 poc2 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffffffffa4
May 13 13:06:25 poc2 kernel: RBP: ffff88031bb79cf8 R08: 00000000ffffffff R09: ffff88031a37f678
May 13 13:06:25 poc2 kernel: R10: 0000000000000001 R11: 0000000000000044 R12: 0000000000000000
May 13 13:06:25 poc2 kernel: R13: ffff88031a37f678 R14: ffff88062d9fd6c8 R15: ffff88032c6da05c
May 13 13:06:25 poc2 kernel: FS: 0000000000000000(0000) GS:ffff880333c00000(0000) knlGS:0000000000000000
May 13 13:06:25 poc2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 13 13:06:25 poc2 kernel: CR2: ffffffffffffffa4 CR3: 0000000001c0c000 CR4: 00000000000007f0
May 13 13:06:25 poc2 kernel: ffffffffffffffa4 ffff88031bb79d18 ffffffffa0594d2b ffff880328d13410
May 13 13:06:25 poc2 kernel: ffff88031a37c200 ffff88031bb79d58 ffffffffa05356f2 0000000000000018
May 13 13:06:25 poc2 kernel: ffff88062cea8800 0000000000000000 ffffea000c8eb640 0000000000000000
May 13 13:06:25 poc2 kernel: [<ffffffffa0594d2b>] fc_seq_start_next+0x1b/0x40 [libfc]
May 13 13:06:25 poc2 kernel: [<ffffffffa05356f2>] ft_queue_status+0xf2/0x220 [tcm_fc]
May 13 13:06:25 poc2 kernel: [<ffffffffa0536972>] ft_queue_data_in+0x72/0x5a0 [tcm_fc]
May 13 13:06:25 poc2 kernel: [<ffffffffa04f57ba>] target_complete_ok_work+0x14a/0x2b0 [target_core_mod]
May 13 13:06:25 poc2 kernel: [<ffffffff810810f5>] process_one_work+0x175/0x430
May 13 13:06:25 poc2 kernel: [<ffffffff81081d1b>] worker_thread+0x11b/0x3a0
May 13 13:06:25 poc2 kernel: [<ffffffff81081c00>] ? rescuer_thread+0x340/0x340
May 13 13:06:25 poc2 kernel: [<ffffffff81088660>] kthread+0xc0/0xd0
May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 13:06:25 poc2 kernel: [<ffffffff8165332c>] ret_from_fork+0x7c/0xb0
May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 13:06:25 poc2 kernel: Code: 1f 44 00 00 f3 90 0f b6 07 38 d0 75
f7 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb e8 7e
05 a2 ff b8 00 01 00 00 <f0> 66 0f c1 03 0f b6 d4 38 c2 74 0e 0f 1f 44
00 00 f3 90 0f b6
May 13 13:06:25 poc2 kernel: RIP [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
So before we start debugging again, please confirm that this is a
*completely* stock v3.11.10 build, and that your not building
out-of-tree target modules again.
Post by Jun Wu
May 13 12:00:47 poc1 kernel: [ 4086.708455] WARNING: CPU: 1 PID: 1869
at lib/list_debug.c:62 __list_del_entry+0x82/0xd0()
May 13 12:00:47 poc1 kernel: [ 4086.708459] list_del corruption.
next->prev should be ffff88061dab0318, but was ffff88061d257318
May 13 12:00:47 poc1 kernel: [ 4086.708461] Modules linked in: fcoe
libfcoe 8021q garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt
target_core_pscsi target_core_file target_core_iblock iscsi_target_mod
target_core_mod nf_conntrack_netbios_ns nf_conntrack_broadcast
ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw coretemp kvm_intel kvm crc32c_intel
iTCO_wdt iTCO_vendor_support microcode serio_raw i2c_i801 ses igb
enclosure lpc_ich mfd_core ixgbe ptp pps_core mdio i7core_edac ioatdma
edac_core dca shpchp acpi_cpufreq mperf nfsd auth_rpcgss nfs_acl lockd
sunrpc radeon i2c_algo_bit drm_kms_helper ttm drm ata_generic i2c_core
pata_acpi pata_jmicron aacraid
fcoethread/1 Not tainted 3.11.10-301.fc20.x86_64 #1
May 13 12:00:47 poc1 kernel: [ 4086.708558] Hardware name: Supermicro
X8DTN/X8DTN, BIOS 2.1c 10/28/2011
May 13 12:00:47 poc1 kernel: [ 4086.708561] 0000000000000009 ffff8806129dfb40 ffffffff816441db ffff8806129dfb88
May 13 12:00:47 poc1 kernel: [ 4086.708569] ffff8806129dfb78 ffffffff8106715d ffff88061dab0318 ffff88061dab0a00
May 13 12:00:47 poc1 kernel: [ 4086.708576] 0000000000000286 ffff880c1b5e4388 0000000000000030 ffff8806129dfbd8
May 13 12:00:47 poc1 kernel: [ 4086.708592] [<ffffffff816441db>] dump_stack+0x45/0x56
May 13 12:00:47 poc1 kernel: [ 4086.708598] [<ffffffff8106715d>] warn_slowpath_common+0x7d/0xa0
May 13 12:00:47 poc1 kernel: [ 4086.708602] [<ffffffff810671cc>] warn_slowpath_fmt+0x4c/0x50
May 13 12:00:47 poc1 kernel: [ 4086.708608] [<ffffffff81311dc2>] __list_del_entry+0x82/0xd0
May 13 12:00:47 poc1 kernel: [ 4086.708613] [<ffffffff81311e1d>] list_del+0xd/0x30
May 13 12:00:47 poc1 kernel: [ 4086.708624] [<ffffffffa05de23c>] fc_io_compl+0x1cc/0x710 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708633] [<ffffffffa05de7df>] fc_fcp_complete_locked+0x5f/0x1a0 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708642] [<ffffffffa05dfac9>] fc_fcp_resp.isra.22+0x79/0x2f0 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708651] [<ffffffff810a2a33>] ? load_balance+0xe3/0x740
May 13 12:00:47 poc1 kernel: [ 4086.708660] [<ffffffffa05e0424>] fc_fcp_recv+0x6e4/0xef0 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708666] [<ffffffff810115ce>] ? __switch_to+0x13e/0x4b0
May 13 12:00:47 poc1 kernel: [ 4086.708673] [<ffffffff8164aab5>] ? _raw_spin_unlock_bh+0x15/0x20
May 13 12:00:47 poc1 kernel: [ 4086.708682] [<ffffffffa05dfd40>] ? fc_fcp_resp.isra.22+0x2f0/0x2f0 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708690] [<ffffffffa05d421b>] fc_exch_recv+0x8eb/0xd70 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708695] [<ffffffffa0613299>] fcoe_percpu_receive_thread+0x299/0x540 [fcoe]
May 13 12:00:47 poc1 kernel: [ 4086.708699] [<ffffffffa0613000>] ? fcoe_set_port_id+0x50/0x50 [fcoe]
May 13 12:00:47 poc1 kernel: [ 4086.708705] [<ffffffff81088660>] kthread+0xc0/0xd0
May 13 12:00:47 poc1 kernel: [ 4086.708710] [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 12:00:47 poc1 kernel: [ 4086.708717] [<ffffffff8165332c>] ret_from_fork+0x7c/0xb0
May 13 12:00:47 poc1 kernel: [ 4086.708723] [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 12:00:47 poc1 kernel: [ 4086.708728] ---[ end trace 61dc774d1f379191 ]---
No idea on the initiator side issue. Intel folks..? (Adding
openfcoe-dev CC')

--nab
Post by Jun Wu
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
Linux poc1 3.11.10-301.fc20.x86_64 #1 SMP Thu Dec 5 14:01:17 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux
Thanks,
Jun Wu
2014-05-15 00:23:24 UTC
Permalink
On Wed, May 14, 2014 at 2:39 PM, Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Post by Jun Wu
Hi Nicholas,
We had to roll back system from 3.14 to 3.11 due to compile issues of
our software. So I am not able to verify your fix at this point.
That is unfortunate your stuck on a now unsupported stable kernel.
There are some other libfc related fixes that have gone in during the
v3.13 timeframe, so I'd strongly recommend upgrading to at least that
stable version.
In any event, I'll be pushing that particular >= v3.13.y patch anyways,
as it's a obvious regression bugfix for percpu-ida pre-allocation.
Post by Jun Wu
I ran the same tests on 3.11 instead.
May 13 13:06:25 poc2 kernel: BUG: unable to handle kernel paging
request at ffffffffffffffa4
May 13 13:06:25 poc2 kernel: IP: [<ffffffff8164ac07>]
_raw_spin_lock_bh+0x17/0x40
May 13 13:06:25 poc2 kernel: PGD 1c0f067 PUD 1c11067 PMD 0
May 13 13:06:25 poc2 kernel: Oops: 0002 [#1] SMP
May 13 13:06:25 poc2 kernel: Modules linked in: fcoe libfcoe 8021q
garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi
Post by Jun Wu
target_core_file target_core_iblock iscsi_target_mod target_core_mod
ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw nfsd auth_rpcgss nfs_acl lockd sunrpc
ixgbe mdio igb ptp pps_core serio_raw ses enclosure iTCO_wdt
iTCO_vendor_support lpc_ich mfd_core shpchp i2c_i801 coretemp
kvm_intel kvm crc32c_intel microcode i7core_edac ioatdma acpi_cpufreq
edac_core dca mperf radeon i2c_algo_bit
May 13 13:06:25 poc2 kernel: drm_kms_helper ttm drm ata_generic
i2c_core pata_acpi pata_jmicron aacraid
May 13 13:06:25 poc2 kernel: CPU: 0 PID: 1810 Comm: kworker/0:0 Not
tainted 3.11.10-301.fc20.x86_64 #1
May 13 13:06:25 poc2 kernel: Hardware name: Supermicro X8DTN/X8DTN,
BIOS 2.1c 10/28/2011
May 13 13:06:25 poc2 kernel: Workqueue: target_completion target_complete_ok_work [target_core_mod]
May 13 13:06:25 poc2 kernel: task: ffff88032c5096e0 ti: ffff88031bb78000 task.ti: ffff88031bb78000
May 13 13:06:25 poc2 kernel: RIP: 0010:[<ffffffff8164ac07>] [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
May 13 13:06:25 poc2 kernel: RSP: 0018:ffff88031bb79cf0 EFLAGS: 00010206
May 13 13:06:25 poc2 kernel: RAX: 0000000000000100 RBX: ffffffffffffffa4 RCX: 0000000000000000
May 13 13:06:25 poc2 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffffffffa4
May 13 13:06:25 poc2 kernel: RBP: ffff88031bb79cf8 R08: 00000000ffffffff R09: ffff88031a37f678
May 13 13:06:25 poc2 kernel: R10: 0000000000000001 R11: 0000000000000044 R12: 0000000000000000
May 13 13:06:25 poc2 kernel: R13: ffff88031a37f678 R14: ffff88062d9fd6c8 R15: ffff88032c6da05c
May 13 13:06:25 poc2 kernel: FS: 0000000000000000(0000) GS:ffff880333c00000(0000) knlGS:0000000000000000
May 13 13:06:25 poc2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 13 13:06:25 poc2 kernel: CR2: ffffffffffffffa4 CR3: 0000000001c0c000 CR4: 00000000000007f0
May 13 13:06:25 poc2 kernel: ffffffffffffffa4 ffff88031bb79d18 ffffffffa0594d2b ffff880328d13410
May 13 13:06:25 poc2 kernel: ffff88031a37c200 ffff88031bb79d58 ffffffffa05356f2 0000000000000018
May 13 13:06:25 poc2 kernel: ffff88062cea8800 0000000000000000 ffffea000c8eb640 0000000000000000
May 13 13:06:25 poc2 kernel: [<ffffffffa0594d2b>] fc_seq_start_next+0x1b/0x40 [libfc]
May 13 13:06:25 poc2 kernel: [<ffffffffa05356f2>] ft_queue_status+0xf2/0x220 [tcm_fc]
May 13 13:06:25 poc2 kernel: [<ffffffffa0536972>] ft_queue_data_in+0x72/0x5a0 [tcm_fc]
May 13 13:06:25 poc2 kernel: [<ffffffffa04f57ba>] target_complete_ok_work+0x14a/0x2b0 [target_core_mod]
May 13 13:06:25 poc2 kernel: [<ffffffff810810f5>] process_one_work+0x175/0x430
May 13 13:06:25 poc2 kernel: [<ffffffff81081d1b>] worker_thread+0x11b/0x3a0
May 13 13:06:25 poc2 kernel: [<ffffffff81081c00>] ? rescuer_thread+0x340/0x340
May 13 13:06:25 poc2 kernel: [<ffffffff81088660>] kthread+0xc0/0xd0
May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 13:06:25 poc2 kernel: [<ffffffff8165332c>] ret_from_fork+0x7c/0xb0
May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 13:06:25 poc2 kernel: Code: 1f 44 00 00 f3 90 0f b6 07 38 d0 75
f7 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb e8 7e
05 a2 ff b8 00 01 00 00 <f0> 66 0f c1 03 0f b6 d4 38 c2 74 0e 0f 1f 44
00 00 f3 90 0f b6
May 13 13:06:25 poc2 kernel: RIP [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
So before we start debugging again, please confirm that this is a
*completely* stock v3.11.10 build, and that your not building
out-of-tree target modules again.
Yes we installed official Fedora 20 iso image and then
yum install targetcli
modprobe fcoe

Kernel version is 3.11.10-301.fc20.x86_64 which is a stock kernel distribution.
Post by Nicholas A. Bellinger
Post by Jun Wu
May 13 12:00:47 poc1 kernel: [ 4086.708455] WARNING: CPU: 1 PID: 1869
at lib/list_debug.c:62 __list_del_entry+0x82/0xd0()
May 13 12:00:47 poc1 kernel: [ 4086.708459] list_del corruption.
next->prev should be ffff88061dab0318, but was ffff88061d257318
May 13 12:00:47 poc1 kernel: [ 4086.708461] Modules linked in: fcoe
libfcoe 8021q garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt
target_core_pscsi target_core_file target_core_iblock iscsi_target_mod
target_core_mod nf_conntrack_netbios_ns nf_conntrack_broadcast
ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw coretemp kvm_intel kvm crc32c_intel
iTCO_wdt iTCO_vendor_support microcode serio_raw i2c_i801 ses igb
enclosure lpc_ich mfd_core ixgbe ptp pps_core mdio i7core_edac ioatdma
edac_core dca shpchp acpi_cpufreq mperf nfsd auth_rpcgss nfs_acl lockd
sunrpc radeon i2c_algo_bit drm_kms_helper ttm drm ata_generic i2c_core
pata_acpi pata_jmicron aacraid
fcoethread/1 Not tainted 3.11.10-301.fc20.x86_64 #1
May 13 12:00:47 poc1 kernel: [ 4086.708558] Hardware name: Supermicro
X8DTN/X8DTN, BIOS 2.1c 10/28/2011
May 13 12:00:47 poc1 kernel: [ 4086.708561] 0000000000000009 ffff8806129dfb40 ffffffff816441db ffff8806129dfb88
May 13 12:00:47 poc1 kernel: [ 4086.708569] ffff8806129dfb78 ffffffff8106715d ffff88061dab0318 ffff88061dab0a00
May 13 12:00:47 poc1 kernel: [ 4086.708576] 0000000000000286 ffff880c1b5e4388 0000000000000030 ffff8806129dfbd8
May 13 12:00:47 poc1 kernel: [ 4086.708592] [<ffffffff816441db>] dump_stack+0x45/0x56
May 13 12:00:47 poc1 kernel: [ 4086.708598] [<ffffffff8106715d>] warn_slowpath_common+0x7d/0xa0
May 13 12:00:47 poc1 kernel: [ 4086.708602] [<ffffffff810671cc>] warn_slowpath_fmt+0x4c/0x50
May 13 12:00:47 poc1 kernel: [ 4086.708608] [<ffffffff81311dc2>] __list_del_entry+0x82/0xd0
May 13 12:00:47 poc1 kernel: [ 4086.708613] [<ffffffff81311e1d>] list_del+0xd/0x30
May 13 12:00:47 poc1 kernel: [ 4086.708624] [<ffffffffa05de23c>] fc_io_compl+0x1cc/0x710 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708633] [<ffffffffa05de7df>] fc_fcp_complete_locked+0x5f/0x1a0 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708642] [<ffffffffa05dfac9>] fc_fcp_resp.isra.22+0x79/0x2f0 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708651] [<ffffffff810a2a33>] ? load_balance+0xe3/0x740
May 13 12:00:47 poc1 kernel: [ 4086.708660] [<ffffffffa05e0424>] fc_fcp_recv+0x6e4/0xef0 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708666] [<ffffffff810115ce>] ? __switch_to+0x13e/0x4b0
May 13 12:00:47 poc1 kernel: [ 4086.708673] [<ffffffff8164aab5>] ? _raw_spin_unlock_bh+0x15/0x20
May 13 12:00:47 poc1 kernel: [ 4086.708682] [<ffffffffa05dfd40>] ? fc_fcp_resp.isra.22+0x2f0/0x2f0 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708690] [<ffffffffa05d421b>] fc_exch_recv+0x8eb/0xd70 [libfc]
May 13 12:00:47 poc1 kernel: [ 4086.708695] [<ffffffffa0613299>] fcoe_percpu_receive_thread+0x299/0x540 [fcoe]
May 13 12:00:47 poc1 kernel: [ 4086.708699] [<ffffffffa0613000>] ? fcoe_set_port_id+0x50/0x50 [fcoe]
May 13 12:00:47 poc1 kernel: [ 4086.708705] [<ffffffff81088660>] kthread+0xc0/0xd0
May 13 12:00:47 poc1 kernel: [ 4086.708710] [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 12:00:47 poc1 kernel: [ 4086.708717] [<ffffffff8165332c>] ret_from_fork+0x7c/0xb0
May 13 12:00:47 poc1 kernel: [ 4086.708723] [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 12:00:47 poc1 kernel: [ 4086.708728] ---[ end trace 61dc774d1f379191 ]---
No idea on the initiator side issue. Intel folks..? (Adding
openfcoe-dev CC')
--nab
Post by Jun Wu
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
Linux poc1 3.11.10-301.fc20.x86_64 #1 SMP Thu Dec 5 14:01:17 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux
Thanks,
Nicholas A. Bellinger
2014-05-15 01:57:25 UTC
Permalink
Post by Jun Wu
On Wed, May 14, 2014 at 2:39 PM, Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Post by Jun Wu
Hi Nicholas,
We had to roll back system from 3.14 to 3.11 due to compile issues of
our software. So I am not able to verify your fix at this point.
That is unfortunate your stuck on a now unsupported stable kernel.
There are some other libfc related fixes that have gone in during the
v3.13 timeframe, so I'd strongly recommend upgrading to at least that
stable version.
In any event, I'll be pushing that particular >= v3.13.y patch anyways,
as it's a obvious regression bugfix for percpu-ida pre-allocation.
Post by Jun Wu
I ran the same tests on 3.11 instead.
May 13 13:06:25 poc2 kernel: BUG: unable to handle kernel paging
request at ffffffffffffffa4
May 13 13:06:25 poc2 kernel: IP: [<ffffffff8164ac07>]
_raw_spin_lock_bh+0x17/0x40
May 13 13:06:25 poc2 kernel: PGD 1c0f067 PUD 1c11067 PMD 0
May 13 13:06:25 poc2 kernel: Oops: 0002 [#1] SMP
May 13 13:06:25 poc2 kernel: Modules linked in: fcoe libfcoe 8021q
garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi
Post by Jun Wu
target_core_file target_core_iblock iscsi_target_mod target_core_mod
ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw nfsd auth_rpcgss nfs_acl lockd sunrpc
ixgbe mdio igb ptp pps_core serio_raw ses enclosure iTCO_wdt
iTCO_vendor_support lpc_ich mfd_core shpchp i2c_i801 coretemp
kvm_intel kvm crc32c_intel microcode i7core_edac ioatdma acpi_cpufreq
edac_core dca mperf radeon i2c_algo_bit
May 13 13:06:25 poc2 kernel: drm_kms_helper ttm drm ata_generic
i2c_core pata_acpi pata_jmicron aacraid
May 13 13:06:25 poc2 kernel: CPU: 0 PID: 1810 Comm: kworker/0:0 Not
tainted 3.11.10-301.fc20.x86_64 #1
May 13 13:06:25 poc2 kernel: Hardware name: Supermicro X8DTN/X8DTN,
BIOS 2.1c 10/28/2011
May 13 13:06:25 poc2 kernel: Workqueue: target_completion target_complete_ok_work [target_core_mod]
May 13 13:06:25 poc2 kernel: task: ffff88032c5096e0 ti: ffff88031bb78000 task.ti: ffff88031bb78000
May 13 13:06:25 poc2 kernel: RIP: 0010:[<ffffffff8164ac07>] [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
May 13 13:06:25 poc2 kernel: RSP: 0018:ffff88031bb79cf0 EFLAGS: 00010206
May 13 13:06:25 poc2 kernel: RAX: 0000000000000100 RBX: ffffffffffffffa4 RCX: 0000000000000000
May 13 13:06:25 poc2 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffffffffa4
May 13 13:06:25 poc2 kernel: RBP: ffff88031bb79cf8 R08: 00000000ffffffff R09: ffff88031a37f678
May 13 13:06:25 poc2 kernel: R10: 0000000000000001 R11: 0000000000000044 R12: 0000000000000000
May 13 13:06:25 poc2 kernel: R13: ffff88031a37f678 R14: ffff88062d9fd6c8 R15: ffff88032c6da05c
May 13 13:06:25 poc2 kernel: FS: 0000000000000000(0000) GS:ffff880333c00000(0000) knlGS:0000000000000000
May 13 13:06:25 poc2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 13 13:06:25 poc2 kernel: CR2: ffffffffffffffa4 CR3: 0000000001c0c000 CR4: 00000000000007f0
May 13 13:06:25 poc2 kernel: ffffffffffffffa4 ffff88031bb79d18 ffffffffa0594d2b ffff880328d13410
May 13 13:06:25 poc2 kernel: ffff88031a37c200 ffff88031bb79d58 ffffffffa05356f2 0000000000000018
May 13 13:06:25 poc2 kernel: ffff88062cea8800 0000000000000000 ffffea000c8eb640 0000000000000000
May 13 13:06:25 poc2 kernel: [<ffffffffa0594d2b>] fc_seq_start_next+0x1b/0x40 [libfc]
May 13 13:06:25 poc2 kernel: [<ffffffffa05356f2>] ft_queue_status+0xf2/0x220 [tcm_fc]
May 13 13:06:25 poc2 kernel: [<ffffffffa0536972>] ft_queue_data_in+0x72/0x5a0 [tcm_fc]
May 13 13:06:25 poc2 kernel: [<ffffffffa04f57ba>] target_complete_ok_work+0x14a/0x2b0 [target_core_mod]
May 13 13:06:25 poc2 kernel: [<ffffffff810810f5>] process_one_work+0x175/0x430
May 13 13:06:25 poc2 kernel: [<ffffffff81081d1b>] worker_thread+0x11b/0x3a0
May 13 13:06:25 poc2 kernel: [<ffffffff81081c00>] ? rescuer_thread+0x340/0x340
May 13 13:06:25 poc2 kernel: [<ffffffff81088660>] kthread+0xc0/0xd0
May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 13:06:25 poc2 kernel: [<ffffffff8165332c>] ret_from_fork+0x7c/0xb0
May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 13:06:25 poc2 kernel: Code: 1f 44 00 00 f3 90 0f b6 07 38 d0 75
f7 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb e8 7e
05 a2 ff b8 00 01 00 00 <f0> 66 0f c1 03 0f b6 d4 38 c2 74 0e 0f 1f 44
00 00 f3 90 0f b6
May 13 13:06:25 poc2 kernel: RIP [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
So before we start debugging again, please confirm that this is a
*completely* stock v3.11.10 build, and that your not building
out-of-tree target modules again.
Yes we installed official Fedora 20 iso image and then
yum install targetcli
modprobe fcoe
Kernel version is 3.11.10-301.fc20.x86_64 which is a stock kernel distribution.
So I'd still recommend >= v3.13.y in order to pick-up the other libfc
related bugfixes that have gone in since v3.11.y ran out of stable
support some months ago. I can't tell if that is related to the bug(s)
that your hitting on both initiator + target sides here, but it
definitely would not hurt.

Also, here is the updated >= v3.13.y percpu-ida pre-allocation
regression fix that will be going into v3.15-rc6 shortly. Note this is
different from the original version, so please make sure you use this
one instead.

https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?id=e7c36ac16b91bf09e94f82d1838f0f9f114bed08

--nab
Jun Wu
2014-05-15 02:20:29 UTC
Permalink
On Wed, May 14, 2014 at 6:57 PM, Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Post by Jun Wu
On Wed, May 14, 2014 at 2:39 PM, Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Post by Jun Wu
Hi Nicholas,
We had to roll back system from 3.14 to 3.11 due to compile issues of
our software. So I am not able to verify your fix at this point.
That is unfortunate your stuck on a now unsupported stable kernel.
There are some other libfc related fixes that have gone in during the
v3.13 timeframe, so I'd strongly recommend upgrading to at least that
stable version.
In any event, I'll be pushing that particular >= v3.13.y patch anyways,
as it's a obvious regression bugfix for percpu-ida pre-allocation.
Post by Jun Wu
I ran the same tests on 3.11 instead.
May 13 13:06:25 poc2 kernel: BUG: unable to handle kernel paging
request at ffffffffffffffa4
May 13 13:06:25 poc2 kernel: IP: [<ffffffff8164ac07>]
_raw_spin_lock_bh+0x17/0x40
May 13 13:06:25 poc2 kernel: PGD 1c0f067 PUD 1c11067 PMD 0
May 13 13:06:25 poc2 kernel: Oops: 0002 [#1] SMP
May 13 13:06:25 poc2 kernel: Modules linked in: fcoe libfcoe 8021q
garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi
Post by Jun Wu
target_core_file target_core_iblock iscsi_target_mod target_core_mod
ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw nfsd auth_rpcgss nfs_acl lockd sunrpc
ixgbe mdio igb ptp pps_core serio_raw ses enclosure iTCO_wdt
iTCO_vendor_support lpc_ich mfd_core shpchp i2c_i801 coretemp
kvm_intel kvm crc32c_intel microcode i7core_edac ioatdma acpi_cpufreq
edac_core dca mperf radeon i2c_algo_bit
May 13 13:06:25 poc2 kernel: drm_kms_helper ttm drm ata_generic
i2c_core pata_acpi pata_jmicron aacraid
May 13 13:06:25 poc2 kernel: CPU: 0 PID: 1810 Comm: kworker/0:0 Not
tainted 3.11.10-301.fc20.x86_64 #1
May 13 13:06:25 poc2 kernel: Hardware name: Supermicro X8DTN/X8DTN,
BIOS 2.1c 10/28/2011
May 13 13:06:25 poc2 kernel: Workqueue: target_completion target_complete_ok_work [target_core_mod]
May 13 13:06:25 poc2 kernel: task: ffff88032c5096e0 ti: ffff88031bb78000 task.ti: ffff88031bb78000
May 13 13:06:25 poc2 kernel: RIP: 0010:[<ffffffff8164ac07>] [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
May 13 13:06:25 poc2 kernel: RSP: 0018:ffff88031bb79cf0 EFLAGS: 00010206
May 13 13:06:25 poc2 kernel: RAX: 0000000000000100 RBX: ffffffffffffffa4 RCX: 0000000000000000
May 13 13:06:25 poc2 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffffffffa4
May 13 13:06:25 poc2 kernel: RBP: ffff88031bb79cf8 R08: 00000000ffffffff R09: ffff88031a37f678
May 13 13:06:25 poc2 kernel: R10: 0000000000000001 R11: 0000000000000044 R12: 0000000000000000
May 13 13:06:25 poc2 kernel: R13: ffff88031a37f678 R14: ffff88062d9fd6c8 R15: ffff88032c6da05c
May 13 13:06:25 poc2 kernel: FS: 0000000000000000(0000) GS:ffff880333c00000(0000) knlGS:0000000000000000
May 13 13:06:25 poc2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 13 13:06:25 poc2 kernel: CR2: ffffffffffffffa4 CR3: 0000000001c0c000 CR4: 00000000000007f0
May 13 13:06:25 poc2 kernel: ffffffffffffffa4 ffff88031bb79d18 ffffffffa0594d2b ffff880328d13410
May 13 13:06:25 poc2 kernel: ffff88031a37c200 ffff88031bb79d58 ffffffffa05356f2 0000000000000018
May 13 13:06:25 poc2 kernel: ffff88062cea8800 0000000000000000 ffffea000c8eb640 0000000000000000
May 13 13:06:25 poc2 kernel: [<ffffffffa0594d2b>] fc_seq_start_next+0x1b/0x40 [libfc]
May 13 13:06:25 poc2 kernel: [<ffffffffa05356f2>] ft_queue_status+0xf2/0x220 [tcm_fc]
May 13 13:06:25 poc2 kernel: [<ffffffffa0536972>] ft_queue_data_in+0x72/0x5a0 [tcm_fc]
May 13 13:06:25 poc2 kernel: [<ffffffffa04f57ba>] target_complete_ok_work+0x14a/0x2b0 [target_core_mod]
May 13 13:06:25 poc2 kernel: [<ffffffff810810f5>] process_one_work+0x175/0x430
May 13 13:06:25 poc2 kernel: [<ffffffff81081d1b>] worker_thread+0x11b/0x3a0
May 13 13:06:25 poc2 kernel: [<ffffffff81081c00>] ? rescuer_thread+0x340/0x340
May 13 13:06:25 poc2 kernel: [<ffffffff81088660>] kthread+0xc0/0xd0
May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 13:06:25 poc2 kernel: [<ffffffff8165332c>] ret_from_fork+0x7c/0xb0
May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 13:06:25 poc2 kernel: Code: 1f 44 00 00 f3 90 0f b6 07 38 d0 75
f7 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb e8 7e
05 a2 ff b8 00 01 00 00 <f0> 66 0f c1 03 0f b6 d4 38 c2 74 0e 0f 1f 44
00 00 f3 90 0f b6
May 13 13:06:25 poc2 kernel: RIP [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
So before we start debugging again, please confirm that this is a
*completely* stock v3.11.10 build, and that your not building
out-of-tree target modules again.
Yes we installed official Fedora 20 iso image and then
yum install targetcli
modprobe fcoe
Kernel version is 3.11.10-301.fc20.x86_64 which is a stock kernel distribution.
So I'd still recommend >= v3.13.y in order to pick-up the other libfc
related bugfixes that have gone in since v3.11.y ran out of stable
support some months ago. I can't tell if that is related to the bug(s)
that your hitting on both initiator + target sides here, but it
definitely would not hurt.
Also, here is the updated >= v3.13.y percpu-ida pre-allocation
regression fix that will be going into v3.15-rc6 shortly. Note this is
different from the original version, so please make sure you use this
one instead.
That's what we are going to do. Our software compiles on 3.13 kernel.
I will run some test once the systems are ready.

Thanks a lot.
Post by Nicholas A. Bellinger
https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?id=e7c36ac16b91bf09e94f82d1838f0f9f114bed08
--nab
Jun Wu
2014-05-20 00:29:35 UTC
Permalink
Hi Nicholas,

We downloaded the source of our running kernel (3.13.10-200) and
applied your percpu-ida pre-allocation regression fix, then compiled
and installed the kernel. I repeated the same test three times,
running 10 fio sessions to 10 drives on the target through fcoe vn2vn.
In the first two tests, the target machine hung with the following
messages:

15231 May 19 11:49:27 poc1 kernel: [ 1073.783229] ft_queue_data_in:
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 196608,
lso_max <0x10000>
15232 May 19 11:49:27 poc1 kernel: [ 1073.783238] ft_queue_data_in:
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 131072,
lso_max <0x10000>
15233 May 19 11:49:27 poc1 kernel: [ 1073.783242] ft_queue_data_in:
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 65536,
lso_max <0x10000>
15234 May 19 11:49:27 poc1 kernel: [ 1073.783245] ft_queue_data_in:
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 0,
lso_max <0x10000>
15235 May 19 11:49:30 poc1 kernel: [ 1076.907061] ft_queue_data_in:
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 196608,
lso_max <0x10000>
15236 May 19 11:49:30 poc1 kernel: [ 1076.907068] ft_queue_data_in:
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 131072,
lso_max <0x10000>
15237 May 19 11:49:30 poc1 kernel: [ 1076.907073] ft_queue_data_in:
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 65536,
lso_max <0x10000>
15238 May 19 11:49:30 poc1 kernel: [ 1076.907077] ft_queue_data_in:
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 0,
lso_max <0x10000>
15239 May 19 11:50:01 poc1 kernel: [ 1107.918910] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 458752,
lso_max <0x10000>
15240 May 19 11:50:01 poc1 kernel: [ 1107.918918] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 393216,
lso_max <0x10000>
15241 May 19 11:50:01 poc1 kernel: [ 1107.918922] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 327680,
lso_max <0x10000>
15242 May 19 11:50:01 poc1 kernel: [ 1107.918925] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 262144,
lso_max <0x10000>
15243 May 19 11:50:01 poc1 kernel: [ 1107.918929] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 196608,
lso_max <0x10000>
15244 May 19 11:50:01 poc1 kernel: [ 1107.918932] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 131072,
lso_max <0x10000>
15245 May 19 11:50:01 poc1 kernel: [ 1107.918936] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 65536,
lso_max <0x10000>
15246 May 19 11:50:01 poc1 kernel: [ 1107.918939] ft_queue_data_in:
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 0,
lso_max <0x10000>
15247 May 19 11:50:05 poc1 kernel: [ 1111.450900] ft_queue_data_in:
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 196608,
lso_max <0x10000>
15248 May 19 11:50:05 poc1 kernel: [ 1111.450908] ft_queue_data_in:
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 131072,
lso_max <0x10000>
15249 May 19 11:51:12 poc1 kernel: [ 1178.698434] ft_queue_data_in: 6
callbacks suppressed
15250 May 19 11:51:12 poc1 kernel: [ 1178.698440] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 458752,
lso_max <0x10000>
15251 May 19 11:51:12 poc1 kernel: [ 1178.698446] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 393216,
lso_max <0x10000>
15252 May 19 11:51:12 poc1 kernel: [ 1178.698449] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 327680,
lso_max <0x10000>
15253 May 19 11:51:12 poc1 kernel: [ 1178.698453] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 262144,
lso_max <0x10000>
15254 May 19 11:51:12 poc1 kernel: [ 1178.698456] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 196608,
lso_max <0x10000>
15255 May 19 11:51:12 poc1 kernel: [ 1178.698460] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 131072,
lso_max <0x10000>
15256 May 19 11:51:12 poc1 kernel: [ 1178.698463] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 65536,
lso_max <0x10000>
15257 May 19 11:51:12 poc1 kernel: [ 1178.698467] ft_queue_data_in:
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 0,
lso_max <0x10000>


In the third test, after above lines, some additional messages were printed:

15258 May 19 11:52:46 poc1 kernel: [ 1272.969464] BUG: soft lockup -
CPU#0 stuck for 22s! [fcoethread/0:3372]
15259 May 19 11:52:46 poc1 kernel: [ 1272.969468] Modules linked in:
fcoe libfcoe ipt_MASQUERADE xt_CHECKSUM tcm_fc libfc scsi_transport_fc
scsi_tgt target_core_pscsi target_core_file target_core_iblock
iscsi_target_mod target_core_mod 8 021q garp mrp ip6t_rpfilter
ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc
ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6tabl
e_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security
iptable_raw coretemp kvm_intel kvm crc32c_intel iTCO_wdt
iTCO_vendor_support gpio_ich microcode serio_ raw lpc_ich
i2c_i801 mfd_core ioatdma ses enclosure i7core_edac edac_core shpchp
acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd sunrpc radeon
drm_kms_helper ttm ixgbe drm igb ata_generic mdio pata_acpi ptp
pps_core i2c_algo_bit pata_j micron aacraid dca i2c_core
15260 May 19 11:52:46 poc1 kernel: [ 1272.969537] CPU: 0 PID: 3372
Comm: fcoethread/0 Not tainted 3.13.10-200.zbfcoepatch.fc20.x86_64 #1
15261 May 19 11:52:46 poc1 kernel: [ 1272.969539] Hardware name:
Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011
15262 May 19 11:52:46 poc1 kernel: [ 1272.969541] task:
ffff88061cb29140 ti: ffff88061e14c000 task.ti: ffff88061e14c000
15263 May 19 11:52:46 poc1 kernel: [ 1272.969543] RIP:
0010:[<ffffffff8168e834>] [<ffffffff8168e834>]
_raw_spin_lock_bh+0x34/0x40
15264 May 19 11:52:46 poc1 kernel: [ 1272.969552] RSP:
0018:ffff88061e14de10 EFLAGS: 00000282
15265 May 19 11:52:46 poc1 kernel: [ 1272.969554] RAX:
000000000000aa8f RBX: ffff880627c14580 RCX: 000000000000136f
15266 May 19 11:52:46 poc1 kernel: [ 1272.969556] RDX:
000000000000136f RSI: ffff88061cb29140 RDI: ffff880627c153fc
15267 May 19 11:52:46 poc1 kernel: [ 1272.969558] RBP:
ffff88061e14de18 R08: ffff88061e14c000 R09: 0000000000000004
15268 May 19 11:52:46 poc1 kernel: [ 1272.969560] R10:
0000000000000004 R11: 0000000000000005 R12: ffff880c09bfd800
15269 May 19 11:52:46 poc1 kernel: [ 1272.969561] R13:
ffff880627c14580 R14: 0000000000000000 R15: 0000000000000000
15270 May 19 11:52:46 poc1 kernel: [ 1272.969564] FS:
0000000000000000(0000) GS:ffff880627c00000(0000)
knlGS:0000000000000000
15271 May 19 11:52:46 poc1 kernel: [ 1272.969566] CS: 0010 DS: 0000
ES: 0000 CR0: 000000008005003b
15272 May 19 11:52:46 poc1 kernel: [ 1272.969568] CR2:
00007f3c2c851000 CR3: 0000000001c0c000 CR4: 00000000000007f0
15273 May 19 11:52:46 poc1 kernel: [ 1272.969570] Stack:
15274 May 19 11:52:46 poc1 kernel: [ 1272.969572] ffff880627c153e0
ffff88061e14dec8 ffffffffa06590ed ffff880627c14580
15275 May 19 11:52:46 poc1 kernel: [ 1272.969577] ffff880c1d00d400
0000000000000000 ffff880627c14580 ffff88061cb29140
15276 May 19 11:52:46 poc1 kernel: [ 1272.969581] ffff88061cb29140
ffff880627c153fc ffff880627c153e0 ffff88060c314dce
15277 May 19 11:52:46 poc1 kernel: [ 1272.969585] Call Trace:
15278 May 19 11:52:46 poc1 kernel: [ 1272.969592]
[<ffffffffa06590ed>] fcoe_percpu_receive_thread+0xdd/0x53c [fcoe]
15279 May 19 11:52:46 poc1 kernel: [ 1272.969597]
[<ffffffffa0659010>] ? fcoe_set_port_id+0x50/0x50 [fcoe]
15280 May 19 11:52:46 poc1 kernel: [ 1272.969603]
[<ffffffff8108f2f2>] kthread+0xd2/0xf0
15281 May 19 11:52:46 poc1 kernel: [ 1272.969607]
[<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
15282 May 19 11:52:46 poc1 kernel: [ 1272.969613]
[<ffffffff81696dbc>] ret_from_fork+0x7c/0xb0
15283 May 19 11:52:46 poc1 kernel: [ 1272.969617]
[<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
15284 May 19 11:52:46 poc1 kernel: [ 1272.969618] Code: 53 48 89 fb e8
4e 34 9e ff b8 00 00 01 00 f0 0f c1 03 89 c2 c1 ea 10 66 39 c2 75 03
5b 5d c3 0f b7 03 89 d1 66 39 d0 74 f3 f3 90 <0f> b7 03 66 39 c8 75 f6
eb e7 66 90 66 66 66 66 90 55 48 89 e5

I didn't see the previous message "unable to handle kernel NULL
pointer dereference at 0000000000000048". So it must have been fixed
by your change.

Thanks,

Jun

On Wed, May 14, 2014 at 6:57 PM, Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Post by Jun Wu
On Wed, May 14, 2014 at 2:39 PM, Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Post by Jun Wu
Hi Nicholas,
We had to roll back system from 3.14 to 3.11 due to compile issues of
our software. So I am not able to verify your fix at this point.
That is unfortunate your stuck on a now unsupported stable kernel.
There are some other libfc related fixes that have gone in during the
v3.13 timeframe, so I'd strongly recommend upgrading to at least that
stable version.
In any event, I'll be pushing that particular >= v3.13.y patch anyways,
as it's a obvious regression bugfix for percpu-ida pre-allocation.
Post by Jun Wu
I ran the same tests on 3.11 instead.
May 13 13:06:25 poc2 kernel: BUG: unable to handle kernel paging
request at ffffffffffffffa4
May 13 13:06:25 poc2 kernel: IP: [<ffffffff8164ac07>]
_raw_spin_lock_bh+0x17/0x40
May 13 13:06:25 poc2 kernel: PGD 1c0f067 PUD 1c11067 PMD 0
May 13 13:06:25 poc2 kernel: Oops: 0002 [#1] SMP
May 13 13:06:25 poc2 kernel: Modules linked in: fcoe libfcoe 8021q
garp mrp tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi
Post by Jun Wu
target_core_file target_core_iblock iscsi_target_mod target_core_mod
ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute
bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw nfsd auth_rpcgss nfs_acl lockd sunrpc
ixgbe mdio igb ptp pps_core serio_raw ses enclosure iTCO_wdt
iTCO_vendor_support lpc_ich mfd_core shpchp i2c_i801 coretemp
kvm_intel kvm crc32c_intel microcode i7core_edac ioatdma acpi_cpufreq
edac_core dca mperf radeon i2c_algo_bit
May 13 13:06:25 poc2 kernel: drm_kms_helper ttm drm ata_generic
i2c_core pata_acpi pata_jmicron aacraid
May 13 13:06:25 poc2 kernel: CPU: 0 PID: 1810 Comm: kworker/0:0 Not
tainted 3.11.10-301.fc20.x86_64 #1
May 13 13:06:25 poc2 kernel: Hardware name: Supermicro X8DTN/X8DTN,
BIOS 2.1c 10/28/2011
May 13 13:06:25 poc2 kernel: Workqueue: target_completion target_complete_ok_work [target_core_mod]
May 13 13:06:25 poc2 kernel: task: ffff88032c5096e0 ti: ffff88031bb78000 task.ti: ffff88031bb78000
May 13 13:06:25 poc2 kernel: RIP: 0010:[<ffffffff8164ac07>] [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
May 13 13:06:25 poc2 kernel: RSP: 0018:ffff88031bb79cf0 EFLAGS: 00010206
May 13 13:06:25 poc2 kernel: RAX: 0000000000000100 RBX: ffffffffffffffa4 RCX: 0000000000000000
May 13 13:06:25 poc2 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffffffffa4
May 13 13:06:25 poc2 kernel: RBP: ffff88031bb79cf8 R08: 00000000ffffffff R09: ffff88031a37f678
May 13 13:06:25 poc2 kernel: R10: 0000000000000001 R11: 0000000000000044 R12: 0000000000000000
May 13 13:06:25 poc2 kernel: R13: ffff88031a37f678 R14: ffff88062d9fd6c8 R15: ffff88032c6da05c
May 13 13:06:25 poc2 kernel: FS: 0000000000000000(0000) GS:ffff880333c00000(0000) knlGS:0000000000000000
May 13 13:06:25 poc2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 13 13:06:25 poc2 kernel: CR2: ffffffffffffffa4 CR3: 0000000001c0c000 CR4: 00000000000007f0
May 13 13:06:25 poc2 kernel: ffffffffffffffa4 ffff88031bb79d18 ffffffffa0594d2b ffff880328d13410
May 13 13:06:25 poc2 kernel: ffff88031a37c200 ffff88031bb79d58 ffffffffa05356f2 0000000000000018
May 13 13:06:25 poc2 kernel: ffff88062cea8800 0000000000000000 ffffea000c8eb640 0000000000000000
May 13 13:06:25 poc2 kernel: [<ffffffffa0594d2b>] fc_seq_start_next+0x1b/0x40 [libfc]
May 13 13:06:25 poc2 kernel: [<ffffffffa05356f2>] ft_queue_status+0xf2/0x220 [tcm_fc]
May 13 13:06:25 poc2 kernel: [<ffffffffa0536972>] ft_queue_data_in+0x72/0x5a0 [tcm_fc]
May 13 13:06:25 poc2 kernel: [<ffffffffa04f57ba>] target_complete_ok_work+0x14a/0x2b0 [target_core_mod]
May 13 13:06:25 poc2 kernel: [<ffffffff810810f5>] process_one_work+0x175/0x430
May 13 13:06:25 poc2 kernel: [<ffffffff81081d1b>] worker_thread+0x11b/0x3a0
May 13 13:06:25 poc2 kernel: [<ffffffff81081c00>] ? rescuer_thread+0x340/0x340
May 13 13:06:25 poc2 kernel: [<ffffffff81088660>] kthread+0xc0/0xd0
May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 13:06:25 poc2 kernel: [<ffffffff8165332c>] ret_from_fork+0x7c/0xb0
May 13 13:06:25 poc2 kernel: [<ffffffff810885a0>] ? insert_kthread_work+0x40/0x40
May 13 13:06:25 poc2 kernel: Code: 1f 44 00 00 f3 90 0f b6 07 38 d0 75
f7 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb e8 7e
05 a2 ff b8 00 01 00 00 <f0> 66 0f c1 03 0f b6 d4 38 c2 74 0e 0f 1f 44
00 00 f3 90 0f b6
May 13 13:06:25 poc2 kernel: RIP [<ffffffff8164ac07>] _raw_spin_lock_bh+0x17/0x40
So before we start debugging again, please confirm that this is a
*completely* stock v3.11.10 build, and that your not building
out-of-tree target modules again.
Yes we installed official Fedora 20 iso image and then
yum install targetcli
modprobe fcoe
Kernel version is 3.11.10-301.fc20.x86_64 which is a stock kernel distribution.
So I'd still recommend >= v3.13.y in order to pick-up the other libfc
related bugfixes that have gone in since v3.11.y ran out of stable
support some months ago. I can't tell if that is related to the bug(s)
that your hitting on both initiator + target sides here, but it
definitely would not hurt.
Also, here is the updated >= v3.13.y percpu-ida pre-allocation
regression fix that will be going into v3.15-rc6 shortly. Note this is
different from the original version, so please make sure you use this
one instead.
https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?id=e7c36ac16b91bf09e94f82d1838f0f9f114bed08
--nab
Nicholas A. Bellinger
2014-05-20 18:03:27 UTC
Permalink
Post by Jun Wu
Hi Nicholas,
We downloaded the source of our running kernel (3.13.10-200) and
applied your percpu-ida pre-allocation regression fix, then compiled
and installed the kernel. I repeated the same test three times,
running 10 fio sessions to 10 drives on the target through fcoe vn2vn.
In the first two tests, the target machine hung with the following
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 458752,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 393216,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 327680,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 262144,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 131072,
lso_max <0x10000>
15249 May 19 11:51:12 poc1 kernel: [ 1178.698434] ft_queue_data_in: 6
callbacks suppressed
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 458752,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 393216,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 327680,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 262144,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 0,
lso_max <0x10000>
The call into lport->tt.seq_send() libfc code is failing to send
outgoing solicited data-in. From the output, note the LSO (large
segment offload aka TCP segment offload) feature has been enabled by the
underlying NIC hardware.

So in order to isolate possible issues, I'd recommend:

- Disabling hardware offloads on both initiator and target sides (LRO +
LSO) using ethtool -K
- Disabling any jumbo frames settings on either side

Is there any other non standard network and/or switch settings that are
in place..? Also, please confirm what your NIC + switch setup looks
like.

Rob & Open-FCoE folks, is there anything else to take into consideration
here..?
Post by Jun Wu
I didn't see the previous message "unable to handle kernel NULL
pointer dereference at 0000000000000048". So it must have been fixed
by your change.
Thanks for confirming that bit.

--nab
Jun Wu
2014-05-21 05:29:44 UTC
Permalink
MTU were 1500 for both initiator and target.
I used "ethtool -K p4p1 tso off" to turn off tcp segmentation offload
on all machines. Register setting after the command is shown below.

[***@poc3 jkong]# ethtool -k p4p1
Features for p4p1:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: on [fixed]
tx-checksum-sctp: on
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
tx-tcp-segmentation: off
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp6-segmentation: off
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: on [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off

Info on NIC drivers

[***@poc3 jkong]# ethtool -i p4p1
driver: ixgbe
version: 3.15.1-k
firmware-version: 0x80000208
bus-info: 0000:08:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

After the change, I repeated the same test and got similar failure on
target side:

[12253.032595] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 458752, lso_max <0x10000>
[12253.032605] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 393216, lso_max <0x10000>
[12253.032609] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 327680, lso_max <0x10000>
[12253.032613] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 262144, lso_max <0x10000>
[12284.299877] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 196608, lso_max <0x10000>
[12284.299885] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 131072, lso_max <0x10000>
[12284.299889] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 65536, lso_max <0x10000>
[12284.299892] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 0, lso_max <0x10000>
[12284.451810] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 458752, lso_max <0x10000>
[12284.451818] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 393216, lso_max <0x10000>
[12284.451824] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 327680, lso_max <0x10000>
[12284.451827] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 262144, lso_max <0x10000>
[12284.451831] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 196608, lso_max <0x10000>
[12284.451834] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 131072, lso_max <0x10000>
[12347.503478] ft_queue_data_in: 2 callbacks suppressed
[12347.503486] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 458752, lso_max <0x10000>
[12347.503492] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 393216, lso_max <0x10000>
[12347.503496] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 327680, lso_max <0x10000>
[12347.503517] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 262144, lso_max <0x10000>
[12378.402412] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 458752, lso_max <0x10000>
[12378.402420] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 393216, lso_max <0x10000>
[12378.402425] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 327680, lso_max <0x10000>
[12378.402428] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 262144, lso_max <0x10000>
[12378.402432] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 196608, lso_max <0x10000>
[12378.402436] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 131072, lso_max <0x10000>
[12378.402440] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 65536, lso_max <0x10000>
[12378.402444] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 0, lso_max <0x10000>
[13049.224513] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 196608, lso_max <0x10000>
[13049.224524] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 131072, lso_max <0x10000>
[13049.224528] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 65536, lso_max <0x10000>
[13049.224532] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 0, lso_max <0x10000>
[13052.511306] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 196608, lso_max <0x10000>
[13052.511313] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 131072, lso_max <0x10000>
[13052.511317] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 65536, lso_max <0x10000>
[13052.511321] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 0, lso_max <0x10000>
[13087.976748] ft_queue_data_in: Failed to send frame
ffff88031afc9c00, xid <0x96b>, remaining 458752, lso_max <0x10000>
[13087.998453] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 458752, lso_max <0x10000>
[13087.998459] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 393216, lso_max <0x10000>
[13087.998463] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 327680, lso_max <0x10000>
[13087.998467] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 262144, lso_max <0x10000>
[13087.998470] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 196608, lso_max <0x10000>
[13087.998474] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 131072, lso_max <0x10000>
[13087.998478] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 65536, lso_max <0x10000>
[13087.998482] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 0, lso_max <0x10000>
[13119.177286] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 458752, lso_max <0x10000>
[13119.177297] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 393216, lso_max <0x10000>
[13119.177302] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 327680, lso_max <0x10000>
[13119.177307] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 262144, lso_max <0x10000>
[13119.177311] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 196608, lso_max <0x10000>
[13119.177316] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 131072, lso_max <0x10000>
[13119.177321] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 65536, lso_max <0x10000>
[13119.177325] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 0, lso_max <0x10000>
[13122.335322] ------------[ cut here ]------------
[13122.335336] WARNING: CPU: 6 PID: 2165 at
include/scsi/fc_frame.h:173 fcoe_percpu_receive_thread+0x507/0x53c
[fcoe]()
[13122.335338] Modules linked in: async_memcpy async_xor xor async_tx
fcoe libfcoe tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
8021q garp mrp bridge stp llc iTCO_wdt gpio_ich iTCO_vendor_support
coretemp kvm_intel kvm crc32c_intel microcode serio_raw i2c_i801
lpc_ich mfd_core ses enclosure i7core_edac ioatdma edac_core shpchp
acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd sunrpc radeon
drm_kms_helper ttm drm ixgbe igb ata_generic mdio pata_acpi ptp
pata_jmicron pps_core i2c_algo_bit aacraid dca i2c_core [last
unloaded: vd]
[13122.335390] CPU: 6 PID: 2165 Comm: fcoethread/6 Tainted: GF
O 3.13.10-200.zbfcoepatch.fc20.x86_64 #1
[13122.335392] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011
[13122.335394] 0000000000000009 ffff88062b04bdd0 ffffffff81687eac
0000000000000000
[13122.335400] ffff88062b04be08 ffffffff8106d4dd ffffe8ffffc41748
ffff88062a444700
[13122.335404] ffff8800b7e926e8 0000000000000002 ffff88062b04be88
ffff88062b04be18
[13122.335408] Call Trace:
[13122.335419] [<ffffffff81687eac>] dump_stack+0x45/0x56
[13122.335426] [<ffffffff8106d4dd>] warn_slowpath_common+0x7d/0xa0
[13122.335430] [<ffffffff8106d5ba>] warn_slowpath_null+0x1a/0x20
[13122.335435] [<ffffffffa0651517>]
fcoe_percpu_receive_thread+0x507/0x53c [fcoe]
[13122.335440] [<ffffffffa0651010>] ? fcoe_set_port_id+0x50/0x50 [fcoe]
[13122.335446] [<ffffffff8108f2f2>] kthread+0xd2/0xf0
[13122.335450] [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
[13122.335458] [<ffffffff81696dbc>] ret_from_fork+0x7c/0xb0
[13122.335461] [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
[13122.335464] ---[ end trace e4509e1053f499ac ]---

Thanks,

Jun

On Tue, May 20, 2014 at 11:03 AM, Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Post by Jun Wu
Hi Nicholas,
We downloaded the source of our running kernel (3.13.10-200) and
applied your percpu-ida pre-allocation regression fix, then compiled
and installed the kernel. I repeated the same test three times,
running 10 fio sessions to 10 drives on the target through fcoe vn2vn.
In the first two tests, the target machine hung with the following
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 458752,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 393216,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 327680,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 262144,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 131072,
lso_max <0x10000>
15249 May 19 11:51:12 poc1 kernel: [ 1178.698434] ft_queue_data_in: 6
callbacks suppressed
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 458752,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 393216,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 327680,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 262144,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 0,
lso_max <0x10000>
The call into lport->tt.seq_send() libfc code is failing to send
outgoing solicited data-in. From the output, note the LSO (large
segment offload aka TCP segment offload) feature has been enabled by the
underlying NIC hardware.
- Disabling hardware offloads on both initiator and target sides (LRO +
LSO) using ethtool -K
- Disabling any jumbo frames settings on either side
Is there any other non standard network and/or switch settings that are
in place..? Also, please confirm what your NIC + switch setup looks
like.
Rob & Open-FCoE folks, is there anything else to take into consideration
here..?
Post by Jun Wu
I didn't see the previous message "unable to handle kernel NULL
pointer dereference at 0000000000000048". So it must have been fixed
by your change.
Thanks for confirming that bit.
--nab
Vasu Dev
2014-05-21 16:21:06 UTC
Permalink
Post by Jun Wu
MTU were 1500 for both initiator and target.
I used "ethtool -K p4p1 tso off" to turn off tcp segmentation offload
on all machines. Register setting after the command is shown below.
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: on [fixed]
tx-checksum-sctp: on
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
tx-tcp-segmentation: off
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp6-segmentation: off
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: on [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off
Info on NIC drivers
driver: ixgbe
version: 3.15.1-k
firmware-version: 0x80000208
bus-info: 0000:08:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
After the change, I repeated the same test and got similar failure on
[12253.032595] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 458752, lso_max <0x10000>
It is send frame failure and to find out what caused send failure more
debug info in low level fcoe Tx path functions will be helpful, it can
be done by:-

# echo 0xFF > /sys/module/libfc/parameters/debug_logging
# echo 0x1 > /sys/module/fcoe/parameters/debug_logging

Disabling Tx offload may not help here and instead would slow down Tx,
so have them restored.

Also, are you using switch between hosts and target ? In any case you
would need DCB PFC or PAUSE enabled to avoid excessive Tx retries though
that should not cause send failure.


//Vasu
Post by Jun Wu
[12253.032605] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 393216, lso_max <0x10000>
[12253.032609] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 327680, lso_max <0x10000>
[12253.032613] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 262144, lso_max <0x10000>
[12284.299877] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 196608, lso_max <0x10000>
[12284.299885] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 131072, lso_max <0x10000>
[12284.299889] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 65536, lso_max <0x10000>
[12284.299892] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 0, lso_max <0x10000>
[12284.451810] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 458752, lso_max <0x10000>
[12284.451818] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 393216, lso_max <0x10000>
[12284.451824] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 327680, lso_max <0x10000>
[12284.451827] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 262144, lso_max <0x10000>
[12284.451831] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 196608, lso_max <0x10000>
[12284.451834] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 131072, lso_max <0x10000>
[12347.503478] ft_queue_data_in: 2 callbacks suppressed
[12347.503486] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 458752, lso_max <0x10000>
[12347.503492] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 393216, lso_max <0x10000>
[12347.503496] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 327680, lso_max <0x10000>
[12347.503517] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 262144, lso_max <0x10000>
[12378.402412] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 458752, lso_max <0x10000>
[12378.402420] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 393216, lso_max <0x10000>
[12378.402425] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 327680, lso_max <0x10000>
[12378.402428] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 262144, lso_max <0x10000>
[12378.402432] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 196608, lso_max <0x10000>
[12378.402436] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 131072, lso_max <0x10000>
[12378.402440] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 65536, lso_max <0x10000>
[12378.402444] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 0, lso_max <0x10000>
[13049.224513] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 196608, lso_max <0x10000>
[13049.224524] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 131072, lso_max <0x10000>
[13049.224528] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 65536, lso_max <0x10000>
[13049.224532] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 0, lso_max <0x10000>
[13052.511306] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 196608, lso_max <0x10000>
[13052.511313] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 131072, lso_max <0x10000>
[13052.511317] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 65536, lso_max <0x10000>
[13052.511321] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 0, lso_max <0x10000>
[13087.976748] ft_queue_data_in: Failed to send frame
ffff88031afc9c00, xid <0x96b>, remaining 458752, lso_max <0x10000>
[13087.998453] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 458752, lso_max <0x10000>
[13087.998459] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 393216, lso_max <0x10000>
[13087.998463] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 327680, lso_max <0x10000>
[13087.998467] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 262144, lso_max <0x10000>
[13087.998470] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 196608, lso_max <0x10000>
[13087.998474] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 131072, lso_max <0x10000>
[13087.998478] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 65536, lso_max <0x10000>
[13087.998482] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 0, lso_max <0x10000>
[13119.177286] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 458752, lso_max <0x10000>
[13119.177297] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 393216, lso_max <0x10000>
[13119.177302] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 327680, lso_max <0x10000>
[13119.177307] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 262144, lso_max <0x10000>
[13119.177311] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 196608, lso_max <0x10000>
[13119.177316] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 131072, lso_max <0x10000>
[13119.177321] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 65536, lso_max <0x10000>
[13119.177325] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 0, lso_max <0x10000>
[13122.335322] ------------[ cut here ]------------
[13122.335336] WARNING: CPU: 6 PID: 2165 at
include/scsi/fc_frame.h:173 fcoe_percpu_receive_thread+0x507/0x53c
[fcoe]()
[13122.335338] Modules linked in: async_memcpy async_xor xor async_tx
fcoe libfcoe tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
8021q garp mrp bridge stp llc iTCO_wdt gpio_ich iTCO_vendor_support
coretemp kvm_intel kvm crc32c_intel microcode serio_raw i2c_i801
lpc_ich mfd_core ses enclosure i7core_edac ioatdma edac_core shpchp
acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd sunrpc radeon
drm_kms_helper ttm drm ixgbe igb ata_generic mdio pata_acpi ptp
pata_jmicron pps_core i2c_algo_bit aacraid dca i2c_core [last
unloaded: vd]
[13122.335390] CPU: 6 PID: 2165 Comm: fcoethread/6 Tainted: GF
O 3.13.10-200.zbfcoepatch.fc20.x86_64 #1
[13122.335392] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011
[13122.335394] 0000000000000009 ffff88062b04bdd0 ffffffff81687eac
0000000000000000
[13122.335400] ffff88062b04be08 ffffffff8106d4dd ffffe8ffffc41748
ffff88062a444700
[13122.335404] ffff8800b7e926e8 0000000000000002 ffff88062b04be88
ffff88062b04be18
[13122.335419] [<ffffffff81687eac>] dump_stack+0x45/0x56
[13122.335426] [<ffffffff8106d4dd>] warn_slowpath_common+0x7d/0xa0
[13122.335430] [<ffffffff8106d5ba>] warn_slowpath_null+0x1a/0x20
[13122.335435] [<ffffffffa0651517>]
fcoe_percpu_receive_thread+0x507/0x53c [fcoe]
[13122.335440] [<ffffffffa0651010>] ? fcoe_set_port_id+0x50/0x50 [fcoe]
[13122.335446] [<ffffffff8108f2f2>] kthread+0xd2/0xf0
[13122.335450] [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
[13122.335458] [<ffffffff81696dbc>] ret_from_fork+0x7c/0xb0
[13122.335461] [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
[13122.335464] ---[ end trace e4509e1053f499ac ]---
Thanks,
Jun
On Tue, May 20, 2014 at 11:03 AM, Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Post by Jun Wu
Hi Nicholas,
We downloaded the source of our running kernel (3.13.10-200) and
applied your percpu-ida pre-allocation regression fix, then compiled
and installed the kernel. I repeated the same test three times,
running 10 fio sessions to 10 drives on the target through fcoe vn2vn.
In the first two tests, the target machine hung with the following
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 458752,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 393216,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 327680,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 262144,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 131072,
lso_max <0x10000>
15249 May 19 11:51:12 poc1 kernel: [ 1178.698434] ft_queue_data_in: 6
callbacks suppressed
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 458752,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 393216,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 327680,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 262144,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 0,
lso_max <0x10000>
The call into lport->tt.seq_send() libfc code is failing to send
outgoing solicited data-in. From the output, note the LSO (large
segment offload aka TCP segment offload) feature has been enabled by the
underlying NIC hardware.
- Disabling hardware offloads on both initiator and target sides (LRO +
LSO) using ethtool -K
- Disabling any jumbo frames settings on either side
Is there any other non standard network and/or switch settings that are
in place..? Also, please confirm what your NIC + switch setup looks
like.
Rob & Open-FCoE folks, is there anything else to take into consideration
here..?
Post by Jun Wu
I didn't see the previous message "unable to handle kernel NULL
pointer dereference at 0000000000000048". So it must have been fixed
by your change.
Thanks for confirming that bit.
--nab
_______________________________________________
fcoe-devel mailing list
http://lists.open-fcoe.org/mailman/listinfo/fcoe-devel
Jun Wu
2014-05-21 21:03:40 UTC
Permalink
I enabled the Tx offload and set the debug_loggings as suggested. A
few minutes run generated a 1.5GB messages file.

The file has repeated patterns of the following:

18630713 May 21 11:01:53 poc2 kernel: [ 1528.182334] host7: xid b48:
f_ctl 800000 seq 1
18630714 May 21 11:01:53 poc2 kernel: [ 1528.182345] host7: xid b48:
f_ctl 880008 seq 2
18630715 May 21 11:01:53 poc2 kernel: [ 1528.182601] host7: xid 38e:
f_ctl 800000 seq 1
18630716 May 21 11:01:53 poc2 kernel: [ 1528.182621] host7: xid 38e:
f_ctl 880008 seq 2
18630717 May 21 11:01:53 poc2 kernel: [ 1528.182771] host7: xid 74a:
f_ctl 800000 seq 1
18630718 May 21 11:01:53 poc2 kernel: [ 1528.182785] host7: xid 74a:
f_ctl 880008 seq 2
18630719 May 21 11:01:53 poc2 kernel: [ 1528.183161] host7: xid e4f:
f_ctl 800000 seq 1
18630720 May 21 11:01:53 poc2 kernel: [ 1528.183181] host7: xid e4f:
f_ctl 880008 seq 2
18630721 May 21 11:01:53 poc2 kernel: [ 1528.184285] host7: xid 666:
f_ctl 800000 seq 1
18630722 May 21 11:01:53 poc2 kernel: [ 1528.184301] host7: xid 666:
f_ctl 880008 seq 2
18630723 May 21 11:01:53 poc2 kernel: [ 1528.184550] host7: xid c20:
f_ctl 800000 seq 1
18630724 May 21 11:01:53 poc2 kernel: [ 1528.184589] host7: xid c20:
f_ctl 880008 seq 2
18630725 May 21 11:01:53 poc2 kernel: [ 1528.185198] host7: xid 607:
f_ctl 800000 seq 1
18630726 May 21 11:01:53 poc2 kernel: [ 1528.185213] host7: xid 607:
f_ctl 880008 seq 2
18630727 May 21 11:01:53 poc2 kernel: [ 1528.185659] host7: xid 925:
f_ctl 800000 seq 1
18630728 May 21 11:01:53 poc2 kernel: [ 1528.185662] host7: xid b48:
f_ctl 800000 seq 1
18630729 May 21 11:01:53 poc2 kernel: [ 1528.185672] host7: xid b48:
f_ctl 880008 seq 2
18630730 May 21 11:01:53 poc2 kernel: [ 1528.185680] host7: xid 925:
f_ctl 880008 seq 2
18630731 May 21 11:01:53 poc2 kernel: [ 1528.185751] host7: xid b61:
f_ctl 800000 seq 1
18630732 May 21 11:01:53 poc2 kernel: [ 1528.185765] host7: xid b61:
f_ctl 880008 seq 2
18630733 May 21 11:01:53 poc2 kernel: [ 1528.186413] host7: xid 829:
f_ctl 800000 seq 1
18630734 May 21 11:01:53 poc2 kernel: [ 1528.186425] host7: xid 829:
f_ctl 880008 seq 2
18630735 May 21 11:01:53 poc2 kernel: [ 1528.186785] host7: xid 6e4:
f_ctl 800000 seq 1
18630736 May 21 11:01:53 poc2 kernel: [ 1528.186817] host7: xid 6e4:
f_ctl 880008 seq 2
18630737 May 21 11:01:53 poc2 kernel: [ 1528.186932] host7: xid dab:
f_ctl 800000 seq 1
18630738 May 21 11:01:53 poc2 kernel: [ 1528.186946] host7: xid dab:
f_ctl 880008 seq 2
18630739 May 21 11:01:53 poc2 kernel: [ 1528.187907] host7: xid 8ac:
f_ctl 800000 seq 1
18630740 May 21 11:01:53 poc2 kernel: [ 1528.187920] host7: xid 8ac:
f_ctl 880008 seq 2
18630741 May 21 11:01:53 poc2 kernel: [ 1528.188656] host7: xid 38e:
f_ctl 800000 seq 1
18630742 May 21 11:01:53 poc2 kernel: [ 1528.188675] host7: xid 38e:
f_ctl 880008 seq 2
18630743 May 21 11:01:53 poc2 kernel: [ 1528.188889] host7: xid b61:
f_ctl 800000 seq 1
18630744 May 21 11:01:53 poc2 kernel: [ 1528.188899] host7: xid b61:
f_ctl 880008 seq 2
18630745 May 21 11:01:53 poc2 kernel: [ 1528.189281] host7: xid 88d:
f_ctl 800000 seq 1
18630746 May 21 11:01:53 poc2 kernel: [ 1528.189301] host7: xid 88d:
f_ctl 880008 seq 2
18630747 May 21 11:01:53 poc2 kernel: [ 1528.189378] host7: xid c20:
f_ctl 800000 seq 1
18630748 May 21 11:01:53 poc2 kernel: [ 1528.189392] host7: xid c20:
f_ctl 880008 seq 2
18630749 May 21 11:01:53 poc2 kernel: [ 1528.189836] host7: xid 862:
f_ctl 800000 seq 1
18630750 May 21 11:01:53 poc2 kernel: [ 1528.189850] host7: xid 862:
f_ctl 880008 seq 2
18630751 May 21 11:01:53 poc2 kernel: [ 1528.191740] host7: xid 6e4:
Exchange timer armed : 0 msecs
18630752 May 21 11:01:53 poc2 kernel: [ 1528.191747] host7: xid 6e4:
f_ctl 800000 seq 1
18630753 May 21 11:01:53 poc2 kernel: [ 1528.191756] host7: xid 6e4:
f_ctl 800000 seq 2
18630754 May 21 11:01:53 poc2 kernel: [ 1528.191763] host7: xid 6e4:
Exchange timed out
18630755 May 21 11:01:53 poc2 kernel: [ 1528.191777] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 458752,
lso_max <0x10000>
18630756 May 21 11:01:53 poc2 kernel: [ 1528.191782] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 393216,
lso_max <0x10000>
18630757 May 21 11:01:53 poc2 kernel: [ 1528.191786] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 327680,
lso_max <0x10000>
18630758 May 21 11:01:53 poc2 kernel: [ 1528.191790] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 262144,
lso_max <0x10000>
18630759 May 21 11:01:53 poc2 kernel: [ 1528.191794] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 196608,
lso_max <0x10000>
18630760 May 21 11:01:53 poc2 kernel: [ 1528.191798] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 131072,
lso_max <0x10000>
18630761 May 21 11:01:53 poc2 kernel: [ 1528.191801] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 65536,
lso_max <0x10000>
18630762 May 21 11:01:53 poc2 kernel: [ 1528.191805] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 0,
lso_max <0x10000>
18630763 May 21 11:01:53 poc2 kernel: [ 1528.192163] host7: xid b48:
f_ctl 800000 seq 1
18630764 May 21 11:01:53 poc2 kernel: [ 1528.192166] host7: xid 607:
f_ctl 800000 seq 1
18630765 May 21 11:01:53 poc2 kernel: [ 1528.192176] host7: xid 607:
f_ctl 880008 seq 2
18630766 May 21 11:01:53 poc2 kernel: [ 1528.192180] host7: xid b48:
f_ctl 880008 seq 2
18630767 May 21 11:01:53 poc2 kernel: [ 1528.192266] host7: xid 666:
f_ctl 800000 seq 1

Above is the first time ft_queue_data_in message shows up in the file.
Here is another instance:

18631537 May 21 11:01:53 poc2 kernel: [ 1528.333876] host7: xid 74a:
f_ctl 800000 seq 1
18631538 May 21 11:01:53 poc2 kernel: [ 1528.333893] host7: xid 74a:
f_ctl 880008 seq 2
18631539 May 21 11:01:53 poc2 kernel: [ 1528.334816] host7: xid c20:
f_ctl 800000 seq 1
18631540 May 21 11:01:53 poc2 kernel: [ 1528.334834] host7: xid c20:
f_ctl 880008 seq 2
18631541 May 21 11:01:53 poc2 kernel: [ 1528.334847] host7: xid b81:
f_ctl 800000 seq 1
18631542 May 21 11:01:53 poc2 kernel: [ 1528.334858] host7: xid 983:
f_ctl 800000 seq 1
18631543 May 21 11:01:53 poc2 kernel: [ 1528.334864] host7: xid b81:
f_ctl 880008 seq 2
18631544 May 21 11:01:53 poc2 kernel: [ 1528.334881] host7: xid 983:
f_ctl 880008 seq 2
18631545 May 21 11:01:53 poc2 kernel: [ 1528.334972] host7: xid 686:
f_ctl 800000 seq 1
18631546 May 21 11:01:53 poc2 kernel: [ 1528.334985] host7: xid 686:
f_ctl 880008 seq 2
18631547 May 21 11:01:53 poc2 kernel: [ 1528.335036] host7: xid 704:
f_ctl 800000 seq 1
18631548 May 21 11:01:53 poc2 kernel: [ 1528.335052] host7: xid 704:
f_ctl 880008 seq 2
18631549 May 21 11:01:53 poc2 kernel: [ 1528.335078] host7: xid 627:
f_ctl 800000 seq 1
18631550 May 21 11:01:53 poc2 kernel: [ 1528.335088] host7: xid 627:
f_ctl 880008 seq 2
18631551 May 21 11:01:53 poc2 kernel: [ 1528.335202] host7: xid b48:
f_ctl 800000 seq 1
18631552 May 21 11:01:53 poc2 kernel: [ 1528.335214] host7: xid b48:
f_ctl 880008 seq 2
18631553 May 21 11:01:53 poc2 kernel: [ 1528.335381] host7: xid 74a:
Exchange timer armed : 0 msecs
18631554 May 21 11:01:53 poc2 kernel: [ 1528.335386] host7: xid 74a:
f_ctl 800000 seq 1
18631555 May 21 11:01:53 poc2 kernel: [ 1528.335388] host7: xid 74a:
Exchange timed out
18631556 May 21 11:01:53 poc2 kernel: [ 1528.335672] host7: xid 869:
f_ctl 800000 seq 1
18631557 May 21 11:01:53 poc2 kernel: [ 1528.335688] host7: xid 869:
f_ctl 880008 seq 2
18631558 May 21 11:01:53 poc2 kernel: [ 1528.336213] host7: xid 965:
f_ctl 800000 seq 1
18631559 May 21 11:01:53 poc2 kernel: [ 1528.336232] host7: xid 965:
f_ctl 880008 seq 2
18631560 May 21 11:01:53 poc2 kernel: [ 1528.336477] host7: xid 74a:
f_ctl 800000 seq 2
18631561 May 21 11:01:53 poc2 kernel: [ 1528.336482] ft_queue_data_in:
Failed to send frame ffff8802e25c1e00, xid <0x74a>, remaining 196608,
lso_max <0x10000>
18631562 May 21 11:01:53 poc2 kernel: [ 1528.336486] ft_queue_data_in:
Failed to send frame ffff8802e25c1e00, xid <0x74a>, remaining 131072,
lso_max <0x10000>
18631563 May 21 11:01:53 poc2 kernel: [ 1528.336489] host7: xid 74a:
f_ctl 800000 seq 3
18631564 May 21 11:01:53 poc2 kernel: [ 1528.337645] host7: xid 86d:
f_ctl 800000 seq 1
18631565 May 21 11:01:53 poc2 kernel: [ 1528.337759] host7: xid 86d:
f_ctl 880008 seq 2
18631566 May 21 11:01:53 poc2 kernel: [ 1528.337827] host7: xid 44e:
f_ctl 800000 seq 1
18631567 May 21 11:01:53 poc2 kernel: [ 1528.337846] host7: xid 44e:
f_ctl 880008 seq 2
18631568 May 21 11:01:53 poc2 kernel: [ 1528.340521] host7: xid 8c2:
Exchange timer armed : 0 msecs
18631569 May 21 11:01:53 poc2 kernel: [ 1528.340526] host7: xid 8c2:
f_ctl 800000 seq 1
18631570 May 21 11:01:53 poc2 kernel: [ 1528.340667] host7: xid 8c2:
Exchange timed out
18631571 May 21 11:01:53 poc2 kernel: [ 1528.341064] host7: xid 983:
f_ctl 800000 seq 1
18631572 May 21 11:01:53 poc2 kernel: [ 1528.341087] host7: xid 983:
f_ctl 880008 seq 2
18631573 May 21 11:01:53 poc2 kernel: [ 1528.341286] host7: xid b48:
f_ctl 800000 seq 1
18631574 May 21 11:01:53 poc2 kernel: [ 1528.341306] host7: xid b48:
f_ctl 880008 seq 2
18631575 May 21 11:01:53 poc2 kernel: [ 1528.341522] host7: xid 869:
Exchange timer armed : 0 msecs
18631576 May 21 11:01:53 poc2 kernel: [ 1528.341528] host7: xid 869:
f_ctl 800000 seq 1
18631577 May 21 11:01:53 poc2 kernel: [ 1528.341539] host7: xid 869:
Exchange timed out
18631578 May 21 11:01:53 poc2 kernel: [ 1528.341966] host7: xid 965:
f_ctl 800000 seq 1
18631579 May 21 11:01:53 poc2 kernel: [ 1528.341979] host7: xid 965:
f_ctl 880008 seq 2
18631580 May 21 11:01:53 poc2 kernel: [ 1528.342450] host7: xid 627:
f_ctl 800000 seq 1
18631581 May 21 11:01:53 poc2 kernel: [ 1528.342467] host7: xid 627:
f_ctl 880008 seq 2
18631582 May 21 11:01:53 poc2 kernel: [ 1528.342945] host7: xid 70a:
f_ctl 800000 seq 1
18631583 May 21 11:01:53 poc2 kernel: [ 1528.342959] host7: xid 70a:
f_ctl 880008 seq 2

After stripping out the repeated "host7: xid xxx: f_ctl 800000 seq
x" messages, we have:

18630507:May 21 11:01:53 poc2 kernel: [ 1528.148314] host7: xid d6b:
Exchange timer armed : 0 msecs
18630509:May 21 11:01:53 poc2 kernel: [ 1528.148357] host7: xid d6b:
Exchange timed out
18630642:May 21 11:01:53 poc2 kernel: [ 1528.169143] host7: xid 88c:
Exchange timer armed : 0 msecs
18630644:May 21 11:01:53 poc2 kernel: [ 1528.169197] host7: xid 88c:
Exchange timed out
18630751:May 21 11:01:53 poc2 kernel: [ 1528.191740] host7: xid 6e4:
Exchange timer armed : 0 msecs
18630754:May 21 11:01:53 poc2 kernel: [ 1528.191763] host7: xid 6e4:
Exchange timed out
18630755:May 21 11:01:53 poc2 kernel: [ 1528.191777] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 458752,
lso_max <0x10000>
18630756:May 21 11:01:53 poc2 kernel: [ 1528.191782] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 393216,
lso_max <0x10000>
18630757:May 21 11:01:53 poc2 kernel: [ 1528.191786] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 327680,
lso_max <0x10000>
18630758:May 21 11:01:53 poc2 kernel: [ 1528.191790] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 262144,
lso_max <0x10000>
18630759:May 21 11:01:53 poc2 kernel: [ 1528.191794] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 196608,
lso_max <0x10000>
18630760:May 21 11:01:53 poc2 kernel: [ 1528.191798] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 131072,
lso_max <0x10000>
18630761:May 21 11:01:53 poc2 kernel: [ 1528.191801] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 65536,
lso_max <0x10000>
18630762:May 21 11:01:53 poc2 kernel: [ 1528.191805] ft_queue_data_in:
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 0,
lso_max <0x10000>
18630777:May 21 11:01:53 poc2 kernel: [ 1528.195937] host7: xid 666:
Exchange timer armed : 0 msecs
18630779:May 21 11:01:53 poc2 kernel: [ 1528.195983] host7: xid 666:
Exchange timed out
18630780:May 21 11:01:53 poc2 kernel: [ 1528.196336] host7: xid 862:
Exchange timer armed : 0 msecs
18630782:May 21 11:01:53 poc2 kernel: [ 1528.196348] host7: xid 862:
Exchange timed out
18630879:May 21 11:01:53 poc2 kernel: [ 1528.211080] host7: xid b61:
Exchange timer armed : 0 msecs
18630881:May 21 11:01:53 poc2 kernel: [ 1528.211137] host7: xid b61:
Exchange timed out
18630898:May 21 11:01:53 poc2 kernel: [ 1528.214253] host7: xid 38e:
Exchange timer armed : 0 msecs
18630900:May 21 11:01:53 poc2 kernel: [ 1528.214284] host7: xid 38e:
Exchange timed out
18631007:May 21 11:01:53 poc2 kernel: [ 1528.236137] host7: xid 88d:
Exchange timer armed : 0 msecs
18631009:May 21 11:01:53 poc2 kernel: [ 1528.236172] host7: xid 88d:
Exchange timed out
18631176:May 21 11:01:53 poc2 kernel: [ 1528.264223] host7: xid 607:
Exchange timer armed : 0 msecs
18631178:May 21 11:01:53 poc2 kernel: [ 1528.264239] host7: xid 607:
Exchange timed out
18631185:May 21 11:01:53 poc2 kernel: [ 1528.266310] host7: xid 82d:
Exchange timer armed : 0 msecs
18631187:May 21 11:01:53 poc2 kernel: [ 1528.266351] host7: xid 82d:
Exchange timed out
18631226:May 21 11:01:53 poc2 kernel: [ 1528.275452] host7: xid 8a2:
Exchange timer armed : 0 msecs
18631228:May 21 11:01:53 poc2 kernel: [ 1528.275463] host7: xid 8a2:
Exchange timed out
18631553:May 21 11:01:53 poc2 kernel: [ 1528.335381] host7: xid 74a:
Exchange timer armed : 0 msecs
18631555:May 21 11:01:53 poc2 kernel: [ 1528.335388] host7: xid 74a:
Exchange timed out
18631561:May 21 11:01:53 poc2 kernel: [ 1528.336482] ft_queue_data_in:
Failed to send frame ffff8802e25c1e00, xid <0x74a>, remaining 196608,
lso_max <0x10000>
18631562:May 21 11:01:53 poc2 kernel: [ 1528.336486] ft_queue_data_in:
Failed to send frame ffff8802e25c1e00, xid <0x74a>, remaining 131072,
lso_max <0x10000>
18631568:May 21 11:01:53 poc2 kernel: [ 1528.340521] host7: xid 8c2:
Exchange timer armed : 0 msecs
18631570:May 21 11:01:53 poc2 kernel: [ 1528.340667] host7: xid 8c2:
Exchange timed out
18631575:May 21 11:01:53 poc2 kernel: [ 1528.341522] host7: xid 869:
Exchange timer armed : 0 msecs
18631577:May 21 11:01:53 poc2 kernel: [ 1528.341539] host7: xid 869:
Exchange timed out
18631660:May 21 11:01:53 poc2 kernel: [ 1528.356897] host7: xid e4f:
Exchange timer armed : 0 msecs
18631662:May 21 11:01:53 poc2 kernel: [ 1528.356975] host7: xid e4f:
Exchange timed out
18631825:May 21 11:01:53 poc2 kernel: [ 1528.398431] host7: xid 965:
Exchange timer armed : 0 msecs
18631827:May 21 11:01:53 poc2 kernel: [ 1528.398509] host7: xid 965:
Exchange timed out
18631844:May 21 11:01:53 poc2 kernel: [ 1528.401772] host7: xid 44e:
Exchange timer armed : 0 msecs
18631846:May 21 11:01:53 poc2 kernel: [ 1528.401850] host7: xid 44e:
Exchange timed out
18631859:May 21 11:01:53 poc2 kernel: [ 1528.404996] host7: xid 704:
Exchange timer armed : 0 msecs
18631861:May 21 11:01:53 poc2 kernel: [ 1528.405043] host7: xid 704:
Exchange timed out
18631876:May 21 11:01:53 poc2 kernel: [ 1528.412608] host7: xid 86d:
Exchange timer armed : 0 msecs
18631878:May 21 11:01:53 poc2 kernel: [ 1528.412619] host7: xid 86d:
Exchange timed out
18631919:May 21 11:01:53 poc2 kernel: [ 1528.431939] host7: xid 686:
Exchange timer armed : 0 msecs
18631921:May 21 11:01:53 poc2 kernel: [ 1528.432000] host7: xid 686:
Exchange timed out
18631950:May 21 11:01:53 poc2 kernel: [ 1528.441542] host7: xid def:
Exchange timer armed : 0 msecs
18631952:May 21 11:01:53 poc2 kernel: [ 1528.441573] host7: xid def:
Exchange timed out
18631967:May 21 11:01:53 poc2 kernel: [ 1528.447848] host7: xid 3ce:
Exchange timer armed : 0 msecs
18631969:May 21 11:01:53 poc2 kernel: [ 1528.447863] host7: xid 3ce:
Exchange timed out
18632008:May 21 11:01:53 poc2 kernel: [ 1528.453868] host7: xid 627:
Exchange timer armed : 0 msecs
18632010:May 21 11:01:53 poc2 kernel: [ 1528.453879] host7: xid 627:
Exchange timed out
18655343:May 21 11:01:56 poc2 kernel: [ 1531.747863] host7: xid 48e:
Exchange timer armed : 0 msecs
18655345:May 21 11:01:56 poc2 kernel: [ 1531.747879] host7: xid 48e:
Exchange timed out
18655414:May 21 11:01:56 poc2 kernel: [ 1531.758175] host7: xid 667:
Exchange timer armed : 0 msecs
18655416:May 21 11:01:56 poc2 kernel: [ 1531.758204] host7: xid 667:
Exchange timed out
18655533:May 21 11:01:56 poc2 kernel: [ 1531.776753] host7: xid 6c6:
Exchange timer armed : 0 msecs
18655535:May 21 11:01:56 poc2 kernel: [ 1531.776765] host7: xid 6c6:
Exchange timed out
18655832:May 21 11:01:56 poc2 kernel: [ 1531.818184] host7: xid 647:
Exchange timer armed : 0 msecs
18655834:May 21 11:01:56 poc2 kernel: [ 1531.818227] host7: xid 647:
Exchange timed out
18655893:May 21 11:01:56 poc2 kernel: [ 1531.825287] host7: xid 46e:
Exchange timer armed : 0 msecs
18655895:May 21 11:01:56 poc2 kernel: [ 1531.825344] host7: xid 46e:
Exchange timed out
18656116:May 21 11:01:56 poc2 kernel: [ 1531.863739] host7: xid dab:
Exchange timer armed : 0 msecs
18656118:May 21 11:01:56 poc2 kernel: [ 1531.863765] host7: xid dab:
Exchange timed out
18656124:May 21 11:01:56 poc2 kernel: [ 1531.865740] host7: xid 8ac:
Exchange timer armed : 0 msecs
18656129:May 21 11:01:56 poc2 kernel: [ 1531.865800] host7: xid 8ac:
Exchange timed out
18656130:May 21 11:01:56 poc2 kernel: [ 1531.865813] host7: xid 5e7:
Exchange timer armed : 2000 msecs
18656267:May 21 11:01:56 poc2 kernel: [ 1531.891151] host7: xid e6f:
Exchange timer armed : 0 msecs
18656269:May 21 11:01:56 poc2 kernel: [ 1531.891177] host7: xid e6f:
Exchange timed out
18656320:May 21 11:01:56 poc2 kernel: [ 1531.899587] host7: xid 82c:
Exchange timer armed : 0 msecs
18656322:May 21 11:01:56 poc2 kernel: [ 1531.899606] host7: xid 82c:
Exchange timed out
18656355:May 21 11:01:56 poc2 kernel: [ 1531.906904] host7: xid 646:
Exchange timer armed : 0 msecs
18656357:May 21 11:01:56 poc2 kernel: [ 1531.906952] host7: xid 646:
Exchange timed out
18656608:May 21 11:01:56 poc2 kernel: [ 1531.960276] host7: xid 945:
Exchange timer armed : 0 msecs
18656610:May 21 11:01:56 poc2 kernel: [ 1531.960293] host7: xid 945:
Exchange timed out
18656613:May 21 11:01:56 poc2 kernel: [ 1531.960985] host7: xid 724:
Exchange timer armed : 0 msecs
18656615:May 21 11:01:56 poc2 kernel: [ 1531.961002] host7: xid 724:
Exchange timed out
18656634:May 21 11:01:56 poc2 kernel: [ 1531.968643] host7: xid 8ad:
Exchange timer armed : 0 msecs
18656636:May 21 11:01:56 poc2 kernel: [ 1531.968660] host7: xid 8ad:
Exchange timed out
18656641:May 21 11:01:56 poc2 kernel: [ 1531.974843] host7: xid 5e7:
Exchange timer armed : 10000 msecs
18656642:May 21 11:01:56 poc2 kernel: [ 1531.974848] host7: xid 5e7:
f_ctl 90000 seq 1
18656655:May 21 11:01:56 poc2 kernel: [ 1531.983896] host7: xid 687:
Exchange timer armed : 0 msecs
18656657:May 21 11:01:56 poc2 kernel: [ 1531.983912] host7: xid 687:
Exchange timed out
18656746:May 21 11:01:56 poc2 kernel: [ 1532.011935] host7: xid 5e7:
Exchange timer armed : 10000 msecs
18656747:May 21 11:01:56 poc2 kernel: [ 1532.011940] host7: xid 5e7:
f_ctl 90000 seq 2
18669030:May 21 11:01:58 poc2 kernel: [ 1533.870105] host7: xid 5e7:
Exchange timed out
18841592:May 21 11:02:27 poc2 kernel: [ 1562.646991] host7: xid 90d:
Exchange timer armed : 0 msecs
18841595:May 21 11:02:27 poc2 kernel: [ 1562.647066] host7: xid 90d:
Exchange timed out
18842498:May 21 11:02:27 poc2 kernel: [ 1562.795804] host7: xid 8ac:
Exchange timer armed : 0 msecs
18842500:May 21 11:02:27 poc2 kernel: [ 1562.795820] host7: xid 8ac:
Exchange timed out
18842547:May 21 11:02:27 poc2 kernel: [ 1562.805335] host7: xid 94c:
Exchange timer armed : 0 msecs
18842549:May 21 11:02:27 poc2 kernel: [ 1562.805351] host7: xid 94c:
Exchange timed out
18842584:May 21 11:02:27 poc2 kernel: [ 1562.810550] host7: xid f2f:
Exchange timer armed : 0 msecs
18842586:May 21 11:02:27 poc2 kernel: [ 1562.810605] host7: xid f2f:
Exchange timed out
18842593:May 21 11:02:27 poc2 kernel: [ 1562.811735] host7: xid 7e6:
Exchange timer armed : 0 msecs
18842595:May 21 11:02:27 poc2 kernel: [ 1562.811795] host7: xid 7e6:
Exchange timed out
18842740:May 21 11:02:27 poc2 kernel: [ 1562.837943] host7: xid e8b:
Exchange timer armed : 0 msecs
18842742:May 21 11:02:27 poc2 kernel: [ 1562.837966] host7: xid 8cd:
Exchange timer armed : 0 msecs
18842743:May 21 11:02:27 poc2 kernel: [ 1562.837968] host7: xid e8b:
Exchange timed out
18842745:May 21 11:02:27 poc2 kernel: [ 1562.838038] host7: xid 8cd:
Exchange timed out
18842872:May 21 11:02:27 poc2 kernel: [ 1562.863661] host7: xid 824:
Exchange timer armed : 0 msecs
18842874:May 21 11:02:27 poc2 kernel: [ 1562.863674] host7: xid 824:
Exchange timed out
18842951:May 21 11:02:27 poc2 kernel: [ 1562.876390] host7: xid 766:
Exchange timer armed : 0 msecs
18842953:May 21 11:02:27 poc2 kernel: [ 1562.876405] host7: xid 766:
Exchange timed out
18842960:May 21 11:02:27 poc2 kernel: [ 1562.878444] host7: xid ca8:
Exchange timer armed : 0 msecs
18842962:May 21 11:02:27 poc2 kernel: [ 1562.878470] host7: xid ca8:
Exchange timed out
18843025:May 21 11:02:27 poc2 kernel: [ 1562.892335] host7: xid a25:
Exchange timer armed : 0 msecs
18843027:May 21 11:02:27 poc2 kernel: [ 1562.892400] host7: xid a25:
Exchange timed out
18843092:May 21 11:02:27 poc2 kernel: [ 1562.905946] host7: xid be1:
Exchange timer armed : 0 msecs
18843094:May 21 11:02:27 poc2 kernel: [ 1562.905971] host7: xid be1:
Exchange timed out
18843239:May 21 11:02:27 poc2 kernel: [ 1562.934586] host7: xid 52e:
Exchange timer armed : 0 msecs
18843241:May 21 11:02:27 poc2 kernel: [ 1562.934601] host7: xid 52e:
Exchange timed out
18843264:May 21 11:02:27 poc2 kernel: [ 1562.938141] host7: xid bc8:
Exchange timer armed : 0 msecs
18843266:May 21 11:02:27 poc2 kernel: [ 1562.938164] host7: xid bc8:
Exchange timed out
18843353:May 21 11:02:27 poc2 kernel: [ 1562.959150] host7: xid c21:
Exchange timer armed : 0 msecs
18843356:May 21 11:02:27 poc2 kernel: [ 1562.959194] ft_queue_data_in:
10 callbacks suppressed
18843357:May 21 11:02:27 poc2 kernel: [ 1562.959196] ft_queue_data_in:
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 458752,
lso_max <0x10000>
18843358:May 21 11:02:27 poc2 kernel: [ 1562.959199] ft_queue_data_in:
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 393216,
lso_max <0x10000>
18843359:May 21 11:02:27 poc2 kernel: [ 1562.959202] ft_queue_data_in:
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 327680,
lso_max <0x10000>
18843360:May 21 11:02:27 poc2 kernel: [ 1562.959204] ft_queue_data_in:
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 262144,
lso_max <0x10000>
18843361:May 21 11:02:27 poc2 kernel: [ 1562.959208] host7: xid c21:
Exchange timed out
18843362:May 21 11:02:27 poc2 kernel: [ 1562.959209] ft_queue_data_in:
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 196608,
lso_max <0x10000>
18843363:May 21 11:02:27 poc2 kernel: [ 1562.959212] ft_queue_data_in:
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 131072,
lso_max <0x10000>
18843364:May 21 11:02:27 poc2 kernel: [ 1562.959214] ft_queue_data_in:
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 65536,
lso_max <0x10000>
18843365:May 21 11:02:27 poc2 kernel: [ 1562.959217] ft_queue_data_in:
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 0,
lso_max <0x10000>
18843416:May 21 11:02:27 poc2 kernel: [ 1562.968554] host7: xid f4f:
Exchange timer armed : 0 msecs
18843426:May 21 11:02:27 poc2 kernel: [ 1562.968656] host7: xid f4f:
Exchange timed out
18843581:May 21 11:02:27 poc2 kernel: [ 1562.991761] host7: xid 7e4:
Exchange timer armed : 0 msecs
18843583:May 21 11:02:27 poc2 kernel: [ 1562.991853] host7: xid 7e4:
Exchange timed out
18843690:May 21 11:02:27 poc2 kernel: [ 1563.011282] host7: xid a05:
Exchange timer armed : 0 msecs

Thanks
Post by Vasu Dev
Post by Jun Wu
MTU were 1500 for both initiator and target.
I used "ethtool -K p4p1 tso off" to turn off tcp segmentation offload
on all machines. Register setting after the command is shown below.
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: on [fixed]
tx-checksum-sctp: on
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
tx-tcp-segmentation: off
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp6-segmentation: off
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: on [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off
Info on NIC drivers
driver: ixgbe
version: 3.15.1-k
firmware-version: 0x80000208
bus-info: 0000:08:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
After the change, I repeated the same test and got similar failure on
[12253.032595] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 458752, lso_max <0x10000>
It is send frame failure and to find out what caused send failure more
debug info in low level fcoe Tx path functions will be helpful, it can
be done by:-
# echo 0xFF > /sys/module/libfc/parameters/debug_logging
# echo 0x1 > /sys/module/fcoe/parameters/debug_logging
Disabling Tx offload may not help here and instead would slow down Tx,
so have them restored.
Also, are you using switch between hosts and target ? In any case you
would need DCB PFC or PAUSE enabled to avoid excessive Tx retries though
that should not cause send failure.
//Vasu
Post by Jun Wu
[12253.032605] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 393216, lso_max <0x10000>
[12253.032609] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 327680, lso_max <0x10000>
[12253.032613] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 262144, lso_max <0x10000>
[12284.299877] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 196608, lso_max <0x10000>
[12284.299885] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 131072, lso_max <0x10000>
[12284.299889] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 65536, lso_max <0x10000>
[12284.299892] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 0, lso_max <0x10000>
[12284.451810] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 458752, lso_max <0x10000>
[12284.451818] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 393216, lso_max <0x10000>
[12284.451824] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 327680, lso_max <0x10000>
[12284.451827] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 262144, lso_max <0x10000>
[12284.451831] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 196608, lso_max <0x10000>
[12284.451834] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 131072, lso_max <0x10000>
[12347.503478] ft_queue_data_in: 2 callbacks suppressed
[12347.503486] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 458752, lso_max <0x10000>
[12347.503492] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 393216, lso_max <0x10000>
[12347.503496] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 327680, lso_max <0x10000>
[12347.503517] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 262144, lso_max <0x10000>
[12378.402412] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 458752, lso_max <0x10000>
[12378.402420] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 393216, lso_max <0x10000>
[12378.402425] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 327680, lso_max <0x10000>
[12378.402428] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 262144, lso_max <0x10000>
[12378.402432] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 196608, lso_max <0x10000>
[12378.402436] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 131072, lso_max <0x10000>
[12378.402440] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 65536, lso_max <0x10000>
[12378.402444] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 0, lso_max <0x10000>
[13049.224513] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 196608, lso_max <0x10000>
[13049.224524] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 131072, lso_max <0x10000>
[13049.224528] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 65536, lso_max <0x10000>
[13049.224532] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 0, lso_max <0x10000>
[13052.511306] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 196608, lso_max <0x10000>
[13052.511313] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 131072, lso_max <0x10000>
[13052.511317] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 65536, lso_max <0x10000>
[13052.511321] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 0, lso_max <0x10000>
[13087.976748] ft_queue_data_in: Failed to send frame
ffff88031afc9c00, xid <0x96b>, remaining 458752, lso_max <0x10000>
[13087.998453] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 458752, lso_max <0x10000>
[13087.998459] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 393216, lso_max <0x10000>
[13087.998463] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 327680, lso_max <0x10000>
[13087.998467] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 262144, lso_max <0x10000>
[13087.998470] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 196608, lso_max <0x10000>
[13087.998474] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 131072, lso_max <0x10000>
[13087.998478] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 65536, lso_max <0x10000>
[13087.998482] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 0, lso_max <0x10000>
[13119.177286] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 458752, lso_max <0x10000>
[13119.177297] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 393216, lso_max <0x10000>
[13119.177302] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 327680, lso_max <0x10000>
[13119.177307] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 262144, lso_max <0x10000>
[13119.177311] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 196608, lso_max <0x10000>
[13119.177316] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 131072, lso_max <0x10000>
[13119.177321] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 65536, lso_max <0x10000>
[13119.177325] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 0, lso_max <0x10000>
[13122.335322] ------------[ cut here ]------------
[13122.335336] WARNING: CPU: 6 PID: 2165 at
include/scsi/fc_frame.h:173 fcoe_percpu_receive_thread+0x507/0x53c
[fcoe]()
[13122.335338] Modules linked in: async_memcpy async_xor xor async_tx
fcoe libfcoe tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
8021q garp mrp bridge stp llc iTCO_wdt gpio_ich iTCO_vendor_support
coretemp kvm_intel kvm crc32c_intel microcode serio_raw i2c_i801
lpc_ich mfd_core ses enclosure i7core_edac ioatdma edac_core shpchp
acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd sunrpc radeon
drm_kms_helper ttm drm ixgbe igb ata_generic mdio pata_acpi ptp
pata_jmicron pps_core i2c_algo_bit aacraid dca i2c_core [last
unloaded: vd]
[13122.335390] CPU: 6 PID: 2165 Comm: fcoethread/6 Tainted: GF
O 3.13.10-200.zbfcoepatch.fc20.x86_64 #1
[13122.335392] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011
[13122.335394] 0000000000000009 ffff88062b04bdd0 ffffffff81687eac
0000000000000000
[13122.335400] ffff88062b04be08 ffffffff8106d4dd ffffe8ffffc41748
ffff88062a444700
[13122.335404] ffff8800b7e926e8 0000000000000002 ffff88062b04be88
ffff88062b04be18
[13122.335419] [<ffffffff81687eac>] dump_stack+0x45/0x56
[13122.335426] [<ffffffff8106d4dd>] warn_slowpath_common+0x7d/0xa0
[13122.335430] [<ffffffff8106d5ba>] warn_slowpath_null+0x1a/0x20
[13122.335435] [<ffffffffa0651517>]
fcoe_percpu_receive_thread+0x507/0x53c [fcoe]
[13122.335440] [<ffffffffa0651010>] ? fcoe_set_port_id+0x50/0x50 [fcoe]
[13122.335446] [<ffffffff8108f2f2>] kthread+0xd2/0xf0
[13122.335450] [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
[13122.335458] [<ffffffff81696dbc>] ret_from_fork+0x7c/0xb0
[13122.335461] [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
[13122.335464] ---[ end trace e4509e1053f499ac ]---
Thanks,
Jun
On Tue, May 20, 2014 at 11:03 AM, Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Post by Jun Wu
Hi Nicholas,
We downloaded the source of our running kernel (3.13.10-200) and
applied your percpu-ida pre-allocation regression fix, then compiled
and installed the kernel. I repeated the same test three times,
running 10 fio sessions to 10 drives on the target through fcoe vn2vn.
In the first two tests, the target machine hung with the following
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 458752,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 393216,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 327680,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 262144,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 131072,
lso_max <0x10000>
15249 May 19 11:51:12 poc1 kernel: [ 1178.698434] ft_queue_data_in: 6
callbacks suppressed
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 458752,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 393216,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 327680,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 262144,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 0,
lso_max <0x10000>
The call into lport->tt.seq_send() libfc code is failing to send
outgoing solicited data-in. From the output, note the LSO (large
segment offload aka TCP segment offload) feature has been enabled by the
underlying NIC hardware.
- Disabling hardware offloads on both initiator and target sides (LRO +
LSO) using ethtool -K
- Disabling any jumbo frames settings on either side
Is there any other non standard network and/or switch settings that are
in place..? Also, please confirm what your NIC + switch setup looks
like.
Rob & Open-FCoE folks, is there anything else to take into consideration
here..?
Post by Jun Wu
I didn't see the previous message "unable to handle kernel NULL
pointer dereference at 0000000000000048". So it must have been fixed
by your change.
Thanks for confirming that bit.
--nab
_______________________________________________
fcoe-devel mailing list
http://lists.open-fcoe.org/mailman/listinfo/fcoe-devel
Jun Wu
2014-05-21 22:23:41 UTC
Permalink
On initiator side, there were also a lot of messages:

May 21 11:01:52 poc1 kernel: [ 3374.393864] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
May 21 11:01:52 poc1 kernel: [ 3374.396149] host7: xid fc1: Exchange
timer canceled
May 21 11:01:52 poc1 kernel: [ 3374.396155] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
May 21 11:01:52 poc1 kernel: [ 3374.397069] host7: xid 602: Exchange
timer canceled
May 21 11:01:52 poc1 kernel: [ 3374.397075] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
May 21 11:01:52 poc1 kernel: [ 3374.397482] host7: xid a46: Exchange
timer armed : 8000 msecs
May 21 11:01:52 poc1 kernel: [ 3374.398443] host7: xid 602: Exchange
timer armed : 8000 msecs
May 21 11:01:52 poc1 kernel: [ 3374.398498] host7: xid 6ce: Exchange
timer armed : 8000 msecs
May 21 11:01:52 poc1 kernel: [ 3374.398863] host7: xid 6ce: Exchange
timer canceled
May 21 11:01:52 poc1 kernel: [ 3374.398869] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
May 21 11:01:52 poc1 kernel: [ 3374.399449] host7: xid 6a2: Exchange
timer armed : 8000 msecs
May 21 11:01:52 poc1 kernel: [ 3374.399476] host7: xid 3e1: Exchange
timer armed : 8000 msecs
May 21 11:01:52 poc1 kernel: [ 3374.399486] host7: xid dcd: Exchange
timer armed : 8000 msecs
May 21 11:01:52 poc1 kernel: [ 3374.399866] host7: xid dcd: Exchange
timer canceled
May 21 11:01:52 poc1 kernel: [ 3374.399872] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
May 21 11:01:52 poc1 kernel: [ 3374.400472] host7: xid d6c: Exchange
timer armed : 8000 msecs
May 21 11:01:52 poc1 kernel: [ 3374.401442] host7: xid 585: Exchange
timer canceled
May 21 11:01:52 poc1 kernel: [ 3374.401453] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
May 21 11:01:52 poc1 kernel: [ 3374.402457] host7: xid 585: Exchange
timer armed : 8000 msecs
May 21 11:01:52 poc1 kernel: [ 3374.402480] host7: xid 742: Exchange
timer armed : 8000 msecs
May 21 11:01:52 poc1 kernel: [ 3374.403602] host7: xid 4a7: Exchange
timer canceled
May 21 11:01:52 poc1 kernel: [ 3374.403607] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
May 21 11:01:52 poc1 kernel: [ 3374.403674] host7: xid 907: Exchange
timer canceled
May 21 11:01:52 poc1 kernel: [ 3374.403678] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
May 21 11:01:52 poc1 kernel: [ 3374.404276] host7: fcp: 00061e: xid
0084-08c2: DDP I/O in fc_fcp_recv_data set ERROR
May 21 11:01:52 poc1 kernel: [ 3374.404281] host7: xid 84: f_ctl 90000 seq 1
May 21 11:01:52 poc1 kernel: [ 3374.404492] host7: xid dac: Exchange
timer armed : 8000 msecs
May 21 11:01:52 poc1 kernel: [ 3374.404506] host7: xid 70a: Exchange
timer armed : 8000 msecs
May 21 11:01:52 poc1 kernel: [ 3374.405384] host7: xid 585: Exchange
timer canceled
May 21 11:01:52 poc1 kernel: [ 3374.405390] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)

More and more "fc_fcp_recv_data set ERROR" messages show up in the file later:

May 21 11:01:53 poc1 kernel: [ 3374.614142] host7: fcp: 00061e: xid
01ed-0925: DDP I/O in fc_fcp_recv_data set ERROR
May 21 11:01:53 poc1 kernel: [ 3374.614147] host7: xid 1ed: f_ctl 90000 seq 1
May 21 11:01:53 poc1 kernel: [ 3374.616658] host7: xid 4e7: Exchange
timer armed : 8000 msecs
May 21 11:01:53 poc1 kernel: [ 3374.617657] host7: xid 54a: Exchange
timer armed : 8000 msecs
May 21 11:01:53 poc1 kernel: [ 3374.617958] host7: xid 54a: Exchange
timer canceled
May 21 11:01:53 poc1 kernel: [ 3374.617965] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
May 21 11:01:53 poc1 kernel: [ 3374.619656] host7: xid ded: Exchange
timer armed : 8000 msecs
May 21 11:01:53 poc1 kernel: [ 3374.620627] host7: xid 489: Exchange
timer armed : 8000 msecs
May 21 11:01:53 poc1 kernel: [ 3374.620744] host7: fcp: 00061e: xid
01cd-038e: DDP I/O in fc_fcp_recv_data set ERROR
May 21 11:01:53 poc1 kernel: [ 3374.620747] host7: xid 1cd: f_ctl 90000 seq 1
May 21 11:01:53 poc1 kernel: [ 3374.621078] host7: xid 1cd: BLS rctl
85 - BLS reject received
May 21 11:01:53 poc1 kernel: [ 3374.621389] host7: xid 3c1: Exchange
timer canceled
May 21 11:01:53 poc1 kernel: [ 3374.621395] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
May 21 11:01:53 poc1 kernel: [ 3374.621622] host7: xid a64: Exchange
timer armed : 8000 msecs
May 21 11:01:53 poc1 kernel: [ 3374.622056] host7: fcp: 00061e: xid
0107-06e4: DDP I/O in fc_fcp_recv_data set ERROR
May 21 11:01:53 poc1 kernel: [ 3374.622060] host7: xid 107: f_ctl 90000 seq 1
May 21 11:01:53 poc1 kernel: [ 3374.622370] host7: xid cc: exch: BLS
rctl 84 - BLS accept
May 21 11:01:53 poc1 kernel: [ 3374.622381] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
May 21 11:01:53 poc1 kernel: [ 3374.622491] host7: fcp: 00061e: xid
00a5-0862: DDP I/O in fc_fcp_recv_data set ERROR
May 21 11:01:53 poc1 kernel: [ 3374.622496] host7: xid a5: f_ctl 90000 seq 1
May 21 11:01:53 poc1 kernel: [ 3374.622866] host7: xid b86: Exchange
timer canceled
May 21 11:01:53 poc1 kernel: [ 3374.622870] host7: xid 1e6: f_ctl 90000 seq 1
May 21 11:01:53 poc1 kernel: [ 3374.622889] host7: xid dcd: Exchange
timer canceled
May 21 11:01:53 poc1 kernel: [ 3374.622897] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
May 21 11:01:53 poc1 kernel: [ 3374.623491] host7: xid 1e6: BLS rctl
85 - BLS reject received
May 21 11:01:53 poc1 kernel: [ 3374.623637] host7: xid 4a0: Exchange
timer canceled
May 21 11:01:53 poc1 kernel: [ 3374.623719] host7: fcp: 00061e: xid
012d-0869: DDP I/O in fc_fcp_recv_data set ERROR
May 21 11:01:53 poc1 kernel: [ 3374.623724] host7: xid 12d: f_ctl 90000 seq 1
May 21 11:01:53 poc1 kernel: [ 3374.624626] host7: xid fa4: Exchange
timer armed : 8000 msecs
May 21 11:01:53 poc1 kernel: [ 3374.624824] host7: xid fa4: Exchange
timer canceled
May 21 11:01:53 poc1 kernel: [ 3374.624829] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
May 21 11:01:53 poc1 kernel: [ 3374.625133] host7: fcp: 00061e: xid
00a4-0e4f: DDP I/O in fc_fcp_recv_data set ERROR
May 21 11:01:53 poc1 kernel: [ 3374.625137] host7: xid a4: f_ctl 90000 seq 1
May 21 11:01:53 poc1 kernel: [ 3374.625404] host7: xid a4: BLS rctl
85 - BLS reject received
May 21 11:01:53 poc1 kernel: [ 3374.625536] host7: fcp: 00061e: xid
01e7-05e7: DDP I/O in fc_fcp_recv_data set ERROR
May 21 11:01:53 poc1 kernel: [ 3374.625540] host7: xid 1e7: f_ctl 90000 seq 1
May 21 11:01:53 poc1 kernel: [ 3374.625637] host7: xid fa4: Exchange
timer armed : 8000 msecs
May 21 11:01:53 poc1 kernel: [ 3374.625842] host7: xid 1e7: BLS rctl
85 - BLS reject received
May 21 11:01:53 poc1 kernel: [ 3374.626210] host7: fcp: 00061e: xid
01a7-0666: DDP I/O in fc_fcp_recv_data set ERROR
May 21 11:01:53 poc1 kernel: [ 3374.626213] host7: xid 1a7: f_ctl 90000 seq 1
May 21 11:01:53 poc1 kernel: [ 3374.626581] host7: xid 102: exch: BLS
rctl 84 - BLS accept
May 21 11:01:53 poc1 kernel: [ 3374.626590] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
May 21 11:01:53 poc1 kernel: [ 3374.626602] host7: fcp: 00061e: xid
01c7-0862: DDP I/O in fc_fcp_recv_data set ERROR
Post by Jun Wu
I enabled the Tx offload and set the debug_loggings as suggested. A
few minutes run generated a 1.5GB messages file.
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
Exchange timer armed : 0 msecs
f_ctl 800000 seq 1
f_ctl 800000 seq 2
Exchange timed out
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 458752,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 393216,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 327680,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 262144,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 0,
lso_max <0x10000>
f_ctl 800000 seq 1
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 880008 seq 2
f_ctl 800000 seq 1
Above is the first time ft_queue_data_in message shows up in the file.
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
Exchange timer armed : 0 msecs
f_ctl 800000 seq 1
Exchange timed out
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 2
Failed to send frame ffff8802e25c1e00, xid <0x74a>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff8802e25c1e00, xid <0x74a>, remaining 131072,
lso_max <0x10000>
f_ctl 800000 seq 3
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
Exchange timer armed : 0 msecs
f_ctl 800000 seq 1
Exchange timed out
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
Exchange timer armed : 0 msecs
f_ctl 800000 seq 1
Exchange timed out
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
After stripping out the repeated "host7: xid xxx: f_ctl 800000 seq
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 458752,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 393216,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 327680,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 262144,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining 0,
lso_max <0x10000>
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Failed to send frame ffff8802e25c1e00, xid <0x74a>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff8802e25c1e00, xid <0x74a>, remaining 131072,
lso_max <0x10000>
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 2000 msecs
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 10000 msecs
f_ctl 90000 seq 1
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 10000 msecs
f_ctl 90000 seq 2
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
10 callbacks suppressed
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 458752,
lso_max <0x10000>
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 393216,
lso_max <0x10000>
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 327680,
lso_max <0x10000>
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 262144,
lso_max <0x10000>
Exchange timed out
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff8800ba6e6400, xid <0xc21>, remaining 0,
lso_max <0x10000>
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Thanks
Post by Vasu Dev
Post by Jun Wu
MTU were 1500 for both initiator and target.
I used "ethtool -K p4p1 tso off" to turn off tcp segmentation offload
on all machines. Register setting after the command is shown below.
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: on [fixed]
tx-checksum-sctp: on
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
tx-tcp-segmentation: off
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp6-segmentation: off
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: on [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off
Info on NIC drivers
driver: ixgbe
version: 3.15.1-k
firmware-version: 0x80000208
bus-info: 0000:08:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
After the change, I repeated the same test and got similar failure on
[12253.032595] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 458752, lso_max <0x10000>
It is send frame failure and to find out what caused send failure more
debug info in low level fcoe Tx path functions will be helpful, it can
be done by:-
# echo 0xFF > /sys/module/libfc/parameters/debug_logging
# echo 0x1 > /sys/module/fcoe/parameters/debug_logging
Disabling Tx offload may not help here and instead would slow down Tx,
so have them restored.
Also, are you using switch between hosts and target ? In any case you
would need DCB PFC or PAUSE enabled to avoid excessive Tx retries though
that should not cause send failure.
//Vasu
Post by Jun Wu
[12253.032605] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 393216, lso_max <0x10000>
[12253.032609] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 327680, lso_max <0x10000>
[12253.032613] ft_queue_data_in: Failed to send frame
ffff88062a638600, xid <0xa0c>, remaining 262144, lso_max <0x10000>
[12284.299877] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 196608, lso_max <0x10000>
[12284.299885] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 131072, lso_max <0x10000>
[12284.299889] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 65536, lso_max <0x10000>
[12284.299892] ft_queue_data_in: Failed to send frame
ffff8803202ec600, xid <0x3a2>, remaining 0, lso_max <0x10000>
[12284.451810] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 458752, lso_max <0x10000>
[12284.451818] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 393216, lso_max <0x10000>
[12284.451824] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 327680, lso_max <0x10000>
[12284.451827] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 262144, lso_max <0x10000>
[12284.451831] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 196608, lso_max <0x10000>
[12284.451834] ft_queue_data_in: Failed to send frame
ffff88061deb1400, xid <0xecf>, remaining 131072, lso_max <0x10000>
[12347.503478] ft_queue_data_in: 2 callbacks suppressed
[12347.503486] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 458752, lso_max <0x10000>
[12347.503492] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 393216, lso_max <0x10000>
[12347.503496] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 327680, lso_max <0x10000>
[12347.503517] ft_queue_data_in: Failed to send frame
ffff8806142bc800, xid <0xb4f>, remaining 262144, lso_max <0x10000>
[12378.402412] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 458752, lso_max <0x10000>
[12378.402420] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 393216, lso_max <0x10000>
[12378.402425] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 327680, lso_max <0x10000>
[12378.402428] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 262144, lso_max <0x10000>
[12378.402432] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 196608, lso_max <0x10000>
[12378.402436] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 131072, lso_max <0x10000>
[12378.402440] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 65536, lso_max <0x10000>
[12378.402444] ft_queue_data_in: Failed to send frame
ffff88062ddeac00, xid <0x6a5>, remaining 0, lso_max <0x10000>
[13049.224513] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 196608, lso_max <0x10000>
[13049.224524] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 131072, lso_max <0x10000>
[13049.224528] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 65536, lso_max <0x10000>
[13049.224532] ft_queue_data_in: Failed to send frame
ffff880614588c00, xid <0xd2f>, remaining 0, lso_max <0x10000>
[13052.511306] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 196608, lso_max <0x10000>
[13052.511313] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 131072, lso_max <0x10000>
[13052.511317] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 65536, lso_max <0x10000>
[13052.511321] ft_queue_data_in: Failed to send frame
ffff88062d49f000, xid <0x8ae>, remaining 0, lso_max <0x10000>
[13087.976748] ft_queue_data_in: Failed to send frame
ffff88031afc9c00, xid <0x96b>, remaining 458752, lso_max <0x10000>
[13087.998453] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 458752, lso_max <0x10000>
[13087.998459] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 393216, lso_max <0x10000>
[13087.998463] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 327680, lso_max <0x10000>
[13087.998467] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 262144, lso_max <0x10000>
[13087.998470] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 196608, lso_max <0x10000>
[13087.998474] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 131072, lso_max <0x10000>
[13087.998478] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 65536, lso_max <0x10000>
[13087.998482] ft_queue_data_in: Failed to send frame
ffff88032c881200, xid <0xb23>, remaining 0, lso_max <0x10000>
[13119.177286] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 458752, lso_max <0x10000>
[13119.177297] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 393216, lso_max <0x10000>
[13119.177302] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 327680, lso_max <0x10000>
[13119.177307] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 262144, lso_max <0x10000>
[13119.177311] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 196608, lso_max <0x10000>
[13119.177316] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 131072, lso_max <0x10000>
[13119.177321] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 65536, lso_max <0x10000>
[13119.177325] ft_queue_data_in: Failed to send frame
ffff88062dff7400, xid <0xfcf>, remaining 0, lso_max <0x10000>
[13122.335322] ------------[ cut here ]------------
[13122.335336] WARNING: CPU: 6 PID: 2165 at
include/scsi/fc_frame.h:173 fcoe_percpu_receive_thread+0x507/0x53c
[fcoe]()
[13122.335338] Modules linked in: async_memcpy async_xor xor async_tx
fcoe libfcoe tcm_fc libfc scsi_transport_fc scsi_tgt target_core_pscsi
target_core_file target_core_iblock iscsi_target_mod target_core_mod
8021q garp mrp bridge stp llc iTCO_wdt gpio_ich iTCO_vendor_support
coretemp kvm_intel kvm crc32c_intel microcode serio_raw i2c_i801
lpc_ich mfd_core ses enclosure i7core_edac ioatdma edac_core shpchp
acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd sunrpc radeon
drm_kms_helper ttm drm ixgbe igb ata_generic mdio pata_acpi ptp
pata_jmicron pps_core i2c_algo_bit aacraid dca i2c_core [last
unloaded: vd]
[13122.335390] CPU: 6 PID: 2165 Comm: fcoethread/6 Tainted: GF
O 3.13.10-200.zbfcoepatch.fc20.x86_64 #1
[13122.335392] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011
[13122.335394] 0000000000000009 ffff88062b04bdd0 ffffffff81687eac
0000000000000000
[13122.335400] ffff88062b04be08 ffffffff8106d4dd ffffe8ffffc41748
ffff88062a444700
[13122.335404] ffff8800b7e926e8 0000000000000002 ffff88062b04be88
ffff88062b04be18
[13122.335419] [<ffffffff81687eac>] dump_stack+0x45/0x56
[13122.335426] [<ffffffff8106d4dd>] warn_slowpath_common+0x7d/0xa0
[13122.335430] [<ffffffff8106d5ba>] warn_slowpath_null+0x1a/0x20
[13122.335435] [<ffffffffa0651517>]
fcoe_percpu_receive_thread+0x507/0x53c [fcoe]
[13122.335440] [<ffffffffa0651010>] ? fcoe_set_port_id+0x50/0x50 [fcoe]
[13122.335446] [<ffffffff8108f2f2>] kthread+0xd2/0xf0
[13122.335450] [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
[13122.335458] [<ffffffff81696dbc>] ret_from_fork+0x7c/0xb0
[13122.335461] [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40
[13122.335464] ---[ end trace e4509e1053f499ac ]---
Thanks,
Jun
On Tue, May 20, 2014 at 11:03 AM, Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Post by Jun Wu
Hi Nicholas,
We downloaded the source of our running kernel (3.13.10-200) and
applied your percpu-ida pre-allocation regression fix, then compiled
and installed the kernel. I repeated the same test three times,
running 10 fio sessions to 10 drives on the target through fcoe vn2vn.
In the first two tests, the target machine hung with the following
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff880c0b188200, xid <0x2a5>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff880c1d1df000, xid <0x305>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 458752,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 393216,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 327680,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 262144,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88060cd40800, xid <0x3cb>, remaining 0,
lso_max <0x10000>
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff880c0b24ca00, xid <0xea6>, remaining 131072,
lso_max <0x10000>
15249 May 19 11:51:12 poc1 kernel: [ 1178.698434] ft_queue_data_in: 6
callbacks suppressed
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 458752,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 393216,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 327680,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 262144,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 196608,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88060ba97400, xid <0xb8a>, remaining 0,
lso_max <0x10000>
The call into lport->tt.seq_send() libfc code is failing to send
outgoing solicited data-in. From the output, note the LSO (large
segment offload aka TCP segment offload) feature has been enabled by the
underlying NIC hardware.
- Disabling hardware offloads on both initiator and target sides (LRO +
LSO) using ethtool -K
- Disabling any jumbo frames settings on either side
Is there any other non standard network and/or switch settings that are
in place..? Also, please confirm what your NIC + switch setup looks
like.
Rob & Open-FCoE folks, is there anything else to take into consideration
here..?
Post by Jun Wu
I didn't see the previous message "unable to handle kernel NULL
pointer dereference at 0000000000000048". So it must have been fixed
by your change.
Thanks for confirming that bit.
--nab
_______________________________________________
fcoe-devel mailing list
http://lists.open-fcoe.org/mailman/listinfo/fcoe-devel
Vasu Dev
2014-05-23 16:23:58 UTC
Permalink
Post by Jun Wu
Post by Jun Wu
18630751 May 21 11:01:53 poc2 kernel: [ 1528.191740] host7: xid
Exchange timer armed : 0 msecs
18630752 May 21 11:01:53 poc2 kernel: [ 1528.191747] host7: xid
f_ctl 800000 seq 1
18630753 May 21 11:01:53 poc2 kernel: [ 1528.191756] host7: xid
f_ctl 800000 seq 2
18630754 May 21 11:01:53 poc2 kernel: [ 1528.191763] host7: xid
Exchange timed out
18630755 May 21 11:01:53 poc2 kernel: [ 1528.191777]
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining
458752,
Post by Jun Wu
lso_max <0x10000>
Exchange 0x6e4 is aborted and then target still sending frame, while
later should not occur but first setting up abort with 0 msec timeout
doesn not look correct either and it is different that 8000 ms on
initiator side. Are you using same switch between initiator and target
with DCB enabled ?

Reducing retries could narrow down to early aborts is the cause here,
can you try with REC disabled on initiator side for that using this
change ?

diff --git a/drivers/scsi/libfc/fc_fcp.c b/drivers/scsi/libfc/fc_fcp.c
index 1d7e76e..43fe793 100644
--- a/drivers/scsi/libfc/fc_fcp.c
+++ b/drivers/scsi/libfc/fc_fcp.c
@@ -1160,6 +1160,7 @@ static int fc_fcp_cmd_send(struct fc_lport *lport,
struct fc_fcp_pkt *fsp,
fc_fcp_pkt_hold(fsp); /* hold for fc_fcp_pkt_destroy */

setup_timer(&fsp->timer, fc_fcp_timeout, (unsigned long)fsp);
+ rpriv->flags &= ~FC_RP_FLAGS_REC_SUPPORTED;
if (rpriv->flags & FC_RP_FLAGS_REC_SUPPORTED)
fc_fcp_timer_set(fsp, get_fsp_rec_tov(fsp));
Jun Wu
2014-05-27 19:44:10 UTC
Permalink
We do not have DCB PFC so the traffic is managed by PAUSE frames by default.
The kernel has been recompiled to include your change. I ran the same
test after the kernel change, i.e. running 10 fio sessions from
initiator to 10 drives on target. The target machine hung after a few
minutes.

On the initiator side the messages look like:
4453 May 27 10:44:14 poc1 kernel: [ 1726.288288] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4454 May 27 10:44:14 poc1 kernel: [ 1726.292240] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4455 May 27 10:44:14 poc1 kernel: [ 1726.295246] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4456 May 27 10:44:14 poc1 kernel: [ 1726.296245] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4457 May 27 10:44:14 poc1 kernel: [ 1726.298249] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4458 May 27 10:44:14 poc1 kernel: [ 1726.300288] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4459 May 27 10:44:14 poc1 kernel: [ 1726.303268] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4460 May 27 10:44:14 poc1 kernel: [ 1726.303273] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4461 May 27 10:44:14 poc1 kernel: [ 1726.304297] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4462 May 27 10:44:14 poc1 kernel: [ 1726.306257] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4463 May 27 10:44:14 poc1 kernel: [ 1726.307293] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4464 May 27 10:44:14 poc1 kernel: [ 1726.307298] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4465 May 27 10:44:14 poc1 kernel: [ 1726.307301] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4466 May 27 10:44:14 poc1 kernel: [ 1726.310317] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4467 May 27 10:44:14 poc1 kernel: [ 1726.313257] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4468 May 27 10:44:14 poc1 kernel: [ 1726.315297] host7: fcp: 00061e:
xid 01cd-0a84: DDP I/O in fc_fcp_recv_data set ERROR
4469 May 27 10:44:14 poc1 kernel: [ 1726.315302] host7: xid 1cd:
f_ctl 90000 seq 1
4470 May 27 10:44:14 poc1 kernel: [ 1726.315666] host7: xid 1cd: BLS
rctl 85 - BLS reject received
4471 May 27 10:44:14 poc1 kernel: [ 1726.316295] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4472 May 27 10:44:14 poc1 kernel: [ 1726.317260] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4473 May 27 10:44:14 poc1 kernel: [ 1726.317266] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4474 May 27 10:44:14 poc1 kernel: [ 1726.317417] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4475 May 27 10:44:14 poc1 kernel: [ 1726.320303] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4476 May 27 10:44:14 poc1 kernel: [ 1726.321301] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4477 May 27 10:44:14 poc1 kernel: [ 1726.323264] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4478 May 27 10:44:14 poc1 kernel: [ 1726.323299] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4479 May 27 10:44:14 poc1 kernel: [ 1726.324307] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4480 May 27 10:44:14 poc1 kernel: [ 1726.327271] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4481 May 27 10:44:14 poc1 kernel: [ 1726.327301] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4482 May 27 10:44:14 poc1 kernel: [ 1726.329274] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4483 May 27 10:44:14 poc1 kernel: [ 1726.330306] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4484 May 27 10:44:14 poc1 kernel: [ 1726.331273] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4485 May 27 10:44:14 poc1 kernel: [ 1726.333270] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4486 May 27 10:44:14 poc1 kernel: [ 1726.333313] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4487 May 27 10:44:14 poc1 kernel: [ 1726.338276] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4488 May 27 10:44:14 poc1 kernel: [ 1726.338316] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4489 May 27 10:44:14 poc1 kernel: [ 1726.340320] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4490 May 27 10:44:14 poc1 kernel: [ 1726.341313] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4491 May 27 10:44:14 poc1 kernel: [ 1726.342286] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4492 May 27 10:44:14 poc1 kernel: [ 1726.343312] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4493 May 27 10:44:14 poc1 kernel: [ 1726.346324] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4494 May 27 10:44:14 poc1 kernel: [ 1726.347283] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4495 May 27 10:44:14 poc1 kernel: [ 1726.350323] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4496 May 27 10:44:14 poc1 kernel: [ 1726.350326] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4497 May 27 10:44:14 poc1 kernel: [ 1726.352288] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4498 May 27 10:44:14 poc1 kernel: [ 1726.352289] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4499 May 27 10:44:14 poc1 kernel: [ 1726.352942] host7: fcp: 00061e:
xid 01ee-0c48: DDP I/O in fc_fcp_recv_data set ERROR
4500 May 27 10:44:14 poc1 kernel: [ 1726.352946] host7: xid 1ee:
f_ctl 90000 seq 1
4501 May 27 10:44:14 poc1 kernel: [ 1726.353288] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
4502 May 27 10:44:14 poc1 kernel: [ 1726.353320] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)

I see much less Exchange timer messages than last time. Here are all
the instances of them:

6521 May 27 10:44:46 poc1 kernel: [ 1757.762301] host7: fcp: 00061e:
xid 0181-090d: DDP I/O in fc_fcp_recv_data set ERROR
6522 May 27 10:44:46 poc1 kernel: [ 1757.762306] host7: xid 181:
f_ctl 90000 seq 1
6523 May 27 10:44:46 poc1 kernel: [ 1757.763257] host7: fcp: 00061e:
xid 0005-0327: DDP I/O in fc_fcp_recv_data set ERROR
6524 May 27 10:44:46 poc1 kernel: [ 1757.763261] host7: xid 5:
f_ctl 90000 seq 1
6525 May 27 10:44:46 poc1 kernel: [ 1757.763347] host7: fcp: 00061e:
xid 0060-0dce: DDP I/O in fc_fcp_recv_data set ERROR
6526 May 27 10:44:46 poc1 kernel: [ 1757.763351] host7: xid 60:
f_ctl 90000 seq 1
6527 May 27 10:44:46 poc1 kernel: [ 1757.763617] host7: xid 12c:
exch: BLS rctl 84 - BLS accept
6528 May 27 10:44:46 poc1 kernel: [ 1757.763627] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
6529 May 27 10:44:46 poc1 kernel: [ 1757.763632] host7: xid 12c:
Exchange timer armed : 10000 msecs
6530 May 27 10:44:46 poc1 kernel: [ 1757.763813] host7: fcp: 00061e:
xid 00cc-090d: DDP I/O in fc_fcp_recv_data set ERROR
6531 May 27 10:44:46 poc1 kernel: [ 1757.763817] host7: xid cc:
f_ctl 90000 seq 1
6532 May 27 10:44:46 poc1 kernel: [ 1757.763983] host7: xid 60: BLS
rctl 85 - BLS reject received
6533 May 27 10:44:46 poc1 kernel: [ 1757.765447] host7: xid 14d: BLS
rctl 85 - BLS reject received
6534 May 27 10:44:46 poc1 kernel: [ 1757.766258] host7: xid 18b: BLS
rctl 85 - BLS reject received
6535 May 27 10:44:46 poc1 kernel: [ 1757.767116] host7: xid ed: BLS
rctl 85 - BLS reject received
6536 May 27 10:44:46 poc1 kernel: [ 1757.767144] host7: xid 21: BLS
rctl 85 - BLS reject received
6537 May 27 10:44:46 poc1 kernel: [ 1757.768161] host7: xid 122: BLS
rctl 85 - BLS reject received
6538 May 27 10:44:46 poc1 kernel: [ 1757.768184] host7: xid 45: BLS
rctl 85 - BLS reject received
6539 May 27 10:44:46 poc1 kernel: [ 1757.768437] host7: xid 16f: BLS
rctl 85 - BLS reject received
6540 May 27 10:44:46 poc1 kernel: [ 1757.768880] host7: xid 14c: BLS
rctl 85 - BLS reject received
6541 May 27 10:44:46 poc1 kernel: [ 1757.772813] host7: xid 8a: BLS
rctl 85 - BLS reject received
6542 May 27 10:44:46 poc1 kernel: [ 1757.778209] host7: xid 181: BLS
rctl 85 - BLS reject received
6543 May 27 10:44:46 poc1 kernel: [ 1757.778226] host7: xid 16b: BLS
rctl 85 - BLS reject received
6544 May 27 10:44:46 poc1 kernel: [ 1757.778905] host7: xid cc: BLS
rctl 85 - BLS reject received
6545 May 27 10:44:56 poc1 kernel: [ 1767.793465] host7: xid 12c:
Exchange timed out
6546 May 27 10:44:56 poc1 kernel: [ 1767.793478] host7: xid e65:
Exchange timer armed : 2000 msecs
6547 May 27 10:44:56 poc1 kernel: [ 1767.793861] host7: xid e65:
Exchange timer canceled
6548 May 27 10:44:56 poc1 kernel: [ 1767.793866] host7: xid 12c:
LS_RJT for RRQ
6549 May 27 10:45:16 poc1 kernel: [ 1788.033856] host7: scsi:
Resetting rport (00061e)
6550 May 27 10:45:16 poc1 kernel: [ 1788.034071] host7: scsi: lun
reset to lun 0 completed

and

7468 May 27 10:45:17 poc1 kernel: [ 1789.288870] host7: fcp: 00061e:
xid 06ae-0985: fsp=ffff880c0e176700:lso:blen=1000 lso_max=0x10000
t_blen=1000
7469 May 27 10:45:17 poc1 kernel: [ 1789.288874] host7: xid 6ae:
f_ctl 90000 seq 1
7470 May 27 10:45:17 poc1 kernel: [ 1789.291411] host7: fcp: 00061e:
xid 068e-07a3: fsp=ffff880c0e177400:lso:blen=1000 lso_max=0x10000
t_blen=1000
7471 May 27 10:45:17 poc1 kernel: [ 1789.291415] host7: xid 68e:
f_ctl 90000 seq 1
7472 May 27 10:45:17 poc1 kernel: [ 1789.291425] host7: fcp: 00061e:
xid 0c2e-0b04: fsp=ffff880c1cc4c800:lso:blen=1000 lso_max=0x10000
t_blen=1000
7473 May 27 10:45:17 poc1 kernel: [ 1789.291428] host7: xid c2e:
f_ctl 90000 seq 1
7474 May 27 10:45:17 poc1 kernel: [ 1789.296282] host7: xid 1c4: BLS
rctl 85 - BLS reject received
7475 May 27 10:45:17 poc1 kernel: [ 1789.296284] host7: xid 8c: BLS
rctl 85 - BLS reject received
7476 May 27 10:45:17 poc1 kernel: [ 1789.297554] host7: xid 6ae:
exch: BLS rctl 84 - BLS accept
7477 May 27 10:45:17 poc1 kernel: [ 1789.297559] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
7478 May 27 10:45:17 poc1 kernel: [ 1789.297564] host7: xid 6ae:
Exchange timer armed : 10000 msecs
7479 May 27 10:45:17 poc1 kernel: [ 1789.298873] host7: xid ef: BLS
rctl 85 - BLS reject received
7480 May 27 10:45:17 poc1 kernel: [ 1789.298887] host7: fcp: 00061e:
xid 01cf-0e8e: DDP I/O in fc_fcp_recv_data set ERROR
7481 May 27 10:45:17 poc1 kernel: [ 1789.298891] host7: xid 1cf:
f_ctl 90000 seq 1
7482 May 27 10:45:17 poc1 kernel: [ 1789.299318] host7: xid 1cf: BLS
rctl 85 - BLS reject received
7483 May 27 10:45:17 poc1 kernel: [ 1789.299390] host7: fcp: 00061e:
xid 00c4-048c: DDP I/O in fc_fcp_recv_data set ERROR
7484 May 27 10:45:17 poc1 kernel: [ 1789.299394] host7: xid c4:
f_ctl 90000 seq 1
7485 May 27 10:45:17 poc1 kernel: [ 1789.299716] host7: xid c4: BLS
rctl 85 - BLS reject received
7486 May 27 10:45:17 poc1 kernel: [ 1789.299733] host7: xid e6: BLS
rctl 85 - BLS reject received
7487 May 27 10:45:17 poc1 kernel: [ 1789.299762] host7: xid 87: BLS
rctl 85 - BLS reject received
7488 May 27 10:45:17 poc1 kernel: [ 1789.299767] host7: xid 107: BLS
rctl 85 - BLS reject received
7489 May 27 10:45:17 poc1 kernel: [ 1789.303432] host7: fcp: 00061e:
xid 0042-050f: DDP I/O in fc_fcp_recv_data set ERROR
7490 May 27 10:45:17 poc1 kernel: [ 1789.303437] host7: xid 42:
f_ctl 90000 seq 1
7491 May 27 10:45:17 poc1 kernel: [ 1789.303813] host7: xid 42: BLS
rctl 85 - BLS reject received
7492 May 27 10:45:17 poc1 kernel: [ 1789.304596] host7: xid 7: BLS
rctl 85 - BLS reject received
7493 May 27 10:45:17 poc1 kernel: [ 1789.304607] host7: xid 1a9:
exch: BLS rctl 84 - BLS accept
7494 May 27 10:45:17 poc1 kernel: [ 1789.304615] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
7495 May 27 10:45:17 poc1 kernel: [ 1789.306357] host7: xid 10d: BLS
rctl 85 - BLS reject received
7496 May 27 10:45:17 poc1 kernel: [ 1789.306797] host7: xid 18e: BLS
rctl 85 - BLS reject received
7497 May 27 10:45:17 poc1 kernel: [ 1789.307196] host7: fcp: 00061e:
xid 01e4-04a7: DDP I/O in fc_fcp_recv_data set ERROR
7498 May 27 10:45:17 poc1 kernel: [ 1789.307201] host7: xid 1e4:
f_ctl 90000 seq 1
7499 May 27 10:45:17 poc1 kernel: [ 1789.307628] host7: xid 1e4: BLS
rctl 85 - BLS reject received
7500 May 27 10:45:17 poc1 kernel: [ 1789.311225] host7: xid 8d: BLS
rctl 85 - BLS reject received
7501 May 27 10:45:17 poc1 kernel: [ 1789.316454] host7: fcp: 00061e:
xid 0c64-0de8: fsp=ffff88061c238c00:lso:blen=1000 lso_max=0x10000
t_blen=1000
7502 May 27 10:45:17 poc1 kernel: [ 1789.316458] host7: xid c64:
f_ctl 90000 seq 1
7503 May 27 10:45:27 poc1 kernel: [ 1799.306713] host7: xid 6ae:
Exchange timed out
7504 May 27 10:45:27 poc1 kernel: [ 1799.306725] host7: xid e65:
Exchange timer armed : 2000 msecs
7505 May 27 10:45:29 poc1 kernel: [ 1801.312254] host7: xid e65:
Exchange timed out
7506 May 27 10:45:29 poc1 kernel: [ 1801.312261] host7: xid e65:
f_ctl 90000 seq 1
7507 May 27 10:45:29 poc1 kernel: [ 1801.312264] host7: xid e65:
Exchange timer armed : 20000 msecs
7508 May 27 10:45:31 poc1 kernel: [ 1803.197758] host7: rport 00061e:
Remove port
7509 May 27 10:45:31 poc1 kernel: [ 1803.197762] host7: rport 00061e:
Port sending LOGO from Ready state

and finally

7724 May 27 10:45:31 poc1 kernel: [ 1803.199215] host7: fcp: 00061e:
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
7725 May 27 10:45:31 poc1 kernel: [ 1803.199219] host7: xid c44:
f_ctl 90000 seq 1
7726 May 27 10:45:31 poc1 kernel: [ 1803.199223] host7: rport 00061e:
Received a LOGO response closed
7727 May 27 10:45:31 poc1 kernel: [ 1803.199226] host7: xid e65:
Exchange timer canceled
7728 May 27 10:45:31 poc1 kernel: [ 1803.199264] host7: rport 00061e:
work delete
7729 May 27 10:46:31 poc1 kernel: [ 1863.357830] rport-7:0-0:
blocked FC remote port time out: removing target and saving binding


The messages on the target side are similar as before.
This is when Exhange timer message first appears:
18732865 May 27 10:44:14 poc2 kernel: [ 1909.030408] host7: xid c48:
f_ctl 800000 seq 1
18732866 May 27 10:44:14 poc2 kernel: [ 1909.030423] host7: xid c48:
f_ctl 880008 seq 2
18732867 May 27 10:44:14 poc2 kernel: [ 1909.031509] host7: xid 766:
f_ctl 800000 seq 1
18732868 May 27 10:44:14 poc2 kernel: [ 1909.031528] host7: xid 766:
f_ctl 880008 seq 2
18732869 May 27 10:44:14 poc2 kernel: [ 1909.031787] host7: xid cce:
f_ctl 800000 seq 1
18732870 May 27 10:44:14 poc2 kernel: [ 1909.031799] host7: xid cce:
f_ctl 880008 seq 2
18732871 May 27 10:44:14 poc2 kernel: [ 1909.031944] host7: xid 2c7:
f_ctl 800000 seq 1
18732872 May 27 10:44:14 poc2 kernel: [ 1909.031971] host7: xid 2c7:
f_ctl 880008 seq 2
18732873 May 27 10:44:14 poc2 kernel: [ 1909.032145] host7: xid f09:
f_ctl 800000 seq 1
18732874 May 27 10:44:14 poc2 kernel: [ 1909.032166] host7: xid f09:
f_ctl 880008 seq 2
18732875 May 27 10:44:14 poc2 kernel: [ 1909.032374] host7: xid ce0:
f_ctl 800000 seq 1
18732876 May 27 10:44:14 poc2 kernel: [ 1909.032383] host7: xid ce0:
f_ctl 880008 seq 2
18732877 May 27 10:44:14 poc2 kernel: [ 1909.032729] host7: xid 8eb:
f_ctl 800000 seq 1
18732878 May 27 10:44:14 poc2 kernel: [ 1909.032736] host7: xid 32f:
f_ctl 800000 seq 1
18732879 May 27 10:44:14 poc2 kernel: [ 1909.032740] host7: xid 8eb:
f_ctl 880008 seq 2
18732880 May 27 10:44:14 poc2 kernel: [ 1909.032759] host7: xid 32f:
f_ctl 880008 seq 2
18732881 May 27 10:44:14 poc2 kernel: [ 1909.033034] host7: xid 98d:
f_ctl 800000 seq 1
18732882 May 27 10:44:14 poc2 kernel: [ 1909.033051] host7: xid 98d:
f_ctl 880008 seq 2
18732883 May 27 10:44:14 poc2 kernel: [ 1909.033262] host7: xid c01:
f_ctl 800000 seq 1
18732884 May 27 10:44:14 poc2 kernel: [ 1909.033290] host7: xid c01:
f_ctl 880008 seq 2
18732885 May 27 10:44:14 poc2 kernel: [ 1909.033373] host7: xid 6ea:
f_ctl 800000 seq 1
18732886 May 27 10:44:14 poc2 kernel: [ 1909.033387] host7: xid 6ea:
f_ctl 880008 seq 2
18732887 May 27 10:44:14 poc2 kernel: [ 1909.033734] host7: xid 766:
f_ctl 800000 seq 1
18732888 May 27 10:44:14 poc2 kernel: [ 1909.033747] host7: xid 766:
f_ctl 880008 seq 2
18732889 May 27 10:44:14 poc2 kernel: [ 1909.034621] host7: xid 30c:
f_ctl 800000 seq 1
18732890 May 27 10:44:14 poc2 kernel: [ 1909.034648] host7: xid 30c:
f_ctl 880008 seq 2
18732891 May 27 10:44:14 poc2 kernel: [ 1909.035309] host7: xid c48:
Exchange timer armed : 0 msecs
18732892 May 27 10:44:14 poc2 kernel: [ 1909.035315] host7: xid c48:
f_ctl 800000 seq 1
18732893 May 27 10:44:14 poc2 kernel: [ 1909.035384] host7: xid c48:
Exchange timed out

This is when ft_queue_data_in failure first appears:
18919838 May 27 10:44:46 poc2 kernel: [ 1940.431788] host7: xid 327:
f_ctl 800000 seq 1
18919839 May 27 10:44:46 poc2 kernel: [ 1940.431801] host7: xid 327:
f_ctl 880008 seq 2
18919840 May 27 10:44:46 poc2 kernel: [ 1940.433931] host7: xid d68:
f_ctl 800000 seq 1
18919841 May 27 10:44:46 poc2 kernel: [ 1940.433951] host7: xid d68:
f_ctl 880008 seq 2
18919842 May 27 10:44:46 poc2 kernel: [ 1940.436138] host7: xid f89:
f_ctl 800000 seq 1
18919843 May 27 10:44:46 poc2 kernel: [ 1940.436174] host7: xid f89:
f_ctl 880008 seq 2
18919844 May 27 10:44:46 poc2 kernel: [ 1940.436889] host7: xid 74a:
f_ctl 800000 seq 1
18919845 May 27 10:44:46 poc2 kernel: [ 1940.436912] host7: xid 74a:
f_ctl 880008 seq 2
18919846 May 27 10:44:46 poc2 kernel: [ 1940.437281] host7: xid 40c:
f_ctl 800000 seq 1
18919847 May 27 10:44:46 poc2 kernel: [ 1940.437326] host7: xid 40c:
f_ctl 880008 seq 2
18919848 May 27 10:44:46 poc2 kernel: [ 1940.437338] host7: xid 92b:
f_ctl 800000 seq 1
18919849 May 27 10:44:46 poc2 kernel: [ 1940.437356] host7: xid 92b:
f_ctl 880008 seq 2
18919850 May 27 10:44:46 poc2 kernel: [ 1940.439508] host7: xid dce:
f_ctl 800000 seq 1
18919851 May 27 10:44:46 poc2 kernel: [ 1940.439539] host7: xid dce:
f_ctl 880008 seq 2
18919852 May 27 10:44:46 poc2 kernel: [ 1940.439551] host7: xid 42f:
f_ctl 800000 seq 1
18919853 May 27 10:44:46 poc2 kernel: [ 1940.439559] host7: xid 42f:
f_ctl 880008 seq 2
18919854 May 27 10:44:46 poc2 kernel: [ 1940.440408] host7: xid 90d:
f_ctl 800000 seq 1
18919855 May 27 10:44:46 poc2 kernel: [ 1940.440425] host7: xid 90d:
f_ctl 880008 seq 2
18919856 May 27 10:44:46 poc2 kernel: [ 1940.440930] host7: xid d60:
f_ctl 800000 seq 1
18919857 May 27 10:44:46 poc2 kernel: [ 1940.440950] host7: xid d60:
f_ctl 880008 seq 2
18919858 May 27 10:44:46 poc2 kernel: [ 1940.440971] host7: xid ac4:
f_ctl 800000 seq 1
18919859 May 27 10:44:46 poc2 kernel: [ 1940.440983] host7: xid ac4:
f_ctl 880008 seq 2
18919860 May 27 10:44:46 poc2 kernel: [ 1940.441482] host7: xid fc2:
f_ctl 800000 seq 1
18919861 May 27 10:44:46 poc2 kernel: [ 1940.441504] host7: xid fc2:
f_ctl 880008 seq 2
18919862 May 27 10:44:46 poc2 kernel: [ 1940.441759] host7: xid 885:
f_ctl 800000 seq 1
18919863 May 27 10:44:46 poc2 kernel: [ 1940.441772] host7: xid 885:
f_ctl 880008 seq 2
18919864 May 27 10:44:46 poc2 kernel: [ 1940.441883] host7: xid ce1:
f_ctl 800000 seq 1
18919865 May 27 10:44:46 poc2 kernel: [ 1940.441909] host7: xid ce1:
f_ctl 880008 seq 2
18919866 May 27 10:44:46 poc2 kernel: [ 1940.443033] host7: xid 7a3:
f_ctl 800000 seq 1
18919867 May 27 10:44:46 poc2 kernel: [ 1940.443063] host7: xid 7a3:
f_ctl 880008 seq 2
18919868 May 27 10:44:46 poc2 kernel: [ 1940.445410] host7: xid 7e6:
f_ctl 800000 seq 1
18919869 May 27 10:44:46 poc2 kernel: [ 1940.445424] host7: xid 7e6:
f_ctl 880008 seq 2
18919870 May 27 10:44:46 poc2 kernel: [ 1940.445609] host7: xid 327:
f_ctl 800000 seq 1
18919871 May 27 10:44:46 poc2 kernel: [ 1940.445621] host7: xid 327:
Exchange timer armed : 0 msecs
18919872 May 27 10:44:46 poc2 kernel: [ 1940.445630] host7: xid 327:
f_ctl 800008 seq 2
18919873 May 27 10:44:46 poc2 kernel: [ 1940.445639] ft_queue_data_in:
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 131072,
lso_max <0x10000>
18919874 May 27 10:44:46 poc2 kernel: [ 1940.445645] ft_queue_data_in:
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 65536,
lso_max <0x10000>
18919875 May 27 10:44:46 poc2 kernel: [ 1940.445649] ft_queue_data_in:
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 0,
lso_max <0x10000>

After stripping out the "host7: xid xxx: f_ctl 8x0000 seq x" lines,
the messages on the target look like:
15476415:May 27 10:42:37 poc2 kernel: [ 1811.571399] perf samples too
long (2670 > 2500), lowering kernel.perf_event_max_sample_rate to
50000
18732891:May 27 10:44:14 poc2 kernel: [ 1909.035309] host7: xid c48:
Exchange timer armed : 0 msecs
18732893:May 27 10:44:14 poc2 kernel: [ 1909.035384] host7: xid c48:
Exchange timed out
18734191:May 27 10:44:14 poc2 kernel: [ 1909.218580] host7: xid 766:
Exchange timer armed : 0 msecs
18734193:May 27 10:44:14 poc2 kernel: [ 1909.218750] host7: xid 766:
Exchange timed out
18734419:May 27 10:44:14 poc2 kernel: [ 1909.254362] host7: xid cce:
Exchange timer armed : 0 msecs
18734422:May 27 10:44:14 poc2 kernel: [ 1909.254385] host7: xid cce:
Exchange timed out
18734439:May 27 10:44:14 poc2 kernel: [ 1909.258190] host7: xid a84:
Exchange timer armed : 0 msecs
18734441:May 27 10:44:14 poc2 kernel: [ 1909.258245] host7: xid a84:
Exchange timed out
18734444:May 27 10:44:14 poc2 kernel: [ 1909.259212] host7: xid d00:
Exchange timer armed : 0 msecs
18734446:May 27 10:44:14 poc2 kernel: [ 1909.259222] host7: xid d00:
Exchange timed out
18734465:May 27 10:44:14 poc2 kernel: [ 1909.261859] host7: xid 287:
Exchange timer armed : 0 msecs
18734467:May 27 10:44:14 poc2 kernel: [ 1909.261910] host7: xid 287:
Exchange timed out
18734632:May 27 10:44:15 poc2 kernel: [ 1909.285827] host7: xid a64:
Exchange timer armed : 0 msecs
18734634:May 27 10:44:15 poc2 kernel: [ 1909.285892] host7: xid a64:
Exchange timed out
18734699:May 27 10:44:15 poc2 kernel: [ 1909.301348] host7: xid be8:
Exchange timer armed : 0 msecs
18734701:May 27 10:44:15 poc2 kernel: [ 1909.301357] host7: xid be8:
Exchange timed out
18734806:May 27 10:44:15 poc2 kernel: [ 1909.319404] host7: xid ea9:
Exchange timer armed : 0 msecs
18734808:May 27 10:44:15 poc2 kernel: [ 1909.319446] host7: xid ea9:
Exchange timed out
18734842:May 27 10:44:15 poc2 kernel: [ 1909.324226] host7: xid c01:
Exchange timer armed : 0 msecs
18734845:May 27 10:44:15 poc2 kernel: [ 1909.324257] host7: xid c01:
Exchange timed out
18734944:May 27 10:44:15 poc2 kernel: [ 1909.347075] host7: xid 30c:
Exchange timer armed : 0 msecs
18734946:May 27 10:44:15 poc2 kernel: [ 1909.347099] host7: xid 30c:
Exchange timed out
18735067:May 27 10:44:15 poc2 kernel: [ 1909.360509] host7: xid c8e:
Exchange timer armed : 0 msecs
18735069:May 27 10:44:15 poc2 kernel: [ 1909.360571] host7: xid c8e:
Exchange timed out
18735154:May 27 10:44:15 poc2 kernel: [ 1909.380932] host7: xid 9ad:
Exchange timer armed : 0 msecs
18735156:May 27 10:44:15 poc2 kernel: [ 1909.380946] host7: xid 9ad:
Exchange timed out
18735253:May 27 10:44:15 poc2 kernel: [ 1909.395037] host7: xid ce0:
Exchange timer armed : 0 msecs
18735255:May 27 10:44:15 poc2 kernel: [ 1909.395113] host7: xid ce0:
Exchange timed out
18735308:May 27 10:44:15 poc2 kernel: [ 1909.405266] host7: xid 746:
Exchange timer armed : 0 msecs
18735312:May 27 10:44:15 poc2 kernel: [ 1909.405332] host7: xid 746:
Exchange timed out
18735317:May 27 10:44:15 poc2 kernel: [ 1909.405832] host7: xid 26c:
Exchange timer armed : 0 msecs
18735319:May 27 10:44:15 poc2 kernel: [ 1909.405852] host7: xid 26c:
Exchange timed out
18735342:May 27 10:44:15 poc2 kernel: [ 1909.412982] host7: xid f09:
Exchange timer armed : 0 msecs
18735344:May 27 10:44:15 poc2 kernel: [ 1909.413062] host7: xid f09:
Exchange timed out
18735572:May 27 10:44:15 poc2 kernel: [ 1909.453616] host7: xid d2e:
Exchange timer armed : 0 msecs
18735574:May 27 10:44:15 poc2 kernel: [ 1909.453631] host7: xid d2e:
Exchange timed out
18735575:May 27 10:44:15 poc2 kernel: [ 1909.454206] host7: xid 32f:
Exchange timer armed : 0 msecs
18735577:May 27 10:44:15 poc2 kernel: [ 1909.454218] host7: xid 32f:
Exchange timed out
18735810:May 27 10:44:15 poc2 kernel: [ 1909.490061] host7: xid c68:
Exchange timer armed : 0 msecs
18735812:May 27 10:44:15 poc2 kernel: [ 1909.490126] host7: xid c68:
Exchange timed out
18735927:May 27 10:44:15 poc2 kernel: [ 1909.515654] host7: xid 267:
Exchange timer armed : 0 msecs
18735929:May 27 10:44:15 poc2 kernel: [ 1909.515667] host7: xid 267:
Exchange timed out
18735940:May 27 10:44:15 poc2 kernel: [ 1909.523278] host7: xid 3cf:
Exchange timer armed : 0 msecs
18735942:May 27 10:44:15 poc2 kernel: [ 1909.523350] host7: xid 3cf:
Exchange timed out
18736041:May 27 10:44:15 poc2 kernel: [ 1909.554439] host7: xid 984:
Exchange timer armed : 0 msecs
18736043:May 27 10:44:15 poc2 kernel: [ 1909.554457] host7: xid 984:
Exchange timed out
18736148:May 27 10:44:15 poc2 kernel: [ 1909.583305] host7: xid 7a6:
Exchange timer armed : 0 msecs
18736150:May 27 10:44:15 poc2 kernel: [ 1909.583332] host7: xid 7a6:
Exchange timed out
18916476:May 27 10:44:45 poc2 kernel: [ 1939.812996] host7: xid 2c7:
Exchange timer armed : 0 msecs
18916478:May 27 10:44:45 poc2 kernel: [ 1939.813075] host7: xid 2c7:
Exchange timed out
18917869:May 27 10:44:45 poc2 kernel: [ 1940.028716] host7: xid 307:
Exchange timer armed : 0 msecs
18917871:May 27 10:44:45 poc2 kernel: [ 1940.028738] host7: xid 307:
Exchange timed out
18917904:May 27 10:44:45 poc2 kernel: [ 1940.036022] host7: xid 44f:
Exchange timer armed : 0 msecs
18917906:May 27 10:44:45 poc2 kernel: [ 1940.036072] host7: xid 44f:
Exchange timed out
18917907:May 27 10:44:45 poc2 kernel: [ 1940.036419] host7: xid d01:
Exchange timer armed : 0 msecs
18917910:May 27 10:44:45 poc2 kernel: [ 1940.036459] host7: xid d01:
Exchange timed out
18918008:May 27 10:44:45 poc2 kernel: [ 1940.050544] host7: xid 94b:
Exchange timer armed : 0 msecs
18918010:May 27 10:44:45 poc2 kernel: [ 1940.050553] host7: xid 94b:
Exchange timed out
18918239:May 27 10:44:45 poc2 kernel: [ 1940.100469] host7: xid 743:
Exchange timer armed : 0 msecs
18918241:May 27 10:44:45 poc2 kernel: [ 1940.100485] host7: xid 743:
Exchange timed out
18918310:May 27 10:44:45 poc2 kernel: [ 1940.111206] host7: xid f82:
Exchange timer armed : 0 msecs
18918312:May 27 10:44:45 poc2 kernel: [ 1940.111357] host7: xid f82:
Exchange timed out
18918329:May 27 10:44:45 poc2 kernel: [ 1940.117376] host7: xid e00:
Exchange timer armed : 0 msecs
18918331:May 27 10:44:45 poc2 kernel: [ 1940.117388] host7: xid e00:
Exchange timed out
18918385:May 27 10:44:45 poc2 kernel: [ 1940.127422] host7: xid d08:
Exchange timer armed : 0 msecs
18918388:May 27 10:44:45 poc2 kernel: [ 1940.127470] host7: xid d08:
Exchange timed out
18918487:May 27 10:44:45 poc2 kernel: [ 1940.142935] host7: xid 8c5:
Exchange timer armed : 0 msecs
18918489:May 27 10:44:45 poc2 kernel: [ 1940.142950] host7: xid 8c5:
Exchange timed out
18918588:May 27 10:44:45 poc2 kernel: [ 1940.158196] host7: xid 92d:
Exchange timer armed : 0 msecs
18918590:May 27 10:44:45 poc2 kernel: [ 1940.158254] host7: xid 92d:
Exchange timed out
18918597:May 27 10:44:45 poc2 kernel: [ 1940.159082] host7: xid a04:
Exchange timer armed : 0 msecs
18918599:May 27 10:44:45 poc2 kernel: [ 1940.159154] host7: xid a04:
Exchange timed out
18918866:May 27 10:44:45 poc2 kernel: [ 1940.209430] host7: xid d28:
Exchange timer armed : 0 msecs
18918868:May 27 10:44:45 poc2 kernel: [ 1940.209484] host7: xid d28:
Exchange timed out
18918909:May 27 10:44:45 poc2 kernel: [ 1940.215142] host7: xid 8ed:
Exchange timer armed : 0 msecs
18918911:May 27 10:44:45 poc2 kernel: [ 1940.215158] host7: xid 8ed:
Exchange timed out
18918918:May 27 10:44:45 poc2 kernel: [ 1940.217253] host7: xid 3ef:
Exchange timer armed : 0 msecs
18918920:May 27 10:44:45 poc2 kernel: [ 1940.217270] host7: xid 3ef:
Exchange timed out
18919025:May 27 10:44:45 poc2 kernel: [ 1940.241859] host7: xid fa9:
Exchange timer armed : 0 msecs
18919027:May 27 10:44:45 poc2 kernel: [ 1940.241904] host7: xid fa9:
Exchange timed out
18919151:May 27 10:44:45 poc2 kernel: [ 1940.267514] host7: xid 38c:
Exchange timer armed : 0 msecs
18919154:May 27 10:44:45 poc2 kernel: [ 1940.267580] host7: xid 38c:
Exchange timed out
18919163:May 27 10:44:45 poc2 kernel: [ 1940.268681] host7: xid d48:
Exchange timer armed : 0 msecs
18919165:May 27 10:44:45 poc2 kernel: [ 1940.268707] host7: xid d48:
Exchange timed out
18919212:May 27 10:44:45 poc2 kernel: [ 1940.278834] host7: xid aa4:
Exchange timer armed : 0 msecs
18919214:May 27 10:44:45 poc2 kernel: [ 1940.278881] host7: xid aa4:
Exchange timed out
18919243:May 27 10:44:45 poc2 kernel: [ 1940.285614] host7: xid f22:
Exchange timer armed : 0 msecs
18919245:May 27 10:44:45 poc2 kernel: [ 1940.285629] host7: xid f22:
Exchange timed out
18919270:May 27 10:44:45 poc2 kernel: [ 1940.293297] host7: xid da0:
Exchange timer armed : 0 msecs
18919272:May 27 10:44:45 poc2 kernel: [ 1940.293346] host7: xid da0:
Exchange timed out
18919321:May 27 10:44:45 poc2 kernel: [ 1940.301098] host7: xid 3cc:
Exchange timer armed : 0 msecs
18919325:May 27 10:44:45 poc2 kernel: [ 1940.301172] host7: xid 3cc:
Exchange timed out
18919472:May 27 10:44:46 poc2 kernel: [ 1940.333266] host7: xid e0e:
Exchange timer armed : 0 msecs
18919474:May 27 10:44:46 poc2 kernel: [ 1940.333299] host7: xid e0e:
Exchange timed out
18919641:May 27 10:44:46 poc2 kernel: [ 1940.368427] host7: xid 2e7:
Exchange timer armed : 0 msecs
18919643:May 27 10:44:46 poc2 kernel: [ 1940.368451] host7: xid 2e7:
Exchange timed out
18919694:May 27 10:44:46 poc2 kernel: [ 1940.386699] host7: xid 826:
Exchange timer armed : 0 msecs
18919696:May 27 10:44:46 poc2 kernel: [ 1940.386744] host7: xid 826:
Exchange timed out
18919785:May 27 10:44:46 poc2 kernel: [ 1940.409090] host7: xid 40f:
Exchange timer armed : 0 msecs
18919787:May 27 10:44:46 poc2 kernel: [ 1940.409125] host7: xid 40f:
Exchange timed out
18919871:May 27 10:44:46 poc2 kernel: [ 1940.445621] host7: xid 327:
Exchange timer armed : 0 msecs
18919873:May 27 10:44:46 poc2 kernel: [ 1940.445639] ft_queue_data_in:
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 131072,
lso_max <0x10000>
18919874:May 27 10:44:46 poc2 kernel: [ 1940.445645] ft_queue_data_in:
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 65536,
lso_max <0x10000>
18919875:May 27 10:44:46 poc2 kernel: [ 1940.445649] ft_queue_data_in:
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 0,
lso_max <0x10000>
18919877:May 27 10:44:46 poc2 kernel: [ 1940.445671] host7: xid 327:
Exchange timed out
18919878:May 27 10:44:46 poc2 kernel: [ 1940.445683] host7: xid ac4:
Exchange timer armed : 2000 msecs
18930794:May 27 10:44:48 poc2 kernel: [ 1942.452127] host7: xid ac4:
Exchange timed out
18930795:May 27 10:44:48 poc2 kernel: [ 1942.452136] host7: xid ac4:
f_ctl 90000 seq 1
18930796:May 27 10:44:48 poc2 kernel: [ 1942.452139] host7: xid ac4:
Exchange timer armed : 20000 msecs

Thanks
Post by Vasu Dev
Post by Jun Wu
Post by Jun Wu
18630751 May 21 11:01:53 poc2 kernel: [ 1528.191740] host7: xid
Exchange timer armed : 0 msecs
18630752 May 21 11:01:53 poc2 kernel: [ 1528.191747] host7: xid
f_ctl 800000 seq 1
18630753 May 21 11:01:53 poc2 kernel: [ 1528.191756] host7: xid
f_ctl 800000 seq 2
18630754 May 21 11:01:53 poc2 kernel: [ 1528.191763] host7: xid
Exchange timed out
18630755 May 21 11:01:53 poc2 kernel: [ 1528.191777]
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining
458752,
Post by Jun Wu
lso_max <0x10000>
Exchange 0x6e4 is aborted and then target still sending frame, while
later should not occur but first setting up abort with 0 msec timeout
doesn not look correct either and it is different that 8000 ms on
initiator side. Are you using same switch between initiator and target
with DCB enabled ?
Reducing retries could narrow down to early aborts is the cause here,
can you try with REC disabled on initiator side for that using this
change ?
diff --git a/drivers/scsi/libfc/fc_fcp.c b/drivers/scsi/libfc/fc_fcp.c
index 1d7e76e..43fe793 100644
--- a/drivers/scsi/libfc/fc_fcp.c
+++ b/drivers/scsi/libfc/fc_fcp.c
@@ -1160,6 +1160,7 @@ static int fc_fcp_cmd_send(struct fc_lport *lport,
struct fc_fcp_pkt *fsp,
fc_fcp_pkt_hold(fsp); /* hold for fc_fcp_pkt_destroy */
setup_timer(&fsp->timer, fc_fcp_timeout, (unsigned long)fsp);
+ rpriv->flags &= ~FC_RP_FLAGS_REC_SUPPORTED;
if (rpriv->flags & FC_RP_FLAGS_REC_SUPPORTED)
fc_fcp_timer_set(fsp, get_fsp_rec_tov(fsp));
Jun Wu
2014-06-02 23:55:02 UTC
Permalink
Hi Vasu,

Is there anything else we can do to collect more information or adjust
the timeout value?
Thanks,

Jun
Post by Jun Wu
We do not have DCB PFC so the traffic is managed by PAUSE frames by default.
The kernel has been recompiled to include your change. I ran the same
test after the kernel change, i.e. running 10 fio sessions from
initiator to 10 drives on target. The target machine hung after a few
minutes.
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
xid 01cd-0a84: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
4470 May 27 10:44:14 poc1 kernel: [ 1726.315666] host7: xid 1cd: BLS
rctl 85 - BLS reject received
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
xid 01ee-0c48: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
I see much less Exchange timer messages than last time. Here are all
xid 0181-090d: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
xid 0005-0327: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
xid 0060-0dce: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
exch: BLS rctl 84 - BLS accept
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
Exchange timer armed : 10000 msecs
xid 00cc-090d: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
6532 May 27 10:44:46 poc1 kernel: [ 1757.763983] host7: xid 60: BLS
rctl 85 - BLS reject received
6533 May 27 10:44:46 poc1 kernel: [ 1757.765447] host7: xid 14d: BLS
rctl 85 - BLS reject received
6534 May 27 10:44:46 poc1 kernel: [ 1757.766258] host7: xid 18b: BLS
rctl 85 - BLS reject received
6535 May 27 10:44:46 poc1 kernel: [ 1757.767116] host7: xid ed: BLS
rctl 85 - BLS reject received
6536 May 27 10:44:46 poc1 kernel: [ 1757.767144] host7: xid 21: BLS
rctl 85 - BLS reject received
6537 May 27 10:44:46 poc1 kernel: [ 1757.768161] host7: xid 122: BLS
rctl 85 - BLS reject received
6538 May 27 10:44:46 poc1 kernel: [ 1757.768184] host7: xid 45: BLS
rctl 85 - BLS reject received
6539 May 27 10:44:46 poc1 kernel: [ 1757.768437] host7: xid 16f: BLS
rctl 85 - BLS reject received
6540 May 27 10:44:46 poc1 kernel: [ 1757.768880] host7: xid 14c: BLS
rctl 85 - BLS reject received
6541 May 27 10:44:46 poc1 kernel: [ 1757.772813] host7: xid 8a: BLS
rctl 85 - BLS reject received
6542 May 27 10:44:46 poc1 kernel: [ 1757.778209] host7: xid 181: BLS
rctl 85 - BLS reject received
6543 May 27 10:44:46 poc1 kernel: [ 1757.778226] host7: xid 16b: BLS
rctl 85 - BLS reject received
6544 May 27 10:44:46 poc1 kernel: [ 1757.778905] host7: xid cc: BLS
rctl 85 - BLS reject received
Exchange timed out
Exchange timer armed : 2000 msecs
Exchange timer canceled
LS_RJT for RRQ
Resetting rport (00061e)
6550 May 27 10:45:16 poc1 kernel: [ 1788.034071] host7: scsi: lun
reset to lun 0 completed
and
xid 06ae-0985: fsp=ffff880c0e176700:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
xid 068e-07a3: fsp=ffff880c0e177400:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
xid 0c2e-0b04: fsp=ffff880c1cc4c800:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
7474 May 27 10:45:17 poc1 kernel: [ 1789.296282] host7: xid 1c4: BLS
rctl 85 - BLS reject received
7475 May 27 10:45:17 poc1 kernel: [ 1789.296284] host7: xid 8c: BLS
rctl 85 - BLS reject received
exch: BLS rctl 84 - BLS accept
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
Exchange timer armed : 10000 msecs
7479 May 27 10:45:17 poc1 kernel: [ 1789.298873] host7: xid ef: BLS
rctl 85 - BLS reject received
xid 01cf-0e8e: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7482 May 27 10:45:17 poc1 kernel: [ 1789.299318] host7: xid 1cf: BLS
rctl 85 - BLS reject received
xid 00c4-048c: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7485 May 27 10:45:17 poc1 kernel: [ 1789.299716] host7: xid c4: BLS
rctl 85 - BLS reject received
7486 May 27 10:45:17 poc1 kernel: [ 1789.299733] host7: xid e6: BLS
rctl 85 - BLS reject received
7487 May 27 10:45:17 poc1 kernel: [ 1789.299762] host7: xid 87: BLS
rctl 85 - BLS reject received
7488 May 27 10:45:17 poc1 kernel: [ 1789.299767] host7: xid 107: BLS
rctl 85 - BLS reject received
xid 0042-050f: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7491 May 27 10:45:17 poc1 kernel: [ 1789.303813] host7: xid 42: BLS
rctl 85 - BLS reject received
7492 May 27 10:45:17 poc1 kernel: [ 1789.304596] host7: xid 7: BLS
rctl 85 - BLS reject received
exch: BLS rctl 84 - BLS accept
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
7495 May 27 10:45:17 poc1 kernel: [ 1789.306357] host7: xid 10d: BLS
rctl 85 - BLS reject received
7496 May 27 10:45:17 poc1 kernel: [ 1789.306797] host7: xid 18e: BLS
rctl 85 - BLS reject received
xid 01e4-04a7: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7499 May 27 10:45:17 poc1 kernel: [ 1789.307628] host7: xid 1e4: BLS
rctl 85 - BLS reject received
7500 May 27 10:45:17 poc1 kernel: [ 1789.311225] host7: xid 8d: BLS
rctl 85 - BLS reject received
xid 0c64-0de8: fsp=ffff88061c238c00:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
Exchange timed out
Exchange timer armed : 2000 msecs
Exchange timed out
f_ctl 90000 seq 1
Exchange timer armed : 20000 msecs
Remove port
Port sending LOGO from Ready state
and finally
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
f_ctl 90000 seq 1
Received a LOGO response closed
Exchange timer canceled
work delete
blocked FC remote port time out: removing target and saving binding
The messages on the target side are similar as before.
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
Exchange timer armed : 0 msecs
f_ctl 800000 seq 1
Exchange timed out
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
Exchange timer armed : 0 msecs
f_ctl 800008 seq 2
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 0,
lso_max <0x10000>
After stripping out the "host7: xid xxx: f_ctl 8x0000 seq x" lines,
15476415:May 27 10:42:37 poc2 kernel: [ 1811.571399] perf samples too
long (2670 > 2500), lowering kernel.perf_event_max_sample_rate to
50000
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 0,
lso_max <0x10000>
Exchange timed out
Exchange timer armed : 2000 msecs
Exchange timed out
f_ctl 90000 seq 1
Exchange timer armed : 20000 msecs
Thanks
Post by Vasu Dev
Post by Jun Wu
Post by Jun Wu
18630751 May 21 11:01:53 poc2 kernel: [ 1528.191740] host7: xid
Exchange timer armed : 0 msecs
18630752 May 21 11:01:53 poc2 kernel: [ 1528.191747] host7: xid
f_ctl 800000 seq 1
18630753 May 21 11:01:53 poc2 kernel: [ 1528.191756] host7: xid
f_ctl 800000 seq 2
18630754 May 21 11:01:53 poc2 kernel: [ 1528.191763] host7: xid
Exchange timed out
18630755 May 21 11:01:53 poc2 kernel: [ 1528.191777]
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining
458752,
Post by Jun Wu
lso_max <0x10000>
Exchange 0x6e4 is aborted and then target still sending frame, while
later should not occur but first setting up abort with 0 msec timeout
doesn not look correct either and it is different that 8000 ms on
initiator side. Are you using same switch between initiator and target
with DCB enabled ?
Reducing retries could narrow down to early aborts is the cause here,
can you try with REC disabled on initiator side for that using this
change ?
diff --git a/drivers/scsi/libfc/fc_fcp.c b/drivers/scsi/libfc/fc_fcp.c
index 1d7e76e..43fe793 100644
--- a/drivers/scsi/libfc/fc_fcp.c
+++ b/drivers/scsi/libfc/fc_fcp.c
@@ -1160,6 +1160,7 @@ static int fc_fcp_cmd_send(struct fc_lport *lport,
struct fc_fcp_pkt *fsp,
fc_fcp_pkt_hold(fsp); /* hold for fc_fcp_pkt_destroy */
setup_timer(&fsp->timer, fc_fcp_timeout, (unsigned long)fsp);
+ rpriv->flags &= ~FC_RP_FLAGS_REC_SUPPORTED;
if (rpriv->flags & FC_RP_FLAGS_REC_SUPPORTED)
fc_fcp_timer_set(fsp, get_fsp_rec_tov(fsp));
Vasu Dev
2014-06-03 00:21:09 UTC
Permalink
Post by Jun Wu
Hi Vasu,
Is there anything else we can do to collect more information or adjust
the timeout value?
Increased Timeouts will help in reducing seq_send failure prints and
must be the case with REC disabled patch try since that skipped REC
completely and with that defaulted to longer SCSI retry timeouts.
Anycase seq_send failure prints are just noise here since seq_send fails
only for aborted exchanges and that is very likely with tcm fc abort
handling, especially in your oversubscribed target setup with 1 target
taking workloads from 10 hosts. I mean seq_send fails only due to
following aborted exchanges check since thereafter fcoe_xmit always
returns success.

if (ep->esb_stat & (ESB_ST_COMPLETE | ESB_ST_ABNORMAL)) {
fc_frame_free(fp);
goto out;
}

So I think it is okay to get rid off send failure prints same as tcm fc
has it on other places of callings seq_send with this patch:-

diff --git a/drivers/target/tcm_fc/tfc_io.c
b/drivers/target/tcm_fc/tfc_io.c
index a48b96d..0425fc0 100644
--- a/drivers/target/tcm_fc/tfc_io.c
+++ b/drivers/target/tcm_fc/tfc_io.c
@@ -200,15 +200,7 @@ int ft_queue_data_in(struct se_cmd *se_cmd)
f_ctl |= FC_FC_END_SEQ;
fc_fill_fc_hdr(fp, FC_RCTL_DD_SOL_DATA, ep->did,
ep->sid,
FC_TYPE_FCP, f_ctl, fh_off);
- error = lport->tt.seq_send(lport, seq, fp);
- if (error) {
- /* XXX For now, initiator will retry */
- pr_err_ratelimited("%s: Failed to send frame %p,
"
- "xid <0x%x>, remaining %
zu, "
- "lso_max <0x%x>\n",
- __func__, fp, ep->xid,
- remaining,
lport->lso_max);
- }
+ lport->tt.seq_send(lport, seq, fp);
}
return ft_queue_status(se_cmd);
}
Post by Jun Wu
Thanks,
Jun
Post by Jun Wu
We do not have DCB PFC so the traffic is managed by PAUSE frames by default.
The kernel has been recompiled to include your change. I ran the same
test after the kernel change, i.e. running 10 fio sessions from
initiator to 10 drives on target. The target machine hung after a few
minutes.
Target could be either really overloaded with your workloads in 10 hosts
to single target setup or you it really stuck, for later case enabling
hung task detection could help under kernel hacking as that would dump
stack trace of hung task. If it is the earlier case then I don't know
how target back end handle the case of sustained over subscribed target
bandwidth, may be Nick could shed some info on that and possibly tcm_fc
has related missing feedback into target core for that.

If your workloads exceed demand intermittently then adjusting timeout as
you mentioned above or increasing all Tx/Rx path resources could help in
absorbing such intermittent incidents and in turn not leading to host IO
timeouts/failures.

//Vasu
Post by Jun Wu
Post by Jun Wu
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
xid 01cd-0a84: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
4470 May 27 10:44:14 poc1 kernel: [ 1726.315666] host7: xid 1cd: BLS
rctl 85 - BLS reject received
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
xid 01ee-0c48: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
I see much less Exchange timer messages than last time. Here are all
xid 0181-090d: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
xid 0005-0327: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
xid 0060-0dce: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
exch: BLS rctl 84 - BLS accept
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
Exchange timer armed : 10000 msecs
xid 00cc-090d: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
6532 May 27 10:44:46 poc1 kernel: [ 1757.763983] host7: xid 60: BLS
rctl 85 - BLS reject received
6533 May 27 10:44:46 poc1 kernel: [ 1757.765447] host7: xid 14d: BLS
rctl 85 - BLS reject received
6534 May 27 10:44:46 poc1 kernel: [ 1757.766258] host7: xid 18b: BLS
rctl 85 - BLS reject received
6535 May 27 10:44:46 poc1 kernel: [ 1757.767116] host7: xid ed: BLS
rctl 85 - BLS reject received
6536 May 27 10:44:46 poc1 kernel: [ 1757.767144] host7: xid 21: BLS
rctl 85 - BLS reject received
6537 May 27 10:44:46 poc1 kernel: [ 1757.768161] host7: xid 122: BLS
rctl 85 - BLS reject received
6538 May 27 10:44:46 poc1 kernel: [ 1757.768184] host7: xid 45: BLS
rctl 85 - BLS reject received
6539 May 27 10:44:46 poc1 kernel: [ 1757.768437] host7: xid 16f: BLS
rctl 85 - BLS reject received
6540 May 27 10:44:46 poc1 kernel: [ 1757.768880] host7: xid 14c: BLS
rctl 85 - BLS reject received
6541 May 27 10:44:46 poc1 kernel: [ 1757.772813] host7: xid 8a: BLS
rctl 85 - BLS reject received
6542 May 27 10:44:46 poc1 kernel: [ 1757.778209] host7: xid 181: BLS
rctl 85 - BLS reject received
6543 May 27 10:44:46 poc1 kernel: [ 1757.778226] host7: xid 16b: BLS
rctl 85 - BLS reject received
6544 May 27 10:44:46 poc1 kernel: [ 1757.778905] host7: xid cc: BLS
rctl 85 - BLS reject received
Exchange timed out
Exchange timer armed : 2000 msecs
Exchange timer canceled
LS_RJT for RRQ
Resetting rport (00061e)
6550 May 27 10:45:16 poc1 kernel: [ 1788.034071] host7: scsi: lun
reset to lun 0 completed
and
xid 06ae-0985: fsp=ffff880c0e176700:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
xid 068e-07a3: fsp=ffff880c0e177400:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
xid 0c2e-0b04: fsp=ffff880c1cc4c800:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
7474 May 27 10:45:17 poc1 kernel: [ 1789.296282] host7: xid 1c4: BLS
rctl 85 - BLS reject received
7475 May 27 10:45:17 poc1 kernel: [ 1789.296284] host7: xid 8c: BLS
rctl 85 - BLS reject received
exch: BLS rctl 84 - BLS accept
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
Exchange timer armed : 10000 msecs
7479 May 27 10:45:17 poc1 kernel: [ 1789.298873] host7: xid ef: BLS
rctl 85 - BLS reject received
xid 01cf-0e8e: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7482 May 27 10:45:17 poc1 kernel: [ 1789.299318] host7: xid 1cf: BLS
rctl 85 - BLS reject received
xid 00c4-048c: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7485 May 27 10:45:17 poc1 kernel: [ 1789.299716] host7: xid c4: BLS
rctl 85 - BLS reject received
7486 May 27 10:45:17 poc1 kernel: [ 1789.299733] host7: xid e6: BLS
rctl 85 - BLS reject received
7487 May 27 10:45:17 poc1 kernel: [ 1789.299762] host7: xid 87: BLS
rctl 85 - BLS reject received
7488 May 27 10:45:17 poc1 kernel: [ 1789.299767] host7: xid 107: BLS
rctl 85 - BLS reject received
xid 0042-050f: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7491 May 27 10:45:17 poc1 kernel: [ 1789.303813] host7: xid 42: BLS
rctl 85 - BLS reject received
7492 May 27 10:45:17 poc1 kernel: [ 1789.304596] host7: xid 7: BLS
rctl 85 - BLS reject received
exch: BLS rctl 84 - BLS accept
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
7495 May 27 10:45:17 poc1 kernel: [ 1789.306357] host7: xid 10d: BLS
rctl 85 - BLS reject received
7496 May 27 10:45:17 poc1 kernel: [ 1789.306797] host7: xid 18e: BLS
rctl 85 - BLS reject received
xid 01e4-04a7: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7499 May 27 10:45:17 poc1 kernel: [ 1789.307628] host7: xid 1e4: BLS
rctl 85 - BLS reject received
7500 May 27 10:45:17 poc1 kernel: [ 1789.311225] host7: xid 8d: BLS
rctl 85 - BLS reject received
xid 0c64-0de8: fsp=ffff88061c238c00:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
Exchange timed out
Exchange timer armed : 2000 msecs
Exchange timed out
f_ctl 90000 seq 1
Exchange timer armed : 20000 msecs
Remove port
Port sending LOGO from Ready state
and finally
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
f_ctl 90000 seq 1
Received a LOGO response closed
Exchange timer canceled
work delete
blocked FC remote port time out: removing target and saving binding
The messages on the target side are similar as before.
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
Exchange timer armed : 0 msecs
f_ctl 800000 seq 1
Exchange timed out
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
Exchange timer armed : 0 msecs
f_ctl 800008 seq 2
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 0,
lso_max <0x10000>
After stripping out the "host7: xid xxx: f_ctl 8x0000 seq x" lines,
15476415:May 27 10:42:37 poc2 kernel: [ 1811.571399] perf samples too
long (2670 > 2500), lowering kernel.perf_event_max_sample_rate to
50000
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 0,
lso_max <0x10000>
Exchange timed out
Exchange timer armed : 2000 msecs
Exchange timed out
f_ctl 90000 seq 1
Exchange timer armed : 20000 msecs
Thanks
Post by Vasu Dev
Post by Jun Wu
Post by Jun Wu
18630751 May 21 11:01:53 poc2 kernel: [ 1528.191740] host7: xid
Exchange timer armed : 0 msecs
18630752 May 21 11:01:53 poc2 kernel: [ 1528.191747] host7: xid
f_ctl 800000 seq 1
18630753 May 21 11:01:53 poc2 kernel: [ 1528.191756] host7: xid
f_ctl 800000 seq 2
18630754 May 21 11:01:53 poc2 kernel: [ 1528.191763] host7: xid
Exchange timed out
18630755 May 21 11:01:53 poc2 kernel: [ 1528.191777]
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining
458752,
Post by Jun Wu
lso_max <0x10000>
Exchange 0x6e4 is aborted and then target still sending frame, while
later should not occur but first setting up abort with 0 msec timeout
doesn not look correct either and it is different that 8000 ms on
initiator side. Are you using same switch between initiator and target
with DCB enabled ?
Reducing retries could narrow down to early aborts is the cause here,
can you try with REC disabled on initiator side for that using this
change ?
diff --git a/drivers/scsi/libfc/fc_fcp.c b/drivers/scsi/libfc/fc_fcp.c
index 1d7e76e..43fe793 100644
--- a/drivers/scsi/libfc/fc_fcp.c
+++ b/drivers/scsi/libfc/fc_fcp.c
@@ -1160,6 +1160,7 @@ static int fc_fcp_cmd_send(struct fc_lport *lport,
struct fc_fcp_pkt *fsp,
fc_fcp_pkt_hold(fsp); /* hold for fc_fcp_pkt_destroy */
setup_timer(&fsp->timer, fc_fcp_timeout, (unsigned long)fsp);
+ rpriv->flags &= ~FC_RP_FLAGS_REC_SUPPORTED;
if (rpriv->flags & FC_RP_FLAGS_REC_SUPPORTED)
fc_fcp_timer_set(fsp, get_fsp_rec_tov(fsp));
Jun Wu
2014-06-04 18:45:09 UTC
Permalink
The test setup includes one host and one target. The target exposes 10
hard drives (or 10 LUNs) on one fcoe port. The single initiator runs
10 fio processes simultaneously to the 10 target drives through fcoe
vn2vn. This is a simple configuration that other people may also want
to try.
Post by Vasu Dev
Exchange 0x6e4 is aborted and then target still sending frame, while
later should not occur but first setting up abort with 0 msec timeout
doesn not look correct either and it is different that 8000 ms on
initiator side.
Should target stop sending frame after abort? I still see a lot of 0
msec messages on target side. Is this something that should be
addressed?
Post by Vasu Dev
Reducing retries could narrow down to early aborts is the cause here,
can you try with REC disabled on initiator side for that using this
change ?
By disabling REC have you confirmed that the early aborts is the
cause? Is the abort caused by 0 msec timeout?

Thanks,

Jun
Post by Vasu Dev
Post by Jun Wu
Hi Vasu,
Is there anything else we can do to collect more information or adjust
the timeout value?
Increased Timeouts will help in reducing seq_send failure prints and
must be the case with REC disabled patch try since that skipped REC
completely and with that defaulted to longer SCSI retry timeouts.
Anycase seq_send failure prints are just noise here since seq_send fails
only for aborted exchanges and that is very likely with tcm fc abort
handling, especially in your oversubscribed target setup with 1 target
taking workloads from 10 hosts. I mean seq_send fails only due to
following aborted exchanges check since thereafter fcoe_xmit always
returns success.
if (ep->esb_stat & (ESB_ST_COMPLETE | ESB_ST_ABNORMAL)) {
fc_frame_free(fp);
goto out;
}
So I think it is okay to get rid off send failure prints same as tcm fc
has it on other places of callings seq_send with this patch:-
diff --git a/drivers/target/tcm_fc/tfc_io.c
b/drivers/target/tcm_fc/tfc_io.c
index a48b96d..0425fc0 100644
--- a/drivers/target/tcm_fc/tfc_io.c
+++ b/drivers/target/tcm_fc/tfc_io.c
@@ -200,15 +200,7 @@ int ft_queue_data_in(struct se_cmd *se_cmd)
f_ctl |= FC_FC_END_SEQ;
fc_fill_fc_hdr(fp, FC_RCTL_DD_SOL_DATA, ep->did,
ep->sid,
FC_TYPE_FCP, f_ctl, fh_off);
- error = lport->tt.seq_send(lport, seq, fp);
- if (error) {
- /* XXX For now, initiator will retry */
- pr_err_ratelimited("%s: Failed to send frame %p,
"
- "xid <0x%x>, remaining %
zu, "
- "lso_max <0x%x>\n",
- __func__, fp, ep->xid,
- remaining,
lport->lso_max);
- }
+ lport->tt.seq_send(lport, seq, fp);
}
return ft_queue_status(se_cmd);
}
Post by Jun Wu
Thanks,
Jun
Post by Jun Wu
We do not have DCB PFC so the traffic is managed by PAUSE frames by default.
The kernel has been recompiled to include your change. I ran the same
test after the kernel change, i.e. running 10 fio sessions from
initiator to 10 drives on target. The target machine hung after a few
minutes.
Target could be either really overloaded with your workloads in 10 hosts
to single target setup or you it really stuck, for later case enabling
hung task detection could help under kernel hacking as that would dump
stack trace of hung task. If it is the earlier case then I don't know
how target back end handle the case of sustained over subscribed target
bandwidth, may be Nick could shed some info on that and possibly tcm_fc
has related missing feedback into target core for that.
If your workloads exceed demand intermittently then adjusting timeout as
you mentioned above or increasing all Tx/Rx path resources could help in
absorbing such intermittent incidents and in turn not leading to host IO
timeouts/failures.
//Vasu
Post by Jun Wu
Post by Jun Wu
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
xid 01cd-0a84: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
4470 May 27 10:44:14 poc1 kernel: [ 1726.315666] host7: xid 1cd: BLS
rctl 85 - BLS reject received
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
xid 01ee-0c48: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
I see much less Exchange timer messages than last time. Here are all
xid 0181-090d: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
xid 0005-0327: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
xid 0060-0dce: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
exch: BLS rctl 84 - BLS accept
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
Exchange timer armed : 10000 msecs
xid 00cc-090d: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
6532 May 27 10:44:46 poc1 kernel: [ 1757.763983] host7: xid 60: BLS
rctl 85 - BLS reject received
6533 May 27 10:44:46 poc1 kernel: [ 1757.765447] host7: xid 14d: BLS
rctl 85 - BLS reject received
6534 May 27 10:44:46 poc1 kernel: [ 1757.766258] host7: xid 18b: BLS
rctl 85 - BLS reject received
6535 May 27 10:44:46 poc1 kernel: [ 1757.767116] host7: xid ed: BLS
rctl 85 - BLS reject received
6536 May 27 10:44:46 poc1 kernel: [ 1757.767144] host7: xid 21: BLS
rctl 85 - BLS reject received
6537 May 27 10:44:46 poc1 kernel: [ 1757.768161] host7: xid 122: BLS
rctl 85 - BLS reject received
6538 May 27 10:44:46 poc1 kernel: [ 1757.768184] host7: xid 45: BLS
rctl 85 - BLS reject received
6539 May 27 10:44:46 poc1 kernel: [ 1757.768437] host7: xid 16f: BLS
rctl 85 - BLS reject received
6540 May 27 10:44:46 poc1 kernel: [ 1757.768880] host7: xid 14c: BLS
rctl 85 - BLS reject received
6541 May 27 10:44:46 poc1 kernel: [ 1757.772813] host7: xid 8a: BLS
rctl 85 - BLS reject received
6542 May 27 10:44:46 poc1 kernel: [ 1757.778209] host7: xid 181: BLS
rctl 85 - BLS reject received
6543 May 27 10:44:46 poc1 kernel: [ 1757.778226] host7: xid 16b: BLS
rctl 85 - BLS reject received
6544 May 27 10:44:46 poc1 kernel: [ 1757.778905] host7: xid cc: BLS
rctl 85 - BLS reject received
Exchange timed out
Exchange timer armed : 2000 msecs
Exchange timer canceled
LS_RJT for RRQ
Resetting rport (00061e)
6550 May 27 10:45:16 poc1 kernel: [ 1788.034071] host7: scsi: lun
reset to lun 0 completed
and
xid 06ae-0985: fsp=ffff880c0e176700:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
xid 068e-07a3: fsp=ffff880c0e177400:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
xid 0c2e-0b04: fsp=ffff880c1cc4c800:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
7474 May 27 10:45:17 poc1 kernel: [ 1789.296282] host7: xid 1c4: BLS
rctl 85 - BLS reject received
7475 May 27 10:45:17 poc1 kernel: [ 1789.296284] host7: xid 8c: BLS
rctl 85 - BLS reject received
exch: BLS rctl 84 - BLS accept
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
Exchange timer armed : 10000 msecs
7479 May 27 10:45:17 poc1 kernel: [ 1789.298873] host7: xid ef: BLS
rctl 85 - BLS reject received
xid 01cf-0e8e: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7482 May 27 10:45:17 poc1 kernel: [ 1789.299318] host7: xid 1cf: BLS
rctl 85 - BLS reject received
xid 00c4-048c: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7485 May 27 10:45:17 poc1 kernel: [ 1789.299716] host7: xid c4: BLS
rctl 85 - BLS reject received
7486 May 27 10:45:17 poc1 kernel: [ 1789.299733] host7: xid e6: BLS
rctl 85 - BLS reject received
7487 May 27 10:45:17 poc1 kernel: [ 1789.299762] host7: xid 87: BLS
rctl 85 - BLS reject received
7488 May 27 10:45:17 poc1 kernel: [ 1789.299767] host7: xid 107: BLS
rctl 85 - BLS reject received
xid 0042-050f: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7491 May 27 10:45:17 poc1 kernel: [ 1789.303813] host7: xid 42: BLS
rctl 85 - BLS reject received
7492 May 27 10:45:17 poc1 kernel: [ 1789.304596] host7: xid 7: BLS
rctl 85 - BLS reject received
exch: BLS rctl 84 - BLS accept
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
7495 May 27 10:45:17 poc1 kernel: [ 1789.306357] host7: xid 10d: BLS
rctl 85 - BLS reject received
7496 May 27 10:45:17 poc1 kernel: [ 1789.306797] host7: xid 18e: BLS
rctl 85 - BLS reject received
xid 01e4-04a7: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7499 May 27 10:45:17 poc1 kernel: [ 1789.307628] host7: xid 1e4: BLS
rctl 85 - BLS reject received
7500 May 27 10:45:17 poc1 kernel: [ 1789.311225] host7: xid 8d: BLS
rctl 85 - BLS reject received
xid 0c64-0de8: fsp=ffff88061c238c00:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
Exchange timed out
Exchange timer armed : 2000 msecs
Exchange timed out
f_ctl 90000 seq 1
Exchange timer armed : 20000 msecs
Remove port
Port sending LOGO from Ready state
and finally
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
f_ctl 90000 seq 1
Received a LOGO response closed
Exchange timer canceled
work delete
blocked FC remote port time out: removing target and saving binding
The messages on the target side are similar as before.
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
Exchange timer armed : 0 msecs
f_ctl 800000 seq 1
Exchange timed out
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
Exchange timer armed : 0 msecs
f_ctl 800008 seq 2
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 0,
lso_max <0x10000>
After stripping out the "host7: xid xxx: f_ctl 8x0000 seq x" lines,
15476415:May 27 10:42:37 poc2 kernel: [ 1811.571399] perf samples too
long (2670 > 2500), lowering kernel.perf_event_max_sample_rate to
50000
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 0,
lso_max <0x10000>
Exchange timed out
Exchange timer armed : 2000 msecs
Exchange timed out
f_ctl 90000 seq 1
Exchange timer armed : 20000 msecs
Thanks
Post by Vasu Dev
Post by Jun Wu
Post by Jun Wu
18630751 May 21 11:01:53 poc2 kernel: [ 1528.191740] host7: xid
Exchange timer armed : 0 msecs
18630752 May 21 11:01:53 poc2 kernel: [ 1528.191747] host7: xid
f_ctl 800000 seq 1
18630753 May 21 11:01:53 poc2 kernel: [ 1528.191756] host7: xid
f_ctl 800000 seq 2
18630754 May 21 11:01:53 poc2 kernel: [ 1528.191763] host7: xid
Exchange timed out
18630755 May 21 11:01:53 poc2 kernel: [ 1528.191777]
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining
458752,
Post by Jun Wu
lso_max <0x10000>
Exchange 0x6e4 is aborted and then target still sending frame, while
later should not occur but first setting up abort with 0 msec timeout
doesn not look correct either and it is different that 8000 ms on
initiator side. Are you using same switch between initiator and target
with DCB enabled ?
Reducing retries could narrow down to early aborts is the cause here,
can you try with REC disabled on initiator side for that using this
change ?
diff --git a/drivers/scsi/libfc/fc_fcp.c b/drivers/scsi/libfc/fc_fcp.c
index 1d7e76e..43fe793 100644
--- a/drivers/scsi/libfc/fc_fcp.c
+++ b/drivers/scsi/libfc/fc_fcp.c
@@ -1160,6 +1160,7 @@ static int fc_fcp_cmd_send(struct fc_lport *lport,
struct fc_fcp_pkt *fsp,
fc_fcp_pkt_hold(fsp); /* hold for fc_fcp_pkt_destroy */
setup_timer(&fsp->timer, fc_fcp_timeout, (unsigned long)fsp);
+ rpriv->flags &= ~FC_RP_FLAGS_REC_SUPPORTED;
if (rpriv->flags & FC_RP_FLAGS_REC_SUPPORTED)
fc_fcp_timer_set(fsp, get_fsp_rec_tov(fsp));
Nicholas A. Bellinger
2014-06-04 22:21:43 UTC
Permalink
Post by Jun Wu
The test setup includes one host and one target. The target exposes 10
hard drives (or 10 LUNs) on one fcoe port. The single initiator runs
10 fio processes simultaneously to the 10 target drives through fcoe
vn2vn. This is a simple configuration that other people may also want
to try.
Post by Vasu Dev
Exchange 0x6e4 is aborted and then target still sending frame, while
later should not occur but first setting up abort with 0 msec timeout
doesn not look correct either and it is different that 8000 ms on
initiator side.
Should target stop sending frame after abort? I still see a lot of 0
msec messages on target side. Is this something that should be
addressed?
Post by Vasu Dev
Reducing retries could narrow down to early aborts is the cause here,
can you try with REC disabled on initiator side for that using this
change ?
By disabling REC have you confirmed that the early aborts is the
cause? Is the abort caused by 0 msec timeout?
The 0 msec timeout still look really suspicious..

IIRC, these timeout values are exchanged in the FLOGI request packet,
and/or in a separate Request Timeout Value (RTV) packet..

It might be worthwhile to track down where these zero-length settings
are coming from, as it might be a indication of what's wrong.

How about the following patch to dump these values..?

Also just curious, have you tried running these two hosts in
point-to-point mode without the switch to see if the same types of
issues occur..? It might be useful to help isolate the problem space a
bit.

Vasu, any other ideas here..?

--nab

diff --git a/drivers/scsi/libfc/fc_lport.c b/drivers/scsi/libfc/fc_lport.c
index e01a298..72b8676 100644
--- a/drivers/scsi/libfc/fc_lport.c
+++ b/drivers/scsi/libfc/fc_lport.c
@@ -379,6 +379,7 @@ static void fc_lport_flogi_fill(struct fc_lport *lport,
sp->sp_tot_seq = htons(255); /* seq. we accept */
sp->sp_rel_off = htons(0x1f);
sp->sp_e_d_tov = htonl(lport->e_d_tov);
+ printk("fc_lport_flogi_fill sp->sp_e_d_tov: %u\n", sp->sp_e_d_tov);

cp->cp_rdfs = htons((u16) lport->mfs);
cp->cp_con_seq = htons(255);
@@ -1766,7 +1767,9 @@ void fc_lport_flogi_resp(struct fc_seq *sp, struct fc_frame *fp,

csp_flags = ntohs(flp->fl_csp.sp_features);
r_a_tov = ntohl(flp->fl_csp.sp_r_a_tov);
+ printk("fc_lport_flogi_resp: r_a_tov: %u\n", r_a_tov);
e_d_tov = ntohl(flp->fl_csp.sp_e_d_tov);
+ printk("fc_lport_flogi_resp: e_d_tov %u\n", e_d_tov);
if (csp_flags & FC_SP_FT_EDTR)
e_d_tov /= 1000000;

@@ -1795,6 +1798,9 @@ void fc_lport_flogi_resp(struct fc_seq *sp, struct fc_frame *fp,
fc_lport_enter_dns(lport);
}

+ printk("fc_lport_flogi_resp: lport->e_d_tov: %u\n", lport->e_d_tov);
+ printk("fc_lport_flogi_resp: lport->r_a_tov: %u\n", lport->r_a_tov);
+
out:
fc_frame_free(fp);
err:
diff --git a/drivers/scsi/libfc/fc_rport.c b/drivers/scsi/libfc/fc_rport.c
index 589ff9a..4cdb055 100644
--- a/drivers/scsi/libfc/fc_rport.c
+++ b/drivers/scsi/libfc/fc_rport.c
@@ -142,7 +142,9 @@ static struct fc_rport_priv *fc_rport_create(struct fc_lport *lport,
rdata->event = RPORT_EV_NONE;
rdata->flags = FC_RP_FLAGS_REC_SUPPORTED;
rdata->e_d_tov = lport->e_d_tov;
+ printk("fc_rport_create: rdata->e_d_tov: %u\n", rdata->e_d_tov);
rdata->r_a_tov = lport->r_a_tov;
+ printk("fc_rport_create: rdata->r_a_tov: %u\n", rdata->r_a_tov);
rdata->maxframe_size = FC_MIN_MAX_PAYLOAD;
INIT_DELAYED_WORK(&rdata->retry_work, fc_rport_timeout);
INIT_WORK(&rdata->event_work, fc_rport_work);
@@ -286,7 +288,9 @@ static void fc_rport_work(struct work_struct *work)
rpriv->rp_state = rdata->rp_state;
rpriv->flags = rdata->flags;
rpriv->e_d_tov = rdata->e_d_tov;
+ printk("fc_rport_work: rpriv->e_d_tov: %u\n", rpriv->e_d_tov);
rpriv->r_a_tov = rdata->r_a_tov;
+ printk("rpriv->r_a_tov: rpriv->r_a_tov: %u\n", rpriv->r_a_tov);
mutex_unlock(&rdata->rp_mutex);

if (rport_ops && rport_ops->event_callback) {
@@ -638,10 +642,14 @@ static int fc_rport_login_complete(struct fc_rport_priv *rdata,
* E_D_TOV is not valid on an incoming FLOGI request.
*/
e_d_tov = ntohl(flogi->fl_csp.sp_e_d_tov);
+ printk("fc_rport_login_complete e_d_tov: %u\n", e_d_tov);
if (csp_flags & FC_SP_FT_EDTR)
e_d_tov /= 1000000;
- if (e_d_tov > rdata->e_d_tov)
+ if (e_d_tov > rdata->e_d_tov) {
+ printk("fc_rport_login_complete rdata->e_d_tov %u\n",
+ rdata->e_d_tov);
rdata->e_d_tov = e_d_tov;
+ }
}
rdata->maxframe_size = fc_plogi_get_maxframe(flogi, lport->mfs);
return 0;
@@ -690,8 +698,11 @@ static void fc_rport_flogi_resp(struct fc_seq *sp, struct fc_frame *fp,
if (!flogi)
goto bad;
r_a_tov = ntohl(flogi->fl_csp.sp_r_a_tov);
- if (r_a_tov > rdata->r_a_tov)
+ printk("fc_rport_flogi_resp r_a_tov: %u\n", r_a_tov);
+ if (r_a_tov > rdata->r_a_tov) {
+ printk("fc_rport_flogi_resp rdata->r_a_tov: %u\n", rdata->r_a_tov);
rdata->r_a_tov = r_a_tov;
+ }

if (rdata->ids.port_name < lport->wwpn)
fc_rport_enter_plogi(rdata);
@@ -971,6 +982,7 @@ static void fc_rport_enter_plogi(struct fc_rport_priv *rdata)
return;
}
rdata->e_d_tov = lport->e_d_tov;
+ printk("fc_rport_enter_plogi: rdata->e_d_tov: %u\n", rdata->e_d_tov);

if (!lport->tt.elsct_send(lport, rdata->ids.port_id, fp, ELS_PLOGI,
fc_rport_plogi_resp, rdata,
@@ -1183,12 +1195,16 @@ static void fc_rport_rtv_resp(struct fc_seq *sp, struct fc_frame *fp,
if (tov == 0)
tov = 1;
rdata->r_a_tov = tov;
+ printk("fc_rport_rtv_resp rdata->r_a_tov: %u\n",
+ rdata->r_a_tov);
tov = ntohl(rtv->rtv_e_d_tov);
if (toq & FC_ELS_RTV_EDRES)
tov /= 1000000;
if (tov == 0)
tov = 1;
rdata->e_d_tov = tov;
+ printk("fc_rport_rtv_resp rdata->e_d_tov: %u\n",
+ rdata->e_d_tov);
}
}
Vasu Dev
2014-06-04 22:01:08 UTC
Permalink
Post by Nicholas A. Bellinger
Post by Jun Wu
The test setup includes one host and one target. The target exposes 10
hard drives (or 10 LUNs) on one fcoe port. The single initiator runs
10 fio processes simultaneously to the 10 target drives through fcoe
vn2vn. This is a simple configuration that other people may also want
to try.
Post by Vasu Dev
Exchange 0x6e4 is aborted and then target still sending frame, while
later should not occur but first setting up abort with 0 msec timeout
doesn not look correct either and it is different that 8000 ms on
initiator side.
Should target stop sending frame after abort? I still see a lot of 0
msec messages on target side. Is this something that should be
addressed?
Post by Vasu Dev
Reducing retries could narrow down to early aborts is the cause here,
can you try with REC disabled on initiator side for that using this
change ?
By disabling REC have you confirmed that the early aborts is the
cause? Is the abort caused by 0 msec timeout?
The 0 msec timeout still look really suspicious..
IIRC, these timeout values are exchanged in the FLOGI request packet,
and/or in a separate Request Timeout Value (RTV) packet..
It might be worthwhile to track down where these zero-length settings
are coming from, as it might be a indication of what's wrong.
How about the following patch to dump these values..?
Also just curious, have you tried running these two hosts in
point-to-point mode without the switch to see if the same types of
issues occur..? It might be useful to help isolate the problem space a
bit.
Vasu, any other ideas here..?
Your patch is good to debug 0 msec value, however this may not the issue
since these are from incoming aborts processing and by then IO is
aborted and would cause seq_send failures as I explained in other
response.

Nab, Shall tcm_fc take some action on seq_send failures to the target
core which could help in slowing down host requests rate above the fcoe
transport ?

Thanks,
Vasu
Post by Nicholas A. Bellinger
--nab
diff --git a/drivers/scsi/libfc/fc_lport.c b/drivers/scsi/libfc/fc_lport.c
index e01a298..72b8676 100644
--- a/drivers/scsi/libfc/fc_lport.c
+++ b/drivers/scsi/libfc/fc_lport.c
@@ -379,6 +379,7 @@ static void fc_lport_flogi_fill(struct fc_lport *lport,
sp->sp_tot_seq = htons(255); /* seq. we accept */
sp->sp_rel_off = htons(0x1f);
sp->sp_e_d_tov = htonl(lport->e_d_tov);
+ printk("fc_lport_flogi_fill sp->sp_e_d_tov: %u\n", sp->sp_e_d_tov);
cp->cp_rdfs = htons((u16) lport->mfs);
cp->cp_con_seq = htons(255);
@@ -1766,7 +1767,9 @@ void fc_lport_flogi_resp(struct fc_seq *sp, struct fc_frame *fp,
csp_flags = ntohs(flp->fl_csp.sp_features);
r_a_tov = ntohl(flp->fl_csp.sp_r_a_tov);
+ printk("fc_lport_flogi_resp: r_a_tov: %u\n", r_a_tov);
e_d_tov = ntohl(flp->fl_csp.sp_e_d_tov);
+ printk("fc_lport_flogi_resp: e_d_tov %u\n", e_d_tov);
if (csp_flags & FC_SP_FT_EDTR)
e_d_tov /= 1000000;
@@ -1795,6 +1798,9 @@ void fc_lport_flogi_resp(struct fc_seq *sp, struct fc_frame *fp,
fc_lport_enter_dns(lport);
}
+ printk("fc_lport_flogi_resp: lport->e_d_tov: %u\n", lport->e_d_tov);
+ printk("fc_lport_flogi_resp: lport->r_a_tov: %u\n", lport->r_a_tov);
+
fc_frame_free(fp);
diff --git a/drivers/scsi/libfc/fc_rport.c b/drivers/scsi/libfc/fc_rport.c
index 589ff9a..4cdb055 100644
--- a/drivers/scsi/libfc/fc_rport.c
+++ b/drivers/scsi/libfc/fc_rport.c
@@ -142,7 +142,9 @@ static struct fc_rport_priv *fc_rport_create(struct fc_lport *lport,
rdata->event = RPORT_EV_NONE;
rdata->flags = FC_RP_FLAGS_REC_SUPPORTED;
rdata->e_d_tov = lport->e_d_tov;
+ printk("fc_rport_create: rdata->e_d_tov: %u\n", rdata->e_d_tov);
rdata->r_a_tov = lport->r_a_tov;
+ printk("fc_rport_create: rdata->r_a_tov: %u\n", rdata->r_a_tov);
rdata->maxframe_size = FC_MIN_MAX_PAYLOAD;
INIT_DELAYED_WORK(&rdata->retry_work, fc_rport_timeout);
INIT_WORK(&rdata->event_work, fc_rport_work);
@@ -286,7 +288,9 @@ static void fc_rport_work(struct work_struct *work)
rpriv->rp_state = rdata->rp_state;
rpriv->flags = rdata->flags;
rpriv->e_d_tov = rdata->e_d_tov;
+ printk("fc_rport_work: rpriv->e_d_tov: %u\n", rpriv->e_d_tov);
rpriv->r_a_tov = rdata->r_a_tov;
+ printk("rpriv->r_a_tov: rpriv->r_a_tov: %u\n", rpriv->r_a_tov);
mutex_unlock(&rdata->rp_mutex);
if (rport_ops && rport_ops->event_callback) {
@@ -638,10 +642,14 @@ static int fc_rport_login_complete(struct fc_rport_priv *rdata,
* E_D_TOV is not valid on an incoming FLOGI request.
*/
e_d_tov = ntohl(flogi->fl_csp.sp_e_d_tov);
+ printk("fc_rport_login_complete e_d_tov: %u\n", e_d_tov);
if (csp_flags & FC_SP_FT_EDTR)
e_d_tov /= 1000000;
- if (e_d_tov > rdata->e_d_tov)
+ if (e_d_tov > rdata->e_d_tov) {
+ printk("fc_rport_login_complete rdata->e_d_tov %u\n",
+ rdata->e_d_tov);
rdata->e_d_tov = e_d_tov;
+ }
}
rdata->maxframe_size = fc_plogi_get_maxframe(flogi, lport->mfs);
return 0;
@@ -690,8 +698,11 @@ static void fc_rport_flogi_resp(struct fc_seq *sp, struct fc_frame *fp,
if (!flogi)
goto bad;
r_a_tov = ntohl(flogi->fl_csp.sp_r_a_tov);
- if (r_a_tov > rdata->r_a_tov)
+ printk("fc_rport_flogi_resp r_a_tov: %u\n", r_a_tov);
+ if (r_a_tov > rdata->r_a_tov) {
+ printk("fc_rport_flogi_resp rdata->r_a_tov: %u\n", rdata->r_a_tov);
rdata->r_a_tov = r_a_tov;
+ }
if (rdata->ids.port_name < lport->wwpn)
fc_rport_enter_plogi(rdata);
@@ -971,6 +982,7 @@ static void fc_rport_enter_plogi(struct fc_rport_priv *rdata)
return;
}
rdata->e_d_tov = lport->e_d_tov;
+ printk("fc_rport_enter_plogi: rdata->e_d_tov: %u\n", rdata->e_d_tov);
if (!lport->tt.elsct_send(lport, rdata->ids.port_id, fp, ELS_PLOGI,
fc_rport_plogi_resp, rdata,
@@ -1183,12 +1195,16 @@ static void fc_rport_rtv_resp(struct fc_seq *sp, struct fc_frame *fp,
if (tov == 0)
tov = 1;
rdata->r_a_tov = tov;
+ printk("fc_rport_rtv_resp rdata->r_a_tov: %u\n",
+ rdata->r_a_tov);
tov = ntohl(rtv->rtv_e_d_tov);
if (toq & FC_ELS_RTV_EDRES)
tov /= 1000000;
if (tov == 0)
tov = 1;
rdata->e_d_tov = tov;
+ printk("fc_rport_rtv_resp rdata->e_d_tov: %u\n",
+ rdata->e_d_tov);
}
}
Jun Wu
2014-06-05 00:22:23 UTC
Permalink
Is there design limit for the number of target drives that we should
not cross? Is 10 a reasonable number? We did notice that lower number
of target has less problems from our testing.

Are there any additional tests that we can do to narrow down the
problem? For example try different IO types, random vs sequential,
read vs write. Would that help?

Nab,
We cannot change the connection between the servers. They are bare
metal cloud servers that we don't have direct access.

Thanks,

Jun
Post by Vasu Dev
Post by Nicholas A. Bellinger
Post by Jun Wu
The test setup includes one host and one target. The target exposes 10
hard drives (or 10 LUNs) on one fcoe port. The single initiator runs
10 fio processes simultaneously to the 10 target drives through fcoe
vn2vn. This is a simple configuration that other people may also want
to try.
Post by Vasu Dev
Exchange 0x6e4 is aborted and then target still sending frame, while
later should not occur but first setting up abort with 0 msec timeout
doesn not look correct either and it is different that 8000 ms on
initiator side.
Should target stop sending frame after abort? I still see a lot of 0
msec messages on target side. Is this something that should be
addressed?
Post by Vasu Dev
Reducing retries could narrow down to early aborts is the cause here,
can you try with REC disabled on initiator side for that using this
change ?
By disabling REC have you confirmed that the early aborts is the
cause? Is the abort caused by 0 msec timeout?
The 0 msec timeout still look really suspicious..
IIRC, these timeout values are exchanged in the FLOGI request packet,
and/or in a separate Request Timeout Value (RTV) packet..
It might be worthwhile to track down where these zero-length settings
are coming from, as it might be a indication of what's wrong.
How about the following patch to dump these values..?
Also just curious, have you tried running these two hosts in
point-to-point mode without the switch to see if the same types of
issues occur..? It might be useful to help isolate the problem space a
bit.
Vasu, any other ideas here..?
Your patch is good to debug 0 msec value, however this may not the issue
since these are from incoming aborts processing and by then IO is
aborted and would cause seq_send failures as I explained in other
response.
Nab, Shall tcm_fc take some action on seq_send failures to the target
core which could help in slowing down host requests rate above the fcoe
transport ?
Thanks,
Vasu
Post by Nicholas A. Bellinger
--nab
diff --git a/drivers/scsi/libfc/fc_lport.c b/drivers/scsi/libfc/fc_lport.c
index e01a298..72b8676 100644
--- a/drivers/scsi/libfc/fc_lport.c
+++ b/drivers/scsi/libfc/fc_lport.c
@@ -379,6 +379,7 @@ static void fc_lport_flogi_fill(struct fc_lport *lport,
sp->sp_tot_seq = htons(255); /* seq. we accept */
sp->sp_rel_off = htons(0x1f);
sp->sp_e_d_tov = htonl(lport->e_d_tov);
+ printk("fc_lport_flogi_fill sp->sp_e_d_tov: %u\n", sp->sp_e_d_tov);
cp->cp_rdfs = htons((u16) lport->mfs);
cp->cp_con_seq = htons(255);
@@ -1766,7 +1767,9 @@ void fc_lport_flogi_resp(struct fc_seq *sp, struct fc_frame *fp,
csp_flags = ntohs(flp->fl_csp.sp_features);
r_a_tov = ntohl(flp->fl_csp.sp_r_a_tov);
+ printk("fc_lport_flogi_resp: r_a_tov: %u\n", r_a_tov);
e_d_tov = ntohl(flp->fl_csp.sp_e_d_tov);
+ printk("fc_lport_flogi_resp: e_d_tov %u\n", e_d_tov);
if (csp_flags & FC_SP_FT_EDTR)
e_d_tov /= 1000000;
@@ -1795,6 +1798,9 @@ void fc_lport_flogi_resp(struct fc_seq *sp, struct fc_frame *fp,
fc_lport_enter_dns(lport);
}
+ printk("fc_lport_flogi_resp: lport->e_d_tov: %u\n", lport->e_d_tov);
+ printk("fc_lport_flogi_resp: lport->r_a_tov: %u\n", lport->r_a_tov);
+
fc_frame_free(fp);
diff --git a/drivers/scsi/libfc/fc_rport.c b/drivers/scsi/libfc/fc_rport.c
index 589ff9a..4cdb055 100644
--- a/drivers/scsi/libfc/fc_rport.c
+++ b/drivers/scsi/libfc/fc_rport.c
@@ -142,7 +142,9 @@ static struct fc_rport_priv *fc_rport_create(struct fc_lport *lport,
rdata->event = RPORT_EV_NONE;
rdata->flags = FC_RP_FLAGS_REC_SUPPORTED;
rdata->e_d_tov = lport->e_d_tov;
+ printk("fc_rport_create: rdata->e_d_tov: %u\n", rdata->e_d_tov);
rdata->r_a_tov = lport->r_a_tov;
+ printk("fc_rport_create: rdata->r_a_tov: %u\n", rdata->r_a_tov);
rdata->maxframe_size = FC_MIN_MAX_PAYLOAD;
INIT_DELAYED_WORK(&rdata->retry_work, fc_rport_timeout);
INIT_WORK(&rdata->event_work, fc_rport_work);
@@ -286,7 +288,9 @@ static void fc_rport_work(struct work_struct *work)
rpriv->rp_state = rdata->rp_state;
rpriv->flags = rdata->flags;
rpriv->e_d_tov = rdata->e_d_tov;
+ printk("fc_rport_work: rpriv->e_d_tov: %u\n", rpriv->e_d_tov);
rpriv->r_a_tov = rdata->r_a_tov;
+ printk("rpriv->r_a_tov: rpriv->r_a_tov: %u\n", rpriv->r_a_tov);
mutex_unlock(&rdata->rp_mutex);
if (rport_ops && rport_ops->event_callback) {
@@ -638,10 +642,14 @@ static int fc_rport_login_complete(struct fc_rport_priv *rdata,
* E_D_TOV is not valid on an incoming FLOGI request.
*/
e_d_tov = ntohl(flogi->fl_csp.sp_e_d_tov);
+ printk("fc_rport_login_complete e_d_tov: %u\n", e_d_tov);
if (csp_flags & FC_SP_FT_EDTR)
e_d_tov /= 1000000;
- if (e_d_tov > rdata->e_d_tov)
+ if (e_d_tov > rdata->e_d_tov) {
+ printk("fc_rport_login_complete rdata->e_d_tov %u\n",
+ rdata->e_d_tov);
rdata->e_d_tov = e_d_tov;
+ }
}
rdata->maxframe_size = fc_plogi_get_maxframe(flogi, lport->mfs);
return 0;
@@ -690,8 +698,11 @@ static void fc_rport_flogi_resp(struct fc_seq *sp, struct fc_frame *fp,
if (!flogi)
goto bad;
r_a_tov = ntohl(flogi->fl_csp.sp_r_a_tov);
- if (r_a_tov > rdata->r_a_tov)
+ printk("fc_rport_flogi_resp r_a_tov: %u\n", r_a_tov);
+ if (r_a_tov > rdata->r_a_tov) {
+ printk("fc_rport_flogi_resp rdata->r_a_tov: %u\n", rdata->r_a_tov);
rdata->r_a_tov = r_a_tov;
+ }
if (rdata->ids.port_name < lport->wwpn)
fc_rport_enter_plogi(rdata);
@@ -971,6 +982,7 @@ static void fc_rport_enter_plogi(struct fc_rport_priv *rdata)
return;
}
rdata->e_d_tov = lport->e_d_tov;
+ printk("fc_rport_enter_plogi: rdata->e_d_tov: %u\n", rdata->e_d_tov);
if (!lport->tt.elsct_send(lport, rdata->ids.port_id, fp, ELS_PLOGI,
fc_rport_plogi_resp, rdata,
@@ -1183,12 +1195,16 @@ static void fc_rport_rtv_resp(struct fc_seq *sp, struct fc_frame *fp,
if (tov == 0)
tov = 1;
rdata->r_a_tov = tov;
+ printk("fc_rport_rtv_resp rdata->r_a_tov: %u\n",
+ rdata->r_a_tov);
tov = ntohl(rtv->rtv_e_d_tov);
if (toq & FC_ELS_RTV_EDRES)
tov /= 1000000;
if (tov == 0)
tov = 1;
rdata->e_d_tov = tov;
+ printk("fc_rport_rtv_resp rdata->e_d_tov: %u\n",
+ rdata->e_d_tov);
}
}
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2014-06-05 22:43:48 UTC
Permalink
Post by Jun Wu
Is there design limit for the number of target drives that we should
not cross? Is 10 a reasonable number? We did notice that lower number
of target has less problems from our testing.
It completely depends on the fabric, initiator, and backend storage.

For example, some initiators (like ESX iSCSI on 1 Gb/sec ethernet) have
a problem with more than a handful of LUNs per session, that can result
in false positive timeouts on the initiator side under heavy I/O loads
due to fairness issues + I/Os not being sent out of the initiator fast
enough.

Other initiators like Qlogic FC are able to run 256 LUNs in a single
session + endpoint without issues.
Post by Jun Wu
Are there any additional tests that we can do to narrow down the
problem? For example try different IO types, random vs sequential,
read vs write. Would that help?
If the issue is really related to the number of outstanding I/Os on your
network, one easy thing to do is reduce the default queue_depth=3 to
queue_depth=1 for each LUN on the initiator side, and see if that has
any effect.

I don't recall where these values are in /sys for fcoe, but are easy to
find using 'find /sys -name queue_depth'. Go ahead and set each of
these for your fcoe initiators LUNs to queue_depth=1 + retest.

Also note that these values are not persistent across restart.
Post by Jun Wu
Nab,
We cannot change the connection between the servers. They are bare
metal cloud servers that we don't have direct access.
That's a shame, as it would certainly help isolate individual networking
components.

--nab
Vasu Dev
2014-06-06 20:28:29 UTC
Permalink
Post by Nicholas A. Bellinger
Post by Jun Wu
Is there design limit for the number of target drives that we should
not cross? Is 10 a reasonable number? We did notice that lower number
of target has less problems from our testing.
It completely depends on the fabric, initiator, and backend storage.
For example, some initiators (like ESX iSCSI on 1 Gb/sec ethernet) have
a problem with more than a handful of LUNs per session, that can result
in false positive timeouts on the initiator side under heavy I/O loads
due to fairness issues + I/Os not being sent out of the initiator fast
enough.
Other initiators like Qlogic FC are able to run 256 LUNs in a single
session + endpoint without issues.
Post by Jun Wu
Are there any additional tests that we can do to narrow down the
problem? For example try different IO types, random vs sequential,
read vs write. Would that help?
If the issue is really related to the number of outstanding I/Os on your
network, one easy thing to do is reduce the default queue_depth=3 to
queue_depth=1 for each LUN on the initiator side, and see if that has
any effect.
Good idea and possibly that is the cause and not something else such as
switch is the factor w/o DCB and PFC PAUSE typically used and required
by fcoe.
Post by Nicholas A. Bellinger
I don't recall where these values are in /sys for fcoe, but are easy to
It is at /sys/block/sdX/device/queue_depth, it is transport agnostic and
therefore same location for fcoe also like any other disks are.
Post by Nicholas A. Bellinger
find using 'find /sys -name queue_depth'. Go ahead and set each of
these for your fcoe initiators LUNs to queue_depth=1 + retest.
Also note that these values are not persistent across restart.
Post by Jun Wu
Nab,
We cannot change the connection between the servers. They are bare
metal cloud servers that we don't have direct access.
That's a shame, as it would certainly help isolate individual networking
components.
Yeah, that would have helped.

//Vasu
Post by Nicholas A. Bellinger
--nab
Jun Wu
2014-06-06 21:45:43 UTC
Permalink
The queue_depth is 32 by default. So all my previous tests were based
on 32 queue_depth.

The result of my tests today confirms that with higher queue_depth,
there are more aborts on the initiator side and corresponding
"Exchange timer armed : 0 msecs" messages on the target side.
I tried queue_depth = 1, 2, 3, and 6. For 1 and 2, there is no abort
or any abnormal messages. For 3, there are 15 instances of aborts and
15 instances of "0 msecs" messages. When queue_depth increased to 6,
there are 81 instances of the aborts and 81 instances of "0 msecs"
messages for the same fio test.

There seems to be a one to one correspondence between the abort
message on the initiator size and the "0 msecs" message on the target
side. The xid's are the same:

On the target side:
[***@poc2 log]# grep "0 msec" messages
Jun 6 11:56:05 poc2 kernel: [ 4436.866024] host7: xid 88b: Exchange
timer armed : 0 msecs
Jun 6 11:56:05 poc2 kernel: [ 4436.866263] host7: xid 905: Exchange
timer armed : 0 msecs
Jun 6 11:56:05 poc2 kernel: [ 4436.866563] host7: xid aa4: Exchange
timer armed : 0 msecs
Jun 6 11:56:05 poc2 kernel: [ 4436.866740] host7: xid be1: Exchange
timer armed : 0 msecs
Jun 6 11:56:05 poc2 kernel: [ 4436.866972] host7: xid 925: Exchange
timer armed : 0 msecs
Jun 6 11:56:05 poc2 kernel: [ 4436.867137] host7: xid 627: Exchange
timer armed : 0 msecs
Jun 6 11:57:27 poc2 kernel: [ 4518.995622] host7: xid 926: Exchange
timer armed : 0 msecs
Jun 6 11:57:27 poc2 kernel: [ 4518.995919] host7: xid be8: Exchange
timer armed : 0 msecs
Jun 6 11:57:27 poc2 kernel: [ 4518.996231] host7: xid 229: Exchange
timer armed : 0 msecs
Jun 6 11:57:27 poc2 kernel: [ 4518.996500] host7: xid 4cc: Exchange
timer armed : 0 msecs
Jun 6 11:57:27 poc2 kernel: [ 4518.996763] host7: xid a2a: Exchange
timer armed : 0 msecs
Jun 6 12:14:54 poc2 kernel: [ 5566.776388] host7: xid 34e: Exchange
timer armed : 0 msecs
Jun 6 12:14:54 poc2 kernel: [ 5566.776657] host7: xid 9e2: Exchange
timer armed : 0 msecs
Jun 6 12:14:54 poc2 kernel: [ 5566.776950] host7: xid 964: Exchange
timer armed : 0 msecs
Jun 6 12:14:54 poc2 kernel: [ 5566.777172] host7: xid a8a: Exchange
timer armed : 0 msecs

On the initiator side:
8890680:Jun 6 11:56:05 poc1 kernel: [ 4899.764511] host7: xid 745:
f_ctl 90008 seq 2
8890681:Jun 6 11:56:05 poc1 kernel: [ 4899.764711] host7: xid 745:
exch: BLS rctl 84 - BLS accept
8890682:Jun 6 11:56:05 poc1 kernel: [ 4899.764753] host7: fcp:
00061e: xid 0745-088b: target abort cmd passed
8890683:Jun 6 11:56:05 poc1 kernel: [ 4899.764758] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
8890684:Jun 6 11:56:05 poc1 kernel: [ 4899.764762] host7: xid ae1:
f_ctl 90008 seq 2
8890685:Jun 6 11:56:05 poc1 kernel: [ 4899.764927] host7: xid ae1:
exch: BLS rctl 84 - BLS accept
8890686:Jun 6 11:56:05 poc1 kernel: [ 4899.764985] host7: fcp:
00061e: xid 0ae1-0905: target abort cmd passed
8890687:Jun 6 11:56:05 poc1 kernel: [ 4899.764990] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
8890688:Jun 6 11:56:05 poc1 kernel: [ 4899.764994] host7: xid 5a0:
f_ctl 90008 seq 2
8890689:Jun 6 11:56:05 poc1 kernel: [ 4899.765161] host7: xid 5a0:
exch: BLS rctl 84 - BLS accept
8890690:Jun 6 11:56:05 poc1 kernel: [ 4899.765210] host7: fcp:
00061e: xid 05a0-0aa4: target abort cmd passed
8890691:Jun 6 11:56:05 poc1 kernel: [ 4899.765215] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
8890692:Jun 6 11:56:05 poc1 kernel: [ 4899.765219] host7: xid e4c:
f_ctl 90008 seq 2
8890693:Jun 6 11:56:05 poc1 kernel: [ 4899.765424] host7: xid e4c:
exch: BLS rctl 84 - BLS accept
8890694:Jun 6 11:56:05 poc1 kernel: [ 4899.765464] host7: fcp:
00061e: xid 0e4c-0be1: target abort cmd passed
8890695:Jun 6 11:56:05 poc1 kernel: [ 4899.765469] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
8890696:Jun 6 11:56:05 poc1 kernel: [ 4899.765473] host7: xid 967:
f_ctl 90008 seq 2
8890697:Jun 6 11:56:05 poc1 kernel: [ 4899.765568] host7: xid 967:
exch: BLS rctl 84 - BLS accept
8890698:Jun 6 11:56:05 poc1 kernel: [ 4899.765618] host7: fcp:
00061e: xid 0967-0925: target abort cmd passed
8890699:Jun 6 11:56:05 poc1 kernel: [ 4899.765624] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
8890700:Jun 6 11:56:05 poc1 kernel: [ 4899.765628] host7: xid 706:
f_ctl 90008 seq 2
8890701:Jun 6 11:56:05 poc1 kernel: [ 4899.765730] host7: xid 706:
exch: BLS rctl 84 - BLS accept
8890702:Jun 6 11:56:05 poc1 kernel: [ 4899.765782] host7: fcp:
00061e: xid 0706-0627: target abort cmd passed
8890703:Jun 6 11:56:05 poc1 kernel: [ 4899.765787] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
9278978:Jun 6 11:57:27 poc1 kernel: [ 4981.894046] host7: xid c26:
f_ctl 90008 seq 2
9278979:Jun 6 11:57:27 poc1 kernel: [ 4981.894288] host7: xid c26:
exch: BLS rctl 84 - BLS accept
9278980:Jun 6 11:57:27 poc1 kernel: [ 4981.894335] host7: fcp:
00061e: xid 0c26-0926: target abort cmd passed
9278981:Jun 6 11:57:27 poc1 kernel: [ 4981.894341] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
9278982:Jun 6 11:57:27 poc1 kernel: [ 4981.894346] host7: xid b84:
f_ctl 90008 seq 2
9278983:Jun 6 11:57:27 poc1 kernel: [ 4981.894571] host7: xid b84:
exch: BLS rctl 84 - BLS accept
9278984:Jun 6 11:57:27 poc1 kernel: [ 4981.894614] host7: fcp:
00061e: xid 0b84-0be8: target abort cmd passed
9278985:Jun 6 11:57:27 poc1 kernel: [ 4981.894618] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
9278986:Jun 6 11:57:27 poc1 kernel: [ 4981.894622] host7: xid 9ad:
f_ctl 90008 seq 2
9278987:Jun 6 11:57:27 poc1 kernel: [ 4981.894875] host7: xid 9ad:
exch: BLS rctl 84 - BLS accept
9278988:Jun 6 11:57:27 poc1 kernel: [ 4981.894920] host7: fcp:
00061e: xid 09ad-0229: target abort cmd passed
9278989:Jun 6 11:57:27 poc1 kernel: [ 4981.894925] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
9278990:Jun 6 11:57:27 poc1 kernel: [ 4981.894928] host7: xid ae0:
f_ctl 90008 seq 2
9278991:Jun 6 11:57:27 poc1 kernel: [ 4981.895103] host7: xid ae0:
exch: BLS rctl 84 - BLS accept
9278992:Jun 6 11:57:27 poc1 kernel: [ 4981.895152] host7: fcp:
00061e: xid 0ae0-04cc: target abort cmd passed
9278993:Jun 6 11:57:27 poc1 kernel: [ 4981.895158] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
9278994:Jun 6 11:57:27 poc1 kernel: [ 4981.895163] host7: xid d41:
f_ctl 90008 seq 2
9278995:Jun 6 11:57:27 poc1 kernel: [ 4981.895382] host7: xid d41:
exch: BLS rctl 84 - BLS accept
9278996:Jun 6 11:57:27 poc1 kernel: [ 4981.895432] host7: fcp:
00061e: xid 0d41-0a2a: target abort cmd passed
9278997:Jun 6 11:57:27 poc1 kernel: [ 4981.895438] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
38409501:Jun 6 12:14:54 poc1 kernel: [ 6029.674694] host7: xid ee2:
f_ctl 90008 seq 2
38409502:Jun 6 12:14:54 poc1 kernel: [ 6029.674956] host7: xid ee2:
exch: BLS rctl 84 - BLS accept
38409503:Jun 6 12:14:54 poc1 kernel: [ 6029.674999] host7: fcp:
00061e: xid 0ee2-034e: target abort cmd passed
38409504:Jun 6 12:14:54 poc1 kernel: [ 6029.675005] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
38409505:Jun 6 12:14:54 poc1 kernel: [ 6029.675009] host7: xid 221:
f_ctl 90008 seq 2
38409506:Jun 6 12:14:54 poc1 kernel: [ 6029.675221] host7: xid 221:
exch: BLS rctl 84 - BLS accept
38409507:Jun 6 12:14:54 poc1 kernel: [ 6029.675268] host7: fcp:
00061e: xid 0221-09e2: target abort cmd passed
38409508:Jun 6 12:14:54 poc1 kernel: [ 6029.675272] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
38409509:Jun 6 12:14:54 poc1 kernel: [ 6029.675276] host7: xid ecd:
f_ctl 90008 seq 2
38409510:Jun 6 12:14:54 poc1 kernel: [ 6029.675523] host7: xid ecd:
exch: BLS rctl 84 - BLS accept
38409511:Jun 6 12:14:54 poc1 kernel: [ 6029.675538] host7: fcp:
00061e: xid 0ecd-0964: target abort cmd passed
38409512:Jun 6 12:14:54 poc1 kernel: [ 6029.675542] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
38409513:Jun 6 12:14:54 poc1 kernel: [ 6029.675546] host7: xid f02:
f_ctl 90008 seq 2
38409514:Jun 6 12:14:54 poc1 kernel: [ 6029.675639] host7: xid f02:
exch: BLS rctl 84 - BLS accept
38409515:Jun 6 12:14:54 poc1 kernel: [ 6029.675691] host7: fcp:
00061e: xid 0f02-0a8a: target abort cmd passed
38409516:Jun 6 12:14:54 poc1 kernel: [ 6029.675697] host7: fcp:
00061e: Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED

Nab,

We have included your changes that print the timeout values:
On the target side:
40969:Jun 6 13:05:15 poc2 kernel: [ 134.665057] fc_rport_create:
rdata->e_d_tov: 2000
40970:Jun 6 13:05:15 poc2 kernel: [ 134.665060] fc_rport_create:
rdata->r_a_tov: 4000
40971:Jun 6 13:05:15 poc2 kernel: [ 135.355969]
fc_rport_login_complete e_d_tov: 0
40972:Jun 6 13:05:15 poc2 kernel: [ 135.355972] fc_rport_flogi_resp r_a_tov: 0
40973:Jun 6 13:05:15 poc2 kernel: [ 135.355976]
fc_rport_enter_plogi: rdata->e_d_tov: 2000
40974:Jun 6 13:05:15 poc2 kernel: [ 135.356206]
fc_rport_login_complete e_d_tov: 2000
40975:Jun 6 13:05:15 poc2 kernel: [ 135.356973] fc_rport_work:
rpriv->e_d_tov: 2000
40976:Jun 6 13:05:15 poc2 kernel: [ 135.356977] rpriv->r_a_tov:
rpriv->r_a_tov: 4000

On the initiator size:
950:Jun 6 13:05:15 poc1 kernel: [ 574.652695] fc_rport_create:
rdata->e_d_tov: 2000
951:Jun 6 13:05:15 poc1 kernel: [ 574.652698] fc_rport_create:
rdata->r_a_tov: 4000
952:Jun 6 13:05:15 poc1 kernel: [ 575.344306] fc_rport_work:
rpriv->e_d_tov: 2000
953:Jun 6 13:05:15 poc1 kernel: [ 575.344310] rpriv->r_a_tov:
rpriv->r_a_tov: 4000
1024:Jun 6 13:05:23 poc1 kernel: [ 582.723206] fc_rport_work:
rpriv->e_d_tov: 2000
1025:Jun 6 13:05:23 poc1 kernel: [ 582.723211] rpriv->r_a_tov:
rpriv->r_a_tov: 4000

When the abort occurs, "iostat -x 1" command on the target side starts
to show 0 iops on some of the drives. Sometimes all the 10 target
drives have 0 iops.

What is the reasonable number to assume for the maximum number of
outstanding IO?

Thanks,

Jun
Post by Vasu Dev
Post by Nicholas A. Bellinger
Post by Jun Wu
Is there design limit for the number of target drives that we should
not cross? Is 10 a reasonable number? We did notice that lower number
of target has less problems from our testing.
It completely depends on the fabric, initiator, and backend storage.
For example, some initiators (like ESX iSCSI on 1 Gb/sec ethernet) have
a problem with more than a handful of LUNs per session, that can result
in false positive timeouts on the initiator side under heavy I/O loads
due to fairness issues + I/Os not being sent out of the initiator fast
enough.
Other initiators like Qlogic FC are able to run 256 LUNs in a single
session + endpoint without issues.
Post by Jun Wu
Are there any additional tests that we can do to narrow down the
problem? For example try different IO types, random vs sequential,
read vs write. Would that help?
If the issue is really related to the number of outstanding I/Os on your
network, one easy thing to do is reduce the default queue_depth=3 to
queue_depth=1 for each LUN on the initiator side, and see if that has
any effect.
Good idea and possibly that is the cause and not something else such as
switch is the factor w/o DCB and PFC PAUSE typically used and required
by fcoe.
Post by Nicholas A. Bellinger
I don't recall where these values are in /sys for fcoe, but are easy to
It is at /sys/block/sdX/device/queue_depth, it is transport agnostic and
therefore same location for fcoe also like any other disks are.
Post by Nicholas A. Bellinger
find using 'find /sys -name queue_depth'. Go ahead and set each of
these for your fcoe initiators LUNs to queue_depth=1 + retest.
Also note that these values are not persistent across restart.
Post by Jun Wu
Nab,
We cannot change the connection between the servers. They are bare
metal cloud servers that we don't have direct access.
That's a shame, as it would certainly help isolate individual networking
components.
Yeah, that would have helped.
//Vasu
Post by Nicholas A. Bellinger
--nab
Rustad, Mark D
2014-06-09 17:00:53 UTC
Permalink
Post by Jun Wu
The queue_depth is 32 by default. So all my previous tests were based
on 32 queue_depth.
The result of my tests today confirms that with higher queue_depth,
there are more aborts on the initiator side and corresponding
"Exchange timer armed : 0 msecs" messages on the target side.
I tried queue_depth = 1, 2, 3, and 6. For 1 and 2, there is no abort
or any abnormal messages. For 3, there are 15 instances of aborts and
15 instances of "0 msecs" messages. When queue_depth increased to 6,
there are 81 instances of the aborts and 81 instances of "0 msecs"
messages for the same fio test.
<snip>
Post by Jun Wu
When the abort occurs, "iostat -x 1" command on the target side starts
to show 0 iops on some of the drives. Sometimes all the 10 target
drives have 0 iops.
What is the reasonable number to assume for the maximum number of
outstanding IO?
You mentioned previously that fewer drives also worked better. The combination of workarounds, reduced queue depth or fewer drives, makes me wonder if there is a power issue in the system housing the drives. It is surprisingly common for a fully-populated system to not really have adequate power for fully active drives. Under those conditions, the drives can have a variety of symptoms ranging from pausing I/O to resetting. When that happens, the target will definitely be overloaded, things will time out and aborts will happen. If things work well with one or two fewer drives with full queue depth, I would really suspect a power, or possibly cooling issue rather than a software issue.

It is even possible that there is a vibration issue in the chassis that results in the drives interfering with one another and causing error recovery related I/O interruptions that could also result in timeouts and exchange aborts.

Nab, earlier you mentioned generating a busy response. I think that is a good thing to be able to do. It is a little complicated because different initiators have different timeouts. Ideally, you'd like to generate the busy response shortly before the initiator times out. Although that might reduce the logging in this case, I really suspect that not having that capability is not the root cause of the remaining issue here.
--
Mark Rustad, Networking Division, Intel Corporation
Vasu Dev
2014-06-09 17:38:18 UTC
Permalink
Post by Rustad, Mark D
Post by Jun Wu
The queue_depth is 32 by default. So all my previous tests were based
on 32 queue_depth.
The result of my tests today confirms that with higher queue_depth,
there are more aborts on the initiator side and corresponding
"Exchange timer armed : 0 msecs" messages on the target side.
I tried queue_depth = 1, 2, 3, and 6. For 1 and 2, there is no abort
or any abnormal messages. For 3, there are 15 instances of aborts and
15 instances of "0 msecs" messages. When queue_depth increased to 6,
there are 81 instances of the aborts and 81 instances of "0 msecs"
messages for the same fio test.
<snip>
Post by Jun Wu
When the abort occurs, "iostat -x 1" command on the target side starts
to show 0 iops on some of the drives. Sometimes all the 10 target
drives have 0 iops.
What is the reasonable number to assume for the maximum number of
outstanding IO?
As mentioned before "It completely depends on the fabric, initiator, and
backend storage.", see related text for more details.
Post by Rustad, Mark D
You mentioned previously that fewer drives also worked better. The combination of workarounds, reduced queue depth or fewer drives, makes me wonder if there is a power issue in the system housing the drives. It is surprisingly common for a fully-populated system to not really have adequate power for fully active drives. Under those conditions, the drives can have a variety of symptoms ranging from pausing I/O to resetting. When that happens, the target will definitely be overloaded, things will time out and aborts will happen. If things work well with one or two fewer drives with full queue depth, I would really suspect a power, or possibly cooling issue rather than a software issue.
It is even possible that there is a vibration issue in the chassis that results in the drives interfering with one another and causing error recovery related I/O interruptions that could also result in timeouts and exchange aborts.
Using ramdisks instead backend disks could confirm and isolate to
backend issues.
Post by Rustad, Mark D
Nab, earlier you mentioned generating a busy response. I think that is a good thing to be able to do. It is a little complicated because different initiators have different timeouts. Ideally, you'd like to generate the busy response shortly before the initiator times out. Although that might reduce the logging in this case, I really suspect that not having that capability is not the root cause of the remaining issue here.
If target sends queue full to hosts based on pending back long instead
re-acting upon data out failures could avoid host timeouts, that will be
inline to described ideal scenario above.

Thanks,
Vasu
Jun Wu
2014-06-09 18:11:15 UTC
Permalink
All the drives are SSD drives. So it shouldn't be power issue in this case.
Post by Rustad, Mark D
Post by Jun Wu
The queue_depth is 32 by default. So all my previous tests were based
on 32 queue_depth.
The result of my tests today confirms that with higher queue_depth,
there are more aborts on the initiator side and corresponding
"Exchange timer armed : 0 msecs" messages on the target side.
I tried queue_depth = 1, 2, 3, and 6. For 1 and 2, there is no abort
or any abnormal messages. For 3, there are 15 instances of aborts and
15 instances of "0 msecs" messages. When queue_depth increased to 6,
there are 81 instances of the aborts and 81 instances of "0 msecs"
messages for the same fio test.
<snip>
Post by Jun Wu
When the abort occurs, "iostat -x 1" command on the target side starts
to show 0 iops on some of the drives. Sometimes all the 10 target
drives have 0 iops.
What is the reasonable number to assume for the maximum number of
outstanding IO?
You mentioned previously that fewer drives also worked better. The combination of workarounds, reduced queue depth or fewer drives, makes me wonder if there is a power issue in the system housing the drives. It is surprisingly common for a fully-populated system to not really have adequate power for fully active drives. Under those conditions, the drives can have a variety of symptoms ranging from pausing I/O to resetting. When that happens, the target will definitely be overloaded, things will time out and aborts will happen. If things work well with one or two fewer drives with full queue depth, I would really suspect a power, or possibly cooling issue rather than a software issue.
It is even possible that there is a vibration issue in the chassis that results in the drives interfering with one another and causing error recovery related I/O interruptions that could also result in timeouts and exchange aborts.
Nab, earlier you mentioned generating a busy response. I think that is a good thing to be able to do. It is a little complicated because different initiators have different timeouts. Ideally, you'd like to generate the busy response shortly before the initiator times out. Although that might reduce the logging in this case, I really suspect that not having that capability is not the root cause of the remaining issue here.
--
Mark Rustad, Networking Division, Intel Corporation
Rustad, Mark D
2014-06-10 16:39:30 UTC
Permalink
Post by Jun Wu
All the drives are SSD drives. So it shouldn't be power issue in this case.
Ah. Probably not power, definitely not vibration! :-) I don't have any experience with racks of SSDs so I don't know what interesting things happen with them.
--
Mark Rustad, Networking Division, Intel Corporation
Jun Wu
2014-06-10 16:46:56 UTC
Permalink
This a Supermicro chassis with redundant power supplies. We see the
same failures with both SSDs or HDDs.
The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
Post by Rustad, Mark D
Post by Jun Wu
All the drives are SSD drives. So it shouldn't be power issue in this case.
Ah. Probably not power, definitely not vibration! :-) I don't have any experience with racks of SSDs so I don't know what interesting things happen with them.
--
Mark Rustad, Networking Division, Intel Corporation
Vasu Dev
2014-06-10 22:38:45 UTC
Permalink
Post by Jun Wu
This a Supermicro chassis with redundant power supplies. We see the
same failures with both SSDs or HDDs.
The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
Is iSCSI or AoE tests with same TCM core kernel with same target and
host NICs/switch ?

What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
typically used and required by fcoe", but you are using PAUSE and switch
cannot be eliminated as you mentioned before, these could affect more to
FCoE than other protocols, so can you ensure IO errors are not due to
frames losses w/o DCB/PFC in your setup ?

While possibly abort issues at target with zero timeout values but you
could avoid them completely by increasing scsi timeout and disabling REC
as discussed before.

Please use inline response and avoid top posts.

Thanks,
Vasu
Jun Wu
2014-06-11 02:40:20 UTC
Permalink
Post by Vasu Dev
Post by Jun Wu
This a Supermicro chassis with redundant power supplies. We see the
same failures with both SSDs or HDDs.
The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
Is iSCSI or AoE tests with same TCM core kernel with same target and
host NICs/switch ?
We tested AoE with the same hardware/switch and test setup. AoE works
except that it is not enterprise protocol and it doesn't provide
performance. It doesn't use TCM.
Post by Vasu Dev
What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
typically used and required by fcoe", but you are using PAUSE and switch
cannot be eliminated as you mentioned before, these could affect more to
FCoE than other protocols, so can you ensure IO errors are not due to
frames losses w/o DCB/PFC in your setup ?
The NIC is:
[***@poc1 log]# lspci | grep 82599
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)

The issue should not be caused by frame losses. The systems work fine
with other protocols.
Post by Vasu Dev
While possibly abort issues at target with zero timeout values but you
could avoid them completely by increasing scsi timeout and disabling REC
as discussed before.
Please use inline response and avoid top posts.
Thanks,
Vasu
Is the following cmd_per_lun fcoe related? Its default value is 3. And
it doesn't allow me to change.
/sys/devices/pci0000:00/0000:00:05.0/0000:08:00.0/net/p4p1/ctlr_2/host9/scsi_host/host9/cmd_per_lun

Nab,
Could you please send the patches you mentioned for me to test?

Thanks,
Jun
Vasu Dev
2014-06-11 18:19:34 UTC
Permalink
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
This a Supermicro chassis with redundant power supplies. We see the
same failures with both SSDs or HDDs.
The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
Is iSCSI or AoE tests with same TCM core kernel with same target and
host NICs/switch ?
We tested AoE with the same hardware/switch and test setup. AoE works
except that it is not enterprise protocol and it doesn't provide
performance. It doesn't use TCM.
You had fcoe working with lower queue depth and that should be yielding
lower performance as AoE beside AoE is not using TCM, so not correct
comparison. What about iSCSI, is that using TCM ?
Post by Jun Wu
Post by Vasu Dev
What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
typically used and required by fcoe", but you are using PAUSE and switch
cannot be eliminated as you mentioned before, these could affect more to
FCoE than other protocols, so can you ensure IO errors are not due to
frames losses w/o DCB/PFC in your setup ?
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
The issue should not be caused by frame losses. The systems work fine
with other protocols.
FCoE is less tolerant than others to packet losses & latency variations
for more FC like deterministic fabric performance and therefore no drop
ethernet is must for FCoE unlike others in the comparison, for instance
iSCSI would adapt tx window as per frame losses but no such thing in
FCoE. Thus you cannot conclude that there is no frames losses just
because other works in the setup, iSCSI and AoE should work fine without
no drop ethernet, PAUSE or PFC, so can you confirm no frames losses
using "ethtool -S ethX" ?
Post by Jun Wu
Post by Vasu Dev
While possibly abort issues at target with zero timeout values but you
could avoid them completely by increasing scsi timeout and disabling REC
as discussed before.
Now that I know ixgbe (82599) in use, try few more things in addition to
suggestions above:-

1) Disable irq balancer
2) Find your ethX interrupts through "cat /proc/interrupts | grep ethX".
Identifying fcoe among them is tricky, it may be labeled with fcoe but
if not then identify them through intr activity while fcoe traffic on,
total 8 fcoe intr are used and pin them across first eight set of CPUs
used in your workloads.
3) Increase rings size from default 512 to 2K or 4K, just a hunch in
case frames dropped due to longer PAUSE or congestion in your setup.
4) Also monitor ethX stats beside fcoe hostX stats for anything stand
out odd there at "/sys/class/fc_host/hostX/statistics/"


<snip>
Post by Jun Wu
Is the following cmd_per_lun fcoe related? Its default value is 3. And
it doesn't allow me to change.
/sys/devices/pci0000:00/0000:00:05.0/0000:08:00.0/net/p4p1/ctlr_2/host9/scsi_host/host9/cmd_per_lun
I think this doesn't matter once device queue depth adjusted to 32 and
that can be adjusted. I mean this is used at scsi host alloc as initial
queue depth and later scsi device queue depth is adjusted to 32 through
slave_alloc call back and that can be adjusted
at /sys/block/sdX/device/queue_depth as you did before but not this.

//Vasu
Jun Wu
2014-06-12 22:18:20 UTC
Permalink
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
This a Supermicro chassis with redundant power supplies. We see the
same failures with both SSDs or HDDs.
The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
Is iSCSI or AoE tests with same TCM core kernel with same target and
host NICs/switch ?
We tested AoE with the same hardware/switch and test setup. AoE works
except that it is not enterprise protocol and it doesn't provide
performance. It doesn't use TCM.
You had fcoe working with lower queue depth and that should be yielding
lower performance as AoE beside AoE is not using TCM, so not correct
comparison. What about iSCSI, is that using TCM ?
We didn't use TCM for the iSCSI test either.
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
typically used and required by fcoe", but you are using PAUSE and switch
cannot be eliminated as you mentioned before, these could affect more to
FCoE than other protocols, so can you ensure IO errors are not due to
frames losses w/o DCB/PFC in your setup ?
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
The issue should not be caused by frame losses. The systems work fine
with other protocols.
FCoE is less tolerant than others to packet losses & latency variations
for more FC like deterministic fabric performance and therefore no drop
ethernet is must for FCoE unlike others in the comparison, for instance
iSCSI would adapt tx window as per frame losses but no such thing in
FCoE. Thus you cannot conclude that there is no frames losses just
because other works in the setup, iSCSI and AoE should work fine without
no drop ethernet, PAUSE or PFC, so can you confirm no frames losses
using "ethtool -S ethX" ?
I ran the same test again, that is 10 fio sessions from one initiator
to 10 drives on the target via vn2vn. After I saw the following
messages
poc2 kernel: ft_queue_data_in: Failed to send frame
ffff8800ae64ba00, xid <0x389>, remaining 196608, lso_max <0x10000>

I checked
ethtool -S p4p1 | grep error
and
ethtool -S p4p1 | grep control
on both initiator and target, the numbers were all zero.

So the abort should not be caused by frame losses. Later the target hung.
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
While possibly abort issues at target with zero timeout values but you
could avoid them completely by increasing scsi timeout and disabling REC
as discussed before.
Now that I know ixgbe (82599) in use, try few more things in addition to
suggestions above:-
1) Disable irq balancer
2) Find your ethX interrupts through "cat /proc/interrupts | grep ethX".
Identifying fcoe among them is tricky, it may be labeled with fcoe but
if not then identify them through intr activity while fcoe traffic on,
total 8 fcoe intr are used and pin them across first eight set of CPUs
used in your workloads.
3) Increase rings size from default 512 to 2K or 4K, just a hunch in
case frames dropped due to longer PAUSE or congestion in your setup.
4) Also monitor ethX stats beside fcoe hostX stats for anything stand
out odd there at "/sys/class/fc_host/hostX/statistics/"
We disabled irq balancer, found the interrupts and pinned them.
Increased the ring sizes to 4K. It seems that these changes allow the
test to run longer. But the target eventually hung. In this test, we
saw none zero tx_flow_control_xon/off and rx_flow_control_xon/off
numbers which indicate PAUSE is working. We did see frame losses
(rx_missed_errors) in this test. So PAUSE frames were issued, but
ultimately it didn't work.

Is it reasonable to expect PAUSE frame to be sufficient for end to end
flow control between nodes? If not, should the issue be dealt with at
mid level by, for example, issuing BUSY to manage flow control between
nodes? Or there needs to be some management at higher level by
limiting outstanding commands? What is the best way to manage?

We are flooding the network link, using all SSD drives and pushing the
boundaries:)

Thanks,
Jun
Post by Vasu Dev
<snip>
Post by Jun Wu
Is the following cmd_per_lun fcoe related? Its default value is 3. And
it doesn't allow me to change.
/sys/devices/pci0000:00/0000:00:05.0/0000:08:00.0/net/p4p1/ctlr_2/host9/scsi_host/host9/cmd_per_lun
I think this doesn't matter once device queue depth adjusted to 32 and
that can be adjusted. I mean this is used at scsi host alloc as initial
queue depth and later scsi device queue depth is adjusted to 32 through
slave_alloc call back and that can be adjusted
at /sys/block/sdX/device/queue_depth as you did before but not this.
//Vasu
Rustad, Mark D
2014-06-12 22:59:00 UTC
Permalink
Post by Jun Wu
Is it reasonable to expect PAUSE frame to be sufficient for end to end
flow control between nodes?
Whether PAUSE is sufficient depends very much on the switches being used and possibly how they are configured. Very simple inexpensive switches may be tuned to expect at most 1500-byte packets, so the larger packets of FCoE could possibly result in buffer overflows in the switches resulting in packet loss.

What kind of switches are you using and how much do you know about them? Are there multiple switches in the path between your initiators and your target?
--
Mark Rustad, Networking Division, Intel Corporation
Vasu Dev
2014-06-13 00:43:14 UTC
Permalink
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
This a Supermicro chassis with redundant power supplies. We see the
same failures with both SSDs or HDDs.
The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
Is iSCSI or AoE tests with same TCM core kernel with same target and
host NICs/switch ?
We tested AoE with the same hardware/switch and test setup. AoE works
except that it is not enterprise protocol and it doesn't provide
performance. It doesn't use TCM.
You had fcoe working with lower queue depth and that should be yielding
lower performance as AoE beside AoE is not using TCM, so not correct
comparison. What about iSCSI, is that using TCM ?
We didn't use TCM for the iSCSI test either.
Too many variables to compare or get any help in isolating issues here.
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
typically used and required by fcoe", but you are using PAUSE and switch
cannot be eliminated as you mentioned before, these could affect more to
FCoE than other protocols, so can you ensure IO errors are not due to
frames losses w/o DCB/PFC in your setup ?
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
The issue should not be caused by frame losses. The systems work fine
with other protocols.
FCoE is less tolerant than others to packet losses & latency variations
for more FC like deterministic fabric performance and therefore no drop
ethernet is must for FCoE unlike others in the comparison, for instance
iSCSI would adapt tx window as per frame losses but no such thing in
FCoE. Thus you cannot conclude that there is no frames losses just
because other works in the setup, iSCSI and AoE should work fine without
no drop ethernet, PAUSE or PFC, so can you confirm no frames losses
using "ethtool -S ethX" ?
I ran the same test again, that is 10 fio sessions from one initiator
to 10 drives on the target via vn2vn. After I saw the following
messages
poc2 kernel: ft_queue_data_in: Failed to send frame
ffff8800ae64ba00, xid <0x389>, remaining 196608, lso_max <0x10000>
I checked
ethtool -S p4p1 | grep error
and
ethtool -S p4p1 | grep control
on both initiator and target, the numbers were all zero.
So the abort should not be caused by frame losses.
You mentioned below frames missed count up, thus there are packet drops.
It would be either due to loss or response not in time. So it could be
due to frames loss, BTW is it after increased timeout as suggested
above with REC disabled ?
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
While possibly abort issues at target with zero timeout values but you
could avoid them completely by increasing scsi timeout and disabling REC
as discussed before.
Now that I know ixgbe (82599) in use, try few more things in addition to
suggestions above:-
1) Disable irq balancer
2) Find your ethX interrupts through "cat /proc/interrupts | grep ethX".
Identifying fcoe among them is tricky, it may be labeled with fcoe but
if not then identify them through intr activity while fcoe traffic on,
total 8 fcoe intr are used and pin them across first eight set of CPUs
used in your workloads.
3) Increase rings size from default 512 to 2K or 4K, just a hunch in
case frames dropped due to longer PAUSE or congestion in your setup.
4) Also monitor ethX stats beside fcoe hostX stats for anything stand
out odd there at "/sys/class/fc_host/hostX/statistics/"
We disabled irq balancer, found the interrupts and pinned them.
Increased the ring sizes to 4K. It seems that these changes allow the
test to run longer. But the target eventually hung. In this test, we
saw none zero tx_flow_control_xon/off and rx_flow_control_xon/off
numbers which indicate PAUSE is working. We did see frame losses
(rx_missed_errors) in this test. So PAUSE frames were issued,
NIC sent pause frames out but still switch not stopping means possibly
switch side pause not enabled and leading to rx_missed_errors, again
back to back would have helped here as Nab suggested before.
Post by Jun Wu
but
ultimately it didn't work.
Is it reasonable to expect PAUSE frame to be sufficient for end to end
flow control between nodes?
PAUSE/PFC is link level but would spread in case of multi nodes but
would depend on switch is use, so check with them.
Post by Jun Wu
If not, should the issue be dealt with at
mid level by, for example, issuing BUSY to manage flow control between
nodes? Or there needs to be some management at higher level by
limiting outstanding commands? What is the best way to manage?
Stack should handle this provided PAUSE is working and frames are not
dropped at L2.
Post by Jun Wu
We are flooding the network link, using all SSD drives and pushing the
boundaries:)
Yeap, I doubt anyone has done enough stress on TCM FC so far and mostly
hosts SW are tested least at Intel against real FC/FCoE targets.

//Vasu
Post by Jun Wu
Thanks,
Jun
Post by Vasu Dev
<snip>
Post by Jun Wu
Is the following cmd_per_lun fcoe related? Its default value is 3. And
it doesn't allow me to change.
/sys/devices/pci0000:00/0000:00:05.0/0000:08:00.0/net/p4p1/ctlr_2/host9/scsi_host/host9/cmd_per_lun
I think this doesn't matter once device queue depth adjusted to 32 and
that can be adjusted. I mean this is used at scsi host alloc as initial
queue depth and later scsi device queue depth is adjusted to 32 through
slave_alloc call back and that can be adjusted
at /sys/block/sdX/device/queue_depth as you did before but not this.
//Vasu
Jun Wu
2014-06-13 01:20:43 UTC
Permalink
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
This a Supermicro chassis with redundant power supplies. We see the
same failures with both SSDs or HDDs.
The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
Is iSCSI or AoE tests with same TCM core kernel with same target and
host NICs/switch ?
We tested AoE with the same hardware/switch and test setup. AoE works
except that it is not enterprise protocol and it doesn't provide
performance. It doesn't use TCM.
You had fcoe working with lower queue depth and that should be yielding
lower performance as AoE beside AoE is not using TCM, so not correct
comparison. What about iSCSI, is that using TCM ?
We didn't use TCM for the iSCSI test either.
Too many variables to compare or get any help in isolating issues here.
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
typically used and required by fcoe", but you are using PAUSE and switch
cannot be eliminated as you mentioned before, these could affect more to
FCoE than other protocols, so can you ensure IO errors are not due to
frames losses w/o DCB/PFC in your setup ?
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
The issue should not be caused by frame losses. The systems work fine
with other protocols.
FCoE is less tolerant than others to packet losses & latency variations
for more FC like deterministic fabric performance and therefore no drop
ethernet is must for FCoE unlike others in the comparison, for instance
iSCSI would adapt tx window as per frame losses but no such thing in
FCoE. Thus you cannot conclude that there is no frames losses just
because other works in the setup, iSCSI and AoE should work fine without
no drop ethernet, PAUSE or PFC, so can you confirm no frames losses
using "ethtool -S ethX" ?
I ran the same test again, that is 10 fio sessions from one initiator
to 10 drives on the target via vn2vn. After I saw the following
messages
poc2 kernel: ft_queue_data_in: Failed to send frame
ffff8800ae64ba00, xid <0x389>, remaining 196608, lso_max <0x10000>
I checked
ethtool -S p4p1 | grep error
and
ethtool -S p4p1 | grep control
on both initiator and target, the numbers were all zero.
So the abort should not be caused by frame losses.
You mentioned below frames missed count up, thus there are packet drops.
It would be either due to loss or response not in time. So it could be
due to frames loss, BTW is it after increased timeout as suggested
above with REC disabled ?
In this particular test, I saw multiple "Failed to send frame"
messages first, and after quite a while, there were no frame losses
reported. It seems to me that the frame loss was not the cause of the
abort, at least in this case.
Yes, the kernel used for the test has REC disabled.
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
While possibly abort issues at target with zero timeout values but you
could avoid them completely by increasing scsi timeout and disabling REC
as discussed before.
Now that I know ixgbe (82599) in use, try few more things in addition to
suggestions above:->> >
1) Disable irq balancer
2) Find your ethX interrupts through "cat /proc/interrupts | grep ethX".
Identifying fcoe among them is tricky, it may be labeled with fcoe but
if not then identify them through intr activity while fcoe traffic on,
total 8 fcoe intr are used and pin them across first eight set of CPUs
used in your workloads.
3) Increase rings size from default 512 to 2K or 4K, just a hunch in
case frames dropped due to longer PAUSE or congestion in your setup.
4) Also monitor ethX stats beside fcoe hostX stats for anything stand
out odd there at "/sys/class/fc_host/hostX/statistics/"
We disabled irq balancer, found the interrupts and pinned them.
Increased the ring sizes to 4K. It seems that these changes allow the
test to run longer. But the target eventually hung. In this test, we
saw none zero tx_flow_control_xon/off and rx_flow_control_xon/off
numbers which indicate PAUSE is working. We did see frame losses
(rx_missed_errors) in this test. So PAUSE frames were issued,
NIC sent pause frames out but still switch not stopping means possibly
switch side pause not enabled and leading to rx_missed_errors, again
back to back would have helped here as Nab suggested before.
Post by Jun Wu
but
ultimately it didn't work.
Is it reasonable to expect PAUSE frame to be sufficient for end to end
flow control between nodes?
PAUSE/PFC is link level but would spread in case of multi nodes but
would depend on switch is use, so check with them.
Post by Jun Wu
If not, should the issue be dealt with at
mid level by, for example, issuing BUSY to manage flow control between
nodes? Or there needs to be some management at higher level by
limiting outstanding commands? What is the best way to manage?
Stack should handle this provided PAUSE is working and frames are not
dropped at L2.
Post by Jun Wu
We are flooding the network link, using all SSD drives and pushing the
boundaries:)
Yeap, I doubt anyone has done enough stress on TCM FC so far and mostly
hosts SW are tested least at Intel against real FC/FCoE targets.
//Vasu
Post by Jun Wu
Thanks,
Jun
Post by Vasu Dev
<snip>
Post by Jun Wu
Is the following cmd_per_lun fcoe related? Its default value is 3. And
it doesn't allow me to change.
/sys/devices/pci0000:00/0000:00:05.0/0000:08:00.0/net/p4p1/ctlr_2/host9/scsi_host/host9/cmd_per_lun
I think this doesn't matter once device queue depth adjusted to 32 and
that can be adjusted. I mean this is used at scsi host alloc as initial
queue depth and later scsi device queue depth is adjusted to 32 through
slave_alloc call back and that can be adjusted
at /sys/block/sdX/device/queue_depth as you did before but not this.
//Vasu
Vasu Dev
2014-06-19 17:00:37 UTC
Permalink
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
This a Supermicro chassis with redundant power supplies. We see the
same failures with both SSDs or HDDs.
The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
Is iSCSI or AoE tests with same TCM core kernel with same target and
host NICs/switch ?
We tested AoE with the same hardware/switch and test setup. AoE works
except that it is not enterprise protocol and it doesn't provide
performance. It doesn't use TCM.
You had fcoe working with lower queue depth and that should be yielding
lower performance as AoE beside AoE is not using TCM, so not correct
comparison. What about iSCSI, is that using TCM ?
We didn't use TCM for the iSCSI test either.
Too many variables to compare or get any help in isolating issues here.
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
typically used and required by fcoe", but you are using PAUSE and switch
cannot be eliminated as you mentioned before, these could affect more to
FCoE than other protocols, so can you ensure IO errors are not due to
frames losses w/o DCB/PFC in your setup ?
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
The issue should not be caused by frame losses. The systems work fine
with other protocols.
FCoE is less tolerant than others to packet losses & latency variations
for more FC like deterministic fabric performance and therefore no drop
ethernet is must for FCoE unlike others in the comparison, for instance
iSCSI would adapt tx window as per frame losses but no such thing in
FCoE. Thus you cannot conclude that there is no frames losses just
because other works in the setup, iSCSI and AoE should work fine without
no drop ethernet, PAUSE or PFC, so can you confirm no frames losses
using "ethtool -S ethX" ?
I ran the same test again, that is 10 fio sessions from one initiator
to 10 drives on the target via vn2vn. After I saw the following
messages
poc2 kernel: ft_queue_data_in: Failed to send frame
ffff8800ae64ba00, xid <0x389>, remaining 196608, lso_max <0x10000>
I checked
ethtool -S p4p1 | grep error
and
ethtool -S p4p1 | grep control
on both initiator and target, the numbers were all zero.
So the abort should not be caused by frame losses.
You mentioned below frames missed count up, thus there are packet drops.
It would be either due to loss or response not in time. So it could be
due to frames loss, BTW is it after increased timeout as suggested
above with REC disabled ?
In this particular test, I saw multiple "Failed to send frame"
messages first, and after quite a while, there were no frame losses
reported. It seems to me that the frame loss was not the cause of the
abort, at least in this case.
Yes, the kernel used for the test has REC disabled.
In that case it takes 10 seconds before abort issued, so don't know what
would hold IO that long in your test setup w/o any frames loss. You
might want to trace IO end to end or profile system for any bottlenecks.
As for FCoE Host stack goes its optimized significantly as I push over
2M IOPS with just single dual port 82599ES 10-Gigabit adapter, so
possibly bottlenecks beyond host interface anywhere from switch to the
backend and therefore profiling or eliminating some setup elements could
help you locate bottlenecks.

Again repeating as I mentioned before that switch eliminate in your
setup is without DCB/PFC so either find ways to skip that hop or find
more into switch for possible drops or extended PAUSE etc. If you could
tell about switch in use then I might try to find more on that though
that all goes beyond Open-FCoE stack.

Thanks,
Vasu

//Vasu
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
While possibly abort issues at target with zero timeout values but you
could avoid them completely by increasing scsi timeout and disabling REC
as discussed before.
Now that I know ixgbe (82599) in use, try few more things in addition to
suggestions above:->> >
1) Disable irq balancer
2) Find your ethX interrupts through "cat /proc/interrupts | grep ethX".
Identifying fcoe among them is tricky, it may be labeled with fcoe but
if not then identify them through intr activity while fcoe traffic on,
total 8 fcoe intr are used and pin them across first eight set of CPUs
used in your workloads.
3) Increase rings size from default 512 to 2K or 4K, just a hunch in
case frames dropped due to longer PAUSE or congestion in your setup.
4) Also monitor ethX stats beside fcoe hostX stats for anything stand
out odd there at "/sys/class/fc_host/hostX/statistics/"
We disabled irq balancer, found the interrupts and pinned them.
Increased the ring sizes to 4K. It seems that these changes allow the
test to run longer. But the target eventually hung. In this test, we
saw none zero tx_flow_control_xon/off and rx_flow_control_xon/off
numbers which indicate PAUSE is working. We did see frame losses
(rx_missed_errors) in this test. So PAUSE frames were issued,
NIC sent pause frames out but still switch not stopping means possibly
switch side pause not enabled and leading to rx_missed_errors, again
back to back would have helped here as Nab suggested before.
Post by Jun Wu
but
ultimately it didn't work.
Is it reasonable to expect PAUSE frame to be sufficient for end to end
flow control between nodes?
PAUSE/PFC is link level but would spread in case of multi nodes but
would depend on switch is use, so check with them.
Post by Jun Wu
If not, should the issue be dealt with at
mid level by, for example, issuing BUSY to manage flow control between
nodes? Or there needs to be some management at higher level by
limiting outstanding commands? What is the best way to manage?
Stack should handle this provided PAUSE is working and frames are not
dropped at L2.
Post by Jun Wu
We are flooding the network link, using all SSD drives and pushing the
boundaries:)
Yeap, I doubt anyone has done enough stress on TCM FC so far and mostly
hosts SW are tested least at Intel against real FC/FCoE targets.
//Vasu
Post by Jun Wu
Thanks,
Jun
Post by Vasu Dev
<snip>
Post by Jun Wu
Is the following cmd_per_lun fcoe related? Its default value is 3. And
it doesn't allow me to change.
/sys/devices/pci0000:00/0000:00:05.0/0000:08:00.0/net/p4p1/ctlr_2/host9/scsi_host/host9/cmd_per_lun
I think this doesn't matter once device queue depth adjusted to 32 and
that can be adjusted. I mean this is used at scsi host alloc as initial
queue depth and later scsi device queue depth is adjusted to 32 through
slave_alloc call back and that can be adjusted
at /sys/block/sdX/device/queue_depth as you did before but not this.
//Vasu
Jun Wu
2014-06-20 00:17:27 UTC
Permalink
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
This a Supermicro chassis with redundant power supplies. We see the
same failures with both SSDs or HDDs.
The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
Is iSCSI or AoE tests with same TCM core kernel with same target and
host NICs/switch ?
We tested AoE with the same hardware/switch and test setup. AoE works
except that it is not enterprise protocol and it doesn't provide
performance. It doesn't use TCM.
You had fcoe working with lower queue depth and that should be yielding
lower performance as AoE beside AoE is not using TCM, so not correct
comparison. What about iSCSI, is that using TCM ?
We didn't use TCM for the iSCSI test either.
Too many variables to compare or get any help in isolating issues here.
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
typically used and required by fcoe", but you are using PAUSE and switch
cannot be eliminated as you mentioned before, these could affect more to
FCoE than other protocols, so can you ensure IO errors are not due to
frames losses w/o DCB/PFC in your setup ?
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
The issue should not be caused by frame losses. The systems work fine
with other protocols.
FCoE is less tolerant than others to packet losses & latency variations
for more FC like deterministic fabric performance and therefore no drop
ethernet is must for FCoE unlike others in the comparison, for instance
iSCSI would adapt tx window as per frame losses but no such thing in
FCoE. Thus you cannot conclude that there is no frames losses just
because other works in the setup, iSCSI and AoE should work fine without
no drop ethernet, PAUSE or PFC, so can you confirm no frames losses
using "ethtool -S ethX" ?
I ran the same test again, that is 10 fio sessions from one initiator
to 10 drives on the target via vn2vn. After I saw the following
messages
poc2 kernel: ft_queue_data_in: Failed to send frame
ffff8800ae64ba00, xid <0x389>, remaining 196608, lso_max <0x10000>
I checked
ethtool -S p4p1 | grep error
and
ethtool -S p4p1 | grep control
on both initiator and target, the numbers were all zero.
So the abort should not be caused by frame losses.
You mentioned below frames missed count up, thus there are packet drops.
It would be either due to loss or response not in time. So it could be
due to frames loss, BTW is it after increased timeout as suggested
above with REC disabled ?
In this particular test, I saw multiple "Failed to send frame"
messages first, and after quite a while, there were no frame losses
reported. It seems to me that the frame loss was not the cause of the
abort, at least in this case.
Yes, the kernel used for the test has REC disabled.
In that case it takes 10 seconds before abort issued, so don't know what
would hold IO that long in your test setup w/o any frames loss. You
might want to trace IO end to end or profile system for any bottlenecks.
As for FCoE Host stack goes its optimized significantly as I push over
2M IOPS with just single dual port 82599ES 10-Gigabit adapter, so
possibly bottlenecks beyond host interface anywhere from switch to the
backend and therefore profiling or eliminating some setup elements could
help you locate bottlenecks.
For the 2M IOPS test, what IO pattern did you run? How many target
drives you have?
We are using Supermicro chassis, so the 10 SSD drives are all local
drives to the target machine.
We are also able to get around 1M IOPS with single port for 4KB IO
size. The issue seems to be associated with large IO size.

I repeated the one initiator one target, 10 target SSD drives test
again with different IO patterns:
1. 1MB io size, sequence write. There were no issue.
2. 1MB io size, random write. No issue.
3. 1MB io size, sequence read. Within one minute, I saw the following
message on the target which indicates abort from the initiator:
Jun 19 13:32:33 poc2 kernel: [ 2820.617015] ft_queue_data_in:
Failed to send frame ffff8806133a8000, xid <0x86a>, remaining 458752,
lso_max <0x10000>
iostat -x 1 showed 0 iops for multiple drives.
4. 1MB io size, random read. This is the least stable one. Immediately
iostat -x 1 showed zero iops for 8 out of the 10 target drives. The
following messages printed out on the initiator side:
Jun 19 15:03:46 poc1 kernel: [ 617.837398] sd 7:0:0:8: [sdt]
Unhandled error code
Jun 19 15:03:46 poc1 kernel: [ 617.837403] sd 7:0:0:8: [sdt]
Jun 19 15:03:46 poc1 kernel: [ 617.837406] Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK
Jun 19 15:03:46 poc1 kernel: [ 617.837408] sd 7:0:0:8: [sdt] CDB:
Jun 19 15:03:46 poc1 kernel: [ 617.837409] Read(10): 28 00 1c
0c d4 00 00 04 00 00
Jun 19 15:03:46 poc1 kernel: [ 617.837418] end_request: I/O
error, dev sdt, sector 470602752

There were no frame losses or PAUSE frame sent during all the above
tests, even though each of the tests lasted quite long, around 20-30
minutes each.
Post by Vasu Dev
Again repeating as I mentioned before that switch eliminate in your
setup is without DCB/PFC so either find ways to skip that hop or find
more into switch for possible drops or extended PAUSE etc. If you could
tell about switch in use then I might try to find more on that though
that all goes beyond Open-FCoE stack.
The switch model is the Arista 7050S-52. PFC/Pause frames are not
enabled on the switch. There is only one switch between the initiator
and the target.
Thanks,

Jun
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
While possibly abort issues at target with zero timeout values but you
could avoid them completely by increasing scsi timeout and disabling REC
as discussed before.
Now that I know ixgbe (82599) in use, try few more things in addition to
suggestions above:->> >
1) Disable irq balancer
2) Find your ethX interrupts through "cat /proc/interrupts | grep ethX".
Identifying fcoe among them is tricky, it may be labeled with fcoe but
if not then identify them through intr activity while fcoe traffic on,
total 8 fcoe intr are used and pin them across first eight set of CPUs
used in your workloads.
3) Increase rings size from default 512 to 2K or 4K, just a hunch in
case frames dropped due to longer PAUSE or congestion in your setup.
4) Also monitor ethX stats beside fcoe hostX stats for anything stand
out odd there at "/sys/class/fc_host/hostX/statistics/"
We disabled irq balancer, found the interrupts and pinned them.
Increased the ring sizes to 4K. It seems that these changes allow the
test to run longer. But the target eventually hung. In this test, we
saw none zero tx_flow_control_xon/off and rx_flow_control_xon/off
numbers which indicate PAUSE is working. We did see frame losses
(rx_missed_errors) in this test. So PAUSE frames were issued,
NIC sent pause frames out but still switch not stopping means possibly
switch side pause not enabled and leading to rx_missed_errors, again
back to back would have helped here as Nab suggested before.
Post by Jun Wu
but
ultimately it didn't work.
Is it reasonable to expect PAUSE frame to be sufficient for end to end
flow control between nodes?
PAUSE/PFC is link level but would spread in case of multi nodes but
would depend on switch is use, so check with them.
Post by Jun Wu
If not, should the issue be dealt with at
mid level by, for example, issuing BUSY to manage flow control between
nodes? Or there needs to be some management at higher level by
limiting outstanding commands? What is the best way to manage?
Stack should handle this provided PAUSE is working and frames are not
dropped at L2.
Post by Jun Wu
We are flooding the network link, using all SSD drives and pushing the
boundaries:)
Yeap, I doubt anyone has done enough stress on TCM FC so far and mostly
hosts SW are tested least at Intel against real FC/FCoE targets.
//Vasu
Post by Jun Wu
Thanks,
Jun
Post by Vasu Dev
<snip>
Post by Jun Wu
Is the following cmd_per_lun fcoe related? Its default value is 3. And
it doesn't allow me to change.
/sys/devices/pci0000:00/0000:00:05.0/0000:08:00.0/net/p4p1/ctlr_2/host9/scsi_host/host9/cmd_per_lun
I think this doesn't matter once device queue depth adjusted to 32 and
that can be adjusted. I mean this is used at scsi host alloc as initial
queue depth and later scsi device queue depth is adjusted to 32 through
slave_alloc call back and that can be adjusted
at /sys/block/sdX/device/queue_depth as you did before but not this.
//Vasu
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vasu Dev
2014-06-20 18:23:55 UTC
Permalink
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
This a Supermicro chassis with redundant power supplies. We see the
same failures with both SSDs or HDDs.
The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
Is iSCSI or AoE tests with same TCM core kernel with same target and
host NICs/switch ?
We tested AoE with the same hardware/switch and test setup. AoE works
except that it is not enterprise protocol and it doesn't provide
performance. It doesn't use TCM.
You had fcoe working with lower queue depth and that should be yielding
lower performance as AoE beside AoE is not using TCM, so not correct
comparison. What about iSCSI, is that using TCM ?
We didn't use TCM for the iSCSI test either.
Too many variables to compare or get any help in isolating issues here.
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
typically used and required by fcoe", but you are using PAUSE and switch
cannot be eliminated as you mentioned before, these could affect more to
FCoE than other protocols, so can you ensure IO errors are not due to
frames losses w/o DCB/PFC in your setup ?
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
The issue should not be caused by frame losses. The systems work fine
with other protocols.
FCoE is less tolerant than others to packet losses & latency variations
for more FC like deterministic fabric performance and therefore no drop
ethernet is must for FCoE unlike others in the comparison, for instance
iSCSI would adapt tx window as per frame losses but no such thing in
FCoE. Thus you cannot conclude that there is no frames losses just
because other works in the setup, iSCSI and AoE should work fine without
no drop ethernet, PAUSE or PFC, so can you confirm no frames losses
using "ethtool -S ethX" ?
I ran the same test again, that is 10 fio sessions from one initiator
to 10 drives on the target via vn2vn. After I saw the following
messages
poc2 kernel: ft_queue_data_in: Failed to send frame
ffff8800ae64ba00, xid <0x389>, remaining 196608, lso_max <0x10000>
I checked
ethtool -S p4p1 | grep error
and
ethtool -S p4p1 | grep control
on both initiator and target, the numbers were all zero.
So the abort should not be caused by frame losses.
You mentioned below frames missed count up, thus there are packet drops.
It would be either due to loss or response not in time. So it could be
due to frames loss, BTW is it after increased timeout as suggested
above with REC disabled ?
In this particular test, I saw multiple "Failed to send frame"
messages first, and after quite a while, there were no frame losses
reported. It seems to me that the frame loss was not the cause of the
abort, at least in this case.
Yes, the kernel used for the test has REC disabled.
In that case it takes 10 seconds before abort issued, so don't know what
would hold IO that long in your test setup w/o any frames loss. You
might want to trace IO end to end or profile system for any bottlenecks.
As for FCoE Host stack goes its optimized significantly as I push over
2M IOPS with just single dual port 82599ES 10-Gigabit adapter, so
possibly bottlenecks beyond host interface anywhere from switch to the
backend and therefore profiling or eliminating some setup elements could
help you locate bottlenecks.
For the 2M IOPS test, what IO pattern did you run? How many target
drives you have?
I used two RAM based soft SANblaze FCoE targets on Romley E5-2690.
Post by Jun Wu
We are using Supermicro chassis, so the 10 SSD drives are all local
drives to the target machine.
We are also able to get around 1M IOPS with single port for 4KB IO
size.
Thats is good.
Post by Jun Wu
The issue seems to be associated with large IO size.
The PFC/Pause becomes more relevant with large IO as with them workload
is IO bound with 10Gig link saturated while CPUs mostly idle or least
not pegged, if later is not the case then you got some other bottlenecks
not related to no drop ethernet. Anycase would result aborts as
discussed before and below same instances again.
Post by Jun Wu
I repeated the one initiator one target, 10 target SSD drives test
1. 1MB io size, sequence write. There were no issue.
2. 1MB io size, random write. No issue.
3. 1MB io size, sequence read. Within one minute, I saw the following
Failed to send frame ffff8806133a8000, xid <0x86a>, remaining 458752,
lso_max <0x10000>
iostat -x 1 showed 0 iops for multiple drives.
4. 1MB io size, random read. This is the least stable one. Immediately
iostat -x 1 showed zero iops for 8 out of the 10 target drives. The
Jun 19 15:03:46 poc1 kernel: [ 617.837398] sd 7:0:0:8: [sdt]
Unhandled error code
Jun 19 15:03:46 poc1 kernel: [ 617.837403] sd 7:0:0:8: [sdt]
hostbyte=DID_ERROR driverbyte=DRIVER_OK
Jun 19 15:03:46 poc1 kernel: [ 617.837409] Read(10): 28 00 1c
0c d4 00 00 04 00 00
Jun 19 15:03:46 poc1 kernel: [ 617.837418] end_request: I/O
error, dev sdt, sector 470602752
There were no frame losses or PAUSE frame sent during all the above
tests, even though each of the tests lasted quite long, around 20-30
minutes each.
I suppose you are checking at all access points, I mean at switch end as
well as at NIC. I typically see PFC from switch under stress, possibly
your switch has larger ports buffers not resulting any PAUSE.
Post by Jun Wu
Post by Vasu Dev
Again repeating as I mentioned before that switch eliminate in your
setup is without DCB/PFC so either find ways to skip that hop or find
more into switch for possible drops or extended PAUSE etc. If you could
tell about switch in use then I might try to find more on that though
that all goes beyond Open-FCoE stack.
The switch model is the Arista 7050S-52. PFC/Pause frames are not
enabled on the switch. There is only one switch between the initiator
and the target.
I don't have that one here, but typically "show interfaces .." dumps
useful stats for any frames underrun, CRC etc.

//Vasu
Post by Jun Wu
Thanks,
Jun
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
Post by Vasu Dev
While possibly abort issues at target with zero timeout values but you
could avoid them completely by increasing scsi timeout and disabling REC
as discussed before.
Now that I know ixgbe (82599) in use, try few more things in addition to
suggestions above:->> >
1) Disable irq balancer
2) Find your ethX interrupts through "cat /proc/interrupts | grep ethX".
Identifying fcoe among them is tricky, it may be labeled with fcoe but
if not then identify them through intr activity while fcoe traffic on,
total 8 fcoe intr are used and pin them across first eight set of CPUs
used in your workloads.
3) Increase rings size from default 512 to 2K or 4K, just a hunch in
case frames dropped due to longer PAUSE or congestion in your setup.
4) Also monitor ethX stats beside fcoe hostX stats for anything stand
out odd there at "/sys/class/fc_host/hostX/statistics/"
We disabled irq balancer, found the interrupts and pinned them.
Increased the ring sizes to 4K. It seems that these changes allow the
test to run longer. But the target eventually hung. In this test, we
saw none zero tx_flow_control_xon/off and rx_flow_control_xon/off
numbers which indicate PAUSE is working. We did see frame losses
(rx_missed_errors) in this test. So PAUSE frames were issued,
NIC sent pause frames out but still switch not stopping means possibly
switch side pause not enabled and leading to rx_missed_errors, again
back to back would have helped here as Nab suggested before.
Post by Jun Wu
but
ultimately it didn't work.
Is it reasonable to expect PAUSE frame to be sufficient for end to end
flow control between nodes?
PAUSE/PFC is link level but would spread in case of multi nodes but
would depend on switch is use, so check with them.
Post by Jun Wu
If not, should the issue be dealt with at
mid level by, for example, issuing BUSY to manage flow control between
nodes? Or there needs to be some management at higher level by
limiting outstanding commands? What is the best way to manage?
Stack should handle this provided PAUSE is working and frames are not
dropped at L2.
Post by Jun Wu
We are flooding the network link, using all SSD drives and pushing the
boundaries:)
Yeap, I doubt anyone has done enough stress on TCM FC so far and mostly
hosts SW are tested least at Intel against real FC/FCoE targets.
//Vasu
Post by Jun Wu
Thanks,
Jun
Post by Vasu Dev
<snip>
Post by Jun Wu
Is the following cmd_per_lun fcoe related? Its default value is 3. And
it doesn't allow me to change.
/sys/devices/pci0000:00/0000:00:05.0/0000:08:00.0/net/p4p1/ctlr_2/host9/scsi_host/host9/cmd_per_lun
I think this doesn't matter once device queue depth adjusted to 32 and
that can be adjusted. I mean this is used at scsi host alloc as initial
queue depth and later scsi device queue depth is adjusted to 32 through
slave_alloc call back and that can be adjusted
at /sys/block/sdX/device/queue_depth as you did before but not this.
//Vasu
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2014-06-11 21:33:10 UTC
Permalink
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
This a Supermicro chassis with redundant power supplies. We see the
same failures with both SSDs or HDDs.
The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
Is iSCSI or AoE tests with same TCM core kernel with same target and
host NICs/switch ?
We tested AoE with the same hardware/switch and test setup. AoE works
except that it is not enterprise protocol and it doesn't provide
performance. It doesn't use TCM.
Post by Vasu Dev
What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
typically used and required by fcoe", but you are using PAUSE and switch
cannot be eliminated as you mentioned before, these could affect more to
FCoE than other protocols, so can you ensure IO errors are not due to
frames losses w/o DCB/PFC in your setup ?
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
The issue should not be caused by frame losses. The systems work fine
with other protocols.
Post by Vasu Dev
While possibly abort issues at target with zero timeout values but you
could avoid them completely by increasing scsi timeout and disabling REC
as discussed before.
Please use inline response and avoid top posts.
Thanks,
Vasu
Is the following cmd_per_lun fcoe related? Its default value is 3. And
it doesn't allow me to change.
/sys/devices/pci0000:00/0000:00:05.0/0000:08:00.0/net/p4p1/ctlr_2/host9/scsi_host/host9/cmd_per_lun
Nab,
Could you please send the patches you mentioned for me to test?
The two are in for-next here:

tcm_fc: Generate TASK_SET_FULL status for DataIN failures
https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=b3e5fe1688b998ba5287a68667ef7cc568739e44

tcm_fc: Generate TASK_SET_FULL status for response failures
https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=6dbe7f4e97d55eefcb471c41c16b62fca5f10c68

--nab
Jun Wu
2014-06-12 22:23:00 UTC
Permalink
We tried the changes. The initiator was not able to see the drives on
the target.
I saw following messages:

Jun 12 09:45:58 poc2 kernel: [ 1629.051837] scsi host7: libfc: Host
reset failed, port (00061e) is not ready.
Jun 12 09:45:58 poc2 kernel: [ 1629.051843] scsi 7:0:0:0: Device
offlined - not ready after error recovery
Jun 12 09:45:58 poc2 kernel: [ 1629.052155] general protection fault:
0000 [#1] SMP
Jun 12 09:45:58 poc2 kernel: scsi host7: libfc: Host reset failed,
port (00061e) is not ready.
Jun 12 09:45:58 poc2 kernel: scsi 7:0:0:0: Device offlined - not ready
after error recovery
Jun 12 09:45:58 poc2 kernel: general protection fault: 0000 [#1] SMP
Jun 12 09:45:58 poc2 kernel: [ 1629.052245] Modules linked in: tcm_fc
target_core_pscsi target_core_file target_core_iblock iscsi_target_mod
target_core_mod ipt_MASQUERADE xt_CHECKSUM 8021q fcoe libfcoe garp mrp
libfc scsi_transport_fc scsi_tgt ip6t_rpfilter ip6t_REJECT
xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter
ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
ip6table_mangle ip6table_security ip6table_raw ip6table_filter
ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
iTCO_wdt iTCO_vendor_support gpio_ich coretemp kvm_intel kvm
crc32c_intel microcode serio_raw ses enclosure nfsd auth_rpcgss
i7core_edac nfs_acl lockd edac_core sunrpc lpc_ich ioatdma mfd_core
shpchp i2c_i801 acpi_cpufreq radeon drm_kms_helper ttm ixgbe igb drm
mdio ata_generic ptp pata_acpi pps_core i2c_algo_bit pata_jmicron
aacraid i2c_core dca
Jun 12 09:45:58 poc2 kernel: [ 1629.053754] CPU: 13 PID: 2488 Comm:
kworker/13:3 Tainted: GF O 3.13.10-200.zbfcoepatch.fc20.x86_64 #1
Jun 12 09:45:58 poc2 kernel: [ 1629.053898] Hardware name: Supermicro
X8DTN/X8DTN, BIOS 2.1c 10/28/2011
Jun 12 09:45:58 poc2 kernel: [ 1629.054005] Workqueue: fc_wq_7
fc_rport_final_delete [scsi_transport_fc]
Jun 12 09:45:58 poc2 kernel: [ 1629.054106] task: ffff880622794500 ti:
ffff880622600000 task.ti: ffff880622600000
Jun 12 09:45:58 poc2 kernel: [ 1629.054213] RIP:
0010:[<ffffffff81434e29>] [<ffffffff81434e29>]
scsi_device_put+0x19/0x50
Jun 12 09:45:58 poc2 kernel: [ 1629.054338] RSP: 0018:ffff880622601d90
EFLAGS: 00010202
Jun 12 09:45:58 poc2 kernel: [ 1629.054415] RAX: 6e696d7200276465 RBX:
ffff880035e32000 RCX: 00000001820001a0
Jun 12 09:45:58 poc2 kernel: [ 1629.054515] RDX: 00000001820001a1 RSI:
00000000820001a0 RDI: ffff880035e32000
Jun 12 09:45:58 poc2 kernel: [ 1629.054615] RBP: ffff880622601da0 R08:
0000000000000000 R09: 0000000000000001
Jun 12 09:45:58 poc2 kernel: [ 1629.054715] R10: ffffea0018b22600 R11:
ffffffff81316701 R12: ffff880035e32000
Jun 12 09:45:58 poc2 kernel: [ 1629.054815] R13: ffff88062dead010 R14:
ffff88062dead000 R15: ffff880035870c00
Jun 12 09:45:58 poc2 kernel: [ 1629.054916] FS: 0000000000000000(0000)
GS:ffff88063fca0000(0000) knlGS:0000000000000000
Jun 12 09:45:58 poc2 kernel: [ 1629.055031] CS: 0010 DS: 0000 ES: 0000
CR0: 000000008005003b
Jun 12 09:45:58 poc2 kernel: [ 1629.055111] CR2: 00007f63cf4d1a70 CR3:
0000000001c0c000 CR4: 00000000000007e0
Jun 12 09:45:58 poc2 kernel: [ 1629.055211] Stack:
Jun 12 09:45:58 poc2 kernel: [ 1629.055241] ffff880035e32000
ffff880035955860 ffff880622601de8 ffffffff81443348
Jun 12 09:45:58 poc2 kernel: [ 1629.055362] 0000000000000202
ffff88062dead000 ffff88062dead000 ffff880035955c40
Jun 12 09:45:58 poc2 kernel: [ 1629.055483] ffff880035955860
ffff880035955800 ffff88032e81a000 ffff880622601e20
Jun 12 09:45:58 poc2 kernel: [ 1629.055604] Call Trace:
Jun 12 09:45:58 poc2 kernel: [ 1629.055647] [<ffffffff81443348>]
scsi_remove_target+0x168/0x210
Jun 12 09:45:58 poc2 kernel: [ 1629.055737] [<ffffffffa0492e6c>]
fc_rport_final_delete+0xac/0x1f0 [scsi_transport_fc]
Jun 12 09:45:58 poc2 kernel: [ 1629.055855] [<ffffffff81087bf6>]
process_one_work+0x176/0x430
Jun 12 09:45:58 poc2 kernel: [ 1629.055941] [<ffffffff8108882b>]
worker_thread+0x11b/0x3a0
Jun 12 09:45:58 poc2 kernel: [ 1629.056022] [<ffffffff81088710>] ?
rescuer_thread+0x350/0x350
Jun 12 09:45:58 poc2 kernel: [ 1629.056110] [<ffffffff8108f2f2>]
kthread+0xd2/0xf0
Jun 12 09:45:58 poc2 kernel: [ 1629.056181] [<ffffffff8108f220>] ?
insert_kthread_work+0x40/0x40
Jun 12 09:45:58 poc2 kernel: [ 1629.056274] [<ffffffff81696dbc>]
ret_from_fork+0x7c/0xb0
Jun 12 09:45:58 poc2 kernel: [ 1629.056353] [<ffffffff8108f220>] ?
insert_kthread_work+0x40/0x40

and on the other node:

Jun 12 09:46:21 poc1 kernel: [ 1667.721292] rport-7:0-0: blocked FC
remote port time out: removing target and saving binding
Jun 12 09:46:32 poc1 kernel: [ 1678.220757] scsi host7: libfc: Host
reset succeeded on port (0003ec)
Jun 12 09:46:42 poc1 kernel: [ 1688.229744] scsi host7: libfc: Host
reset succeeded on port (0003ec)

Jun 12 09:46:52 poc1 kernel: [ 1698.238715] scsi 7:0:0:0: Device
offlined - not ready after error recovery
Jun 12 09:46:52 poc1 kernel: [ 1698.238916] BUG: unable to handle
kernel NULL pointer dereference at (null)
Jun 12 09:46:52 poc1 kernel: [ 1698.239043] IP: [<ffffffff81434e29>]
scsi_device_put+0x19/0x50
Jun 12 09:46:52 poc1 kernel: [ 1698.239138] PGD 0
Jun 12 09:46:52 poc1 kernel: [ 1698.239171] Oops: 0000 [#1] SMP
Jun 12 09:46:52 poc1 kernel: [ 1698.239227] Modules linked in: tcm_fc
target_core_pscsi target_core_file target_core_iblock iscsi_target_mod
target_core_mod ipt_MASQUERADE xt_CHECKSUM 8021q fcoe garp libfcoe mrp
libfc scsi_transport_fc scsi_tgt ip6t_rpfilter ip6t_REJECT
xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter
ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
ip6table_mangle ip6table_security ip6table_raw ip6table_filter
ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
iTCO_wdt gpio_ich iTCO_vendor_support coretemp kvm_intel kvm
crc32c_intel microcode serio_raw i2c_i801 lpc_ich mfd_core ioatdma ses
enclosure i7core_edac edac_core shpchp acpi_cpufreq nfsd auth_rpcgss
nfs_acl lockd sunrpc radeon drm_kms_helper ttm ixgbe igb drm mdio
ata_generic ptp pata_acpi pps_core i2c_algo_bit pata_jmicron aacraid
dca i2c_core
Jun 12 09:46:52 poc1 kernel: [ 1698.240740] CPU: 1 PID: 1155 Comm:
kworker/1:2 Tainted: GF O 3.13.10-200.zbfcoepatch.fc20.x86_64 #1
Jun 12 09:46:52 poc1 kernel: [ 1698.240879] Hardware name: Supermicro
X8DTN/X8DTN, BIOS 2.1c 10/28/2011
Jun 12 09:46:52 poc1 kernel: [ 1698.240984] Workqueue: fc_wq_7
fc_starget_delete [scsi_transport_fc]
Jun 12 09:46:52 poc1 kernel: [ 1698.241080] task: ffff88061c3e7020 ti:
ffff88061cc60000 task.ti: ffff88061cc60000
Jun 12 09:46:52 poc1 kernel: [ 1698.241186] RIP:
0010:[<ffffffff81434e29>] [<ffffffff81434e29>]
scsi_device_put+0x19/0x50
Jun 12 09:46:52 poc1 kernel: [ 1698.241309] RSP: 0018:ffff88061cc61db0
EFLAGS: 00010202
Jun 12 09:46:52 poc1 kernel: [ 1698.241385] RAX: 0000000000000000 RBX:
ffff88061d364800 RCX: 00000001820001e8
Jun 12 09:46:52 poc1 kernel: [ 1698.241485] RDX: 00000001820001e9 RSI:
00000000820001e8 RDI: ffff88061d364800
Jun 12 09:46:52 poc1 kernel: [ 1698.241585] RBP: ffff88061cc61dc0 R08:
0000000000000000 R09: 0000000000000001
Jun 12 09:46:52 poc1 kernel: [ 1698.241685] R10: ffffea0018701740 R11:
ffffffff81316701 R12: ffff88061d364800
Jun 12 09:46:52 poc1 kernel: [ 1698.241785] R13: ffff88061e13e010 R14:
ffff88061e13e000 R15: ffff88060b5f9400
Jun 12 09:46:52 poc1 kernel: [ 1698.241886] FS: 0000000000000000(0000)
GS:ffff880627c20000(0000) knlGS:0000000000000000
Jun 12 09:46:52 poc1 kernel: [ 1698.242002] CS: 0010 DS: 0000 ES: 0000
CR0: 000000008005003b
Jun 12 09:46:52 poc1 kernel: [ 1698.242082] CR2: 0000000000000000 CR3:
0000000001c0c000 CR4: 00000000000007e0
Jun 12 09:46:52 poc1 kernel: [ 1698.242182] Stack:
Jun 12 09:46:52 poc1 kernel: [ 1698.242212] ffff88061d364800
ffff8806160bb860 ffff88061cc61e08 ffffffff81443348
Jun 12 09:46:52 poc1 kernel: [ 1698.242333] 0000000000000202
ffff88061e13e000 ffff8806160bb800 ffff88061c3c3080
Jun 12 09:46:52 poc1 kernel: [ 1698.242453] ffff880627c33e00
ffffe8f9e7c22e00 0000000000000040 ffff88061cc61e20
Jun 12 09:46:52 poc1 kernel: [ 1698.242574] Call Trace:
Jun 12 09:46:52 poc1 kernel: [ 1698.242615] [<ffffffff81443348>]
scsi_remove_target+0x168/0x210
Jun 12 09:46:52 poc1 kernel: [ 1698.242706] [<ffffffffa05461b2>]
fc_starget_delete+0x22/0x30 [scsi_transport_fc]
Jun 12 09:46:52 poc1 kernel: [ 1698.242815] [<ffffffff81087bf6>]
process_one_work+0x176/0x430
Jun 12 09:46:52 poc1 kernel: [ 1698.242901] [<ffffffff8108882b>]
worker_thread+0x11b/0x3a0
Jun 12 09:46:52 poc1 kernel: [ 1698.242982] [<ffffffff81088710>] ?
rescuer_thread+0x350/0x350
Jun 12 09:46:52 poc1 kernel: [ 1698.243069] [<ffffffff8108f2f2>]
kthread+0xd2/0xf0
Jun 12 09:46:52 poc1 kernel: [ 1698.243140] [<ffffffff8108f220>] ?
insert_kthread_work+0x40/0x40
Jun 12 09:46:52 poc1 kernel: [ 1698.243231] [<ffffffff81696dbc>]
ret_from_fork+0x7c/0xb0
Jun 12 09:46:52 poc1 kernel: [ 1698.243310] [<ffffffff8108f220>] ?
insert_kthread_work+0x40/0x40
Jun 12 09:46:52 poc1 kernel: [ 1698.243397] Code: c4 08 48 89 d8 5b 41
5c 41 5d 41 5e 41 5f 5d c3 66 90 66 66 66 66 90 55 48 89 e5 41 54 49
89 fc 53 48 8b 07 48 8b 80 c0 00 00 00 <48> 8b 18 48 85 db 74 0d 48 89
df e8 f7 72 ca ff 48 85 c0 75 12
Jun 12 09:46:52 poc1 kernel: [ 1698.243942] RIP [<ffffffff81434e29>]
scsi_device_put+0x19/0x50
Jun 12 09:46:52 poc1 kernel: [ 1698.244032] RSP <ffff88061cc61db0>
Jun 12 09:46:52 poc1 kernel: [ 1698.244081] CR2: 0000000000000000
Jun 12 09:46:52 poc1 kernel: [ 1698.263415] ---[ end trace c6e30b4311d1a181 ]---
Jun 12 09:46:52 poc1 kernel: [ 1698.263501] BUG: unable to handle
kernel paging request at ffffffffffffffd8

Thanks,
Jun

On Wed, Jun 11, 2014 at 2:33 PM, Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
This a Supermicro chassis with redundant power supplies. We see the
same failures with both SSDs or HDDs.
The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
Is iSCSI or AoE tests with same TCM core kernel with same target and
host NICs/switch ?
We tested AoE with the same hardware/switch and test setup. AoE works
except that it is not enterprise protocol and it doesn't provide
performance. It doesn't use TCM.
Post by Vasu Dev
What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
typically used and required by fcoe", but you are using PAUSE and switch
cannot be eliminated as you mentioned before, these could affect more to
FCoE than other protocols, so can you ensure IO errors are not due to
frames losses w/o DCB/PFC in your setup ?
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
The issue should not be caused by frame losses. The systems work fine
with other protocols.
Post by Vasu Dev
While possibly abort issues at target with zero timeout values but you
could avoid them completely by increasing scsi timeout and disabling REC
as discussed before.
Please use inline response and avoid top posts.
Thanks,
Vasu
Is the following cmd_per_lun fcoe related? Its default value is 3. And
it doesn't allow me to change.
/sys/devices/pci0000:00/0000:00:05.0/0000:08:00.0/net/p4p1/ctlr_2/host9/scsi_host/host9/cmd_per_lun
Nab,
Could you please send the patches you mentioned for me to test?
tcm_fc: Generate TASK_SET_FULL status for DataIN failures
https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=b3e5fe1688b998ba5287a68667ef7cc568739e44
tcm_fc: Generate TASK_SET_FULL status for response failures
https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=6dbe7f4e97d55eefcb471c41c16b62fca5f10c68
--nab
Vasu Dev
2014-06-20 18:29:17 UTC
Permalink
Post by Jun Wu
We tried the changes. The initiator was not able to see the drives on
the target.
Jun 12 09:45:58 poc2 kernel: [ 1629.051837] scsi host7: libfc: Host
reset failed, port (00061e) is not ready.
Jun 12 09:45:58 poc2 kernel: [ 1629.051843] scsi 7:0:0:0: Device
offlined - not ready after error recovery
0000 [#1] SMP
Jun 12 09:45:58 poc2 kernel: scsi host7: libfc: Host reset failed,
port (00061e) is not ready.
Jun 12 09:45:58 poc2 kernel: scsi 7:0:0:0: Device offlined - not ready
after error recovery
Jun 12 09:45:58 poc2 kernel: general protection fault: 0000 [#1] SMP
Jun 12 09:45:58 poc2 kernel: [ 1629.052245] Modules linked in: tcm_fc
target_core_pscsi target_core_file target_core_iblock iscsi_target_mod
target_core_mod ipt_MASQUERADE xt_CHECKSUM 8021q fcoe libfcoe garp mrp
libfc scsi_transport_fc scsi_tgt ip6t_rpfilter ip6t_REJECT
xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter
ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
ip6table_mangle ip6table_security ip6table_raw ip6table_filter
ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
iTCO_wdt iTCO_vendor_support gpio_ich coretemp kvm_intel kvm
crc32c_intel microcode serio_raw ses enclosure nfsd auth_rpcgss
i7core_edac nfs_acl lockd edac_core sunrpc lpc_ich ioatdma mfd_core
shpchp i2c_i801 acpi_cpufreq radeon drm_kms_helper ttm ixgbe igb drm
mdio ata_generic ptp pata_acpi pps_core i2c_algo_bit pata_jmicron
aacraid i2c_core dca
kworker/13:3 Tainted: GF O 3.13.10-200.zbfcoepatch.fc20.x86_64 #1
Jun 12 09:45:58 poc2 kernel: [ 1629.053898] Hardware name: Supermicro
X8DTN/X8DTN, BIOS 2.1c 10/28/2011
Jun 12 09:45:58 poc2 kernel: [ 1629.054005] Workqueue: fc_wq_7
fc_rport_final_delete [scsi_transport_fc]
ffff880622600000 task.ti: ffff880622600000
0010:[<ffffffff81434e29>] [<ffffffff81434e29>]
scsi_device_put+0x19/0x50
Looks like Scsi_Host is already released while its scsi device being
released here but the fc_remove_host called and then only finally scsi
host released. Looks like due to scsi device from pending scan work
which is recently fixed by this patch from Neil:-

http://patchwork.open-fcoe.org/patch/153/

I triggered host reset myself few times and didn't run into. I also
tried with DEBUG_OBJECTS_FREE enabled and no warning with that as well.

[1277601.499732] host7: scsi: Resetting host
[1277601.500123] host7: lport 00a959: Entered RESET state from Ready
state
[1277601.501756] host7: rport 00aaf4: Remove port
[1277601.502121] host7: rport 00aaf4: Port sending LOGO from Ready state
[1277601.503145] host7: fip: els_send op 9 d_id aaf4
[1277601.503744] host7: rport 00aaf4: Delete port
[1277601.504214] host7: rport 00aaf4: work event 3
[1277601.504489] host7: fcp: 00aaf4: Returning DID_ERROR to scsi-ml due
to FC_DATA_UNDRUN (scsi)
[1277601.504520] host7: xid 201: f_ctl 90000 seq 1
[1277601.504583] host7: rport 00aaf4: Received a LOGO response closed
[1277601.504583] host7: lport 00a959: Entered FLOGI state from reset
state
[1277601.504583] host7: lport 00a959: Entered READY from state FLOGI
[1277601.504834] line:1378line:1378
[1277601.504834] scsi host7: libfc: Host reset succeeded on port
(00a959)
[1277601.507137] host7: rport 00aaf4: callback ev 3
[1277601.508396] host7: fip: vn_rport_callback aaf4 event 3
[1277601.509136] host7: rport 00aaf4: work delete

//Vasu
Post by Jun Wu
Jun 12 09:45:58 poc2 kernel: [ 1629.054338] RSP: 0018:ffff880622601d90
EFLAGS: 00010202
ffff880035e32000 RCX: 00000001820001a0
00000000820001a0 RDI: ffff880035e32000
0000000000000000 R09: 0000000000000001
ffffffff81316701 R12: ffff880035e32000
ffff88062dead000 R15: ffff880035870c00
Jun 12 09:45:58 poc2 kernel: [ 1629.054916] FS: 0000000000000000(0000)
GS:ffff88063fca0000(0000) knlGS:0000000000000000
Jun 12 09:45:58 poc2 kernel: [ 1629.055031] CS: 0010 DS: 0000 ES: 0000
CR0: 000000008005003b
0000000001c0c000 CR4: 00000000000007e0
Jun 12 09:45:58 poc2 kernel: [ 1629.055241] ffff880035e32000
ffff880035955860 ffff880622601de8 ffffffff81443348
Jun 12 09:45:58 poc2 kernel: [ 1629.055362] 0000000000000202
ffff88062dead000 ffff88062dead000 ffff880035955c40
Jun 12 09:45:58 poc2 kernel: [ 1629.055483] ffff880035955860
ffff880035955800 ffff88032e81a000 ffff880622601e20
Jun 12 09:45:58 poc2 kernel: [ 1629.055647] [<ffffffff81443348>]
scsi_remove_target+0x168/0x210
Jun 12 09:45:58 poc2 kernel: [ 1629.055737] [<ffffffffa0492e6c>]
fc_rport_final_delete+0xac/0x1f0 [scsi_transport_fc]
Jun 12 09:45:58 poc2 kernel: [ 1629.055855] [<ffffffff81087bf6>]
process_one_work+0x176/0x430
Jun 12 09:45:58 poc2 kernel: [ 1629.055941] [<ffffffff8108882b>]
worker_thread+0x11b/0x3a0
Jun 12 09:45:58 poc2 kernel: [ 1629.056022] [<ffffffff81088710>] ?
rescuer_thread+0x350/0x350
Jun 12 09:45:58 poc2 kernel: [ 1629.056110] [<ffffffff8108f2f2>]
kthread+0xd2/0xf0
Jun 12 09:45:58 poc2 kernel: [ 1629.056181] [<ffffffff8108f220>] ?
insert_kthread_work+0x40/0x40
Jun 12 09:45:58 poc2 kernel: [ 1629.056274] [<ffffffff81696dbc>]
ret_from_fork+0x7c/0xb0
Jun 12 09:45:58 poc2 kernel: [ 1629.056353] [<ffffffff8108f220>] ?
insert_kthread_work+0x40/0x40
Jun 12 09:46:21 poc1 kernel: [ 1667.721292] rport-7:0-0: blocked FC
remote port time out: removing target and saving binding
Jun 12 09:46:32 poc1 kernel: [ 1678.220757] scsi host7: libfc: Host
reset succeeded on port (0003ec)
Jun 12 09:46:42 poc1 kernel: [ 1688.229744] scsi host7: libfc: Host
reset succeeded on port (0003ec)
Jun 12 09:46:52 poc1 kernel: [ 1698.238715] scsi 7:0:0:0: Device
offlined - not ready after error recovery
Jun 12 09:46:52 poc1 kernel: [ 1698.238916] BUG: unable to handle
kernel NULL pointer dereference at (null)
Jun 12 09:46:52 poc1 kernel: [ 1698.239043] IP: [<ffffffff81434e29>]
scsi_device_put+0x19/0x50
Jun 12 09:46:52 poc1 kernel: [ 1698.239138] PGD 0
Jun 12 09:46:52 poc1 kernel: [ 1698.239171] Oops: 0000 [#1] SMP
Jun 12 09:46:52 poc1 kernel: [ 1698.239227] Modules linked in: tcm_fc
target_core_pscsi target_core_file target_core_iblock iscsi_target_mod
target_core_mod ipt_MASQUERADE xt_CHECKSUM 8021q fcoe garp libfcoe mrp
libfc scsi_transport_fc scsi_tgt ip6t_rpfilter ip6t_REJECT
xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter
ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
ip6table_mangle ip6table_security ip6table_raw ip6table_filter
ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
iTCO_wdt gpio_ich iTCO_vendor_support coretemp kvm_intel kvm
crc32c_intel microcode serio_raw i2c_i801 lpc_ich mfd_core ioatdma ses
enclosure i7core_edac edac_core shpchp acpi_cpufreq nfsd auth_rpcgss
nfs_acl lockd sunrpc radeon drm_kms_helper ttm ixgbe igb drm mdio
ata_generic ptp pata_acpi pps_core i2c_algo_bit pata_jmicron aacraid
dca i2c_core
kworker/1:2 Tainted: GF O 3.13.10-200.zbfcoepatch.fc20.x86_64 #1
Jun 12 09:46:52 poc1 kernel: [ 1698.240879] Hardware name: Supermicro
X8DTN/X8DTN, BIOS 2.1c 10/28/2011
Jun 12 09:46:52 poc1 kernel: [ 1698.240984] Workqueue: fc_wq_7
fc_starget_delete [scsi_transport_fc]
ffff88061cc60000 task.ti: ffff88061cc60000
0010:[<ffffffff81434e29>] [<ffffffff81434e29>]
scsi_device_put+0x19/0x50
Jun 12 09:46:52 poc1 kernel: [ 1698.241309] RSP: 0018:ffff88061cc61db0
EFLAGS: 00010202
ffff88061d364800 RCX: 00000001820001e8
00000000820001e8 RDI: ffff88061d364800
0000000000000000 R09: 0000000000000001
ffffffff81316701 R12: ffff88061d364800
ffff88061e13e000 R15: ffff88060b5f9400
Jun 12 09:46:52 poc1 kernel: [ 1698.241886] FS: 0000000000000000(0000)
GS:ffff880627c20000(0000) knlGS:0000000000000000
Jun 12 09:46:52 poc1 kernel: [ 1698.242002] CS: 0010 DS: 0000 ES: 0000
CR0: 000000008005003b
0000000001c0c000 CR4: 00000000000007e0
Jun 12 09:46:52 poc1 kernel: [ 1698.242212] ffff88061d364800
ffff8806160bb860 ffff88061cc61e08 ffffffff81443348
Jun 12 09:46:52 poc1 kernel: [ 1698.242333] 0000000000000202
ffff88061e13e000 ffff8806160bb800 ffff88061c3c3080
Jun 12 09:46:52 poc1 kernel: [ 1698.242453] ffff880627c33e00
ffffe8f9e7c22e00 0000000000000040 ffff88061cc61e20
Jun 12 09:46:52 poc1 kernel: [ 1698.242615] [<ffffffff81443348>]
scsi_remove_target+0x168/0x210
Jun 12 09:46:52 poc1 kernel: [ 1698.242706] [<ffffffffa05461b2>]
fc_starget_delete+0x22/0x30 [scsi_transport_fc]
Jun 12 09:46:52 poc1 kernel: [ 1698.242815] [<ffffffff81087bf6>]
process_one_work+0x176/0x430
Jun 12 09:46:52 poc1 kernel: [ 1698.242901] [<ffffffff8108882b>]
worker_thread+0x11b/0x3a0
Jun 12 09:46:52 poc1 kernel: [ 1698.242982] [<ffffffff81088710>] ?
rescuer_thread+0x350/0x350
Jun 12 09:46:52 poc1 kernel: [ 1698.243069] [<ffffffff8108f2f2>]
kthread+0xd2/0xf0
Jun 12 09:46:52 poc1 kernel: [ 1698.243140] [<ffffffff8108f220>] ?
insert_kthread_work+0x40/0x40
Jun 12 09:46:52 poc1 kernel: [ 1698.243231] [<ffffffff81696dbc>]
ret_from_fork+0x7c/0xb0
Jun 12 09:46:52 poc1 kernel: [ 1698.243310] [<ffffffff8108f220>] ?
insert_kthread_work+0x40/0x40
Jun 12 09:46:52 poc1 kernel: [ 1698.243397] Code: c4 08 48 89 d8 5b 41
5c 41 5d 41 5e 41 5f 5d c3 66 90 66 66 66 66 90 55 48 89 e5 41 54 49
89 fc 53 48 8b 07 48 8b 80 c0 00 00 00 <48> 8b 18 48 85 db 74 0d 48 89
df e8 f7 72 ca ff 48 85 c0 75 12
Jun 12 09:46:52 poc1 kernel: [ 1698.243942] RIP [<ffffffff81434e29>]
scsi_device_put+0x19/0x50
Jun 12 09:46:52 poc1 kernel: [ 1698.244032] RSP <ffff88061cc61db0>
Jun 12 09:46:52 poc1 kernel: [ 1698.244081] CR2: 0000000000000000
Jun 12 09:46:52 poc1 kernel: [ 1698.263415] ---[ end trace c6e30b4311d1a181 ]---
Jun 12 09:46:52 poc1 kernel: [ 1698.263501] BUG: unable to handle
kernel paging request at ffffffffffffffd8
Thanks,
Jun
On Wed, Jun 11, 2014 at 2:33 PM, Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Post by Jun Wu
Post by Vasu Dev
Post by Jun Wu
This a Supermicro chassis with redundant power supplies. We see the
same failures with both SSDs or HDDs.
The same tests pass with non-fcoe protocol, i.e. iSCSI or AoE.
Is iSCSI or AoE tests with same TCM core kernel with same target and
host NICs/switch ?
We tested AoE with the same hardware/switch and test setup. AoE works
except that it is not enterprise protocol and it doesn't provide
performance. It doesn't use TCM.
Post by Vasu Dev
What NICs in your chassis? As I mentioned before that "DCB and PFC PAUSE
typically used and required by fcoe", but you are using PAUSE and switch
cannot be eliminated as you mentioned before, these could affect more to
FCoE than other protocols, so can you ensure IO errors are not due to
frames losses w/o DCB/PFC in your setup ?
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
The issue should not be caused by frame losses. The systems work fine
with other protocols.
Post by Vasu Dev
While possibly abort issues at target with zero timeout values but you
could avoid them completely by increasing scsi timeout and disabling REC
as discussed before.
Please use inline response and avoid top posts.
Thanks,
Vasu
Is the following cmd_per_lun fcoe related? Its default value is 3. And
it doesn't allow me to change.
/sys/devices/pci0000:00/0000:00:05.0/0000:08:00.0/net/p4p1/ctlr_2/host9/scsi_host/host9/cmd_per_lun
Nab,
Could you please send the patches you mentioned for me to test?
tcm_fc: Generate TASK_SET_FULL status for DataIN failures
https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=b3e5fe1688b998ba5287a68667ef7cc568739e44
tcm_fc: Generate TASK_SET_FULL status for response failures
https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=6dbe7f4e97d55eefcb471c41c16b62fca5f10c68
--nab
Nicholas A. Bellinger
2014-06-05 22:28:28 UTC
Permalink
Post by Vasu Dev
Post by Nicholas A. Bellinger
Post by Jun Wu
The test setup includes one host and one target. The target exposes 10
hard drives (or 10 LUNs) on one fcoe port. The single initiator runs
10 fio processes simultaneously to the 10 target drives through fcoe
vn2vn. This is a simple configuration that other people may also want
to try.
Post by Vasu Dev
Exchange 0x6e4 is aborted and then target still sending frame, while
later should not occur but first setting up abort with 0 msec timeout
doesn not look correct either and it is different that 8000 ms on
initiator side.
Should target stop sending frame after abort? I still see a lot of 0
msec messages on target side. Is this something that should be
addressed?
Post by Vasu Dev
Reducing retries could narrow down to early aborts is the cause here,
can you try with REC disabled on initiator side for that using this
change ?
By disabling REC have you confirmed that the early aborts is the
cause? Is the abort caused by 0 msec timeout?
The 0 msec timeout still look really suspicious..
IIRC, these timeout values are exchanged in the FLOGI request packet,
and/or in a separate Request Timeout Value (RTV) packet..
It might be worthwhile to track down where these zero-length settings
are coming from, as it might be a indication of what's wrong.
How about the following patch to dump these values..?
Also just curious, have you tried running these two hosts in
point-to-point mode without the switch to see if the same types of
issues occur..? It might be useful to help isolate the problem space a
bit.
Vasu, any other ideas here..?
Your patch is good to debug 0 msec value, however this may not the issue
since these are from incoming aborts processing and by then IO is
aborted and would cause seq_send failures as I explained in other
response.
Nab, Shall tcm_fc take some action on seq_send failures to the target
core which could help in slowing down host requests rate above the fcoe
transport ?
So there are two options here..

One is to simply return -EAGAIN or -ENOMEM from the ->queue_data_in() or
->queue_status() callbacks to notify the target to delay + retry sending
the data-in + response.

The second is to go ahead and set SAM_STAT_TASK_SET_FULL status once
->seq_send() fails for data-in and immediately attempt to send the
response with non GOOD status. If ->seq_send() for the response
subsequently fails, then keep the SAM_STAT_TASK_SET_FULL status and
return -EAGAIN to force the target to requeue + response only the
response packet.

Once the initiator receives SAM_STAT_TASK_SET_FULL status, it will lower
it's queue depth and retry the initiator command a new queue slot is
available in the LLD.. Another option would be to return BUSY status
instead, will which make the initiator retry the original command,
without attempting to lower the outstanding queue_depth.

So I've got a few (untested) patches to implement this in tcm_fc for the
data-in seq_send() failure case, and will be sending them out shortly.
Please review + test.

Separate from these patches, is the amount of outstanding I/Os on the
network the main culprit here after reviewing the logs..? Considering
that the upstream fcoe initiator is using cmd_per_lun=3, it seems that
even with 10 LUNs on a single endpoint, the number of total outstanding
I/Os is still pretty low..

--nab
Vasu Dev
2014-06-04 21:43:36 UTC
Permalink
Post by Jun Wu
The test setup includes one host and one target. The target exposes 10
hard drives (or 10 LUNs) on one fcoe port. The single initiator runs
10 fio processes simultaneously to the 10 target drives through fcoe
vn2vn. This is a simple configuration that other people may also want
to try.
Still seq_send could fail on aborted exchange as I explained below with
stress from 10 fio sessions from same host, thanks for clarifying.
Post by Jun Wu
Post by Vasu Dev
Exchange 0x6e4 is aborted and then target still sending frame, while
later should not occur but first setting up abort with 0 msec timeout
doesn not look correct either and it is different that 8000 ms on
initiator side.
Should target stop sending frame after abort? I still see a lot of 0
msec messages on target side.
First target doesn't sends aborts and 0 msec prints are on receipt of
abort.
Post by Jun Wu
Is this something that should be
addressed?
May be but will not help for your case since once exchange aborted then
subsequent seq_send will fail on aborted exchange.
Post by Jun Wu
Post by Vasu Dev
Reducing retries could narrow down to early aborts is the cause here,
can you try with REC disabled on initiator side for that using this
change ?
By disabling REC have you confirmed that the early aborts is the
cause?
Can you count instance before and after through full log for *same
duration run* ? I guess should be less as we talked before in context of
increasing timeouts below.
Post by Jun Wu
Is the abort caused by 0 msec timeout?
No target doesn't initiate abort and these prints are on receipt of
abort handling and once exchange aborted then seq_send will fail with
noisy prints.

//Vasu
Post by Jun Wu
Thanks,
Jun
Post by Vasu Dev
Post by Jun Wu
Hi Vasu,
Is there anything else we can do to collect more information or adjust
the timeout value?
Increased Timeouts will help in reducing seq_send failure prints and
must be the case with REC disabled patch try since that skipped REC
completely and with that defaulted to longer SCSI retry timeouts.
Anycase seq_send failure prints are just noise here since seq_send fails
only for aborted exchanges and that is very likely with tcm fc abort
handling, especially in your oversubscribed target setup with 1 target
taking workloads from 10 hosts. I mean seq_send fails only due to
following aborted exchanges check since thereafter fcoe_xmit always
returns success.
if (ep->esb_stat & (ESB_ST_COMPLETE | ESB_ST_ABNORMAL)) {
fc_frame_free(fp);
goto out;
}
So I think it is okay to get rid off send failure prints same as tcm fc
has it on other places of callings seq_send with this patch:-
diff --git a/drivers/target/tcm_fc/tfc_io.c
b/drivers/target/tcm_fc/tfc_io.c
index a48b96d..0425fc0 100644
--- a/drivers/target/tcm_fc/tfc_io.c
+++ b/drivers/target/tcm_fc/tfc_io.c
@@ -200,15 +200,7 @@ int ft_queue_data_in(struct se_cmd *se_cmd)
f_ctl |= FC_FC_END_SEQ;
fc_fill_fc_hdr(fp, FC_RCTL_DD_SOL_DATA, ep->did,
ep->sid,
FC_TYPE_FCP, f_ctl, fh_off);
- error = lport->tt.seq_send(lport, seq, fp);
- if (error) {
- /* XXX For now, initiator will retry */
- pr_err_ratelimited("%s: Failed to send frame %p,
"
- "xid <0x%x>, remaining %
zu, "
- "lso_max <0x%x>\n",
- __func__, fp, ep->xid,
- remaining,
lport->lso_max);
- }
+ lport->tt.seq_send(lport, seq, fp);
}
return ft_queue_status(se_cmd);
}
Post by Jun Wu
Thanks,
Jun
Post by Jun Wu
We do not have DCB PFC so the traffic is managed by PAUSE frames by default.
The kernel has been recompiled to include your change. I ran the same
test after the kernel change, i.e. running 10 fio sessions from
initiator to 10 drives on target. The target machine hung after a few
minutes.
Target could be either really overloaded with your workloads in 10 hosts
to single target setup or you it really stuck, for later case enabling
hung task detection could help under kernel hacking as that would dump
stack trace of hung task. If it is the earlier case then I don't know
how target back end handle the case of sustained over subscribed target
bandwidth, may be Nick could shed some info on that and possibly tcm_fc
has related missing feedback into target core for that.
If your workloads exceed demand intermittently then adjusting timeout as
you mentioned above or increasing all Tx/Rx path resources could help in
absorbing such intermittent incidents and in turn not leading to host IO
timeouts/failures.
//Vasu
Post by Jun Wu
Post by Jun Wu
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
xid 01cd-0a84: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
4470 May 27 10:44:14 poc1 kernel: [ 1726.315666] host7: xid 1cd: BLS
rctl 85 - BLS reject received
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
xid 01ee-0c48: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
I see much less Exchange timer messages than last time. Here are all
xid 0181-090d: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
xid 0005-0327: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
xid 0060-0dce: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
exch: BLS rctl 84 - BLS accept
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
Exchange timer armed : 10000 msecs
xid 00cc-090d: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
6532 May 27 10:44:46 poc1 kernel: [ 1757.763983] host7: xid 60: BLS
rctl 85 - BLS reject received
6533 May 27 10:44:46 poc1 kernel: [ 1757.765447] host7: xid 14d: BLS
rctl 85 - BLS reject received
6534 May 27 10:44:46 poc1 kernel: [ 1757.766258] host7: xid 18b: BLS
rctl 85 - BLS reject received
6535 May 27 10:44:46 poc1 kernel: [ 1757.767116] host7: xid ed: BLS
rctl 85 - BLS reject received
6536 May 27 10:44:46 poc1 kernel: [ 1757.767144] host7: xid 21: BLS
rctl 85 - BLS reject received
6537 May 27 10:44:46 poc1 kernel: [ 1757.768161] host7: xid 122: BLS
rctl 85 - BLS reject received
6538 May 27 10:44:46 poc1 kernel: [ 1757.768184] host7: xid 45: BLS
rctl 85 - BLS reject received
6539 May 27 10:44:46 poc1 kernel: [ 1757.768437] host7: xid 16f: BLS
rctl 85 - BLS reject received
6540 May 27 10:44:46 poc1 kernel: [ 1757.768880] host7: xid 14c: BLS
rctl 85 - BLS reject received
6541 May 27 10:44:46 poc1 kernel: [ 1757.772813] host7: xid 8a: BLS
rctl 85 - BLS reject received
6542 May 27 10:44:46 poc1 kernel: [ 1757.778209] host7: xid 181: BLS
rctl 85 - BLS reject received
6543 May 27 10:44:46 poc1 kernel: [ 1757.778226] host7: xid 16b: BLS
rctl 85 - BLS reject received
6544 May 27 10:44:46 poc1 kernel: [ 1757.778905] host7: xid cc: BLS
rctl 85 - BLS reject received
Exchange timed out
Exchange timer armed : 2000 msecs
Exchange timer canceled
LS_RJT for RRQ
Resetting rport (00061e)
6550 May 27 10:45:16 poc1 kernel: [ 1788.034071] host7: scsi: lun
reset to lun 0 completed
and
xid 06ae-0985: fsp=ffff880c0e176700:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
xid 068e-07a3: fsp=ffff880c0e177400:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
xid 0c2e-0b04: fsp=ffff880c1cc4c800:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
7474 May 27 10:45:17 poc1 kernel: [ 1789.296282] host7: xid 1c4: BLS
rctl 85 - BLS reject received
7475 May 27 10:45:17 poc1 kernel: [ 1789.296284] host7: xid 8c: BLS
rctl 85 - BLS reject received
exch: BLS rctl 84 - BLS accept
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
Exchange timer armed : 10000 msecs
7479 May 27 10:45:17 poc1 kernel: [ 1789.298873] host7: xid ef: BLS
rctl 85 - BLS reject received
xid 01cf-0e8e: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7482 May 27 10:45:17 poc1 kernel: [ 1789.299318] host7: xid 1cf: BLS
rctl 85 - BLS reject received
xid 00c4-048c: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7485 May 27 10:45:17 poc1 kernel: [ 1789.299716] host7: xid c4: BLS
rctl 85 - BLS reject received
7486 May 27 10:45:17 poc1 kernel: [ 1789.299733] host7: xid e6: BLS
rctl 85 - BLS reject received
7487 May 27 10:45:17 poc1 kernel: [ 1789.299762] host7: xid 87: BLS
rctl 85 - BLS reject received
7488 May 27 10:45:17 poc1 kernel: [ 1789.299767] host7: xid 107: BLS
rctl 85 - BLS reject received
xid 0042-050f: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7491 May 27 10:45:17 poc1 kernel: [ 1789.303813] host7: xid 42: BLS
rctl 85 - BLS reject received
7492 May 27 10:45:17 poc1 kernel: [ 1789.304596] host7: xid 7: BLS
rctl 85 - BLS reject received
exch: BLS rctl 84 - BLS accept
Returning DID_ERROR to scsi-ml due to FC_CMD_ABORTED
7495 May 27 10:45:17 poc1 kernel: [ 1789.306357] host7: xid 10d: BLS
rctl 85 - BLS reject received
7496 May 27 10:45:17 poc1 kernel: [ 1789.306797] host7: xid 18e: BLS
rctl 85 - BLS reject received
xid 01e4-04a7: DDP I/O in fc_fcp_recv_data set ERROR
f_ctl 90000 seq 1
7499 May 27 10:45:17 poc1 kernel: [ 1789.307628] host7: xid 1e4: BLS
rctl 85 - BLS reject received
7500 May 27 10:45:17 poc1 kernel: [ 1789.311225] host7: xid 8d: BLS
rctl 85 - BLS reject received
xid 0c64-0de8: fsp=ffff88061c238c00:lso:blen=1000 lso_max=0x10000
t_blen=1000
f_ctl 90000 seq 1
Exchange timed out
Exchange timer armed : 2000 msecs
Exchange timed out
f_ctl 90000 seq 1
Exchange timer armed : 20000 msecs
Remove port
Port sending LOGO from Ready state
and finally
Returning DID_ERROR to scsi-ml due to FC_DATA_UNDRUN (scsi)
f_ctl 90000 seq 1
Received a LOGO response closed
Exchange timer canceled
work delete
blocked FC remote port time out: removing target and saving binding
The messages on the target side are similar as before.
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
Exchange timer armed : 0 msecs
f_ctl 800000 seq 1
Exchange timed out
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
f_ctl 880008 seq 2
f_ctl 800000 seq 1
Exchange timer armed : 0 msecs
f_ctl 800008 seq 2
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 0,
lso_max <0x10000>
After stripping out the "host7: xid xxx: f_ctl 8x0000 seq x" lines,
15476415:May 27 10:42:37 poc2 kernel: [ 1811.571399] perf samples too
long (2670 > 2500), lowering kernel.perf_event_max_sample_rate to
50000
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Exchange timed out
Exchange timer armed : 0 msecs
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 131072,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 65536,
lso_max <0x10000>
Failed to send frame ffff88062cbd3a00, xid <0x327>, remaining 0,
lso_max <0x10000>
Exchange timed out
Exchange timer armed : 2000 msecs
Exchange timed out
f_ctl 90000 seq 1
Exchange timer armed : 20000 msecs
Thanks
Post by Vasu Dev
Post by Jun Wu
Post by Jun Wu
18630751 May 21 11:01:53 poc2 kernel: [ 1528.191740] host7: xid
Exchange timer armed : 0 msecs
18630752 May 21 11:01:53 poc2 kernel: [ 1528.191747] host7: xid
f_ctl 800000 seq 1
18630753 May 21 11:01:53 poc2 kernel: [ 1528.191756] host7: xid
f_ctl 800000 seq 2
18630754 May 21 11:01:53 poc2 kernel: [ 1528.191763] host7: xid
Exchange timed out
18630755 May 21 11:01:53 poc2 kernel: [ 1528.191777]
Failed to send frame ffff8805bf245e00, xid <0x6e4>, remaining
458752,
Post by Jun Wu
lso_max <0x10000>
Exchange 0x6e4 is aborted and then target still sending frame, while
later should not occur but first setting up abort with 0 msec timeout
doesn not look correct either and it is different that 8000 ms on
initiator side. Are you using same switch between initiator and target
with DCB enabled ?
Reducing retries could narrow down to early aborts is the cause here,
can you try with REC disabled on initiator side for that using this
change ?
diff --git a/drivers/scsi/libfc/fc_fcp.c b/drivers/scsi/libfc/fc_fcp.c
index 1d7e76e..43fe793 100644
--- a/drivers/scsi/libfc/fc_fcp.c
+++ b/drivers/scsi/libfc/fc_fcp.c
@@ -1160,6 +1160,7 @@ static int fc_fcp_cmd_send(struct fc_lport *lport,
struct fc_fcp_pkt *fsp,
fc_fcp_pkt_hold(fsp); /* hold for fc_fcp_pkt_destroy */
setup_timer(&fsp->timer, fc_fcp_timeout, (unsigned long)fsp);
+ rpriv->flags &= ~FC_RP_FLAGS_REC_SUPPORTED;
if (rpriv->flags & FC_RP_FLAGS_REC_SUPPORTED)
fc_fcp_timer_set(fsp, get_fsp_rec_tov(fsp));
Continue reading on narkive:
Loading...