Dual target node ALUA multipathing for Vmware

Discussion:

Robert Wood

2014-09-06 04:19:17 UTC

Good afternoon,

We are testing a dual node Pacemaker based cluster to deliver Fibre
Channel LUNs to Vmware (ESXi 5.5). OS is Ubuntu Trusty 14.04, LIO
version is:

Target Engine Core ConfigFS Infrastructure v4.1.0 on Linux/x86_64 on
3.14.8-031408-generic

Each of the two LIO nodes (e1, kio1) has a QLA2462 Fibre Channel
(dual-port) HBAs, and the initiator node also has a QLA2462. These
are connected via a Brocade 4Gb Silkworm switch.

We are setting TPG ID to be separate on each node, while keeping the
LUN serial number the same, back end storage being a Ceph image
presented equally to both LIO nodes.

One node sets ALUA state to s (Standby), while the other to o
(Active/Optimized). However,when both nodes are presenting the same
LUN to Vmware, the TPG is only set to one of the nodes' ID for all
paths to storage. ALUA state is Standby for all LUNs and there are
many errors in Vmware host's vmkernel.log, such as:

2014-09-06T03:38:58.748Z cpu2:45460)WARNING: NMP:
nmpDeviceAttemptFailover:603: Retry world failover device
"naa.6001405afcc298020000000000000000" - issuing command
0x41364041a100
2014-09-06T03:38:58.748Z cpu2:45460)WARNING: NMP:
nmp_SelectPathAndIssueCommand:3174: PSP selected path
"vmhba2:C0:T0:L1" in a bad state (standby) on device
"naa.6001405afcc298020000000000000000".
2014-09-06T03:38:58.748Z cpu2:45460)WARNING: NMP:
nmpCompleteRetryForPath:352: Retry cmd 0x1a (0x41364041a100) to dev
"naa.6001405afcc298020000000000000000" failed on path
"vmhba2:C0:T0:L1" H:0x1 D:0x0 P:0x0 Possible sense data: 0x2 0x3a
2014-09-06T03:38:58.748Z cpu2:45460)WARNING: NMP:
nmpCompleteRetryForPath:382: Logical device
"naa.6001405afcc298020000000000000000": awaiting fast path state
update before retrying failed command again...
2014-09-06T03:38:59.748Z cpu17:37198)WARNING: VMW_SATP_ALUA:
satp_alua_determineStatus:563: VMW_SATP_ALUA:Unknown Check condition
0/2 0x5 0x26 0x0.
2014-09-06T03:38:59.748Z cpu17:37198)WARNING: VMW_SATP_ALUA:
satp_alua_issueCommandOnPath:685: Probe cmd 0xa4 failed for path
"vmhba2:C0:T0:L1" (0x5/0x26/0x0). Check if failover mode is still
ALUA.
2014-09-06T03:38:59.748Z cpu17:37198)WARNING: VMW_SATP_ALUA:
satp_alua_issueSTPG:490: STPG failed on path "vmhba2:C0:T0:L1"

Here is some LIO Info:

***@kio1:/sys/kernel/config/target# tcm_node --listtgptgps
iblock_4711/p_FCLun_test2
\------> kio1 Target Port Group ID: 2
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Standby
Primary Access Status: Altered by Implicit ALUA
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
qla2xxx/naa.21000024ff4f0dee/tpgt_1/lun_1
qla2xxx/naa.21000024ff4f0def/tpgt_1/lun_1

\------> default_tg_pt_gp Target Port Group ID: 0
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Active/Optimized
Primary Access Status: None
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
No Target Port Group Members

***@e1:/var/log# tcm_node --listtgptgps iblock_4711/p_FCLun_test
\------> e1 Target Port Group ID: 1
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Active/Optimized
Primary Access Status: None
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
qla2xxx/naa.21000024ff4f0f20/tpgt_1/lun_1
qla2xxx/naa.21000024ff4f0f21/tpgt_1/lun_1

\------> default_tg_pt_gp Target Port Group ID: 0
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Active/Optimized
Primary Access Status: None
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
No Target Port Group Members

Vmware does not identify both TPGs, but only one:

naa.6001405afcc298020000000000000000
Device Display Name: LIO-ORG Fibre Channel Disk
(naa.6001405afcc298020000000000000000)
Storage Array Type: VMW_SATP_ALUA
Storage Array Type Device Config:
{implicit_support=on;explicit_support=on;
explicit_allow=on;alua_followover=on;{TPG_id=1,TPG_state=AO}{TPG_id=1,TPG_state=AO}{TPG_id=1,TPG_state=AO}{TPG_id=1,TPG_state=AO}}
Path Selection Policy: VMW_PSP_MRU
Path Selection Policy Device Config: Current Path=vmhba2:C0:T0:L1
Path Selection Policy Device Custom Config:
Working Paths: vmhba2:C0:T0:L1
Is Local SAS Device: false
Is Boot USB Device: false

So the questions are:

- Are we missing something related to setting up LIO that would make
the Vmware host recognize two separate TPGs?

- http://www.spinics.net/lists/target-devel/msg07102.html - this
appears to e needed for reprobing, but is this also required for
detection?

Thank you,
RW

Robert Wood

2014-09-09 14:35:38 UTC

Permalink

Post by Robert Wood
One node sets ALUA state to s (Standby), while the other to o
(Active/Optimized). However,when both nodes are presenting the same
LUN to Vmware, the TPG is only set to one of the nodes' ID for all
paths to storage. ALUA state is Standby for all LUNs and there are

So far we found that TPG groups of the same name must be present on both
hosts, even if they have no members. Suspecting that LIO generates some TPG
ID behind the scenes, but this is not documented. As long as all groups with
the same names are present on all hosts, initiator sees correct ALUA states.

Working to test failover and wondering if there are any controls over the
described behavior.

RW

Mike Christie

2014-09-10 23:15:07 UTC

Permalink

Post by Robert Wood

Yeah, ESXi requires that it gets the same info from all ports that the
LU can be accessed through. This includes group id values and also ALUA
states. In your example you had group id 2 showing standby on kio1, but
that group did not exist on e1. Also e1 had a group id 1, but kio1 did not.

So in your example we would want to see:

***@kio1:/sys/kernel/config/target# tcm_node --listtgptgps
iblock_4711/p_FCLun_test2
\------> kio1 Target Port Group ID: 0
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Standby
Primary Access Status: Altered by Implicit ALUA
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
qla2xxx/naa.21000024ff4f0dee/tpgt_1/lun_1
qla2xxx/naa.21000024ff4f0def/tpgt_1/lun_1

\------> default_tg_pt_gp Target Port Group ID: 1
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Active/Optimized
Primary Access Status: None
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
No Target Port Group Members

***@e1:/var/log# tcm_node --listtgptgps iblock_4711/p_FCLun_test
\------> e1 Target Port Group ID: 1
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Active/Optimized
Primary Access Status: None
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
qla2xxx/naa.21000024ff4f0f20/tpgt_1/lun_1
qla2xxx/naa.21000024ff4f0f21/tpgt_1/lun_1

\------> default_tg_pt_gp Target Port Group ID: 0
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Standby
Primary Access Status: None
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
No Target Port Group Members

Post by Robert Wood
Working to test failover and wondering if there are any controls over the
described behavior.

For failover, how are you handling the alua state changes? In my example
above e1 has the active paths. If they went down, ESXi would send a STPG
to kio1. We would then need to resync the alua state. It would look
something like this where group id 0 goes to active group id 1 goes to
standby on both nodes [note the status info might be wrong below but you
get the idea with the groups and alua states changing]:

***@kio1:/sys/kernel/config/target# tcm_node --listtgptgps
iblock_4711/p_FCLun_test2
\------> kio1 Target Port Group ID: 0
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Active/Optimized
Primary Access Status: Altered by Explicit ALUA
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
qla2xxx/naa.21000024ff4f0dee/tpgt_1/lun_1
qla2xxx/naa.21000024ff4f0def/tpgt_1/lun_1

\------> default_tg_pt_gp Target Port Group ID: 1
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Standby
Primary Access Status: None
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
No Target Port Group Members

***@e1:/var/log# tcm_node --listtgptgps iblock_4711/p_FCLun_test
\------> e1 Target Port Group ID: 1
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Standby
Primary Access Status: None
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
qla2xxx/naa.21000024ff4f0f20/tpgt_1/lun_1
qla2xxx/naa.21000024ff4f0f21/tpgt_1/lun_1

\------> default_tg_pt_gp Target Port Group ID: 0
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Active/Optimized
Primary Access Status: None
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
No Target Port Group Members

Robert Wood

2014-09-14 01:18:18 UTC

Permalink

Hi Mike,

Post by Mike Christie
Yeah, ESXi requires that it gets the same info from all ports that the
LU can be accessed through. This includes group id values and also ALUA
states. In your example you had group id 2 showing standby on kio1, but
that group did not exist on e1. Also e1 had a group id 1, but kio1 did not.
For failover, how are you handling the alua state changes? In my example
above e1 has the active paths. If they went down, ESXi would send a STPG
to kio1. We would then need to resync the alua state. It would look
something like this where group id 0 goes to active group id 1 goes to
standby on both nodes [note the status info might be wrong below but you
root <at> kio1:/sys/kernel/config/target# tcm_node --listtgptgps
iblock_4711/p_FCLun_test2
\------> kio1 Target Port Group ID: 0
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Active/Optimized
Primary Access Status: Altered by Explicit ALUA
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
qla2xxx/naa.21000024ff4f0dee/tpgt_1/lun_1
qla2xxx/naa.21000024ff4f0def/tpgt_1/lun_1
\------> default_tg_pt_gp Target Port Group ID: 1
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Standby
Primary Access Status: None
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
No Target Port Group Members
root <at> e1:/var/log# tcm_node --listtgptgps iblock_4711/p_FCLun_test
\------> e1 Target Port Group ID: 1
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Standby
Primary Access Status: None
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
qla2xxx/naa.21000024ff4f0f20/tpgt_1/lun_1
qla2xxx/naa.21000024ff4f0f21/tpgt_1/lun_1
\------> default_tg_pt_gp Target Port Group ID: 0
Active ALUA Access Type(s): Implicit and Explicit
Primary Access State: Active/Optimized
Primary Access Status: None
Preferred Bit: 0
Active/NonOptimized Delay in milliseconds: 100
Transition Delay in milliseconds: 0
\------> TG Port Group Members
No Target Port Group Members

If I understand you correctly, The following configuration must be present
when the LUN is active on node e1 :

Node e1 (LUN is a member of both groups):
TPG name: e1, TPG ID = 1, ALUA state: Active/Optimized
TPG name: kio1, TPG ID = 2, ALUA state: Standby

Node kio1 (LUN is *not* a member of either group):
TPG name: e1, TPG ID = 1, ALUA state: Active/Optimized
TPG name: kio1, TPG ID = 2, ALUA state: Standby

And this would be the change when we want LUN to switch to node kio1 (or
node e1 crashes):

Node e1 (if present)(LUN is *not* a member of either group):
TPG name: e1, TPG ID = 1, ALUA state: Standby
TPG name: kio1, TPG ID = 2, ALUA state: Active/Optimized

Node kio1 (LUN is a member of both groups):
TPG name: e1, TPG ID = 1, ALUA state: Standby
TPG name: kio1, TPG ID = 2, ALUA state: Active/Optimized

But shouldn't it work to just have one TPG with the same name and ID on both
nodes and simply switch the ALUA state and membership of the LUN on
failover?

We are preparing to test failover under load and will post results the
coming week.

Best regards,
RW

Robert Wood

2014-09-15 22:31:39 UTC

Permalink

Hi Mike,

Post by Robert Wood
If I understand you correctly, The following configuration must be present
TPG name: e1, TPG ID = 1, ALUA state: Active/Optimized
TPG name: kio1, TPG ID = 2, ALUA state: Standby
TPG name: e1, TPG ID = 1, ALUA state: Active/Optimized
TPG name: kio1, TPG ID = 2, ALUA state: Standby
And this would be the change when we want LUN to switch to node kio1 (or
TPG name: e1, TPG ID = 1, ALUA state: Standby
TPG name: kio1, TPG ID = 2, ALUA state: Active/Optimized
TPG name: e1, TPG ID = 1, ALUA state: Standby
TPG name: kio1, TPG ID = 2, ALUA state: Active/Optimized

OK, I think we have progressed in understanding what Vmware expects. Vmware
basically does not care that we have Linux boxes with LIO delivering
storage. In fact, it sees all ports as one big array delivering a given LU.
And Vmware expects to fail over from some TPG to another TPG upon failure of
all paths to a particular TPG.

Example: if nodes are e1 and kio1, TPGs are 0 and 1, S = standby, A =
Active/Optimized, (l) means LU is member and (n) means no members, then in
our initial state we have:

TPG -->> 0 1

kio1 S(l) A(n)

e1 S(n) A(l)

Then when we fail over, we just "catch" where Vmware is going to go with its
STPG request:

TPG -->> 0 1

kio1 A(l) S(n)

e1 A(n) S(l)

We believe this can be implemented with Pacemaker and will run that test in
a couple of days. A manual test seems to work OK.

Thanks,
RW

Mike Christie

2014-09-16 16:58:06 UTC

Permalink

Post by Robert Wood
Hi Mike,

OK, I think we have progressed in understanding what Vmware expects. Vmware
basically does not care that we have Linux boxes with LIO delivering
storage. In fact, it sees all ports as one big array delivering a given LU.
And Vmware expects to fail over from some TPG to another TPG upon failure of
all paths to a particular TPG.
Example: if nodes are e1 and kio1, TPGs are 0 and 1, S = standby, A =
Active/Optimized, (l) means LU is member and (n) means no members, then in
TPG -->> 0 1
kio1 S(l) A(n)
e1 S(n) A(l)
Then when we fail over, we just "catch" where Vmware is going to go with its
TPG -->> 0 1
kio1 A(l) S(n)
e1 A(n) S(l)
We believe this can be implemented with Pacemaker and will run that test in
a couple of days. A manual test seems to work OK.

Yes, this is what I was saying before. This brings us back to the
question about how you are going to handle that STPG.

For the STPG handling I am working on a patch, so when we get a a STPG,
we can call out to userspace and do whatever needs to be done. So if you
are using pacemaker, then you can do those cibadm commands to move the
resource to the other node.

Are you using a different method to catch the STPGs?