Clustered LIO Using RBD - Mike Christie Red Hat Oct 28, 2014
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Clustered LIO Using RBD Mike Christie Red Hat Oct 28, 2014
Agenda ● State of HA SCSI Target Support in Linux. ● Difficulties adding Active/Active Support. ● Future Work. 2
State of Linux Open Source HA SCSI Targets ● Active/Passive. ● Pacemaker support for IET, TGT, SCST and LIO. ● Node level failover when target node goes down. ● Relies on virtual ports/portals (IP take over for iSCSI, NPIV for FC) or implicit ALUA failover. ● Missing final pieces of support for distributed SCSI Persistent Reservations. 3
iSCSI Active/Passive With Virtual IPs Server1 Switch A ● Server1 accesses the two targets/GWs one at a time Virtual IP 192.168.56.22 through one or more Virtual IPs. ● eth2 and eth4 are used by Corosync/Pacemaker for cluster membership and cluster aware devices like DRBD. eth1 eth3 ● If the active target goes down, corosync/pacemaker will activate the passive target. ● Server1's TCP/IP layer and/or iSCSI/multipath layer will eth2 eth4 detect the disruption and perform recovery like packet retransmission, iSCSI/SCSI command retry or relogin. (Active) iqn.2003-04.com.test (Passive) iqn.2003-04.com.test 4
Active/Active HA LIO Support ● Benefits: ● Simple initiator support. ● Boot, failover, failback, setup. ● Support for all SCSI transports in common a implementation. ● Possible performance improvement. ● Drawbacks: ● Complex target implementation. ● Distributed error handling, setup, and command execution. 5
iSCSI HA Active/Active Server1 Switch A ● Server1 accesses the two targets/GWs through two paths: 192.168.10.22 and 192.168.1.23. ● Both targets access the same RBD devices at the IP: 192.168.100.22 IP: 192.168.1.23 same time. ● eth2 and eth4 are used by Corosync/Pacemaker eth1 eth3 for DLM/CPG and cluster membership. ● If a node or paths to a node become unreachable eth2 eth4 Server1's multipath layer will mark those paths as unusable until they come back online. (Active) iqn.2003-04.com.test (Active) iqn.2003-04.com.test RBD 6
Implementation Challenges ● Request execution. ● Synchronizing error recovery across nodes. ● Distributing setup information. 7
Distributed Request Execution ● COMPARE AND WRITE ● Atomically read, compare, and if matching, write N bytes of data. ● Used by ESXi (known as ATS) for finely grained locking. ● If multiple nodes are executing this request at the same time then locking is needed. ● Patches posted upstream to push the execution to the backing device. 8
Persistent Reservation (PR) Support ● PRs are a set of commands used to control access to a device. ● Used by clustering software like Windows Clustering and Red Hat Cluster Suite to prevent client nodes from accessing the device. ● Initiator sends PR requests to the target which inform it what set of I_T Nexuses (SCSI ports) can access the device, and what type of access they have. ● This info must be copied across cluster. ● Ports can be added/removed and access restrictions can be changed any time. 9
HA Active/Active PR example Server1 1) Server1 sends PR register command to register Sever1 and Node1's ports to allow access to LUN $N. 1. 2) Node1 stores PR info locally. 3) Node1 copies data to Node2. Switch A 4. 6. 5. 4) Node1 returns successful status to Server1. 5) The process is now repeated for Server1 and Node2's ports (remote copy and return 2. eth1 3. of status are skipped in example). eth3 6) Server1 sends a PR reserve command to establish the reservation. This prevents other eth2 eth4 Server nodes from being able to access Node 1 Node 2 LUN $N (this info will also be copied to node2 and a node1 will return a status code to Sever1). RBD 10
Persistent Reservation Implementation ● Possible solutions: ● Use Corosync/Pacemaker and DLM to distribute PR info across nodes. ● Pass the PR execution to userspace and use the Corosync cpg library messaging to send the PR info to the nodes in the cluster. ● Have a cluster FS/device that is used to store the PR info in. ● Add callbacks to the LIO kernel modules or pass PR execution to userspace, so devices like RBD can utilize their own locking/messaging. 11
Distributed Task Management ● When a command times out, the OS will send SCSI Task Management requests (TMFs) to abort commands and reset devices. ● The SCSI device reset request is called LOGICAL UNIT RESET (LUN RESET). ● SCSI spec defines the required behavior. ● Abort all commands. ● Terminate other task management functions. ● Send an event (SCSI Unit Attentions) through all paths indicating the device was reset. 12
HA Active/Active LUN RESET Example Server1 1) Sever1 cannot determine the state of a command. To get the device in a known state it sends a LUN RESET. 2) Node1 begins processing the reset by internally blocking new commands and aborting Switch A running commands. 1. 4. 5. 6. 3) Node1 sends a message to Node2 instructing it to execute the distributed reset 2. 3. process. eth1 eth3 4) After all the reset steps, like command cleanup, have been completed on both nodes, Node1 eth2 eth4 Node1 returns a success status to Server1. Node2 5) And 6) Node1 and Node2 send Unit Attention notifications through all paths that are accessing the device that was reset. RBD 13
LUN RESET Handling ● Experimenting with passing part of the TMF handling to userspace. ● Use cpg to interact with LIO on all nodes. ● Extend LIO configfs interface, so userspace can block devices and perform the required reset operations. ● Possible future work/alternative. ● Add Linux kernel block layer interface to abort commands and reset devices. ● request_queue->rq_abort_fn(struct request *) ● request_queue->reset_q_fn(struct request_queue *) ● New BLK_QUEUE_RESET notifier_block event. ● LIO would use this to allow backing device to do the heavy lifting. 14
Offloaded Task Management Server1 1) Server1 sends LUN RESET to Node1. 2) Node1 calls RBD's request_queue->reset_q-fn() Switch A 5. 1. RBD translates that to new rbd/rados reset 6. operaiton. 3) RBD/rados aborts commands and sends other clients accessing device notification that eth1 eth3 its commands were aborted due to reset. 4) RBD client on Node2 handles rados reset notification by firing new eth2 eth4 2. Node1 Node2 4. BLK_QUEUE_RESET event. 5) LIO handles BLK_QUEUE_RESET event by sending SCSI UAs on paths accessing LUN through that node. 6) RBD client on Node1 notifies reset_q_fn RBD caller the reset was successful. LIO then returns a success status and UAs as needed. 3. 15
Management ● Have only just begun to look into this. ● Must support VMware VASA, Oracle Storage Connect, Red Hat libStorageMgmt, etc. ● Must have setup info like UUIDs, inquiry info, SCSI settings synced up on all nodes. ● Prefer to integrate with existing projects. ● Extend the LIO target library, rts lib, and lio utils to support clustering? ● Extend existing remote configuration daemons like targetd (https://github.com/agrover/targetd)? 16
Questions? ● I can be reached at mchristi@redhat.com. 17
You can also read