murali.vilayannur at gmail.com
Sun Aug 19 19:26:23 EDT 2007
Your proposal for separating the I/O path from the metadata path using
existing iSCSI kind of protocols sounds quite interesting and
intriguing. Just to clarify my understanding and to also spark a
discussion along these lines, I have jotted down my thoughts and
please let me know if I understood your proposal completely.
- THis proposal calls for a split fast path I/O that will start out
optional (possibly remain optional) since we don't know what the
performance implications of this path is at scale.
Presumably this can be tweaked to be a per-file/per-open option..?
- mount & all metadata operations of non-opened files remain the same
using either the existing client-core model or possibly the fuse
- iscsi target mode code should be added to the servers to that they
can service iSCSI PDU's. THis will need some fair amount of tweaking
and could possibly leverage from Pete's recent OSD work.
- on an open, we upcall to the client-core and fetch a list of the
data file handles & BMI addresses of corresponding servers to the
kernel. Assume for simplicity that we will only handle simple stripe
distribution (round-robin) across all the servers.
When we return to the kernel module, we send an iscsi login request to
each of the data servers that are part of the striped file's backing.
Once that is done, the call returns back to the caller. Consequently,
every open of a file on PVFS will result in creation on "n" scsi
initiator end-points where "n" is the # of data servers. (DOn't know
what impact this will have on the scalability of the Linux SCSI
- DO we need to login each time? I think login can/should be made a
one-time operation to each server.
- any metadata operation involving an already opened file such as
fstat, read, write etc should be mapped to a SCSI command, packetized
and sent over the previously created iSCSI session's connection. Some
of the offset calculations etc would therefore need to be moved to the
- Essentially, the bulk of the work involved is in presenting each
data file handle on the server as a LUN. Since we do this only for
opened files, this shouldn't be that big a scalability issue.??
- What do we respond to a REPORT LUNs command? See below on one possibility
If as part of the open system call implementation, we send an
out-of-band PVFS message (scalability...?) to intimate to the servers
to add the corresponding data file handles to be eligible LUNs then we
could report all those LUN ids..
When does the Linux ISCSI initiator stack send a report LUNs btw?
At the end of the day, it looks like we will incur a heavy cost on
open() to improve the cost of I/O which is ok, if we can do
openg()/openfh() type of calls..
Did I understand your proposal correctly? Will this work?
More information about the Pvfs2-developers