Found wdiff, but it reported no recognisable version. Falling back to builtin diff colouring... --- 1/draft-ietf-nfsv4-minorversion1-22.txt 2008-05-09 21:45:18.358829506 -0700 +++ 2/draft-ietf-nfsv4-minorversion1-23.txt 2008-05-09 21:45:20.657475891 -0700 @@ -1,19 +1,19 @@ NFSv4 S. Shepler Internet-Draft M. Eisler Intended status: Standards Track D. Noveck -Expires: November 2, 2008 Editors - May 1, 2008 +Expires: November 10, 2008 Editors + May 9, 2008 NFS Version 4 Minor Version 1 - draft-ietf-nfsv4-minorversion1-22.txt + draft-ietf-nfsv4-minorversion1-23.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that @@ -24,21 +24,21 @@ and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. - This Internet-Draft will expire on November 2, 2008. + This Internet-Draft will expire on November 10, 2008. Copyright Notice Copyright (C) The IETF Trust (2008). Abstract This Internet-Draft describes NFS version 4 minor version one, including features retained from the base protocol and protocol extensions made subsequently. Major extensions introduced in NFS @@ -155,28 +155,28 @@ 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 147 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 147 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 147 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 148 7.8. Security Policy and Namespace Presentation . . . . . . . 148 8. State Management . . . . . . . . . . . . . . . . . . . . . . 149 8.1. Client and Session ID . . . . . . . . . . . . . . . . . 150 8.2. Stateid Definition . . . . . . . . . . . . . . . . . . . 150 8.2.1. Stateid Types . . . . . . . . . . . . . . . . . . . 151 8.2.2. Stateid Structure . . . . . . . . . . . . . . . . . 152 - 8.2.3. Special Stateids . . . . . . . . . . . . . . . . . . 153 + 8.2.3. Special Stateids . . . . . . . . . . . . . . . . . . 154 8.2.4. Stateid Lifetime and Validation . . . . . . . . . . 155 8.2.5. Stateid Use for I/O Operations . . . . . . . . . . . 158 8.2.6. Stateid Use for SETATTR Operations . . . . . . . . . 159 8.3. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 159 8.4. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 161 8.4.1. Client Failure and Recovery . . . . . . . . . . . . 162 - 8.4.2. Server Failure and Recovery . . . . . . . . . . . . 162 + 8.4.2. Server Failure and Recovery . . . . . . . . . . . . 163 8.4.3. Network Partitions and Recovery . . . . . . . . . . 166 8.5. Server Revocation of Locks . . . . . . . . . . . . . . . 171 8.6. Short and Long Leases . . . . . . . . . . . . . . . . . 172 8.7. Clocks, Propagation Delay, and Calculating Lease Expiration . . . . . . . . . . . . . . . . . . . . . . . 172 8.8. Obsolete Locking Infrastructure From NFSv4.0 . . . . . . 173 9. File Locking and Share Reservations . . . . . . . . . . . . . 174 9.1. Opens and Byte-Range Locks . . . . . . . . . . . . . . . 174 9.1.1. State-owner Definition . . . . . . . . . . . . . . . 174 9.1.2. Use of the Stateid and Locking . . . . . . . . . . . 175 @@ -255,206 +255,207 @@ 11.10.3. The fs_locations_item4 Structure . . . . . . . . . . 259 11.11. The Attribute fs_status . . . . . . . . . . . . . . . . 261 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 265 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 265 12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 266 12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 267 12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 267 12.2.3. pNFS Client . . . . . . . . . . . . . . . . . . . . 267 12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 267 12.2.5. Storage Protocol . . . . . . . . . . . . . . . . . . 267 - 12.2.6. Control Protocol . . . . . . . . . . . . . . . . . . 268 + 12.2.6. Control Protocol . . . . . . . . . . . . . . . . . . 267 12.2.7. Layout Types . . . . . . . . . . . . . . . . . . . . 268 - 12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 269 + 12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 268 12.2.9. Layout Iomode . . . . . . . . . . . . . . . . . . . 269 12.2.10. Device IDs . . . . . . . . . . . . . . . . . . . . . 270 12.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 271 12.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 272 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 272 12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 272 12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 273 12.5.3. Layout Stateid . . . . . . . . . . . . . . . . . . . 274 12.5.4. Committing a Layout . . . . . . . . . . . . . . . . 275 - 12.5.5. Recalling a Layout . . . . . . . . . . . . . . . . . 279 + 12.5.5. Recalling a Layout . . . . . . . . . . . . . . . . . 278 12.5.6. Revoking Layouts . . . . . . . . . . . . . . . . . . 287 12.5.7. Metadata Server Write Propagation . . . . . . . . . 287 12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 287 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 289 12.7.1. Recovery from Client Restart . . . . . . . . . . . . 289 - 12.7.2. Dealing with Lease Expiration on the Client . . . . 290 + 12.7.2. Dealing with Lease Expiration on the Client . . . . 289 12.7.3. Dealing with Loss of Layout State on the Metadata - Server . . . . . . . . . . . . . . . . . . . . . . . 291 + Server . . . . . . . . . . . . . . . . . . . . . . . 290 12.7.4. Recovery from Metadata Server Restart . . . . . . . 291 12.7.5. Operations During Metadata Server Grace Period . . . 293 - 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 294 + 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 293 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 294 12.9. Security Considerations for pNFS . . . . . . . . . . . . 294 13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 295 - 13.1. Client ID and Session Considerations . . . . . . . . . . 296 - 13.1.1. Sessions Considerations for Data Servers . . . . . . 298 + 13.1. Client ID and Session Considerations . . . . . . . . . . 295 + 13.1.1. Sessions Considerations for Data Servers . . . . . . 297 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 298 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 299 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 303 13.4.1. Determining the Stripe Unit Number . . . . . . . . . 303 13.4.2. Interpreting the File Layout Using Sparse Packing . 303 - 13.4.3. Interpreting the File Layout Using Dense Packing . . 306 + 13.4.3. Interpreting the File Layout Using Dense Packing . . 305 13.4.4. Sparse and Dense Stripe Unit Packing . . . . . . . . 308 - 13.5. Data Server Multipathing . . . . . . . . . . . . . . . . 310 - 13.6. Operations Sent to NFSv4.1 Data Servers . . . . . . . . 311 + 13.5. Data Server Multipathing . . . . . . . . . . . . . . . . 309 + 13.6. Operations Sent to NFSv4.1 Data Servers . . . . . . . . 310 13.7. COMMIT Through Metadata Server . . . . . . . . . . . . . 313 - 13.8. The Layout Iomode . . . . . . . . . . . . . . . . . . . 315 - 13.9. Metadata and Data Server State Coordination . . . . . . 315 - 13.9.1. Global Stateid Requirements . . . . . . . . . . . . 315 - 13.9.2. Data Server State Propagation . . . . . . . . . . . 316 - 13.10. Data Server Component File Size . . . . . . . . . . . . 318 - 13.11. Layout Revocation and Fencing . . . . . . . . . . . . . 319 + 13.8. The Layout Iomode . . . . . . . . . . . . . . . . . . . 314 + 13.9. Metadata and Data Server State Coordination . . . . . . 314 + 13.9.1. Global Stateid Requirements . . . . . . . . . . . . 314 + 13.9.2. Data Server State Propagation . . . . . . . . . . . 315 + 13.10. Data Server Component File Size . . . . . . . . . . . . 317 + 13.11. Layout Revocation and Fencing . . . . . . . . . . . . . 318 13.12. Security Considerations for the File Layout Type . . . . 319 14. Internationalization . . . . . . . . . . . . . . . . . . . . 320 14.1. Stringprep profile for the utf8str_cs type . . . . . . . 321 - 14.2. Stringprep profile for the utf8str_cis type . . . . . . 323 + 14.2. Stringprep profile for the utf8str_cis type . . . . . . 322 14.3. Stringprep profile for the utf8str_mixed type . . . . . 324 - 14.4. UTF-8 Capabilities . . . . . . . . . . . . . . . . . . . 326 - 14.5. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 326 - 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 327 - 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 327 - 15.1.1. General Errors . . . . . . . . . . . . . . . . . . . 329 - 15.1.2. Filehandle Errors . . . . . . . . . . . . . . . . . 331 + 14.4. UTF-8 Capabilities . . . . . . . . . . . . . . . . . . . 325 + 14.5. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 325 + 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 326 + 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 326 + 15.1.1. General Errors . . . . . . . . . . . . . . . . . . . 328 + 15.1.2. Filehandle Errors . . . . . . . . . . . . . . . . . 330 15.1.3. Compound Structure Errors . . . . . . . . . . . . . 332 - 15.1.4. File System Errors . . . . . . . . . . . . . . . . . 334 - 15.1.5. State Management Errors . . . . . . . . . . . . . . 336 - 15.1.6. Security Errors . . . . . . . . . . . . . . . . . . 337 + 15.1.4. File System Errors . . . . . . . . . . . . . . . . . 333 + 15.1.5. State Management Errors . . . . . . . . . . . . . . 335 + 15.1.6. Security Errors . . . . . . . . . . . . . . . . . . 336 15.1.7. Name Errors . . . . . . . . . . . . . . . . . . . . 337 - 15.1.8. Locking Errors . . . . . . . . . . . . . . . . . . . 338 + 15.1.8. Locking Errors . . . . . . . . . . . . . . . . . . . 337 15.1.9. Reclaim Errors . . . . . . . . . . . . . . . . . . . 339 - 15.1.10. pNFS Errors . . . . . . . . . . . . . . . . . . . . 340 + 15.1.10. pNFS Errors . . . . . . . . . . . . . . . . . . . . 339 15.1.11. Session Use Errors . . . . . . . . . . . . . . . . . 341 - 15.1.12. Session Management Errors . . . . . . . . . . . . . 343 - 15.1.13. Client Management Errors . . . . . . . . . . . . . . 343 - 15.1.14. Delegation Errors . . . . . . . . . . . . . . . . . 344 + 15.1.12. Session Management Errors . . . . . . . . . . . . . 342 + 15.1.13. Client Management Errors . . . . . . . . . . . . . . 342 + 15.1.14. Delegation Errors . . . . . . . . . . . . . . . . . 343 15.1.15. Attribute Handling Errors . . . . . . . . . . . . . 344 - 15.1.16. Obsoleted Errors . . . . . . . . . . . . . . . . . . 345 - 15.2. Operations and their valid errors . . . . . . . . . . . 346 - 15.3. Callback operations and their valid errors . . . . . . . 362 - 15.4. Errors and the operations that use them . . . . . . . . 364 - 16. NFSv4.1 Procedures . . . . . . . . . . . . . . . . . . . . . 378 - 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 378 - 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 379 - 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . 390 - 18. NFSv4.1 Operations . . . . . . . . . . . . . . . . . . . . . 393 - 18.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 393 - 18.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 399 - 18.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 400 - 18.4. Operation 6: CREATE - Create a Non-Regular File Object . 403 + 15.1.16. Obsoleted Errors . . . . . . . . . . . . . . . . . . 344 + 15.2. Operations and their valid errors . . . . . . . . . . . 345 + 15.3. Callback operations and their valid errors . . . . . . . 361 + 15.4. Errors and the operations that use them . . . . . . . . 363 + 16. NFSv4.1 Procedures . . . . . . . . . . . . . . . . . . . . . 377 + 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 377 + 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 378 + 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . 389 + 18. NFSv4.1 Operations . . . . . . . . . . . . . . . . . . . . . 392 + 18.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 392 + 18.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 398 + 18.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 399 + 18.4. Operation 6: CREATE - Create a Non-Regular File Object . 402 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting - Recovery . . . . . . . . . . . . . . . . . . . . . . . . 406 - 18.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 407 - 18.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 407 - 18.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 409 - 18.9. Operation 11: LINK - Create Link to a File . . . . . . . 410 - 18.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 413 - 18.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 417 - 18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 418 - 18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 420 - 18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 421 + Recovery . . . . . . . . . . . . . . . . . . . . . . . . 405 + 18.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 406 + 18.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 406 + 18.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 408 + 18.9. Operation 11: LINK - Create Link to a File . . . . . . . 409 + 18.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 412 + 18.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 416 + 18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 417 + 18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 419 + 18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 420 18.15. Operation 17: NVERIFY - Verify Difference in - Attributes . . . . . . . . . . . . . . . . . . . . . . . 423 - 18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 424 + Attributes . . . . . . . . . . . . . . . . . . . . . . . 422 + 18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 423 18.17. Operation 19: OPENATTR - Open Named Attribute - Directory . . . . . . . . . . . . . . . . . . . . . . . 443 - 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 444 - 18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 446 - 18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 446 - 18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 448 - 18.22. Operation 25: READ - Read from File . . . . . . . . . . 449 - 18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 451 - 18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 455 - 18.25. Operation 28: REMOVE - Remove File System Object . . . . 456 - 18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 458 - 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 462 - 18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 463 - 18.29. Operation 33: SECINFO - Obtain Available Security . . . 464 - 18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 468 - 18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 471 - 18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 472 - 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 476 - 18.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 478 - 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 481 + Directory . . . . . . . . . . . . . . . . . . . . . . . 442 + 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 443 + 18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 445 + 18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 445 + 18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 447 + 18.22. Operation 25: READ - Read from File . . . . . . . . . . 448 + 18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 450 + 18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 454 + 18.25. Operation 28: REMOVE - Remove File System Object . . . . 455 + 18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 457 + 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 461 + 18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 462 + 18.29. Operation 33: SECINFO - Obtain Available Security . . . 463 + 18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 467 + 18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 470 + 18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 471 + 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 475 + 18.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 477 + 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 480 18.36. Operation 43: CREATE_SESSION - Create New Session and - Confirm Client ID . . . . . . . . . . . . . . . . . . . 498 + Confirm Client ID . . . . . . . . . . . . . . . . . . . 497 18.37. Operation 44: DESTROY_SESSION - Destroy existing - session . . . . . . . . . . . . . . . . . . . . . . . . 508 + session . . . . . . . . . . . . . . . . . . . . . . . . 507 18.38. Operation 45: FREE_STATEID - Free stateid with no - locks . . . . . . . . . . . . . . . . . . . . . . . . . 509 + locks . . . . . . . . . . . . . . . . . . . . . . . . . 508 18.39. Operation 46: GET_DIR_DELEGATION - Get a directory - delegation . . . . . . . . . . . . . . . . . . . . . . . 510 - 18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 514 + delegation . . . . . . . . . . . . . . . . . . . . . . . 509 + 18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 513 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings - for a File System . . . . . . . . . . . . . . . . . . . 516 + for a File System . . . . . . . . . . . . . . . . . . . 515 18.42. Operation 49: LAYOUTCOMMIT - Commit writes made using - a layout . . . . . . . . . . . . . . . . . . . . . . . . 518 - 18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 521 + a layout . . . . . . . . . . . . . . . . . . . . . . . . 517 + 18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 520 18.44. Operation 51: LAYOUTRETURN - Release Layout - Information . . . . . . . . . . . . . . . . . . . . . . 526 + Information . . . . . . . . . . . . . . . . . . . . . . 530 18.45. Operation 52: SECINFO_NO_NAME - Get Security on - Unnamed Object . . . . . . . . . . . . . . . . . . . . . 530 + Unnamed Object . . . . . . . . . . . . . . . . . . . . . 534 18.46. Operation 53: SEQUENCE - Supply per-procedure - sequencing and control . . . . . . . . . . . . . . . . . 531 - 18.47. Operation 54: SET_SSV - Update SSV for a Client ID . . . 537 + sequencing and control . . . . . . . . . . . . . . . . . 536 + 18.47. Operation 54: SET_SSV - Update SSV for a Client ID . . . 541 18.48. Operation 55: TEST_STATEID - Test stateids for - validity . . . . . . . . . . . . . . . . . . . . . . . . 539 - 18.49. Operation 56: WANT_DELEGATION - Request Delegation . . . 541 + validity . . . . . . . . . . . . . . . . . . . . . . . . 543 + 18.49. Operation 56: WANT_DELEGATION - Request Delegation . . . 545 18.50. Operation 57: DESTROY_CLIENTID - Destroy existing - client ID . . . . . . . . . . . . . . . . . . . . . . . 545 + client ID . . . . . . . . . . . . . . . . . . . . . . . 549 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims - Finished . . . . . . . . . . . . . . . . . . . . . . . . 545 - 18.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 548 - 19. NFSv4.1 Callback Procedures . . . . . . . . . . . . . . . . . 548 - 19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 549 - 19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 549 - 20. NFSv4.1 Callback Operations . . . . . . . . . . . . . . . . . 553 - 20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 553 - 20.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 554 + Finished . . . . . . . . . . . . . . . . . . . . . . . . 549 + 18.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 552 + 19. NFSv4.1 Callback Procedures . . . . . . . . . . . . . . . . . 552 + 19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 553 + 19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 553 + 20. NFSv4.1 Callback Operations . . . . . . . . . . . . . . . . . 557 + 20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 557 + 20.2. Operation 4: CB_RECALL - Recall a Delegation . . . . . . 558 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from - Client . . . . . . . . . . . . . . . . . . . . . . . . . 555 - 20.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 559 + Client . . . . . . . . . . . . . . . . . . . . . . . . . 559 + 20.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 563 20.5. Operation 7: CB_PUSH_DELEG - Offer Delegation to - Client . . . . . . . . . . . . . . . . . . . . . . . . . 563 - 20.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 564 + Client . . . . . . . . . . . . . . . . . . . . . . . . . 567 + 20.6. Operation 8: CB_RECALL_ANY - Keep any N recallable + objects . . . . . . . . . . . . . . . . . . . . . . . . 568 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal - Resources for Recallable Objects . . . . . . . . . . . . 566 + Resources for Recallable Objects . . . . . . . . . . . . 571 20.8. Operation 10: CB_RECALL_SLOT - change flow control - limits . . . . . . . . . . . . . . . . . . . . . . . . . 567 + limits . . . . . . . . . . . . . . . . . . . . . . . . . 572 20.9. Operation 11: CB_SEQUENCE - Supply backchannel - sequencing and control . . . . . . . . . . . . . . . . . 568 + sequencing and control . . . . . . . . . . . . . . . . . 573 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending - Delegation Wants . . . . . . . . . . . . . . . . . . . . 570 + Delegation Wants . . . . . . . . . . . . . . . . . . . . 575 20.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible - lock availability . . . . . . . . . . . . . . . . . . . 571 + lock availability . . . . . . . . . . . . . . . . . . . 576 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify device ID - changes . . . . . . . . . . . . . . . . . . . . . . . . 573 + changes . . . . . . . . . . . . . . . . . . . . . . . . 578 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback - Operation . . . . . . . . . . . . . . . . . . . . . . . 575 - 21. Security Considerations . . . . . . . . . . . . . . . . . . . 575 - 22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 577 - 22.1. Named Attribute Definitions . . . . . . . . . . . . . . 577 - 22.2. ONC RPC Network Identifiers (netids) . . . . . . . . . . 577 - 22.3. Defining New Notifications . . . . . . . . . . . . . . . 578 - 22.4. Defining New Layout Types . . . . . . . . . . . . . . . 578 - 22.5. Path Variable Definitions . . . . . . . . . . . . . . . 580 - 22.5.1. Path Variable Values . . . . . . . . . . . . . . . . 580 - 22.5.2. Path Variable Names . . . . . . . . . . . . . . . . 580 - 23. References . . . . . . . . . . . . . . . . . . . . . . . . . 580 - 23.1. Normative References . . . . . . . . . . . . . . . . . . 580 - 23.2. Informative References . . . . . . . . . . . . . . . . . 582 - Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 584 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 586 - Intellectual Property and Copyright Statements . . . . . . . . . 587 + Operation . . . . . . . . . . . . . . . . . . . . . . . 580 + 21. Security Considerations . . . . . . . . . . . . . . . . . . . 580 + 22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 582 + 22.1. Named Attribute Definitions . . . . . . . . . . . . . . 582 + 22.2. ONC RPC Network Identifiers (netids) . . . . . . . . . . 582 + 22.3. Defining New Notifications . . . . . . . . . . . . . . . 583 + 22.4. Defining New Layout Types . . . . . . . . . . . . . . . 583 + 22.5. Path Variable Definitions . . . . . . . . . . . . . . . 585 + 22.5.1. Path Variable Values . . . . . . . . . . . . . . . . 585 + 22.5.2. Path Variable Names . . . . . . . . . . . . . . . . 585 + 23. References . . . . . . . . . . . . . . . . . . . . . . . . . 585 + 23.1. Normative References . . . . . . . . . . . . . . . . . . 585 + 23.2. Informative References . . . . . . . . . . . . . . . . . 587 + Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 589 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 591 + Intellectual Property and Copyright Statements . . . . . . . . . 592 1. Introduction 1.1. The NFS Version 4 Minor Version 1 Protocol The NFS version 4 minor version 1 (NFSv4.1) protocol is the second minor version of the NFS version 4 (NFSv4) protocol. The first minor version, NFSv4.0 is described in [21]. It generally follows the guidelines for minor versioning model listed in Section 10 of RFC 3530. However, it diverges from guidelines 11 ("a client and server @@ -1158,21 +1159,21 @@ information to distinguish the client from other user level clients running on the same host, such as a process identifier or other unique sequence. The client ID is assigned by the server (the eir_clientid result from EXCHANGE_ID) and should be chosen so that it will not conflict with a client ID previously assigned by the server. This applies across server restarts. In the event of a server restart, a client may find out that its - current client ID is no longer valid when it receives a + current client ID is no longer valid when it receives an NFS4ERR_STALE_CLIENTID error. The precise circumstances depend on the characteristics of the sessions involved, specifically whether the session is persistent (see Section 2.10.5.5), but in each case the client will receive this error when it attempts to establish a new session with the existing client ID and receives the error NFS4ERR_STALE_CLIENTID, indicating that a new client ID must be obtained via EXCHANGE_ID and the new session established with that client ID. When a session is not persistent, the client will find out that it @@ -2099,21 +2100,21 @@ two different EXCHANGE_ID requests, and the eir_clientid, eir_server_owner.so_major_id, and eir_server_scope results match in both EXCHANGE_ID results, but the eir_server_owner.so_minor_id results do not match then the client is permitted to perform client ID trunking. The client can associate each connection with different sessions, where each session is associated with the same server. Of course, even if the eir_server_owner.so_minor_id fields do match, the client is free to employ client ID trunking instead of - sessiond trunking. + session trunking. The client completes the act of client ID trunking by invoking CREATE_SESSION on each connection, using the same client ID that was returned in eir_clientid. These invocations create two sessions and also associate each connection with each session. When doing client ID trunking, locking state is shared across sessions associated with the same client ID. This requires the server to coordinate state across sessions. @@ -2368,27 +2369,27 @@ CB_SEQUENCE (e.g. BIND_CONN_TO_SESSION), then the RPC XID is needed for correct operation to match the reply to the request. o The SEQUENCE or CB_SEQUENCE operation may generate an error. If so, the embedded slot id, sequence id, and sessionid (if present) in the request will not be in the reply, and the requester has only the XID to match the reply to the request. Given that well formulated XIDs continue to be required, this begs the question why SEQUENCE and CB_SEQUENCE replies have a sessionid, - slot id and sequence id? Having the sessionid in the reply means the - requester does not have to use the XID to lookup the sessionid, which - would be necessary if the connection were associated with multiple - sessions. Having the slot id and sequence id in the reply means - requester does not have to use the XID to lookup the slot id and - sequence id. Furhermore, since the XID is only 32 bits, it is too - small to guarantee the re-association of a reply with its request + slot id and sequence id? Having the session id in the reply means + the requester does not have to use the XID to lookup the session id, + which would be necessary if the connection were associated with + multiple sessions. Having the slot id and sequence id in the reply + means requester does not have to use the XID to lookup the slot id + and sequence id. Furhermore, since the XID is only 32 bits, it is + too small to guarantee the re-association of a reply with its request ([27]); having sessionid, slot id, and sequence id in the reply allows the client to validate that the reply in fact belongs to the matched request. The SEQUENCE (and CB_SEQUENCE) operation also carries a "highest_slotid" value which carries additional requester slot usage information. The requester must always indicate the slot id representing the outstanding request with the highest-numbered slot value. The requester should in all cases provide the most conservative value possible, although it can be increased somewhat @@ -2457,43 +2458,44 @@ entries at least as large as the old value of maximum requests outstanding, until it can infer that the requester has seen a reply containing the new granted highest_slotid. The replier can infer that requester as seen such a reply when it receives a new request with the same slotid as the request replied to and the next higher sequenceid. 2.10.5.1.1. Caching of SEQUENCE and CB_SEQUENCE Replies When a SEQUENCE or CB_SEQUENCE operation is successfully executed, - its reply MUST always be cached. Specifically, sessionid, - sequenceid, and slotid MUST be cached in the reply cache. The reply - from SEQUENCE also includes the highest slotid, target highest - slotid, and status flags. Instead of caching these values, the - server MAY re-compute the values from the current state of the fore - channel, session and/or client ID as appropriate. Similarly, the - reply from CB_SEQUENCE includes a highest slotid and target highest - slotid. The client MAY re-compute the values from the current state - of the session as appropriate. + its reply MUST always be cached. Specifically, session id, sequence + id, and slot id MUST be cached in the reply cache. The reply from + SEQUENCE also includes the highest slot id, target highest slot id, + and status flags. Instead of caching these values, the server MAY + re-compute the values from the current state of the fore channel, + session and/or client ID as appropriate. Similarly, the reply from + CB_SEQUENCE includes a highest slot id and target highest slot id. + The client MAY re-compute the values from the current state of the + session as appropriate. Regardless of whether a replier is re-computing highest slotid, - target slotid, and status on replies to retries or not, the requester - MUST NOT assume the values are being re-computed whenever it receives - a reply after a retry is sent, since it has no way of knowing whether - the reply it has received was sent by the server in response to the - retry, or is a delayed response to the original request. Therefore, - it may be the case that highest slotid, target slotid, or status bits - may reflect the state of affairs when the request was first executed. - Although acting based on such delayed information is valid, it may - cause the receiver to do unneeded work. Requesters MAY choose to - send additional requests to get the current state of affairs or use - the state of affairs reported by subsequent requests, in preference - to acting immediately on data which may be out of date. + target slot id, and status on replies to retries or not, the + requester MUST NOT assume the values are being re-computed whenever + it receives a reply after a retry is sent, since it has no way of + knowing whether the reply it has received was sent by the server in + response to the retry, or is a delayed response to the original + request. Therefore, it may be the case that highest slot id, target + slot id, or status bits may reflect the state of affairs when the + request was first executed. Although acting based on such delayed + information is valid, it may cause the receiver to do unneeded work. + Requesters MAY choose to send additional requests to get the current + state of affairs or use the state of affairs reported by subsequent + requests, in preference to acting immediately on data which may be + out of date. 2.10.5.1.2. Errors from SEQUENCE and CB_SEQUENCE Any time SEQUENCE or CB_SEQUENCE return an error, the sequence id of the slot MUST NOT change. The replier MUST NOT modify the reply cache entry for the slot whenever an error is returned from SEQUENCE or CB_SEQUENCE. 2.10.5.1.3. Optional Reply Caching @@ -2585,44 +2587,44 @@ client may have been granted a delegation to a file it has opened, but the reply to the OPEN (informing the client of the granting of the delegation) may be delayed in the network. If a conflicting operation arrives at the server, it will recall the delegation using the backchannel, which may be on a different transport connection, perhaps even a different network, or even a different session associated with the same client ID The presence of a session between client and server alleviates this issue. When a session is in place, each client request is uniquely - identified by its { sessionid, slot id, sequence id } triple. By the - rules under which slot entries (reply cache entries) are retired, the - server has knowledge whether the client has "seen" each of the + identified by its { session id, slot id, sequence id } triple. By + the rules under which slot entries (reply cache entries) are retired, + the server has knowledge whether the client has "seen" each of the server's replies. The server can therefore provide sufficient information to the client to allow it to disambiguate between an erroneous or conflicting callback race condition. For each client operation which might result in some sort of server callback, the server SHOULD "remember" the { sessionid, slot id, sequence id } triple of the client request until the slot id retirement rules allow the server to determine that the client has, in fact, seen the server's reply. Until the time the { sessionid, slot id, sequence id } request triple can be retired, any recalls of the associated object MUST carry an array of these referring identifiers (in the CB_SEQUENCE operation's arguments), for the benefit of the client. After this time, it is not necessary for the server to provide this information in related callbacks, since it is certain that a race condition can no longer occur. The CB_SEQUENCE operation which begins each server callback carries a list of "referring" { sessionid, slot id, sequence id } triples. If - the client finds the request corresponding to the referring - sessionid, slot id and sequence id to be currently outstanding (i.e. - the server's reply has not been seen by the client), it can determine + the client finds the request corresponding to the referring session + id, slot id and sequence id to be currently outstanding (i.e. the + server's reply has not been seen by the client), it can determine that the callback has raced the reply, and act accordingly. If the client does not find the request corresponding the referring triple to be outstanding (including the case of a sessionid referring to a destroyed session), then there is no race with respect to this triple. The server SHOULD limit the referring triples to requests that refer to just those that apply to the objects referred to in the CB_COMPOUND procedure. The client must not simply wait forever for the expected server reply to arrive before responding to the CB_COMPOUND that won the race, @@ -2643,31 +2645,31 @@ back), the client and server negotiate the maximum sized request they will send or process (ca_maxrequestsize), the maximum sized reply they will return or process (ca_maxresponsesize), and the maximum sized reply they will store in the reply cache (ca_maxresponsesize_cached). If a request exceeds ca_maxrequestsize, the reply will have the status NFS4ERR_REQ_TOO_BIG. A replier MAY return NFS4ERR_REQ_TOO_BIG as the status for first operation (SEQUENCE or CB_SEQUENCE) in the request (which means no operations in the request executed, and the - state of the slot in the reply cache is unchanged), or it MAY chose - to return it on a subsequent operation in the same COMPOUND or + state of the slot in the reply cache is unchanged), or it MAY opt to + return it on a subsequent operation in the same COMPOUND or CB_COMPOUND request (which means at least one operation did execute and the state of the slot in reply cache does change). The replier SHOULD set NFS4ERR_REQ_TOO_BIG on the operation that exceeds ca_maxrequestsize. If a reply exceeds ca_maxresponsesize, the reply will have the status NFS4ERR_REP_TOO_BIG. A replier MAY return NFS4ERR_REP_TOO_BIG as the status for first operation (SEQUENCE or CB_SEQUENCE) in the request, - or it MAY chose to return it on a subsequent operation (in the same + or it MAY opt to return it on a subsequent operation (in the same COMPOUND or CB_COMPOUND reply). A replier MAY return NFS4ERR_REP_TOO_BIG in the reply to SEQUENCE or CB_SEQUENCE, even if the response would still exceed ca_maxresponsesize. If sa_cachethis or csa_cachethis are TRUE, then the replier MUST cache a reply except if an error is returned by the SEQUENCE or CB_SEQUENCE operation (see Section 2.10.5.1.2). If the reply exceeds ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter) @@ -2748,23 +2750,23 @@ sequence id) MUST be rejected with NFS4ERR_DEADSESSION (returned by SEQUENCE). Such a session is considered dead. A server MAY re- animate a session after a server restart so that the session will accept new requests as well as retries. To re-animate a session the server needs to persist additional information through server restart: o The client ID. This is a prerequisite to let the client to create more sessions associated with the same client ID as the - o The client ID's sequenceid that is used for creating sessions (see - Section 18.35 and Section 18.36. This is a prerequisite to let - the client create more sessions. + o The client ID's sequence id that is used for creating sessions + (see Section 18.35 and Section 18.36). This is a prerequisite to + let the client create more sessions. o The principal that created the client ID. This allows the server to authenticate the client when it sends EXCHANGE_ID. o The SSV, if SP4_SSV state protection was specified when the client ID was created (see Section 18.35). This lets the client create new sessions, and associate connections with the new and existing sessions. o The properties of the client ID as defined in Section 18.35. @@ -3527,22 +3529,22 @@ o A catastrophe that causes the reply cache to be corrupted or lost on the media it was stored on. This applies even if the replier indicated in the CREATE_SESSION results that it would persist the cache. o The server purges the session of a client that has been inactive for a very extended period of time. Loss of reply cache is equivalent to loss of session. The replier indicates loss of session to the requester by returning - NFS4ERR_BADSESSION on the next operation that uses the sessionid that - refers to the lost session. + NFS4ERR_BADSESSION on the next operation that uses the session id + that refers to the lost session. After an event like a server restart, the client may have lost its connections. The client assumes for the moment that the session has not been lost. It reconnects, and if it specified connection association enforcement when the session was created, it invokes BIND_CONN_TO_SESSION using the sessionid. Otherwise, it invokes SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns NFS4ERR_BADSESSION, the client knows the session was lost. If the connection survives session loss, then the next SEQUENCE operation the client sends over the connection will get back @@ -3716,24 +3718,24 @@ | | Various defined file types. | | nfsstat4 | enum nfsstat4; | | | Return value for operations. | | offset4 | typedef uint64_t offset4; | | | Various offset designations (READ, WRITE, LOCK, | | | COMMIT). | | qop4 | typedef uint32_t qop4; | | | Quality of protection designation in SECINFO. | | sec_oid4 | typedef opaque sec_oid4<>; | | | Security Object Identifier. The sec_oid4 data | - | | type is not really opaque. Instead it contains | - | | an ASN.1 OBJECT IDENTIFIER as used by GSS-API in | - | | the mech_type argument to GSS_Init_sec_context. | - | | See [7] for details. | + | | type is not really opaque. Instead it contains an | + | | ASN.1 OBJECT IDENTIFIER as used by GSS-API in the | + | | mech_type argument to GSS_Init_sec_context. See | + | | [7] for details. | | sequenceid4 | typedef uint32_t sequenceid4; | | | Sequence number used for various session | | | operations (EXCHANGE_ID, CREATE_SESSION, | | | SEQUENCE, CB_SEQUENCE). | | seqid4 | typedef uint32_t seqid4; | | | Sequence identifier used for file locking. | | sessionid4 | typedef opaque sessionid4[NFS4_SESSIONID_SIZE]; | | | Session identifier. | | slotid4 | typedef uint32_t slotid4; | | | Sequencing artifact for various session | @@ -4665,21 +4667,21 @@ Some REQUIRED and RECOMMENDED attributes are set-only, i.e. they can be set via SETATTR but not retrieved via GETATTR. Similarly, some REQUIRED and RECOMMENDED attributes are get-only, i.e. they can be retrieved GETATTR but not set via SETATTR. If a client attempts to set a get-only attribute or get a set-only attributes, the server MUST return NFS4ERR_INVAL. 5.6. REQUIRED Attributes - List and Definition References The list of REQUIRED attributes appears in Table 4. The meaning of - hte columns of the table are: + the columns of the table are: o Name: the name of attribute o Id: the number assigned to the attribute. In the event of conflicts between the assigned number and [12], the latter is authoritative. o Data Type: The XDR data type of the attribute. o Acc: Access allowed to the attribute. R means read-only (GETATTR @@ -6653,22 +6655,29 @@ ACE4_INHERIT_ONLY_ACE set. (In the case of a dacl or sacl attribute, both of those ACEs SHOULD also have the ACE4_INHERITED_ACE flag set.) This makes it simpler to modify the effective permissions on the directory without modifying the ACE which is to be inherited to the new directory's children. 6.4.3.2. Automatic Inheritance The acl attribute consists only of an array of ACEs, but the sacl (Section 6.2.3) and dacl (Section 6.2.2) attributes also include an - additional flag field. The flag field applies to the entire sacl or - dacl; three flag values are defined: + additional flag field. + + struct nfsacl41 { + aclflag4 na41_flag; + nfsace4 na41_aces<>; + }; + + The flag field applies to the entire sacl or dacl; three flag values + are defined: const ACL4_AUTO_INHERIT = 0x00000001; const ACL4_PROTECTED = 0x00000002; const ACL4_DEFAULTED = 0x00000004; and all other bits must be cleared. The ACE4_INHERITED_ACE flag may be set in the ACEs of the sacl or dacl (whereas it must always be cleared in the acl). Together these features allow a server to support automatic @@ -6796,21 +6805,21 @@ In NFSv3, the client expects all LOOKUP operations to remain within a single server file system. For example, the device attribute will not change. This prevents a client from taking namespace paths that span exports. In the case of NFSv3, an automounter on the client can obtain a snapshot of the server's namespace using the EXPORTS procedure of the MOUNT protocol. If it understands the server's pathname syntax, it can create an image of the server's namespace on the client. The parts of the namespace that are not exported by the server are filled - in with directories that might be constructed similarly to a NFSv4.1 + in with directories that might be constructed similarly to an NFSv4.1 "pseudo file system" (see Section 7.3) that allows the user to browse from one mounted file system to another. There is a drawback to this representation of the server's namespace on the client: it is static. If the server administrator adds a new export the client will be unaware of it. 7.3. Server Pseudo File System NFSv4.1 servers avoid this namespace inconsistency by presenting all the exports for a given server within the framework of a single @@ -6985,22 +6997,22 @@ which represents a client as a whole to the eventual lightweight stateid used for most client and server locking interactions. The details of this transition will vary with the type of object but it always starts with a client ID. 8.1. Client and Session ID A client must establish a client ID (see Section 2.4) and then one or more sessionids (see Section 2.10) before performing any operations to open, lock, delegate, or obtain a layout for a file object. Each - sessionid is associated with a specific client ID, and thus serves as - a shorthand reference to an NFSv4.1 client. + session id is associated with a specific client ID, and thus serves + as a shorthand reference to an NFSv4.1 client. For some types of locking interactions, the client will represent some number of internal locking entities called "owners", which normally correspond to processes internal to the client. For other types of locking-related objects, such as delegations and layouts, no such intermediate entities are provided for, and the locking-related objects are considered to be transferred directly between the server and a unitary client. 8.2. Stateid Definition @@ -7269,22 +7282,22 @@ appropriate error returned when necessary. Special and non-special stateids are handled separately. (See Section 8.2.3 for a discussion of special stateids.) Note that stateids are implicitly qualified by the current client ID, as derived from the client ID associated with the current session. Note however, that the semantics of the session will prevent stateids associated with a previous client or server instance from being analyzed by this procedure. - If server restart has resulted in an invalid client ID or a sessionid - which is invalid, SEQUENCE will return an error and the operation + If server restart has resulted in an invalid client ID or a session + id which is invalid, SEQUENCE will return an error and the operation that takes a stateid as an argument will never be processed. If there has been a server restart where there is a persistent session, and all leased state has been lost, then the session in question will, although valid, be marked as dead, and any operation not satisfied by means of the reply cache will receive the error NFS4ERR_DEADSESSION, and thus not be processed as indicated below. When a stateid is being tested, and the "other" field is all zeros or all ones, a check that the "other" and "seqid" fields match a defined @@ -11698,23 +11712,23 @@ referring (absent) file system nor is there any access to the fh_expire_type attribute. o All file system instances servers should be considered as of different _change_ classes. For other class assignments, handling of file system transitions depends on the reasons for the transition: o When the transition is due to migration, that is the client was - directed to new file system after receiving a NFS4ERR_MOVED error, - the target should be treated as being of the same _write-verifier_ - class as the source. + directed to new file system after receiving an NFS4ERR_MOVED + error, the target should be treated as being of the same _write- + verifier_ class as the source. o When the transition is due to failover to another replica, that is, the client selected another replica without receiving and NFS4ERR_MOVED error, the target should be treated as being of a different _write-verifier_ class from the source. The specific choices reflect typical implementation patterns for failover and controlled migration respectively. Since other choices are possible and useful, this information is better obtained by using fs_locations_info. When a server implementation needs to communicate @@ -12374,21 +12388,21 @@ open denies WRITE and the data is changed), that lock SHOULD be considered administratively revoked. The opaque strings fss_source and fss_current provide a way of presenting information about the source of the file system image being present. It is not intended that client do anything with this information other than make it available to administrative tools. It is intended that this information be helpful when researching possible problems with a file system image that might arise when it is unclear if the correct image is being accessed and if not, how - that image came to be made. This kind of dianostic information will + that image came to be made. This kind of diagnostic information will be helpful, if, as seems likely, copies of file systems are made in many different ways (e.g. simple user-level copies, file system-level point-in-time copies, clones of the underlying storage), under a variety of administrative arrangements. In such environments, determining how a given set of data was constructed can be very helpful in resolving problems. The opaque string fss_source is used to indicate the source of a given file system with the expectation that tools capable of creating a file system image propagate this information, when that is @@ -12492,36 +12506,36 @@ ||| | ||| | ||| Storage +-----------+ | ||| Protocol |+-----------+ | ||+----------------||+-----------+ Control | |+-----------------||| | Protocol| +------------------+|| Storage |------------+ +| Devices | +-----------+ - Figure 67 + Figure 68 In this model, the clients, server, and storage devices are responsible for managing file access. This is in contrast to NFSv4 without pNFS where it is primarily the server's responsibility; some of this responsibility may be delegated to the client under strictly specified conditions. pNFS takes the form of OPTIONAL operations that manage protocol - objects called 'layouts' which contain data location information. - The layout is managed in a similar fashion as NFSv4.1 data - delegations are managed. For example, the layout is leased, + objects called 'layouts' which contain a byte-range and storage + location information. The layout is managed in a similar fashion as + NFSv4.1 data delegations. For example, the layout is leased, recallable and revocable. However, layouts are distinct abstractions and are manipulated with new operations. When a client holds a - layout, it is granted the ability to access the data location - directly using the location information specified in the layout. + layout, it is granted the ability to directly access the byte-range + at the storage location specified in the layout. There are interactions between layouts and other NFSv4.1 abstractions such as data delegations and byte-range locking. Delegation issues are discussed in Section 12.5.5. Byte range locking issues are discussed in Section 12.2.9 and Section 12.5.1. The NFSv4.1 pNFS feature has been structured to allow for a variety of storage protocols to be defined and used. As noted in the diagram above, the storage protocol is the method used by the client to store and retrieve data directly from the storage devices. The NFSv4.1 @@ -12540,57 +12554,55 @@ o Object protocols such as OSD over iSCSI or Fibre Channel [40]. o Other storage protocols, including PVFS and other file systems that are in use in HPC environments. It is possible that various storage protocols are available to both client and server and it may be possible that a client and server do not have a matching storage protocol available to them. Because of this, the pNFS server MUST support normal NFSv4.1 access to any file accessible by the pNFS feature; this will allow for continued - interoperability between a NFSv4.1 client and server. + interoperability between an NFSv4.1 client and server. 12.2. pNFS Definitions NFSv4.1's pNFS feature partitions the file system protocol into two - parts: metadata and data. Where data is the contents of a file and - metadata is "everything else". The metadata functionality is - implemented by a metadata server that supports pNFS and the - operations described in (Section 18). The data functionality is - implemented by a storage device that supports the storage protocol. - A subset (defined in Section 13.6) of NFSv4.1 itself is one such - storage protocol. New terms are introduced to the NFSv4.1 - nomenclature and existing terms are clarified to allow for the - description of the pNFS feature. + parts: metadata and data. Where data being the contents of a file + and the metadata is "everything else". The metadata functionality is + implemented by a NFSv4.1 server that supports pNFS and the operations + described in (Section 18) (a metadata server). The data + functionality is implemented by one or more storage devices, each of + which are accessed by the client via a storage protocol. A subset + (defined in Section 13.6) of NFSv4.1 is one such storage protocol. + New terms are introduced to the NFSv4.1 nomenclature and existing + terms are clarified to allow for the description of the pNFS feature. 12.2.1. Metadata Information about a file system object, such as its name, location within the namespace, owner, ACL and other attributes. Metadata may also include storage location information and this will vary based on the underlying storage mechanism that is used. 12.2.2. Metadata Server An NFSv4.1 server which supports the pNFS feature. A variety of architectural choices exists for the metadata server and its use of - what file system information is held at the server. Some servers may - contain metadata only for the file objects that reside at the - metadata server while file data resides on the associated storage - devices. Other metadata servers may hold both metadata and a varying - degree of file data. + file system information held at the server. Some servers may contain + metadata only for file objects residing at the metadata server while + the file data resides on associated storage devices. Other metadata + servers may hold both metadata and a varying degree of file data. 12.2.3. pNFS Client An NFSv4.1 client that supports pNFS operations and supports at least - one storage protocol or layout type for performing I/O to storage - devices. + one storage protocol for performing I/O to storage devices. 12.2.4. Storage Device A storage device stores a regular file's data, but leaves metadata management to the metadata server. A storage device could be another NFSv4.1 server, an object storage device (OSD), a block device accessed over a SAN (e.g., either FiberChannel or iSCSI SAN), or some other entity. 12.2.5. Storage Protocol @@ -12618,38 +12630,38 @@ devices that hold the data. A layout is said to belong to a specific layout type (data type layouttype4, see Section 3.3.13). The layout type allows for variants to handle different storage protocols, such as those associated with block/volume [31], object [30], and file (Section 13) layout types. A metadata server, along with its control protocol, MUST support at least one layout type. A private sub-range of the layout type name space is also defined. Values from the private layout type range MAY be used for internal testing or experimentation. - As an example, layout of the file layout type could be an array of - tuples (e.g., deviceID, file_handle), along with a definition of how - the data is stored across the devices (e.g., striping). A block/ - volume layout might be an array of tuples that store along with information about block size - and the associated file offset of the block number. An object layout - might be an array of tuples and an additional - structure (i.e., the aggregation map) that defines how the logical - byte sequence of the file data is serialized into the different - objects. Note that the actual layouts are typically more complex - than these simple expository examples. + As an example, the organization of the file layout type could be an + array of tuples (e.g., deviceID, file_handle), along with a + definition of how the data is stored across the devices (e.g., + striping). A block/volume layout might be an array of tuples that + store along with information + about block size and the associated file offset of the block number. + An object layout might be an array of tuples and + an additional structure (i.e., the aggregation map) that defines how + the logical byte sequence of the file data is serialized into the + different objects. Note that the actual layouts are typically more + complex than these simple expository examples. Requests for pNFS-related operations will often specify a layout type. Examples of such operations are GETDEVICEINFO and LAYOUTGET. The response for these operations will include structures such a device_addr4 or a layout4, each of which includes a layout type within it. The layout type sent by the server MUST always be the - same one requested by the client. When a client sends a response + same one requested by the client. When a server sends a response that includes a different layout type, the client SHOULD ignore the response and behave as if the server had returned an error response. 12.2.8. Layout A layout defines how a file's data is organized on one or more storage devices. There are many potential layout types; each of the layout types are differentiated by the storage protocol used to access data and in the aggregation scheme that lays out the file data on the underlying storage devices. A layout is precisely identified @@ -12667,61 +12679,61 @@ permissible for layouts with different iomodes, pertaining to the same byte range, to be held by the same client. An example of this would be copy-on-write functionality for a block/volume layout type. 12.2.9. Layout Iomode The layout iomode (data type layoutiomode4, see Section 3.3.20) indicates to the metadata server the client's intent to perform either just read operations or a mixture of I/O possibly containing read and write operations. For certain layout types, it is useful - for a client to specify this intent at LAYOUTGET (Section 18.43) - time. For example, block/volume based protocols, block allocation - could occur when a READ/WRITE iomode is specified. A special - LAYOUTIOMODE4_ANY iomode is defined and can only be used for + for a client to specify this intent at the time it sends LAYOUTGET + (Section 18.43). For example, block/volume based protocols, block + allocation could occur when a READ/WRITE iomode is specified. A + special LAYOUTIOMODE4_ANY iomode is defined and can only be used for LAYOUTRETURN and CB_LAYOUTRECALL, not for LAYOUTGET. It specifies that layouts pertaining to both READ and READ/WRITE iomodes are being returned or recalled, respectively. - A storage device may validate I/O with regards to the iomode; this is + A storage device may validate I/O with regard to the iomode; this is dependent upon storage device implementation and layout type. Thus, if the client's layout iomode is inconsistent with the I/O being performed, the storage device may reject the client's I/O with an - error indicating a new layout with the correct I/O mode should be - fetched. For example, if a client gets a layout with a READ iomode - and performs a WRITE to a storage device, the storage device is - allowed to reject that WRITE. + error indicating a new layout with the correct iomode should be + obtained via LAYOUTGET. For example, if a client gets a layout with + a READ iomode and performs a WRITE to a storage device, the storage + device is allowed to reject that WRITE. - The iomode does not conflict with OPEN share modes or lock requests; - open mode and lock conflicts are enforced as they are without the use - of pNFS, and are logically separate from the pNFS layout level. As - well, open modes and locks are the preferred method for restricting - user access to data files. For example, an OPEN of read, deny-write - does not conflict with a LAYOUTGET containing an iomode of READ/WRITE - performed by another client. Applications that depend on writing - into the same file concurrently may use byte-range locking to - serialize their accesses. + The use of the layout iomode does not conflict with OPEN share modes + or byte-range lock requests; open mode and lock conflicts are + enforced as they are without the use of pNFS, and are logically + separate from the pNFS layout level. Open modes and locks are the + preferred method for restricting user access to data files. For + example, an OPEN of read, deny-write does not conflict with a + LAYOUTGET containing an iomode of READ/WRITE performed by another + client. Applications that depend on writing into the same file + concurrently may use byte-range locking to serialize their accesses. 12.2.10. Device IDs - The device ID (data type deviceid4, see Section 3.3.14) names a group - of storage devices. The scope of a device ID is per pair of client - ID and layout type. In practice, a significant amount of information - may be required to fully address a storage device. Rather than - embedding all such information in a layout, layouts embed device IDs. - The NFSv4.1 operation GETDEVICEINFO (Section 18.40) is used to - retrieve the complete address information (including all device - addresses for the device ID) regarding the storage device according - to its layout type and device ID. For example, the address of an - NFSv4.1 data server or of an object storage device could be an IP - address and port. The address of a block storage device could be a - volume label. + The device ID (data type deviceid4, see Section 3.3.14) identifies a + group of storage devices. The scope of a device ID is the pair + . In practice, a significant amount of + information may be required to fully address a storage device. + Rather than embedding all such information in a layout, layouts embed + device IDs. The NFSv4.1 operation GETDEVICEINFO (Section 18.40) is + used to retrieve the complete address information (including all + device addresses for the device ID) regarding the storage device + according to its layout type and device ID. For example, the address + of an NFSv4.1 data server or of an object storage device could be an + IP address and port. The address of a block storage device could be + a volume label. Clients cannot expect the mapping between a device ID and its storage device address(es) to persist across metadata server restart. See Section 12.7.4 for a description of how recovery works in that situation. A device ID lives as long as there is a layout referring to the device ID. If there are no layouts referring to the device ID, the server is free to delete the device ID any time. Once a device ID is deleted by the server, the server MUST NOT reuse the device ID for @@ -12863,26 +12875,24 @@ is incapable of providing this check in the presence of mandatory file locks, the metadata server then MUST NOT grant layouts and mandatory file locks simultaneously. 12.5.2. Getting a Layout A client obtains a layout with the LAYOUTGET operation. The metadata server will grant layouts of a particular type (e.g., block/volume, object, or file). The client selects an appropriate layout type that the server supports and the client is prepared to use. The layout - returned to the client may not exactly align with the requested byte - range. A field within the LAYOUTGET request, loga_minlength, - specifies the minimum length of the layout. The loga_minlength field - should be at least one. As needed a client may make multiple - LAYOUTGET requests; these will result in multiple overlapping, non- - conflicting layouts. + returned to the client might not exactly match the requested byte + range as described in Section 18.43.3. As needed a client may make + multiple LAYOUTGET requests; these might result in multiple + overlapping, non-conflicting layouts (see Section 12.2.8). In order to get a layout, the client must first have opened the file via the OPEN operation. When a client has no layout on a file, it MUST present a stateid as returned by OPEN, a delegation stateid, or a byte-range lock stateid in the loga_stateid argument. A successful LAYOUTGET result includes a layout stateid. The first successful LAYOUTGET processed by the server using a non-layout stateid as an argument MUST have the "seqid" field of the layout stateid in the response set to one. Thereafter, the client uses a layout stateid (see Section 12.5.3) on future invocations of LAYOUTGET on the file, @@ -12944,21 +12954,21 @@ correct "seqid" is defined as the highest "seqid" value from responses of fully processed LAYOUTGET or LAYOUTRETURN operations or arguments of a fully processed CB_LAYOUTRECALL operation. Since the server is incrementing the "seqid" value on each layout operation, the client may determine the order of operation processing by inspecting the "seqid" value. In the case of overlapping layout ranges, the ordering information will provide the client the knowledge of which layout ranges are held. Note that overlapping layout ranges may occur because of the client's specific requests or because the server is allowed to expand the range of a requested - layout and notify the client in the LAYOUTRETURN results Additional + layout and notify the client in the LAYOUTRETURN results. Additional layout stateid sequencing requirements are provided in Section 12.5.5.2. The client's receipt of a "seqid" is not sufficient for subsequent use. The client must fully process the operations before the "seqid" can be used. For LAYOUTGET results, if the client is not using the forgetful model (Section 12.5.5.1), it MUST first update its record of what ranges of the file's layout it has before using the seqid. For LAYOUTRETURN results, the client MUST delete the range from its record of what ranges of the file's layout it had before using the @@ -13876,21 +13886,21 @@ NFSv4.1) what role the request to the common server network address is directed to. 12.9. Security Considerations for pNFS pNFS separates file system metadata and data and provides access to both. There are pNFS-specific operations (listed in Section 12.3) that provide access to the metadata; all existing NFSv4.1 conventional (non-pNFS) security mechanisms and features apply to accessing the metadata. The combination of components in a pNFS - system (see Figure 67) is required to preserve the security + system (see Figure 68) is required to preserve the security properties of NFSv4.1 with respect to an entity accessing storage device from a client, including security countermeasures to defend against threats that NFSv4.1 provides defenses for in environments where these threats are considered significant. In some cases, the security countermeasures for connections to storage devices may take the form of physical isolation or a recommendation not to use pNFS in an environment. For example, it may be impractical to provide confidentiality protection for some storage protocols to protect against eavesdropping; in environments @@ -14870,21 +14880,21 @@ o Otherwise, there must be an open stateid for the current open- owner, and that open stateid for the open file in question is used, unless mandatory locking, prevents that. See below. o If the data server had previously responded with NFS4ERR_LOCKED to use of the open stateid, then the client should use the lock stateid whenever one exists for that open file with the current lock-owner. o Special stateids should never be used and if used the data server - MUST reject the I/O with a NFS4ERR_BAD_STATEID error. + MUST reject the I/O with an NFS4ERR_BAD_STATEID error. 13.9.2. Data Server State Propagation Since the metadata server, which handles lock and open-mode state changes, as well as ACLs, may not be co-located with the data servers where I/O access are validated, the server implementation MUST take care of propagating changes of this state to the data servers. Once the propagation to the data servers is complete, the full effect of those changes MUST be in effect at the data servers. However, some state changes need not be propagated immediately, although all @@ -17600,25 +17608,26 @@ 16.1.1. ARGUMENTS void; 16.1.2. RESULTS void; 16.1.3. DESCRIPTION - Standard NULL procedure. Void argument, void response. This - procedure has no functionality associated with it. Because of this - it is sometimes used to measure the overhead of processing a service - request. Therefore, the server should ensure that no unnecessary - work is done in servicing this procedure. + This is the standard NULL procedure with the standard void argument + and void response. This procedure has no functionality associated + with it. Because of this it is sometimes used to measure the + overhead of processing a service request. Therefore, the server + SHOULD ensure that no unnecessary work is done in servicing this + procedure. 16.1.4. ERRORS None. 16.2. Procedure 1: COMPOUND - Compound Operations 16.2.1. ARGUMENTS enum nfs_opnum4 { @@ -18005,21 +18014,21 @@ PUTFH fh1 {fh1} LOOKUP "compA" {fh2} GETATTR {fh2} LOOKUP "compB" {fh3} GETATTR {fh3} LOOKUP "compC" {fh4} GETATTR {fh4} GETFH - Figure 84 + Figure 85 In this example, the PUTFH (Section 18.19) operation explicitly sets the current filehandle value while the result of each LOOKUP operation sets the current filehandle value to the resultant file system object. Also, the client is able to insert GETATTR operations using the current filehandle as an argument. The PUTROOTFH (Section 18.21) and PUTPUBFH (Section 18.21) operations also set the current filehandle. The above example would replace "PUTFH fh1" with PUTROOTFH or PUTPUBFH with no filehandle argument in @@ -18047,51 +18056,51 @@ A "current stateid" is the stateid that is associated with the current filehandle. The current stateid may only be changed by an operation that modifies the current filehandle or returns a stateid. If an operation returns a stateid it MUST set the current stateid to the returned value. If an operation sets the current filehandle but does not return a stateid, the current stateid MUST be set to the all-zeros special stateid, i.e. (seqid, other) = (0, 0). If an operation uses a stateid as an argument but does not return a stateid, the current stateid MUST NOT be changed. E.g., PUTFH, - PUTROOFH, and PUTPUBFH will change the current server state from + PUTROOTFH, and PUTPUBFH will change the current server state from {ocfh, (osid)} to {cfh, (0, 0)} while LOCK will change the current state from {cfh, (osid} to {cfh, (nsid)}. Operations like LOOKUP that transform a current filehandle and component name into a new current filehandle will also change the current stateid to {0, 0}. The SAVEFH and RESTOREFH operations will save and restore both the current filehandle and the current stateid as a set. The following example is the common case of a simple READ operation with a supplied stateid showing that the PUTFH initializes the current stateid to (0, 0). The subsequent READ with stateid (sid1) leaves the current stateid unchanged, but does evaluate the the operation. PUTFH fh1 - -> {fh1, (0, 0)} READ (sid1), 0, 1024 {fh1, (0, 0)} -> {fh1, (0, 0)} - Figure 85 + Figure 86 This next example performs an OPEN with the root filehandle and as a result generates stateid (sid1). The next operation specifies the READ with the argument stateid set such that (seqid, other) are equal to (1, 0), but the current stateid set by the previous operation is actually used when the operation is evaluated. This allows correct interaction with any existing, potentially conflicting, locks. PUTROOTFH - -> {fh1, (0, 0)} OPEN "compA" {fh1, (0, 0)} -> {fh2, (sid1)} READ (1, 0), 0, 1024 {fh2, (sid1)} -> {fh2, (sid1)} CLOSE (1, 0) {fh2, (sid1)} -> {fh2, (sid2)} - Figure 86 + Figure 87 The final example is similar to the second in how it passes the stateid sid2 generated by the LOCK operation to the next READ operation. This allows the client to explicitly surround a single I/O operation with a lock and its appropriate stateid to guarantee correctness with other client locks. The example also shows how SAVEFH and RESTOREFH can save and later re-use a filehandle and stateid, passing them as the current filehandle and stateid to a READ operation. @@ -18100,21 +18109,21 @@ READ (1, 0), 0, 1024 {fh1, (sid2)} -> {fh1, (sid2)} LOCKU 0, 1024, (1, 0) {fh1, (sid2)} -> {fh1, (sid3)} SAVEFH {fh1, (sid3)} -> {fh1, (sid3)} PUTFH fh2 {fh1, (sid3)} -> {fh2, (0, 0)} WRITE (1, 0), 0, 1024 {fh2, (0, 0)} -> {fh2, (0, 0)} RESTOREFH {fh2, (0, 0)} -> {fh1, (sid3)} READ (1, 0), 1024, 1024 {fh1, (sid3)} -> {fh1, (sid3)} - Figure 87 + Figure 88 16.2.4. ERRORS COMPOUND will of course return every error that each operation on the fore channel can return (see Table 12). However if COMPOUND returns zero operations, obviously the error returned by COMPOUND has nothing to do with an error returned by an operation. The list of errors COMPOUND will return if it processes zero operations include: COMPOUND error returns @@ -18392,21 +18401,21 @@ NFS is not going to be acceptable to some people. Historically, NFS servers have allowed a user to READ a file if the user has execute access to the file. As a practical example, the UNIX specification [41] states that an implementation claiming conformance to UNIX may indicate in the access() programming interface's result that a privileged user has execute rights, even if no execute permission bits are set on the regular file's attributes. It is possible to claim conformance to the UNIX specification and instead not indicate execute rights in - that situation, which is true for some operating enviroments. + that situation, which is true for some operating environments. Suppose the operating environments of the client and server are implementing the access() semantics for privileged users differently, and the ACCESS operation implementations of the client and server follow their respective access() semantics. This can cause undesired behavior: o Suppose the client's access() interface returns X_OK if the user is privileged and no execute permission bits are set on the regular file's attribute, and the server's access() interface does not return X_OK in that situation. Then the client will be unable @@ -18875,20 +18884,26 @@ nfsstat4 status; }; 18.5.3. DESCRIPTION Purges all of the delegations awaiting recovery for a given client. This is useful for clients which do not commit delegation information to stable storage to indicate that conflicting requests need not be delayed by the server awaiting recovery of delegation information. + The client is NOT specified by the clientid field of the request. + The client SHOULD set the client field to zero and the server MUST + ignore the clientid field. Instead the server MUST derive the client + ID from the value of the session id in the arguments of the SEQUENCE + operation that precedes DELEGPURGE in the COMPOUND request. + This operation should be used by clients that record delegation information on stable storage on the client. In this case, DELEGPURGE should be sent immediately after doing delegation recovery on all delegations known to the client. Doing so will notify the server that no additional delegations for the client will be recovered allowing it to free resources, and avoid delaying other clients which make requests that conflict with the unrecovered delegations. The set of delegations known to the server and the client may be different. The reason for this is that a client may fail after making a request which resulted in delegation but before @@ -20165,22 +20180,22 @@ | CLAIM_DELEG_CUR_FH | OPEN as granted by the server. Generally | | | this is done as part of recalling a | | | delegation. With CLAIM_DELEGATE_CUR, the | | | file is identified by the current | | | filehandle and the specified component | | | name. With CLAIM_DELEG_CUR_FH (new to | | | NFSv4.1), the file is identified by just | | | the current filehandle. | | CLAIM_DELEGATE_PREV, | The client is claiming a delegation | | CLAIM_DELEG_PREV_FH | granted to a previous client instance; | - | | used after the client restarts. The | - | | server MAY support CLAIM_DELEGATE_PREV or | + | | used after the client restarts. The server | + | | MAY support CLAIM_DELEGATE_PREV or | | | CLAIM_DELEG_PREV_FH (new to NFSv4.1). If | | | it does support either open type, | | | CREATE_SESSION MUST NOT remove the | | | client's delegation state, and the server | | | MUST support the DELEGPURGE operation. | +----------------------+--------------------------------------------+ For OPEN requests that reach the server during the grace period, the server returns an error of NFS4ERR_GRACE. The following claim types are exceptions: @@ -21631,21 +21646,21 @@ The SECINFO operation is expected to be used by the NFS client when the error value of NFS4ERR_WRONGSEC is returned from another NFS operation. This signifies to the client that the server's security policy is different from what the client is currently using. At this point, the client is expected to obtain a list of possible security flavors and choose what best suits its policies. As mentioned, the server's security policies will determine when a client request receives NFS4ERR_WRONGSEC. See Table 14 for a list operations which can return NFS4ERR_WRONGSEC. In addition, when - READDIR returns attributes, the rdaddr_error (Section 5.8.1.12) can + READDIR returns attributes, the rdattr_error (Section 5.8.1.12) can contain NFS4ERR_WRONGSEC. Note that CREATE and REMOVE MUST NOT return NFS4ERR_WRONGSEC. The rationale for CREATE is that unless the target name exists it cannot have a separate security policy from the parent directory, and the security policy of the parent was checked when its filehandle was injected into the COMPOUND request's operations stream (for similar reasons, an OPEN operation that creates the target MUST NOT return NFS4ERR_WRONGSEC). If the target name exists, while it might have a separate security policy, that is irrelevant because CREATE MUST return NFS4ERR_EXIST. The rationale for REMOVE is that while that target might have separate security @@ -23376,31 +23391,50 @@ records introduced in the description of EXCHANGE_ID is used with the following addition: clientid_arg: The value of the csa_clientid field of the CREATE_SESSION4args structure of the current request. Since CREATE_SESSION is a non-idempotent operation, we must consider the possibility that retries may occur as a result of a client restart, network partition, malfunctioning router, etc. For each client ID created by EXCHANGE_ID, the server maintains a separate - reply cache similar to the session reply cache used for SEQUENCE - operations, with two distinctions. + reply cache (called the CREATE_SESSION reply cache) similar to the + session reply cache used for SEQUENCE operations, with two + distinctions. o First this is a reply cache just for detecting and processing CREATE_SESSION requests for a given client ID. o Second, the size of the client ID reply cache is of one slot (and as a result, the CREATE_SESSION request does not carry a slot number). This means that at most one CREATE_SESSION request for a given client ID can be outstanding. + As previously stated, CREATE_SESSION can be sent with or without a + preceding SEQUENCE operation. Even if SEQUENCE precedes + CREATE_SESSION, the server MUST maintain the CREATE_SESSION reply + cache, which is separate from the reply cache for the session + associated with SEQUENCE. If CREATE_SESSION was originally sent by + itself, the client MAY send a retry of the CREATE_SESSION operation + within a COMPOUND preceded by SEQUENCE. If CREATE_SESSION was + originally sent in a COMPOUND that started with SEQUENCE, then the + client SHOULD send a retry in a COMPOUND that starts with SEQUENCE + that has the same session id as the SEQUENCE of the original request. + However, the client MAY send a retry in a COMPOUND that either has no + preceding SEQUENCE, or has a preceding SEQUENCE that refers to a + different session than the original CREATE_SESSION. This might be + necessary if the client sends a CREATE_SESSION in a COMPOUND preceded + by a SEQUENCE with session id X, and session X no longer exists. + Regardless, any retry of CREATE_SESSION, with or without a preceding + SEQUENCE, MUST use the same value of csa_sequence as the original. + When a client sends a successful EXCHANGE_ID and it is returned an unconfirmed client ID, the client is also returned eir_sequenceid, and the client is expected to set the value of csa_sequenceid in the client ID-confirming-CREATE_SESSION it sends with that client ID to the value of eir_sequenceid. When EXCHANGE_ID returns a new, unconfirmed client ID, the server initializes the client ID slot to be equal to eir_sequenceid - 1 (accounting for underflow), and records a contrived CREATE_SESSION result with a "cached" result of NFS4ERR_SEQ_MISORDERED. With the slot thus initialized, the processing of the CREATE_SESSION operation is divided into four @@ -24195,161 +24230,414 @@ the sessionid in the preceding SEQUENCE operation), current filehandle, layout type (loga_layout_type), and the layout stateid (loga_stateid). The use of the loga_iomode field depends upon the layout type, but should reflect the client's data access intent. If the metadata server is in a grace period, and does not persist layouts and device ID to device address mappings, then it MUST return NFS4ERR_GRACE (see Section 8.4.2.1). The LAYOUTGET operation returns layout information for the specified - byte range: a layout. To get a layout from a specific offset through - the end-of-file, regardless of the file's length, a loga_length field - set to NFS4_UINT64_MAX is used. If loga_length is zero, or if a - loga_length which is not NFS4_UINT64_MAX is specified, and the sum of - loga_length and loga_offset exceeds NFS4_UINT64_MAX, the error - NFS4ERR_INVAL will result. + byte range: a layout. The client actually specifies two ranges, both + starting at the offset in the loga_offset field. The first range is + between loga_offset and loga_offset + loga_length - 1 inclusive. + This range indicates the desired range the client wants the layout to + cover. The second range is between loga_offset and loga_offset + + loga_minlength - 1 inclusive. This range indicates the required + range the client needs the layout to cover. Thus, loga_minlength + MUST be less than or equal to loga_length. - The loga_minlength field specifies the minimum length of layout the - server MUST return with two exceptions: + When a length field is set to NFS4_UINT64_MAX, this indicates a + desire (when loga_length is NFS4_UINT64_MAX) or requirement (when + loga_minlength is NFS4_UINT64_MAX) to get a layout from loga_offset + through the end-of-file, regardless of the file's length. - 1. The argument loga_iomode was set to LAYOUTIOMODE_READ, and - loga_offset plus loga_minlength goes past the end of the file. + The following rules govern the relationships among, and the minima of + loga_length, loga_minlength, and loga_offset. - 2. The range from loga_offset through loga_offset + loga_minlength - - 1 overlaps two or more striping patterns. In which case, - logr_layout will contain two or more elements, and the sum of the - lo_length fields of each element MUST be at least loga_minlength - unless the first exception also applies. + o If loga_length is less than loga_minlength, the metadata server + MUST return NFS4ERR_INVAL. - If this requirement cannot be met, the server MUST NOT return a - layout and the error NFS4ERR_BADLAYOUT MUST be returned. + o If loga_minlength is zero, this is an indication to the metadata + server that the client desires any layout at offset loga_offset or + less that the metadata server has "readily available". Readily is + subjective, and depends on the layout type and the pNFS server + implementation. For example, some metadata servers might have to + pre-allocate stable storage when they receive a request for a + range of a file that goes beyond the file's current length. If + loga_minlength is zero and loga_length is greater than zero, this + tells the metadata server what range of the layout the client + would prefer to have. If loga_length and loga_minlength are both + zero, then the client is indicating it desires a layout of any + length with the ending offset of the range no less than specified + loga_offset, and the starting offset at or below loga_offset. If + the metadata server does not have a layout that is readily + available, then it MUST return return NFS4ERR_LAYOUTTRYLATER. + + o If the sum of loga_offset and loga_minlength exceeds + NFS4_UINT64_MAX, and loga_minlength is not NFS4_UINT64_MAX, the + error NFS4ERR_INVAL MUST result. + + o If the sum of loga_offset and loga_length exceeds NFS4_UINT64_MAX, + and loga_length is not NFS4_UINT64_MAX, the error NFS4ERR_INVAL + MUST result. + + After the metadata server has performed the above checks on + loga_offset, loga_minlength, and loga_offset, the metadata server + MUST return a layout according to the rules in Table 21. + + Acceptable layouts based on loga_minlength. Note: u64m = + NFS4_UINT64_MAX; a_off = loga_offset; a_minlen = loga_minlength. + + +-----------+-----------+----------+----------+---------------------+ + | Layout | Layout | Layout | Layout | Layout length of | + | iomode of | a_minlen | iomode | offset | reply | + | request | of | of reply | of reply | | + | | request | | | | + +-----------+-----------+----------+----------+---------------------+ + | _READ | u64m | MAY be | MUST be | MUST be >= file | + | | | _READ | <= a_off | length - layout | + | | | | | offset | + | _READ | u64m | MAY be | MUST be | MUST be u64m | + | | | _RW | <= a_off | | + | _READ | > 0 and < | MAY be | MUST be | MUST be >= MIN(file | + | | u64m | _READ | <= a_off | length, a_minlen + | + | | | | | a_off) - layout | + | | | | | offset | + | _READ | > 0 and < | MAY be | MUST be | MUST be >= a_off - | + | | u64m | _RW | <= a_off | layout offset + | + | | | | | a_minlen | + | _READ | 0 | MAY be | MUST be | MUST be > 0 | + | | | _READ | <= a_off | | + | _READ | 0 | MAY be | MUST be | MUST be > 0 | + | | | _RW | <= a_off | | + | _RW | u64m | MUST be | MUST be | MUST be u64m | + | | | _RW | <= a_off | | + | _RW | > 0 and < | MUST be | MUST be | MUST be >= a_off - | + | | u64m | _RW | <= a_off | layout offset + | + | | | | | a_minlen | + | _RW | 0 | MUST be | MUST be | MUST be > 0 | + | | | _RW | <= a_off | | + +-----------+-----------+----------+----------+---------------------+ + + Table 21 + + If loga_minlength is not zero and the metadata server cannot return a + layout according to the rules in Table 21, then the metadata server + MUST return the error NFS4ERR_BADLAYOUT. If loga_minlength is zero + and the metadata server cannot or will not return a layout according + to the rules in Table 21, then the metadata server MUST return the + error NFS4ERR_LAYOUTTRYLATER. Assuming loga_length is greater than + loga_minlength or equal to zero, the metadata server SHOULD return a + layout according to the rules in Table 22. + + Desired layouts based on loga_length. The rules of Table 21 MUST be + applied first. Note: u64m = NFS4_UINT64_MAX; a_off = loga_offset; + a_len = loga_length. + + +------------+------------+-----------+-----------+-----------------+ + | Layout | Layout | Layout | Layout | Layout length | + | iomode of | a_len of | iomode of | offset of | of reply | + | request | request | reply | reply | | + +------------+------------+-----------+-----------+-----------------+ + | _READ | u64m | MAY be | MUST be | SHOULD be u64m | + | | | _READ | <= a_off | | + | _READ | u64m | MAY be | MUST be | SHOULD be u64m | + | | | _RW | <= a_off | | + | _READ | > 0 and < | MAY be | MUST be | SHOULD be >= | + | | u64m | _READ | <= a_off | a_off - layout | + | | | | | offset + a_len | + | _READ | > 0 and < | MAY be | MUST be | SHOULD be >= | + | | u64m | _RW | <= a_off | a_off - layout | + | | | | | offset + a_len | + | _READ | 0 | MAY be | MUST be | SHOULD be > | + | | | _READ | <= a_off | a_off - layout | + | | | | | offset | + | _READ | 0 | MAY be | MUST be | SHOULD be > | + | | | _READ | <= a_off | a_off - layout | + | | | | | offset | + | _RW | u64m | MUST be | MUST be | SHOULD be u64m | + | | | _RW | <= a_off | | + | _RW | > 0 and < | MUST be | MUST be | SHOULD be >= | + | | u64m | _RW | <= a_off | a_off - layout | + | | | | | offset + a_len | + | _RW | 0 | MUST be | MUST be | SHOULD be > | + | | | _RW | <= a_off | a_off - layout | + | | | | | offset | + +------------+------------+-----------+-----------+-----------------+ + + Table 22 The loga_stateid field specifies a valid stateid. If a layout is not currently held by the client, the loga_stateid field represents a stateid reflecting the correspondingly valid open, byte-range lock, - or delegation stateid. Once a layout is held by the client for the - file, the loga_stateid field is a stateid as returned from a previous - LAYOUTGET or LAYOUTRETURN operation or provided by a CB_LAYOUTRECALL - operation (see Section 12.5.3). + or delegation stateid. Once a layout is held on the file by the + client, the loga_stateid field MUST be a stateid as returned from a + previous LAYOUTGET or LAYOUTRETURN operation or provided by a + CB_LAYOUTRECALL operation (see Section 12.5.3). The loga_maxcount field specifies the maximum layout size (in bytes) that the client can handle. If the size of the layout structure exceeds the size specified by maxcount, the metadata server will return the NFS4ERR_TOOSMALL error. The returned layout is expressed as an array, logr_layout, with each element of type layout4. If a file has a single striping pattern, - then logr_layout will contain just one entry. Otherwise, if the + then logr_layout SHOULD contain just one entry. Otherwise, if the requested range overlaps more than one striping pattern, logr_layout will contain the required number of entries. The elements of logr_layout MUST be sorted in ascending order of the value of the lo_offset field of each element. There MUST be no gaps or overlaps in the range between two successive elements of logr_layout. The lo_iomode field in each element of logr_layout MUST be the same. - The metadata server may adjust the range of the returned layout based - on the usage implied by the loga_iomode. The client MUST be prepared - to get a layout that does not align exactly with its request. See - Section 12.5.2 for more details. + Table 21 and Table 22 both refer to a returned layout iomode, offset, + and length. Because the returned layout is encoded in the + logr_layout array, more description is required. - The metadata server may also return a layout with an lo_iomode other - than that requested by the client. If it does so, it MUST ensure - that the lo_iomode is more permissive than the loga_iomode requested. - For example, this behavior allows an implementation to upgrade read- - only requests to read/write requests at its discretion, within the - limits of the layout type specific protocol. A lo_iomode of either - LAYOUTIOMODE4_READ or LAYOUTIOMODE4_RW MUST be returned. + iomode + + The value of the returned layout iomode listed in Table 21 and + Table 22 is equal to the value of the lo_iomode field in each + element of logr_layout. As shown in Table 21 and Table 22, the + metadata server MAY return a layout with an lo_iomode different + from the requested iomode (field loga_iomode of the request). If + it does so, it MUST ensure that the lo_iomode is more permissive + than the loga_iomode requested. For example, this behavior allows + an implementation to upgrade read-only requests to read/write + requests at its discretion, within the limits of the layout type + specific protocol. A lo_iomode of either LAYOUTIOMODE4_READ or + LAYOUTIOMODE4_RW MUST be returned. + + offset + + The value of the returned layout offset listed in Table 21 and + Table 22 is always equal to the lo_offset field of the first + element logr_layout. + + length + + When setting the value of the returned layout length, the + situation is complicated by the possibility that the special + layout length value NFS4_UINT64_MAX is involved. For a + logr_layout array of N elements, the lo_length field in the first + N-1 elements MUST NOT be NFS4_UINT64_MAX. The lo_length field of + the last element of logr_layout can be NFS4_UINT64_MAX under some + conditions as described in the following list. + + * If an applicable rule of Table 21 states the metadata server + MUST return a layout of length NFS4_UINT64_MAX, then lo_length + field of the last element of logr_layout MUST be + NFS4_UINT64_MAX. + + * If an applicable rule of Table 21 states the metadata server + MUST NOT return a layout of length NFS4_UINT64_MAX, then + lo_length field of the last element of logr_layout MUST NOT be + NFS4_UINT64_MAX. + + * If an applicable rule of Table 22 states the metadata server + SHOULD return a layout of length NFS4_UINT64_MAX, then + lo_length field of the last element of logr_layout SHOULD be + NFS4_UINT64_MAX. + + * When the value of the returned layout length of Table 21 and + Table 22 is not NFS4_UINT64_MAX, then the returned layout + length is equal to the sum of the lo_length fields of each + element of logr_layout. The logr_return_on_close result field is a directive to return the - layout before closing the file. When the server sets this return - value to TRUE, it MUST be prepared to recall the layout in the case - the client fails to return the layout before close. For the server - that knows a layout must be returned before a close of the file, this - return value can be used to communicate the desired behavior to the - client and thus remove one extra step from the client's and server's - interaction. + layout before closing the file. When the metadata server sets this + return value to TRUE, it MUST be prepared to recall the layout in the + case the client fails to return the layout before close. For the + metadata server that knows a layout must be returned before a close + of the file, this return value can be used to communicate the desired + behavior to the client and thus remove one extra step from the + client's and metadata server's interaction. The logr_stateid stateid is returned to the client for use in subsequent layout related operations. See Section 8.2, Section 12.5.3, and Section 12.5.5.2 for a further discussion and requirements. The format of the returned layout (lo_content) is specific to the layout type. The value of the layout type (lo_content.loc_type) for - each of the elements of the array of layouts returned by the server - (logr_layout) MUST be equal to the loga_layout_type specified by the - client. If it is not equal, the client SHOULD ignore the response as - invalid and behave as if the server returned an error, even if the - client does have support for the layout type returned. + each of the elements of the array of layouts returned by the metadata + server (logr_layout) MUST be equal to the loga_layout_type specified + by the client. If it is not equal, the client SHOULD ignore the + response as invalid and behave as if the metadata server returned an + error, even if the client does have support for the layout type + returned. If layouts are not supported for the requested file or its containing - file system the server SHOULD return NFS4ERR_LAYOUTUNAVAILABLE. If - the layout type is not supported, the metadata server should return - NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts are supported but no layout - matches the client provided layout identification, the server should - return NFS4ERR_BADLAYOUT. If an invalid loga_iomode is specified, or - a loga_iomode of LAYOUTIOMODE4_ANY is specified, the server should - return NFS4ERR_BADIOMODE. + file system the metadata server MUST return + NFS4ERR_LAYOUTUNAVAILABLE. If the layout type is not supported, the + metadata server MUST return NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts + are supported but no layout matches the client provided layout + identification, the metadata server MUST return NFS4ERR_BADLAYOUT. + If an invalid loga_iomode is specified, or a loga_iomode of + LAYOUTIOMODE4_ANY is specified, the metadata server MUST return + NFS4ERR_BADIOMODE. If the layout for the file is unavailable due to transient - conditions, e.g. file sharing prohibits layouts, the server MUST - return NFS4ERR_LAYOUTTRYLATER. + conditions, e.g. file sharing prohibits layouts, the metadata server + MUST return NFS4ERR_LAYOUTTRYLATER. If the layout request is rejected due to an overlapping layout - recall, the server MUST return NFS4ERR_RECALLCONFLICT. See + recall, the metadata server MUST return NFS4ERR_RECALLCONFLICT. See Section 12.5.5.2 for details. If the layout conflicts with a mandatory byte range lock held on the file, and if the storage devices have no method of enforcing mandatory locks, other than through the restriction of layouts, the - metadata server should return NFS4ERR_LOCKED. + metadata server SHOULD return NFS4ERR_LOCKED. If client sets loga_signal_layout_avail to TRUE, then it is registering with the client a "want" for a layout in the event the - layout cannot be obtained due to resource exhaustion. If the server - supports and will honor the "want", the results will have - logr_will_signal_layout_avail set to TRUE. If so the client should - expect a CB_RECALLABLE_OBJ_AVAIL operation to indicate that a layout - is available. + layout cannot be obtained due to resource exhaustion. If the + metadata server supports and will honor the "want", the results will + have logr_will_signal_layout_avail set to TRUE. If so the client + should expect a CB_RECALLABLE_OBJ_AVAIL operation to indicate that a + layout is available. On success, the current filehandle retains its value and the current stateid is updated to match the value as returned in the results. 18.43.4. IMPLEMENTATION Typically, LAYOUTGET will be called as part of a COMPOUND request after an OPEN operation and results in the client having location information for the file; this requires that loga_stateid be set to - the special stateid that tells the server to use the current stateid, - which is set by OPEN (see Section 16.2.3.1.2) . A client may also - hold a layout across multiple OPENs. The client specifies a layout - type that limits what kind of layout the server will return. This - prevents servers from issuing layouts that are unusable by the - client. + the special stateid that tells the metadata server to use the current + stateid, which is set by OPEN (see Section 16.2.3.1.2) . A client + may also hold a layout across multiple OPENs. The client specifies a + layout type that limits what kind of layout the metadata server will + return. This prevents metadata servers from granting layouts that + are unusable by the client. + + As indicated by Table 21 and Table 22 the specification of LAYOUTGET + allows a pNFS client and server considerable flexibility. A pNFS + client can take several strategies for sending LAYOUTGET. Some + examples are as follows. + + o If LAYOUTGET is preceded by OPEN in the same COMPOUND request, and + the OPEN requests read access, the client might opt to request a + _READ layout with loga_offset set to zero, loga_minlength set to + zero, and loga_length set to NFS4_UINT64_MAX. If the file has + space allocated to it, that space is striped over one or more + storage devices, and there is either no conflicting layout, or the + concept of a conflicting layout does not apply to the pNFS + server's layout type or implementation, then the metadata server + might return a layout with a starting offset of zero, and a length + equal to the length of the file, if not NFS4_UINT64_MAX. If the + length of the file is not a multiple of the pNFS server's stripe + width (see Section 13.2 for a formal definition), the metadata + server might round the returned layout's length up. + + o If LAYOUTGET is preceded by OPEN in the same COMPOUND request, and + the OPEN does not truncate the file, and requests write access, + the client might opt to request a _RW layout with loga_offset set + to zero, loga_minlength set to zero, and loga_length set to the + file's current length (if known), or NFS4_UINT64_MAX. As with the + previous case, under some conditions the metadata server might + return a layout that covers the entire length of the file or + beyond. + + o As above, but the OPEN truncates the file. In this case, client + might anticipate it will be writing to the file from offset zero, + and so loga_offset and loga_minlength are set to zero, and + loga_length is set to the value of threshold4_write_iosize. The + metadata server might return a layout from offset zero with a + length at least as long as as threshold4_write_iosize. + + o A process on the client invokes a request to read from offset + 10000 for length 50000. The client is using buffered I/O, and has + buffer sizes of 4096 bytes. The client intends to map the request + of the process into a series of READ requests starting at offset + 8192. The end offset needs to be higher than 10000 + 50000 = + 60000, and the next offset that is a multiple of 4096 is 61440. + The difference between 61440 and that starting offset of the + layout is 53248 (which is the product of 4096 and 15). The value + of threshold4_read_iosize is less than 53248, so the client sends + a LAYOUTGET request with loga_offset set to 8192, loga_minlength + set to 53248, and loga_length set to the file's length (if known) + minus 8192 or NFS4_UINT64_MAX (if the file's length is not known). + Since this LAYOUTGET request exceeds the metadata server's + threshold, it grants the layout, possibly with an initial offset + of 0, with an end offset of at least 8192 + 53248 - 1 = 61439, but + preferably a layout with an offset aligned on the stripe width and + a length that is a multiple of the stripe width. + + o As above, but the client is not using buffered I/O, and instead + all internal I/O requests are sent directly to the server. The + LAYOUTGET request has loga_offset equal to 10000, and + loga_minlength set to 50000. The value of loga_length is set to + the length of the file. The metadata server is free to return a + layout that fully overlaps the requested range, with a starting + offset and length aligned on the stripe width. + + o Again a process on the client invokes a request to read from + offset 10000 for length 50000, and buffered I/O is in use. The + client is expecting that the server might not be able to return + the layout for the full I/O range, with loga_offset set to 8192 + and loga_minlength set to 53248. The client intends to map the + request of the process into a series of READ requests starting at + offset 8192, each with length 4096, with a total length of 53248 + (which equals 13 * 4096). Because the value of + threshold4_read_iosize is equal to 4096, it is practical and + reasonable for the client to use several LAYOUTGETs to complete + the series of READs. The client sends a LAYOUTGET request with + loga_offset set to 8192, loga_minlength set to 4096, and + loga_length set to 53248 or higher. The server will grant a + layout possibly with an initial offset of 0, with an end offset of + at least 8192 + 4096 - 1 = 12287, but preferably a layout with an + offset aligned on the stripe width and a length that is a multiple + of the stripe width. This will allow the client to make forward + progress, possibly having to issue more LAYOUTGET requests for the + remainder of the range. + + o An NFS client detects a sequential read pattern, and so issues a + LAYOUTGET that goes well beyond any current or pending read + requests to the server. The server might likewise detect this + pattern, and grant the LAYOUTGET request. The client continues to + send LAYOUTGET requests once it has read from an offset of the + file that represents 50% of the way through the last layout it + received. + + o As above but the client fails to detect the pattern, but the + server does. The next time the metadata server gets a LAYOUTGET, + it returns a layout with a length that is well beyond + loga_minlength. + + o A client is using buffered I/O, and has a long queue of write + behinds to process and also detects a sequential write pattern. + It issues a LAYOUTGET for a layout that spans the range of the + queued write behinds and well beyond, including ranges beyond the + filer's current length. The client continues to issue LAYOUTGETs + once the write behind queue reaches 50% of the maximum queue + length. Once the client has obtained a layout referring to a particular - device ID, the server MUST NOT delete the device ID until the layout - is returned or revoked. + device ID, the metadata server MUST NOT delete the device ID until + the layout is returned or revoked. CB_NOTIFY_DEVICEID can race with LAYOUTGET. One race scenario is that LAYOUTGET returns a device ID the client does not have device - address mappings for, and the server sends a CB_NOTIFY_DEVICEID to - add the device ID to the client's awareness and meanwhile the client - sends GETDEVICEINFO on the device ID. This scenario is discussed in - Section 18.40.4. Another scenario is that the CB_NOTIFY_DEVICEID is - processed by the client before it processes the results from - LAYOUTGET. The client will send a GETDEVICEINFO on the device ID. - If the results from GETDEVICEINFO are received before the client gets - results from LAYTOUTGET, then there is no longer a race. If the - results from LAYOUTGET are received before the results from - GETDEVICEINFO, the client can either wait for results of + address mappings for, and the metadata server sends a + CB_NOTIFY_DEVICEID to add the device ID to the client's awareness and + meanwhile the client sends GETDEVICEINFO on the device ID. This + scenario is discussed in Section 18.40.4. Another scenario is that + the CB_NOTIFY_DEVICEID is processed by the client before it processes + the results from LAYOUTGET. The client will send a GETDEVICEINFO on + the device ID. If the results from GETDEVICEINFO are received before + the client gets results from LAYTOUTGET, then there is no longer a + race. If the results from LAYOUTGET are received before the results + from GETDEVICEINFO, the client can either wait for results of GETDEVICEINFO, or send another one to get possibly more up to date device address mappings for the device ID. 18.44. Operation 51: LAYOUTRETURN - Release Layout Information 18.44.1. ARGUMENT /* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ const LAYOUT4_RET_REC_FILE = 1; const LAYOUT4_RET_REC_FSID = 2; @@ -24850,23 +25137,23 @@ If SEQUENCE returns an error, then the state of the slot (sequence id, cached reply) MUST NOT change, and the associated lease MUST NOT be renewed. If SEQUENCE returns NFS4_OK, then the associated lease MUST be renewed (see Section 8.3), except if SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED is returned in sr_status_flags. 18.46.4. IMPLEMENTATION - The server MUST maintain a mapping of sessionid to client ID in order - to validate any operations that follow SEQUENCE that take a stateid - as an argument and/or result. + The server MUST maintain a mapping of session id to client ID in + order to validate any operations that follow SEQUENCE that take a + stateid as an argument and/or result. If the client establishes a persistent session, then a SEQUENCE done after a server restart may encounter requests performed and recorded in a persistent reply cache before the server restart. In this case, SEQUENCE will be processed successfully, while requests which were not processed previously are rejected with NFS4ERR_DEADSESSION. Depending on which of the operations within the COMPOUND were successfully performed before the server restart, these operations will also have replies sent from the server reply cache. Note that @@ -25278,30 +25565,30 @@ Once a RECLAIM_COMPLETE is done, there can be no further reclaim operations for locks whose scope is defined as having completed recovery. Once the client sends RECLAIM_COMPLETE, the server will not allow the client to do subsequent reclaims of locking state for that scope and if these are attempted, will return NFS4ERR_NO_GRACE. Whenever a client establishes a new client ID and before it does the first non-reclaim operation that obtains a lock, it MUST do a global RECLAIM_COMPLETE, even if there are no locks to reclaim. If non- - reclaim locking operations are done before the RECLAIM_COMPLETE, a + reclaim locking operations are done before the RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. Similarly, when the client accesses a file system on a new server, before it sends the first non-reclaim operation that obtains a lock on this new server, it must do a RECLAIM_COMPLETE with rca_one_fs set to TRUE and current filehandle within that file system, even if there are no locks to reclaim. If non-reclaim locking operations are done - on that file system before the RECLAIM_COMPLETE, a NFS4ERR_GRACE will - be returned. + on that file system before the RECLAIM_COMPLETE, an NFS4ERR_GRACE + will be returned. Any locks not reclaimed at the point at which RECLAIM_COMPLETE is done become non-reclaimable. The client MUST NOT attempt to reclaim them, either during the current server instance or in any subsequent server instance, or on another server to which responsibility for that file system is transferred. If the client were to do so, it would be violating the protocol by representing itself as owning locks that it does not own, and so has no right to reclaim. See Section 8.4.3 for a discussion of edge conditions related to lock reclaim. @@ -25386,24 +25673,25 @@ 19.1.1. ARGUMENTS void; 19.1.2. RESULTS void; 19.1.3. DESCRIPTION - Standard NULL procedure. Void argument, void response. Even though - there is no direct functionality associated with this procedure, the - server will use CB_NULL to confirm the existence of a path for RPCs - from server to client. + CB_NULL is the standard ONC RPC NULL procedure, with the standard + void argument and void response. Even though there is no direct + functionality associated with this procedure, the server will use + CB_NULL to confirm the existence of a path for RPCs from the server + to client. 19.1.4. ERRORS None. 19.2. Procedure 1: CB_COMPOUND - Compound Operations 19.2.1. ARGUMENTS enum nfs_cb_opnum4 { @@ -25508,37 +25796,37 @@ nfs_cb_resop4 resarray<>; }; 19.2.3. DESCRIPTION The CB_COMPOUND procedure is used to combine one or more of the callback procedures into a single RPC request. The main callback RPC program has two main procedures: CB_NULL and CB_COMPOUND. All other operations use the CB_COMPOUND procedure as a wrapper. - In the processing of the CB_COMPOUND procedure, the client may find - that it does not have the available resources to execute any or all - of the operations within the CB_COMPOUND sequence. This is discussed - in Section 2.10.5.4. + During the processing of the CB_COMPOUND procedure, the client may + find that it does not have the available resources to execute any or + all of the operations within the CB_COMPOUND sequence. Refer to + Section 2.10.5.4 for details. The minorversion field of the arguments MUST be the same as the minorversion of the COMPOUND procedure used to created the client ID and session. For NFSv4.1, minorversion MUST be set to 1. Contained within the CB_COMPOUND results is a 'status' field. This status must be equivalent to the status of the last operation that was executed within the CB_COMPOUND procedure. Therefore, if an operation incurred an error then the 'status' value will be the same error value as is being returned for the operation that failed. - For a description of the "tag" field, see Section 16.2.3 where the - corresponding forward channel procedure is described. + The "tag" field is handled the same way as that of COMPOUND procedure + (see Section 16.2.3). Illegal operation codes are handled in the same way as they are handled for the COMPOUND procedure. 19.2.4. IMPLEMENTATION The CB_COMPOUND procedure is used to combine individual operations into a single RPC request. The client interprets each of the operations in turn. If an operation is executed by the client and the status of that operation is NFS4_OK, then the next operation in @@ -25567,21 +25855,21 @@ | NFS4ERR_INVAL | The tag argument is not in UTF-8 | | | encoding. | | NFS4ERR_MINOR_VERS_MISMATCH | | | NFS4ERR_SERVERFAULT | | | NFS4ERR_TOO_MANY_OPS | | | NFS4ERR_REP_TOO_BIG | | | NFS4ERR_REP_TOO_BIG_TO_CACHE | | | NFS4ERR_REQ_TOO_BIG | | +------------------------------+------------------------------------+ - Table 21 + Table 23 20. NFSv4.1 Callback Operations 20.1. Operation 3: CB_GETATTR - Get Attributes 20.1.1. ARGUMENT struct CB_GETATTR4args { nfs_fh4 fh; bitmap4 attr_request; @@ -25602,67 +25890,68 @@ 20.1.3. DESCRIPTION The CB_GETATTR operation is used by the server to obtain the current modified state of a file that has been write delegated. The attributes size and change are the only ones guaranteed to be serviced by the client. See Section 10.4.3 for a full description of how the client and server are to interact with the use of CB_GETATTR. If the filehandle specified is not one for which the client holds a - write open delegation, an NFS4ERR_BADHANDLE error is returned. + write delegation, an NFS4ERR_BADHANDLE error is returned. 20.1.4. IMPLEMENTATION The client returns attrmask bits and the associated attribute values only for the change attribute, and attributes that it may change (time_modify, and size). -20.2. Operation 4: CB_RECALL - Recall an Open Delegation +20.2. Operation 4: CB_RECALL - Recall a Delegation 20.2.1. ARGUMENT struct CB_RECALL4args { stateid4 stateid; bool truncate; nfs_fh4 fh; }; 20.2.2. RESULT struct CB_RECALL4res { nfsstat4 status; }; 20.2.3. DESCRIPTION - The CB_RECALL operation is used to begin the process of recalling an - open delegation and returning it to the server. + The CB_RECALL operation is used to begin the process of recalling a + delegation and returning it to the server. - The truncate flag is used to optimize recall for a file which is - about to be truncated to zero. When it is set, the client is freed - of obligation to propagate modified data for the file to the server, - since this data is irrelevant. + The truncate flag is used to optimize recall for a file object which + is a regular file and is about to be truncated to zero. When it is + TRUE, the client is freed of the obligation to propagate modified + data for the file to the server, since this data is irrelevant. - If the handle specified is not one for which the client holds an open + If the handle specified is not one for which the client holds a delegation, an NFS4ERR_BADHANDLE error is returned. If the stateid specified is not one corresponding to an open delegation for the file specified by the filehandle, an NFS4ERR_BAD_STATEID is returned. 20.2.4. IMPLEMENTATION - The client should reply to the callback immediately. Replying does - not complete the recall except when an error was returned. The - recall is not complete until the delegation is returned using a - DELEGRETURN. + The client SHOULD reply to the callback immediately. Replying does + not complete the recall except when the value of the reply's status + field is neither NFS4ERR_DELAY nor NFS4_OK. The recall is not + complete until the delegation is returned using a DELEGRETURN + operation. 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from Client 20.3.1. ARGUMENT /* * NFSv4.1 callback arguments and results */ enum layoutrecall_type4 { @@ -25697,45 +25986,45 @@ 20.3.2. RESULT struct CB_LAYOUTRECALL4res { nfsstat4 clorr_status; }; 20.3.3. DESCRIPTION The CB_LAYOUTRECALL operation is used by the server to recall layouts from the client; as a result, the client will begin the process of - returning layouts with LAYOUTRETURN. The CB_LAYOUTRECALL operation + returning layouts via LAYOUTRETURN. The CB_LAYOUTRECALL operation specifies one of three forms of recall processing with the value of layoutrecall_type4. The recall is either for a specific layout (by file), for an entire file system (FSID), or for all file systems (ALL). The behavior of the operation varies based on the value of the layoutrecall_type4. The value and behaviors are: LAYOUTRECALL4_FILE - For a layout to match the recall request, the following fields - must match in value with the layout: clora_type, clora_iomode, - lor_fh, and the byte range specified by lor_offset, and - lor_length. The clora_iomode field may have a special value of - LAYOUTIOMODE4_ANY. The LAYOUTIOMODE4_ANY will match any value - originally returned in a layout; therefore it acts as a wild card - for iomode. The other special value used is for lor_length. If - lor_length has a value of NFS4_MAXFILELEN, the lor_length field - means the maximum possible file size. If a matching layout is - found, it MUST be returned using the LAYOUTRETURN operation, see - Section 18.44. An example of the field's special value use is if - clora_iomode is LAYOUTIOMODE4_ANY, lor_offset is zero, and - lor_length is NFS4_MAXFILELEN, then the entire layout is to be - returned. + For a layout to match the recall request, the values of the + following fields must match those of the layout: clora_type, + clora_iomode, lor_fh, and the byte range specified by lor_offset + and lor_length. The clora_iomode field may have a special value + of LAYOUTIOMODE4_ANY. The special value LAYOUTIOMODE4_ANY will + match any iomode originally returned in a layout; therefore it + acts as a wild card. The other special value used is for + lor_length. If lor_length has a value of NFS4_UINT64_MAX, the + lor_length field means the maximum possible file size. If a + matching layout is found, it MUST be returned using the + LAYOUTRETURN operation (see Section 18.44). An example of the + field's special value use is if clora_iomode is LAYOUTIOMODE4_ANY, + lor_offset is zero, and lor_length is NFS4_UINT64_MAX, then the + entire layout is to be returned. The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the client does not hold layouts for the file or if the client does not have any overlapping layouts for the specification in the layout recall. LAYOUTRECALL4_FSID and LAYOUTRECALL4_ALL If LAYOUTRECALL4_FSID is specified, the fsid specifies the file system for which any outstanding layouts MUST be returned. If @@ -25746,65 +26035,64 @@ respective LAYOUTRETURN with either LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL acknowledges to the server that the client invalidated the said device mappings. See Section 12.5.5.2.1.5 for considerations with "bulk" recall of layouts. The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the client does not hold layouts and does not have valid deviceid mappings. In processing the layout recall request, the client also varies its - behavior on the value of the clora_changed field. This field is used - by the server to provide additional context for the reason why the - layout is being recalled. A FALSE value for clora_changed indicates - that no change in the layout is expected and the client may write - modified data to the storage devices involved; this must be done - prior to returning the layout via LAYOUTRETURN. A TRUE value for - clora_changed indicates that the server is changing the layout. - Examples of layout changes and reasons for a TRUE indication are: + behavior based on the value of the clora_changed field. This field + is used by the server to provide additional context for the reason + why the layout is being recalled. A FALSE value for clora_changed + indicates that no change in the layout is expected and the client may + write modified data to the storage devices involved; this must be + done prior to returning the layout via LAYOUTRETURN. A TRUE value + for clora_changed indicates that the server is changing the layout. + Examples of layout changes and reasons for a TRUE indication are: the metadata server is restriping the file or a permanent error has occurred on a storage device and the metadata server would like to provide a new layout for the file. Therefore, a clora_changed value of TRUE indicates some level of change for the layout and the client SHOULD NOT write and commit modified data to the storage devices. In this case, the client writes and commits data through the metadata server. See Section 12.5.3 for a description of how the lor_stateid field in the arguments is to be constructed. Note that the "seqid" field of lor_stateid MUST NOT be zero. See Section 8.2, Section 12.5.3, and Section 12.5.5.2 for a further discussion and requirements. 20.3.4. IMPLEMENTATION The client's processing for CB_LAYOUTRECALL is similar to CB_RECALL - (recall of file delegations) in that straightforward processing of - the layout recall done and the client responds to the request before - actually returning layouts with the LAYOUTRETURN operation. While - the client responds to the CB_LAYOUTRECALL immediately, the operation - is not considered complete (i.e. considered pending) until all - affected layouts are returned to the server with the LAYOUTRETURN - operation. + (recall of file delegations) in that the client responds to the + request before actually returning layouts via the LAYOUTRETURN + operation. While the client responds to the CB_LAYOUTRECALL + immediately, the operation is not considered complete (i.e. + considered pending) until all affected layouts are returned to the + server via the LAYOUTRETURN operation. - Before returning the layout to the server with LAYOUTRETURN, the + Before returning the layout to the server via LAYOUTRETURN, the client should wait for the response from in-process or in-flight READ, WRITE, or COMMIT operations that use the recalled layout. - If the client is holding modified data which is effected by a + If the client is holding modified data which is affected by a recalled layout, the client has various options for writing the data to the server. As always, the client may write the data through the metadata server. In fact, the client may not have a choice other than writing to the metadata server when the clora_changed argument is TRUE and a new layout is unavailable from the server. However, the client may be able to write the modified data to the storage device if the clora_changed argument is FALSE; this needs to be done - before returning the layout with LAYOUTRETURN. If the client were to + before returning the layout via LAYOUTRETURN. If the client were to obtain a new layout covering the modified data's range, then writing to the storage devices is an available alternative. Note that before obtaining a new layout, the client must first return the original layout. In the case of modified data being written while the layout is held, the client must use LAYOUTCOMMIT operations at the appropriate time; as required LAYOUTCOMMIT must be done before the LAYOUTRETURN. If a large amount of modified data is outstanding, the client may send LAYOUTRETURNs for portions of the recalled layout; this allows the @@ -25912,57 +26200,58 @@ to clients about changes to delegated directories The registration of notifications for the directories occurs when the delegation is established using GET_DIR_DELEGATION. These notifications are sent over the backchannel. The notification is sent once the original request has been processed on the server. The server will send an array of notifications for changes that might have occurred in the directory. The notifications are sent as list of pairs of bitmaps and values. See Section 3.3.7 for a description of how NFSv4.1 bitmaps work. - If the server has more notifications then can fit in the CB_COMPOUND + If the server has more notifications than can fit in the CB_COMPOUND request, it SHOULD send a sequence of serial CB_COMPOUND requests so that the client's view of the directory does not become confused. E.g. If the server indicates a file named "foo" is added, and that - the file "foo" is removed, the order it which the client receives - these notifications are processed needs to be the same as the order - in which corresponding operations occurred on the server. + the file "foo" is removed, the order in which the client receives + these notifications needs to be the same as the order in which + corresponding operations occurred on the server. If the client holding the delegation makes any changes in the directory that cause files or sub directories to be added or removed, the server will notify that client of the resulting change(s). If the client holding the delegation is making attribute or cookie verifier changes only, the server does not need to send notifications to that client. The server will send the following information for each operation: NOTIFY4_ADD_ENTRY The server will send information about the new directory entry being created along with the cookie for that entry. The entry information (data type notify_add4) includes the component name of the entry and attributes. The server will send this type of entry when a file is actually being created, when an entry is being added to a directory as a result of a rename across directories (see below), and when a hard link is being created to an existing file. If this entry is added to the end of the directory, the - server will set the nad_last_entry flag to true. If the file is + server will set the nad_last_entry flag to TRUE. If the file is added such that there is at least one entry before it, the server will also return the previous entry information (nad_prev_entry, a variable length array of up to one element. If the array is of zero length, there is no previous entry), along with its cookie. - This is to help clients find the right location in their DNLC or - directory caches where this entry should be cached. If the new - entry's cookie is available, it will be in nad_new_entry_cookie - (another variable length array of up to one element). If the - addition of the entry causes another entry to be deleted (which - can only happen in the rename case) atomically with the addition, - then information on this entry is reported in nad_old_entry. + This is to help clients find the right location in their file name + caches and directory caches where this entry should be cached. If + the new entry's cookie is available, it will be in the + nad_new_entry_cookie (another variable length array of up to one + element) field. If the addition of the entry causes another entry + to be deleted (which can only happen in the rename case) + atomically with the addition, then information on this entry is + reported in nad_old_entry. NOTIFY4_REMOVE_ENTRY The server will send information about the directory entry being deleted. The server will also send the cookie value for the deleted entry so that clients can get to the cached information for this entry. NOTIFY4_RENAME_ENTRY The server will send information about both the old entry and the new entry. This includes name and attributes for each entry. In @@ -26013,57 +26302,56 @@ 20.5.2. RESULT struct CB_PUSH_DELEG4res { nfsstat4 cpdr_status; }; 20.5.3. DESCRIPTION CB_PUSH_DELEG is used by the server to both signal to the client that - the delegation it wants is available and to simultaneously offer the - delegation to the client. The client has the choice of accepting the - delegation by returning NFS4_OK to the server, delaying the decision - to accept the offered delegation by returning NFS4ERR_DELAY or - permanently rejecting the offer of the delegation by returning - NFS4ERR_REJECT_DELEG. When a delegation is rejected in this fashion, - the want previously established is permanently deleted. - - The server MUST send in cpda_delegation a delegation which satisfies - a request made in an OPEN or WANT_DELEGATION operation. + the delegation it wants (previously indicated via a want established + from an OPEN or WANT_DELEGATION operation) is available and to + simultaneously offer the delegation to the client. The client has + the choice of accepting the delegation by returning NFS4_OK to the + server, delaying the decision to accept the offered delegation by + returning NFS4ERR_DELAY or permanently rejecting the offer of the + delegation by returning NFS4ERR_REJECT_DELEG. When a delegation is + rejected in this fashion, the want previously established is + permanently deleted and the delegation is subject to acquisition by + another client. 20.5.4. IMPLEMENTATION If the client does return NFS4ERR_DELAY and there is a conflicting delegation request, the server MAY process it at the expense of the client that returned NFS4ERR_DELAY. The client's want will typically not be cancelled, but MAY processed behind other delegation requests or registered wants. When a client returns a status other than NFS4_OK, NFSERR_DELAY, or NFS4ERR_REJECT_DELAY, the want remains pending, although servers may decide to cancel the want by sending a CB_WANTS_CANCELLED. -20.6. Operation 8: CB_RECALL_ANY - Keep any N delegations +20.6. Operation 8: CB_RECALL_ANY - Keep any N recallable objects - Notify client to return delegation and keep N of them. + Notify client to return all but N recallable objects. 20.6.1. ARGUMENT const RCA4_TYPE_MASK_RDATA_DLG = 0; const RCA4_TYPE_MASK_WDATA_DLG = 1; const RCA4_TYPE_MASK_DIR_DLG = 2; const RCA4_TYPE_MASK_FILE_LAYOUT = 3; - const RCA4_TYPE_MASK_BLK_LAYOUT_MIN = 4; - const RCA4_TYPE_MASK_BLK_LAYOUT_MAX = 7; + const RCA4_TYPE_MASK_BLK_LAYOUT = 4; const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; - const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 11; + const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9; const RCA4_TYPE_MASK_OTHER_LAYOUT_MIN = 12; const RCA4_TYPE_MASK_OTHER_LAYOUT_MAX = 15; struct CB_RECALL_ANY4args { uint32_t craa_objects_to_keep; bitmap4 craa_type_mask; }; 20.6.2. RESULT @@ -26097,37 +26385,67 @@ resource pools for layouts and for delegations, or further separate resources by types of delegations. When a given resource pool is over-utilized, the server can send a CB_RECALL_ANY to clients holding recallable objects of the types involved, allowing it to keep a certain number of such objects and return any excess. A mask specifies which types of objects are to be limited. The client chooses, based on its own knowledge of current usefulness, which of the objects in that class should be returned. - For NFSv4.1, a number of bits are defined. For some of these, ranges - are defined and it is up to the definition of the storage protocol to - specify how these are to be used. There are ranges for blocks-based - storage protocols, for object-based storage protocols and a reserved - range for other experimental storage protocols. The RFC defining - such a storage protocol needs to specify how particular bits within - its range are to be used. For example, it may specify a mapping - between attributes of the layout (read vs. write, size of area) and - the bit to be used or it may define a field in the layout where the - associated bit position is made available by the server to the - client. + A number of bits are defined. For some of these, ranges are defined + and it is up to the definition of the storage protocol to specify how + these are to be used. There are ranges reserved for object-based + storage protocols and for other experimental storage protocols. An + RFC defining such a storage protocol needs to specify how particular + bits within its range are to be used. For example, it may specify a + mapping between attributes of the layout (read vs. write, size of + area) and the bit to be used or it may define a field in the layout + where the associated bit position is made available by the server to + the client. - When an undefined bit is set in the type mask, NFS4ERR_INVAL should - be returned. If a client does not support an object of the specified - type, if the bit is defined, NFS4ERR_INVAL should not be returned. - Future minor versions of NFSv4 may expand the set of valid type mask - bits. + RCA4_TYPE_MASK_RDATA_DLG + + The client is to return read delegations on non-directory file + objects. + + RCA4_TYPE_MASK_WDATA_DLG + + The client is to return write delegations on regular file objects. + + RCA4_TYPE_MASK_DIR_DLG + + The client is to return directory delegations. + + RCA4_TYPE_MASK_FILE_LAYOUT + + The client is to return layouts of type LAYOUT4_NFSV4_1_FILES. + + RCA4_TYPE_MASK_BLK_LAYOUT + + See [31] for a description. + + RCA4_TYPE_MASK_OBJ_LAYOUT_MIN to RCA4_TYPE_MASK_OBJ_LAYOUT_MAX + + See [30] for a description. + + RCA4_TYPE_MASK_OTHER_LAYOUT_MIN to RCA4_TYPE_MASK_OTHER_LAYOUT_MAX + + This range is reserved for telling the client to recall layouts of + experimental or site specific layout types (see Section 3.3.13). + + When a bit is set in the type mask that corresponds to an undefined + type of recallable object, NFS4ERR_INVAL MUST be returned. When a + bit is set that corresponds to a defined type of object, but the + client does not support an object of the type, NFS4ERR_INVAL MUST NOT + be returned. Future minor versions of NFSv4 may expand the set of + valid type mask bits. CB_RECALL_ANY specifies a count of objects that the client may keep as opposed to a count that the client must return. This is to avoid potential race between a CB_RECALL_ANY that had a count of objects to free with a set of client-originated operations to return layouts or delegations. As a result of the race, the client and server would have differing ideas as to how many objects to return. Hence the client could mistakenly free too many. If resource demands prompt it, the server may send another @@ -26185,26 +26503,39 @@ nfsstat4 croa_status; }; 20.7.3. DESCRIPTION CB_RECALLABLE_OBJ_AVAIL is used by the server to signal the client that the server has resources to grant recallable objects that might previously have been denied by OPEN, WANT_DELEGATION, GET_DIR_DELEG, or LAYOUTGET. - The argument, objects_to_keep means the total number of recallable - objects of the types indicated in the argument type_mask that the - server believes it can allow the client to have, including the number - of such objects the client already has. A client that tries to - acquire more recallable objects than the server informs it can have - runs the risk of having objects recalled. + The argument craa_objects_to_keep means the total number of + recallable objects of the types indicated in the argument type_mask + that the server believes it can allow the client to have, including + the number of such objects the client already has. A client that + tries to acquire more recallable objects than the server informs it + can have runs the risk of having objects recalled. + + The server is not obligated to reserve the difference between the + number of the objects the client currently has and the value of + craa_objects_to_keep, nor does delaying the reply to + CB_RECALLABLE_OBJ_AVAIL prevent the server from using the resources + of the recallable objects for another purpose. Indeed, if a client + responds slowly to CB_RECALLABLE_OBJ_AVAIL, the server might + interpret the client as having reduced capability to manage + recallable objects, and so cancel or reduce any reservation it is + maintaining on behalf of the client. Thus if the client desires to + acquire more recallable objects, it needs to reply quickly to + CB_RECALLABLE_OBJ_AVAIL, and then send the appropriate operations to + acquire recallable objects. 20.8. Operation 10: CB_RECALL_SLOT - change flow control limits Change flow control limits 20.8.1. ARGUMENT struct CB_RECALL_SLOT4args { slotid4 rsa_target_highest_slotid; }; @@ -26212,24 +26543,25 @@ 20.8.2. RESULT struct CB_RECALL_SLOT4res { nfsstat4 rsr_status; }; 20.8.3. DESCRIPTION The CB_RECALL_SLOT operation requests the client to return session slots, and if applicable, transport credits (e.g. RDMA credits for - connections associated with the operations channel) to the server. - CB_RECALL_SLOT specifies rsa_target_highest_slotid, the target - highest_slot the server wants for the session. The client, should - then work toward reducing the highest_slot to the target. + connections associated with the operations channel) of the session's + fore channel. CB_RECALL_SLOT specifies rsa_target_highest_slotid, + the value of the target highest slot id the server wants for the + session. The client MUST then progress toward reducing the session's + highest slot id to the target value. If the session has only non-RDMA connections associated with its operations channel, then the client need only wait for all outstanding requests with a slotid > rsa_target_highest_slotid to complete, then send a single COMPOUND consisting of a single SEQUENCE operation, with the sa_highestslot field set to rsa_target_highest_slotid. If there are RDMA-based connections associated with operation channel, then the client needs to also send enough zero-length RDMA Sends to take the total RDMA credit count to rsa_target_highest_slotid + 1 or below. @@ -26285,42 +26617,42 @@ case NFS4_OK: CB_SEQUENCE4resok csr_resok4; default: void; }; 20.9.3. DESCRIPTION The CB_SEQUENCE operation is used to manage operational accounting for the backchannel of the session on which a request is sent. The - contents include the session to which this request belongs, slot id - and sequence id used by the server to implement session request - control and exactly once semantics, and exchanged slot maximums which - are used to adjust the size of the reply cache. This operation MUST - appear once as the first operation in each CB_COMPOUND request or a - protocol error must result. See Section 18.46.3 for a description of - how slots are processed. + contents include the session id to which this request belongs, the + slot id and sequence id used by the server to implement session + request control and exactly once semantics, and exchanged slot id + maxima which are used to adjust the size of the reply cache. This + operation will appear once as the first operation in each CB_COMPOUND + request or a protocol error MUST result. See Section 18.46.3 for a + description of how slots are processed. If csa_cachethis is TRUE, then the server is requesting that the client cache the reply in the callback reply cache. The client MUST cache the reply (see Section 2.10.5.1.3). The csa_referring_call_lists array is the list of COMPOUND requests, identified by sessionid, slot id and sequencid. These are requests that the client previously sent to the server. These previous requests created state that some operation(s) in the same CB_COMPOUND - as the csa_referring_call_lists is identifying. A sessionid is + as the csa_referring_call_lists are identifying. A session id is included because leased state is tied to a client ID, and a client ID can have multiple sessions. See Section 2.10.5.3. - The value of csa_sequenceid argument relative to the cached sequence - id on the slot falls into one of three cases. + The value of the csa_sequenceid argument relative to the cached + sequence id on the slot falls into one of three cases. o If the difference between csa_sequenceid and the client's cached sequence id at the slot id is two (2) or more, or if csa_sequenceid is less than the cached sequence id (accounting for wraparound of the unsigned sequence id value), then the client MUST return NFS4ERR_SEQ_MISORDERED. o If csa_sequenceid and the cached sequence id are the same, this is a retry, and the client returns the CB_COMPOUND request's cached reply. @@ -26343,22 +26675,20 @@ id, cached reply) MUST NOT change. The client returns two "highest_slotid" values: csr_highest_slotid, and csr_target_highest_slotid. The former is the highest slot id the client will accept in a future CB_SEQUENCE operation, and SHOULD NOT be less than the value of csa_highest_slotid (but see Section 2.10.5.1 for an exception). The latter is the highest slot id the client would prefer the server use on a future CB_SEQUENCE operation. -20.9.4. IMPLEMENTATION - 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending Delegation Wants Retracts promise to signal delegation availability. 20.10.1. ARGUMENT struct CB_WANTS_CANCELLED4args { bool cwca_contended_wants_cancelled; bool cwca_resourced_wants_cancelled; @@ -26412,50 +26742,51 @@ }; 20.11.2. RESULT struct CB_NOTIFY_LOCK4res { nfsstat4 cnlr_status; }; 20.11.3. DESCRIPTION - The server can use this operation to indicate that a lock for the - given file and lock-owner, previously requested by the client via an - unsuccessful LOCK request, might be available. + The server can use this operation to indicate that a byte-range lock + for the given file and lock-owner, previously requested by the client + via an unsuccessful LOCK request, might be available. This callback is meant to be used by servers to help reduce the latency of blocking locks in the case where they recognize that a client which has been polling for a blocking lock may now be able to acquire the lock. If the server supports this callback for a given file, it MUST set the OPEN4_RESULT_MAY_NOTIFY_LOCK flag when responding to successful opens for that file. This does not commit - the server to use of CB_NOTIFY_LOCK, but the client may use this as a - hint to decide how frequently to poll for locks derived from that - open. + the server to the use of CB_NOTIFY_LOCK, but the client may use this + as a hint to decide how frequently to poll for locks derived from + that open. If an OPEN operation results in an upgrade, in which the stateid returned has an "other" value matching that of a stateid already allocated, with a new "seqid" indicating a change in the lock being represented, then the value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag when responding to that new OPEN controls handling from that point going forward. When parallel OPENs are done on the same file and open-owner, the ordering of the "seqid" field of the returned stateid (subject to wraparound) are to be used to select the controlling value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag. 20.11.4. IMPLEMENTATION - The server must not grant the lock to the client unless and until it - receives an actual lock request from the client. Similarly, the + The server MUST NOT grant the lock to the client unless and until it + receives an actual LOCK request from the client. Similarly, the client receiving this callback cannot assume that it now has the - lock, or that a subsequent request for the lock will be successful. + lock, or that a subsequent LOCK request for the lock will be + successful. The server is not required to implement this callback, and even if it does, it is not required to use it in any particular case. Therefore the client must still rely on polling for blocking locks, as described in Section 9.6. Similarly, the client is not required to implement this callback, and even it does, is still free to ignore it. Therefore the server MUST NOT assume that the client will act based on the callback. @@ -26493,53 +26824,52 @@ 20.12.2. RESULT struct CB_NOTIFY_DEVICEID4res { nfsstat4 cndr_status; }; 20.12.3. DESCRIPTION The CB_NOTIFY_DEVICEID operation is used by the server to send notifications to clients about changes to pNFS device IDs. The - registration of device ID notifications occurs when the device - mapping stateid is established using GETDEVICEINFO or GETDEVICELIST. - These notifications are sent over the backchannel. The notification - is sent once the original request has been processed on the server. - The server will send an array of notifications, cnda_changes, as a - list of pairs of bitmaps and values. See Section 3.3.7 for a - description of how NFSv4.1 bitmaps work. + registration of device ID notifications is optional and is done via + GETDEVICEINFO. These notifications are sent over the backchannel + once the original request has been processed on the server. The + server will send an array of notifications, cnda_changes, as a list + of pairs of bitmaps and values. See Section 3.3.7 for a description + of how NFSv4.1 bitmaps work. As with CB_NOTIFY (Section 20.4.3), it is possible the server has more notifications than can fit in a CB_COMPOUND, thus requiring multiple CB_COMPOUNDs. Unlike CB_NOTIFY, serialization is not an issue because unlike directory entries, device IDs cannot be re-used after being deleted (Section 12.2.10). All device ID notifications contain a device ID and a layout type. The layout type is necessary because two different layout types can share the same device ID, and the common device ID can have completely different mappings for each layout type. The server will send the following notifications: NOTIFY_DEVICEID4_CHANGE A previously provided device ID to device address mapping has - changed and the client uses GETDEVICEINFO or GETDEVICELIST to - obtain the updated mapping. The notification is encoded in a - value of data type notify_deviceid_change4. This data type also - contains a boolean field, ndc_immediate, which if TRUE indicates - that the change will be enforced immediately, and so the client - might not be able to complete any pending I/O to the device ID. - If ndc_immediate is FALSE, then for an indefinite time, the client - can complete pending I/O. After pending I/O is complete, the - client SHOULD get the new device ID to device address mappings - before issuing new I/O to the device ID. + changed and the client uses GETDEVICEINFO to obtain the updated + mapping. The notification is encoded in a value of data type + notify_deviceid_change4. This data type also contains a boolean + field, ndc_immediate, which if TRUE indicates that the change will + be enforced immediately, and so the client might not be able to + complete any pending I/O to the device ID. If ndc_immediate is + FALSE, then for an indefinite time, the client can complete + pending I/O. After pending I/O is complete, the client SHOULD get + the new device ID to device address mappings before issuing new + I/O to the device ID. NOTIFY4_DEVICEID_DELETE Deletes a device ID from the mappings. This notification MUST NOT be sent if the client has a layout that refers to the device ID. In other words if the server is sending a delete device ID notification, one of the following is true for layouts associated with the layout type: * The client never had a layout referring to that device ID. @@ -26564,127 +26894,125 @@ /* * CB_ILLEGAL: Response for illegal operation numbers */ struct CB_ILLEGAL4res { nfsstat4 status; }; 20.13.3. DESCRIPTION This operation is a placeholder for encoding a result to handle the - case of the client sending an operation code within COMPOUND that is - not defined in the NFSv4.1 specification. See Section 16.2.3 for + case of the server sending an operation code within CB_COMPOUND that + is not defined in the NFSv4.1 specification. See Section 19.2.3 for more details. The status field of CB_ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. 20.13.4. IMPLEMENTATION A server will probably not send an operation with code OP_CB_ILLEGAL but if it does, the response will be CB_ILLEGAL4res just as it would be with any other invalid operation code. Note that if the client gets an illegal operation code that is not OP_ILLEGAL, and if the client checks for legal operation codes during the XDR decode phase, - then the CB_ILLEGAL4res would not be returned. + then an instance of data type CB_ILLEGAL4res will not be returned. 21. Security Considerations - NFS has historically used a model where, from an authentication - perspective, the client was the entire machine, or at least the - source network address of the machine. The NFS server relied on the - NFS client to make the proper authentication of the end-user. The - NFS server in turn shared its files only to specific clients, as - identified by the client's source network address. Given this model, - the AUTH_SYS RPC security flavor simply identified the end-user using - the client to the NFS server. When processing NFS responses, the - client ensured that the responses came from the same network address - and port number that the request was sent to. While such a model is - easy to implement and simple to deploy and use, it is certainly not a - safe model. Thus, NFSv4.1 implementations are REQUIRED to support a - security model that uses end to end authentication, where an end-user - on a client mutually authenticates (via cryptographic schemes that do - not expose passwords or keys in the clear on the network) to a - principal on an NFS server. Consideration should also be given to - the integrity and privacy of NFS requests and responses. The issues - of end to end mutual authentication, integrity, and privacy are - discussed Section 2.2.1.1.1. + Historically the authentication of model of NFS had the entire + machine being the NFS client, and the NFS server trusting the NFS + client to authenticate the end-user. The NFS server in turn shared + its files only to specific clients, as identified by the client's + source network address. Given this model, the AUTH_SYS RPC security + flavor simply identified the end-user using the client to the NFS + server. When processing NFS responses, the client ensured that the + responses came from the same network address and port number that the + request was sent to. While such a model is easy to implement and + simple to deploy and use, it is unsafe. Thus, NFSv4.1 + implementations are REQUIRED to support a security model that uses + end to end authentication, where an end-user on a client mutually + authenticates (via cryptographic schemes that do not expose passwords + or keys in the clear on the network) to a principal on an NFS server. + Consideration is also be given to the integrity and privacy of NFS + requests and responses. The issues of end to end mutual + authentication, integrity, and privacy are discussed + Section 2.2.1.1.1. - Note that while NFSv4.1 mandates an end to end mutual authentication - model, the "classic" model of machine authentication via network - address checking and AUTH_SYS identification can still be supported - with the caveat that the AUTH_SYS flavor is neither REQUIRED nor - RECOMMENDED by this specification, and so interoperability via - AUTH_SYS is not assured. + Note that being REQUIRED to implement does not mean REQUIRED to use; + AUTH_SYS can be used by NFSv4.1 clients and servers. However, + AUTH_SYS is merely an OPTIONAL security flavor in NFSv4.1, and so + interoperability via AUTH_SYS is not assured. For reasons of reduced administration overhead, better performance and/or reduction of CPU utilization, users of NFSv4.1 implementations may opt to not use security mechanisms that enable integrity protection on each remote procedure call and response. The use of mechanisms without integrity leaves the user vulnerable to an attacker in the middle of the NFS client and server that modifies the RPC request and/or the response. While implementations are free to provide the option to use weaker security mechanisms, there are three operations in particular that warrant the implementation overriding user choices. - The first two such operations are SECINFO SECINFO_NO_NAME. It is - RECOMMENDED that the client send the either operation such that it is - protected with a security flavor that has integrity protection, such - as RPCSEC_GSS with either the rpc_gss_svc_integrity or + o The first two such operations are SECINFO and SECINFO_NO_NAME. It + is RECOMMENDED that the client send both operations such that they + is protected with a security flavor that has integrity protection, + such as RPCSEC_GSS with either the rpc_gss_svc_integrity or rpc_gss_svc_privacy service. Without integrity protection encapsulating SECINFO and SECINFO_NO_NAME and their results, an attacker in the middle could modify results such that the client - might select a weaker algorithm in the set allowed by server, making - the client and/or server vulnerable to further attacks. + might select a weaker algorithm in the set allowed by server, + making the client and/or server vulnerable to further attacks. - The second operation that should definitely use integrity protection - is any GETATTR for the fs_locations attribute. The attack has two - steps. First the attacker modifies the unprotected results of some - operation to return NFS4ERR_MOVED. Second, when the client follows - up with a GETATTR for the fs_locations attribute, the attacker - modifies the results to cause the client migrate its traffic to a - server controlled by the attacker. + o The third operation that should definitely use integrity + protection is any GETATTR for the fs_locations and + fs_locations_info attributes. The attack has two steps. First + the attacker modifies the unprotected results of some operation to + return NFS4ERR_MOVED. Second, when the client follows up with a + GETATTR for the fs_locations or fs_locations_info attributes, the + attacker modifies the results to cause the client migrate its + traffic to a server controlled by the attacker. Relative to previous NFS versions, NFSv4.1 has additional security considerations for pNFS (see Section 12.9 and Section 13.12), locking and session state (see Section 2.10.7.3). 22. IANA Considerations 22.1. Named Attribute Definitions - The NFSv4.1 protocol provides for the association of named attributes - to files. The name space identifiers for these attributes are - defined as string names. The protocol does not define the specific - assignment of the name space for these file attributes. Even though - the name space is not specifically controlled to prevent collisions, - an IANA registry has been created for the registration of NFSv4.1 - named attributes. Registration will be achieved through the + The NFSv4.1 protocol supports the association of a file with zero or + more named attributes. The name space identifiers for these + attributes are defined as string names. The protocol does not define + the specific assignment of the name space for these file attributes. + Even though the name space is not specifically controlled to prevent + collisions, an IANA registry has been created for the registration of + NFSv4.1 named attributes. Registration will be achieved through the publication of an Informational RFC and will require not only the name of the attribute but the syntax and semantics of the named attribute contents; the intent is to promote interoperability where common interests exist. While application developers are allowed to define and use attributes as needed, they are encouraged to register the attributes with IANA. Such registered named attributes are presumed to apply to all minor versions of NFSv4, including those defined subsequently to the registration. Where the named attribute is intended to be limited with regard to the minor versions for which they are not be used, the Informational RFC must clearly state the applicable limits. 22.2. ONC RPC Network Identifiers (netids) Section 3.3.9) discussed the r_netid field and the corresponding r_addr field within a netaddr4 structure. The NFSv4 protocol depends on the syntax and semantics of these fields to effectively - communicate callback information between client and server. + communicate callback and other information between client and server. Therefore, an IANA registry has been created to include the values defined in this document and to allow for future expansion based on transport usage/availability. Additions to this ONC RPC Network Identifier registry must be done with the publication of an RFC. The initial values for this registry are as follows (some of this text is replicated from Section 3.3.9 for clarity): The Network Identifier (or r_netid for short) is used to specify a transport protocol and associated universal address (or r_addr for @@ -26723,22 +27051,23 @@ to NFSv4. This requires a new minor version of NFSv4, and requires a standards track document from IETF. Another way to add a notification is to specify a new layout type. Notifications for new layout types would be requested via GETDEVICELIST (Section 18.41) and GETDEVICEINFO (Section 18.40). See Section 22.4). 22.4. Defining New Layout Types New layout type numbers will be requested from IANA. IANA will only provide layout type numbers for Standards Track RFCs approved by the - IESG, in accordance with Standards Action policy defined in RFC2434 - [20]. + IESG, in accordance with Standards Action policy defined in [20]. + All layout types assigned by IANA MUST be in the range 0x00000001 to + 0x7FFFFFFF. The author of a new pNFS layout specification must follow these steps to obtain acceptance of the layout type as a standard: 1. The author devises the new layout specification. 2. The new layout type specification MUST, at a minimum: * Define the contents of the layout-type-specific fields of the following data types: @@ -26762,20 +27091,22 @@ 1. Failure and restart for client, server, storage device. 2. Lease expiration from perspective of the active client, server, storage device. 3. Loss of layout state resulting in fencing of client access to storage devices (for an example, see Section 12.7.3). * A list of any new notification values for CB_NOTIFY_DEVICEID. + * A list of any new recallable object types for CB_RECALL_ANY. + * Include an IANA considerations section. * Include a security considerations section. 3. The author documents the new layout specification as an Internet Draft. 4. The author submits the Internet Draft for review through the IETF standards process as defined in "Internet Official Protocol Standards" (STD 1). The new layout specification will be @@ -26924,26 +27255,26 @@ [27] Werme, R., "RPC XID Issues", USENIX Conference Proceedings , February 1996. [28] Nowicki, B., "NFS: Network File System Protocol specification", RFC 1094, March 1989. [29] Bhide, A., Elnozahy, E., and S. Morgan, "A Highly Available Network Server", USENIX Conference Proceedings , January 1991. [30] Halevy, B., Welch, B., and J. Zelenka, "Object-based pNFS - Operations", September 2007, . + Operations", April 2008, . [31] Black, D., Fridella, S., and J. Glasgow, "pNFS Block/Volume - Layout", November 2007, . + Layout", April 2008, . [32] Callaghan, B., "WebNFS Client Specification", RFC 2054, October 1996. [33] Callaghan, B., "WebNFS Server Specification", RFC 2055, October 1996. [34] Shepler, S., "NFS Version 4 Design Considerations", RFC 2624, June 1999. @@ -26993,29 +27324,32 @@ Burnett, and Charles Fan with contributions from Ted Anderson, Neil Brown, and Jon Haswell. The initial drafts for the Directory Delegations support were contributed by Saadia Khan with input from Dave Noveck, Mike Eisler, Carl Burnett, Ted Anderson and Tom Talpey. The initial drafts for the ACL explanations were contributed by Sam Falkner and Lisa Week. + The pNFS work was inspired by the NASD and OSD work done by Garth + Gibson. Gary Grider has also been a champion of high-performance + parallel I/O. Garth Gibson and Peter Corbett started the pNFS effort + with a problem statement document for IETF that formed the basis for + the pNFS work in NFSv4.1. + The initial drafts for the parallel NFS support were edited by Brent Welch and Garth Goodson. Additional authors for those documents were Benny Halevy, David Black, and Andy Adamson. Additional input came from the informal group which contributed to the construction of the initial pNFS drafts; specific acknowledgement goes to Gary Grider, Peter Corbett, Dave Noveck, Peter Honeyman, and Stephen Fridella. - The pNFS work was inspired by the NASD and OSD work done by Garth - Gibson. Gary Grider of the national labs (LANL) has also been a - champion of high-performance parallel I/O. Fredric Isaman found several errors in draft versions of the ONC RPC XDR description of the NFSv4.1 protocol. Audrey Van Bellingham provided, in numerous ways, essential co- ordination and management of the process of editing the specification drafts. Richard Jernigan gave feedback on the file layout's striping pattern design. @@ -27061,21 +27395,22 @@ Iyer, Suchit Kaura, Trond Myklebust, Anatoly Pinchuk, Spencer Shepler, Renu Tewari, Lisa Week, and Brent Welch. A review team worked together to generate the tables of assignments of error sets to operations and make sure that each such assignment had two or more people validating it. Participating in the process were: Andy Adamson, Mike Eisler, Sam Falkner, Garth Goodson, Robert Gordon, Trond Myklebust, Dave Noveck Spencer Shepler, Tom Talpey, Amy Weaver, and Lisa Week. - Others who provided comments include: Mahesh Siddheshwar. + Others who provided comments include: Jason Goldschmidt and Mahesh + Siddheshwar. Authors' Addresses Spencer Shepler Sun Microsystems, Inc. 7808 Moonflower Drive Austin, TX 78750 USA Phone: +1-512-401-1080