| .TH VENTI 7 |
| .SH NAME |
| venti \- archival storage server |
| .SH DESCRIPTION |
| Venti is a block storage server intended for archival data. |
| In a Venti server, the SHA1 hash of a block's contents acts |
| as the block identifier for read and write operations. |
| This approach enforces a write-once policy, preventing |
| accidental or malicious destruction of data. In addition, |
| duplicate copies of a block are coalesced, reducing the |
| consumption of storage and simplifying the implementation |
| of clients. |
| .PP |
| This manual page documents the basic concepts of |
| block storage using Venti as well as the Venti network protocol. |
| .PP |
| .IR Venti (1) |
| documents some simple clients. |
| .IR Vac (1), |
| .IR vacfs (4), |
| and |
| .IR vbackup (8) |
| are more complex clients. |
| .PP |
| .IR Venti (3) |
| describes a C library interface for accessing |
| Venti servers and manipulating Venti data structures. |
| .PP |
| .IR Venti (8) |
| describes the programs used to run a Venti server. |
| .PP |
| .SS "Scores |
| The SHA1 hash that identifies a block is called its |
| .IR score . |
| The score of the zero-length block is called the |
| .IR "zero score" . |
| .PP |
| Scores may have an optional |
| .IB label : |
| prefix, typically used to |
| describe the format of the data. |
| For example, |
| .IR vac (1) |
| uses a |
| .B vac: |
| prefix, while |
| .IR vbackup (8) |
| uses prefixes corresponding to the file system |
| types: |
| .BR ext2: , |
| .BR ffs: , |
| and so on. |
| .SS "Files and Directories |
| Venti accepts blocks up to 56 kilobytes in size. |
| By convention, Venti clients use hash trees of blocks to |
| represent arbitrary-size data |
| .IR files . |
| The data to be stored is split into fixed-size |
| blocks and written to the server, producing a list |
| of scores. |
| The resulting list of scores is split into fixed-size pointer |
| blocks (using only an integral number of scores per block) |
| and written to the server, producing a smaller list |
| of scores. |
| The process continues, eventually ending with the |
| score for the hash tree's top-most block. |
| Each file stored this way is summarized by |
| a |
| .B VtEntry |
| structure recording the top-most score, the depth |
| of the tree, the data block size, and the pointer block size. |
| One or more |
| .B VtEntry |
| structures can be concatenated |
| and stored as a special file called a |
| .IR directory . |
| In this |
| manner, arbitrary trees of files can be constructed |
| and stored. |
| .PP |
| Scores passed between programs conventionally refer |
| to |
| .B VtRoot |
| blocks, which contain descriptive information |
| as well as the score of a directory block containing a small number |
| of directory entries. |
| .PP |
| Conventionally, programs do not mix data and directory entries |
| in the same file. Instead, they keep two separate files, one with |
| directory entries and one with metadata referencing those |
| entries by position. |
| Keeping this parallel representation is a minor annoyance |
| but makes it possible for general programs like |
| .I venti/copy |
| (see |
| .IR venti (1)) |
| to traverse the block tree without knowing the specific details |
| of any particular program's data. |
| .SS "Block Types |
| To allow programs to traverse these structures without |
| needing to understand their higher-level meanings, |
| Venti tags each block with a type. The types are: |
| .PP |
| .nf |
| .ft L |
| VtDataType 000 \f1data\fL |
| VtDataType+1 001 \fRscores of \fPVtDataType\fR blocks\fL |
| VtDataType+2 002 \fRscores of \fPVtDataType+1\fR blocks\fL |
| \fR\&...\fL |
| VtDirType 010 VtEntry\fR structures\fL |
| VtDirType+1 011 \fRscores of \fLVtDirType\fR blocks\fL |
| VtDirType+2 012 \fRscores of \fLVtDirType+1\fR blocks\fL |
| \fR\&...\fL |
| VtRootType 020 VtRoot\fR structure\fL |
| .fi |
| .PP |
| The octal numbers listed are the type numbers used |
| by the commands below. |
| (For historical reasons, the type numbers used on |
| disk and on the wire are different from the above. |
| They do not distinguish |
| .BI VtDataType+ n |
| blocks from |
| .BI VtDirType+ n |
| blocks.) |
| .SS "Zero Truncation |
| To avoid storing the same short data blocks padded with |
| differing numbers of zeros, Venti clients working with fixed-size |
| blocks conventionally |
| `zero truncate' the blocks before writing them to the server. |
| For example, if a 1024-byte data block contains the |
| 11-byte string |
| .RB ` hello " " world ' |
| followed by 1013 zero bytes, |
| a client would store only the 11-byte block. |
| When the client later read the block from the server, |
| it would append zero bytes to the end as necessary to |
| reach the expected size. |
| .PP |
| When truncating pointer blocks |
| .RB ( VtDataType+ \fIn |
| and |
| .BI VtDirType+ n |
| blocks), |
| trailing zero scores are removed |
| instead of trailing zero bytes. |
| .PP |
| Because of the truncation convention, |
| any file consisting entirely of zero bytes, |
| no matter what its length, will be represented by the zero score: |
| the data blocks contain all zeros and are thus truncated |
| to the empty block, and the pointer blocks contain all zero scores |
| and are thus also truncated to the empty block, |
| and so on up the hash tree. |
| .SS Network Protocol |
| A Venti session begins when a |
| .I client |
| connects to the network address served by a Venti |
| .IR server ; |
| the conventional address is |
| .BI tcp! server !venti |
| (the |
| .B venti |
| port is 17034). |
| Both client and server begin by sending a version |
| string of the form |
| .BI venti- versions - comment \en \fR. |
| The |
| .I versions |
| field is a list of acceptable versions separated by |
| colons. |
| The protocol described here is version |
| .BR 02 . |
| The client is responsible for choosing a common |
| version and sending it in the |
| .B VtThello |
| message, described below. |
| .PP |
| After the initial version exchange, the client transmits |
| .I requests |
| .RI ( T-messages ) |
| to the server, which subsequently returns |
| .I replies |
| .RI ( R-messages ) |
| to the client. |
| The combined act of transmitting (receiving) a request |
| of a particular type, and receiving (transmitting) its reply |
| is called a |
| .I transaction |
| of that type. |
| .PP |
| Each message consists of a sequence of bytes. |
| Two-byte fields hold unsigned integers represented |
| in big-endian order (most significant byte first). |
| Data items of variable lengths are represented by |
| a one-byte field specifying a count, |
| .IR n , |
| followed by |
| .I n |
| bytes of data. |
| Text strings are represented similarly, |
| using a two-byte count with |
| the text itself stored as a UTF-encoded sequence |
| of Unicode characters (see |
| .IR utf (7)). |
| Text strings are not |
| .SM NUL\c |
| -terminated: |
| .I n |
| counts the bytes of UTF data, which include no final |
| zero byte. |
| The |
| .SM NUL |
| character is illegal in text strings in the Venti protocol. |
| The maximum string length in Venti is 1024 bytes. |
| .PP |
| Each Venti message begins with a two-byte size field |
| specifying the length in bytes of the message, |
| not including the length field itself. |
| The next byte is the message type, one of the constants |
| in the enumeration in the include file |
| .BR <venti.h> . |
| The next byte is an identifying |
| .IR tag , |
| used to match responses to requests. |
| The remaining bytes are parameters of different sizes. |
| In the message descriptions, the number of bytes in a field |
| is given in brackets after the field name. |
| The notation |
| .IR parameter [ n ] |
| where |
| .I n |
| is not a constant represents a variable-length parameter: |
| .IR n [1] |
| followed by |
| .I n |
| bytes of data forming the |
| .IR parameter . |
| The notation |
| .IR string [ s ] |
| (using a literal |
| .I s |
| character) |
| is shorthand for |
| .IR s [2] |
| followed by |
| .I s |
| bytes of UTF-8 text. |
| The notation |
| .IR parameter [] |
| where |
| .I parameter |
| is the last field in the message represents a |
| variable-length field that comprises all remaining |
| bytes in the message. |
| .PP |
| All Venti RPC messages are prefixed with a field |
| .IR size [2] |
| giving the length of the message that follows |
| (not including the |
| .I size |
| field itself). |
| The message bodies are: |
| .ta \w'\fLVtTgoodbye 'u |
| .IP |
| .ne 2v |
| .B VtThello |
| .IR tag [1] |
| .IR version [ s ] |
| .IR uid [ s ] |
| .IR strength [1] |
| .IR crypto [ n ] |
| .IR codec [ n ] |
| .br |
| .B VtRhello |
| .IR tag [1] |
| .IR sid [ s ] |
| .IR rcrypto [1] |
| .IR rcodec [1] |
| .IP |
| .ne 2v |
| .B VtTping |
| .IR tag [1] |
| .br |
| .B VtRping |
| .IR tag [1] |
| .IP |
| .ne 2v |
| .B VtTread |
| .IR tag [1] |
| .IR score [20] |
| .IR type [1] |
| .IR pad [1] |
| .IR count [2] |
| .br |
| .B VtRread |
| .IR tag [1] |
| .IR data [] |
| .IP |
| .ne 2v |
| .B VtTwrite |
| .IR tag [1] |
| .IR type [1] |
| .IR pad [3] |
| .IR data [] |
| .br |
| .B VtRwrite |
| .IR tag [1] |
| .IR score [20] |
| .IP |
| .ne 2v |
| .B VtTsync |
| .IR tag [1] |
| .br |
| .B VtRsync |
| .IR tag [1] |
| .IP |
| .ne 2v |
| .B VtRerror |
| .IR tag [1] |
| .IR error [ s ] |
| .IP |
| .ne 2v |
| .B VtTgoodbye |
| .IR tag [1] |
| .PP |
| Each T-message has a one-byte |
| .I tag |
| field, chosen and used by the client to identify the message. |
| The server will echo the request's |
| .I tag |
| field in the reply. |
| Clients should arrange that no two outstanding |
| messages have the same tag field so that responses |
| can be distinguished. |
| .PP |
| The type of an R-message will either be one greater than |
| the type of the corresponding T-message or |
| .BR Rerror , |
| indicating that the request failed. |
| In the latter case, the |
| .I error |
| field contains a string describing the reason for failure. |
| .PP |
| Venti connections must begin with a |
| .B hello |
| transaction. |
| The |
| .B VtThello |
| message contains the protocol |
| .I version |
| that the client has chosen to use. |
| The fields |
| .IR strength , |
| .IR crypto , |
| and |
| .IR codec |
| could be used to add authentication, encryption, |
| and compression to the Venti session |
| but are currently ignored. |
| The |
| .IR rcrypto , |
| and |
| .I rcodec |
| fields in the |
| .B VtRhello |
| response are similarly ignored. |
| The |
| .IR uid |
| and |
| .IR sid |
| fields are intended to be the identity |
| of the client and server but, given the lack of |
| authentication, should be treated only as advisory. |
| The initial |
| .B hello |
| should be the only |
| .B hello |
| transaction during the session. |
| .PP |
| The |
| .B ping |
| message has no effect and |
| is used mainly for debugging. |
| Servers should respond immediately to pings. |
| .PP |
| The |
| .B read |
| message requests a block with the given |
| .I score |
| and |
| .IR type . |
| Use |
| .I vttodisktype |
| and |
| .I vtfromdisktype |
| (see |
| .IR venti (3)) |
| to convert a block type enumeration value |
| .RB ( VtDataType , |
| etc.) |
| to the |
| .I type |
| used on disk and in the protocol. |
| The |
| .I count |
| field specifies the maximum expected size |
| of the block. |
| The |
| .I data |
| in the reply is the block's contents. |
| .PP |
| The |
| .B write |
| message writes a new block of the given |
| .I type |
| with contents |
| .I data |
| to the server. |
| The response includes the |
| .I score |
| to use to read the block, |
| which should be the SHA1 hash of |
| .IR data . |
| .PP |
| The Venti server may buffer written blocks in memory, |
| waiting until after responding to the |
| .B write |
| message before writing them to |
| permanent storage. |
| The server will delay the response to a |
| .B sync |
| message until after all blocks in earlier |
| .B write |
| messages have been written to permanent storage. |
| .PP |
| The |
| .B goodbye |
| message ends a session. There is no |
| .BR VtRgoodbye : |
| upon receiving the |
| .BR VtTgoodbye |
| message, the server terminates up the connection. |
| .PP |
| Version |
| .B 04 |
| of the Venti protocol is similar to version |
| .B 02 |
| (described above) |
| but has two changes to accomodates larger payloads. |
| First, it replaces the leading 2-byte packet size with |
| a 4-byte size. |
| Second, the |
| .I count |
| in the |
| .B VtTread |
| packet may be either 2 or 4 bytes; |
| the total packet length distinguishes the two cases. |
| .SH SEE ALSO |
| .IR venti (1), |
| .IR venti (3), |
| .IR venti (8) |
| .br |
| Sean Quinlan and Sean Dorward, |
| ``Venti: a new approach to archival storage'', |
| .I "Usenix Conference on File and Storage Technologies" , |
| 2002. |