| .TH VENTI 8 |
| .SH NAME |
| venti \- archival storage server |
| .SH SYNOPSIS |
| .in +0.25i |
| .ti -0.25i |
| .B venti/venti |
| [ |
| .B -Ldrs |
| ] |
| [ |
| .B -a |
| .I address |
| ] |
| [ |
| .B -B |
| .I blockcachesize |
| ] |
| [ |
| .B -c |
| .I config |
| ] |
| [ |
| .B -C |
| .I lumpcachesize |
| ] |
| [ |
| .B -h |
| .I httpaddress |
| ] |
| [ |
| .B -I |
| .I indexcachesize |
| ] |
| [ |
| .B -W |
| .I webroot |
| ] |
| .SH DESCRIPTION |
| .I Venti |
| is a SHA1-addressed archival storage server. |
| See |
| .IR venti (7) |
| for a full introduction to the system. |
| This page documents the structure and operation of the server. |
| .PP |
| A venti server requires multiple disks or disk partitions, |
| each of which must be properly formatted before the server |
| can be run. |
| .SS Disk |
| The venti server maintains three disk structures, typically |
| stored on raw disk partitions: |
| the append-only |
| .IR "data log" , |
| which holds, in sequential order, |
| the contents of every block written to the server; |
| the |
| .IR index , |
| which helps locate a block in the data log given its score; |
| and optionally the |
| .IR "bloom filter" , |
| a concise summary of which scores are present in the index. |
| The data log is the primary storage. |
| To improve the robustness, it should be stored on |
| a device that provides RAID functionality. |
| The index and the bloom filter are optimizations |
| employed to access the data log efficiently and can be rebuilt |
| if lost or damaged. |
| .PP |
| The data log is logically split into sections called |
| .IR arenas , |
| typically sized for easy offline backup |
| (e.g., 500MB). |
| A data log may comprise many disks, each storing |
| one or more arenas. |
| Such disks are called |
| .IR "arena partitions" . |
| Arena partitions are filled in the order given in the configuration. |
| .PP |
| The index is logically split into block-sized pieces called |
| .IR buckets , |
| each of which is responsible for a particular range of scores. |
| An index may be split across many disks, each storing many buckets. |
| Such disks are called |
| .IR "index sections" . |
| .PP |
| The index must be sized so that no bucket is full. |
| When a bucket fills, the server must be shut down and |
| the index made larger. |
| Since scores appear random, each bucket will contain |
| approximately the same number of entries. |
| Index entries are 40 bytes long. Assuming that a typical block |
| being written to the server is 8192 bytes and compresses to 4096 |
| bytes, the active index is expected to be about 1% of |
| the active data log. |
| Storing smaller blocks increases the relative index footprint; |
| storing larger blocks decreases it. |
| To allow variation in both block size and the random distribution |
| of scores to buckets, the suggested index size is 5% of |
| the active data log. |
| .PP |
| The (optional) bloom filter is a large bitmap that is stored on disk but |
| also kept completely in memory while the venti server runs. |
| It helps the venti server efficiently detect scores that are |
| .I not |
| already stored in the index. |
| The bloom filter starts out zeroed. |
| Each score recorded in the bloom filter is hashed to choose |
| .I nhash |
| bits to set in the bloom filter. |
| A score is definitely not stored in the index of any of its |
| .I nhash |
| bits are not set. |
| The bloom filter thus has two parameters: |
| .I nhash |
| (maximum 32) |
| and the total bitmap size |
| (maximum 512MB, 2\s-2\u32\d\s+2 bits). |
| .PP |
| The bloom filter should be sized so that |
| .I nhash |
| \(mu |
| .I nblock |
| \(<= |
| 0.7 \(mu |
| .IR b , |
| where |
| .I nblock |
| is the expected number of blocks stored on the server |
| and |
| .I b |
| is the bitmap size in bits. |
| The false positive rate of the bloom filter when sized |
| this way is approximately 2\s-2\u\-\fInblock\fR\d\s+2. |
| .I Nhash |
| less than 10 are not very useful; |
| .I nhash |
| greater than 24 are probably a waste of memory. |
| .I Fmtbloom |
| (see |
| .IR venti-fmt (8)) |
| can be given either |
| .I nhash |
| or |
| .IR nblock ; |
| if given |
| .IR nblock , |
| it will derive an appropriate |
| .IR nhash . |
| .SS Memory |
| Venti can make effective use of large amounts of memory |
| for various caches. |
| .PP |
| The |
| .I "lump cache |
| holds recently-accessed venti data blocks, which the server refers to as |
| .IR lumps . |
| The lump cache should be at least 1MB but can profitably be much larger. |
| The lump cache can be thought of as the level-1 cache: |
| read requests handled by the lump cache can |
| be served instantly. |
| .PP |
| The |
| .I "block cache |
| holds recently-accessed |
| .I disk |
| blocks from the arena partitions. |
| The block cache needs to be able to simultaneously hold two blocks |
| from each arena plus four blocks for the currently-filling arena. |
| The block cache can be thought of as the level-2 cache: |
| read requests handled by the block cache are slower than those |
| handled by the lump cache, since the lump data must be extracted |
| from the raw disk blocks and possibly decompressed, but no |
| disk accesses are necessary. |
| .PP |
| The |
| .I "index cache |
| holds recently-accessed or prefetched |
| index entries. |
| The index cache needs to be able to hold index entries |
| for three or four arenas, at least, in order for prefetching |
| to work properly. Each index entry is 50 bytes. |
| Assuming 500MB arenas of |
| 128,000 blocks that are 4096 bytes each after compression, |
| the minimum index cache size is about 6MB. |
| The index cache can be thought of as the level-3 cache: |
| read requests handled by the index cache must still go |
| to disk to fetch the arena blocks, but the costly random |
| access to the index is avoided. |
| .PP |
| The size of the index cache determines how long venti |
| can sustain its `burst' write throughput, during which time |
| the only disk accesses on the critical path |
| are sequential writes to the arena partitions. |
| For example, if you want to be able to sustain 10MB/s |
| for an hour, you need enough index cache to hold entries |
| for 36GB of blocks. Assuming 8192-byte blocks, |
| you need room for almost five million index entries. |
| Since index entries are 50 bytes each, you need 250MB |
| of index cache. |
| If the background index update process can make a single |
| pass through the index in an hour, which is possible, |
| then you can sustain the 10MB/s indefinitely (at least until |
| the arenas are all filled). |
| .PP |
| The |
| .I "bloom filter |
| requires memory equal to its size on disk, |
| as discussed above. |
| .PP |
| A reasonable starting allocation is to |
| divide memory equally (in thirds) between |
| the bloom filter, the index cache, and the lump and block caches; |
| the third of memory allocated to the lump and block caches |
| should be split unevenly, with more (say, two thirds) |
| going to the block cache. |
| .SS Network |
| The venti server announces two network services, one |
| (conventionally TCP port |
| .BR venti , |
| 17034) serving |
| the venti protocol as described in |
| .IR venti (7), |
| and one serving HTTP |
| (conventionally TCP port |
| .BR http , |
| 80). |
| .PP |
| The venti web server provides the following |
| URLs for accessing status information: |
| .TF "\fL/storage" |
| .PD |
| .TP |
| .B /index |
| A summary of the usage of the arenas and index sections. |
| .TP |
| .B /xindex |
| An XML version of |
| .BR /index . |
| .TP |
| .B /storage |
| Brief storage totals. |
| .TP |
| .BI /set |
| Disable the values of all variables. |
| Variables are: |
| .BR compress , |
| whether or not to compress blocks |
| (for debugging); |
| .BR logging , |
| whether to write entries to the debugging logs; |
| .BR stats , |
| whether to collect run-time statistics; |
| .BR icachesleeptime , |
| the time in milliseconds between successive updates |
| of megabytes of the index cache; |
| .BR arenasumsleeptime , |
| the time in milliseconds between reads while |
| checksumming an arena in the background. |
| The two sleep times should be (but are not) managed by venti; |
| they exist to provide more experience with their effects. |
| The other variables exist only for debugging and |
| performance measurement. |
| .TP |
| .BI /set?name= variable |
| Show the current setting of |
| .IR variable . |
| .TP |
| .BI /set?name= variable &value= value |
| Set |
| .I variable |
| to |
| .IR value . |
| .TP |
| .BI /graph/ name / param / param / \fR... |
| A PNG image graphing the named run-time statistic over time. |
| The details of names and parameters are undocumented; |
| see |
| .B httpd.c |
| in the venti sources. |
| .TP |
| .B /log |
| A list of all debugging logs present in the server's memory. |
| .TP |
| .BI /log/ name |
| The contents of the debugging log with the given |
| .IR name . |
| .TP |
| .B /flushicache |
| Force venti to begin flushing the index cache to disk. |
| The request response will not be sent until the flush |
| has completed. |
| .TP |
| .B /flushdcache |
| Force venti to begin flushing the arena block cache to disk. |
| The request response will not be sent until the flush |
| has completed. |
| .PD |
| .PP |
| Requests for other files are served by consulting a |
| directory named in the configuration file |
| (see |
| .B webroot |
| below). |
| .SS Configuration File |
| A venti configuration file |
| enumerates the various index sections and |
| arenas that constitute a venti system. |
| The components are indicated by the name of the file, typically |
| a disk partition, in which they reside. The configuration |
| file is the only location that file names are used. Internally, |
| venti uses the names assigned when the components were formatted |
| with |
| .I fmtarenas |
| or |
| .I fmtisect |
| (see |
| .IR venti-fmt (8)). |
| In particular, only the configuration needs to be |
| changed if a component is moved to a different file. |
| .PP |
| The configuration file consists of lines in the form described below. |
| Lines starting with |
| .B # |
| are comments. |
| .TF "\fLindex\fI name " |
| .PD |
| .TP |
| .BI index " name |
| Names the index for the system. |
| .TP |
| .BI arenas " file |
| .I File |
| is an arena partition, formatted using |
| .IR fmtarenas . |
| .TP |
| .BI isect " file |
| .I File |
| is an index section, formatted using |
| .IR fmtisect . |
| .TP |
| .BI bloom " file |
| .I File |
| is a bloom filter, formatted using |
| .IR fmtbloom . |
| .PD |
| .PP |
| After formatting a venti system using |
| .IR fmtindex , |
| the order of arenas and index sections should not be changed. |
| Additional arenas can be appended to the configuration; |
| run |
| .I fmtindex |
| with the |
| .B -a |
| flag to update the index. |
| .PP |
| The configuration file also holds configuration parameters |
| for the venti server itself. |
| These are: |
| .TF "\fLhttpaddr\fI netaddr " |
| .TP |
| .BI mem " size |
| lump cache size |
| .TP |
| .BI bcmem " size |
| block cache size |
| .TP |
| .BI icmem " size |
| index cache size |
| .TP |
| .BI addr " netaddr |
| network address to announce venti service |
| (default |
| .BR tcp!*!venti ) |
| .TP |
| .BI httpaddr " netaddr |
| network address to announce HTTP service |
| (default |
| .BR tcp!*!http ) |
| .TP |
| .B queuewrites |
| queue writes in memory |
| (default is not to queue) |
| .TP |
| .BI webroot " dir |
| directory tree containing files for |
| .IR venti 's |
| internal HTTP server to consult for unrecognized URLs |
| .PD |
| .PP |
| The units for the various cache sizes above can be specified by appending a |
| .LR k , |
| .LR m , |
| or |
| .LR g |
| (case-insensitive) |
| to indicate kilobytes, megabytes, or gigabytes respectively. |
| .PP |
| The |
| .I file |
| name in the configuration lines above can be of the form |
| .IB file : lo - hi |
| to specify a range of the file. |
| .I Lo |
| and |
| .I hi |
| are specified in bytes but can have the usual |
| .BI k , |
| .BI m , |
| or |
| .B g |
| suffixes. |
| Either |
| .I lo |
| or |
| .I hi |
| may be omitted. |
| This notation eliminates the need to |
| partition raw disks on non-Plan 9 systems. |
| .SS Command Line |
| Many of the options to Venti duplicate parameters that |
| can be specified in the configuration file. |
| The command line options override those found in a |
| configuration file. |
| Additional options are: |
| .TF "\fL-c\fI config" |
| .PD |
| .TP |
| .BI -c " config |
| The server configuration file |
| (default |
| .BR venti.conf ) |
| .TP |
| .B -d |
| Produce various debugging information on standard error. |
| Implies |
| .BR -s . |
| .TP |
| .B -L |
| Enable logging. By default all logging is disabled. |
| Logging slows server operation considerably. |
| .TP |
| .B -r |
| Allow only read access to the venti data. |
| .TP |
| .B -s |
| Do not run in the background. |
| Normally, |
| the foreground process will exit once the Venti server |
| is initialized and ready for connections. |
| .PD |
| .SH EXAMPLE |
| A simple configuration: |
| .IP |
| .EX |
| % cat venti.conf |
| index main |
| isect /tmp/disks/isect0 |
| isect /tmp/disks/isect1 |
| arenas /tmp/disks/arenas |
| bloom /tmp/disks/bloom |
| mem 10M |
| bcmem 20M |
| icmem 30M |
| % |
| .EE |
| .PP |
| Format the index sections, the arena partition, |
| the bloom filter, and |
| finally the main index: |
| .IP |
| .EX |
| % venti/fmtisect isect0. /tmp/disks/isect0 |
| % venti/fmtisect isect1. /tmp/disks/isect1 |
| % venti/fmtarenas arenas0. /tmp/disks/arenas & |
| % venti/fmtbloom /tmp/disks/bloom & |
| % wait |
| % venti/fmtindex venti.conf |
| % |
| .EE |
| .PP |
| Start the server and check the storage statistics: |
| .IP |
| .EX |
| % venti/venti |
| % hget http://$sysname/storage |
| .EE |
| .SH SOURCE |
| .B \*9/src/cmd/venti/srv |
| .SH "SEE ALSO" |
| .IR venti (1), |
| .IR venti (3), |
| .IR venti (7), |
| .IR venti-backup (8) |
| .IR venti-fmt (8) |
| .br |
| Sean Quinlan and Sean Dorward, |
| ``Venti: a new approach to archival storage'', |
| .I "Usenix Conference on File and Storage Technologies" , |
| 2002. |
| .SH BUGS |
| Setting up a venti server is too complicated. |
| .PP |
| Venti should not require the user to decide how to |
| partition its memory usage. |
| .PP |
| Users of shells other than |
| .IR rc (1) |
| will not be able to use the program names shown. |
| One solution is to define |
| .B "V=$PLAN9/bin/venti" |
| and then substitute |
| .B $V/ |
| for |
| .B venti/ |
| in the paths above. |