Because of the simplicity of information Atropos exposes to applications, the interface to Atropos can be readily implemented with small extensions to the commands already defined in the SCSI protocol. The parameters p and w could be exposed in a new mode page returned by the MODE SENSE SCSI command. To ensure that Atropos executes all requests to non-contiguous VLBNs for the other-major access together, an application can link the appropriate requests. To do so, the READ or WRITE commands for semi-sequential access are issued with the Link bit set. 3.3.4 Implementation details Our Atropos logical volume manager implementation is a stand-alone process that accepts I/O requests via a socket. It issues individual disk I/Os directly to the attached SCSI disks using the Linux raw SCSI device /dev/sg. With an SMP host, the process can run on a separate CPU of the same host, to minimize the effect on the execution of the main application. An application using Atropos is linked with a stub library providing API functions for reading and writing. The library uses shared memory to avoid data copies and communicates through the socket with the Atropos LVM process. The Atropos LVM organization is specified by a configuration file, which functions in lieu of a format command. The file lists the number of disks, p, the desired block size, b, and the list of disks to be used. For convenience, the interface stub also includes three functions. The function get boundaries(LBN) returns the stripe unit boundaries between which the given LBN falls. Hence, these boundaries form a collection of w contiguous LBNs for constructing efficient I/Os. The get rectangle(LBN) function returns the wp contiguous LBNs in a single row across all disks. These functions are just convenient wrappers that calculate the proper LBNs from the w and p parameters. Finally, the stub interface also includes a batch() function to explicitly group READ and WRITE commands (e.g., for semi-sequential access). With no outstanding requests in the queue (i.e., the disk is idle), current SCSI disks will immediately schedule the first received request of batch, even though it may not be the one with the smallest rotational latency. This diminishes the effectiveness of semi-sequential access. To overcome this problem, our Atropos implementation “pre-schedules” the batch of requests by sending first the request that will incur the smallest rotational latency. It uses known techniques for SPTF scheduling outside of disk firmware [14]. With the help of a detailed and validated model of the disk mechanics [2, 21], the disk head position is deduced from the location and time of the last-completed request. If disks waited for all requests of a batch before making a scheduling decision, this prescheduling would not be necessary. Our implementation of the Atropos logical volume manager is about 2000 lines of C++ code and includes implementations of RAID levels 0 and 1. Another 600 lines of C code implement methods for automatically extracting track boundaries and head switch time [22, 26]. 4 Efficient access in database systems Efficient access to database tables in both dimensions can significantly improve performance of a variety of queries doing selective table scans. These queries can request (i) a subset of columns (restricting access along the primary dimension, if the order is column-major), which is prevalent in decision support workloads (TPC-H), (ii) a subset of rows (restricting access along the secondary dimension), which is prevalent in online transaction processing (TPC-C), or (iii) a combination of both. A companion project [24] to Atropos extends the Shore database storage manager [3] to support a page layout that takes advantage of Atropos’s efficient accesses in both dimensions. The page layout is based on a cache-efficient page layout, called PAX [1], which extends the NSM page layout to group values of a single attribute into units called “minipages”. Minipages in PAX exist to take advantage of CPU cache prefetchers to minimize cache misses during single-attribute memory accesses. We use minipages as well, but they are aligned and sized to fit into one or more 512 byte LBNs, depending on the relative sizes of the attributes within a single page. The mapping of 8 KB pages onto the quadrangles of the Atropos logical volume is depicted in Figure 6. A single page contains 16 equally-sized attributes, labeled A1–A16, where each attribute is stored in a separate minipage that maps to a single VLBN. Accessing a single page is thus done by issuing 16 batched requests to every 16th (or more generally, wp-th) VLBN. Internally, the VLBNs comprising this page are mapped diagonally to the blocks marked with the dashed arrow. Hence, 4 semi-sequential accesses proceeding in parallel can fetch the entire page (i.e., row-major order access). Individual minipages are mapped across sequential runs of VLBNs. For example, to fetch attribute A1 for records 0–399, the database storage manager can issue one efficient sequential I/O to fetch the appropriate minipages. Atropos breaks this I/O into four efficient, trackbased disk accesses proceeding in parallel. The database storage manager then reassembles these minipages into appropriate 8 KB pages [24]. |