A Disk RAID array Volume Manager for Disk

while optionally exploiting the exposed information to
orchestrate its data layouts and access patterns.
7 Conclusions
The Atropos disk array LVMemploys a new data organization
that allows applications to take advantage of features
built into modern disks. Striping data in track-sized
units lets them take advantage of zero-latency access to
achieve efficient access for sequential access patterns.
Taking advantage of request scheduling and knowing
exact head switch times enables semi-sequential access,
which results in efficient access to diagonal sets of noncontiguous
blocks.
By exploiting disk characteristics, a new data organization,
and exposing high-level constructs about this
organization, Atropos can deliver efficient accesses for
database systems, resulting in up to 4 speed-ups for
decision support workloads, without compromising performance
of OLTP workloads.
Acknowledgements
We thank the members and companies of the PDL
Consortium (including EMC, Hewlett-Packard, Hitachi,
IBM, Intel, LSI Logic, Microsoft, Network Appliance,
Oracle, Panasas, Seagate, Sun, and Veritas) for their
interest, insights, feedback, and support. This work
is funded in part by NSF grants CCR-0113660, IIS-
0133686, and CCR-0205544, and by an IBM faculty
partnership award.
References
[1] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving
relations for cache performance. International Conference
on Very Large Databases (Rome, Italy, 11–14 September 2001),
pages 169–180. Morgan Kaufmann Publishing, Inc., 2001.
[2] J. S. Bucy and G. R. Ganger. The DiskSim simulation environment
version 3.0 reference manual. Technical Report CMU–
CS–03–102. Department of Computer Science Carnegie-Mellon
University, Pittsburgh, PA, January 2003.
[3] M. J. Carey et al. Shoring up persistent applications. ACM SIGMOD
International Conference on Management of Data (Minneapolis,
MN, 24–27 May 1994). Published as SIGMOD Record,
23(2):383–394, 1994.
[4] P. M. Chen and D. A. Patterson. Maximizing performance in a
striped disk array. ACM International Symposium on Computer
Architecture (Seattle, WA), pages 322–331, June 1990.
[5] G. P. Copeland and S. Khoshafian. A decomposition storage
model. ACM SIGMOD International Conference on Management
of Data (Austin, TX, 28–31 May 1985), pages 268–279.
ACM Press, 1985.
[6] IBM DB2 Universal Database Administration Guide: Implementation,
Document number SC09-2944-005.
[7] T. E. Denehy, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau.
Bridging the information gap in storage protocol stacks. Summer
USENIX Technical Conference (Monterey, CA, 10–15 June
2002), pages 177–190, 2002.
[8] G. R. Ganger. Blurring the line between OSs and storage devices.
Technical report CMU–CS–01–166. Carnegie Mellon
University, December 2001.
[9] G. G. Gorbatenko and D. J. Lilja. Performance of twodimensional
data models for I/O limited non-numeric applications.
Laboratory for Advanced Research in Computing Technology
and Compilers Technical report ARCTiC–02–04. University
of Minnesota, February 2002.
[10] J. L. Hennessy and D. A. Patterson. Computer Architecture: A
Quantitative Approach, 3rd ed. Morgan Kaufmann Publishers,
Inc., San Francisco, CA, 2003.
[11] R. Y. Hou and Y. N. Patt. Track piggybacking: an improved rebuild
algorithm for RAID5 disk arrays. International Conference
on Parallel Processing (Urbana, Illinois), 14–18 August 1995.
[12] D. M. Jacobson and J. Wilkes. Disk scheduling algorithms
based on rotational position. Technical report HPL–CSP–91–
7. Hewlett-Packard Laboratories, Palo Alto, CA, 24 February
1991, revised 1 March 1991.
[13] M. Livny, S. Khoshafian, and H. Boral. Multi-disk management
algorithms. ACM SIGMETRICS Conference on Measurement
and Modeling of Computer Systems, pages 69–77, 1987.
[14] C. R. Lumb, J. Schindler, and G. R. Ganger. Freeblock scheduling
outside of disk firmware. Conference on File and Storage
Technologies (Monterey, CA, 28–30 January 2002), pages 275–
288. USENIX Association, 2002.
[15] R. Ramamurthy, D. J. DeWitt, and Q. Su. A case for fractured
mirrors. International Conference on Very Large Databases
(Hong Kong, China, 20–23 August 2002), pages 430–441. Morgan
Kaufmann Publishers, Inc., 2002.
[16] A. L. N. Reddy and P. Banerjee. A study of parallel disk organizations.
Computer Architecture News, 17(5):40–47, September
1989.
[17] M. Rosenblum and J. K. Ousterhout. The design and implementation
of a log-structured file system. ACMTransactions on Computer
Systems, 10(1):26–52. ACM Press, February 1992.
[18] K. Salem and H. Garcia-Molina. Disk striping. International
Conference on Data Engineering (Los Angeles, CA), pages 336–
342. IEEE, Catalog number 86CH2261-6, February 1986.
[19] J. Schindler. Matching application access patterns to storage
device characteristics. PhD thesis. Carnegie Mellon University,
2004.
[20] J. Schindler, A. Ailamaki, and G. R. Ganger. Lachesis: robust
database storage management based on device-specific performance
characteristics. International Conference on Very Large
Databases (Berlin, Germany, 9–12 September 2003). Morgan
Kaufmann Publishing, Inc., 2003.
[21] J. Schindler and G. R. Ganger. Automated disk drive characterization.
Technical report CMU–CS–99–176. Carnegie-Mellon
University, Pittsburgh, PA, December 1999.
[22] J. Schindler, J. L. Griffin, C. R. Lumb, and G. R. Ganger. Trackaligned
extents: matching access patterns to disk drive characteristics.
Conference on File and Storage Technologies (Monterey,
CA, 28–30 January 2002), pages 259–274. USENIX Association,
2002.
[23] M. Seltzer, P. Chen, and J. Ousterhout. Disk scheduling revisited.
Winter USENIX Technical Conference (Washington, DC, 22–26
January 1990), pages 313–323, 1990.
[24] M. Shao, J. Schindler, S. W. Schlosser, A. Ailamaki, and G. R.
Ganger. Clotho: decoupling memory page layout from storage
organization. Technical report CMU–PDL–04–102. Carnegie-
Mellon University, Pittsburgh, PA, February 2004.
[25] J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The HP AutoRAID
hierarchical storage system. ACMTransactions on Computer
Systems, 14(1):108–136, February 1996.
[26] B. L. Worthington, G. R. Ganger, Y. N. Patt, and J. Wilkes. Online
extraction of SCSI disk drive parameters. ACM SIGMETRICS
Conference on Measurement and Modeling of Computer
Systems (Ottawa, Canada), pages 146–156, May 1995.
A Access Efficiency Calculations
Let T(N;K) be the time it takes to service a request of
K sectors that fit onto a single track of a disk with N
sectors per track (i.e., track-aligned access). Ignoring
seek, and assuming no zero-latency access, this time can
be expressed as
Tnzl(N;K) =
N ��1
2N
+
K
N
where the first term is the average rotational latency, and
the second term is the media access time. For disks with
zero-latency access, the first term is not constant; rotational
latency decreases with increasing K. Thus,
Tzl(N;K) =
(N ��K+1)(N+K)
2N2 +
K ��1
N
These expressions are derived elsewhere [19].
The efficiency of track-based access is the ratio between
the raw one revolution time, Trev, and the time it
takes to read S = kN sectors for some large k. Hence,
En =
kTrev
Tn(N;N)+(k��1)(Ths+Trev)
kTrev
k (Ths+Trev)
where Tn(N;N) is the time to read data on the first track,
and (k��1)(Ths+Trev) is the time spent in head switches
and accessing the remaining tracks. In the limit, the access
efficiency is
En(N;H) =1��
H
N
(6)
which is the maximal streaming efficiency of a disk.
The maximal efficiency of semi-sequential quadrangle
access is simply
Eq(N;H) =
Trev
Tq(N;S)
=
Trev
Tzl (N;db+(d��1)H)
(7)
with d and b set accordingly.
B Relaxing the one-revolution constraint
Suppose that semi-sequential access to d blocks, each of
size b, from a single quadrangle takes more than one revolution.
Then the inequality in Equation 1 will be larger
than 1. With probability 1=N, a seek will finish with
disk heads positioned exactly at the beginning of the b
sectors mapped to the first track (the upper left corner of
the quadrangle in Figure 12). In this case, the disk will
access all db sectors with maximal efficiency (only incurring
head switch of H sectors for every b-sector read).
However, with probability 1 ��1=N, the disk heads
will land somewhere “in the middle” of the b sectors after
a seek, as illustrated by the arrow in Figure 12. Then,
the access will incur a small rotational latency to access
the beginning of the nearest b sectors, which are, say, on
the k-th track. After this initial rotational latency, which
is, on average, equal to (b��1)=2N, the (d��k)b sectors
mapped onto (d ��k) tracks can be read with maximal
efficiency of the semi-sequential quadrangle access.
L
b b b b
b
N-(K mod N)
Collapsing quadrangle with d = 6 into a request of size db+(d-1)H
b
b
b
b
b
N
d
Physical layout of a quadrangle across disk tracks of size N
H
H H H H
H
head position after seek
Figure 12: An alternative representation of quadrangle access.
To read the remaining k tracks, the disk heads will
need at be positioned to the beginning of the b sectors
on the first track. This will incur a small seek and additional
rotational latency of L=N. Hence, the resulting
efficiency is much lower than when the one-revolution
constraint holds, which avoids this rotational latency.
We can express the total response time for quadrangle
access without the one-revolution constraint as
Tq(N;S) =
b��1
2N
+
K
N
+Plat
L
N
(8)
where Plat = (N ��H ��b��1)=N is the probability of incurring
the additional rotational latency after reading k
out of d tracks, K = db��(d ��1)H is the effective request
size, L = N ��(K mod N), and S = db is the original
request size. To understand this equation, it may be
helpful to refer to the bottom portion of Figure 12.
The efficiencies of the quadrangle accesses with and
without the one-revolution constraint are approximately
the same when the time spent in rotational latency and
seek for the unconstrained access equals to the time
spent in rotational latency incurred during passing over
dR residual sectors. Hence,
dR
N
=
N ��1
N N ��1
2N
+Seek
Ignoring seek and approximating N ��1 to be N, this occurs
when R 6= 0 and
d
N
2R:
Thus, in order to achieve the same efficiency for the
non-constrained access, we will have to access at least
d VLBNs. However, this will significantly increase I/O
latency. If R = 0 i.e., when there are no residual sectors,
the one-revolution constraint already yields the most efficient
quadrangle access.