Venti-delta Delta NEW
|
|
Bookmark Venti-delta Delta NEW |
About Venti-delta Delta NEWHere you can find all about Venti-delta Delta NEW like manual and other informations. For example: review.
Venti-delta Delta NEW manual (user guide) is ready to download for free.
On the bottom of page users can write a review. If you own a Venti-delta Delta NEW please write about it to help other people. [ Report abuse or wrong photo | Share your Venti-delta Delta NEW photo ]
Manual
Download
(Portuguese)Venti-delta Delta NEW, size: 908 KB |
Download
(English)Check if your language version is avaliable. Most of manuals are avaliable in many languages. |
Venti-delta Delta NEW
Video review
Venti Delta, sagna, ventiladores, presidente prudente
User reviews and opinions
No opinions have been provided. Be the first and add a new opinion/review.
Documents

Application-specic Delta-encoding via Resemblance Detection
Fred Douglis IBM T. J. Watson Research Center Hawthorne, NY 10532
douglis@acm.org
Arun Iyengar IBM T. J. Watson Research Center Hawthorne, NY 10532
aruni@us.ibm.com via a slow link, techniques which minimize bandwidth required for updates are highly desirable. However, in each of these environments, it is not always possible to identify an appropriate base version to take advantage of delta-encoding. Our work therefore addresses a domain in which there are very many objects with arbitrary overlap among different pairs of objects, and the relationships between these pairs are not known a priori. If one can identify which pairs are suitable candidates, delta-encoding can reduce the size of one relative to another, thereby reducing storage or transmission costs in exchange for computation. We consider several application domains for this technique, which we refer to as delta-encoding via resemblance detection, or DERD: web trafc, email, and les in a le system. We defer additional discussion of our research until after a more detailed discussion of delta-encoding and resemblance detection, which appears in the following subsection. After that, the next section describes the framework of our analysis in greater detail, including the metrics we consider. Section 3 presents the various datasets we used. Section 4 describes the experiments, and Section 5 provides the results of these experiments. Section 6 discusses the resource usage issues that would arise in a practical implementation of DERD. Section 7 surveys related work, and Section 8 summarizes and describes possible future work.
Abstract
Many objects, such as les, electronic messages, and web pages, contain overlapping content. Numerous past research projects have observed that one can compress one object relative to another one by computing the differences between the two, but these delta-encoding systems have almost invariably required knowledge of a specic relationship between themmost commonly, two versions using the same name at different points in time. We consider cases in which this relationship is determined dynamically, by efciently determining when a sufcient resemblance exists between two objects in a relatively large collection. We look at specic examples of this technique, namely web pages, email, and les in a le system, and evaluate the potential data reduction and the factors that inuence this reduction. We nd that delta-encoding using this resemblance detection technique can improve on simple compression by up to a factor of two, depending on workload, and that a small fraction of objects can potentially account for a large portion of these savings.
1 Introduction
Delta-encoding is the act of compressing a data object, such as a le or web page, relative to another object [1, 13]. Usually there is a temporal relationship between the two objects: the latter object exists, and when it is subsequently modied, the changes can be represented in a small fraction of the size of the entire object. There is often also a naming relationship between the objects, since a modied le can have the same name as the original copy. In these cases, identifying the base version against which to compute a delta is straightforward. Delta-encoding is particularly attractive for situations where information is being updated across a network with limited bandwidth. For example, web sites are often replicated both for higher performance and availability. The bandwidth between the replicas can be limited. Another example would be replicated mail systems. Electronic mail systems often allow clients to replicate copies of mail messages locally. Clients may be connected to the network via phone lines with limited bandwidth. For an email client connected to a mail server
1.1 Background
It is difcult to describe our approach without providing a general overview of both delta-encoding and resemblance detection. We cover enough of each of these areas here to set the stage for combining the two, then return to a more comprehensive comparison with related work toward the end of the paper. Deltas are useful for reducing resource requirements, and existing applications of deltas generally fall into two categories: storage and networking. For storage, when one already stores a base version of a le, subsequent versions can be represented by changes. This lowers storage demands within le systems (the Revision Control System (RCS) [25] is a longstanding example of this), backup-restore systems [1], and similar environ-
1.3 Summary of Results
We have found that the benets of application-specic deltas vary depending on the mix of content types. For example, HTML and email messages display a great deal of redundancy across large datasets, resulting in deltas that are signicantly smaller than simply compressing the data, while mail attachments are often dominated by non-textual data that do not lend themselves to the technique. A few large les can contribute much of the total savings if they are particularly amenable to delta-encoding. Application-specic techniques, such as delta-encoding an unzipped version of a zip or gzip le and then zipping the result, can signicantly improve results for a particular le, but unless an entire dataset consists of such les, overall results improve by just a couple of percent. Numerous parameters can be varied in assessing the benets of deltas in this context, and we have evaluated several. The results do not appear to be sensitive to the size of shingles or the delta-encoding algorithm, within reason. The extent of the match of the number of features is a good predictor of the delta size. Perhaps most importantly, when multiple les match the same number of features, there is minimal difference between the best deltathe smallest delta obtained across all the lesand the average delta. The latter two results suggest that while it is benecial to determine the le(s) with the maximal number of matching features, only one delta need be computed. This is crucial because nding matching features, given a precomputed database of the features of other les and the dynamically computed feature set of the le being delta-encoded, is far more efcient than computing an actual delta.
1.2 Goals
As Manber suggested, one can use the features of documents to identify when les overlap and then deltaencode pairs of overlapping les to save space or bandwidth. One goal of this work was to assess whether this technique is generally applicable, and if not, to identify some specic instances in which it is applicable. A second goal was to evaluate a number of the parameters used in this process, such as:
2 Framework
This section describes our approach to the problem of delta-encoding with resemblance detection in greater detail. We discuss the types of data we considered and the way in which we evaluate the potential benets of DERD.
2.1 Types of Data
In the past, delta-encoding has been used for many types of data in numerous environments. Our interest has fo-
the size of a shingle,
ments. Over a network, transmitting data that are already known to the recipient can be avoided. The most common approach in this case is to work from a common base version known to the sender and recipient, compute the delta, and transmit it. This technique has been applied to web trafc [16], IP-level network communication [24], and other domains. An extension to the traditional web delta-encoding approach is to select the base version by nding similar, rather than identical, URLs [7]. What if one wishes to nd a similar le based on content rather than name, among a large collection of les? Manber devised a method for extracting features of les based on their contents, in order to nd les with overlapping content efciently [14]. He computed hashes of overlapping sequences of bytes (also known as shingles), then looked for how many of these hashes were shared by different les. Manber indicated that clustering similar les for improved compression would be an application of this technique. Broder used a similar approach but used a deterministic sampling of the hash values to dramatically reduce the amount of data needed for each le [5, 6]. With his approach, a subset of features of a le is used to represent the le, and if two les share many of those features in common, there is a high probability of signicant content in common as well. A common use for this technique is to suppress near-duplicates in search engine results [6], and variations of the technique have been used in link-level duplicate suppression [24] and le systems [8, 17, 20]. Because the shingling technique has seen so much use in the systems community of late, we refrain from providing a detailed description of it. Briey, it uses Rabin ngerprints [21] to compute a hash of consecutive bytes; the key properties of Rabin ngerprints are that they are efcient to compute over a sliding window, and they are uniformly distributed over all possible values. Thus, Broders approach of selecting the ngerprints with the smallest values effectively selects random features in a deterministic fashion, and two documents with many features in common overall would hopefully have many of these features in common.
pressed version of the le. For example, we made a copy of the Redhat 7.1 /usr/share/dict/words (409,276 bytes, 45,424 one-word lines) and changed line six from abandon to xyzzy. We call the copy words1. Both words and words1 generated gzipped les of about 131 Kbytes, with a difference of just four bytes in size. Encoding the differences between the uncompressed words1 and words, using vcdiff, represented the differences in just 79 bytes. In stark contrast, deltaencoding words1.gz against words.gz generated about 93 Kbytes. Therefore, delta-encoding two compressed les by encoding their uncompressed versions and compressing the result (if needed) has the potential for signicant gains. Since zip can store an arbitrarily large number of les and directories as a single compressed le, comparing its contents individually and zip-ing the results into a single zip le can have similar benets. One might assume that tar need not be handled specially, since it concatenates its input without compression. We nd below that this hypothesis is incorrect for the three delta-encoding programs we tried. For all these datatypes, however, the overall effects depend on the mix of data: in practice, the number and size of compressed les that can benet from this approach may be dwarfed by all the other data. Delta-encoding algorithm and parameters There are a few possible delta-encoding programs. We did not nd signicant differences in output sizes among the available programs; therefore, following the approach of delta-encoding in HTTP [16], we report numbers using Korn and Vos vcdiff [13]. Delta-encoding versus compression We vary a parameter that species how much smaller a delta must be than simply compressing a le before the delta is used. If no delta is small enough, of the les used as potential base versions, the compressed version is used instead. We use vcdiff for compression (deltaencoding a le against /dev/null), due to historical reasons. Its data reduction is comparable to gzip, though typically slightly worse. Identical les When an identical le appears multiple times in a dataset, it can be trivially encoded against another instance through the use of hash functions such as MD5. Past stud-
ies have investigated the prevalence of mirrors on the web [4] and techniques for suppressing duplicate payloads [12]. We chose to suppress duplicates from consideration in our analysis, since they are trivially handled through other means, except when a le contained in a zip archive is duplicated (since two zip les may have many identical les and some changed content, and our unzip-rezip procedure would match up the identical les).
3.2 File Data
We used two types of le data, which are summarized in Table 2. First, we scanned the entire /usr directory in a nearly unmodied Redhat Linux 7.1 distribution, totaling just under 2 Gbytes of data in over 100K les. Second, we examined email from several users and in several formats. Much of our analysis used over 500 Mbytes of one users UNIX-based email, which is stored individually in separate les by the MH mail system. The remaining data came from Lotus Notes, which stores message bodies and attachments as separate objects in a at-le document database. We studied the attachments of ve users and the message bodies of two.
3 Datasets
We separate our analyses into two types of data: web pages and les in a le system. We lump email into the latter category, since in general we expect the benets to be greater for static encoding (space reduction) than network transmission. Note that not all the datasets we analyzed are discussed further in this paper, but we include them in the tables to give a sense of the variability of the results.
4 Experiments
As described in Section 2.2, we varied a number of parameters in the delta-encoding and resemblance detection process. Our general goals were to determine how much more data could be eliminated by using deltas rather than just compression, and how sensitive that result would be to this set of parameters. In particular, we wanted to estimate the minimal work a system might do to get a reasonable benet (i.e., the point of diminishing returns). In general, we xed the parameters to a common set. We then varied each parameter to evaluate its effect. Table 3 lists these parameters, with a brief description of each one, the default value in boldface, and other tested parameters. The parameters are clustered into two sets: the rst controls the pass over the data to compute the features, and the second controls the comparison of those features and computation of the deltas. In some cases, due to space constraints, we do not present additional details about variations in parameters that did not signicantly affect results; these are denoted by italic text. Additional descriptions of many of the parameters were given above in Section 2.2. Note that min features ratio is special, in that it is possible to compute the savings for each number of matching features and then compute a cumulative benet for each number of matches in a later stage, as demonstrated in Section 5.1.
3.1 Web Data
Ideally, to analyze the benets of DERD for the web, one would study a live implementation over an extended time, and/or use full content traces to simulate an implementation. The latter approach was used effectively to study delta-encoding based on identical URLs [16], but such traces are difcult to obtain. Instead, we used the w3get program to download a small set of root web pages, and recursively the pages linked from them, up to two levels. We specically excluded le sufxes that suggested image data, such as JPG and GIF, focusing instead on the base pages. This is partly because delta-encoding has already been demonstrated to be ineffective across two different image les, even having the same name [16], and partly because images change more slowly than HTML [9] and are more likely to be cached in the rst place. While periodic downloads of specic web pages have been used in the past to evaluate delta-encoding [13], cross-page comparisons require a single snapshot of a large number of pages. We believe these pages, and the results obtained from them, demonstrate a high degree of overlap in content between pages on the same site; this has been observed in other research due to the high use of templates for creating dynamic pages [3, 23]. Table 1 lists the sites accessed, all between 24-26 July 2002, with the number of pages and total size. Note that in the case of Yahoo!, the download was aborted after about 27 Mbytes were downloaded, as that offered sufcient data to perform an analysis, and it was unclear how much additional data would be retrieved if left unchecked.
4.1 Implementation Details
Most of the work to encode differences based on similarity is performed by a pair of Perl scripts. One of these recursively descends over a set of directories and invokes a Java program to compute the features. Each computation is a separate invocation of Java, though that could be optimized. Once a les features have been computed, they are cached in a separate le. The other script takes the precomputed set of lenames and features, and for each le determines which
Name Yahoo IBM Masters CNN Wimbledon
Files From. yahoo.com ibm.com masters.com cnn.com wimbledon.com
Files 3,190
Size (Mbytes) 27.55 3.21 3.19 2.53 2.40
Delta% 10
Comp% 35
Table 1: Web datasets evaluated. Delta and compression percentages refer to the size of the encoded dataset relative to the
original.
Name /usr MH User1 User1 User2 User2 User3 User4
Files From. /usr one users MH directory User 1s Notes mail bodies User 1s Notes mail attach. User 2s Notes mail bodies User 2s Notes mail attach. User 3s Notes mail attach. User 4s Notes mail attach.
Bod Att Bod Att Att Att
Files Included Excluded 102,932 1,250 87,005 3,445 1,1,982
Size (Mbytes) 1,964.16 565.69 5.97 81.29 1.18 417.35 36.18 991.45
Delta% 52 53
Comp% 61 66
Table 2: File datasets evaluated. Excluded les are explained in the text. Delta and compression percentages refer to the size of
the encoded dataset relative to the original.
other les have the maximum number of matching features. Currently this is done by identifying which features a le has, and incrementing counters for all other les with a given feature in common, using the value of the feature as a hash key. This records the most features in common at any point,. After all features are processed, any les that have at least one feature in common are sorted by the number of matching features. Typically, only the les that match exactly features are considered as base versions, up to the max comparisons parameter, but if the best matches fail to produce a small enough delta, poorer matches are considered until the maximum is reached. There are methods to optimize this comparison by precomputing the overlap of les, as well as through estimation [22], which we intend to integrate at a later date. Delta-encoding is performed by one of a set of programs, all written in C. Once a pair of les has been so encoded, the size of the output is cached. Occasionally, the delta-encoding program might generate a delta that is larger than the compressed le, or even larger than the original le. In those cases, the minimum of the other values is used. For a given dataset, the results are reported by listing how many les have a maximum features match for a given number of features, with statistics aggregated over those les: the original size, the size of the delta-encoded output, and the size of the output using vc-
diff compression (delta-encoding against /dev/null, comparable to gzip). Table 4 is an example of this output. The rows at the top show dissimilar les, where deltas made no difference, while the rows at the bottom had the greatest similarity and the smallest deltas. The BestDelta and AvgDelta columns show that, in general, there was at most a 1% difference in size (relative to the original le) between the best of up to ten matching les and the average of all ten. This characteristic was common to all the datasets. Correspondingly, in all the gures, the curves for the savings for delta-encoding depict the average cases. There are two apparent anomalies in Table 4 worth noting. First, there is a substantial jump in size at the complete 30/30 features match, despite a consistent number of les, showing a much higher average le size. This is skewed by a large number of nearly identical les, resulting from form letters attaching manuscripts for review; if each manuscript was sent to three persons and the features in the large common data were all selected by the minimization process, they all match in every feature. (This is a desirable behavior, but may not be typical of all datasets.) Second, the les with 02 out of 30 features matching have a dramatically worse compression ratio than the other data. We believe these are attributable to types of data that neither match other les to a great extent nor exhibit particularly good compressibility from internally repeated text
Processing Stage
Parameter shingle size num features
Description Number of bytes in a ngerprinted shingle Number of features compared Minimum size of an individual le to include in statistics Should zip les be unzipped before comparison Should gz les be unzipped before comparison Whether encoding A against B precludes encoding B against A Program to perform delta-encoding Whether to compare against all les, or just best matches Maximum number of les to compare against, with equal maximal matching features What fraction of features must match to compute a delta? What is the maximum size of a delta, relative to simple compression, for it to be used? 20, 30 30, 100
Values
Preprocessing
min size unzip gunzip static files program exhaustive search
MH Sizes by Feature Relative Size (%) Total Size (MB) 30 0
MH Cumulative Sizes, 30 features
(a) Total data sizes for the original dataset, using compression, and using DERD , for individual numbers of matching features. Most of the data match very few features in any other le, or match all the features. The y-axis is on a log scale.
Relative Size (%)
skew suggests that heuristics for intelligently selecting a subset of potential delta-encoded pairs, or compressed les, could be quite benecial.
5.3 Effects of File Blocking
Section 2.2 referred to an impact on size reduction from rounding to xed block sizes. In some workloads, such as le backups, this is a non-issue, but in others it can have a moderate impact for small blocks and a substantial impact for large ones. Figure 3(b) shows how varying the blocksize affects overall savings for the MH dataset. Like Figure 3(a), it plots the cumulative savings sorted by contribution, but it accounts for block rounding effects. A 1-Kbyte min-
1 Original Compressed DERD Maximum Number of Matching Features
Compressed DERD Matching Features (M >= X)
(b) Cumulative benets. The y-axis shows the relative size, in percent, of compressing or delta-encoding each le. A point on the x-axis shows the benet from performing this on all les that match at least that many features.
Figure 1: Effect of matching features, for the MH data. These gures graphically depict the the data in Table 4.
Cumulative Sizes, multiple static datasets 100 100
Cumulative Sizes, multiple web datasets IBM comp IBM DERD Yahoo comp Yahoo DERD
40 /usr comp /usr DERD MH comp MH DERD 20 Matching Features (M >= X) 25 30
Matching Features (M >= X) 25 30
(a) Static datasets.
(b) Web datasets.
Figure 2: Effect of matching features, cumulative, for several datasets. imum blocksize, typical for many UNIX systems with fragmented le blocks, reduces the total possible benet of delta-encoding from around 66% (assuming no rounding) to 61%, but a 4-Kbyte blocksize brings the benet down to 40% since so many messages are smaller than 4 Kbytes.
5.4 Handling Compressed and Tarred Files
Section 2.2 provided a justication for comparing the uncompressed versions of zip and gzip les, as well as a hypothesis that tar les would not need special treatment. For some workloads this is irrelevant, since for example the MH repository stored all messages with full
Cumulative Savings by Contribution 10 Savings (%) Savings (%) 30 DERD by File% Comp by File% DERD by Byte% Comp by Byte% 0.1 0.01 0.10 Percent of original files/bytes to obtain savings 10
Cumulative DERD Savings by Contribution and Block Size
(a) Relative savings as a function of cumulative les, by count or by bytes. Plotted on a log-log scale.
(b) Relative savings assuming no le blocking, or rounding to 1-Kbyte or 4-Kbyte units.
Figure 3: Cumulative savings from MH les, sorted in order of contribution to total savings. bodies, uncompressed. An attachment might contain MIME-encoded compressed les, but these would be part of the single le being examined, and one would have to be more sophisticated about extracting these attachments. In fact, there was no single workload in our study with large numbers of both zip and gzip les, and overall benets from including this feature were only 12% of the original data size in any dataset. For example, the User4 Attach workload, which had the most zip les, only saved an additional 2% over the case without special handling. Even though the zip les themselves were reduced by about a third, overall storage was dominated by other le types. We expected directly delta-encoding one tar le against a similar tar le to generate a small delta if individual les had much overlap, but this was not the case in some limited experiments. Vcdiff generated a delta about the size of the original gzipped tar le, and two other delta programs used within IBM performed similarly. We tried a sample test, using two email tar le attachments unpacked into two directories, and then using DERD to encode all les in the two directories. We selected the delta-encoded and compressed sizes of the individual les in the smaller of the tar les, and found delta-encoding saved 85% of the bytes, compared to 71% for simple compression of individual les and 79% when the entire tar le was compressed as a whole. Depending on how this extends to an entire workload, just as with zip and gzip, these savings may not justify the added effort. used. There are reasons why this might not be desirable, such as a web server using a cached compressed version rather than computing a specialized delta for a given request. As another example, consider a le system backup that would require both a base le and a delta to be retrieved before producing a saved le: if the compressed version were 25% larger than the delta, it would consume that extra storage, but restoring the le would involve retrieving 125% of the deltas size rather than the delta and a base version that would undoubtedly be much larger than that 25%. We varied the threshold for using a delta to be 25 100% of the compressed size, in increments of 25%. Figure 4 shows the result of this experiment on the MH dataset. There is a dramatic increase in the relative size of the delta-encoded data at higher numbers of matching features, because in some cases, there is no longer a usable match at a given level. The most interesting metric is the overall savings if all les are included, since that no longer suffers from this shift; the relative size increases from about 35% to about 45% as the threshold is reduced.
7 Related Work
Mogul, et al., analyzed the potential benets of compression and delta-encoding in the context of HTTP [16]. They found that delta-encoding could dramatically reduce network trafc in cases where a client and server shared a past version of a web page, termed a deltaeligible response. When a delta was available, it reduced network bandwidth requirements by about an order of magnitude. However, in the traces evaluated in that study, responses were delta-eligible only a small fraction of the time: 10% in one trace and 30% in the other, but the one with 30% excluded binary data such as images. On the other hand, most resources were compressible, and they estimated that compressing those re-
sources dynamically would still offer signicant savings in bandwidth and end-to-end transfer timesfactors of 2-3 improvement in size were typical. Later, Chan and Woo devised a method to increase the frequency of delta-eligible responses by comparing resources to other cached resources with similar URLs [7]. Their assumption was that resources near each other on a server would have pieces in common, something they then validated experimentally. They also described an algorithm for comparing a le against several other les, rather than the one-on-one comparison typically performed in this context. However, they did not explain how a server would select the particular related resources in practice, assuming that it has no specic knowledge of a clients cache. We believe there is an implicit assumption that this approach is in fact limited to personal proxies with exact knowledge of the clients cache [11, 2], in which case it has limited applicability. Ouyang, et al., similarly clustered related web pages by URL, and tried to select the best base version for a given cluster by computing deltas from a small sample [18]. While they were not focused on a caching context, and are more similar to the general applications described herein, they did not initially use the more efcient resemblance detection methods of Manber and Broder to best select the base versions. Subsequently, they applied resemblance detection techniques to scale the technique to larger collections [19]. This work, roughly concurrent with our own, is similar in its general approach. However, the largest dataset they analyzed was just over 20,000 web pages, and they did not consider other types of data such as email. Another possibly signicant distinction is that they used shingle sizes of only 4 bytes, whereas we used 20-30 bytes. (We did not obtain this paper in time to repeat our analyses with such a small shingle size.) Spring and Weatherall [24] essentially generalized Chan and Woos work by applying it to all data sent over a specic communication channel, and using resemblance detection to detect duplicate sequences in a collection of data. This was done by computing ngerprints of shingles, selecting those with a predetermined number of zeroes in the low-order bits (deterministically selecting a fraction of features), and scanning before and after the matching shingle to nd the longest duplicate data sequence. Like Chan and Woos work, this system worked only with a close coupling between clients and servers, so both sides would know what redundant data existed in the client. In addition, the communication channel approach requires a separate cache of packets exchanged in the past, which may compete with the browser cache and other applications for resources. In some cases, the suppression of redundancy is at a very coarse level, for instance identifying when an en-
For web content, we have found substantial overlap among pages on a single site. This is consistent with Chan and Woo [7], Ouyang, et al. [19], and recent work on automatic detection of common fragments within pages [23]. For the ve web datasets we considered, deltas reduced the total size of the dataset to 819% of the original data, compared to 2936% using compression. For les and email, there was much more variability, and the overall benets are not as dramatic, but they are signicant: two of the largest datasets reduced the overall storage needs by 1020% beyond compression. There was signicant skew in at least one dataset, with a small fraction of les accounting for a large portion of the savings. Factors such as shingle size and the number of features compared do not dramatically affect these results. Given a particular number of maximal matching features, there is not a wide variation across base les in the size of the resulting deltas. A new le will often be created by making a small number of changes to an older le; the new le may even have the same name as the old le. In these cases, the new le can often be delta-encoded from the old le with minimal overhead. For the most part, our datasets did not consider these scenarios. For situations where this type of update is prevalent, the benets from deltaencoding are likely to be higher. Now that we have demonstrated the potential savings of DERD, in the abstract, we would like to implement underlying systems using this technology. The smaller deltas for web data suggest that an obvious approach is to integrate DERD into a web server and/or cache, and then use a live system over time. However, supporting resemblance-based deltas in HTTP involves extra overheads and protocol support [10] that do not affect other applications such as backups. We are also interested in methods to reduce storage and network costs in email systems, and hope to implement our approach in commonly used mail platforms. As the system scales to larger datasets, we can add heuristics for more efcient resemblance detection and feature computation. We can also evaluate additional application-specic methods, such as encoding individual elements of tar les, and compare the various delta-based approaches against other systems such as LBFS and rsync in greater depth.
is a LotusScript guru extraordinaire. Ziv Bar-Yossef, Sridhar Rajagopalan, and Lakshmish Ramaswamy provided code for computing features. Several people have permitted us to analyze their data, including Lisa Amini, Frank Eskesen and Andy Walter. Ramesh Agarwal, Andrei Broder, Ron Fagin, Chris Howson, Ray Jennings, Jason LaVoie, Srini Seshan, John Tracey, and Andrew Tridgell have provided helpful comments on some of the ideas presented in this paper and/or earlier drafts of this paper. Finally, we thank the anonymous reviewers and our shepherd, Darrell Long, for their advice and feedback.
References
[1] M. Ajtai, R. Burns, R. Fagin, D. Long, and L. Stockmeyer. Compactly encoding unstructured input with differential compression. Journal of the ACM, 49(3):318367, May 2002. [2] Gaurav Banga, Fred Douglis, and Michael Rabinovich. Optimistic deltas for WWW latency reduction. In Proceedings of 1997 USENIX Technical Conference, pages 289303, January 1997. [3] Ziv Bar-Yossef and Sridhar Rajagopalan. Template detection via data mining and its applications. In Proceedings of the Eleventh International Conference on World Wide Web, pages 580591. ACM Press, 2002. [4] K. Bharat and A. Broder. Mirror, mirror on the web: A study of host pairs with replicated content. In Proceedings of the 8th International World Wide Web Conference, pages 501512, May 1999. [5] Andrei Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences (SEQUENCES97), 1997. [6] Andrei Z. Broder. Identifying and ltering nearduplicate documents. In Combinatorial Pattern Matching, 11th Annual Symposium, pages 110, June 2000. [7] Mun Choon Chan and Thomas Y. C. Woo. Cachebased compaction: A new technique for optimizing web transfer. In Proceedings of Infocom99, pages 117125, 1999. [8] L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: Making backup cheap and easy. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI02), pages 285 298. USENIX, December 2002. [9] Fred Douglis, Anja Feldmann, Balachander Krishnamurthy, and Jeffrey Mogul. Rate of change and other metrics: a live study of the World Wide Web.
Acknowledgments
Kiem-Phong Vo jointly developed the idea of web-based DERD , resulting in a research report [10] from which a small amount of the text in this manuscript has been taken. Andrei Broder has been extremely helpful in understanding the intricacies of resemblance detection, Randal Burns and Kiem-Phong Vo have similarly been helpful in providing and helping us to understand their delta-encoding software packages, and Laurence Marks
In Proceedings of the Symposium on Internet Technologies and Systems, pages 147158. USENIX, December 1997. [10] Fred Douglis, Arun K. Iyengar, and Kiem-Phong Vo. Dynamic suppression of similarity in the web: a case for deployable detection mechanisms. Technical Report RC22514, IBM Research, July 2002. [11] Barron C. Housel and David B. Lindquist. WebExpress: A system for optimizing Web browsing in a wireless environment. In Proceedings of the Second Annual International Conference on Mobile Computing and Networking, pages 108116. ACM, November 1996. [12] Terence Kelly and Jeffrey Mogul. Aliasing on the World Wide Web: Prevalence and Performance Implications. In Proceedings of the 11th International World Wide Web Conference, May 2002. [13] David G. Korn and Kiem-Phong Vo. Engineering a differencing and compression data format. In Proceedings of the 2002 Usenix Conference. USENIX Association, June 2002. [14] U. Manber. Finding similar les in a large le system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 110, January 1994. [15] J. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann, Y. Goland, A. van Hoff, and D. Hellerstein. Delta encoding in HTTP, January 2002. RFC 3229. [16] Jeffrey Mogul, Fred Douglis, Anja Feldmann, and Balachander Krishnamurthy. Potential benets of delta-encoding and data compression for HTTP. In Proceedings of ACM SIGCOMM97 Conference, pages 181194, September 1997. [17] Athicha Muthitacharoen, Benjie Chen, and David Mazieres. A low-bandwidth network le system. In Symposium on Operating Systems Principles, pages 174187, 2001. [18] Zan Ouyang, Nasir Memon, and Torsten Suel. Using delta encoding for compressing related web pages. In Data Compression Conference, page 507, March 2001. Poster. [19] Zan Ouyang, Nasir Memon, Torsten Suel, and Dimitre Trendalov. Cluster-based delta compression of a collection of les. In International Conference on Web Information Systems Engineering (WISE), December 2002. [20] S. Quinlan and S. Dorward. Venti: a new approach to archival storage. In Proceedings of the First USENIX Conference on File and Storage Technologies, Monterey,CA, 2002. [21] Michael O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for
Research in Computing Technology, Harvard University, 1981. [22] Sridhar Rajagopalan, 2002. Personal Communication. [23] Lakshmish Ramaswamy, Arun Iyengar, Ling Liu, and Fred Douglis. Techniques for efcient detection of fragments in web pages. Manuscript, November 2002. [24] Neil T. Spring and David Wetherall. A protocolindependent technique for eliminating redundant network trafc. In Proceedings of ACM SIGCOMM, August 2000. [25] W. Tichy. RCS: a system for version control. SoftwarePractice & Experience, 15(7):637654, July 1985. [26] Andrew Tridgell. Efcient Algorithms for Sorting and Synchronization. PhD thesis, Australian National University, 1999.
Tags
Deploy 3 Audio 625 LP290 Alarm KX-TG1070HG VDR-D150EP Dictionary Lesabre 2001 Rout 54-N FZ6-2004 ZKF221 13L-M150B W130A F64080IM 100ET 2200 EU Navigon 7210 TSU9400 Dvdr7310H 58 OT-802 Aspire 1400 PC-D450 DSC-T100 LBT-ZX8 Ariston 129 DAV-DZ520K AOI-892 Carens Review Aficio1060 GR-DV1 Digitv CDP-M205 RD2010 R1994 WM-FX521 WM-EX570 WM-FX290 LE37A457 D845gebv2 63100X DM 90 Samsung E908 RQ-SX41 VCT-D680RM 72600 Family Adat-HD24 FP120 8120 USB GNX3000 Simulator 98 LN-406 And VST Player A100-ST3211 SGH-T229 CA-5II AS720 LUX CP-S210W Explorist 210 HD ZD DCR-1200 Dvdr3455H-37B Samsung D780 Expedition-2001 DI8512 Aero 1500 Uniden BCT7 400Z ABS 18-25 C II X-360 SL-DZ1200 BCT-1540 HP-147E KDC-BT6544U STR-DA2000ES Mcculloch PM85 KW-XR411 SP-1203N Bullet 2 RP-42PX11H PS42A410c1D Photo CY-M9054EUC T59840 LX3950W 10 3 Saab 9-5 DM1050 FP547 Studio Pc 1201 TX-906 300W101 RSH7pnrs ZTE240 EL-6620 Cooker C 2
manuel d'instructions, Guide de l'utilisateur | Manual de instrucciones, Instrucciones de uso | Bedienungsanleitung, Bedienungsanleitung | Manual de Instruções, guia do usuário | инструкция | návod na použitie, Užívateľská príručka, návod k použití | bruksanvisningen | instrukcja, podręcznik użytkownika | kullanım kılavuzu, Kullanım | kézikönyv, használati útmutató | manuale di istruzioni, istruzioni d'uso | handleiding, gebruikershandleiding
Sitemap
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
