Software‎ > ‎

DedupEstimate

Description

Deduplication estimator - evaluates how efficient a deduplication of a directory would be. Allows choosing from a wide-range of deduplication granularity by number of bits in the rolling-hash mask. For more details about the concept see PACK project and the related publication

Download

File Version Date Comments
DedupEstimate.jar 1.0 23-Jan-2014

Running

UsageDedupEstimate.jar <dir-name> <chunk-bits>

Example:

java -jar DedupEstimate.jar c:\temp\yahoo 6-7
Directory: c:\temp\yahoo

serial     file_size bits avg_chunk    chunks    self_bytes    glob_bytes dedup_ratio file_name
1            366,680    6        68     5,347        63,823             0     17.406% Yahoo.ver1.htm
2            363,465    6        68     5,311        62,885       189,873     69.541% Yahoo.ver2.htm
3            357,188    6        67     5,253        62,392       201,823     73.971% Yahoo.ver3.htm
total      1,087,333    6        68    15,911       189,100       391,696     53.415% -
1            366,680    7       136     2,690        32,575             0      8.884% Yahoo.ver1.htm
2            363,465    7       127     2,851        37,081       172,357     57.623% Yahoo.ver2.htm
3            357,188    7       124     2,872        36,846       192,278     64.147% Yahoo.ver3.htm

The results are presented in the 3 following columns:
  • self_bytes - Number of redundant bytes in the file itself, overlapping identical chunks that appear earlier in the same file, before even considering other files.
  • glob_bytes - Number of redundant bytes when comparing with the global chunk collection, filled with previous files.
  • dedup_ratio - The final result as (self_bytes + glob_bytes) / file_size.
Technical
The file is loaded to memory first, and then it is being processed as a whole.
Therefore, very large files may require using the -Xmx flag to enlarge maximal memory allocation by the JVM. For example, java -Xmx4000m -jar ...

ċ
DedupEstimate.jar
(79k)
Eyal Zohar,
Jan 22, 2014, 10:38 PM
Comments