QDDA update – 2.2

Posted on February 8, 2019 by Bart Sjerps

Quick post to announce QDDA version 2.2 has been published on Github and in the Outrun-Extras YUM repository.

Reminder: The Quick and Dirty Dedupe Analyzer is an Open Source Linux tool that scans disks or files block by block to find duplicate blocks and compression ratios, so that it can report – in detail – what the expected data reduction rate is on a storage array capable of these things. It can be downloaded as standalone executable (QDDA download), as RPM package via YUM or compiled from source (QDDA Install)

QDDA 2.2 adds:

DellEMC VMAX and PowerMAX support (using the DEFLATE compression algorithm)
bash-completion (by entering on the command line, RPM version only)
Improved options for custom storage definitions
Internal C++ code improvements (not relevant to the normal user)

Note to other storage vendors: If you’d like your array to be included in the tool, drop me a note with dedupe/compression algorithm details and I’ll see what is possible.

Continue reading →

Announcing qdda 2.0

Posted on May 14, 2018 by Bart Sjerps

It’s almost a year since I blogged about qdda (the Quick & Dirty Dedupe Analyzer).

qdda is a tool that lets you scan any Linux disk or file (or multiple disks) and predicts potential thin, dedupe and compression savings if you would move that disk/file to an All Flash array like DellEMC XtremIO or VMAX All-flash. In contrast to similar (usually vendor-based) tools, qdda can run completely independent. It does NOT require a registration or sending a binary dataset back to the mothership (which would be a security risk). Anyone can inspect the source code and run it so there are no hidden secrets.

It’s based upon the most widely deployed database engine, SQLite, and uses MD5 hashing and LZ4 compression to produce data reduction estimates.

The reason it took a while to follow-up is because I spent a lot of evening hours to almost completely rewrite the tool. A summary of changes:

Run completely as non-privileged user (i.e. ‘nobody’) to make it safe to run on production systems
Increased the hash to 60 bits so it scales to at least 80 Terabyte without compromising accuracy
Decreased the database space consumption by 50%
Multithreading so there are separate readers, workers and a single database updater which allows qdda to use multiple CPU cores
Many other huge performance improvements (qdda has demonstrated to scan data at about 7GB/s on a fast server, bottleneck was IO and theoretically could handle double that bandwidth before maxing out on database updates)
Very detailed embedded man page (manual). The qdda executable itself can show its own man page (on Linux with ‘man’ installed)
Improved standard reports and detailed reports with compression and dedupe histograms
Option to define your own custom array definitions
Removed system dependencies (SQLite, LZ4, and other libraries) to allow qdda to run at almost any Linux system and can be downloaded as a single executable (no more requirements to install RPM packages)
Many other small improvements and additions
Completely moved to github – where you can also download the binary

Read the overview and animated demo on the project homepage here: https://github.com/outrunnl/qdda

HTML version of the detailed manual page: https://github.com/outrunnl/qdda/blob/master/doc/qdda.md

As qdda is licensed under GPL it offers no guarantee on anything. My recommendation is to use it for learning purposes or do a first what-if analysis, and if you’re interested in data reduction numbers from the vendor, then ask them for a formal analysis using their own tools. That said, I did a few comparison tests and the data reduction numbers were within 1% of the results from vendor-supported tools. The manpage has a section on accuracy explaining the differences.

Continue reading →

Time for a change: Migrating Exadata to a modern All-Flash array

Posted on February 28, 2018 by Bart Sjerps

inflate With Oracle’s uncertain SPARC future, the rise of very fast and capable All-Flash arrays and existing Exadata customers looking to refresh their hardware, I increasingly get questions on what platform we can offer as an alternative to Exadata or Supercluster. A big challenge can be to break away from the lock-in effect of HCC (Hybrid Columnar Compression although I’d like to call it Hotel California Compression) as it seems hard to get an estimate on how much storage capacity is needed in other storage or migrating to other platforms. Note that any storage could theoretically use HCC but Oracle disabled it on purpose on anything other than Exadata, ZFS appliance or Pillar (huh?) storage.

As far as I know, there is no direct query to figure out how big a HCC compressed table would get after decompressing. HCC tables can get very large and the compression ratio can be fairly high which makes sizing a new environment a challenge.

So in order to provide a reasonable guesstimate, I created a typical scenario on my VMware homelab to estimate generic storage requirements for HCC compressed environments.

Continue reading →

The Quick and Dirty Dedupe Analyzer – Part 1 – Hands on

Posted on May 17, 2017 by Bart Sjerps

As announced in my last blogpost, qdda is a tool that analyzes potential storage savings by scanning data and giving a deduplication, compression and thin provisioning estimate. The results are an indication whether a modern All-Flash Array (AFA) like Dell EMC XtremIO would be worth considering.

In this (lenghty) post I will go over the basics of qdda and run a few synthetic test scenarios to show what’s possible. The next posts will cover more advanced scenarios such as running against Oracle database data, multiple nodes and other exotic ones such as running against ZFS storage pools.

[ Warning: Lengthy technical content, Rated T, parental advisory required ]

Continue reading →

The Quick and Dirty Deduplication Analyzer

Posted on April 18, 2017 by Bart Sjerps

The best thing about being me… There are so many “me”s.

— Agent Smith, The Matrix Reloaded

One of our customers reported less than optimal space savings on XtremIO running Oracle. In order to test various scenarios with Oracle I was in search of a deduplication analysis method or tool – only to find out that there was nothing available that qualified.

TL;DR: QDDA is an Open Source tool I wrote to analyze Linux files, devices or data streams for duplicate blocks and compression estimates. It can quickly give you an idea of how much storage savings you could get using a modern All-Flash Array like XtremIO. It is safe to use on production systems and allows quick analysis of various test scenarios giving direct results, and even works with files/devices that are in use. No registration or uploading of your confidential data is required.

Continue reading →

Dirty Cache

A storage infrastructure perspective on optimizing business applications

Category Archives: Deduplication

QDDA update – 2.2

Announcing qdda 2.0

Time for a change: Migrating Exadata to a modern All-Flash array

The Quick and Dirty Dedupe Analyzer – Part 1 – Hands on

The Quick and Dirty Deduplication Analyzer

Rate this:

Rate this:

Rate this:

Rate this:

Rate this: