File Compression Problem

mrpenguinismyhomeboy · Oct 23, 2010

Me, being the math dork that I am, was thinking about ways to increase lossless file compression ratios (formats like .RAR and .ZIP are lossless file compressions). I figured that since you had a lossless file compression of a sort, that once you compressed a file once, you could keep compressing it to impressively small sizes (after all, a .RAR file is still a file, so why not?)

Thus, I derived that for some lossless file compression function R(b) where, b = number of bytes of original file, that:

R(b) = x * b : x is compression ratio less then 1

R(R(b)) = x^2 * b
R(R(R(b))) = x^3 * b
and so on...

thus, lim x^n * b : (n increases to infinity) = 0 (this means that if you compressed file b a very large number of times, its file size would get very close to zero)

I thus set out to test this. I took the minecraft.exe file (232501 bytes) and using WinRAR, compressed it to a file of 93162 bytes, which works out to a forty percent compression ratio. I then proceeded to compress the minecraft.rar file into a file called minecraft2.rar. However, what happened was that minecraft2.rar was BIGGER then minecraft.rar by like 100 bytes, which made no sense to me. And I kept compressing it too, and the size of each file only got bigger.

Does anyone know why this happens? Any computer nerds willing to explain this? I know math but not alot about the specific nuances of computers. Any help would be appreciated. This puzzles me and I can't think of any reason why this would be true.

BonsaiK · Oct 24, 2010

mrpenguinismyhomeboy said:
Thus, I derived that for some lossless file compression function R(b) where, b = number of bytes of original file, that:

R(b) = x * b : x is compression ratio less then 1

R(R(b)) = x^2 * b
R(R(R(b))) = x^3 * b
and so on...

thus, lim x^n * b : (n increases to infinity) = 0 (this means that if you compressed file b a very large number of times, its file size would get very close to zero)

Lossless compression doesn't work like that (as you discovered the hard way). It's not a simple mathematical ratio, it's more like a "map" for reconstructing the original data file, which may have several strings of repetitive or pattern-based data which can be compressed by representing it as a simpler alogorithm, in much the same way as an ASCII character may only be 1 byte but it generates an 8x8 bit character (8 bytes) using a mapping codex. That data "map" in the case of your .zip or .rar file is designed to use as little data as possible, so nothing is wasted. If you try to compress it again, the program looks at the data and goes "hang on, I can't actually afford to chop anything out here, this is very economical data and I have to leave it all in" so it does nothing except leave the file as-is and then tack on the file headers and footers to create the new archive, which is why you get the small file size increase. I'm not very geeky but I'm kind of semi-geeky so I hope this explanation makes some sense.

Veylon · Oct 24, 2010

It goes something like this. Imagine you have two files that looks like this:
00000000
000001

Eight Zeroes. Seven Zeroes and a One.

One compression scheme could compress them like this:
80
5011

The First number marks the number of times a number is in it, then that number. Repeat as necessary.
Eight Zeroes = Eight Zero. Two Numbers. You save six spaces.
But the second one, even though shorter, needs more space. Five zeroes, then one one becomes Five Zero One One. Only a saving of two.

Now lets "compress" them again:
1810
151021

They both got longer this time.

The general rule of thumb is that files with less variety get compressed more and files with more variety get compressed less. Compressing a file makes it more varied and thus lessens (or even reverses) the value of additional compressing.

(Note to fellow nerds: This is just a simple example. Please don't tear it to pieces with nitpicking.)

BonsaiK · Oct 24, 2010

That post there above this one is actually a really good explanation.

A really smart compression program will look at this sort of thing described in the post above, go "gosh, I'm going to make it bigger", and refuse to even run the compression alogorithm on the data at all.

File Compression Problem

mrpenguinismyhomeboy

New member

BonsaiK

Music Industry Corporate Whore

Veylon

New member

BonsaiK

Music Industry Corporate Whore