Free Write: Compression Algorithms

So, I still think about compression algorithms from time to time for giggles and I’ve come to more conclusions. The combinations of statistics quite innadequate to be used as a means of compression. For a bit of background, the way it works is it gets statistics on the whole then splits it up by groups. With the groups it gets statistics on each and consolidates them as much as possible. The problem with this is ordering and redistributing resources amoungst the groups. With redistribution, I have attempted to find ways to create strong distinctions between them. Cases such as ‘sum of unique pattern lengths’, ‘number of unique pattern lengths’, ‘number of unique prime patterns’ and ‘total factors in lengths’ allow us to figure out some of the patterns simply. But in some cases its not that simple.

Given unique pattern sum of 20, 3 unique patterns, 3 primes and 3 total factors, find all possible combinations

  • 20 = 11 + 7 + 3
  • 20 = 13 + 5 + 3

As you can see, we have a problem. There are two possibilities for what we want to be only 1. If we specify the largest value, finished right? Nope

  • 42 = 23 + 11 + 7 + 3
  • 42 = 23 + 13 + 5 + 3

Arguably this is fine because we can make this enumerated. However, then we are generating all possible values that meet this requirement. How do you even begin with that?

  • With 4 values, we know that 2, 3, and 5 are the lost possible. That means  that the number that is fluid is actually 42 – 10 = 32.
  • 32’s closest prime is 31 and we can only add the the greatest low prime (5). 1 + 5 = 6
  • However 6 is not a prime. As a result we need to find the next highest prime (29) and distribute the 3 until we find all primes
  • This however is not possible so we do it again with 23 (as that is the next available prime) etc

Perhaps there are less than 2^32 values, perhaps not. But lets say there is. Now we know The segment lengths. How are the repititions distributed? We know there are X total repititions distributed amoungst Y of the possible Unique Segment Lengths that Total up to Z of the Complete resource. Lets Enumerate that! Why not? We’re getting sloppy already, why not another Enumeration? Ok, so we know what percentages of all repetitions belong to each unique segment length. Now, how many times does Each unique segment length happen? How are the reptitions split within each segment lengths domain?

Lets Make sure we know all the Questions. So our method is divide and conquer. We will split the data into parts based on how its repeating (111000111000 etc).

  • How many unique segment lengths amoungst the patterns?
  • How many patterns are there?
  • How many of each pattern is a one of the unique segments?
  • What are the Unique Segment Lengths?
  • How many repetitions are there?
  • How many of the repetitions belong to each unique segment length?
  • How does each unique segment length distribute its repetitions?

Then we have ordering. I had tried to split this up into different types of ordering. Heres what we know.

  1. Each Pattern Starts with a different bit than the last one ends – that means if the previous ended with 1 the current starts with 0
    1. Example : 111100001010, 00011001100
    2. We can treat all patterns as 1/0, 1/1 || 0/0, 01
    3. All similarities can be combined (1/0, 1/1+0/0, 1/0 -> 1/0)
    4. This can be done any number of times
  2. Each Pattern Length is different from the next pattern length – This means that there will be a rise or a fall between each pattern
    1. There are cases when knowing highs and lows may not tell the complete pattern – High, Low, High, Low
    2. If done recursively for each high and low, the complete order can only not be found due to 2 equal lengths that are tested
  3. There is a minimum start point for each segment length
    1. Example: 5 segment A, 4 unique – A must surround unique
    2. Example: 5 segment B, 20 unique – B must start before 16
  4. There are unique occasions such as
    1. Patterns next to eachother have the same number of bits
    2. Patterns next to eachother have the same number of 1s/0s
    3. Patterns next to eachother have the same number of repetitions

These properties can filter the overall orders significantly, but there is still a problem. Coming up with algorithms to create the possibilities. But in the end, enumeration becomes an issue again.

Why is Enumeration Evil?

The reason why its evil is simple. Data is an enumeration of itself. There are 1000 possibilities for the number up to 1000, 10 possibilities for the number up to 10 and 9999 possibilities for the number up to 9999. Really a file and any peice of data is just a really big number that is in base 256. As a result, anytime we are are possibly creating a whole new file for this specific situation. It may be big, maybe small, we don’t know.

How do we avoid Enumeration

The way we avoid enumeration is looking for ways to make patterns or rules within a peice. We see a series of bytes get repeated? We remove the series and add a pointer to the first. This is at least the gzip method. Another possibility is finding positions in indefinite random sequences. Its important to note that none of these random sequences should overlap. Issue with this case is that we basically make threads for each random seed until one of them gives us the number we want. The Discovery is terrible.

One idea I had was that the data can be broken up into two equations

  • y = X%2
  • y = (X + 1)%2

From these two equations, we then figure out Each Start, Stop

  • X Offset
  • On, Off

We can then create 2 Strings with the X Offsets. However, this will bring us to a list of on/offs. Each On/Off takes up 16 bytes for the offset and is only streamable when we specify the length of On and reach at least halfway. Its quite possible that this would infact be larger than the data itself.

Another form of enumeration avoidance is a different divide and conquer approach. Lets say you take a file and split it up into bytes. You then remove all the zeroes and mark where they are as a seperate byte pattern (100111001100, where 1 means there is a zero and 0 means it is from the original list). If there is 100 zeroes in a 200 byte file, the file has effectively been reduced to 100 + Ciel(100/8) or around 113. These are significant gains and only increase as more zeroes are found. Except when there are only 8 or 16 zeroes, the gains are miniscule. What if the entire file was split up? Well, then we have to mark how long each of the patterns is going to be (probably with a 64 bit integer) then we have to do it for 256 values. For a file that is 200 bytes, the result will be 4 + 256 * 200/8 or 4 + 256 * 25. Basically we are effectively increasing the overall size of the document. If we were to try and do it for only the most counted, in situations where all the bytes occur equally no compression could be done.

Moving Forward

Ultimately, heres where I stand on compression

  • Describing data through properties is ok for verification of it but trying to compress it with properties is cool but ultimately a red herring
  • Trying to force a file to fit within a constraint leads to ‘maybe compresses’ type situations.
  • Trying to match up a file to a random number requires multiple threads to be created.
  • Trying to match up a sequence with something that has happened previously requires big storage

Compression isn’t simple but it sure is a lot of fun!