Failure and Compression

The last few days I was hammering away at some interesting ideas. They were about making any file into a 16 hexadecimal string.  The idea came about when I saw a post on hacker news about distributed systems and url standards for them. When it comes to distributed systems and sharing of files I’m of this opinion

  • You want the requestor able to get chunks from multiple computers and consolidate them into one – This also means that all the chunks know that they are a part of a whole and are able to point to the whole
  • You want the requestor to know whats coming to them – This can be metrics about the file like length, checksum, hash, etc.
  • Security is secondary to ensuring the system is working without hicups. Secure requests should be a manner of encrypted messeges with keys that are already available to each computer.

I want to take the second one a step further. I want to make the url so detailed that the computer can actually recreate the file from their tiny little hash! Impossible? Never! Here is the product of my work https://github.com/formula1/compression-failures

Technique 1: File attributes and filtering possibilities

So the first thing I went after was attempting to turn any file into a few 32 bit integers. This included obvious things like file length to ensure I have a finite number of possiblities and checksums to remove extremes and to limit my possibilities even more. From there its kind of free shooting. I figured my best next step is to seperate the file out into what I will call ‘patterns’. The way patterns work are as so

  • A pattern has a segment length – 010 has a length of 1, 001100 length of two, 000111 length of three, 1111 4, 00000 5, etc
  • A pattern has a repition count – 010 has 3 repetitions, 001100 also has 3, 000111 has 2, 1111 has 1 and 00000 has 1
  • A pattern must also either start with 1 or zero
  • A pattern ends when what would be the next repetition ends early or goes for too long.

Splitting the file up into patterns sounds like a swell idea right? well with the 64k byte file I used, it had around 10,000 patterns. However, more interestingly is that the segment lengths were at 9 and the repetitions were at 7. Now these two aspects filter the number of possibilties even more (which I did not mathematically calculate). Ideally, each of these attributes would get me closer and closer to finally understanding what the file actually contains. However, I realized this isn’t going to be nearly as simple as what I thought

  • given
    • 00110011
    • 11001100
  • can you tell the difference between their length, checksums, unique segments, unique repetitions, total repetitions and starts with 1 counts? yes you can!
    • 8, 4, 1, 1, 4, 0
    • 8, 4, 1, 1, 4, 1
  • what about
    • 010011000111
    • 001101000111
  • You cannot
    • 12, 6, 3, 1, 6, 0
    • 12, 6, 3, 1, 6, 0

So I had this brilliant idea! What if I can figure out the order? Not the actual count, just the order of largest to least. To do this I would check the difference between each pattern. If the pattern increased in segment length 1 if decreased 0. What could go wrong?

  • given
    • 010011000111
    • 001101000111
  • You can!
    • 12, 6, 3, 1, 6, 0, 11
    • 12, 6, 3, 1, 6, 0, 01
  • What about
    • 001101000111
    • 000111010011
  • You cannot
    • 12, 6, 3, 1, 6, 0, 01
    • 12, 6, 3, 1, 6, 0, 01

How about if I get the order of the highest and lowests? Technically I can do this indefinitely until I order them completely!

  • given
    • 001101000111
    • 000111010011
  • You can!
    • 12, 6, 3, 1, 6, 0, 01, 1
    • 12, 6, 3, 1, 6, 0, 01, 0
  • what about
    • 0000011111010011
    • 0000111101000111
  • You cannot
    • 12, 6, 3, 1, 6, 0, 01, 0
    • 12, 6, 3, 1, 6, 0, 01, 0

I would need ot be able to point out exactly how much is in each of those highest amounts. Perhaps I can count the number of unique differences? Nope, because both are unique. This was my snag. I stil don’t know how to handle it. I am also fully aware that the ordering may actually be far more data than I anticipate.

Technique 2: Prime numbers!! 😀

:C So the way I planned to do it was as so:

  • Turn the file into a Gigantic number
  • Boil the number down to prime numbers and exponents
  • If the prime number if too big, represent it as the ‘nth prime’ instead.
  • if the numbe rof factors are too long
    • find the closes prime number to the gigantic number
    • subtract the gigantic number from the prime
    • Do it again with the left over

What I would be left with is a sequence like this

mul(2,3,nth(7),add(nth(13), mul(5,7)))

Looks great right? I thought so too! I deliberately created the syntax so I can format anything into 4 bit operations. Unfortunately I didn’t realize how slow prime number generation could really be. I ended up creating my own because I needed to be constantly streaming them and or hold a binary list. However, the problem with prime generation is that they always need the previous numbers to go forward. Finding a prime is actually about finding what isn’t a prime and collecting the leftovers. This sounds straight forward but its actually pretty ugly and leaves me waiting for minutes at a time to handle 24 bit primes.

  • stack = [];
  • i = 2
  • do
    • yield i
    • stack.push({ value: i + i, jump: i })
    • while(stack[0].value == ++i)
      • var jump = stack.shift().jump
      • var value = i;
      • var index = -1;
      • do
        • value += jump
        • index = stack.find({ value : value });
      • while(isFound(index))
      • stack.insertAt(index, { value: value, jump: jump})
  • while(true)

This generates primes fast, don’t get me wrong. I was suprised at myself at how well it worked. Truelly. Not only that, I can take full credit (with inspirations from some seives). But if it doesn’t create primes, it finds what isn’t primes. If it can’t handle 24 bits, whats to say it can handle 10 bytes or even 1000 bytes? When I was writing the readme I decided I would write a bit about making workers and making it threaded. This is kind of a neat idea but its still not perfect considering each worker still must wait for its previous workers primes. as we get into huge numbers, that is less true because 2*current may often be 3 workers later. Another concept is using a less absolute prime number creator like a mersenne prime. These are easily calculateable and also can interact well with logs so Its possible I could speed up the algorithm to a huge degree. Instead of trying to find out if its prime. I check how far away from the next 2, If 1, I consider it a mersenne prime. Else, get the mersenne prime of the two before it. Multiply it a few times. Subtract the total. And do it again with the left overs. This seems just as good but prime numbers are pretty special. And as good as mersenne primes are, they will probably not always be good enough for my purposes.

What can I say about it?

What I love about working on projects like these is I tend to want to end them. Usually, I want to come to some conclusion like “it’s imposible” or “way too slow” but I always find a way to make it work. From there its just about implementing it or ignoring it. A project like this has huge uses but many of the uses I don’t have direct interaction with. Only potential. I do believe growing myself is important but I don’t want to lose it over a cute little experiment with big potential