Richard Robertson Posted August 23, 2010 Posted August 23, 2010 I think the biggest problem is that you are thinking of values that aren't strictly a (multi)byte format. When it comes down to it, you're trying to eliminate bytes, right? So working in any format besides that, it's going to be messy.
czardas Posted August 23, 2010 Posted August 23, 2010 (edited) Messy Er...? That is something I can't deny. It might also not work at all, but without trying, I won't know the answer. Unless of course I find an obvious refutation. Here's how it might work (or not). Say this is the start of some binary data. 1011 1111 1010 0010 1111 1101 0111 1110 1011 1111 0111 I may find multiple repetitions of a long section. (scaled down here for simplicity) 101(1 1111 101)0 0010 (1111 1101) 0(111 1110 1)0(11 1111 01)11 A = 11111101 So we can create a new representation. 101 A 00010 A 0 A 0 A 11 In this case there won't be any compression, but imagine that such patterns happen to occur over much longer sections when searched for in a different base. I don't know what possibilities will arise. It may be highly unpredictable. Edited August 23, 2010 by czardas operator64 ArrayWorkshop
Richard Robertson Posted August 23, 2010 Posted August 23, 2010 (edited) Searching for the matched patterns requires a lot of processing time. As your window gets bigger, it'll take longer and longer. For instance, you would need to make 666 (scary coincidence) checks in "1011 1111 1010 0010 1111 1101 0111 1110 1011 1111 0111" using one byte wide blocks. The first 27 checks would compare "1011 1111" to every byte, starting at every bit. The next 26 would compare "011 1111 1" to every byte, starting at every bit, and so on, down to the last check which would compare "1110 1011" to the last byte. 666 checks for five and a half bytes isn't going to scale well. Edited August 23, 2010 by Richard Robertson
czardas Posted August 23, 2010 Posted August 23, 2010 I see the point that you are making. Designing an efficient search algorithm represents a challenge. That part is confusing me. I think that some form of guess work will be required. Only searching a fraction of the possible patterns and then guessing again. Or some kind of fuzzy logic - whatever that means? operator64 ArrayWorkshop
Richard Robertson Posted August 23, 2010 Posted August 23, 2010 Fuzzy logic means that your values are "true", "false" and "maybe".
czardas Posted August 23, 2010 Posted August 23, 2010 (edited) I get the true and false part. The other part is still a bit fuzzy. Richard, your input is much appreciated. Edited August 23, 2010 by czardas operator64 ArrayWorkshop
Drifter Posted September 13, 2010 Author Posted September 13, 2010 cameronsdad: Still willing to assist here? I have started university so unfortunately i cant spend as much time here as id like, but i might take all of what you said to a professor if i get the time. I did talk to a professor who did things with compression and they said you cant obtain a much better system than the common algorithms already used, unless you want to use lots of computations somehow, in which case you might be able to squeeze out another 1% compression. I do wonder if there really is any new good way to compress.
seandisanti Posted September 13, 2010 Posted September 13, 2010 cameronsdad:Still willing to assist here? I have started university so unfortunately i cant spend as much time here as id like, but i might take all of what you said to a professor if i get the time.I did talk to a professor who did things with compression and they said you cant obtain a much better system than the common algorithms already used, unless you want to use lots of computations somehow, in which case you might be able to squeeze out another 1% compression.I do wonder if there really is any new good way to compress. yes, still willing to help; i've actually just finished a couple of big projects at work that were keeping me pretty busy. as far as the assertion that it's as good as it's going to get, that's just silly. i'm not claiming that i've any profound insight on how to accomplish said goal, BUT every new way to do something was thought impossible or not worth trying until someone actually did it, and then it just became the way.
Drifter Posted September 13, 2010 Author Posted September 13, 2010 as far as the assertion that it's as good as it's going to get, that's just silly. i'm not claiming that i've any profound insight on how to accomplish said goal, BUT every new way to do something was thought impossible or not worth trying until someone actually did it, and then it just became the way.i agree with that one, but i do think that any further compression will come at the cost of processing time, and since hard drives now can hold 1TB and are cheap, processing time is viewed as more valuable. So we are possibly going against the grain of what most compression does (something fast).However, i will again add that MAYBE this isn't entirely true either, i'm still interested in what your method is. Unfortunately i still have issues understanding while reading what you've posted thus far on the issue.
seandisanti Posted September 13, 2010 Posted September 13, 2010 i agree with that one, but i do think that any further compression will come at the cost of processing time, and since hard drives now can hold 1TB and are cheap, processing time is viewed as more valuable. So we are possibly going against the grain of what most compression does (something fast).However, i will again add that MAYBE this isn't entirely true either, i'm still interested in what your method is. Unfortunately i still have issues understanding while reading what you've posted thus far on the issue.Perhaps a better way to describe what i was talking about before is as a simple cypher. Think of a substitution cypher where typically you'll have one or more replacement letters for each plain(uncompressed) letter. Except instead of increasing the length of the replacements, you're increasing the length of the plaintext segments that each replacement is substituting... maybe that doesn't clear it up much... Look at the ascii characters, each is represented by a code, it's a direct 1-1 relationship... what if instead of having 1 ascii code per letter, you had one hex ascii code for each 3 character permutation of letters? then the longer the file, the more it would benefit by switching to that scheme... Now assuming you're still with me to this point. because text varies so much, 3 may not be the optimal number of characters for any given text, maybe a book can be broken up into fewer unique 4 character combinations than 3... or 5 characters, etc etc. using regular expressions, or any other set based algorithm, finding the ideal chunk length, and mapping out the replacements should be trivial even for large files...
Drifter Posted September 13, 2010 Author Posted September 13, 2010 i think i do understand the basic idea of what you are doing then, its just the details now. Thank you for your effort in getting it across.
seandisanti Posted September 13, 2010 Posted September 13, 2010 perhaps an example is in order... figure in the english alphabet if you ignore case and white space, you've got 26 (1A in hex) different letters. there are only 676 (2A4) possible 2 character combinations though any given file is probably going to have fewer then 255 (FF) different 2 character combinations. If you find a chunk length that the file can be broken into which would create 255 or fewer unique substitutions (the fewer the better)then replacing the chunks with the numbers that represent their replacements would save you space, and give you lossless compression. because you'd be storing 3+ characters with 2 characters of space used... i fear i may have started repeating myself a bit now; did i add ANY clarity to the idea?
Drifter Posted September 13, 2010 Author Posted September 13, 2010 a little, in that there most likely wont be every pattern theoretically possible. This is becoming easier to grasp now.
seandisanti Posted September 13, 2010 Posted September 13, 2010 a little, in that there most likely wont be every pattern theoretically possible. This is becoming easier to grasp now. exactly, so by replacing the patterns with single or double character replacements, the longer the patterns or chunks that we can replace; the greater the compression ratio.
Drifter Posted September 14, 2010 Author Posted September 14, 2010 This is correct. I learned this through a compression method i just tried involving use of an English dictionary to replace. Since my method works with 3 chars max to represent all words in my list, the best compression comes out of the longer words.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now