how does compression work? why does a compressed JPEG look different, but a JPEG in a compressed zip file looks the same? (long, serious) 

there are many different methods of file compression. one of the simplest methods is run length encoding (RLE). the idea is simple: say you have a file like this:
you could store it as:
to represent that there are 3 a's, 4 b's, etc. you would then simply need a program to reverse the process - decompression. this is seldom used, however, as it had a fairly major flaw:
which is twice as large! RLE compression is best used for data with lots of repetition, and will generally make more random files *larger* rather than smaller.

a more complex method is to make a "dictionary" of things contained in the file and use that to make things smaller. for example, you could replace all occurrences of "the united states of america" with "🇺🇲", and then state that "🇺🇲" refers to "the united states of america" in the dictionary. this would allow you to save a (relatively) huge amount of space if the full phrase appears dozens of times.

what i've been talking about so far are lossless compression methods. the file is exactly the same after being compressed and decompressed. lossy compression, on the other hand, allows for some data loss. this would be unacceptable in, for example, a computer program or a text file, because you can't just remove a chunk of text or code and approximate what used to be there. you can, however, do this with an image. this is how JPEGs work. the compression is lossy, which means that some data is removed. this is relatively imperceptible at higher quality settings, but becomes more obvious the more you sacrifice quality for size. PNG files are (almost always) lossless, however. your phone camera takes photos in JPEG instead of PNG, though, because even though some quality is lost, a photo stored as a PNG would be much, much larger.

some examples of file formats that typically use lossy compression are JPEG, MP4, MP3, OGG, and AVI. some examples of lossless compression formats are FLAC, PNG, ZIP, RAR, and ALAC. some examples of lossless, uncompressed files are WAV, TXT, JS, BMP, and TAR. in terms of file size, you'll always find that lossy files are smaller than the lossless files they were created from (unless it's an horrendously inefficient compression format), and that losslessly compressed files are smaller than uncompressed ones.

you'll find that putting a long text file in a zip makes it much smaller, but putting an MP3 in a zip has a much less major effect. this is because MP3 files are already compressed quite efficiently, and there's not really much that a lossless algorithm can do.

there are benefits to all three types of formats. lossily compressed files are much smaller, losslessly compressed files are perfectly true to the original sound/image/etc while being much smaller, and uncompressed data is very easy for computers to work with, as they don't have to apply any decompression or compression algorithms to it. this is (partly) why BMP and WAV files still exist, despite PNG and FLAC being much more efficient.

as an example of how dramatic these differences often are, i looked at the file sizes for the downloadable version of master boot record's album "internet protocol" in three formats: WAV, FLAC, and MP3. you can see that the file size (shown in megabytes) is nearly 90 megabytes smaller with the FLAC version, and the MP3 version is only ~13% of the size of the WAV version. note that these downloads are in ZIP format - the WAV files would be even larger than shown here. this is not representative of all compression algorithms, nor is it representative of all music - this is just an illustrative example. TV static in particular compresses very poorly, because it's so random, which makes it hard for algorithms to find patterns. watch a youtube video of TV static to see this in effect - you'll notice obvious "block" shapes and blurriness that shouldn't be there as the algorithm struggles to do its job. the compression on youtube is particularly strong to ensure that the servers can keep up with the enormous demand, but not so much so that videos become blurry, unwatchable messes.

i actually made a slightly improved RLE algorithm for use in a video game once - i could store the tile data much more efficiently by specifying that there were, say, ten trees in a row rather than listing "tree" ten times. i got around the a1b1c1d1e1 problem by making it possible to write something like "abcd2e2". if there was no number present, the algorithm assumed 1.

the game was never finished (never anywhere close!). honestly, i had a lot more fun working on the code than i did the actual game...

also i skipped a few details in this post because compression gets really complex as you move onto more sophisticated algorithms, so sorry about that 0u0;

how does compression work? why does a compressed JPEG look different, but a JPEG in a compressed zip file looks the same? (long, serious) 

@lynnesbian Nice explanation!

I just want to add some videos for those who want to go further:
Lossless text compression:
Lossy video compression (and where it fails):

@anathem thank you! tom scott's videos are always fantastic. i got the idea to mention youtube's inability to properly handle static from that second video, actually!

Sign in to participate in the conversation
Lynnestodon's anti-chud pro-skub instance for funtimes