Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Still useful today.

Try to transmit a 100G file through any service is usually a pain especially if one end has non-stable Internet.



That's a very bad way of solving that issue. If transmission is a problem, either use a proper retry-friendly protocol (such as bittorrent) or split the file. Using hacks on the data format just leads to additional pain


> or split the file

Wait, I'm confused. Isn't this what OP was talking about?


Splitting the file doesn’t need to be part of the file format itself. I could split a file into N parts, then concatenate the parts together at a later time, regardless of what is actually in the file.

The OP was saying that zip files can specify their own special type of splitting, done within the format itself, rather than operating on the raw bytes of a saved file.


> Splitting the file doesn’t need to be part of the file format itself. I could split a file into N parts, then concatenate the parts together at a later time, regardless of what is actually in the file.

I'm inclined to agree with you.

You can see good examples of this with the various multi-part upload APIs used by cloud object storage platforms like S3. There's nothing particularly fancy about it. Each part is individually retry-able, with checksumming of parts and the whole, so you get nice and reliable approaches.

On the *nix side, you can just run split over a file, to the desired size, and you can just cat all the parts together, super simple. It would be simple to have a CLI or full UI tool that would handle the pause between `cat`s as you swapped in and out various media, if we hark back to the zip archive across floppy disks days.


Without knowing the specifics of what's being talked about, I guess it makes sense that zip did that because the OS doesn't make it easy for the average user to concatenate files, and it would be hard to concatenate 10+ files in the right order. If you have to use a cli then it's not really a solution for most people, nor is it something I want to have to do anyways.

The OS level solution might be a naming convention like "{filename}.{ext}.{n}" like "videos.zip.1" where you right-click it and choose "concatenate {n} files" and turns them into "{filename}.{ext}".


> the OS doesn't make it easy for the average user to concatenate files

Bwah! You are probably thinking too much GUI.

    X301 c:\Users\justsomehnguy>copy /?
    Copies one or more files to another location.

    COPY [/D] [/V] [/N] [/Y | /-Y] [/Z] [/L] [/A | /B ] source [/A | /B]
         [+ source [/A | /B] [+ ...]] [destination [/A | /B]]

    [skipped]

    To append files, specify a single file for destination, but multiple files
    for source (using wildcards or file1+file2+file3 format).


Try to concatenate 1000 files with natural sorting names using `copy`. I did this regularly and I have to write a python script it make it easier.

It's much easier to just right click any of the zip part files and let 7-zip to unzip it, and it will tell me if any part is missing or corrupt.


Why would you use manual tools to achieve what ZIP archive can give you out of the box? E.g. if you do this manually you’d need to worry about file checksum to ensure you put it together correctly.


Because, as said before, zip managing splits ends with two sources of truth in the file format that can differ while the whole file still being valid


I've had good luck using tools like piping large files through `mbuffer`[1] such as ZFS snapshots, and it's worked like a charm.

[1] https://man.freebsd.org/cgi/man.cgi?query=mbuffer&sektion=1&...


couldn't agree more!

We need to separate and design modules as unitary as possible:

- zip should ARCHIVE/COMPRESS, i.e. reduce the file size and create a single file from the file system point of view. The complexity should go in the compression algorithm.

- Sharding/sending multiple coherent pieces of the same file (zip or not) is a different module and should be handled by specialized and agnostic protocols that do this like the ones you mentioned.

People are always doing tools that handle 2 or more use cases instead of following the UNIX principle to create generic and good single respectability tools that can be combined together (thus allowing a 'whitelist' of combinations which is safe). Quite frankly it's annoying and very often leads to issues such as this that weren't even thought in the original design because of the exponential problem of combining tools together.


Well, 1) is zip with compression into single file, 2) is zip without compression into multiple files. You can also combine the two. And in all cases, you need a container format.

The tasks are related enough that I don't really see the problem here.


I meant that they should be separate tools that can be piped together. For example: you have 1 directory of many files (1Gb in total)

`zip out.zip dir/`

This results in a single out.zip file that is, let's say 500Mb (1:2 compression)

If you want to shard it, you have a separate tool, let's call it `shard` that works on any type of byte streams:

`shard -I out.zip -O out_shards/ --shard_size 100Mb`

This results in `out_shards/1.shard, ..., out_shards/5.shard`, each of 100Mb each.

And then you have the opposite: `unshard` (back into 1 zip file) and `unzip`.

No need for 'sharding' to exist as a feature in the zip utility.

And... if you want only the shard from the get go without the original 1 file archive, you can do something like:

`zip dir/ | shard -O out_shards/`

Now, these can be copied to the floppy disks (as discussed above) or sent via the network etc. The main thing here is that the sharding tool works on bytes only (doesn't know if it's an mp4 file, a zip file, a txt file etc.) and does no compression and the zip tool does no sharding but optimizes compression.


In unix, that is split https://en.wikipedia.org/wiki/Split_(Unix) (and its companion cat).

The problem is that on DOS (and Windows), it didn't have the unix philosophy of a tool that did one thing well and you couldn't depend on the necessary small tools being available. Thus, each compression tool also included its own file spanning system.

https://en.wikipedia.org/wiki/File_spanning


The key thing that you get by integrating the two tools is the ability to more easily extract a single file from a multipart archive— Instead of having to reconstruct the entire file, you can look in the part/diskette with the index to find out which other part/diskette you need to use to get at the file you want.


Don't forget that with this two-step method, you also require enough diskspace to hold the entire ZIP archive before it's sharded.

AFAIK you can create a ZIP archive saved to floppy disks even if your source hard disk has low/almost no free space.

Phil Katz (creator of the ZIP file format) had a different set of design constraints.


The problem seems to be that each individual split part is valid in itself. This means that the entire file, with the central directory at the end, can diverge from each entry. This is the original issue.


Why do you believe that archiving and compressing belong in the same layer more than sharding does? The unixy tool isn't zip, it's tar | gzip.


tar|gzip does not allow random access to files. You have to decompress the whole tarball (up to the file you want).


Even worse, in the general case, you should really decompress the whole tarball up to the end because the traditional mechanism for efficiently overwriting a file in a tarball is to append another copy of it to the end. (This is similar to why you should only trust the central directory for zip files.)


I agree!

Also, I enjoyed your Freudian slip:

single respectability tools

->

single responsibility tools


If the point is being able to access some files even if the whole archive isn’t uploaded, why not create 100 separate archives each with a partial set of files?

Or use a protocol that supports resume of partial transmits.


Because sometimes your files are very large it's not easy to create separate archives with (roughly) even size.

A single video can easily be over 20GB, for example.


This carries the information that all those files are a pack in an inseparable and immutable way, contrary to encoding that in the archive's name or via some parallel channel.


Presumably it compresses better if it's all one archive?


nncp, bittorrent...


I recently had to do this with about 700Gb, and yeah OneDrive hated that. I ended up concatenating tars together.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: