TL/DR: xsv is probably what you want, or maybe zsv and/or awk
awk can do this super easily. Here's an example snippet that not only shards, but compresses your shards.
```
(NR - 1) % shard_size == 0 { # ready to start a new shard
current_n = current_n + 1
output_file = sprintf("%s%04d.%s.bz2", target_prefix, current_n, file_type)
print "writing to " output_file > "/dev/stderr"
# close any prior-opened output_command (else will err on too many open files)
if(output_command != "")
close(output_command)
output_command = "bzip2 > " output_file
# print header
print headrow | output_command
}
NR != 1 {
print $0 | output_command
}
```
This of course assumes that each line is a single record, so you'll need some preprocessing if your CSV might contain embedded line-ends. For the preprocessing, you can use something like the `2tsv` or `select -e ...` command of https://github.com/liquidaty/zsv (disclaimer: I'm its author) to ensure each record is a single line.
You can also use something like `xsv split` (see https://lib.rs/crates/xsv) which frankly is probably your best option as of today (though zsv will be getting its own shard command soon)
I think $40 a year just for the split feature is a little too much. However, adding more useful features for manipulating CSV files would probably change my mind about it. For example, doing some reorder or preprocess of the files as if people split into X files they'll have to do that action X times if they do it in Excel directly.
Check out the Didgets tool at https://www.Didgets.com which will let you import data from CSV, Json, Json Lines; filter out any unwanted rows; and then export it out into CSV, Json, Json Lines, HTML, or XML files while splitting them up like this does.
Note: if you're pulling in a lot of S3 list_objects_v2() data, and have some honking big object sizes, the Long Integer type craps out at representing a 2Gb file. You need to use Double.
It's interesting how people use and abuse Excel. How many people try to use Excel to process millions of rows? Time to use tools that can directly query csv files to aggregate the data.
The answer is actually PowerPivot. Access you can do a few GB, but with PowerPivot, you can do a billion or more rows. Don’t expect any insane performance though.
When downloading open data files from government sites there can be millions of rows.
It would be nice if excel worked with 50M rows to do quick filters and pivots and stuff.
But of course there are other tools for that.
Comically, I’m not sure how splitting a 50M line csv into 50 files helps as you can’t filter or pivot across 50 files unless you want to do lots of manual stuff.
What do you mean by no update? This is a new app completely.
I submitted Superintendent.app 2 times and stopped because the second time hit the front page. The other 2 times were posted by someone else, which I have no control over...
There’s already open source utilities [0] for users who aren’t proficient in Unix commands.
If anything this should be the definition of a one time fee.
You’re free to charge whatever you like, but it seems odd that anyone would pay you year after year to use your app.
It’s not the price, as $40 isn’t that much, but the value and principle of the thing.
[0] https://github.com/philoushka/LargeFileSplitter