Some (most?) tools that output data in columns and fit each one to the largest v...

flusteredBias · on Sept 27, 2021

> these design choices seem to limit its use to relatively small files

1. As a rule-of-thumb, I have been working on functionality before optimization. That said, `tv` is really fast. It is completely false that `tv` only works for relatively small files. I just pushed a 624MB file to `tv`. It ran in 2.8 seconds. With `column` it takes 5.0 seconds. Now, I would love help from programmers smarter than me. I am sure there are a lot of optimization gains to be had in `tv`. I just wanted to make sure potential users are not misled. `tv` is performant.

> Some (most?) tools that output data in columns and fit each one to the largest value in that column need to scan the whole file as a first pass just to start displaying data.

> Not only is it the case with this tool, but from what I'm reading in main.rs it looks like it's also loading the whole file in memory.

2. `tv` reads once, but parse partly. This means that it reads the full file only to grab the number of rows. It only parses(take) the first n rows.

https://github.com/alexhallam/tv/blob/b548f0d19f64438d53f732...

TAForObvReasons · on Sept 27, 2021

If the goal is to calculate the correct column width, you have to do one pass through the data before writing the first row.

If the file can be read multiple times (not a UNIX stream), you can just read the file twice.

If the file is a stream, instead of retaining the entire dataset in memory, you can write to a temporary file and re-parse it after calculating the widths.

flusteredBias · on Sept 27, 2021

The correct column width is calculated from the first n rows not the full file.

A stream does not work for tv because a stream does not know how many rows are in the file a priori. Displaying the dimensions of the file is a priority for `tv`. I am very happy with that trade-off. I would rather know the dimensions of a file than have a file stream of unknown dimensions.

unclad5968 · on Sept 27, 2021

If you did it the way he's talking about you would stream through the file to find how many rows and write the file as a temp file that you could re-parse for the actual data.

I'm not saying you should or shouldn't, but your use case doesn't bar you from using streams.

flusteredBias · on Sept 27, 2021

I see. Thanks for the clarification.

eevilspock · on Sept 27, 2021

I like this idea. I don't think it would be jarring if the read-ahead buffer was a minimal number of lines, i.e. looking like distinct pages. The default could be at least the line height of the terminal, or some multiple.

There could be an option to redisplay the header row for resized "pages".

There could be a CLI switch giving the user control, i.e. make everyone happy.

killjoywashere · on Sept 27, 2021

I think data scientists will recognize this problem, and there's a well-used solution: .head()

Just show me the top 5 rows. That's all most people are looking for.

cat data/a.csv | tv --head

fiddlerwoaroof · on Sept 27, 2021

Or:

    head -n5 data/a.csv | tv

eli · on Sept 27, 2021

Unless your csv has embedded line breaks

IncRnd · on Sept 28, 2021

If tv works with embedded EOLs, you can do this:

  cat data/a.csv | tv | head -n5

It is more resource intensive, but it pushes the problem you mentioned onto tv. If tv doesn't work with embedded EOLs, then you need to fix your data or fix your tool.

flusteredBias · on Sept 27, 2021

Can you give an example of what you mean? If it breaks tv then I would like to add it to the automated tests and see if we can work on it.

eli · on Sept 27, 2021

No sorry, I assume tv is fine. The problem is in assuming `head -n5` gives you 5 rows and piping that into tv.

flusteredBias · on Sept 27, 2021

Oo I see. Thanks for clarifying.

franga2000 · on Sept 27, 2021

> Just show me the top 5 rows. That's all most people are looking for.

Is it? I'd wager that can't be more than half its use at most. Accessing a specific section that could be at any section of the file is very common in my experience, as is truly random access. Both of these, as well as the first few rows use case, are far better served by a page system.