We can blame CSV, or we can blame the way people use CSV. Either way CSV is so unreliable that I try to “fail-fast” as soon as possible in automated pipeline.
At work, we explicitly define data structuring process, converting CSV to Parquet with strict schema and technical/structural validation. We assign interns and new grad engineers for this, which is nicely within their capabilities too with minimal training.
At work, we explicitly define data structuring process, converting CSV to Parquet with strict schema and technical/structural validation. We assign interns and new grad engineers for this, which is nicely within their capabilities too with minimal training.