Hacker Newsnew | past | comments | ask | show | jobs | submit | dataflowmapper's commentslogin

Thanks for sharing, good point about garbage in and garbage out. I mostly work with software implementation so this use case is pretty insightful.


Nice, sounds like a pretty standard approach. Have you had any issues or time sinks around repeatability where you need to tweak scripts or create new ones for different CSVs?


Thanks for sharing, these kind of tasks can definitely become a time sink. Out of curiosity, you mentioned apps using the API's aren't reliable at scale, if you don't mind sharing, where do things break down when using them?


Wow, thanks for the detailed and insightful breakdown of your process. This really resonates a lot of the complexities I've been exploring and encountered as an implementation consultant.

A few things particularly stood out to me:

  - Your point about semantic mapping ("understanding the structure... what it means and where it should go") being more irritating than the actual coding of transformations is spot on. Domain expertise on these things can make a big difference.

  - The version tracking issues with Excel and the need to trace back decisions or errors are a big reason Excel isn't scalable. I totally agree with you and think Excel is bad for this kind of stuff and repeatable transformations/processes are important.
  - Your desire for an easier way to flag/share errors with partners for correction and then effectively store those changes as part of a repeatable process (like 'code' or a versioned configuration) is a fascinating challenge in collaborative data onboarding. It seems crucial for those fixes to not be one-offs especially when dealing with partners/clients which are notoriously slow. I've used Flatfile in the past, but its limiting in that its all about one-offs and transformations aren't repeated.
It sounds like you've built a robust internal Python library to manage a lot of this, which is impressive. When you use Jupyter notebooks for each incoming file:

  - How do you manage the reusability of common logic or tests across those notebooks, especially if a core cleaning function needs an update that should apply to many different partner files?
  - And how do less technical team members (like the interns you mentioned) interact with or contribute to that Python/notebook-based process, particularly if they need to understand or slightly modify a transformation rule?
And for your usage of LLM, is that mostly for mapping source field to target fields?

Thanks again for sharing such a comprehensive view, it's incredibly helpful for understanding the friction points and the need for traceable, repeatable solutions.


Course! These are one-time migrations, so at most we have 3-4 projects happening concurrently and don't need to worry about backwards compatibility either. We can just import a specific version or branch of the library, or at worst, we just copy and paste the function and make the change in the notebook. But a majority of the functions we've put work into in the library are really the consistently applied across any incoming data files - API scripts, de-duplication, assigning IDs... honestly any changes like this are usually pretty easy to make.

The tests are always consistent across files and primarily check for the validity of the upload (right type, for example, or logic (start times have end times, etc.). Every test should work for every file ever, since those are based on the platform and upload constraints rather than partner data.

Short answer is they don't. For onboarding interns into the process, I write a skeleton notebook that imports the internal library and walks them through cleaning a file. But we would hire for interns that have background in coding and data cleaning. Starting out, rather than change a existing function, they might add a line of code in the notebook that changes the data in a way that the existing function would now work, for example. There are cases where specific-business logic needs to be coded into a function, but we would just write those ad-hoc. This isn't an upload that needs to be done automatically or very quickly, so that hasn't been a problem.

The only reason other team members would contribute to this would be around shaping the business logic of which fields should be uploaded where and how they should be formatted, etc. But that data review part that sometimes needs to be done is very tricky, e.g., checking that the transformed results are what they want. It mostly happens in Excel - we did build a POC UI where they would upload a CSV, and they could click through each record and review and suggest changes in a cleaner way.

For LLMs, we don't use them for mapping, but we've tested it and it works fine. The LLM mapping doesn't really save us a ton of time compared to us just looking through the file and assigning columns in our mapping file (which is composed of about 120 columns). If we wanted to deploy it and allow the partner to upload a sheet and the LLM to suggest a mapping, that could work. (The mapping step in the demo video on your site looks great!) Instead we use them to format and extract unstructured data -> for example, we might need to pull tags out of descriptive text, or extract event hours out of a string of text describing the event. The LLM can do this really well now with structured JSON outputs, but reviewing it to ensure it is correct is still a manual process.


Appreciate the detailed follow up and all this information, its been super insightful to get more real world perspective on this. Your processes all sound very well thought out and address most of the shortcomings/pain points I've experienced personally. There's definitely a lot of valuable info in there for anyone in this space. And thank you for taking the time to check out the site, feedback is limited at this stage so hearing that something resonated is great, hoping it all makes sense.


Yeah programming definitely offers most flexibility if you have that skillset. I'm particularly interested in your 'last 20% is like going to the moon' analogy for special-purpose tools or even Excel/Bash. Do you have any examples off the top of your head of the kinds of transformation or validation challenges that you find fall into that really difficult 20%, where only a 'real programming language' can effectively get the job done?


For one thing a lot of tools like Excel do unwanted data transformations, such as importing from a CSV they try to guess whether text is meant to be a string or a number and sometimes guess wrong. You can spend a lot of time disabling this behavior or preventing it or fixing it up afterwards, but frequently bad data just plain corrupts the analysis or target system.

You can't really trust the time handling functions on any platform which is some of the reason why languages like Python, and Java, might have two or three libraries for working with dates and times in the standard library because people realized the old one was unfixable. Plenty of Javascript date handling libraries have told people "look, this is obsolete, it's time to move on" not just because that's the Javascript way, but because the libraries really were error prone.

In a real programming language it's straightforward to fix those problems, in a fake programming language it is difficult or impossible.

If you've got a strange index structure in the source or destination data, for instance, many tools fall down. For instance if you want to convert nested set trees

https://blog.uniauth.com/nested-set-model

to something more normal like an adjacency list (or vice versa) a lot of simple tools are going to fall down.


Gotcha, totally agree on those points. I think everyone's dealt with the Excel typing crap. My team uses Workato for some integrations and we use scripts any time we need math because of questionable precision so I see your take on the unreliable functions part.


Thanks for sharing that workflow, for more straight forward flows, that sounds like a decent approach. My main thoughts on where it could be improved, or where I see potential struggles, are when:

  - People aren't comfortable or familiar with coding/Python.
  - You get into more complex imports like historical data, transactional data, etc. There you might have like 15 transaction types that have to be mapped, all with different fields, math, and conditional logic where the requirements become too much for just prompting ChatGPT effectively, and iterating on the Python can get pretty involved.
  - The source data structure and transformation needs aren't consistently the same, leading to a lot of 'throwaway' or heavily modified scripts for each unique case.
  - Tasks like VLOOKUPs or enriching data come into play, which might add manual coding or complexity beyond a simple 1-to-1 source-to-destination script.
These are the areas where I'm exploring if a more structured way could offer benefits in terms of repeatability and accessibility for a wider range of users or complex scenarios. Appreciate the insight into your process and your thoughts on this.


From doing historical data restorations (pre-json), using documented industry/standard structures & protocols, normalized data formats/layouts helps out quite a bit.

Note: awk/shell scripts, ~80's because relatively portable across platforms (mac,pc,dec,ibm (ignoring ebcdic), etc.)


Thanks for sharing, this is actually pretty cool. I could see how this could be a really flexible tool for power users. How'd you get into developing it? Were you trying to solve a specific problem you ran into?


I've played around with regex for some use cases like cleaning up names, but I don't know the syntax well enough to manipulate data well on the fly. How have you been leveraging ai for those use cases you mentioned?


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: