I ended up learning how to do this from the various SQL shells. But, that was a bit of a cognitive load, especially when the CSV files were complicated. AFAIK, for example, SQLite won't properly ignore commas in quoted fields, so you have to throw in an extra utility (like csvkit) to change the delimiter before importing into SQLite.
sidenote: through Odo's homepage, I discovered this amazing library for generating network graph diagrams, NetworkX: http://networkx.github.io/
Nice! Have definitely heard of Gephi but haven't made an effort to use it, out of reluctance to learn a new GUI/system and how I rarely ever try to solve problems that require graph analysis. For that subreddit visualization, how much data prep/wrangling did you do (after making the API requests of course) before you worked with the data in Gephi?
Not too bad. I just had to get it in the format of one line per source-target pair. The data came from /u/uglyasblasphemy in SQL format though apparently he's removed the link. Most of the fun was in arranging nodes with the move tool. With the layout I used (Fruchterman-Reingold) the nodes form a circle and you can pick up clusters and move them where you want them. Useful to make things more meaningful to the eye, if probably less mathematically correct.
Odo is the name of a shapeshifter on Star Trek: Deep Space 9, who doesn't know his origins or homeworld until around season 3. I guess the homeworld isn't revisited as a major plot point until season 6? I don't remember...seasons 6 and 7 were honestly kind of a blur for me.
I think prior to season 6, the Klingons and Romulans tried to destroy the Dominion by taking out Odo's homeworld in a sneak attack. Odo's people turned out to be the rulers of a large empire that sought to conquer solids (non-shapeshifters).
Yes, but later on, it will be forced to hard-fork because of a merge conflict and lose most of it's functions until it merges again with a young downstream project.
I've loved the idea of Odo since I first saw it, but I've always been wary: the devil is in the details. I'm curious when and how information gets lost during each transfer because of the peculiarities of each format, or how those decisions are made and exposed.
Min/max limits, truncation nulls, floating point precision, encodings, picking CHAR vs VARCHAR or string vs categorical, metadata like indices, etc are some of the hard problems behind bulk loading.
Very nice! I have been working on a similar library in Swift [1] which does this, but also has a nice user interface on Mac and (soon) iOS [2]. Coincidentally also uses a Star Trek themed name :-)
Neat! Unless I'm misunderstanding, I think Odo is more general, though, in that you don't need to pass data through any sort of unified Dataset protocol: there's a graph of direct translators between formats.
Not until you linked me. :) They look like good steps but not quite what I need, for different reasons. I'll have to check when I get back to the office next week.
Part of the problem is the scale, but another part is that writing partitioned parquets seems poorly documented (I would love corrections, I spent a decent amount of time last week looking for good information)
I ended up learning how to do this from the various SQL shells. But, that was a bit of a cognitive load, especially when the CSV files were complicated. AFAIK, for example, SQLite won't properly ignore commas in quoted fields, so you have to throw in an extra utility (like csvkit) to change the delimiter before importing into SQLite.
sidenote: through Odo's homepage, I discovered this amazing library for generating network graph diagrams, NetworkX: http://networkx.github.io/