Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Odo: Shapeshifting for your data (odo.readthedocs.io)
181 points by jonbaer on Jan 3, 2017 | hide | past | favorite | 35 comments


Very nice. Having a one-line wrapper for the fast, native ways to do bulk data import for databases is fantastic on its own: http://odo.readthedocs.io/en/latest/perf.html

I ended up learning how to do this from the various SQL shells. But, that was a bit of a cognitive load, especially when the CSV files were complicated. AFAIK, for example, SQLite won't properly ignore commas in quoted fields, so you have to throw in an extra utility (like csvkit) to change the delimiter before importing into SQLite.

sidenote: through Odo's homepage, I discovered this amazing library for generating network graph diagrams, NetworkX: http://networkx.github.io/


NetworkX is very powerful, however for more hands-on network graph diagrams (perhaps more art than science) I rely heavily on http://gephi.org.

(Here's my Gephi map of NSFW subreddits according to their links to each other: http://electronsoup.net/nsfw_subreddits/)


Nice! Have definitely heard of Gephi but haven't made an effort to use it, out of reluctance to learn a new GUI/system and how I rarely ever try to solve problems that require graph analysis. For that subreddit visualization, how much data prep/wrangling did you do (after making the API requests of course) before you worked with the data in Gephi?


Not too bad. I just had to get it in the format of one line per source-target pair. The data came from /u/uglyasblasphemy in SQL format though apparently he's removed the link. Most of the fun was in arranging nodes with the move tool. With the layout I used (Fruchterman-Reingold) the nodes form a circle and you can pick up clusters and move them where you want them. Useful to make things more meaningful to the eye, if probably less mathematically correct.


What a great tool. I'm going to bookmark odo's homepage so that I don't have to wait three seasons to find it again.


Bigger nerd? You for making that joke, me for getting it.


Statistically it was likely that at least one person in this audience would get it. Therefore I think that OP wins this nerd throwdown.

Full disclosure: I still don't.


Throw us a clue?


Odo is the name of a shapeshifter on Star Trek: Deep Space 9, who doesn't know his origins or homeworld until around season 3. I guess the homeworld isn't revisited as a major plot point until season 6? I don't remember...seasons 6 and 7 were honestly kind of a blur for me.

http://memory-alpha.wikia.com/wiki/Odo


I think prior to season 6, the Klingons and Romulans tried to destroy the Dominion by taking out Odo's homeworld in a sneak attack. Odo's people turned out to be the rulers of a large empire that sought to conquer solids (non-shapeshifters).


Cardassians, not Klingons.


Also when Odo was forced to return to the Link to be judged (end of season 4).



Star Trek, Deep Space 9.

Odo is a shape-shifting being who is removed from his home planet. But we don't find that out til the 3rd season.


Does it have to travel to its homeworld to learn the full range of its shapeshifting abilities?


Yes, and it also has to be electrocuted repeatedly before it runs for the first time.


Yes, but later on, it will be forced to hard-fork because of a merge conflict and lose most of it's functions until it merges again with a young downstream project.


Also today is the 23rd anniversary of the premiere of DS9.


I love how more DS9 community got sparked to life after Netflix decided put it up.


TNG and DS9 being on Netflix is the very reason I signed up and stayed.


I've loved the idea of Odo since I first saw it, but I've always been wary: the devil is in the details. I'm curious when and how information gets lost during each transfer because of the peculiarities of each format, or how those decisions are made and exposed.

Min/max limits, truncation nulls, floating point precision, encodings, picking CHAR vs VARCHAR or string vs categorical, metadata like indices, etc are some of the hard problems behind bulk loading.


Very nice! I have been working on a similar library in Swift [1] which does this, but also has a nice user interface on Mac and (soon) iOS [2]. Coincidentally also uses a Star Trek themed name :-)

[1] http://github.com/pixelspark/warp [2] https://warp.one


Neat! Unless I'm misunderstanding, I think Odo is more general, though, in that you don't need to pass data through any sort of unified Dataset protocol: there's a graph of direct translators between formats.


Well done.

By the way, I have problem when opening that site. The site has problem with SSL certificate on Mozilla Firefox.


I use odo a lot to take any data source into pandas or blaze and it's been great so far. For my it's the lazy and easy way of moving data around.


I wish it supported large csv to partitioned parquet. THAT is something I need a good solution for.


Are you familiar with fastparquet (https://github.com/dask/fastparquet) and pyarrow (https://pyarrow.readthedocs.io/en/latest) ?


Not until you linked me. :) They look like good steps but not quite what I need, for different reasons. I'll have to check when I get back to the office next week.


Your link is malformed. https://github.com/dask/fastparquet


...and its fixed.


(Py)Spark? I run it locally and do that kind of operation pretty frequently.


On what scale?

Part of the problem is the scale, but another part is that writing partitioned parquets seems poorly documented (I would love corrections, I spent a decent amount of time last week looking for good information)


I was recently working with a 38GB compressed text dataset locally. I can't remember how big it was in parquet.

It's literally one line of code. See http://labs.vistarmedia.com/2016/12/27/indexing-json-logs-wi... for an example (except you write to a local file system rather than HDFS).


That looks very helpful, I'll have to check it out!


Oh my god it looks beautiful




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: