Odo: Shapeshifting for your data

danso · on Jan 3, 2017

Very nice. Having a one-line wrapper for the fast, native ways to do bulk data import for databases is fantastic on its own: http://odo.readthedocs.io/en/latest/perf.html

I ended up learning how to do this from the various SQL shells. But, that was a bit of a cognitive load, especially when the CSV files were complicated. AFAIK, for example, SQLite won't properly ignore commas in quoted fields, so you have to throw in an extra utility (like csvkit) to change the delimiter before importing into SQLite.

sidenote: through Odo's homepage, I discovered this amazing library for generating network graph diagrams, NetworkX: http://networkx.github.io/

flashman · on Jan 3, 2017

NetworkX is very powerful, however for more hands-on network graph diagrams (perhaps more art than science) I rely heavily on http://gephi.org.

(Here's my Gephi map of NSFW subreddits according to their links to each other: http://electronsoup.net/nsfw_subreddits/)

danso · on Jan 3, 2017

Nice! Have definitely heard of Gephi but haven't made an effort to use it, out of reluctance to learn a new GUI/system and how I rarely ever try to solve problems that require graph analysis. For that subreddit visualization, how much data prep/wrangling did you do (after making the API requests of course) before you worked with the data in Gephi?

flashman · on Jan 4, 2017

Not too bad. I just had to get it in the format of one line per source-target pair. The data came from /u/uglyasblasphemy in SQL format though apparently he's removed the link. Most of the fun was in arranging nodes with the move tool. With the layout I used (Fruchterman-Reingold) the nodes form a circle and you can pick up clusters and move them where you want them. Useful to make things more meaningful to the eye, if probably less mathematically correct.

marchenko · on Jan 3, 2017

What a great tool. I'm going to bookmark odo's homepage so that I don't have to wait three seasons to find it again.

skyrw · on Jan 3, 2017

Bigger nerd? You for making that joke, me for getting it.

GrinningFool · on Jan 3, 2017

Statistically it was likely that at least one person in this audience would get it. Therefore I think that OP wins this nerd throwdown.

Full disclosure: I still don't.

cpr · on Jan 3, 2017

Throw us a clue?

danso · on Jan 3, 2017

Odo is the name of a shapeshifter on Star Trek: Deep Space 9, who doesn't know his origins or homeworld until around season 3. I guess the homeworld isn't revisited as a major plot point until season 6? I don't remember...seasons 6 and 7 were honestly kind of a blur for me.

http://memory-alpha.wikia.com/wiki/Odo

goatlover · on Jan 3, 2017

I think prior to season 6, the Klingons and Romulans tried to destroy the Dominion by taking out Odo's homeworld in a sneak attack. Odo's people turned out to be the rulers of a large empire that sought to conquer solids (non-shapeshifters).

moosingin3space · on Jan 4, 2017

Cardassians, not Klingons.

cholantesh · on Jan 4, 2017

Also when Odo was forced to return to the Link to be judged (end of season 4).

acemarke · on Jan 3, 2017

Pretty sure that's a reference to http://memory-alpha.wikia.com/wiki/The_Search,_Part_II_%28ep... and http://memory-alpha.wikia.com/wiki/Great_Link .

mahmud · on Jan 3, 2017

Star Trek, Deep Space 9.

Odo is a shape-shifting being who is removed from his home planet. But we don't find that out til the 3rd season.

RileyKyeden · on Jan 3, 2017

Does it have to travel to its homeworld to learn the full range of its shapeshifting abilities?

cholantesh · on Jan 3, 2017

Yes, and it also has to be electrocuted repeatedly before it runs for the first time.

oneplane · on Jan 4, 2017

Yes, but later on, it will be forced to hard-fork because of a merge conflict and lose most of it's functions until it merges again with a young downstream project.

ElComradio · on Jan 4, 2017

Also today is the 23rd anniversary of the premiere of DS9.

oneplane · on Jan 4, 2017

I love how more DS9 community got sparked to life after Netflix decided put it up.

TeMPOraL · on Jan 4, 2017

TNG and DS9 being on Netflix is the very reason I signed up and stayed.

makmanalp · on Jan 4, 2017

I've loved the idea of Odo since I first saw it, but I've always been wary: the devil is in the details. I'm curious when and how information gets lost during each transfer because of the peculiarities of each format, or how those decisions are made and exposed.

Min/max limits, truncation nulls, floating point precision, encodings, picking CHAR vs VARCHAR or string vs categorical, metadata like indices, etc are some of the hard problems behind bulk loading.

misterdata · on Jan 4, 2017

Very nice! I have been working on a similar library in Swift [1] which does this, but also has a nice user interface on Mac and (soon) iOS [2]. Coincidentally also uses a Star Trek themed name :-)

[1] http://github.com/pixelspark/warp [2] https://warp.one

panic · on Jan 4, 2017

Neat! Unless I'm misunderstanding, I think Odo is more general, though, in that you don't need to pass data through any sort of unified Dataset protocol: there's a graph of direct translators between formats.

0x54MUR41 · on Jan 4, 2017

Well done.

By the way, I have problem when opening that site. The site has problem with SSL certificate on Mozilla Firefox.

merqurio · on Jan 3, 2017

I use odo a lot to take any data source into pandas or blaze and it's been great so far. For my it's the lazy and easy way of moving data around.

eximius · on Jan 4, 2017

I wish it supported large csv to partitioned parquet. THAT is something I need a good solution for.

quasiben · on Jan 4, 2017

Are you familiar with fastparquet (https://github.com/dask/fastparquet) and pyarrow (https://pyarrow.readthedocs.io/en/latest) ?

eximius · on Jan 4, 2017

Not until you linked me. :) They look like good steps but not quite what I need, for different reasons. I'll have to check when I get back to the office next week.

ColanR · on Jan 4, 2017

Your link is malformed. https://github.com/dask/fastparquet

ColanR · on Jan 4, 2017

...and its fixed.

nl · on Jan 4, 2017

(Py)Spark? I run it locally and do that kind of operation pretty frequently.

eximius · on Jan 4, 2017

On what scale?

Part of the problem is the scale, but another part is that writing partitioned parquets seems poorly documented (I would love corrections, I spent a decent amount of time last week looking for good information)

nl · on Jan 4, 2017

I was recently working with a 38GB compressed text dataset locally. I can't remember how big it was in parquet.

It's literally one line of code. See http://labs.vistarmedia.com/2016/12/27/indexing-json-logs-wi... for an example (except you write to a local file system rather than HDFS).

eximius · on Jan 4, 2017

That looks very helpful, I'll have to check it out!

kevinwang · on Jan 3, 2017

Oh my god it looks beautiful