Hacker Newsnew | past | comments | ask | show | jobs | submit | digisth's commentslogin

If one is already well-versed in multiple areas of software technology (especially development and database administration), this is an excellent book. It surveys the landscape of software data storage technologies, talks about (at a modest level of depth) some of theory behind things like quorums in distributed database systems, resiliency/redundancy strategies during data loss, and a host of other interesting topics.

I'd consider its level of depth somewhere in the middle between specialist books and 10k foot overview books. I recommend it to anyone that has been a software developer or DBA for 5+ years, as I think they'd get the most value out of it.


If one prefers to use vim as a pager for psql, try this:

https://unencumberedbyfacts.com/2016/01/04/psql-vim-happy-fa...

I've been using it for a few months, and it works great for me.


The newer matching services (as opposed to boards) are all worth checking out: Hired, Vettery, Underdog.io. I found many good leads through all of them, and my current position is through one of them (Vettery.)

AngelList Jobs is also a place to find interesting positions (startup-centric ones in this case, as one might expect.)


My experience with placement services is that none of these people seem to work with remote candidates, and many have a narrow focus on the Bay Area. They are also high-friction (since now you just need to interview and go through the process of 8 placement services instead of cutting out the middleman and just applying to the 8 place you'd like to work) and pretty restrictive in their interpretation of candidacies.


These (and probably most) services are currently geared toward technology hubs and in-office work, that's true, as that's still the norm in industry, so they reflect that. I'm in NYC and saw plenty of outreach.

For remote, remoteok.io seemed pretty good when I used it.


Location: New York, NY

Remote: OK

Willing to relocate: No

Technologies: Python, Django, Ruby, Rails, AWS, Linux

Résumé/CV: http://www.panix.com/~sth/resume2017.docx

Email: spencer.hoffman@gmail.com

Interested in a backend role, especially web/data API building and/or data processing, broadly construed.


Do you know of a source that compares these different libraries in terms of capabilities, focus/use cases, size limits, performance, format support, etc.?

Googling turned up very little for me.

TIA

Edit: libraries mentioned in thread:

PMML, Arrow, Dill, marshmallow, pytables, parquet/fastparquet (and pickle, obviously)


No, I don't, but some of these are apples and oranges, that was part of my point. You're conflating many different types of things.

Specifically, the ones I talked about are for storing large tabular datasets on disk. Stuff that lays out data on disk so that it's easy and efficient to query only a part of the dataset, e.g. only certain columns or only certain rows that match a predicate or within a range of indexes. These can store hundreds of gb, no problem. They often have some sort of compression, like LZ, snappy or blosc that has relatively low CPU overhead while giving decent compression. I tried to separate the file formats (which are readable from other languages) from the python libraries that write them. For this, I'd default to pytables / HDF5, barring some specific use case where you'd already know what other one you need.

Dill / pickle are for serializing generic python objects. I wouldn't really use them to store anything big, but it's very convenient for complicated data structures, like hierarchies of objects and classes. E.g. to save the current running state of your program. You don't have to think about storage formats and layouts and serialization routines, if you have a list of python objects you can pickle it. Pickle is built in, while dill is an external library that nicely handles a bunch more edge cases.

PMML seems like an XML based format specifically for trained machine learning models. Don't really know much about this.


The rule of thumb I've always used for when to use OO is "will there be more than one extant object at once or not?" If yes, and especially if these objects need real behavior, then use OO.

If you're essentially going through one object at a time, then discarding them, you're may just be doing conduit data processing, and so there's little advantage to using objects. I think what's missing in this (well-written) analysis is this distinction; if you're slurping data from one place, making a few changes (or especially if you're not making any), then sticking into a DB or vice versa, OO may be the wrong choice.

Ask yourself while writing the code: "are these active, behavior-driven objects that need encapsulation and relatively sophisticated behaviors, or is this just data I'm doing some relatively simple processing on?"


I have a pile of links for getting started with DL in my comment history you can use: https://news.ycombinator.com/item?id=10676455

What really helped advance my understanding from zero to knowledgeable novice was rewriting some existing code line by line (using expanded variable names and comments), and thinking about each line and what it does as you go. It's the software development equivalent of Hunter S. Thompson re-typing The Great Gatsby just to get the feel of writing a great novel. Here's one I did based on Denny Britz's tutorial:

Britz's Original: http://www.wildml.com/2015/09/implementing-a-neural-network-...

My version: https://gist.github.com/sthware/c47824c116e6a61a56d9

HTH


Mixins have their place, but perhaps we are at the point where we need general advice (like the class "Prefer composition over inheritance") for mixins ("Prefer traits over mixins"), especially as there's a good analog between the two. Matthijs Hollemans wrote an article which argues for that as well:

http://matthijshollemans.com/2015/07/22/mixins-and-traits-in...



If you want to know more about RNNs in general, I can't recommend watching the videos/reading the notes from this course enough:

http://cs224d.stanford.edu/syllabus.html

If you want something more basic to get your head around NNs, I recommend Denny Britz's "Neural Networks from Scratch":

http://www.wildml.com/2015/09/implementing-a-neural-network-...

I created a gist with a heavily commented version of his code:

https://gist.github.com/sthware/c47824c116e6a61a56d9


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: