Lessons Learned from Two Years as a Data Scientist

listenallyall · on Sept 5, 2021

This reads like a pretty "wet-behind-the-ears" professional who doesn't know what he doesn't know.

> There's no Java awfulness like ... instead it's just `cars = []`

I mean, there's very good reasons for static typing. And if he was using Kotlin, he could specify whether the variable `cars` was itself immutable and whether the list was immutable (`val/var cars : List<String>/MutableList<String>`

> notebooks

yea, Jupyter kernels exist for almost every language. This is not a Python advantage.

> debugging

Good IDEs have the ability to set breakpoints, inspect variables, test methods, etc.

> type hints

"oh, forget what I said earlier about how Java had ugly boilerplate, now I have an `import` and a type def after all" - except nothing here is actually enforced

> parallelism

parallelism is relative... a lot of compiled JVM code will run much faster than Python to start with, and even with `multiprocessing`, Python won't catch up (and JVM languages have their own concurrency solutions, of course)

> I've put a large chunk of my money in leveraged index funds and etfs.

Written by a person who's never seen the slightest hint of a bear market, or rising interest rates. That's ok, you wouldn't be the first smart person to be seduced by leverage: https://www.investopedia.com/terms/m/myron-scholes.asp

> Stimulants like caffeine, adderall, and modafinil are magic... People do stay on adderall and modafinil indefinitely

Look, I'm no doctor (and I'm aware I'm out of the loop on things like this), but mental & concentration stimulants are the kinds of things associated with old people, not recent graduates.

tomnipotent · on Sept 5, 2021

This is an oddly snarky response to someone just sharing their experience, but you seem to be reading it matter-of-factly. She's upfront about having just graduated, and having 2-years of experience with a math background and not a CS background.

She clearly covered a lot of good ground in that time, and even took the time to write a 6000+ word article.

dawndrain · on Sept 5, 2021

Certainly lots of things to learn still.

Going point by point:

It's not "he" it's "she".

I don't think it's controversial that you can be much more concise with python. My experience first learning Java was that everything was 2-3x as verbose as in python. The difference is smaller if you're using type hints in python, but it's still more concise.

I talked about repl's/notebooks for other languages. They're still an especially great tool for python/data science since they make it very easy to visualize data and share analyses.

I played around with breakpoints in pycharm and I don't think it would work for me. You need to run your code from pycharm in debug mode for the breakpoint to trigger, whereas I always run things from the command line or a notebook.

I believe you that there are times when python is slower. At least it's not noticeably slower when simply calling C behind the scenes or when you're i/o blocked anyway.

Re investing, I mean, everyone has seen phenomenal returns since they were born: this century is unprecedented. Also there was the pandemic crash very recently, so everyone has experienced an extremely harsh (albeit brief) bear market too. LTCM was like 100x-leveraged, which I would not advocate for, since you'll almost certainly get wiped out if you hold that position for more than a few hours...

Eh, lots of kids have add, and like 10% of college students used adderall in 2016 according to the first hit on google. In any case they've been magic for me the few times I've tried them e.g. working 12+ good hours in a day.

dhanna · on Sept 5, 2021

It’s magic because you have no tolerance. That goes away after a few weeks of continuous use.

hazbot · on Sept 5, 2021

True, but that didn't stop me from finding parts interesting, and just generally enjoying the earnestness.

int_19h · on Sept 5, 2021

Modafinil is literally known as "the Wall Street drug", and it's not because the old people there use it.

cobertos · on Sept 4, 2021

I really liked this post. Tons of small tidbits I feel I'd only get from working in the teams they worked on. Some things I noted:

* Check out `ray` as an alternative to `multiprocessing`

* Check out `tqdm`

* Use `pdb` more

* See if fast.ai or https://jalammar.github.io/illustrated-transformer/ are worthwhile

* Prioritize the papers I read better

* _Leveraged_ index funds?

chriak8292 · on Sept 4, 2021

Risk-adjusted return (Sharpe)

- VTI/VOO: 0.6 to 0.8

- Leveraged S&P500 ETFs: 0.5 to 0.9

- Savings account: positive infinity

- Treasury bonds: 1.0 to 1.2

Leveraged index funds give approximately the same risk-adjusted return as passive index funds but with 2-2.5x the absolute return (eg, UPRO).

However, if you have a weak stomach (you don’t like seeing your balance drop), then leveraged ETFs are not for you.

rejectedandsad · on Sept 4, 2021

What should the net worth of someone like OP be by 25? I really wonder how far behind I am compared to my more savvy investor peers.

ZephyrBlu · on Sept 5, 2021

Comparing net worth when you're young makes no sense. It only makes sense in 20 years after things have had time to play out, and even then you still have a lot of time left.

Even if you have a far superior strategy and an equal amount of starting capital you might not be that far ahead of your peers after only a few years, and even if you are there is no guarantee your strategy will continue working.

In reality the amount of starting capital, income and investment strategy will vary wildly among your peers so almost any comparison is a wild extrapolation.

FWIW I'm younger than you and still have negative net worth (Loans), but growth rate > net worth. I have many years left to earn money, invest it and let it compound.

rejectedandsad · on Sept 5, 2021

I don't know about that, there are lots of people around my age like OP that aren't well represented in personal finance advice columns. My net worth is only around $300k as of today, which I'd imagine is on the lower end of most of these kinds of people (Dartmouth grads, ubermensch, Google engineers etc).

ZephyrBlu · on Sept 5, 2021

Mate, this is delusional.

You're taking the top 0.1% of people your age that everything went right for and comparing yourself to them.

The only purpose to this is to throw a pity party for yourself.

Again, it doesn't make sense to compare net worth when you're young. Even if they have more money right now, if you are compounding better you will win in the long term.

rejectedandsad · on Sept 5, 2021

I don’t know if that’s true - I’m very skeptical of the idea $300k is anything but the very low end of Dartmouth grads.

cauthon · on Sept 4, 2021

Check out ipdb instead of pdb. It’s pdb but with the ipython repl instead of python’s

nerdponx · on Sept 4, 2021

I find that PDB++ (on Pip as "pdbpp") is better in its role specifically as a debugger.

otabdeveloper4 · on Sept 5, 2021

Seems like "data scientist" is now code for "junior Python developer".

I need a person who actually knows statistics, how to integrate, what Bayes' theorem means in practice, the difference between a confidence and a credible interval, etc.

What would this person be called in 2021? (I understand that searching for "data scientist" candidates would be a waste of my time.)

jstx1 · on Sept 5, 2021

> Seems like "data scientist" is now code for "junior Python developer".

No, it isn't.

> I need a person who actually knows statistics, how to integrate...

Maybe someone with a strong mathematical education and commercial experience in research... like the author of the post? Just because they focused on some more basic SWE concepts in a blog post doesn't mean that they don't know statistics.

> What would this person be called in 2021?

Data scientist, research scientist, probably some other titles too.

otabdeveloper4 · on Sept 5, 2021

Forgive me, but the blog is titled "what I learned from my first two years as a data scientist", and I'd rather trust the author on their word and not speculate.

(And the experience of me working with dozens of data scientists corroborates - "data science" means basic Python programming and the kind of boring trial-and-error feature engineering tasks you'd typically assign to a junior software developer.)

I want a person who actually uses statistics and math to drive business decisions, not just someone who took some statistics courses in university before becoming a software developer.

P.S. Honest question, really. I don't want to sift through 500 resumes before finding the person I want.

FridgeSeal · on Sept 5, 2021

Ha, I did maths at uni and when I first graduated, the role you described was basically my dream job.

What you call the title is up to you, “data scientist” will probably net you a lot of people who mostly want to build machine-learning models, maybe “applied statistician” or “data analyst” is a better bet? Hiring out of a local uni is possibly also a decent choice if you can deal with having to bring some maths/stats grads’ business knowledge up to scratch.

co2benzoate · on Sept 5, 2021

You are looking for a statistician.

ekianjo · on Sept 5, 2021

> reading programming books is a waste of time

Erm... How about no? Unless you read useless books.

dhanna · on Sept 5, 2021

Contemporary students have no appreciation for reading and that’s going to be my competitive advantage.

int_19h · on Sept 5, 2021

"Useless" accurately describes the majority of published books in the industry these days. So you can spend a lot of time finding the right book... or you can just dive straight for the spec.

ekianjo · on Sept 8, 2021

> or you can just dive straight for the spec.

in practice its not even following the spec, its just copy pasting stack overflow answers.

Reading a book at least gives a more sustainable structure to what you learn.

qsort · on Sept 4, 2021

> Google doesn't allow any production-level projects to be written in python due to safety concerns

Is this actually true? If it's true that Google doesn't allow Python in production, it seems unlikely that it's due to security concerns.

MontyCarloHall · on Sept 4, 2021

Their command line interfaces to Google Cloud Platform (e.g. gcloud/gsutil) is 100% Python. Is that not considered a “production level project”?

natchy · on Sept 5, 2021

That’s just a client though, right? They write JavaScript too for clients.

It’s probably production level services, not just projects.

moogly · on Sept 4, 2021

Yeah, did they rewrite YouTube?

dawndrain · on Sept 4, 2021

I heard this second-hand, not totally sure it's true

smueller1234 · on Sept 4, 2021

It's not.

The remote connection to safety (did you mean security?) would be that static source analysis tools don't work as reliably with dynamic languages. That matters at Google. But you don't even have to think as hard about it: Python is simply comparatively slow and inefficient. Google's fleet is large. It pays off to use more efficient languages.

(There's also the whole thing about Python being largely single threaded and computers being very wide these days, as well as being a terrible memory hog and memory making up half the cost of servers.)

wheelinsupial · on Sept 4, 2021

> in fact Google doesn't allow any production-level projects to be written in python due to safety concerns.

Then why did you write it as a fact?

thom · on Sept 4, 2021

I felt like this article was a bit light on data scientist specific advice, and while I am not one, I do herd them for a living, so thought I'd put some random thoughts together:

1) Quite often you are not training a machine to be the best at something. You're training a machine to help a human to be the best at something. Be sure to optimise for this when necessary.

2) Push predictions, don't ask others to pull them. Focus on decoupling your data science team and their customers early on. The worst thing that can happen is duplicating logic, first to craft features during training, and later to submit those features to an API from clients. Even if you hide the feature engineering behind the API, this can either slow down predictions, or still require bulky requests from the client in the case of sequence data. Instead, stream data into your feature store, and stream predictions out onto your event bus. Then your data science team can truly be a black box.

3) Unit test invariants in your model's outputs. While you can't write tests for exact outputs, you can say "such and such a case should output a higher value than some other case, all things being equal". When your model disagrees, do at least consider that the model may be correct though.

4) Do ablation tests in reverse, and unit test each addition to your model's architecture to prove it helps.

5) Often you will train a model on historical data, and content yourself that all future predictions will be outside this training set. However, don't forget that sometimes updates to historical data will trigger a prediction to be recalculated, and this might be overfit. Sometimes you can serve cached results, but small feature changes make this harder.

6) Your data scientists are probably the people who are most intimate with your data. They will be the first to stumble on bugs and biases, so give them very good channels to report QA issues. If you are a lone data scientist in a larger organisation, seek out and forge these channels early.

7) Don't treat labelling tools as grubby little hacked together apps. Resource them properly, make sure you watch and listen to the humans building and using them.

8) Have objective ways of comparing models that are thematically similar but may differ in their exact goal variables. If you can't directly compare log loss or whatever like-for-like, find some more external criteria.

9) Much of your job is building trust in your models with stakeholders. Don't be afraid to build simple stuff that captures established intuitions before going deep - show people the machine gets the basics first.

10) If you're struggling to replicate a result from a paper, either with or without the original code, welcome to academia.

Probably not earth shattering stuff, I grant you.

jeeeb · on Sept 5, 2021

Good list. One thing I'd add, which you kind of hint at:

Good practices from software engineering are just as applicable to Data Science. In particular:

Notebooks are great for performing an EDA, and testing out new concepts. They're not great for running production code. Put your non-once off code in regular source code files and source control it.

Break your code into separately testable and composable functions. Write unit tests to verify behavior where you can. Speaking from experience you all most certainly will find bugs.

Implement a peer review process for the methodology used and the code. Approaches should be explainable and justifiable. Bugs and poor assumptions can lead to incorrect results.

Focus on making your model training process end-to-end reproducible. Document the training data used. Document the configuration used. Link back to the commit hash of the exact code used. Make sure your environment is reproducible.

eointierney · on Sept 5, 2021

Thank you for superb advice. Like all good advice it is clear, general, based on experience, and anthropocentric.

It may not be earth-shattering, but that's probably a good thing.

I especially like numbers 1 and 10. If I was looking for a job I'd like to work with you.

thom · on Sept 6, 2021

Thank you for making such a kind comment!

kgwgk · on Sept 4, 2021

Nice pictures. (Didn't read the words... but I liked the pictures.)

123pie123 · on Sept 4, 2021

I was distracted (in a good way) by the pictures

but I did read most of it. I thought it was a fairly honest and good blog

I did like the diagram of the different company styles

I'll will read it 100% when I have enough time

jldugger · on Sept 4, 2021

> I did like the diagram of the different company styles

That's quite an old internet cartoon.

irrational · on Sept 4, 2021

I’d like a copy of the “Excuses to Miss Meetings” book.

rossdavidh · on Sept 4, 2021

Good to know I'm not the only one.

jsrcout · on Sept 4, 2021

I just love those fake O'Reilly book covers. Wish I could order the actual books :-)

throwaway98797 · on Sept 4, 2021

Good stuff, excpet:

<<I've put a large chunk of my money in leveraged index funds and etfs>>

This is dangerous advice if you go all in on this.

pja · on Sept 4, 2021

I believe the available evidence suggests that over the long term, a small amount of leverage increases returns. Obviously it increases volatility as well, but if your time horizon is long, you can cope with that.

You do not want to invest in ETFs that are themselves leveraged & rebalance daily however, that’s going to eat all your money if you hold them for any length of time - those products are designed to be held for short periods only.

bt3 · on Sept 4, 2021

SPXL also has an expense ratio over 1%, which will eat away at earnings unless in the best of bull rushes (now).

tfehring · on Sept 5, 2021

The expense ratios of leveraged ETFs are nothing compared to the volatility drag. There are far cheaper and more effective ways than leveraged ETFs for buy-and-hold investors to obtain leverage, notably LEAPs and index futures. (Disclaimer: Not investing advice, do your own research, etc.)

chriak8292 · on Sept 4, 2021

Leveraged EFTs outperform VTI/VOO (in terms of total return) over 30-40 year investment horizons. Period.

Now, the risk (potential one-year downside) is not for everyone.

throwaway98797 · on Sept 4, 2021

A little bit of leverage as others have commented is fine.

The problem happens if there’s 51% drop in 2x levered fund.

There’s a reason the fund the article’s OP is in started in 2008 and not 40 years ago.

lordgrenville · on Sept 4, 2021

This feels like someone's private rough notes, but seeing as it's on the front page I have one nit-pick:

> OS packages...which are installed with apt-get (linux) or homebrew (mac)

(Home)brew isn't bundled with Mac, and also works on Linux. And lots of Linux distros don't use apt.

ineedasername · on Sept 5, 2021

And lots of Linux distros don't use apt.

We don't like to talk about those. It's just not done in polite company. Like that distro that concerns itself with hats. RPMs should apply to cars, never packages.

ZephyrBlu · on Sept 4, 2021

I was hoping for more DS related stuff. It almost sounds like you're learning to be a SWE!

The investing section is curious.

> On brilliant advice from the man who arguably went from mere millions to decabillions faster than anyone in modern history, I've put a large chunk of my money in leveraged index funds and etfs

Who are you referencing here and did you do any DD other than taking his advice?

I'm wondering what the downsides of leveraged index funds and ETFs are since I'm not sure how they work.

dawndrain · on Sept 4, 2021

> did you do any DD other than taking his advice?

I did some backtesting simulations that made leveraged investing look pretty awesome. The effective borrow rate for funds like spxl is crazy low, way better than if I were to borrow myself. (Also, fwiw I was pretty conservative and am overall only around 2x-leveraged.)

The internet is very opposed to leveraged investing imo, but I think most of the concerns are pretty dumb. There was this one blog post where this guy ran ten simulations of his own, most of which showed the leveraged portfolio doing comparably to the baseline, but one a couple showed it doing worse and one saw the leveraged portfolio 100x'ing or something... and he concluded that it wasn't worth it??

People will also appeal to volatility drag as a superficially sophisticated knockdown (in short, imagine all four two-step paths in which the market goes up or down by 10% at each step. Then the baseline market averages out to (.81 + .99 + .99 + 1.21)/4 = 1, and a 3x leveraged portfolio averages out to (.49 + .91 + .91 + 1.69)/4 = 1. Volatility drag is those two middle worlds where the leveraged portfolio does badly despite the market as a whole basically ending up where it started.

ZephyrBlu · on Sept 4, 2021

Thanks for you perspective. I'm considering using leverage so it's interesting to hear from people who are currently using it.

Also, you might find this tweet and paper interesting:

- Tweet: https://twitter.com/patio11/status/1432891941138563077

- Paper: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89...

akg_67 · on Sept 4, 2021

Use keyword “HedgeFundie” (username that first started the discussion on Bogleheads and now most refer this strategy with his name) to search discussions about Leveraged ETFs on Bogleheads forum and Reddit. There are over 300+ pages worth of discussion on this topic only on Bogleheads.

Also checkout information about permanent portfolio put forward by Bridgewater Associates.

dawndrain · on Sept 4, 2021

> I was hoping for more DS related stuff. It almost sounds like you're learning to be a SWE!

It kind of felt that way too :). Some more data sciency things were learning how transformers work, hacking pytorch, using visualization tools like tensorboard and wandb, web scraping, better using parallelism, tuning hyperparameters (mostly the learning rate tbh), better fluency with the command line than I assume most swe's need, getting very comfortable inspecting data, making experiments more reproducible, reading lots of papers, writing papers, and trying (somewhat half-heartedly) to get published.

kbenson · on Sept 4, 2021

Almost definitely Warren Buffett, who went so far as to make a very public bet that index funds performed better than hedge funds (and thus most active investing). You can see details on the deal and outcome here https://www.investopedia.com/articles/investing/030916/buffe...

Also, he's got tens of billions.

Edit: Or maybe not on reflection about the "faster than anyone" bit. I dunno.

comp_throw7 · on Sept 4, 2021

Sam Bankman-Fried, unless I miss my guess.

systemsignal · on Sept 5, 2021

yeah think that's right, esp considering their background at FTX

krrrh · on Sept 4, 2021

The leverage means they go up faster, but they also go down faster.

In regular investing you feel good when you get a modest return, and bad when you experience a modest loss. With leveraged funds you feel like a genius when the market goes up and like a complete moron when you get wiped out.

ZephyrBlu · on Sept 4, 2021

I understand how leverage works, just not in the context of an index fund or ETF.

From your description it sounds like the leverage is baked into it, so it's kind of like "safe" leverage in that you can't lose more than you have.

krrrh · on Sept 5, 2021

This link explains it pretty well. There are other advantages from regular forms of margin too.

https://www.investopedia.com/terms/l/leveraged-etf.asp

ElevenPhonons · on Sept 5, 2021

It's worth noting that "except:" is not the same as "except Exception:" in Python. "except:" is catching BaseException which is often not what to do. BaseException is catching SystemExit amongst other things.

- https://docs.python.org/3/library/exceptions.html#exception-...

Zababa · on Sept 4, 2021

About Java vs Python, modern Java would be:

    import java.util.ArrayList;
    var cars = new ArrayList<String>();

Python would be:

    cars: list[string] = []

The big difference seems to be that ArrayList is not a "default" data structure in Java, but it is in Python. While I like the Python example better, I'm not offended by the modern Java.

techzerd · on Sept 4, 2021

The author goes full circle by introducing types in python. Almost every java developer uses an IDE with autocomplete so they would most likely not use a var (where they wouldn't) ..type the interface they want to use, add the assignment operator and then just let the IDE suggest the implementation, add the import and even format the line/file. Such trivialities don't help when attempting to distinguishing the flexibility python bring for arbitrary/quick code

qsort · on Sept 4, 2021

Actually, as a Java dev, I quite like modern Java for data work. Streams + static typing make aggregating data a breeze.

I assume for more advanced work like the one OP mentions you'd still want to stick to Python because of the superior ecosystem, but I was pleasantly surprised by Java.

Can't believe I'm writing this, but I actually like modern Java.

Zababa · on Sept 4, 2021

I've only followed Java from far away (last time I used it was Java 7 in college) but modern Java seem like a very nice language. There are also Graal, alternative ecosystems to Spring, good things like that.

lokimedes · on Sept 4, 2021

That modern Python example scream regression to me. Why not simply cars = []?

This obsession with killing perfectly good languages with strongly typed hints is completely undermining the point.

Zababa · on Sept 4, 2021

I think you're in the minority thinking that Python is "killed" by adding static (not strong, Python is already strongly typed) type hints. They are also optional, so you're free to not use them.

To go back to the code example, if you want to express the same thing in Python as in Java, you have to add a type hint to be able to statically check the code. A good thing about Python is that you can choose when and if you want to use a static type hint, while in Java you're forced to use them.

zentropia · on Sept 4, 2021

I have work with large python project without types. It's a nightmare. Types are extremelly useful.

tbrownaw · on Sept 4, 2021

Static type information makes it far easier to avoid a whole class of bugs, which becomes more important as your project gets larger.

mkaic · on Sept 5, 2021

Holy cow, I almost never read entire posts on HN but this one was excellent and thoroughly helpful :) nice work OP!

jupin · on Sept 4, 2021

> Google doesn't allow any production-level projects to be written in python due to safety concerns.

Is this true?

jupin · on Sept 4, 2021

I love the tip about using the python debugger "pdb". This reminds me of the similar Ansible debug feature (e.g. "debugger: on_failed") which let's you jump into an in-flight playbook.

Zababa · on Sept 4, 2021

I remember using OCaml a bit and it has a time-travelling debugger. You launch the program with the debugger, it crashes and then you can inspect the program as you wish. I was really impressed by this.

nerdponx · on Sept 4, 2021

Someone wrote a time-travel debugger for Python, but it seems like a one-off project and I'm not sure if there's an actively maintained/developed tool. https://github.com/TomOnTime/timetravelpdb

joeman1000 · on Sept 4, 2021

Am I reading this correctly?: you were hired at MS with only cursory python knowledge? I’m very jealous and thinking more and more about going into IT after I leave uni for engineering (not software engineering). I know python well, along with a few other things I’ve picked up over the years (emacs/elisp, vim/vimscript, LaTeX formatting, JS, Common Lisp, APL, bash scripting, mathematica and matlab etc.). Would this be enough to land a position like yours? I am lacking in the AI area, but I can begin that on my weekends.

disgruntledphd2 · on Sept 5, 2021

She probably did an interview focused on data analysis, statistics and experimentation. Her python skills probably weren't that relevant, as long as she can use libraries.

joeman1000 · on Sept 4, 2021

What’s wrong with my comment? I’m just jealous of you IT guys..

nerdponx · on Sept 4, 2021

He was an academic researcher.

joeman1000 · on Sept 5, 2021

That makes more sense. Thank you.

jingw222 · on Sept 5, 2021

Ironic that someone who works on VSCode at Microsoft does not use the notebook in VSCode but instead Jupyter

alcoholic_byte · on Sept 11, 2021

Have you tried running Jupyter-notebooks inside of VSCode? The performance and ressource-usage is abysmal.

jingw222 · on Sept 11, 2021

Of course I have. Being able to integrate everything into a single software would be great. Every couple of months I tried using notebooks in VSCode I switched back. It’s so bittersweet

mrfusion · on Sept 4, 2021

Can you share a pic of your monitor in use? Sounds like a cool idea.