This reads like a pretty "wet-behind-the-ears" professional who doesn't know what he doesn't know.
> There's no Java awfulness like ... instead it's just `cars = []`
I mean, there's very good reasons for static typing. And if he was using Kotlin, he could specify whether the variable `cars` was itself immutable and whether the list was immutable (`val/var cars : List<String>/MutableList<String>`
> notebooks
yea, Jupyter kernels exist for almost every language. This is not a Python advantage.
> debugging
Good IDEs have the ability to set breakpoints, inspect variables, test methods, etc.
> type hints
"oh, forget what I said earlier about how Java had ugly boilerplate, now I have an `import` and a type def after all" - except nothing here is actually enforced
> parallelism
parallelism is relative... a lot of compiled JVM code will run much faster than Python to start with, and even with `multiprocessing`, Python won't catch up (and JVM languages have their own concurrency solutions, of course)
> I've put a large chunk of my money in leveraged index funds and etfs.
Written by a person who's never seen the slightest hint of a bear market, or rising interest rates. That's ok, you wouldn't be the first smart person to be seduced by leverage: https://www.investopedia.com/terms/m/myron-scholes.asp
> Stimulants like caffeine, adderall, and modafinil are magic... People do stay on adderall and modafinil indefinitely
Look, I'm no doctor (and I'm aware I'm out of the loop on things like this), but mental & concentration stimulants are the kinds of things associated with old people, not recent graduates.
This is an oddly snarky response to someone just sharing their experience, but you seem to be reading it matter-of-factly. She's upfront about having just graduated, and having 2-years of experience with a math background and not a CS background.
She clearly covered a lot of good ground in that time, and even took the time to write a 6000+ word article.
I don't think it's controversial that you can be much more concise with python. My experience first learning Java was that everything was 2-3x as verbose as in python. The difference is smaller if you're using type hints in python, but it's still more concise.
I talked about repl's/notebooks for other languages. They're still an especially great tool for python/data science since they make it very easy to visualize data and share analyses.
I played around with breakpoints in pycharm and I don't think it would work for me. You need to run your code from pycharm in debug mode for the breakpoint to trigger, whereas I always run things from the command line or a notebook.
I believe you that there are times when python is slower. At least it's not noticeably slower when simply calling C behind the scenes or when you're i/o blocked anyway.
Re investing, I mean, everyone has seen phenomenal returns since they were born: this century is unprecedented. Also there was the pandemic crash very recently, so everyone has experienced an extremely harsh (albeit brief) bear market too. LTCM was like 100x-leveraged, which I would not advocate for, since you'll almost certainly get wiped out if you hold that position for more than a few hours...
Eh, lots of kids have add, and like 10% of college students used adderall in 2016 according to the first hit on google. In any case they've been magic for me the few times I've tried them e.g. working 12+ good hours in a day.
Comparing net worth when you're young makes no sense. It only makes sense in 20 years after things have had time to play out, and even then you still have a lot of time left.
Even if you have a far superior strategy and an equal amount of starting capital you might not be that far ahead of your peers after only a few years, and even if you are there is no guarantee your strategy will continue working.
In reality the amount of starting capital, income and investment strategy will vary wildly among your peers so almost any comparison is a wild extrapolation.
FWIW I'm younger than you and still have negative net worth (Loans), but growth rate > net worth. I have many years left to earn money, invest it and let it compound.
I don't know about that, there are lots of people around my age like OP that aren't well represented in personal finance advice columns. My net worth is only around $300k as of today, which I'd imagine is on the lower end of most of these kinds of people (Dartmouth grads, ubermensch, Google engineers etc).
You're taking the top 0.1% of people your age that everything went right for and comparing yourself to them.
The only purpose to this is to throw a pity party for yourself.
Again, it doesn't make sense to compare net worth when you're young. Even if they have more money right now, if you are compounding better you will win in the long term.
Seems like "data scientist" is now code for "junior Python developer".
I need a person who actually knows statistics, how to integrate, what Bayes' theorem means in practice, the difference between a confidence and a credible interval, etc.
What would this person be called in 2021? (I understand that searching for "data scientist" candidates would be a waste of my time.)
> Seems like "data scientist" is now code for "junior Python developer".
No, it isn't.
> I need a person who actually knows statistics, how to integrate...
Maybe someone with a strong mathematical education and commercial experience in research... like the author of the post? Just because they focused on some more basic SWE concepts in a blog post doesn't mean that they don't know statistics.
> What would this person be called in 2021?
Data scientist, research scientist, probably some other titles too.
Forgive me, but the blog is titled "what I learned from my first two years as a data scientist", and I'd rather trust the author on their word and not speculate.
(And the experience of me working with dozens of data scientists corroborates - "data science" means basic Python programming and the kind of boring trial-and-error feature engineering tasks you'd typically assign to a junior software developer.)
I want a person who actually uses statistics and math to drive business decisions, not just someone who took some statistics courses in university before becoming a software developer.
P.S. Honest question, really. I don't want to sift through 500 resumes before finding the person I want.
Ha, I did maths at uni and when I first graduated, the role you described was basically my dream job.
What you call the title is up to you, “data scientist” will probably net you a lot of people who mostly want to build machine-learning models, maybe “applied statistician” or “data analyst” is a better bet? Hiring out of a local uni is possibly also a decent choice if you can deal with having to bring some maths/stats grads’ business knowledge up to scratch.
"Useless" accurately describes the majority of published books in the industry these days. So you can spend a lot of time finding the right book... or you can just dive straight for the spec.
The remote connection to safety (did you mean security?) would be that static source analysis tools don't work as reliably with dynamic languages. That matters at Google. But you don't even have to think as hard about it: Python is simply comparatively slow and inefficient. Google's fleet is large. It pays off to use more efficient languages.
(There's also the whole thing about Python being largely single threaded and computers being very wide these days, as well as being a terrible memory hog and memory making up half the cost of servers.)
I felt like this article was a bit light on data scientist specific advice, and while I am not one, I do herd them for a living, so thought I'd put some random thoughts together:
1) Quite often you are not training a machine to be the best at something. You're training a machine to help a human to be the best at something. Be sure to optimise for this when necessary.
2) Push predictions, don't ask others to pull them. Focus on decoupling your data science team and their customers early on. The worst thing that can happen is duplicating logic, first to craft features during training, and later to submit those features to an API from clients. Even if you hide the feature engineering behind the API, this can either slow down predictions, or still require bulky requests from the client in the case of sequence data. Instead, stream data into your feature store, and stream predictions out onto your event bus. Then your data science team can truly be a black box.
3) Unit test invariants in your model's outputs. While you can't write tests for exact outputs, you can say "such and such a case should output a higher value than some other case, all things being equal". When your model disagrees, do at least consider that the model may be correct though.
4) Do ablation tests in reverse, and unit test each addition to your model's architecture to prove it helps.
5) Often you will train a model on historical data, and content yourself that all future predictions will be outside this training set. However, don't forget that sometimes updates to historical data will trigger a prediction to be recalculated, and this might be overfit. Sometimes you can serve cached results, but small feature changes make this harder.
6) Your data scientists are probably the people who are most intimate with your data. They will be the first to stumble on bugs and biases, so give them very good channels to report QA issues. If you are a lone data scientist in a larger organisation, seek out and forge these channels early.
7) Don't treat labelling tools as grubby little hacked together apps. Resource them properly, make sure you watch and listen to the humans building and using them.
8) Have objective ways of comparing models that are thematically similar but may differ in their exact goal variables. If you can't directly compare log loss or whatever like-for-like, find some more external criteria.
9) Much of your job is building trust in your models with stakeholders. Don't be afraid to build simple stuff that captures established intuitions before going deep - show people the machine gets the basics first.
10) If you're struggling to replicate a result from a paper, either with or without the original code, welcome to academia.
Good list. One thing I'd add, which you kind of hint at:
Good practices from software engineering are just as applicable to Data Science. In particular:
Notebooks are great for performing an EDA, and testing out new concepts. They're not great for running production code. Put your non-once off code in regular source code files and source control it.
Break your code into separately testable and composable functions. Write unit tests to verify behavior where you can. Speaking from experience you all most certainly will find bugs.
Implement a peer review process for the methodology used and the code. Approaches should be explainable and justifiable. Bugs and poor assumptions can lead to incorrect results.
Focus on making your model training process end-to-end reproducible. Document the training data used. Document the configuration used. Link back to the commit hash of the exact code used. Make sure your environment is reproducible.
I believe the available evidence suggests that over the long term, a small amount of leverage increases returns. Obviously it increases volatility as well, but if your time horizon is long, you can cope with that.
You do not want to invest in ETFs that are themselves leveraged & rebalance daily however, that’s going to eat all your money if you hold them for any length of time - those products are designed to be held for short periods only.
The expense ratios of leveraged ETFs are nothing compared to the volatility drag. There are far cheaper and more effective ways than leveraged ETFs for buy-and-hold investors to obtain leverage, notably LEAPs and index futures. (Disclaimer: Not investing advice, do your own research, etc.)
We don't like to talk about those. It's just not done in polite company. Like that distro that concerns itself with hats. RPMs should apply to cars, never packages.
I was hoping for more DS related stuff. It almost sounds like you're learning to be a SWE!
The investing section is curious.
> On brilliant advice from the man who arguably went from mere millions to decabillions faster than anyone in modern history, I've put a large chunk of my money in leveraged index funds and etfs
Who are you referencing here and did you do any DD other than taking his advice?
I'm wondering what the downsides of leveraged index funds and ETFs are since I'm not sure how they work.
I did some backtesting simulations that made leveraged investing look pretty awesome. The effective borrow rate for funds like spxl is crazy low, way better than if I were to borrow myself. (Also, fwiw I was pretty conservative and am overall only around 2x-leveraged.)
The internet is very opposed to leveraged investing imo, but I think most of the concerns are pretty dumb. There was this one blog post where this guy ran ten simulations of his own, most of which showed the leveraged portfolio doing comparably to the baseline, but one a couple showed it doing worse and one saw the leveraged portfolio 100x'ing or something... and he concluded that it wasn't worth it??
People will also appeal to volatility drag as a superficially sophisticated knockdown (in short, imagine all four two-step paths in which the market goes up or down by 10% at each step. Then the baseline market averages out to (.81 + .99 + .99 + 1.21)/4 = 1, and a 3x leveraged portfolio averages out to (.49 + .91 + .91 + 1.69)/4 = 1. Volatility drag is those two middle worlds where the leveraged portfolio does badly despite the market as a whole basically ending up where it started.
Use keyword “HedgeFundie” (username that first started the discussion on Bogleheads and now most refer this strategy with his name) to search discussions about Leveraged ETFs on Bogleheads forum and Reddit. There are over 300+ pages worth of discussion on this topic only on Bogleheads.
Also checkout information about permanent portfolio put forward by Bridgewater Associates.
> I was hoping for more DS related stuff. It almost sounds like you're learning to be a SWE!
It kind of felt that way too :). Some more data sciency things were learning how transformers work, hacking pytorch, using visualization tools like tensorboard and wandb, web scraping, better using parallelism, tuning hyperparameters (mostly the learning rate tbh), better fluency with the command line than I assume most swe's need, getting very comfortable inspecting data, making experiments more reproducible, reading lots of papers, writing papers, and trying (somewhat half-heartedly) to get published.
Almost definitely Warren Buffett, who went so far as to make a very public bet that index funds performed better than hedge funds (and thus most active investing). You can see details on the deal and outcome here https://www.investopedia.com/articles/investing/030916/buffe...
Also, he's got tens of billions.
Edit: Or maybe not on reflection about the "faster than anyone" bit. I dunno.
The leverage means they go up faster, but they also go down faster.
In regular investing you feel good when you get a modest return, and bad when you experience a modest loss. With leveraged funds you feel like a genius when the market goes up and like a complete moron when you get wiped out.
It's worth noting that "except:" is not the same as "except Exception:" in Python. "except:" is catching BaseException which is often not what to do. BaseException is catching SystemExit amongst other things.
import java.util.ArrayList;
var cars = new ArrayList<String>();
Python would be:
cars: list[string] = []
The big difference seems to be that ArrayList is not a "default" data structure in Java, but it is in Python. While I like the Python example better, I'm not offended by the modern Java.
The author goes full circle by introducing types in python.
Almost every java developer uses an IDE with autocomplete so they would most likely not use a var (where they wouldn't) ..type the interface they want to use, add the assignment operator and then just let the IDE suggest the implementation, add the import and even format the line/file.
Such trivialities don't help when attempting to distinguishing the flexibility python bring for arbitrary/quick code
Actually, as a Java dev, I quite like modern Java for data work. Streams + static typing make aggregating data a breeze.
I assume for more advanced work like the one OP mentions you'd still want to stick to Python because of the superior ecosystem, but I was pleasantly surprised by Java.
Can't believe I'm writing this, but I actually like modern Java.
I've only followed Java from far away (last time I used it was Java 7 in college) but modern Java seem like a very nice language. There are also Graal, alternative ecosystems to Spring, good things like that.
I think you're in the minority thinking that Python is "killed" by adding static (not strong, Python is already strongly typed) type hints. They are also optional, so you're free to not use them.
To go back to the code example, if you want to express the same thing in Python as in Java, you have to add a type hint to be able to statically check the code. A good thing about Python is that you can choose when and if you want to use a static type hint, while in Java you're forced to use them.
I love the tip about using the python debugger "pdb". This reminds me of the similar Ansible debug feature (e.g. "debugger: on_failed") which let's you jump into an in-flight playbook.
I remember using OCaml a bit and it has a time-travelling debugger. You launch the program with the debugger, it crashes and then you can inspect the program as you wish. I was really impressed by this.
Someone wrote a time-travel debugger for Python, but it seems like a one-off project and I'm not sure if there's an actively maintained/developed tool. https://github.com/TomOnTime/timetravelpdb
Am I reading this correctly?: you were hired at MS with only cursory python knowledge? I’m very jealous and thinking more and more about going into IT after I leave uni for engineering (not software engineering). I know python well, along with a few other things I’ve picked up over the years (emacs/elisp, vim/vimscript, LaTeX formatting, JS, Common Lisp, APL, bash scripting, mathematica and matlab etc.). Would this be enough to land a position like yours? I am lacking in the AI area, but I can begin that on my weekends.
She probably did an interview focused on data analysis, statistics and experimentation. Her python skills probably weren't that relevant, as long as she can use libraries.
Of course I have. Being able to integrate everything into a single software would be great. Every couple of months I tried using notebooks in VSCode I switched back. It’s so bittersweet
> There's no Java awfulness like ... instead it's just `cars = []`
I mean, there's very good reasons for static typing. And if he was using Kotlin, he could specify whether the variable `cars` was itself immutable and whether the list was immutable (`val/var cars : List<String>/MutableList<String>`
> notebooks
yea, Jupyter kernels exist for almost every language. This is not a Python advantage.
> debugging
Good IDEs have the ability to set breakpoints, inspect variables, test methods, etc.
> type hints
"oh, forget what I said earlier about how Java had ugly boilerplate, now I have an `import` and a type def after all" - except nothing here is actually enforced
> parallelism
parallelism is relative... a lot of compiled JVM code will run much faster than Python to start with, and even with `multiprocessing`, Python won't catch up (and JVM languages have their own concurrency solutions, of course)
> I've put a large chunk of my money in leveraged index funds and etfs.
Written by a person who's never seen the slightest hint of a bear market, or rising interest rates. That's ok, you wouldn't be the first smart person to be seduced by leverage: https://www.investopedia.com/terms/m/myron-scholes.asp
> Stimulants like caffeine, adderall, and modafinil are magic... People do stay on adderall and modafinil indefinitely
Look, I'm no doctor (and I'm aware I'm out of the loop on things like this), but mental & concentration stimulants are the kinds of things associated with old people, not recent graduates.