An Introduction to Stock Market Data Analysis with R – Part 1

ktta · on March 28, 2017

I don't have much experience with any kind of stock market analysis/ HFT programming.

I've seen people use python (or R) a lot in online tutorials and courses (there's one on coursera[1] check it out). It's understandable that python is preferred since it's an easy language to get started with, but I've never really seen C/C++ used by people online (Although I've heard about the use of FPGAs by the hardcore guys)

Is the only reason C++ isn't recommended because it is difficult? The performance difference seems rather large to me. And especially if you're doing lots of number crunching, it would really benefit using C/C++ wouldn't it?

Do people use them but I don't hear about them because anyone above 'tutorial level' doesn't share their code or talk about it?

Or are the R and Python Computational libraries close to the performance of C/C++ that it's a viable compromise?

[1]: https://www.coursera.org/learn/computational-investing

ryankennedyio · on March 28, 2017

I don't think any retailer can trade on the scale and speed that would require C/C++ over Python. With lower capital and significantly higher transaction costs, retail traders (mostly) need to have holding periods for at least hours, so speed isn't really an issue. It's more about finding longer term anomalies to deploy maybe at most a few dozen positions, as opposed to reacting to billions of quotes over any trading day.

I think Python is a great tool for finance, particularly with libraries like Pandas, Numpy, etc. Most of the work in those libraries is C/Fortan/etc anyway. Great community. For more 'academic' style work/exploration, I think R might have an edge, but I personally find it more difficult to write and maintain high quality code in R.

I'd like to just throw a quick plug in for an open source project I'm working on [1]. It's an event-driven trading library written in Python. We're pretty close to having live trade execution ready in an alpha state.

The motivation is that you can put together a pretty simply strategy in ~40 lines of code. The codebase is designed to be pretty flexible & modular. Mike from http://quantstart.com has some great resources & many of his recent articles use the same library.

[1]: https://github.com/mhallsmoore/qstrader

MR4D · on March 28, 2017

Given the network latency of 43ms between, say, Houston, and NYC, you are completely correct.

Source: https://ipnetwork.bgtmo.ip.att.net/pws/network_delay.html

dagw · on March 28, 2017

HFT is only a small part of financial data analysis. Everybody I know who works with mathematical modelling in finance use tools like SAS, Matlab, R and even Excel to build and run their models. If you're working at something like a pension fund and thinking years into the future then shaving a few milliseconds or even minutes here and there isn't just isn't a priority.

rampage101 · on March 28, 2017

People use Python for finance and time series because of how easy it to read in the data. There is a library called pandas which allows you to read in a txt, csv, or excel file easily and do any type of array manipulation you want. The pandas library is written in C++, so it's basically a python wrapper.

Conversely I don't think C++ has this same type of library for reading in data in a single line.

While C++ is faster than Python overall, for most financial simulations Python is definitely fast enough.

jzwinck · on March 28, 2017

The Pandas library is not written in C at all actually. You can view the source here: https://github.com/pandas-dev/pandas/tree/master/pandas

It's mostly Python with a bit of Cython, and pull requests that are not pure Python are more likely to be rejected. There's basically zero C++ in Pandas itself.

em500 · on March 28, 2017

Pandas was started as a skunkworks project inside a big hedge fund (AQR) by Wes McKinney (who after a few years at Cloudera is now at another hedge fund, Two Sigma). So it's no coincidence that pandas is very well adapted to financial / investment analysis.

cluoma · on March 28, 2017

Speaking only for R, most of the performance-critical functions included in packages these days are just wrappers to a C++ implementation. It isn't until you need to implement something custom that performance starts to be an issue.

That combined with it's ease-of-use working with tabular data, and the myriad of packages available make it a very popular choice.

cgmil · on March 28, 2017

If Rcpp were not a thing, R might have died from being slow. Since it is a thing, though, people can easily write packages that are certainly fast enough for what they want to do, and that keeps the language useful.

j7ake · on March 28, 2017

The main bottle neck in data analysis is not running the code but thinking about your data, visualizing it in different ways, and looking for interesting corner cases your model is not good at predicting.

In that sense it is much faster to do these analyses in Python or R than in C++.

bicubic · on March 28, 2017

>Or are the R and Python Computational libraries close to the performance of C/C++ that it's a viable compromise?

That's a big part of it. Python's data analysis tooling is generally written on top of Numpy, which is insanely optimised code that an average C/C++ developer couldn't compete with in terms of perf. Numpy will completely smoke naive C code to do common data manipulation tasks. So it's a double win, really. You get very good perf out of the box, and you get a high level dynamic typed language which lets you focus on high level logic and iterate quickly.

The other part is the ecosystem. For a variety of reasons, Python has become the premier language for Big Data™, and over the past 5 years has accreted a huge collection of libraries for analytics, visualization, ML, distributed compute, etc. A single developer can now click these libraries together to achieve stuff that could only be done with huge teams before. You can build a distributed training cluster for a deep learning algorithm and deploy it to Amazon in maybe 2k lines of code. C++? I don't even know where to begin.

neshibble · on March 28, 2017

This. Python's stength as a language IMO comes from it's ability to interface with more efficient code. You can write the meat of your library at near C levels of efficiency, and have python treat it like a "black box" essentially. That seems to be how NN code works often as well.

robertk · on March 28, 2017

Given that the R interpreter is written in C and admits trivial FFI bindings, as demonstrated by libraries like glmnet or gbm calling out to it, I don't see how this is an inherent advantage of Python.

EmlynC · on March 28, 2017

While it doesn't have an inherent advantage, it has the mindshare and momentum of a community that has these tools now.

R could be just as capable as Python, but I think Python has largely won the race to be the most popular language for data analysis which in turn encourage more developers to commit to it, cementing Python's advantage.

R still has solid lead in statistics and a good mindshare amongst academics.

baldfat · on March 28, 2017

> R could be just as capable as Python, but I think Python has largely won the race to be the most popular language for data analysis which in turn encourage more developers to commit to it, cementing Python's advantage.

Your comparing Apples and Oranges. R is a domain specific language and will never be a general purpose language.

It is not true that Python won any race in statistics. http://www.kdnuggets.com/2015/05/r-vs-python-data-science.ht...

Let alone in industry investment coming from Microsoft and other major players.

R is above Python in Statistics in momentum and numbers. Python is a good choice but Python is still playing catch up to R due to the speed at which R is developing. R with data.table and Hadleyverse (https://www.r-bloggers.com/welcome-to-the-hadleyverse/) and RStudio the momentum has been clearly on the side of R.

R just 5 years ago was a fraction of the users it has today.

Python and R are both good choices with equal speed but the difference is that R is a domain specific language that has a lot of positive ecco system.

robertk · on March 28, 2017

R is a LISP. I would disagree heavily with it being domain-specific. It is as capable and Turing complete as any language. The only argument you can create is about performance and the judiciousness of putting stats functions in the base library, as opposed to Common Lisp which ships with even less. Not only "will" it be a general purpose programming language, it already is.

Ntrails · on March 28, 2017

Worth noting that there's a difference between analysis/research and a trading platform etc.

I'd guess that low effort threshold to try and backtest ideas is very attractive even if you're going to have to rewrite fast/better/stronger in other languages to trade in real time?

tnecniv · on March 28, 2017

It mostly depends on what you want to do. If you want to try to beat the speed of the internet like some of the HFT stories you hear, you're going to want to cut as much fat out as you can. If you are more interested in doing some analysis of historic data where you don't care about speed so much, it's a lot easier to play around with things in a language like R or Python.

cgmil · on March 28, 2017

I hear that R/Python's role in the project is initial modelling and testing. You train trading algorithms in R or Python, then deploy them in a C/C++ algorithm for speed.

xapata · on March 28, 2017

NumPy is C++, Fortran, and sometimes CUDA. Depending on usage.

baldfat · on March 28, 2017

> It's understandable that python is preferred since it's an easy language to get started with

Python is a good choice but R is amazingly good language that doesn't deserve to be pushed back negatively.

It's not easier to learn Python then R. They both have plus and minuses but knowing both if I was to teach someone statistical programming I absolutely would teach R over Python.

collyw · on March 28, 2017

Numpy for Python provides wrappers around C arrays, so (apparently) comes close in terms of performance.

branchless · on March 28, 2017

Kdb is used in many large banks.

fapjacks · on March 28, 2017

> DISCLAIMER: THIS IS NOT FINANCIAL ADVICE!!! Furthermore, I have ZERO experience as a trader (a lot of this knowledge comes from a one-semester course on stock trading I took at Salt Lake Community College)!

This disclaimer is a good idea, but I'd have gone as far as saying that getting into HFT as a hobbyist programmer is going to be throwing your money into a giant hole.

ccc111 · on March 28, 2017

agree.

i will say as an advent stock guy i think most of this statistical stuff is bs.

i think we like to find patterns and only the ones that work in the past dont really work in the future.

stocks are run off news and articles and weather the big bank sells today or buys among a myriad of other things.

i believe hft are only really meant for big companies that have to trade 300 positions and maintain certain standards.

Not as a average guy running one in his basement to make $200 dollar a day like clock work, only if he worked a little harder and studied more phd statistics and found this "amazing unbeatable forumla".

by the way would you like to make $5242 a month from working at home? insertscamwebsitehere,com

csomar · on March 28, 2017

I disagree, strongly. While the learning experience was super stressful and expensive and had me on stress-drugs, the financial advantage was nothing but insane. I'm a small time trader, started from 5 figures and now up to a good 6 figures of trading capital; and took close to a 6 figures of earnings already.

TheColorYellow · on March 28, 2017

Really? I haven't many people to have claimed to have gotten an "insane" financial advantage from the stock market, yet alone HFT. I've met a few but none of them had such a positive reaction that it seemed worth investing a significant amount of time.

Any advice?

csomar · on March 28, 2017

Well, that's the catch. I'm very unlikely to give you any hints on what markets I trade and the advantages I'm using. I'm also very unlikely to take any person that does that very seriously (I really don't).

Also remember that this market (or any market with big money) is brutal and will eat you alive. I don't expect the average person and his capital to last for a long time.

fapjacks · on March 28, 2017

Well that was the intent of my original comment, that your money is going to disappear. Whether or not you stick around and continue throwing money into the hole long enough to eventually end up at the bottom of someone else's hole is a different matter, I think. You are making money, but you persisted -- indeed, between stress drug prescriptions and visits to the doctor -- when I think most hobbyist programmers will have wished they'd never started in the first place.

csomar · on March 28, 2017

Indeed. If anyone ask me if they should start trading, my reaction is "DON'T DO IT". But that is for the average person and maybe software developer. For people who have a fighter personality, good mathematical skills, good programming skills, good economic skills, no kids/family and a nice pile of cash, I think they have a chance.

user5994461 · on March 28, 2017

These people have no edge, they have the most to loose and the less likely to risk much while not realizing the risks they are taking.

They should not trade.

fapjacks · on March 28, 2017

Can I ask approximately how much money and time did you have to sink into learning the ropes before you started to turn a profit?

csomar · on March 28, 2017

Not quite accurate but I started with a low 5 figures. Wiped through about 1/3 of it in the first few months, then recovered. It took me about 1.5 years to start making a return on my trades. That doesn't count, certainly, my medical cost, occasional spending spree, and potential future health damage.

neshibble · on March 28, 2017

I think this is one of the reasons Econ as a field has stagnated. (THIs is v hard to defend but bear with me). People have no reason to share / peer review their findings. It is extremely advantageous to share no, or bad advice. Hence why you get the 2007/08 housing bubble, as hundreds of econ professors peddle COMPLETELY false economic theory and results based analysis.

In fact, many still are. Not much has changed in the field of economics. How data science isn't a mandatory requirement for such a data driven field just shows to me how immature the field is.

sjg007 · on March 28, 2017

? I mean economics has been politicized extensively. Other than that economics is applied math and behavior. Systems of equations, Markov chains, game theory, ... Then there are specific levers one can push (inject money, take out money, regulate or not regulate). These basically modify the transition probabilities on specific states. 2007/08 was predictable fundamentally because wages didn't keep up with house rising house prices.

arca_vorago · on March 28, 2017

Looks like my data science degree is gonna be worth it then!

justonepost · on March 28, 2017

Not if you are careful, the problem is reliably beating just putting your money in a low cost index fund to a point it does better than what you'd be making in a real job. Good luck with that.

empath75 · on March 28, 2017

Is there a similar article for analyzing stock fundamentals? (P/e, revenue, etc)

cgmil · on March 28, 2017

No, but I'm always looking for new content ideas. People like my R/Python for Stocks posts, and ask for more, but I'm not really sure where to start in terms of giving them what they want.

soheil · on March 28, 2017

It'd be great if this was posted on Kaggle.

ccc111 · on March 28, 2017

Awesome article!

have you guys been on https://www.quantopian.com/

very informative

very good blog statistic articles on top of backtesting

tempodox · on March 28, 2017

You might just as well “analyze” dice rolls at the casino. Except that people routinely lie to you when they want your money (in case you didn't know already), there's no insight to be had. It is already a well established fact that in gambling, the house always wins.

user5994461 · on March 28, 2017

> It is already a well established fact that in gambling, the house always wins.

I had a friend who used to run a financial exchange. That was his favorite sentence.

The house always wins.

jdonaldson · on March 28, 2017

You can still profit off of HFT, even if you can't compete directly. The key is to find the opportunity they create: HFT will typically exit HARD on a stock with weak earnings or bad news. They need to find a better short term growth option. I like to find those situations, buy low, and hang on for a year.