splitforce's comments

splitforce · on Aug 28, 2014

Qualitative data is OK, but without a really strong understanding of social theory and stuff like desirability bias there's a big risk of not asking the right questions and making misinformed decisions.

One way we've found to limit this risk is by supplementing hypotheses developed through qualitative research with a quantitative approach.

So, ask users what they like/dislike about an experience to formulate an idea of what changes to your app may better the experience, BUT make sure to then TEST those changes using a rigorous method (i.e.: experimentation or A/B testing) to validate that the feedback you're hearing is not just noise...

hkyeti · on Aug 28, 2014

Agree on the premise. I think theres a spectrum of questions, from those that work great in short simple polls:

"What's frustrating you most right now? Level is boring, level is too difficult, loading time is too slow etc

vs ones that require detail framing and context - for example How would you approach building your army.

The former can work very well, but you need to be careful to get good results out of those less concrete with many factors at play.

splitforce · on July 30, 2014

Next step is to take some input about user preferences and generate a personalized ranking of places to go. Here's a quick stab at building a 'personalized' desirability index: https://docs.google.com/a/splitforce.com/spreadsheets/d/1u-6...

splitforce · on July 11, 2014

Thanks for the article Kevin, but I am afraid that most of your audience will find that just tweaking colors or copy will not produce any meaningful results for their businesses. Let me explain…

We’ve found that a successful approach to A/B testing is really dependent on the type of company you’re operating and product that you’re offering. Small cosmetic changes to the UI or copy often result in equally small changes to click-thru or conversion rates, and so these A/B tests require relatively greater levels of statistical power in order to achieve significance.

For mega-traffic companies like Google or Amazon, these kinds of tests are worth the cost of testing because a sub-1% lift still contributes substantially to their bottom line.

But for everyone else, ‘shallow’ A/B tests of a button color or call to action will often yield inconclusive results. Here’s an article from the founder of GrooveHQ detailing such an experience: http://www.groovehq.com/blog/failed-ab-tests.

If you’re running a small or medium business – or even a larger one that does not have the scalable testing practices of a tech giant like Amazon in place – testing deeper changes to the product, UI layouts or entire UX workflows are what move the needle. This is what we’re now calling ‘empathic A/B testing’ – where tests are designed with empathy for users.

If you ask the questions: What changes can I make to my product or website that would motivate my users to take the actions I want them to take? What are they looking for? What do they care about? And why? More often than not, I think you’ll find that the answer is not ‘a different button color’

In the end, A/B testing is really a very unsophisticated way of answering the question ‘What works better?’ because you are sending a fixed proportion of your users to a suboptimal variant for the duration of the test. We’ve done a lot of research into better solutions to this problem, and have found that a dynamic approach using a learning algorithm always leads to faster results and higher average conversion rates. You can read more about that here: http://splitforce.com/resources/auto-optimization/

splitforce · on July 3, 2014

Nice article as always, thanks Alex!

What I’ve found is that a successful approach to A/B testing is really dependent on the type of company you’re operating and product that you’re offering. Small cosmetic changes to the UI or copy often result in equally small changes to click-thru or conversion rates, and so these A/B tests require relatively greater levels of statistical power in order to achieve significance.

For mega-traffic companies like Google or Amazon, these kinds of tests are worth it because a sub-1% lift still contributes substantially to their bottom line. They also have the traffic numbers to properly power tests of smaller changes in a reasonable amount of time.

But for everyone else, ‘shallow’ A/B tests of a button color or call to action will often yield inconclusive results because they don’t the traffic numbers. For these types of companies, we’ve seen that deeper changes to the product, UI layouts or entire UX workflows are what move the needle. Designing these tests require more thought and development work up-front – but at least you’ll be making substantial improvements in an experimentally rigorous way instead of just spinning your wheels with some one-off design tweaks.

To avoid these kinds of disappointing tests, another thing to consider is setting a minimum detectable effect. The idea here is that validating a small change in improvement requires more statistical power (i.e.: more test subjects) than validating a large change, and at some point in order to justify a continuation of the test you’ll want to achieve some minimum amount of lift. Once you can say with statistical confidence that this desired lift isn’t achievable, you can stop the test earlier and move on to the next.

Most importantly, you should be designing these tests with empathy for your audience. Ask the questions: What changes can I make to my product or website that would motivate my users to take the actions I want to them to take? What are they looking for? What do they care about? And why? More often than not, I think you’ll find that the answer is not ‘a different button color’ :-D

In the end, A/B testing is really a very unsophisticated way of answering the question ‘What works better?’ We’ve done a lot of research into better solutions to this problem, and have found that an automated approach using a learning algorithm almost always leads to faster results and higher average conversion rates. You can read more about that here: http://splitforce.com/resources/auto-optimization/

splitforce · on June 25, 2014

Congrats Paras! New dashboard is looking nice, and I especially like the ability to retroactively segment results based on different customer dimensions and discover new opportunities for personalization.

One thing that I've noticed is that traditional A/B testing is a pretty sub-optimal way of answering the question: 'What works better, A or B?'

In the most basic example of an A/B test, you have a variation A and a variation B each shown to 50% of your user base. By definition, this approach will be sending half of your users to a worse performing version during the entire duration of the test!

The automated approach is based on a bandit algorithm that dynamically updates the proportion of users shown a given variation. With each new piece of data that you collect on the test variations' conversion rates and confidence, the algorithm adjusts the percentages automatically so that better performing variations are promoted and worse performers are pruned away.

This leads to:

1) faster results, because your directing test resources (i.e.: users and their data) to validate what you actually care about (i.e.: confidence in the best variation’s performance)

2) a higher average conversion rate during the test itself, because relatively more users are being sent to the better performing variation automatically, and

3) less time and effort required to actively manage your experiments.

Though the math behind this approach is slightly more complex than a traditional A/B test, it’s a no-brainer for those that are really interested in making data-driven decisions because of how much better the results are that it produces.

For anyone interested, here’s a post we put together on how it works: http://splitforce.com/resources/auto-optimization/

splitforce · on June 3, 2014

Nice post, Sergey. We've been using Thompson Sampling to deal with delayed feedback, as we're dealing with data coming from mobile applications which are not always connected. The results have been pretty good, here's a breakdown of how it works if you're interested: https://splitforce.com/resources/auto-optimization/

Have you thought about how to deal with changes in environmental factors over relatively longer periods of time? For example, seasonality or changes in popular taste.

splitforce · on May 27, 2014

Thanks, we think it's pretty cool too! And you've totally hit on why this is awesome - by automatically funneling users towards better-performing variations you can focus your resources on validating what matters and cut down on the deadweight loss associated with a traditional A/B. For more on this here's a recommended read: http://stevehanov.ca/blog/index.php?id=132?utm_medium=referr...

orasis · on May 27, 2014

Do you guys do any sort of logistic regression for optimizing numeric values? That would be really cool.

splitforce · on May 27, 2014

Right now auto-optimization only supports binary goal types. But we support A/B testing with Time and Quanity (numeric) goal types as well, so working on a solution to automate those types of experiments.

splitforce · on May 27, 2014

Thanks for the heads up - looking into it...

splitforce · on May 27, 2014

True, but why? Activision spends over $10M annually on employing an analytics team of about a dozen PhDs to build Rubin causal models to be able to show the right thing at the right time to players. That kind of budget makes sense for a $1 billion title like Call of Duty, but where does that leave the other 99%? The fact is that even for the companies that don't employ a full-time team of statisticians, if you're running your app or game like a true business you should be leveraging data to make better business decisions. Bandit algorithms like those we're proposing just let you do that at a fraction of the cost ;-)

splitforce · on Nov 5, 2013

Nice post Alex, thanks for that.

A lot of people forget how important two particular industries have been in terms of pushing the envelope when it comes to computer processing and Internet bandwidth technology: Pornography and gaming.

While porn and games are certainly among the more hedonistic (and certainly less virtuous) of products, because people care so much about them is in large part the reason why we have more powerful CPU/GPUs - for example - or faster connection speeds. (I guess you can thank U.S. military investments for some of this stuff as well.)

My point is, gaming is important. Like, really important. It might not have the direct impact on African schoolchildren that Kiva or Doctors Without Borders does, but one could argue that those organizations would not be able to leverage the technology that they rely on so much if others hand't paved the way. Keep doin' the good work, son! ;-)