• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Lars Lofgren

Building Growth Teams

  • About Me
  • What I’ve Read
  • Want Help?

Conversions

My 7 Rules for A/B Testing That Triple Conversion Rates

September 11, 2015 By Lars Lofgren 16 Comments

I really don’t care how any given A/B test turns out.

That’s right. Not one bit.

But wait, how do I double or triple conversion rates without caring how a test performs?

I actually care about the whole SYSTEM of testing. All the pieces need to fit together just right. If not, you’ll waste a ton of time A/B testing without getting anywhere. This is what happens to most teams.

But if you do it right. If you play by the right rules. And you get all the pieces to fit just right, it’s simply a matter of time before you triple conversions at any step of your funnel.

I set up my system so that the more I play, the more I win. I stack enough wins on top of each other that conversion rates triple. And any given test can fail along the way. I don’t care.

What does my A/B testing strategy look like? It’s pretty simple.

  • Cycle through as many tests as possible to find a couple of 10-40% wins.
  • Stack those wins on top of each other in order to double and triple conversion rates.
  • Avoid launching any false winners that drag conversions back down.

For all this to work, you’ll need to follow 7 very specific rules. Each of them is critical. Skip one and the whole system breaks down. Follow them and you’ll drive your funnel relentlessly up and to the right.

Rule 1: Above all else, the control stands

I look at A/B tests very differently from most people.

Usually, when someone runs a test, they’ll consider each of their variants as equals. The control and the variant are both viable and their goal is to see which one is better.

I can’t stand that approach.

We’re not here for a definitive answer. We’re here to cycle through tests to find a couple of big winners that we can stack on top of each other.

If there’s a 2% difference between the variant and the control, I really don’t care which one is the TRUE winner. Yes, yes, yes, I’d care about a 2% win if I had enough data to hit statistical significance on those tests (more on this in a minute). But unless you’re Facebook or Amazon, you probably don’t have that kind of volume. I’ve worked on multiple sites with more than 1 million visitors/month and it’s exceedingly rare to have enough data hitting a single asset in order to detect those kinds of changes.

In order for this to system to work, you have to approach the variant and control differently. Unless a variant PROVES itself as a clear winner, the control stands. In other words, the control is ALWAYS assumed to be the winner. The burden of proof is on the variant. No changes unless the variant wins.

This ensures that we’re only making positive changes to assets going forward.

Rule 2: Get 2000+ people through the test within 30 days

So you don’t have any traffic? Then don’t A/B test. It’s that simple. Do complete revamps on your assets and then eyeball it.

Remember, we need the A/B testing SYSTEM working together. And we’re playing the long-term. Which means we need a decent volume of data so we can cycle through a bunch of different test ideas. If it takes you 6 months to run a single test, you’ll never be able to run enough tests to find the few winners.

In general, I look for 2000 or more people hitting the asset that I’m testing within 30 days. So if you want to A/B test your homepage, it better get 2000 unique visitors every month. I even prefer 10K-20K people but I’ll get started with as little as 2000/month. Anything less than that and it’s just not worth it.

Rule 3: Always wait at least a week

Inside of a week, data is just too volatile. I’ve had tests with 240% improvements at 99% certainty within 24 hours of launching the test. This is NOT a winner. It always comes crashing down. Best-case scenario, it’s really just a 30-40% win. Worse case, it flip-flops and is actually a 20% decline.

It also lets you get a full weekly cycle worth of data. Visitors don’t always behave the same on weekends as they do during the week. So a solid week’s worth of data gives you a much more consistent sample set.

Here’s an interesting result that I had on one of my tests. Right out of the gate, it looked like I a had 10% lift. After a week of running the test, the test does a COMPLETE flip-flop on me and becomes a 10% loser (at 99% certainty too):

Flip Flip A/B Test

One of my sneaking suspicions is that most of the 250% lift case studies floating around the interwebs are just tests that had extreme results in the first few days. And if they had ran a bit longer, they would have come down to a modest gain. Some of them would even flip-flop into losers. But because people declare winners too soon, they run around on Twitter declaring victory.

Rule 4: Only launch variants at 99% statistical significance

Wait, 99%? What happened to 95%?

If you’ve done an A/B test, you’ve probably run across the recommendation that you should wait until you hit 95% significance. That way, you’ll only pick a false winner 1 out of every 20 tests. And none of us want to pick losers so we typically follow this advice.

You’ve run a bunch of A/B tests. You find a bunch of wins. You’re proud of those wins. You feel a giant, happy A/B testing bubble of pride.

Well, I’m going to pop your A/B testing bubble of pride.

Your results didn’t mean anything. You picked a lot more losers than just 1 in 20. Sorry.

Let’s back up a minute. Where does the 95% statistical significance rule come from?

Dig up any academic or scientific journal that that has quantitative research and you’ll find 95% statistical significance everywhere. It’s the golden standard.

When marketers started running tests, it was a smart move to use this same standard to see if our data actually told us anything. But we forgot a key piece along the way.

See, you can’t just run a measure of statistical confidence on your test after it’s running. You need to determine your sample size first. We do this by deciding the minimal improvement that we want to detect. Something like 5% or 10%. Then we can figure out the statistical power needed and from there, determine our sample size. Confused yet? Yeah, you kind of need to know some statistics to do this stuff. I need to look it up in a textbook each time it comes up.

So what happens if we skip all the fancy shmancy stats stuff and just run tests to 95% confidence without worrying about it? You come up with false positives WAY more frequently than just 1 out of 20 tests.

Here’s an example test I ran. In the first two days, we got a 58.7% increase in conversions at 97.7% confidence:

Chasing Statistical Significance with A/B Tests - 2 Day Results

That’s more than good enough for most marketers. Most people I know would have called it a winner, launched it, and moved on.

Now let’s fast-forward 1 week. That giant 58.7% win? Gone. We’re at a 17.4% with only 92% confidence:


Chasing Statistical Significance with A/B Tests - 1 Week Results

And the results after 4 weeks? Down to a 11.7% win at 95.7% certainty. We’ve gone from a major win to a marginal win in a couple of weeks. It might stabilize here. It might not.

Chasing Statistical Significance with A/B Tests - 4 Week Results

We have tests popping in and out of significance as they collect data. This is why determining your required sample size is so important. You want to make sure that a test doesn’t trick you early on.

But Lars! It still looks like a winner even if it’s a small winner! Shouldn’t we still launch it? There are two problems with launching early:

  1. There’s no guarantee that it would have turned out a winner in the long run. If we had kept running the test, it might have dropped even further. And every once in awhile, it’ll flip-flop on you to become a loser. Then we’ve lost hard-earned wins from previous winners.
  2. We would have vastly over-inflated the expected impact on the business. A 60% win moves mountains. They crush your metrics and eat board decks for breakfast. 11% wins, on the other hand, have a much gentler impact on your growth. They give your metrics a soothing spa package and nudge them a bit in the right direction. Calling that early win at 60% gets the whole team way too excited. Those same hopes and dreams get crushed in the coming weeks when growth is far more modest. Do that too many times and people stop trusting A/B test results. They’ll also take the wrong lessons from it and start focusing on elements that don’t have a real impact on the business.

So what do we do if 95% statistical significance is unreliable?

There’s an easier way to do all this.

While I was at Kissmetrics, I worked with our Growth Engineer, Will Kurt, at the time. He’s a wicked smart guy that runs his own statistics blog now.

We modeled out a bunch of A/B testing strategies over the long term. There’s a blog post that goes over all our data and I also did a webinar on it. How does a super disciplined academic research strategy compare to the fast and lose 95% online marketing strategy? What if we bump it to 99% statistical significance instead?

We discovered that you’d get very similar results over the long term if you just used a 99% statistical significance rule. It’s just as reliable as the academic research strategy without needed to do the heavy stats work for each test. And using 95% statistical significance without a required sample size isn’t as reliable as most people think it is.

The 99% rule is the cornerstone of my A/B testing strategy. I only make changes at 99% statistical significance. Any less than that and I don’t change the control. This reduces the odds of launching false winners to a more manageable level and allows us to stack wins on top of each other without accidentally negating our wins with a bad variant.

Rule 5: If a test drops below a 10% lift, kill it.

Great, we’re now waiting for 99% certainty on all our tests.

Doesn’t that dramatically increase the time it takes to run all our tests? Indeed it does.

Which is why this is my first kill rule.

Again, we care about the whole system here. We’re cycling to find the winners. So we can’t just let a 2-5% test run for 6 months.

What would you rather have?

  • A confirmed 5% winner that took 6 months to reach
  • A 20% winner after cycling through 6-12 tests in that same 6 month period

To hell with that 5% win, give me the 20%!

So the longer we let a test run, the higher that our opportunity costs start to stack up. If we wait too long, we’re forgoing serious wins that we could of found by launching other tests.

If a test drops below a 10% lift, it’s now too small to matter. Kill it. Shut it down and move on to your next test.

What if we have a 8% projected win at 96% certainty? It’s SO close! Or what if we have enough data to find 5% wins quickly?

Then we ask ourselves one very simple question: will this test hit certainty within 30 days? If you’re 2 weeks into the test and close to 99% certainty, let it run a bit longer. I do this myself.

What happens at day 30? That leads us to our next kill rule.

Rule 6: If no winner after 1 month, kill it.

Chasing A/B test wins can be addictive. JUST. ONE. MORE. DAY. OF. DATA.

We’re emotionally invested in our idea. We love the new page that we just launched. And IT’S SO CLOSE TO WINNING. Just let it run a bit longer? PLEEEEEASE?

I get it, each of these tests becomes a personal pet project. And it’s heartbreaking to give up on it.

If you have a test that’s trending towards a win, let it keep going for the moment. But we have to cut ourselves off at some point. The problem is that a many of these “small-win” tests are mirages. First they look like 15% wins. Then 10%. Then 5%. Then 2%. The more data you collect, the more that the variant converges with your control.

CUT YOURSELF OFF. We need a rule that keeps our emotions in check. You gotta do it. Kill that flop of a test and move on to your next idea.

That’s why I have a 30-day kill rule. If the variant doesn’t hit 99% certainty by day 30, we kill it. Even if it’s at 98%, we shut it down on the spot and move on.

Rule 7: Build your next test while waiting for your data

Cycling through tests as fast as we can is the name of the game. We need to keep our testing pipeline STACKED.

There should be absolutely NO downtime between tests. How long does it take you to build a new variant? Starting with the initial idea, how long until it goes live? 2 weeks? 3 weeks? Maybe even an entire month?

If you wait to start on the next test until the current test is finished, you’ve wasted enough data for 1-2 other tests. That’s 1-2 other chances that you could of found that 20% win to stack on top of your other wins.

Do not waste data. Keep those tests running at full speed.

As soon as one test comes down, the next test goes up. Every time.

Yes, you’ll need to get a team in place to dedicate to A/B tests. This is not a trivial amount of work. You’ll be launching A/B tests full time. And your team will need to be moving at full-speed without any barriers.

If it were easy, every one would be doing it.

Follow All 7 A/B Testing Rules to Consistently Drive Conversion Up and to the Right

Follow the system with disciple and it’s a matter of time before you double or triple your conversion rates. The longer that you play, the more likely you’ll win.

Here are all the rules in one spot:

  1. Above all else, the control stands
  2. Get 2000+ people through the test within 30 days
  3. Always wait at least a week
  4. Only launch variants at 99% certainty
  5. If a test drops below a 10% lift, kill it.
  6. If no winner after 1 month, kill it.
  7. Build your next test while waiting for your data

How Live Chat Tools Impact Conversions and Why I Launched a Bad Variant

July 21, 2015 By Lars Lofgren 15 Comments

Do those live chat tools actually help your business? Will they get you more customers by allowing your visitors to chat directly with your team?

Like most tests, you can come up with theories that sound great for both sides.

Pro Live Chat Theory: Having a live chat tool helps people answer questions faster, see the value of your product, and will lead to more signups when people see how willing you are to help them.

Anti Live Chat Theory: It’s one more element on your site that will distract people from your primary CTAs so conversions will drop when you add it to your site.

These aren’t the only theories either, we could come up with dozens on both sides.

But which is it? Do signups go up or down when you put a live chat tool on the marketing site of your SaaS app?

It just so happens I ran this exact test while I was at Kissmetrics.

How We Set Up the Live Chat Tool Test

Before we ran the test, we already had Olark running on our pricing page. The Sales team requested it and we launched without running it through an A/B test. Anecdotally, it seemed helpful. An occasional high-quality lead would come through and it would help our SDR team disqualify poor leads faster.

Around September 2014, the Sales team started pushing to have Olark across our entire marketing site. Since I had taken ownership of signups, our marketing site, and our A/B tests, I pushed back. We weren’t just going to launch it, it needed to go through an A/B test first. I was pro-Olark at this point but wanted to make sure we weren’t cannibalizing our funnel by accident.

We got it slotted for an A/B test in Oct 2014 and decided to test it on 3 core pages of our marketing site: our Features, Customers, and Pricing pages.

Our control didn’t have Olark running at all. This means that we stripped it from our pricing page for the control. Only the variant would have Olark on any pages.

Here’s what our Olark popup looked like during business hours:

Kissmetrics Olark Popup Business Hours

And here it is after-hours:

Kissmetrics Olark Popup After Hours

Looking at the popups now, I wish and I done a once-over with the copy. It’s pretty bland and generic. That might have gotten us better results. At the time, I decided to test whatever Sales wanted since this test was coming from them.

Setting up the A/B test was pretty simple. We used an internal tool to split visitors into variants randomly (this is how we ran most of our A/B tests at Kissmetrics). Half our visitors randomly got Olark, the other half never saw it. Then we tagged each group with Kissmetrics properties and used our own Kissmetrics A/B Test Report to see how conversions changed in our funnel.

So how did the data play out anyway?

Not great.

Our Live Chat A/B Test Results

Here’s what Olark did to our signups:

Live Chat Tool Impact on Signup Conversions

A decrease of 8.59% at 81.38% statistical significance. I can’t say that we have a confirmed loser at this point. I prefer 99% statistical significance for those kinds of claims. But that data is not trending towards a winner.

How about activations? Did it improve signup quality and get more people to install Kissmetrics? That step of the funnel looked even worse:

Live Chat Tool Impact on Activations

A 22.14% decrease on activations at 97.32% statistical significance. Most marketers would declare this as a confirmed loser since we hit the 95% statistical significance threshold. Even if you push for 99% statistical significance, the results are not looking good at this point.

What about customers? Maybe it increased the total number of new customers somehow? I can’t share that data but the test was inconclusive that far down the funnel.

The Decision – Derailed by Internal Politics

So here’s what we know:

  • Olark might decrease signups by a small amount.
  • Olark is probably decreasing Kissmetrics installs.
  • The impact on customer counts is unknown.

Seems like a pretty straightforward decision right? We’re looking at possible hits on signups and activations, then a complete roll of the dice on customers. These aren’t the kind of odds I like to play with. Downside at the top of the funnel with a slim chance of success at the bottom. We should of taken it down right?

Unfortunately, that’s not what happened.

Olark is still live on the Kissmetrics site 9 months after we did the test. If you go to the pricing page, it’s still there:

Kissmetrics Life Chat Tool on Pricing Page

Why wouldn’t we kill a bad test? Why would we let a bad, risky variant live on?

Internal politics.

Here’s the thing: just because you have data doesn’t mean that decisions get made rationally.

I took these test results to one of our Sales directors at the time and said that I was going to take Olark off the site completely. That caused a bit of a firestorm. Alarms got passed up the Sales chain and I found myself in a meeting with the entire Sales leadership.

I wanted Olark gone. Sales was 100% against me.

Live chat is considered a best practice (or at least it was a best practice at one point). It’s a safe choice for any SaaS leadership team. I have no idea HOW it became a best practice considering the data I found but that’s not the point. There’s plenty of best practices that sound great but actually make things worse.

Here’s what the head of Sales told me: “Salesforce uses live chat so it should work for us too.”

But following tactics from industry leaders is the fastest path to mediocrity for a few reasons:

  • They might be testing it themselves to see if it works, you don’t know if it’s still mid-test or a win they’ve decided to keep.
  • They might not have tested it, they could be following best practices themselves and have no idea if it actually helps.
  • They may have gotten bad data but decided to keep it because of internal politics.
  • Even if it does work for them, there’s no guarantee that it’ll work for you. I’ve actually found most tactics to be very situational. There’s a few cases where a tactic helps immensely but most of the time it’s a waste of effort and has no impact.

It’s also difficult to understand how a live chat tool would decrease conversions. Maybe it’s a distraction, maybe not. But when you see good opportunities come in as an SDR rep that help you meet your qualified lead quotas, it’s not easy to separate that anecdotal experience from the data on the entire system.

But none of this mattered. Sales was completely adamant about keeping it.

The ambiguity on customer counts didn’t help either. As long as it was an unknown, arguments could still be made in favor of Olark.

Why didn’t I let the test run longer and get enough data on how it impacted new customer counts? With how close the data was, we would have needed to run the test for several months before getting anywhere close to an answer. Since I had several other tests in my pipeline, I faced serious opportunity costs if I let the test run. Running one test for 3 months means not running 3-4 other tests that have a chance at being major wins.

So I faced a choice. I could have removed Olark if I was stubborn enough. My team had access to the marketing site, Sales didn’t. But standing my ground would start an internal battle between Marketing and Sales. It’d get escalated to our CEO and I’d spend the next couple of weeks arguing in meetings instead of trying to find other wins for the company. Regardless of the final decision, the whole ordeal would fray relationships between the teams. I’d also burn a lot of social capital if I decided to push my decision through. With the decrease in trust, there would be all sorts of long-term costs that would prevent us executing effectively on future projects.

I pushed back and luckily got agreement for not launching it on the Features or Customers pages. But Sales wouldn’t budge on the Pricing page. I chose to let it drop and it lives to this day.

That’s how I launched a variant that decreased conversions.

Should You Use a Live Chat Tool on Your Site?

Could a live chat tool increase the conversions on your site? Possibly. Just because it didn’t work for me doesn’t mean it won’t work for you.

Are there other places that I would place a live chat tool? Maybe a support site or within a product? Certainly. There are plenty of cases where acquisition matters less than helping people as quickly as possible.

Would I use a live chat tool at an early stage startup to collect every possible bit of feedback I could? Regardless of what it did to signups? Most definitely. Any qualitative feedback at this stage is immensely valuable as you iterate to product/market fit. Sacrificing a few signups is well worth the cost of being able to chat will prospects.

If I was trying to increase conversions to signups, activations, and customers, would I launch a live chat tool on a SaaS marketing site without A/B testing it first? Absolutely not. Since this test didn’t go well, I wouldn’t launch a live chat tool without conclusive data proving that it helped conversions.

Olark and the rest of the live chat companies have great products. There’s definitely ways for them to add a ton of value. Getting lots of qualitative feedback at an early stage startup is probably the strongest use case that I see. But if your goal is to increase signups, activations, and customers, I’d be very careful with assuming that a live chat tool will help you.

Primary Sidebar

Don’t miss any of my new essays.

Most Popular Posts

  • The Three Engines of Growth – with Eric Ries
  • My 7 Rules for A/B Testing That Triple Conversion Rates
  • The 35 Headline Formulas of John Caples
  • The 9 Delusions From the Halo Effect
  • How Live Chat Tools Impact Conversions and Why I Launched a Bad Variant
  • How to Keep Riding the Slack Rocketship Without Blowing It Up
  • Sorry Eric Ries, There’s Only Two Engines of Growth
  • Two Mistakes I Made on the Engines of Growth
  • What is Permission Marketing?
  • How to Read 70 Books a Year And Catapult Your Career

Copyright 2019