Why Toofr Uses Heroku For Its Email Finding Service


A peak under the hood of how Toofr works

Updated on January 3, 2018

The Setup

Like most web apps with real customers and revenue, Toofr has had many iterations in software and hardware. The original Toofr scripts were written in Python and pushed into a local MySQL database. I added a PHP CodeIgniter website framework around the Python scripts to make Toofr available to the public. I hosted the website and database on a single Digital Ocean server for a couple of years.

When I got some heavy Toofr API users, I split the database off into its own server, which allowed me to then have multiple web servers. I moved everything onto Rackspace so I could use their load balancers. It worked well, but there was a ton of technical debt. About two years ago I started the long process of rewriting Toofr into Ruby on Rails. Things got busy at my real job at the time so I hired a developer to complete the rewrite for me. It still took months. We tried hosting it on Heroku and AWS Elastic Beanstalk but couldn't get it to work. When we finally launched "Toofr on Rails" we did it with our own virtual private servers running on Rackspace. We had a worker server, an API server, two web servers, and a database server. The web servers were fronted with a cloud load balancer.

This felt to me like a lot of overhead. We deployed using Capistrano, so every time we changed a server we had to update the deploy script. It's not a huge deal, but I never felt comfortable deploying code myself. What if the deploy failed? What if I lost my internet connection midway through? I wouldn't know how to fix it. So I'd push code and then have my developer do the deployment. It wasn't ideal for me, but it worked. For about a year that's how we did it.

Then one day I decided to try Heroku again.

Re-Enter Heroku

About a month ago I forked my main Toofr repository and tried again to get it working on Heroku. To my surprise, I had it up and running in less than three hours. The last time we tried it, a year or so earlier, I spent days trying to get the build to succeed. Some code changes we made along the way, or perhaps some upgrades to Heroku's own infrastructure, made the difference. I was elated! Hosting infrastructure shouldn't make me this happy, but it did, and since I'm writing this post about it, I suppose it still does.

The benefits to Heroku are many, but the ones that matter most to me are:

  • Provision new servers with one click -- literally, one click
  • Or if you love Terminal (I'm getting there...) then you can use Heroku's Command Line Interface to do it
  • Since you choose how many web servers (they call them "dynos") you want, load balancing is on by default
  • Deployment is also automated -- just tell Heroku which Github branch to watch and when you push code to that branch, Heroku will deploy it
  • If that's not enough, SSL certificates are provisioned and updated for free! That was always a pain on private servers and load balancers

Here's getting a little bit into the weeds, but one last thing needed happen before I could fully take advantage of the rapid deploys on Heroku. The way Toofr's bulk list import worked originally was to import and process the CSV file in one long-running background job. Each record was processed serially, and ran at a rate of about 350 records per hour. Some users uploaded files with 10,000 records, and those would take over 24 hours to run. During that time we couldn't deploy because Capistrano would reset the worker server, starting that big job back over from zero. I wasn't happy with this design. My developer would have to wait until the wee hours of the night (his daytime though, since he's in India) to sneak a deployment in before someone else uploaded a huge file. This of course would not scale.

A big improvement to file uploads

I figured we could do better, and the move to Heroku meant it was as good a time as any to completely refactor that import process. Now when a CSV file is imported, a job is queued up in Sidekiq which simply reads each row into a table. As each record is successfully imported, another job is queued which runs the Toofr process on it. So now when a 10,000 record CSV file is imported, instead of one HUGE job, we see 10,001 tiny jobs (1 for the import itself and 10,000 to process each one of the records.) My worker dyno can do 50 jobs simultaneously, so those 10,001 jobs are handled in chunks of 50. This effectively makes it 50X faster than the old way. That's a big win!

I also began thinking about how Toofr users might want to organize their data. Lists are a common and logical construct, but it didn't exist in Toofr yet. I was selling lists of founders whose funding announcements were captured on Crunchbase, but those were simply files hosted on Amazon S3. There was nothing Toofr-ish about it. It also reeked of a refactor, so I decided to do a major merger of files and lists. Instead of just being a static file hosted on S3, a list became a group of records in a table in Toofr's database. That's essentially what the file import refactor I described above required.

So now when you import a file, it first makes a new list, and then it imports your CSV rows into records in your Toofr list. Then it runs whichever Toofr process you requested on those records. The finished file is created from database records and stored on S3, but it can be updated and rewritten any time based on the associated records. It's a far more elegant approach and I'm very, very happy with it.

What's next?

I ask myself this every day! For now, I'm trying to figure out why I'm getting all of these timeouts. Some API requests are taking the maximum 30 seconds to respond and that's not right. Each timeout is represented by a red dot in the (amazing, supercool) Heroku metrics chart here:

I also need to learn how to persist data in Redis so I can finally integrate some classification data I've compiled that trains an artificial intelligence tool how to decide what industry a website is in and whether or not it's an agency. Pretty cool stuff! Since Redis is hosted by Heroku, and I've been very hands-off on it, letting Heroku simply work its magic, I'm going to need to actually learn what's going on there.

And just in case you found this post while Googling a Heroku deploy problem, I'm closing with my Procfile, Puma, and Sidekiq config files. I hope they help you on your quest.

Here's my Procfile

web: bundle exec puma -C config/puma.rb
worker: bundle exec sidekiq -C config/sidekiq.yml

And config/sidekiq.yml

development:  
  :concurrency: 1
staging:  
  :concurrency: 4
production:  
  :concurrency: 50
:queues:
  - high
  - default
  - mailers
  - low

config/initializers/sidekiq.rb

if Rails.env.production?
  Sidekiq.configure_client do |config|
    config.redis = { url: ENV['REDIS_URL'] }
  end

  Sidekiq.configure_server do |config|
    config.redis = { url: ENV['REDIS_URL'] }
  end
end

And finally the config/puma.rb

workers Integer(ENV['WEB_CONCURRENCY'] || 1)
threads_count = Integer(ENV['RAILS_MAX_THREADS'] || 25)
threads threads_count, threads_count

preload_app!

rackup      DefaultRackup
port        ENV['PORT']     || 3000
environment ENV['RACK_ENV'] || 'development'

on_worker_boot do
  ActiveRecord::Base.establish_connection
end
More Find Emails Articles >>