Updated on November 4, 2019
A couple of weekends ago I built my first Sinatra app to find some more business email patterns for Toofr. Gotta say... 'tis wonderful to work with such a clean, uncluttered Ruby framework.
Email Scraping: The Birth of the Blues
My little scraper is essentially a web app without a website. I didn't know very much about Sinatra when I got started. In fact, I still don't know much about it. I'm probably only scratching the surface on what it can do. What I do know is that it requires only a few files and you can call its functions via a Raketask. This is important, as I'll describe later on.
So why did I choose Sinatra? Simply put, I wanted a few things:
- Keep it lightweight. My pure Ruby scraper script was just one file, so I figured a full-blown Rails app was wayyyy overpowered. I just need the bare bones.
- Use ActiveRecord for database reading and writing. This way I wouldn't have to write any direct Postgres SQL code. I'm familiar with ActiveRecord syntax and wanted to build fast and not have to keep looking up SQL read and write functions.
- Play nicely with Heroku. I decided from the beginning that I was going to run this little scraper on Heroku. I found from testing my script locally that the target site would ban my IP after a very low number of pings. Since Heroku spins up a new dyno every time you schedule a task, a great way to avoid using proxy servers is to just have Heroku make you a new dyno every time you get blocked.
Sales Hacking: All of Me
Exactly how lightweight is Sinatra? I was amazed. It's super light. Here's the file list. It's 6 files, and one of them is blank! (Procfile, for my specific case, since I'm not running a website.)
- app.rb - This file seems to be the guts. It's loading all the Sinatra goodies and includes my models.
- config.ru - I read that Heroku likes seeing this, so I included it.
- environments.rb - This defines my Heroku and local database connections.
- Gemfile - Just like a Rails app, this controls my libraries.
- Procfile - Since I don't need Heroku to run any web or worker dynos, I include this file but leave it blank.
- Rakefile - Here's where I put the actual scraper code. The scraper itself became a Raketask that gets called by Heroku Scheduler.
The Best is Yet to Come
I'll describe the Ruby scraper in more detail in my next post. It's a pretty brute force technique, but it's working really well!
A quick teaser - here's the contents of the app.rb file.
# app.rb
require 'sinatra'
require 'sinatra/activerecord'
require './environments'
class Domain < ActiveRecord::Base
end
class Page < ActiveRecord::Base
end