How to Scrape Email Data Using Ruby and Nokogiri

If you don't know, now you know

Updated on November 4, 2019

In a previous post, I wrote about a Sinatra app I built to pull data from a site that publicly lists email domain patterns. Here's an explanation of how the code works.

This is just a review of the relevant code snippets. And by the way, is not the actual site. I don't want to give away all my tricks or upset the owner of the site, so I'm not disclosing which one it is. Instead, I'll just tell you how I scraped it.

First thing's first.

If you pick up scraping, you'll start a lot of your scripts like this.

page = Nokogiri::HTML(open("{l}&page=#{p}"))

A lot of magic happens here! This little line does a few things:

  • It takes a page number and a letter and passes it into a URL. This line is in a nested loop (for each letter, go through these page numbers).
  • It opens the web page. For those new to Ruby, yes, it is as easy as open(put-url-here).
  • It loads the HTML into Nokogiri (a very important Ruby library that interprets XML into nodes and such) and then stores that into a variable I'm calling "page."

In other words, we got a website opened and parsed. Now what?

domains = page.xpath("//a[@class='block']")

Well, I actually need to open more pages, because the page I just opened is full of links to other pages I want to open. I want to get all those links, so this line says:

  • Find all the anchor tags with class='block'. The HTML would look like this: <a href='somewhere' class='block'>something something </a>. Nokogiri allows xpath syntax, which is a very powerful and standardized way to parse XML.
  • Store them into a new variable I'm calling "domains."

I know I want the domains with that class because I looked at the HTML and they all have that in common. If you do a lot of scraping, you'll find that discovering these patterns are critical.

So now I have a list of all of these domain nodes.

Each of them has the text of the domain and a link to the interior page that actually has the data I want. Here’s what one of the domain nodes actually looks like:

#<Nokogiri::XML::Element:0x3fd620921f40 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fd620921ea0 name="class" value="block">, #<Nokogiri::XML::Attr:0x3fd620921e8c name="href" value="/d/">] children=[#<Nokogiri::XML::Text:0x3fd620921298 "\n\t\t\t\t\n\t\t\t\t ">]>

I actually want that href, so all I have to do is domain['href'] and to get the domain text itself, it's just domain.content.strip to take away all the \n\t\t\t crap!

So now what? You guessed it: we open that href!

domain_page = Nokogiri::HTML(open("{domain['href']}"))

And now I again use xpath to get the exact data I want on the page:

containers = domain_page.xpath("//div[@class='format_score_container']")
containers.each do |c|
    pattern = c.xpath("div[@class='format fl']").first.content.strip
    confidence = c.xpath("div[@class='score_container fl']/div[@class='confidence_value fl']").first.content.strip

And that’s really it, folks. The pattern data I want is in an HTML block that looks like this:

<div class='format_score_container'>yummy juicy data I want</div>

I iterate through that yummy juicy data and get the pattern and confidence score for each one. What I'm not showing here is the database connection I use to store these guys, but that's pretty basic Postgres / ActiveRecord stuff. You can google it.

More Find Emails Articles >>