Updated on November 4, 2019
In a previous post, I wrote about a Sinatra app I built to pull data from a site that publicly lists email domain patterns. Here's an explanation of how the code works.
This is just a review of the relevant code snippets. And by the way, itsasecret.com is not the actual site. I don't want to give away all my tricks or upset the owner of the site, so I'm not disclosing which one it is. Instead, I'll just tell you how I scraped it.
First thing's first.
If you pick up scraping, you'll start a lot of your scripts like this.
page = Nokogiri::HTML(open("http://www.itsasecret.com/i/browse_letter/?letter=#{l}&page=#{p}"))
A lot of magic happens here! This little line does a few things:
- It takes a page number and a letter and passes it into a URL. This line is in a nested loop (for each letter, go through these page numbers).
- It opens the web page. For those new to Ruby, yes, it is as easy as
open(put-url-here)
. - It loads the HTML into Nokogiri (a very important Ruby library that interprets XML into nodes and such) and then stores that into a variable I'm calling "page."
In other words, we got a website opened and parsed. Now what?
domains = page.xpath("//a[@class='block']")
Well, I actually need to open more pages, because the page I just opened is full of links to other pages I want to open. I want to get all those links, so this line says:
- Find all the anchor tags with
class='block'
. The HTML would look like this:<a href='somewhere' class='block'>something something </a>
. Nokogiri allows xpath syntax, which is a very powerful and standardized way to parse XML. - Store them into a new variable I'm calling "domains."
I know I want the domains with that class because I looked at the HTML and they all have that in common. If you do a lot of scraping, you'll find that discovering these patterns are critical.
So now I have a list of all of these domain nodes.
Each of them has the text of the domain and a link to the interior page that actually has the data I want. Here’s what one of the domain nodes actually looks like:
#<Nokogiri::XML::Element:0x3fd620921f40 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fd620921ea0 name="class" value="block">, #<Nokogiri::XML::Attr:0x3fd620921e8c name="href" value="/d/a--design.com/">] children=[#<Nokogiri::XML::Text:0x3fd620921298 "\n\t\t\t\t a--design.com\n\t\t\t\t ">]>
I actually want that href, so all I have to do is domain['href']
and to get the domain text itself, it's just domain.content.strip
to take away all the \n\t\t\t crap!
So now what? You guessed it: we open that href!
domain_page = Nokogiri::HTML(open("http://www.stillasecret.com#{domain['href']}"))
And now I again use xpath to get the exact data I want on the page:
containers = domain_page.xpath("//div[@class='format_score_container']")
containers.each do |c|
pattern = c.xpath("div[@class='format fl']").first.content.strip
confidence = c.xpath("div[@class='score_container fl']/div[@class='confidence_value fl']").first.content.strip
end
And that’s really it, folks. The pattern data I want is in an HTML block that looks like this:
<div class='format_score_container'>yummy juicy data I want</div>
I iterate through that yummy juicy data and get the pattern and confidence score for each one. What I'm not showing here is the database connection I use to store these guys, but that's pretty basic Postgres / ActiveRecord stuff. You can google it.