Updated on November 4, 2019
The thing about AI / NLP / ML (that's artificial intelligence, natural language processing, and machine learning for the un-initiated) is that it requires data, and lots of it. For a lot of companies, that can be a non-starter. You can't venture into this hot new field unless there's something for the computer to chew on.
I got into it by simply having an idea. What if I could I use this new technology to tell me what industry a webpage is and whether or not it's an agency? Here's basically what I did and how the new Toofr AI service works.
I already had my training data
Natural language processing, specifically classifiers, require "training data" in order to operate. This is the raw material you feed into the machine so it can learn. Classifiers require very simple key-value pairings. I'll steal this example from the Rubygem that I used, Stuff Classifier. Once you've created an instance of the classifier (that's the cls
below) you can start training it.
cls.train(:dog, "Dogs are awesome, cats too. I love my dog")
cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")
cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
This is a silly example but it gets the point across. The train
method on the classifier takes two inputs. The first is the key, or the concept that you're training the bot to understand, and the second is the value, or an example of how the concept might be defined or put in some other context in the real world. Behind the scenes, the classifier uses either of two algorithms to figure out, given a body of these key-value pairs, what the key should be given a value it's never seen before.
My training set looks something like this.
cls.train(:seo, "Our team develops online marketing strategies tohelp promote your services and...")
cls.train(:web_design, "A full service Atlanta web development and digital marketing agency located in Alpharetta...")
cls.train(:advertising, "A full service boutique creative agency in Austin Texas We uncover the truth that drives a brand...")
cls.train(:copywriters, "Cincinnati freelance copywriter food stylist freelance copywriter...")
You might have figured out that in this example I'm training a classifier to figure out what kind of agency it's looking at. I gathered this dataset from a few different sources which simply gave me the agency website and a category the agency was associated with. It's critical to get a large number of associations between these categories and the websites, especially if there are a lot of categories (or keys, per my description above.). Based on the documentation for IBM Watson's API, it looks like you should have at least 30 values per key you're trying to train.
The values I need aren't simply the website, of course. Instead, I need some content about the website. To capture that, I built a quick and dirty bot that opens up the website and extracts the title, meta description, and content of the header tags (h1, h2, h3, etc). I also looked for any links with the word "about" in them and opened those and scraped the same content. I stored all of this into a single field in my database. My hefty little bot did this about 3500 times in less than 15 minutes. Web technology still amazes me. Even the basic stuff.
If you want to follow along...
If you want to follow along, run this in a Rails console. You'll need to initialize your classifier first. Rather than type that all part out for you, this tutorial was really good. I'm saving you the hassle of getting the content yourself and instead letting you tap into a CSV file with some sample data.
file = open('https://s3.amazonaws.com/toofr-dev/nlp_agencies_sample.csv', 'r')
csv = CSV.read(file, encoding: "ISO8859-1")
StuffClassifier::Bayes.open("industries") do |cls|
csv.each do |row|
cls.train(row.first.parameterize.underscore.to_sym, row.second)
end
end
Even if you're new to Ruby or programming in general, this should be pretty easy to read. I've opened a file, read it into a CSV file, and then iterated through the CSV file to train my classifier.
So, hey, congrats! Now you can say that you too are adept at ML / AI / NLP. You've trained a classifier. Next step is to classify. When you use Toofr's Get Company Data feature, here's all that's happening.
classifier = StuffClassifier::Bayes.new("industries")
industry = classifier.classify(content)
The content
variable is the content I scraped from a new website. It's basically saying to the classifier, "Hey, you've seen a whole bunch of key-value pairs. Here's a value (the content). Tell me which key (or agency category) it looks like."
And that, in a very small nutshell, is how to use Ruby to run some natural language processing in a sales data context.
What's next?
Well, the way the matches will get better over time is if I can give feedback to the classifier. Sometimes the data is off. Like, way off. It's really difficult to know why without dissecting the algorithm. However, I can go back to the training material and see if something was wrong. So, I need a way for users to tell me when the data is bad. It would also be helpful to know when it's good, and I can then re-classify based on new data when a particular industry isn't classifying properly.