October 20, 2014
How to train your inbox using Microsoft Azure Machine Learning (part two)
In this three part-series, Theta Software’s product architect and lead consultant Jim Taylor takes a closer look at Microsoft Azure Machine Learning. New here? read part one first: What is machine learning and how might it be useful?
Could I train my inbox using Microsoft Azure Machine Learning?
When I’m looking at new technologies or I’m trying to get to grips with a new concept I like to get my hands dirty and try things out for myself, so I set about trying to find a problem and solution that could make use of a predictive analysis experiment.
The problem...
At Theta we have a number of people based at various locations, working in our two offices, working from home, working at client sites etc.
Most people are in the habit of sending what I call “whereabouts emails” with subject lines like:
<out> An hour or so
<WFH> Sick today
<WFH>
<late>in the office from 9.45am
Out-ClientName
WFH
<in later> In after lunch. At XXX from 11:30am to 1:30pm then office
Bob on leave until 3rd
Out for lunch
As you’d expect I have a rule setup for this – most people use the convention of using angle brackets to denote WFH (working from home), OUT etc and it’s pretty much never sent directly to me.
Apart from the odd false positive and emails that slip through the rule this works well enough most of the time
So it turns out I have a readily available source of training and test data (my inbox and whereabouts folder) and a problem looking for a solution. Can I make an experiment that predicts whether an email is a “whereabouts email” without explicitly defining rules? Can I train a model by using the emails in my inbox as training and test data – by providing details of all emails and those that have ended up in my whereabouts folder?
Designing the experiment
The first step is to decide what data to use in the experiment. What attributes of each email in my inbox and other folders would provide useful indicators?
I decided to try the following:
- Is in whereabouts folder – this is what we are predicting so we provide this to train the model.
- Has attachments – Whereabouts emails tend not to have attachments
- Sent direct - Whereabouts emails tend to be sent to groups of people rather that direct
- May contain a time – The subject has text which contains something which may be interpreted as a time e.g. 1pm, 12:30
- Is reply or forward - Whereabouts emails tend not to be a reply or forwarded email
- Received day of week – Interested to see if this is a factor
- Received hour – Interested to see if there is a pattern here (I tend to see these emails arrive in the morning and evening)
- Subject word count – Tends to be low
- Body word count – Tends to be low
- Sender domain – Usually from the company domain
- Has CC – Whereabouts emails rarely contain a CC
- Importance – Interested to see if this is a factor
- Body format – Interested to see if this is a factor
- Special character count – Included to see if this can help given the convention of using angle brackets but not too specific as to lead the experiment too much.
- Subject number count – Included to see whether this is a factor
So I created a console application using Outlook Office automation to iterate over all folders and emails in my inbox and produce a csv output.
The application source code can be found on GitHub.
The resulting dataset (I had 2000+ rows) can be uploaded to Azure Machine Learning as a dataset.
In my next post, I'll go through step by step how to run this experiment, evaluate it and publish as a web service. Go to part three of the Microsoft Azure Machine Learning series.