Since all SMTP / Internet mail contains headers, there are various easily classifiable patterns that can be applied in order to provide categorization that is both quick and precise. The biggest downside of this technique is that it requires regular manual curation of the classification file. However, this is also one of its greatest strengths, as it provides "manual override" to immediately address mis-classifications and implement new rules for business or legal reasons.
Regex complements other classification techniques because it is immediately available to new users. It requires no external input like training data or the user's social graph. A good analogy to this is the spamassassin rules (ref, cookbook, ref) which are also regex based. Thus far, we've been really pleased with how our Regex Classifier has performed, so I thought I'd outline its design, limitations, and configuration below.
The Regex Classifier is designed to:
- Be entirely data driven. All classification heuristics are to be in config / YAML files, not in code.
- Provide a confidence factor for a given regex, as a function for overall weighting for use in combination with other classifiers.
- Check each header against all rules and returns the match with the highest confidence factor.
- Be fast and memory efficient enough for online execution
- Maintain accuracy of better than 0.5% mistakes based on a broad sample of email.
- Doesn't require user configuration, but takes advantage of it if it is there.
The current version has the following limitations:
- Only the email's header is parsed, not the body.
- All headers are checked against all rules, making this an O(n2) problem.
- A given email has only a single classification based on the best match.
- No multifactor (e.g. And, Not operators, like the spamassassin META tag)
- No attachment matching / MIME type pattern matching, although the MIME email headers can be used.
- No parameterized regexes (by user name, or other run-time criteria)
- Automatic rule reloading when the definition file changes during production.
- There is currently no provision for overriding system defaults on a per user basis.
- There is no way to recategorize an existing message. For example the system configuration classifies a message as Newsletter but a user wants the message only in her Banking folder.
- There is no pattern match on the header name. It is not possible to check for all fields beginning with F, as "^F*". Only the header value is checked for regular expressions.
Classification Configuration File
The configuration file is in YAML format, with the following hierarchy:
description: <description A>
category: <major category>
- header: <header1>
value: <value1 regex>
- header: <header2>
value: <value2 regex>
- <subcategory-A> is the desired classification for this pattern match
- <description A> is a description of the rule. It is optional but suggested.
- <category> is the optional major category, ie. "respond", "read", "skim"
- <header1> is the header to find without the ':', e.g. "X-Mailer:". Headers are not case sensitive.
- <value1 regex> is a standard Ruby regular expression, e.g. "@ripariandata.com*". Matching is case insensitive.
- <confidence> is a manually entered likelihood of accuracy. Only a completely unambiguous and certain regex should have a confidence close to but less than 1.0. The default should be 0.80, presuming the rule quite likely to be accurate.
- A confidence of 1.0 or more is a special value to override any other classification means. No other classification by automated means should be applied.
While you can't build a truly intelligent inbox with Regex alone, we've found that combining its rules with social graphing, dyadic reciprocity, and Naive Bayesian classifiers creates one heckuva email classification machine.
Don't believe us? Test it out for yourselves!