Parse email content from quoted reply

c# ruby email email-parsing

I did a lot more searching on this and here's what I've found. There are basically two situations under which you are doing this: when you have the entire thread and when you don't. I'll break it up into those two categories:

When you have the thread:

If you have the entire series of emails, you can achieve a very high level of assurance that what you are removing is actually quoted text. There are two ways to do this. One, you could use the message's Message-ID, In-Reply-To ID, and Thread-Index to determine the individual message, it's parent, and the thread it belongs to. For more information on this, see RFC822, RFC2822, this interesting article on threading, or this article on threading. Once you have re-assembled the thread, you can then remove the external text (such as To, From, CC, etc... lines) and you're done.

If the messages you are working with do not have the headers, you can also use similarity matching to determine what parts of an email are the reply text. In this case you're stuck with doing similarity matching to determine the text that is repeated. In this case you might want to look into a Levenshtein Distance algorithm such as this one on Code Project or this one.

No matter what, if you're interested in the threading process, check out this great PDF on reassembling email threads.

When you don't have the thread:

If you are stuck with only one message from the thread, you're doing to have to try to guess what the quote is. In that case, here are the different quotation methods I have seen:

a line (as seen in outlook).
Angle Brackets
"---Original Message---"
"On such-and-such day, so-and-so wrote:"

Remove the text from there down and you're done. The downside to any of these is that they all assume that the sender put their reply on top of the quoted text and did not interleave it (as was the old style on the internet). If that happens, good luck. I hope this helps some of you out there!

c# ruby email email-parsing

First of all, this is a tricky task.

You should collect typical responses from different e-mail clients and prepare correct regular expressions (or whatever) to parse them. I've collected responses from outlook, thunderbird, Gmail, Apple mail, and mail.ru.

I am using regular expressions to parse responses in the following manner: if an expression did not match, I try to use the next one.

new Regex("From:\\s*" + Regex.Escape(_mail), RegexOptions.IgnoreCase);new Regex("<" + Regex.Escape(_mail) + ">", RegexOptions.IgnoreCase);new Regex(Regex.Escape(_mail) + "\\s+wrote:", RegexOptions.IgnoreCase);new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline);new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase);new Regex("from:\\s*$", RegexOptions.IgnoreCase);

To remove quotation in the end:

new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline);

Here is my small collection of test responses (samples divided by --- ):

From: test@test.com [mailto:test@test.com] Sent: Tuesday, January 13, 2009 1:27 PM----2008/12/26 <test@test.com>>  text----test@test.com wrote:> text----      test@test.com wrote:         texttext----2009/1/13 <test@test.com>>  text---- test@test.com wrote:         text text----2009/1/13 <test@test.com>> text> text----2009/1/13 <test@test.com>> text> text----test@test.com wrote:> text> text<response here>------- On Fri, 23/1/09, test@test.com <test@test.com> wrote:> text> text

c# ruby email email-parsing

Thank you, Goleg, for the regexes! Really helped. This isn't C#, but for the googlers out there, here's my Ruby parsing script:

def extract_reply(text, address)    regex_arr = [      Regexp.new("From:\s*" + Regexp.escape(address), Regexp::IGNORECASE),      Regexp.new("<" + Regexp.escape(address) + ">", Regexp::IGNORECASE),      Regexp.new(Regexp.escape(address) + "\s+wrote:", Regexp::IGNORECASE),      Regexp.new("^.*On.*(\n)?wrote:$", Regexp::IGNORECASE),      Regexp.new("-+original\s+message-+\s*$", Regexp::IGNORECASE),      Regexp.new("from:\s*$", Regexp::IGNORECASE)    ]    text_length = text.length    #calculates the matching regex closest to top of page    index = regex_arr.inject(text_length) do |min, regex|        [(text.index(regex) || text_length), min].min    end    text[0, index].stripend

It's worked pretty well so far.

CodeHunter

Parse email content from quoted reply

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last