This project is read-only.

Parsing Bodystructure with RegEx

Aug 20, 2010 at 6:44 AM
Edited Aug 20, 2010 at 6:51 AM
Ive been trying to find a regex to parse at least the first part of the bodystructure and was wondering if anyone would mind helping me. So far what I have seems to work on messages from gmail and dovecot. Obviously there are some blanks where I have the capture set as "unknown". Any help is appreciated.

("(?<type>[^"]*)"|(?<type>NIL))\s "(?<lang>[^"]*)"\s\( ("(?<unknown>[^"]*)"\s"(?<charset>[^"]*)"\)|(?<charset>NIL))\s (?<unknown>NIL)\s (?<unknown>NIL)\s ("(?<encoding>[^"]*)"|(?<encoding>NIL))\s (?<size>\d+|NIL)\s (?<lines>\d+|NIL)\s (?<unknown>NIL)\s (\(("(?<disposition>[^"]*)"|(?<disposition>NIL))\s ("(?<offset>[^"]*)"|(?<offset>NIL))\) |(?<disposition>NIL))\s (?<unknown>NIL)\s ("(?<filename>[^"]*)"|(?<filename>NIL))
The examples I've been testing against are listed below:
(("text" "html" ("charset" "us-ascii") NIL NIL "7bit" 7541 123 NIL NIL NIL NIL) "alternative" ("boundary" "----=_Part_892054_1709053796.1144914819345") NIL NIL NIL) "related" ("boundary" "----=_Part_892053_1703335780.1144914819345") NIL NIL NIL)
("text" "plain" ("charset" "windows-1252") NIL NIL "quoted-printable" 1253 59 NIL NIL NIL NIL)("text" "html" ("charset" "windows-1252") NIL NIL "quoted-printable" 11965 208 NIL NIL NIL NIL) "alternative" ("boundary" "--NextPart_048F8BC8A2197DE2036A") NIL NIL NIL)
("text" "plain" ("charset" "us-ascii") NIL NIL "7bit" 7122 100 NIL ("inline" NIL) NIL "292.txt")("text" "html" ("charset" "us-ascii") NIL NIL "7bit" 12975 188 NIL ("inline" NIL) NIL "292.html") "alternative" ("boundary" "----=_NextPart_001_AEA6_74B0DC51.19495CFF") NIL NIL NIL)

I have found the following tool to be extemely useful while developing and testing these regular expressions: