Regular Expressions And Log Parsing
My company needs a small utility to parse log files generated by some of the speech engines we are using in our products. They generate old-style ASCII log files that are easy to read but cumbersome to parse.
I started playing around with regular expressions , and found a treasure! Boy was i missing out on something really powerful or what? Using Regex I was able to create a small LogToXML converter class in just under 2 hours (with almost no prior experience with regular expressions before!). it is amazing.
I'm currently trying to write this whole experience out as an article to CodeProject. Hopefully out in the next couple of days. Using Expresso really teaches you stuff about the possibilities just by poking around on all the optin tabs. The thing i love about this the most is that i keep finding out cooler stuff about Regex. For instance - I can create named groups to use for my parsing. As an example i'll show you a line from my Babylon -Pro translator and Thesaurus Log file:
It contains this line:
26/4/03 23:43:52 4.0.2.3:Roy:3004984200 - warning 998: Error opening file 'C:\Program Files\Babylon\ocr_data' "rb" - (Path not found)
using this (kinda scary) regular expression, i can parse this into names fields and go over them using the Match Object provided by the .Net Framework:
(?<Date>(?<Month>\d{1,2})/(?<Day>\d{1,2})/(?<Year>(?:\d{4}|\d{2}))(?x))\s(?<Time>(?<Hour>\d{1,2}):(?<Minute>\d{1,2}):(?<Seconds>\d{1,2}))\s(?<Version>\d.\d.\d.\d):(?<User>\w*):(?<SomeNumber>\d*)\s-\s(?<StatusLevel>\w*)\s(?<MessageID>\d*):\s(?<Message>.*)
What i get in Expresso is hierarchical Tree view of the results for this line.
- 26/4/03 23:43:52 4.0.2.3:Roy:3004984200 - warning 998: Error opening file 'C:\Program Files\Babylon\ocr_data' "rb" - (Path not found)
-Date[26/4/03]
-Day[26]
-Month[4]
-Year[03]
-Time[23:43:52]
-Hour[23]
-Minute[43]
-Seconds[52]
-Version[4.0.2.3]
-User[Roy]
-SomeNumber[3004984200]
-StatusLevel[warning]
-MessageID[998]
-[Message[Error opening file 'C:\Program Files\Babylon\ocr_data' "rb" - (Path not found)]
Isn't that beautiful?
The possibilities after transforming this to XML are endless:
- Use XPath to query the XML - allowing your user to execute queries on your log file
- Use XSLT to create an HTML report of the Log file
- Load this into a dataset ,make it write it's schema and voila - you have a typed dataset of the log file. You can execute queries on it (although i havnt figured out a way yet to execute global queries on a dataset with some tables. the only select i can do is using the DataTable.Select Function - which only allows me to put in a filter from it. Is there no other way?)
-many many more....
Did i mention that to create the XML file while parsing I use the XMLTextWrite Class? Yet another cool little class! :)
p.s:
Nice collection of Regular expression links lies here.