Parsing Microsoft XML Event Logs in Pig

Reading XML in Hadoop can be a challenge.  Hadoop, by default, will read a "line" in the file and pass that on as a record.  XML documents are always contained on a single line.  Additionally, the whole document (file) will most likely contain several records, based on the element structure.

In the case of a Microsoft XML Event Log, it contains an array of elements, "ROW", for each record.  The first step is to isolate the record and get that processed as a single event record.  That can be done with a UDF in the "piggybank.jar" contributor jar, located in the /usr/lib/bin directory of an HDP installation.

-- Register the contributor jar to make the UDF available.
register '/usr/lib/pig/piggybank.jar';

-- Using the custom Loader, identify and extract a chararray that represents all the content between (and including) the "ROW" xml element.
A = load '$source' using'ROW') 
    as (row:chararray);

The variable 'A' now represents an array of records.  Each record contains the "full" text of the XML element.  Now all we need to do is identify the elements, extract and/or query them as desired.

There are two methods that I tried for this.  The first is using REGEX patterns to identify the group elements, extracting those to a record and continuing on.  My attempt at that looked like this:

B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(row,'<ROW>\\n\\s*<EventLog>\\n\\s*(.*)\\n\\s*</EventLog>\\n\\s*<RecordNumber>\\n\\s*(.*)\\n\\s*</RecordNumber>\\n\\s*<TimeGenerated>\\n\\s*(.*)\\n\\s*</TimeGenerated>\\n\\s*<EventID>\\n\\s*(.*)\\n\\s*</EventID>\\n\\s*<ComputerName>\\n\\s*(.*)\\n\\s*</ComputerName>\\n\\s*<SID>\\n\\s*</SID>\\n\\s*<EventType>\\n\\s*(.*)\\n\\s*</EventType>\\n\\s*<SourceName>\\n\\s*(.*)\\n\\s*</SourceName>\\n\\s*<EventCategory>\\n\\s*(.*)\\n\\s*</EventCategory>\\n\\s*<EventTypeName>\\n\\s*(.*)\\n\\s*</EventTypeName>\\n\\s*<EventCategoryName>\\n\\s*(.*)\\n\\s*</EventCategoryName>\\n\\s*<Strings>\\n\\s*(.*)\\n\\s*</Strings>\\n\\s*<Message>\\n\\s*(.*)\\n\\s*</Message>\\n\\s*<Data>\\n\\s*</Data>\\n\\s*</ROW>')) 
    as (eventlog:chararray, recordnumber:chararray, timegenerated:chararray, eventid:chararray,
        computername:chararray, eventtype:chararray, sourcename:chararray, eventcategory:chararray,
        eventtypename:chararray, eventcategoryname:chararray, strings:chararray, message:chararray);

This basically states that for each element in 'A', which is identified by 'row', apply this REGEX and extract the grouping to the structure defined below.  For more basic stuff, this might work as a quick and dirty solution.  But for the MS XML Event Logs, it was "REALLY SLOW" and error prone.  And the results regularly dropped data elements.  I'm sure it was because the REGEX wasn't perfect and the fact that a bunch of "random" data could disrupt the REGEX's interpretation.

The second preferred approach is to write a custom UDF to replace the "FLATTEN(REGEX_EXTRACT_ALL..." built in UDF.  While this took a bit more time, it was well worth it.  Here's an example of how to load and store the data with the custom UDF.

register '/usr/lib/pig/piggybank.jar';
register '/home/vagrant/hwx.pso/com.hortonworks.pso.pig.udf-1.0-SNAPSHOT-shaded.jar';
A = load '$source' using'ROW') 
    as (row:chararray);
B = foreach A generate EXTRACT_XML_LOG_EVENT(row);
STORE B INTO '$target' USING PigStorage();

So, what's the impact?  Pretty significant.  Using the same dataset in both methods, the REGEX process took 2+ hours.  While most of the work completed in 20 minutes, a number of files from certain types of server play havoc with the REGEX, through a lot of errors, slow the process down AND most importantly, DROPPING fields and possibly records from the import.

With the custom UDF, which is based on an XML parser and not a REGEX, the process took less then 2 minutes.  And we were able to get a deeper, nested, element structure while preserving the entire record.  No dropped elements.

The source for the UDF can be found here on my github account.