Monday, 24 July 2017

Importing XML Data into MySQL Tables Using a Stored Procedure

In theory, it would seem that a stored procedure would be the perfect vehicle for importing XML data from a file. In practice, there are several mitigating factors to doing so.  For starters, the LOAD XML INFILE and LOAD DATA INFILE statements have not been permitted to run within a Stored Procedure since MySQL v5.0.7.  On top of that, there are serious limitations in how dynamic you can make your procs; too many to make them as flexible that you'd need them to be in order to handle a variety of input files.  Finally, right now there are no means for mapping XML data to table structures, so you really have to know what's coming in.  But it's not all bad news. If you're willing to target an XML file with a rigid and known structure per proc, you can still get a lot of use out of one call.

The Sample Input File and Target Table


For the purposes of today's article, we'll be sticking with a simple one-to-one XML file to table mapping.  However, the same principles can equally apply to one-to-many, many-to-one, and many-to-many mappings as well.

The XML file that we'll be working with contains a list of applicants, each containing three attributes.  The exact structure of the XML data isn’t so important; what really matters is that the proc knows what it is!  In our test file, each row to be inserted is contained within an <applicant /> tag.  Each column is represented by a named attribute.

<?xml version="1.0"?>
<applicant_list>
  <applicant id="1" fname="Rob" lname="Gravelle"/>
  <applicant id="2" fname="Al" lname="Bundy"/>
  <applicant id="3" fname="Little" lname="Richard"/>
</applicant_list>

The target table contains the same three fields, an int ID and two varchars; the ID is the primary key, so it does not allow nulls or duplicate values.  The last_name allows duplicates, but not nulls.  The first_name field is the most flexible and allows just about any type of data up to 100 characters in length.

CREATE  TABLE `applicants` (  
    `id`         INT          NOT NULL ,
    `last_name`  VARCHAR(100) NOT NULL ,
    `first_name` VARCHAR(100) NULL ,  
 PRIMARY KEY (`id`) );

Some people have employed User Defined Functions (UDFs) as a means of circumventing MySQL's imposed limitations.  UDFs are typically written in C++ and compiled as object files.  Once added to the server, UDFs can be called just like any native MySQL function.  While there's nothing wrong with UDFs, I can't advocate going that route for a couple of reasons.  First, writing C++ code is something usually better suited to an application developer than a Database Administrator.   If you really want to take the UDF approach it would probably be better to find an existing one that fits your needs.  In particular, I have heard good things about lib_mysqludf_sys. The second reason is that, quite frankly,  MySQL already provides the tools to get the job done.  Using the proc-friendly Load_File() and ExtractValue() functions, we can accomplish everything we need to without resorting to any third-party solutions.

The proc accepts two varchar input parameters: the path contains the location of the xml file and the node is the tag drilldown to the applicant data.  In our case these would equate to something like the following:

path: 'C:\\applicants1.xml', node: '/applicant_list/applicant'

Note that on Windows platforms, backslashes are used as the path separator, so these must be escaped using another backslash.  This is not necessary on UNIX platforms because paths are separated by forward slashes.

In the proc, Load_File() imports the xml data into a local variable, while ExtractValue() then queries the XML data using XPath.  For instance, in the code below, it retrieves a count of applicants from the xml_content variable:

    declare xml_content text;
     declare v_row_count int unsigned;
     set xml_content = load_file(path);
     set v_row_count = extractValue(xml_content, concat('count(', node, ')'));  

Iterating Over the XML Records


Chances are that you won't know how many records will be imported on each run.  Therefore the row count is essential to be able to iterate over every XML row.  In the while loop we have to increment the row index by one in order to retrieve the current XML row.  One specific row can be fetched using the square array brackets [n], where n is the v_row_index.  The @* after the last forward slash tells ExtractValue() to fetch the entire contents of the node.  We can then access each attribute in the same way as the rows, again using the square array brackets [n].   The Insert statement syntax that I used here does not include the column names.  As such, each parameter is inserted in order.

    declare v_row_index int unsigned default 0;  
    while v_row_index < v_row_count do              
        set v_row_index = v_row_index + 1;      
        set v_xpath_row = concat(node, '[', v_row_index, ']/@*');
       
        insert into applicants values (
            extractValue(xml_content, concat(v_xpath_row, '[1]')),
            extractValue(xml_content, concat(v_xpath_row, '[2]')),
            extractValue(xml_content, concat(v_xpath_row, '[3]'))
        );
    end while;

Here is the full proc code:

DELIMITER $$
CREATE DEFINER=`root`@`localhost` PROCEDURE `import_applicant_xml`(path varchar(255), node varchar(255))
BEGIN
    declare xml_content text;
    declare v_row_index int unsigned default 0;  
    declare v_row_count int unsigned;
    declare v_xpath_row varchar(255);

    set xml_content = load_file(path);

    -- calculate the number of row elements.  
    set v_row_count  = extractValue(xml_content, concat('count(', node, ')'));
   
    -- loop through all the row elements  
    while v_row_index < v_row_count do              
        set v_row_index = v_row_index + 1;      
        set v_xpath_row = concat(node, '[', v_row_index, ']/@*');
        insert into applicants values (
            extractValue(xml_content, concat(v_xpath_row, '[1]')),
            extractValue(xml_content, concat(v_xpath_row, '[2]')),
            extractValue(xml_content, concat(v_xpath_row, '[3]'))
        );
    end while;
END

Performing a Test Run


Calling the proc is just a matter of using the call command with the proc name and two input parameters.   (Remember to escape backslashes in the file path on Windows platforms.)

MySQL> call import_applicants_xml('C:\\applicants1.xml', '/applicant_list/applicant');
MySQL> select * from applicants;
+---+----------+-------------+
 |1  |Rob       |Gravelle     |
 +---+----------+-------------+
 |2  |Al        |Bundy        |
 +---+----------+-------------+
 |3  |Little    |Richard      |
 +---+----------+-------------+
 3 row(s) returned