Quantcast
Channel: ABAP Development
Viewing all articles
Browse latest Browse all 948

A Way of Reading Huge Excel Files

$
0
0

Recently, a colleague of mine had to write an ABAP application with an import functionality for Excel files. I recommended him to use the abap2xlsx package which serves as a fully featured negotiator between the ABAP and the Excel world. In particular, abap2xlsx contains a reader class, ZCL_EXCEL_READER_2007, designed for importing a .xslx file into the ABAP data structures on which the package is based.

 

To my astonishment, the reader dumped with a memory overflow error for a sample Excel file.

  • Yes - with about 28'000 rows and 50 columns it was a large file!
  • And, yes, it is crazy to have a business process with files of these sizes! No discussion about that.

 

On the other hand, the file had a size of 6.5 MB, which was moderate compared to the heap memory limit of 2 GB which had been touched according to the short dump.

 

I was just curious how this memory consumption of a factor way larger than 100 from the original data could be explained.

 

mem_alloc_failed.png

So I wrote a little test report for analyzing the situation with the memory inspector.

 

report zz_test_abap2xlsx_reader.
parameters: p_file type string memory id fil.
at selection-screen on value-request for p_file.   perform get_file(zz_file_io_forms) using p_file.
start-of-selection.   perform start.
* ---
form start.   data: lo_excel_reader type ref to zif_excel_reader,         lo_excel type ref to zcl_excel.   create object lo_excel_reader type zcl_excel_reader_2007.   lo_excel = lo_excel_reader->load_file( p_file ).   break-point.
endform.                    "start

 

Starting the report with one of those large Excel files and stepping through the code with the debugger, I arrive at the place where the worksheet is to be parsed. Everything looks normal. There are some 9 MB allocated for the XSTRING containing the file. At this point, this is the largest object in memory. No problem. The second largest object is the "table of X", the result of the file read process (more or less a redundant copy of the first data - but negligible compared to the 4 GB we are talking about).

 

Peanuts!

 

worksheet1.png

At that point in the debugger, I am just before jumping into the method  get_ixml_from_zip_archive( ). This is a general-purpose method in the reader class, which loads an XML document contained in the zip archive; after loading, it will be parsed with the methods of the IF_IXML family, and a reference to the result XML DOM object  will be passed back to the caller.

 

Here, I am immediately before the call of the parse() method of the if_ixml_parser object. The memory consumption looks still more than modest:

worksheet2.png

Now, with a single step, I am triggering the parse() Method. Processing takes a few minutes until the control is given back to the debugger again. From a memory perspective, the result is amazing:

 

workshhet3.png

With this single method call (which is implemented "by kernel module" and thus cannot be analyzed further with ABAP means), the overall memory consumption  raises to more than 2 GB!

 

Even more amazing: The 2GB cannot be explained from the memory inspector's individual object view :

 

worksheet4.png

 

Here, the top object only requires some 100 MB. This is the decompressed XML worksheet, as produced from the zip loader. From the rest of the 2GB, we see absolutely nothing.

 

So we have to face the fact that the iXML DOM parser in this case requires a factor 20 more memory than the raw document which is to be parsed.

 

Since at this point, nothing could be done (after all, iXML is a kernel component), I looked out for alternatives. My plan was: If we omitted the DOM parsing and instead are satisfied with gripping the element contents from the stream while we're reading it, the memory footprint could be reduced considerably.

 

I could prove that this really is the case, by implementing an alternative parser class, ZCL_XLSX_PARSER, on the base of  the sXML family, in particular of the IF_SXML_READER implementation provided by this family. If you are happy with simply reading the raw data from the file (as we were in our particular case), omitting more functionality like macros, cell styles etc., the class ZCL_XLSX_PARSER will be fully sufficient.

 

At the same place as above, immediately before parsing, the memory in the debugger looks similar: On top, we see the decompressed XML file, allocating about 100 MB as above:

 

worksheet5.png

 

Now I execute the parse_cell_data( ) method.

 

worksheet6.png

For the same file that I tested above, the memory consumption only moderately increased through the parsing: It was 117,707,688 bytes before the call of parse_cell_data( ), and is 141,485,696 bytes afterwards. The difference - about 23 MB - is needed for storing the result (the deep structure ES_EXCEL above, containing internal tables with the cell data). All the figures can be explained by looking at the consumption of the ABAP data alone. We are still far away from the 4GB limit.


Also, the execution speed is better. With the sXML reader, the full reading process needs about 52 seconds. With the iXML reader, only the call of if_ixml_parser->parse() of the same worksheet took 340 seconds.

 

The main difference between IF_IXML_PARSER and IF_SXML_READER is that the latter is only a scanner. It doesn't generate an internal image of the XML file, but only detects the beginning, the end, the attributes and the contents of XML elements during reading the stream. That's it. With IF_SXML, I have to instrument the reader myself if I want to extract data from the reading process. Here is the implementation of the parse_cell_data( ) method, just to give you the idea. If you want to study the whole class, have a look into it here.

 

method parse_cell_data.  data: ls_cell type ty_excel_cell,        lv_number type decfloat34,        lv_time type t,        lv_date type d.
* Main parse loop  while io_reader->node_type ne if_sxml_node=>co_nt_final.    io_reader->next_node( ).    case io_reader->name.      when 'row'.  " A row        check io_reader->node_type = if_sxml_node=>co_nt_element_open.        add 1 to ls_cell-row.        clear: ls_cell-col, ls_cell-ref, ls_cell-type.      when 'c'.  " A column, child of a row        case io_reader->node_type.          when if_sxml_node=>co_nt_element_open.            add 1 to ls_cell-col.            ls_cell-type = get_cell_type( io_reader ).          when if_sxml_node=>co_nt_element_close.            insert ls_cell into table cs_excel-cells.        endcase.      when 'v'.  " A value - may be a reference to the string table        if io_reader->node_type eq if_sxml_node=>co_nt_element_close.          case ls_cell-type.            when gc_cell_datatype-string.              " Referenzindex für Stringtabelle              ls_cell-ref = io_reader->value + 1. " ist 0-basiert in xlsx            when gc_cell_datatype-number.              lv_number = io_reader->value.              append lv_number to cs_excel-numbers.              ls_cell-ref = sy-tabix.            when gc_cell_datatype-date.
* Convert to the ABAP-conformal internal date representation              lv_number = io_reader->value + 693595.              append lv_number to cs_excel-numbers.              ls_cell-ref = sy-tabix.            when gc_cell_datatype-time.
* Convert to the ABAP-conformal internal time representation
* Work with milliseconds, however, for further improvements              lv_number = round( val = 86400 * io_reader->value dec = 3 ).              append lv_number to cs_excel-numbers.              ls_cell-ref = sy-tabix.          endcase.        endif.      when 'is'.  " Inline Strings: Just append them to the stringtab        if io_reader->node_type eq if_sxml_node=>co_nt_element_close.          append io_reader->value to cs_excel-strings.          ls_cell-ref  = sy-tabix.          ls_cell-type = gc_cell_datatype-string.        endif.    endcase.  endwhile.
endmethod.

 

By the way: The code has been written with support of unit tests. The above method, like the rest of the core methods, is covered to 100%. There are only a few unprocessed boundary cases in the unit tests: For example the case of a corrupt zip file (which cannot be unpacked properly). See the ABAP unit test section in the code repository for details.


Viewing all articles
Browse latest Browse all 948

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>