This bug has been squashed in:
* Power BI Desktop Version: 2.69.5467.1751 64-bit (May 2019)
* Excel for Office 365 MSO (16.0.11601.20174)
The focus of this post is an optimization tip for Excel sheets that are used in Power BI queries. The tip, which is to avoid using formatting beyond the range you actually need, is a result of a potential bug in the way Power BI processes empty but filled cells. I discovered the bug whilst trying to solve a problem as to why a 512 row by 19 column sheet would never load in Power BI. There was something about the sheet that caused Power BI to consume every available resource when loading the sheet. It took a while to discover the root cause. I tried several things including only accessing the specific columns I needed in the query, trying a different machine and saving the file locally. Nothing would stop Power BI consuming every available resource when loading the sheet. Then I noticed that several columns and rows had a background fill which had been copied down the full height and across the width of the sheet. Removing the fill allowed the sheet to load in seconds. I suspect that what I have discovered is a bug but it also serves to illustrate a way by which you can optimize sheets for processing.
I’ll demonstrate the phenomenon in the reminder of this post.
In order to illustrate my point I’ve created a simple spreadsheet. In Cell G2 I have added a drop down selection using Data Validation.
The source for the drop down is in cells K2 to K5. The first item in the list is blank so ‘ignore blanks’ is not selected in Data Validation.
For the demonstration I am using a typical simple Power BI query to access a sheet saved locally. The phenomenon occurs regardless where the file is saved.
Without background fill
First up I use autofill to fill down to the end of Column G. I’ve chosen to fill to the end of the sheet as this mimics common user behavior and the situation in the problematic sheet. The fill starts at cell G2 and the cell contains the drop down but no background fill.
The Fill Down option fills the column down to row 1048576 which is the last row in the sheet. Loading the sheet into Power BI yields 11 columns with headings and 4 rows of data.
You might be wondering why the region is this size. If you review the sheet again you’ll notice that there is data in cells A1 and K5. As I am using a simple query it imports a rectangular region which is defined by cells that contain data i.e. top left corner, the lowest row and furthest column to the right. It does not return data to row 1048576 as whilst it contains data validation the cells appear empty. The data refreshes in a blink of an eye.
With background fill
Now repeating the exercise but this time with a background fill to the cells yields a completely different result:
Power BI reports that the cells below row 4 contain null values but yet the region extends to row 999+. The data refresh takes much longer:
To my mind this is a bug. The Excel file is only 9kb in size and, whilst the cells contain a validation formula and have a defined fill, Excel considers them as empty. Clearly this can have a significant impact on the ability of Power BI to import and refresh data as it changes the definition of the area to be imported to 11 columns with headings and 1,048,575 rows of data i.e. 11,534,336 items to be processed!
Returning to the problem I was trying to solve. The reason why the original 512 row by 19 column sheet never loaded in Power BI was that several columns had been filled with colour to the bottom of the sheet (row 1,048,576) and 2 rows had been filled to the extreme right (column XFD) (I suspect by selecting the row and then filling it which is another common user action). This defined a region of 16,384 columns and 1,048,576 rows i.e. 17,179,869,184 items to be processed. That yielded in excess of 2GB of data to process with the result that the refresh failed to conclude.
I know the bug can be mitigated by defining the region as a list or with additional steps in the query but that should not be necessary if the cells are null. I suspect that it also serves to highlight that Power BI considers any formatting when processing sheets and so query performance can be optimized by removing formatting.