Merging Tables with Different Columns and Layouts in PowerPivot
Posted by Alan on March 24, 2011
So I have not really worked in-depth with PowerPivot up to this point. I have played around with it but never really done anything substantial. However, I recently embarked on a pilot project for a client where PowerPivot was potentially the right solution for them. This client’s data resides in multiple systems, which is fairly normal now-a-days. The kicker for them was that most of their systems are hosted by outside providers and the options to get data out of those systems is limited. Some of the limitations are forced by the systems themselves while some are financial as the provider wants extra money for every change in the export format or for each export. So basically I am dealing with data that comes to the client in 8 or 9 spreadsheets, all with different columns and all with different column names. To top it off, not all the columns are consistent across the spreadsheets.
So my first challenge was how to get these spreadsheets into PowerPivot. Now PowerPivot can easily bring in tables from Excel, but it is designed to bring in different data that can be logically linked. My need was to consolidate all these data sources into a single table. After pulling my hair out on how to do this, my colleague Richard Mintz (he is blogging now, check him out) pointed me to a blog post that helped me figure out what I needed to do. Now this post assumed all the files were the same so I took his method and modified it somewhat as detailed below:
- I created a blank “master” Excel workbook that contained the fields I wanted to populate with “user friendly” names rather than start with one of the data sources first.
- I created my data connections to all my data source spreadsheets.
- I brought in my master Excel workbook first and then began modifying the import query for the rest of the imports.
It was in step #3 where I began to forge my own way. So after the “UNION ALL” statement, instead of a “Select * From C:\\file_path\datasource.xlsx.[TabName$]” statement I modified it to be a “Select As” type statement where I mapped my raw field names to my user friendly names. An example is below:
SELECT [DataSource1$].XX_ID as ‘UniqueID’,
[DataSource1$].TITLE as ‘Salutation’,
[DataSource1$].FNAME as ‘First Name’,
[DataSource1$].LNAME as ‘Last Name’,
” as ‘Title’,
” as ‘Company ID’,
[DataSource1$].COMPANY as ‘Company Name’,
[DataSource1$].ADDR1 as ‘Address 1’,
[DataSource1$].ADDR2 as ‘Address 2’,
[DataSource1$].CITY as ‘City’,
[DataSource1$].PROVINCE as ‘Province’,
[DataSource1$].POSTAL as ‘Postal Code’,
” as ‘Phone’,
[DataSource1$].LAST_CONTACT_DATE as ‘Imported Date’,
‘M/D/Y’ as ‘Imported Date Format’,
[DataSource1$].SALE_AMOUNT as ‘Sale Amount’,
‘DataSource1’ as ‘Source’
UNION ALL (repeat with next data source and modified as necessary)……
So a couple of things I learned by doing this whole process:
- It can be done but it is not the easiest or most intuitive thing to do.
- You have to map to every field in your master document. If a data source does not have that field, then you have fill it in with a blank or “n/a” or some other value. This is what happened to me for in the sample above Title, Company ID and Phone. Each of my queries had to be tweaked individually.
- If you are bringing in data from multiple sources I found it helpful to add a column to track where the data came from hence the field “Source” that I added.
- Hand coding the complete query in the PowerPivot window was tedious at best. I gave up and wrote it in Notepad and then copied it over.
- My different data sources all treated dates differently and this was very problematic. However that is the subject of my next post.
If anyone has an easier way to do this I would love to hear about it.