An Advanced Cmdlet to Merge Csv Files in PowerShell

From Svendsen Tech Powershell Wiki
Jump to: navigation, search

So you have two or more CSV files (or custom PowerShell objects) that share one or more columns/headers/properties, and you want to merge them based on this or these headers/properties. The most typical scenario will be using only one ID column, such as a username that's common in two or more files, and you want for instance everything from one CSV file joined with a separate CSV file that has some info that's missing in the other one, and you want the username (or some other ID field) to be the thing that ties it together.

As of right now, I have no idea how this performs on large files or data sets, but I suspect it'll be fairly slow and highly resource-intensive both with regards to memory and CPU usage. The testing you see below is basically all the testing I have done as of right now. Feel free to provide feedback, check the "about" page at the bottom for my mail address. I might put this on github later.

Since Import-Csv turns stuff into custom PowerShell objects, and this script works with those objects, I suppose the cmdlet could more accurately have been named "Merge-Object", but there it is.

Since writing this I also found out about Join-Object, written by the PowerShell team. It's a bit different from mine. It only handles two CSV files / custom PS objects at the time.

Contents






For simple concatenation, of two files with shared headers, all you need is basically:

@(Import-Csv file1.csv) + @(Import-Csv file2.csv) | Export-Csv joined.csv

There's a more thorough article on simple concatenation/appending, "for dummies", here, written by the Microsoft scripting guys.

This Merge-Csv cmdlet also does some consistency checking on the merge and reports discrepancies, as seen here:

Merge-Csv-Example-Small.png

Download

Merge-Csv.ps1.txt - right-click and download.

2014-07-16: Fixed so it handles spaces in titles.

2014-03-18: Uploaded new version with -Id instead of -SharedColumn, and regex escaping of the optional -Separator that's used with the -split operator. Updated screenshots and documentation.

2014-02-03: Worked around a bug (in a stupid way) that occurs when working with only two files and the first file has an ID that's not in the second file. It would report "none". Now it correctly reports "1".

Older versions: File:Merge-Csv.ps1.txt.


Parameters

Id Shared columns/headers that serve as ID fields. Typically something like a username.
Path Path to CSV files to merge. Cannot be used with -InputObject.
InputObject Custom PowerShell objects to merge. Cannot be used with -Path.
Delimiter CSV delimiter character.
Separator Default separator string used between multiple ID fields. Default is "#Merge-Csv-Separator#". Shouldn't ever have to be changed, but is here just in case you happen to have that string in your ID headers (this is not likely!).

Examples

The Basic Example

I have users.csv which contains usernames and the users' departments.

I have user-mail.csv which contains usernames and the users' email addresses.

PS D:\temp> ipcsv users.csv | ft -AutoSize

Username Department
-------- ----------
John     IT        
Jane     HR        

PS D:\temp> ipcsv user-mail.csv | ft -AutoSize

Username Email           
-------- -----           
John     john@example.com
Jane     jane@example.com

Now I want to merge them, so I use this fabulous cmdlet and simply run this:

PS D:\temp> . .\Merge-Csv.ps1
PS D:\temp> Merge-Csv -Path users.csv, user-mail.csv -Id Username | Export-Csv -enc UTF8 merged.csv

PS D:\temp> ipcsv .\merged.csv | ft -AutoSize

Username Department Email           
-------- ---------- -----           
John     IT         john@example.com
Jane     HR         jane@example.com

Voila.


A Practical Walk-through/example

Import CSV files as objects to do some work on them. The Merge-Csv cmdlet also has an -InputObject parameter that can be used instead of -Path.

PS C:\> $Csv1 = ipcsv csv-file1.csv -Delimiter ';' # semi-colon as delimiter in this one
PS C:\> $Csv2 = ipcsv csv-file2.csv # comma as delimiter

PS P:\> $csv1.Count
12898

PS P:\> $csv2.Count
804

You could look at the headers in Excel/notepad/whatever, but to list them using PowerShell, you can do it like this:

PS P:\> $csv2[0] | Get-Member -MemberType NoteProperty | select -exp Name
Company
Department
DN
FirstName
LastLogonTimestamp
LastName
SamAccountName
whenCreated

PS P:\> $csv1[0] | Get-Member -MemberType NoteProperty | select -exp Name
DEAKTIVERT
DEAKTIVERTDATO
ETTERNAVN
FORNAVN
KSTEDNAVN
PAGAAVD
PAGAAVDNR
TITTEL
USERID

So I notice one file uses "USERID", while the other uses "SamAccountName" for the ID field. Again, you could edit this in Excel or using a text editor, before the CSV import, but I demonstrate how to handle this anyway in case someone is working with custom objects from other sources than CSV files.

Here I "copy" the field "SamAccountName", and give it the name "USERID":

PS P:\> $csv2 = $csv2 | select *,@{'n'='USERID';e={$_.SamAccountName}}

# To anonymize I show only that the header is added, but just skip
# " | Get-Member ..." to verify that the field is copied properly
PS P:\> $csv2[0] | Get-Member -MemberType NoteProperty | select -exp Name
Company
Department
DN
FirstName
LastLogonTimestamp
LastName
SamAccountName
USERID
whenCreated

Now when I merge this, I will get thousands of warnings about inconsistencies, because a lot of the users exist in $Csv1, but not in $Csv2. The syntax I use to redirect the warning stream requires PowerShell version 3 or later, so this is strongly recommended for situations like this. I measure how long it takes. As we saw earlier, $Csv1 has almost 13,000 records, while $Csv2 has about 800. This took five minutes and 54 seconds. The PowerShell.exe process used about 608 MB RAM and the single CPU core that hosted the powershell.exe process was pegged for most of the processing time (but not all of it).

PS P:\> Measure-Command {
    (Merge-Csv -InputObject $Csv1, $Csv2 -Id USERID | export-csv
    -enc utf8 -notype merged.csv) 3> warnings.txt } | select TotalSeconds | ft -a

TotalSeconds
------------
 354,3457868

This generated 13366 warnings:

PS P:\> (gc .\warnings.txt).Count
13366

Be aware that to redirect the warnings, you need to enclose the Merge-Csv command and possibly following pipeline in parentheses before " 3> warnings.txt".

Now to filter the merged file down to only those that have all values, I check for a "random" value I know will always be populated in $Csv1 and one for $Csv2 (different).

Here's some anonymized data from warnings.txt:

PS P:\> (gc warnings.txt | select -first 10 -skip 20) -replace "'([^']+)'", "'anon'"
Identifying column entry 'anon' was not found in all CSV data objects/files. Found in object/file no.: 1
Identifying column entry 'anon' was not found in all CSV data objects/files. Found in object/file no.: 1
Identifying column entry 'anon' was not found in all CSV data objects/files. Found in object/file no.: 1
Identifying column entry 'anon' was not found in all CSV data objects/files. Found in object/file no.: 1
Identifying column entry 'anon' was not found in all CSV data objects/files. Found in object/file no.: 2
Identifying column entry 'anon' was not found in all CSV data objects/files. Found in object/file no.: 1
Identifying column entry 'anon' was not found in all CSV data objects/files. Found in object/file no.: 2
Identifying column entry 'anon' was not found in all CSV data objects/files. Found in object/file no.: 1
Identifying column entry 'anon' was not found in all CSV data objects/files. Found in object/file no.: 1
Identifying column entry 'anon' was not found in all CSV data objects/files. Found in object/file no.: 1

Then I filter on the "DEAKTIVERT" column from $Csv1 and the "SamAccountName" column from $Csv2 to create a report of "matches" (I should probably somehow automate(!) this and add it to the cmdlet:

PS P:\> ipcsv .\merged.csv | where { $_.SamAccountName -and $_.DEAKTIVERT } |
    Export-Csv -enc utf8 -notype merged-extract.csv

# And apparently we have 169 matches.
PS P:\> (ipcsv .\merged-extract.csv).Count
169

Example of Merging Three Files

Firstly, I dot-source the Merge-Csv.ps1 file to get the Merge-Csv function/cmdlet into the PowerShell session. You can also use "Import-Module .\Merge-Csv.ps1".

PS D:\temp> . .\Merge-Csv.ps1

Then I display what's currently in the CSV files. You can clearly see that the "Username" column is the only one that exists in all three files, and that they have unique columns/headers. You can also see a duplicate entry in the first file (user42) and that some of the usernames exist in one or more files, but not all of them. This is reported via warning messages when I merge.

Lastly, I perform the merge, we see the warnings and the end result, which you could then pipe to Export-Csv to create the merged CSV file. You can filter columns with Select-Object and filter out rows with empty values in specific fields with Where-Object, as demonstrated briefly below.

Merge-Csv-Example.png

Filtering the Merged Results

Firstly, I export to a file called merged.csv (to save screen space in these docs).

PS D:\temp> Merge-Csv -Path csvmerge1.csv, csvmerge2.csv, csvmerge3.csv -Id Username |
    Sort Username | Export-Csv merged.csv

WARNING: Duplicate identifying (shared column(s) ID) entry found in CSV data/file 1: user42
WARNING: Identifying column entry 'firstOnly' was not found in all CSV data objects/files. Found in object/file no.: 1
WARNING: Identifying column entry '2only' was not found in all CSV data objects/files. Found in object/file no.: 2
WARNING: Identifying column entry 'user2and3only' was not found in all CSV data objects/files. Found in object/file no.: 2, 3

Then I use the built-in cmdlet Where-Object (or just "Where") to filter on whether the "File3" and "File1B" columns are populated by something with a true value (not $null or an empty string in this case) and display only the results for which these fields have values.

PS D:\temp> ipcsv merged.csv | Where { $_.File3 -and $_.File1B } | ft -a

Username File1A File1B TestID File2A  File2B  TestX   File3 
-------- ------ ------ ------ ------  ------  -----   ----- 
user1    1A1    1B1    same   1A3     2A3     same    same  
user2    2A1    2B1    diff2  2A3     2B3     diff2_2 testC2
user3    3A1    3B1    same   3A3     3B3     same    same  
user42   42A1   42B1   same42 testA42 testB42 same42  same42

Example of Merging Three Files Based on Two ID Columns

Now I change the csvmerge2.csv file's header "TestXID" to "TestID", which is shared with a column title in csvmerge1.csv. Similarly, for csvmerge3.csv, I change the title/header "TestXXID" to "TestID".

Then I attempt the same merge as before to see what happens.

Merge-Csv-Example-Error.png


Mostly out of consideration for the blind or visually impaired, I repeat it in text here:

PS D:\temp> Merge-Csv -Path csvmerge1.csv, csvmerge2.csv, csvmerge3.csv -Id Username | sort username | ft -a
Merge-Csv : Some headers are shared. Are you just looking for '@(ipcsv csv1) + @(ipcsv.csv2) | Export-Csv ...'?
To remove duplicate (between the files to merge) headers from a CSV file, Import-Csv it, pass it to Select-Object, and 
omit the duplicate header(s)/column(s).
Exiting.
At line:1 char:1
+ Merge-Csv -Path csvmerge1.csv, csvmerge2.csv, csvmerge3.csv -Id Userna ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Write-Error], WriteErrorException
    + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Merge-Csv

Now, what I do is to take advantage of the fact that I allow for -Id to be an array of strings (multiple ID columns/headers), and I add "TestID", which is now shared among all the files. The CSV now looks like this:

PS D:\temp> Merge-Csv -Path csvmerge1.csv,csvmerge2.csv,csvmerge3.csv -Id Username, TestID | sort username | ft -a
WARNING: Duplicate identifying (shared column(s) ID) entry found in CSV data/file 1: user42, same42
WARNING: Identifying column entry 'user2, diff2' was not found in all CSV data objects/files. Found in object/file no.: 1
WARNING: Identifying column entry 'user2and3only, testID' was not found in all CSV data objects/files. Found in object/file no.: 3
WARNING: Identifying column entry 'user2, testC2' was not found in all CSV data objects/files. Found in object/file no.: 3
WARNING: Identifying column entry '2only, c' was not found in all CSV data objects/files. Found in object/file no.: 2
WARNING: Identifying column entry 'user2and3only, test2and3X' was not found in all CSV data objects/files. Found in object/file no.: 2
WARNING: Identifying column entry 'user2, diff2_2' was not found in all CSV data objects/files. Found in object/file no.: 2
WARNING: Identifying column entry 'firstOnly, foo' was not found in all CSV data objects/files. Found in object/file no.: 1

Username      TestID     File1A      File1B      File2A  File2B  Data3     
--------      ------     ------      ------      ------  ------  -----     
2only         c                                  a       b                 
firstOnly     foo        firstOnlyA1 firstOnlyB1                           
user1         same       1A1         1B1         1A3     2A3     usr1Data3 
user2         diff2      2A1         2B1                                   
user2         diff2_2                            2A3     2B3               
user2         testC2                                             usr2Data3 
user2and3only testID                                             2and3Data3
user2and3only test2and3X                         2and3A2 2and3B2           
user3         same       3A1         3B1         3A3     3B3     usr3Data3 
user42        same42     42A1        42B1        testA42 testB42 usr42Data3

Mostly out of consideration for the sighted among us, I also provide a screenshot of the same (I'm funny, aren't I).

Merge-Csv-Example-Multiple-Columns.png

Observe the warnings which are now about the combination of the two fields (just simple string comparisons using the values joined with the default -Separator string, which can be changed if needed).

Personal tools