Convert from most encodings to utf8 with powershell
You have an ANSI-encoded file, or a file encoded using some other (supported) encoding, and want to convert it to UTF-8 (or another supported encoding). I ran into this when working with exported data from Excel which was in latin1/ISO8859-1 by default, and I couldn't find a way to specify UTF-8 in Excel.
The problem occurred when I wanted to work on the CSV file using the PowerShell cmdlet Import-Csv, which, as far as I can tell, doesn't work correctly with latin1-encoded files exported from Excel or ANSI files created with notepad - if they contain non-US characters.
The command you are looking for is Set-Content. Type "Get-Help Set-Content -Full" at a PowerShell prompt to read the help text, and see the example below.
Also see the part about using Get-Content file.csv | ConvertFrom-Csv.
Click here for an article on how to convert using iconv on Linux.
Listing the cmdlet Set-Content's Supported Encodings
A hack to list the supported encodings is to use one that doesn't exist:
PS C:\> 'foo' | Set-Content -Encoding whatever Set-Content : Cannot bind parameter 'Encoding'. Cannot convert value "whatever" to type "Microsoft.PowerShell.Commands. FileSystemCmdletProviderEncoding" due to invalid enumeration values. Specify one of the following enumeration values an d try again. The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii". At line:1 char:30 + 'foo' | Set-Content -Encoding <<<< whatever + CategoryInfo : InvalidArgument: (:) [Set-Content], ParameterBindingException + FullyQualifiedErrorId : CannotConvertArgumentNoMessage,Microsoft.PowerShell.Commands.SetContentCommand
Notice the part with the possible enumeration values:
- Unknown (probably not very useful)
Here I present a solution or work-around.
Create Problematic Data
To simulate the situation, I open notepad and manually enter some data causing issues. Notepad has some logic that determines what file encoding it uses, but the default is ANSI, and that is what it uses in this example. The data contains the "extra" Norwegian vowels "æ", "ø" and "å", and their position in the Norwegian alphabet in a manually crafted CSV file.
PS C:\> notepad .\norwegian-vowels.txt PS C:\> gc .\norwegian-vowels.txt vowel,position æ,27 ø,28 å,29
Verify It Displays Incorrectly
Here you see the Norwegian vowels are incorrectly displayed as question marks ("?") after being processed by Import-Csv.
PS C:\> Import-Csv .\norwegian-vowels.txt vowel position ----- -------- ? 27 ? 28 ? 29
Convert to UTF-8 and Verify It Displays Correctly
Here I use the cmdlet Get-Content to get the content of the current problematic file (norwegian-vowels.txt), pipe it to Set-Content with the parameter -Encoding utf8 and a new file name as the output file (norwegian-vowels-utf8.txt).
Then I just pass it to Import-Csv to verify it's displayed correctly.
PS C:\> Get-Content .\norwegian-vowels.txt | Set-Content -Encoding utf8 norwegian-vowels-utf8.txt PS C:\> Import-Csv .\norwegian-vowels-utf8.txt vowel position ----- -------- æ 27 ø 28 å 29
Additional Information and Avoiding a Temporary File
Ron Delzer commented on the following (quote):
In looking at why Import-Csv doesn't work as expected found that the missing element is simply the UTF-8 BOM (see http://en.wikipedia.org/wiki/Byte_order_mark) The Get-Content cmdlet correctly determines the encoding at UTF-8 if the BOM is present or not, Import-Csv only works if the BOM is present. I tried specifying the encoding to Import-Csv and that does not work either: PS C:\> Import-Csv -Encoding UTF8 .\norwegian-vowels.txt You can eliminate the interim file encoding step like this: PS C:\> Get-Content .\norwegian-vowels.txt | ConvertFrom-Csv I submitted a bug to Microsoft Connect: https://connect.microsoft.com/PowerShell/feedback/details/1371244/import-csv-does-not-correctly-detect-encoding-for-utf-8-files-without-bom
keywords: convert from latin1 to utf8 using powershell, convert from latin1 to utf-8, convert from any encoding to utf8, convert from utf7 to utf8, convert from utf16 to utf8, powershell, iconv, linux, converting to utf8, converting file encodings with powershell, converting file encoding with linux, convert file, iso-8859-1-15, iso-8859-1, latin1, incompatible file encoding, characters displayed incorrectly, norwegian vowels incorrectly displayed in powershell, characters incorrectly displayed in powershell, converting files using powershell, excel csv, import-csv, csv latin1, csv iso8859, import-csv utf8, characters display wrong with import-csv in powershell