Convert from most encodings to utf8 with powershell

From Svendsen Tech PowerShell Wiki
Jump to: navigation, search




Scenario

You have an ANSI-encoded file, or a file encoded using some other (supported) encoding, and want to convert it to UTF-8 (or another supported encoding). I ran into this when working with exported data from Excel which was in latin1/ISO8859-1 by default, and I couldn't find a way to specify UTF-8 in Excel.

The problem occurred when I wanted to work on the CSV file using the PowerShell cmdlet Import-Csv, which, as far as I can tell, doesn't work correctly with latin1-encoded files exported from Excel or ANSI files created with notepad - if they contain non-US characters.

The command you are looking for is Set-Content. Type "Get-Help Set-Content -Full" at a PowerShell prompt to read the help text, and see the example below.

Also see the part about using Get-Content file.csv | ConvertFrom-Csv.

Click here for an article on how to convert using iconv on Linux.

Listing the cmdlet Set-Content's Supported Encodings

A hack to list the supported encodings is to use one that doesn't exist:

PS C:\> 'foo' | Set-Content -Encoding whatever
Set-Content : Cannot bind parameter 'Encoding'. Cannot convert value "whatever" to type "Microsoft.PowerShell.Commands.
FileSystemCmdletProviderEncoding" due to invalid enumeration values. Specify one of the following enumeration values an
d try again. The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".
At line:1 char:30
+ 'foo' | Set-Content -Encoding <<<<  whatever
    + CategoryInfo          : InvalidArgument: (:) [Set-Content], ParameterBindingException
    + FullyQualifiedErrorId : CannotConvertArgumentNoMessage,Microsoft.PowerShell.Commands.SetContentCommand

Notice the part with the possible enumeration values:

  • Unknown (probably not very useful)
  • String
  • Unicode
  • Byte
  • BigEndianUnicode
  • UTF8
  • UTF7
  • ASCII

Solution

Here I present a solution or work-around.

Create Problematic Data

To simulate the situation, I open notepad and manually enter some data causing issues. Notepad has some logic that determines what file encoding it uses, but the default is ANSI, and that is what it uses in this example. The data contains the "extra" Norwegian vowels "æ", "ø" and "å", and their position in the Norwegian alphabet in a manually crafted CSV file.

PS C:\> notepad .\norwegian-vowels.txt
PS C:\> gc .\norwegian-vowels.txt
vowel,position
æ,27
ø,28
å,29

Verify It Displays Incorrectly

Here you see the Norwegian vowels are incorrectly displayed as question marks ("?") after being processed by Import-Csv.

PS C:\> Import-Csv .\norwegian-vowels.txt

vowel                                                       position
-----                                                       --------
?                                                           27
?                                                           28
?                                                           29

Convert to UTF-8 and Verify It Displays Correctly

Here I use the cmdlet Get-Content to get the content of the current problematic file (norwegian-vowels.txt), pipe it to Set-Content with the parameter -Encoding utf8 and a new file name as the output file (norwegian-vowels-utf8.txt).

Then I just pass it to Import-Csv to verify it's displayed correctly.

PS C:\> Get-Content .\norwegian-vowels.txt | Set-Content -Encoding utf8 norwegian-vowels-utf8.txt
PS C:\> Import-Csv .\norwegian-vowels-utf8.txt

vowel                                                       position
-----                                                       --------
æ                                                           27
ø                                                           28
å                                                           29

Additional Information and Avoiding a Temporary File

Ron Delzer commented on the following (quote):

In looking at why Import-Csv doesn't work as expected found that
the missing element is simply the UTF-8 BOM
(see http://en.wikipedia.org/wiki/Byte_order_mark)
 
The Get-Content cmdlet correctly determines the encoding at UTF-8 if the
BOM is present or not, Import-Csv only works if the BOM is present.
 
I tried specifying the encoding to Import-Csv and that does not work
either: PS C:\> Import-Csv -Encoding UTF8 .\norwegian-vowels.txt
 
You can eliminate the interim file encoding step like this:
PS C:\> Get-Content .\norwegian-vowels.txt | ConvertFrom-Csv
 
I submitted a bug to Microsoft Connect:

https://connect.microsoft.com/PowerShell/feedback/details/1371244/import-csv-does-not-correctly-detect-encoding-for-utf-8-files-without-bom

Screenshot Example

Convert-from-ansi-to-utf8-example.png


keywords: convert from latin1 to utf8 using powershell, convert from latin1 to utf-8, convert from any encoding to utf8, convert from utf7 to utf8, convert from utf16 to utf8, powershell, iconv, linux, converting to utf8, converting file encodings with powershell, converting file encoding with linux, convert file, iso-8859-1-15, iso-8859-1, latin1, incompatible file encoding, characters displayed incorrectly, norwegian vowels incorrectly displayed in powershell, characters incorrectly displayed in powershell, converting files using powershell, excel csv, import-csv, csv latin1, csv iso8859, import-csv utf8, characters display wrong with import-csv in powershell