Convert from most encodings to utf8 with powershell

Jump to page sections

Scenario
Listing the cmdlet Set-Content's Supported Encodings
Solution
Create Problematic Data
Verify It Displays Incorrectly
Convert To UTF-8 and Verify It Displays Correctly
Additional Information and Avoiding a Temporary File
Screenshot Example

This article was originally written some time between 2011 and 2015. Last update: 2022-01-26.

Scenario

If you have an ANSI-encoded file, or a file encoded using some other (supported) encoding, and want to convert it to UTF-8 (or another supported encoding), this article is for you. I ran into this when working with exported data from Excel which was in latin1/ISO8859-1 by default, and I couldn't find a way to specify UTF-8 in Excel.

The problem occurred when I wanted to work on the CSV file using the PowerShell cmdlet Import-Csv, which, as far as I can tell, doesn't work correctly with latin1-encoded files exported from Excel or ANSI files created with notepad - if they contain non-US characters. 2022-01-26: It's a known bug that has probably been fixed in PowerShell v7. It seems still present in PowerShell 5.1. The bug occurs when the file is missing the UTF-8 BOM (more on that below). The bug was submitted to Microsoft Connect years ago here.

A command you may be looking for is Set-Content. Type "Get-Help Set-Content -Online" at a PowerShell prompt to read the help text, and see the example below.

Also see the part about using Get-Content file.csv | ConvertFrom-Csv.

Click here for an article on how to convert using iconv on Linux.

Internally in PowerShell, a string is a sequence of 16-bit Unicode characters (often called a Unicode code point or Unicode scalar value). It's implemented directly using the .NET System.String type, which is a reference type (read more about that in my deep copying article).

A string can be arbitrarily long (computer memory and physics as we currently understand it allowing) and it is immutable, meaning it cannot be changed without creating an entirely new altered version/"copy" of the string.

Listing the cmdlet Set-Content's Supported Encodings

A hack to list the supported encodings is to use one that doesn't exist:

PS C:\> 'foo' | Set-Content -Encoding whatever
Set-Content : Cannot bind parameter 'Encoding'. Cannot convert value "whatever" to type "Microsoft.PowerShell.Commands.
FileSystemCmdletProviderEncoding" due to invalid enumeration values. Specify one of the following enumeration values an
d try again. The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".
At line:1 char:30
+ 'foo' | Set-Content -Encoding <<<<  whatever
    + CategoryInfo          : InvalidArgument: (:) [Set-Content], ParameterBindingException
    + FullyQualifiedErrorId : CannotConvertArgumentNoMessage,Microsoft.PowerShell.Commands.SetContentCommand

Notice the part with the possible enumeration values:

Unknown (probably not very useful)
String
Unicode
Byte
BigEndianUnicode
UTF8
UTF7
ASCII

Solution

Here I present a solution or work-around.

Create Problematic Data

To simulate the situation, I open notepad and manually enter some data causing issues. Notepad has some logic that determines what file encoding it uses, but the default is ANSI, and that is what it uses in this example. The data contains the "extra" Norwegian vowels "æ", "ø" and "å", and their position in the Norwegian alphabet in a manually crafted CSV file.

PS C:\> notepad .\norwegian-vowels.txt
PS C:\> gc .\norwegian-vowels.txt
vowel,position
æ,27
ø,28
å,29

Verify It Displays Incorrectly

Here you see the Norwegian vowels are incorrectly displayed as question marks ("?") after being processed by Import-Csv.

PS C:\> Import-Csv .\norwegian-vowels.txt vowel position ----- -------- ? 27 ? 28 ? 29

Convert To UTF-8 and Verify It Displays Correctly

Here I use the cmdlet Get-Content to get the content of the current problematic file (norwegian-vowels.txt), pipe it to Set-Content with the parameter -Encoding utf8 and a new file name as the output file (norwegian-vowels-utf8.txt).

Then I just pass it to Import-Csv to verify it's displayed correctly.

PS C:\> Get-Content .\norwegian-vowels.txt | Set-Content -Encoding utf8 norwegian-vowels-utf8.txt PS C:\> Import-Csv .\norwegian-vowels-utf8.txt vowel position ----- -------- æ 27 ø 28 å 29

Additional Information and Avoiding a Temporary File

Ron Delzer commented on the following (quote):
In looking at why Import-Csv doesn't work as expected found that the missing element is simply the UTF-8 BOM

(see https://en.wikipedia.org/wiki/Byte_order_mark )

The Get-Content cmdlet correctly determines the encoding at UTF-8 if the BOM is present or not, Import-Csv only works if the BOM is present.

I tried specifying the encoding to Import-Csv and that does not work either: PS C:\> Import-Csv -Encoding UTF8 .\norwegian-vowels.txt

You can eliminate the interim file encoding step like this: PS C:\> Get-Content .\norwegian-vowels.txt | ConvertFrom-Csv

I submitted a bug to Microsoft Connect: https://connect.microsoft.com/PowerShell/feedback/details/1371244/import-csv-does-not-correctly-detect-encoding-for-utf-8-files-without-bom

Screenshot Example

Example of how to convert from ANSI to UTF8

keywords: convert from latin1 to utf8 using powershell, convert from latin1 to utf-8, convert from any encoding to utf8, convert from utf7 to utf8, convert from utf16 to utf8, powershell, iconv, linux, converting to utf8, converting file encodings with powershell, converting file encoding with linux, convert file, iso-8859-1-15, iso-8859-1, latin1, incompatible file encoding, characters displayed incorrectly, norwegian vowels incorrectly displayed in powershell, characters incorrectly displayed in powershell, converting files using powershell, excel csv, import-csv, csv latin1, csv iso8859, import-csv utf8, characters display wrong with import-csv in powershell