Powershell split operator

From Svendsen Tech PowerShell Wiki
Jump to: navigation, search

The main PowerShell Regular Expressions article is here. In this article I describe how to use the -split operator for a few common tasks - and also provide some insight into regular expressions. This article was one of the first articles I wrote, so it's a bit of a mess, but the information itself, and knowledge conveyed, is good.

Splitting on Whitespace Using the PowerShell -split Operator

The -split operator takes a regular expression, and to split on an arbitrary amount of whitespace, you can use the regexp "\s+". "\s" is a special regexp operator that includes/matches spaces, tabs, vertical tabs, newlines, carriage returns and form feeds: ' ', \t, \v, \n, \r, \f. The most common of these are spaces, tabs and newlines. PowerShell does some magic translation between the traditional Windows line ending "\r\n" and "\n" so you can just use \n ("`n" in PowerShell's double-quoted strings) most of the time.

The following "+" is a quantifier, and this one means "one or more", and it will try to match as many as it can, so-called greedy matching. If you want non-greedy matching, use "\s+?". The only difference is a trailing question mark, which after a quantifier means "make the quantifier non-greedy". Non-greedy means it will try to match as few characters as possible, while getting a complete regexp match, instead of as many as possible. A question mark used after a character, group (in parentheses), character class or somewhere else that's not after a quantifier, means it makes the preceding element optional; that means the regexp will match whether the element is there or not.

For instance you might split on newlines, while possibly accounting for and removing carriage returns, with the regex:

$MultiLineString -split "\r?\n"

If you have mangled or wrongly formatted data, this can also be useful; it will split on any consecutive newlines and/or carriage returns it finds:

$MultiLineString -split "[\r\n]+"

Another quantifier is "*", which means "zero or more". So ".*?" means "match zero or more of any character" (except newlines, unless you use the SingleLine option: "(?s)"). It will try to match as few as possible while still having a successful complete regexp match, including the surrounding regexp parts. Be aware that ".*" always matches, even if you pass in an empty string:

PS C:\> '' -match '.*'

Here's an example where I split a string with different whitespace using the regexp "\s+" which, as I described above, means "match one or more whitespace characters, greedily (as many as you can)":

PS C:\> "a  `t  b`t`t`t c `n `td e`tf" -split '\s+'

And to verify that there aren't any surrounding spaces or whitespace, I wrap it in a Foreach-Object that prepends and appends a single quote:

PS C:\> "a  `t  b`t`t`t c `n `td e`tf" -split '\s+' | Foreach { "'" + $_ + "'" }

To output it on one line, separated by commas, you could enclose the whole pipeline in parentheses and use the -join operator, like this:

PS C:\> ("a  `t  b`t`t`t c `n `td e`tf" -split '\s+' | Foreach { "'$_'" } ) -join ', '
'a', 'b', 'c', 'd', 'e', 'f'

Indexing/retrieving specific elements from the list/array/collection

To retrieve the third and fifth element, you could enclose the split command in parentheses and use normal array indexing. Remember it starts at 0, not 1.

PS C:\> ("a  `t  b`t`t`t c `n `td e`tf" -split '\s+')[2,4]

To get the first three elements, you could index and use the range operator (..):

PS C:\> ("a  `t  b`t`t`t c `n `td e`tf" -split '\s+')[0..2]

Splitting On Single Spaces

To split on a single space, you just need a literal space. The [regex]::Escape() method will, however, escape spaces. Two spaces will result in an "empty" field.

PS C:\> 'foo bar  baz' -split ' '

PS C:\>

The obvious trick to pull out here to filter out empty elements is simply splitting on any amount (more than nothing) of spaces in sequence, trying to match as many as possible:

PS C:\> 'foo bar  baz' -split ' +'
PS C:\>

You can also use Where-Object and something like this to filter out empty elements:

# Skip elements/lines that do NOT contain non-whitespace.
# "\S" is the complement/opposite of \s, so it includes
# anything that is not defined as whitespace.
PS C:\> 'foo bar  baz' -split ' ' | Where { $_ -match '\S' }

# Skip elements that do not have a length greater than zero
PS C:\> 'foo bar  baz' -split ' ' | Where { $_.length -gt 0 }

# Since an empty string is considered false, you can also simply use:
PS C:\> 'foo bar  baz' -split ' ' | Where { $_ }

Splitting On Multiple Delimiters

To split on multiple characters, you can put them in a character class. There are some pre-defined classes, like "\w", which includes "A-Z", "a-z", "_" and "0-9", as well as other alphabet characters, as defined by the .NET framework.

Regexp Character Classes

A regexp character class is like a meta language inside the regular expressions language, with its own rules, different from outside a character class. A character class begins with a "[" and ends with a "]" and anything between them will be included in the class. Most meta characters outside have no special meaning inside a character class. Such as "|" being a literal pipe inside and an "OR" operator outside. Also, if you use the same character multiple times, it will be no different from using it once. Common beginner mistakes are putting words or whatever inside character classes. The character class [dinner] is no different from, say, [ndire]. Each occurrence of a character only counts once, and the order is insignificant, except for in ranges.

There are a few exceptions, most notably that a "^" character first in a character class will negate the rest of the characters in the character class. So [^abc] will match anything except the letters a, b and c.

The character class [foo\s+] will match one instance of either whitespace (\s), the letter "f", the letter "o" or the literal symbol "+".

A character class meta character is "-", which indicates a range like "[a-f0-9]" for matching hex digits or "[a-z]" for matching English alphabet letters.

The Actual Splitting

To split on, for instance, the characters "-", "," and "_" (dash, comma, underscore) - you can use the character class [,_-]. Notice how I said "-" is special inside character classes, but there's a special case when it comes first or last, where it's interpreted literally (makes sense, right?). To avoid future errors where you might add something before or after the dash, I recommend always escaping it. You can also put it anywhere if you escape it. A small quirk here, is that you need to escape it using a backslash, "\", not the PowerShell escape character "`". Demonstrated in the examples below.

PS C:\> 'a-b,c,d_e' -split '[,_\-]'

Like I mentioned above, you can escape the character class range operator with a backslash ("\"). Here i -join the string, and effectively replace the characters I split on with a hash sign ("#"):

PS C:\> 'a-b,c,d_e' -split '[,\-_]' -join '#'

The Special Meaning Of A Period In A Regex Passed To -split

You might try to split on something with a period/dot (".") and expect something like "'a.b.c' -split '.'" to produce what you want, but since the -split operator takes a regular expression, a period has special meaning, namely "any character" (except newlines without the SingleLine regexp option). So it will split on every single character and produce an array of six empty string elements, as demonstrated in this example:

PS C:\> '3.2.1' -split '.'

PS C:\> ('3.2.1' -split '.').count
PS C:\> ('3.2.1' -split '.')[0].length
PS C:\> ('3.2.1' -split '.')[0].GetType().FullName

To split on a literal period/dot, you need to escape it, and not using the PowerShell escape character, "`", but a backslash: "\". Here's the result we wanted in this case:

PS C:\> '3.2.1' -split '\.'
PS C:\>

The String Method .Split()

To split on a single, or multiple, characters, you can also use the System.String object method Split(). It's documented by Microsoft here.

PS C:\> 'a,b;c,d'.Split(',') -join ' | '
a | b;c | d
PS C:\> 'a,b;c,d'.Split(',;') -join ' | '
a | b | c | d

You can also remove empty elements using [StringSplitOptions]::RemoveEmptyEntries. Below, I demonstrate how you get empty elements, and how they can be removed by adding a parameter to the Split() method.

PS C:\> 'a,,,d'.Split(',') -join ' | '
a |  |  | d
PS C:\> 'a,,,d'.Split(',', [StringSplitOptions]::RemoveEmptyEntries) -join ' | '
a | d

An alternative way of effectively removing empty elements in the form of these doubled-up delimiters (commas), is by using -replace to replace multiple commas with a single comma. This of course doesn't consider quoted fields containing delimiters/commas in the data.

PS C:\> ('a,,,d' -replace ',+', ',').Split(',') -join ' | '
a | d

# or with -split:

PS C:\> 'a,,,d' -replace ',+', ',' -split ',' -join ' | '
a | d

# or slicker still:

PS C:\> 'a,,,d' -split ',+' -join ' | '
a | d

Keywords: using the powershell split operator, splitting text, split on whitespace