Powershell vs perl at text processing

From Svendsen Tech PowerShell Wiki
Jump to: navigation, search

I'll try to write this little article in a neutral way. In this article I compare PowerShell to Perl with regards to text processing. I'm not doing proper benchmarking here, but the numbers do serve as indicators, especially when you have differences as significant as 25 seconds vs 0.4 seconds.

I'm dealing with a specific, real-world problem where I wanted to parse an IRC log. A small spanner in the works is that I needed to run a regular expression against the contents of a file - and it needed to span multiple lines. This means I need to read the entire file into memory, which PowerShell's Get-Content does anyway. The log file is 43 MB in size.

Both Perl and PowerShell will use a single core on a CPU with four cores. The CPU is a Quad Core Q9550 at 2.83 GHz.

The computer is mostly idle, but honestly I'm running a music player, chat clients, mail clients, etc. as usual. This is by no means, not even a far cry from, the ideal test environment, but I ran the scripts several times consecutively and the numbers were always very similar, so my brain seems satisfied the numbers are quite representative. For proper benchmarking you'd need thousands of runs, consider averages, standard deviations, median and all that jazz. Honestly, it doesn't seem necessary.

By the way, I have since writing this article written a generic PowerShell benchmarking module.




First Attempt With PowerShell

So I wrote up the code below. Don't worry about the regular expression; suffice it to say it's what I needed.

$IRCLog = [string] (Get-Content e:\temp\irc.log)

$Regex = [regex] '(?s)Delectus> -\*\).*?Vinneren er\.{3}.+?Delectus> -\*> (.+?) med\s+\d+\s+poeng'

$IRCWinners = @{}

[regex]::Matches($IRCLog, $Regex) |
    ForEach-Object { $IRCWinners.($_.Groups[1].Value) += 1 }

$IRCWinners.GetEnumerator() | Sort-Object @{e={$_.Value}; Ascending=$false},@{e={$_.Name}; Ascending=$true}

Then I measured how long it took to run a few times and got numbers similar to this:

PS E:\temp> (Measure-Command { .\IRC-Winner-Extract.ps1 > foo.txt }).TotalSeconds
24,6283702

About 24 seconds. Not too impressive?

Second Attempt With PowerShell

So I looked into .NET methods for reading files and found the [System.IO.File] class method ReadAllText. This is significantly faster.

I replaced the line where I read the file:

$IRCLog = [string] (Get-Content e:\temp\irc.log)

With this:

$IRCLog = [System.IO.File]::ReadAllText('e:\temp\irc.log')

Then I ran the code again and measured how long it took:

PS E:\temp> (Measure-Command { .\IRC-Winner-Extract.ps1 > foo.txt }).TotalSeconds
3,7554382

About 3.76 seconds on that run. A lot faster than the previous attempt. Always similar results on subsequent runs.

I suppose one might conclude that Get-Content is slow for this type of operation. What will be even slower is using "-Delimiter `0" as a parameter to Get-Content. I didn't even have the patience to wait for the script to finish after a few minutes. I know this article is pretty haphazardly put together; my apologies.

First Perl Attempt

So I decided to write a Perl version to see how it compared. Text processing is known as one of Perl's strengths.

Code:

use warnings;
use strict;

my $regex = qr'(?s)Delectus> -\*\).*?Vinneren er\.{3}.+?Delectus> -\*> (.+?) med\s+\d+\s+poeng';

open my $fh, '<', 'irc.log' or die "Failed to open file: $!\n$^E";

my $irc_log = do { local $/; <$fh> };

my %winners;

$winners{$1} += 1 while $irc_log =~ m/$regex/g;

foreach my $key (sort { $winners{$b} <=> $winners{$a} or $a cmp $b} keys %winners) {
    
    printf '%-35s%s', $key, "$winners{$key}\n";
    
}

Then I ran it from PowerShell using Measure-Command:

PS E:\temp> (Measure-Command { perl .\IRC-Winner-Extract.pl > foo2.txt }).TotalSeconds
0,3780953

About 0.38 seconds. It seems Perl is faster for this specific problem.

I've run the scripts numerous times and the numbers are always close to the results I've posted here.

Second Perl Attempt With File::Slurp

For the second Perl attempt, I tried using File::Slurp, but this seems to increase the duration of the script for this specific example. This is due to the time it takes to load the module, which becomes significant in such a short-lived script.

You see that the first time it's run, it takes a while longer. There's probably some caching going on. All the previous example times are after consecutive runs, so whatever caching there is should be in effect.

The code to read the file is now:

use File::Slurp;
my $irc_log = read_file('e:/temp/irc.log');

And giving it a few spins, we get:

PS E:\temp> 1..3 | %{ (Measure-Command { perl .\IRC-Winner-Extract.pl > foo2.txt }).TotalSeconds }
1,3639568
0,648764
0,6400252
PS E:\temp> 1..3 | %{ (Measure-Command { perl .\IRC-Winner-Extract.pl > foo2.txt }).TotalSeconds }
0,6360413
0,6393869
0,6458041

About 0.64 seconds with File::Slurp.

On larger files you should see greater time savings with File::Slurp. The optional parameter to the read_file() function, scalar_ref => 1, shaves off a significant amount of time (relatively speaking):

PS E:\temp> 1..5 | % { (Measure-Command { perl .\IRC-Winner-Extract.pl }).TotalSeconds }
0,5304287
0,5253521
0,5210729
0,5206464
0,5241358

About 0.52 seconds.

Load Time of the Perl Interpreter Itself from PowerShell

Conclusion: Mostly insignificant in most situations.

PS E:\> 1..10 | %{ (Measure-Command { perl -e 1 }).TotalSeconds }
0,0310774
0,0273425
0,0264974
0,0275955
0,0282403
0,0282031
0,0286994
0,0288328
0,0287991
0,0301134

I calculated the average on 10 and 1000 iterations:

PS E:\> 1..10 | %{ (Measure-Command { perl -e 1 }).TotalSeconds } | Measure-Object -Sum | %{ $_.Sum / $_.Count }
0,02691432
PS E:\> 1..1000 | %{ (Measure-Command { perl -e 1 }).TotalSeconds } | Measure-Object -Sum | %{ $_.Sum / $_.Count }
0,027541721

Very similar to the individual results.