4
\$\begingroup\$

I have a very large JSON Lines File with 4.000.000 Rows, and I need to convert several events from every row. The resulted CSV File contains 15.000.000 rows. How can I optimize this script?

I'm using Powershell core 7 and it takes around 50 hours to complete the conversion.

My Powershell script:

$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8 
$i = 0
$ig = 0
$output = @()
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file.csv"
if (test-path $Exportfile) {
 Remove-Item -path $Exportfile
}
foreach ($line in [System.IO.File]::ReadLines($Importfile, $encoding)) {
 $json = $line | ConvertFrom-Json
 foreach ($item in $json.events.items) {
 $CSVLine = [pscustomobject]@{
 Key = $json.Register.Key
 CompanyID = $json.id
 Eventtype = $item.type
 Eventdate = $item.date
 Eventdescription = $item.description
 }
 $output += $CSVLine
 }
 $i++
 $ig++
 if ($i -ge 30000) {
 $output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
 $i = 0
 $output = @()
 $minutes = $stopwatch.elapsed.TotalMinutes
 $percentage = $ig / $totalrows * 100
 $totalestimatedtime = $minutes * (100/$percentage)
 $timeremaining = $totalestimatedtime - $minutes
 Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
 }
}
$output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
$stopwatch.Stop()
asked May 5, 2020 at 17:26
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

There's a good article about performance to get your started: Slow Code: Top 5 ways to make your Powershell scripts run faster; at first sight, here are applicable both parts of Problem #2 & #3: Appending stuff:

  • Appending to files: Export-CSV -Append repeats roughly 133 times (= 4E+6/30E+3 i.e. 4E+6 rows written in chunks of 30E+3 rows each) which could considerably worsen performance for big output file;
  • Appending to arrays: (heart of the matter) when the += operator is used, it's actually destroying the array and creating a new one (source).

To eliminate appending to arrays, you can use .NET array list object as follows (merely a hint): replace

  • $output = @() with $output = [System.Collections.ArrayList]::new(), and
  • $output += $CSVLine with [void]$output.Add($CSVLine).

Note: do not compute one-purpose variable $CSVLine at all; instead, use

[void]$output.Add(
 [pscustomobject]@{
 Key = $json.Register.Key
 CompanyID = $json.id
 Eventtype = $item.type
 Eventdate = $item.date
 Eventdescription = $item.description
 })

Read the remark about the generic List<T> class and Performance Considerations as well:

In deciding whether to use the List<T> or ArrayList class, both of which have similar functionality, remember that the List<T> class performs better in most cases and is type safe. If a reference type is used for type T of the List<T> class, the behavior of the two classes is identical.

answered May 16, 2020 at 15:47
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.