I have a very large JSON Lines File with 4.000.000 Rows, and I need to convert several events from every row. The resulted CSV File contains 15.000.000 rows. How can I optimize this script?
I'm using Powershell core 7 and it takes around 50 hours to complete the conversion.
My Powershell script:
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$output = @()
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file.csv"
if (test-path $Exportfile) {
Remove-Item -path $Exportfile
}
foreach ($line in [System.IO.File]::ReadLines($Importfile, $encoding)) {
$json = $line | ConvertFrom-Json
foreach ($item in $json.events.items) {
$CSVLine = [pscustomobject]@{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $item.type
Eventdate = $item.date
Eventdescription = $item.description
}
$output += $CSVLine
}
$i++
$ig++
if ($i -ge 30000) {
$output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
$i = 0
$output = @()
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100/$percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
}
$output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
$stopwatch.Stop()
1 Answer 1
There's a good article about performance to get your started: Slow Code: Top 5 ways to make your Powershell scripts run faster; at first sight, here are applicable both parts of Problem #2 & #3: Appending stuff:
- Appending to files:
Export-CSV -Append
repeats roughly 133 times (=4E+6/30E+3
i.e.4E+6
rows written in chunks of30E+3
rows each) which could considerably worsen performance for big output file; - Appending to arrays: (heart of the matter) when the
+=
operator is used, it's actually destroying the array and creating a new one (source).
To eliminate appending to arrays, you can use .NET array list object as follows (merely a hint): replace
$output = @()
with$output = [System.Collections.ArrayList]::new()
, and$output += $CSVLine
with[void]$output.Add($CSVLine)
.
Note: do not compute one-purpose variable $CSVLine
at all; instead, use
[void]$output.Add(
[pscustomobject]@{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $item.type
Eventdate = $item.date
Eventdescription = $item.description
})
Read the remark about the generic List<T>
class and Performance Considerations as well:
In deciding whether to use the
List<T>
orArrayList
class, both of which have similar functionality, remember that theList<T>
class performs better in most cases and is type safe. If a reference type is used for typeT
of theList<T>
class, the behavior of the two classes is identical.