I'm a bit new to PowerShell (version 5) and was wondering if there are any improvements I could make to the following script. Any suggestions for style, code, etc. are all welcome.
The script creates a csv of file attributes and extended properties recursively from a directory parameter. I'm going to then load the csv files into SQL Server.
I'm not sure if there is a more efficient way of filtering the files by last modified date. I need to run it against 30 million files in batches of 500,000.
I've included an example of the UNC path that I use to produce the results file name in the $searchPath
variable.
I'm currently running this script in a .bat
file. Is there a better way of running these scripts concurrently?
PowerShell Script
$searchPath = '\\server\ca$\los angeles\documents\*.*'; #An example of the path format used as a parameter for $args[0]
$resultsFileName = '';
$startDate = '01-Jul-2018';
$endDate = '31-Jul-2019';
$shell = New-Object -COMObject Shell.Application;
#set date defaults
$dateTimeFormat = 'dd-MMM-yyyy HH:mm:ss.fff';
$executeStart = Get-Date;
$executeStart = $executeStart.ToString($dateTimeFormat);
Write-Host 'Execute Start:' $executeStart;
#get the parent folder from the search path
Write-Host 'Search Path:' $searchPath;
$folder = ($searchPath).Substring(1, ($searchPath).Length - 4);
$parent = Split-Path $folder;
#if the path is not in the expected format dont set the results file name
if($parent.Length -gt 2)
{
#get the state name
$state = Split-Path $parent;
$state = Split-Path $state -Leaf;
$state = $state -replace '\$','';
$state = $state.ToLower();
#get the office name
$office = Split-Path $parent -Leaf;
$office = $office -replace '\W','-';
$office = $office.ToLower();
$resultsFileName = $state + '_' + $office;
};
#format the result file name and path
$resultFileTimestamp = Get-Date -format 'yyyyMMdd_HHmmss.fff';
if($resultsFileName -eq '') {$resultsFileName = 'results_' + $resultsFileName + '_' + $resultFileTimestamp};
$resultsFile = "C:\Temp\Results\$resultsFileName.csv";
Write-Host 'Results File:' $resultsFile;
$linenumber = 1;
#get the file attributes from the recursively from the search path
Get-Childitem -recurse -Path $searchPath | ? {($_.LastWriteTime -gt $startDate -AND $_.LastWriteTime -lt $endDate) -OR ($_.CreationTime -gt $startDate -AND $_.CreationTime -lt $endDate)} | ForEach-Object {
$fullName = $_.FullName;
$folder = Split-Path $fullName;
$file = Split-Path $fullName -Leaf;
$shell = New-Object -COMObject Shell.Application;
$shellfolder = $shell.Namespace($folder);
$shellfile = $shellfolder.ParseName($file);
#loop through the extended properties looking for the columns we want
for ($a ; $a -le 325; $a++)
{
if($shellfolder.getDetailsOf($File, $a))
{
$keyValue = $shellfolder.GetDetailsOf($null, $a)
switch ( $keyValue )
{
'Attributes' { $Attributes = $shellfolder.GetDetailsOf($shellfile, $a) }
'Title' { $Title = $shellfolder.GetDetailsOf($shellfile, $a) }
'Authors' { $Authors = $shellfolder.GetDetailsOf($shellfile, $a) }
'Last printed' { $LastPrinted = $shellfolder.GetDetailsOf($shellfile, $a) }
'Date last saved' { $DateLastSaved = $shellfolder.GetDetailsOf($shellfile, $a) }
'Pages' { $Pages = $shellfolder.GetDetailsOf($shellfile, $a) }
'Word count' { $WordCount = $shellfolder.GetDetailsOf($shellfile, $a) }
'Total editing time' { $TotalEditingTime = $shellfolder.GetDetailsOf($shellfile, $a) }
'File count' { $FileCount = $shellfolder.GetDetailsOf($shellfile, $a) }
}
}
}
$a=0;
#format extended properties
$LastPrinted = $LastPrinted -replace '[^\p{L}\p{Nd}/(/}/_/:/ ]', ''; #replace non date characters
if($LastPrinted -ne '') {$LastPrinted = [datetime]::parseexact($LastPrinted, 'd/MM/yyyy h:mm tt', $null).ToString($dateTimeFormat) } else {$LastPrinted = 'NULL'};
$DateLastSaved = $DateLastSaved -replace '[^\p{L}\p{Nd}/(/}/_/:/ ]', ''; #replace non date characters
if($DateLastSaved -ne '') {$DateLastSaved = [datetime]::parseexact($DateLastSaved, 'd/MM/yyyy h:mm tt', $null).ToString($dateTimeFormat) } else {$DateLastSaved = 'NULL'};
$Title = $Title.replace("`n","").replace("`r",""); #remove carriage return line feed from string
$Authors = $Authors.replace("`n","").replace("`r",""); #remove carriage return line feed from string
#show the user what file number the script is on
$currentFileDateTime = Get-Date -format $dateTimeFormat;
Write-Host $linenumber, $currentFileDateTime, $fullName;
#format the output of the csv file
Get-Content $fullName | Measure-Object -Character -Word | `
Select-Object -ExcludeProperty Property `
@{ Name = 'LineNumber'; Expression={$linenumber}} `
, @{ Name = 'ExtractTime'; Expression={ Get-Date -format $dateTimeFormat }} `
, @{ Name = 'FullName'; Expression={ $fullName }} `
, @{ Name = 'FilePath'; Expression={ $folder }} `
, @{ Name = 'FileName'; Expression={ $file }} `
, @{ Name = 'FileSizeKb'; Expression={ (Get-Item $fullName).length / 1024 }} `
, @{ Name = 'CreationTime'; Expression={(Get-ChildItem $fullName).CreationTime.ToString($dateTimeFormat) }} `
, @{ Name = 'LastWriteTime'; Expression={(Get-ItemProperty $fullName).LastWriteTime.ToString($dateTimeFormat) }} `
, @{ Name = 'Attributes'; Expression={ $Attributes.ToString() }} `
, @{ Name = 'Title'; Expression={ $Title.ToString() }} `
, @{ Name = 'Authors'; Expression={ $Authors.ToString() }} `
, @{ Name = 'LastPrinted'; Expression={ $LastPrinted.ToString() }} `
, @{ Name = 'LastSaved'; Expression={ $DateLastSaved.ToString() }} `
, @{ Name = 'PageCount'; Expression={ $Pages.ToString() }} `
, @{ Name = 'WordCount'; Expression={ $WordCount.ToString() }} `
, Characters `
, @{ Name = 'TotalEditingTime'; Expression={ $TotalEditingTime.ToString() }} `
, @{ Name = 'FileCount'; Expression={ $FileCount.ToString() }}; `
$linenumber ++ `
} | Export-Csv -NoTypeInformation -Path $resultsFile;
$executeEnd = Get-Date;
$executeEnd = $executeEnd.ToString($dateTimeFormat);
Write-Host 'Execute End:' $executeEnd;
Example of CSV file
-
1\$\begingroup\$ What is the purpose of the script? \$\endgroup\$Dangph– Dangph2019年06月17日 14:24:21 +00:00Commented Jun 17, 2019 at 14:24
-
\$\begingroup\$ To create a CSV of file attributes from multiple directories. I'm going to then load the CSV files into SQL server. \$\endgroup\$aduguid– aduguid2019年06月17日 15:02:25 +00:00Commented Jun 17, 2019 at 15:02
1 Answer 1
As ForEach-Object
and Where-Object
are slow, you had better to use the foreach
statement instead of them when working with large data.
Also, because it is useless to get all extended properties, narrow down the property numbers to be obtained in advance.
Although Select-Object
is useful, it will be slower if you use "calculated property", so it is better to create a new object directly using [pscustomobject]@{}
.
# settings
$startDate = Get-Date "01-Jul-2018"
$endDate = Get-Date "31-Mar-2029"
$resultDir = mkdir "E:\tets\Results" -Force
$dateTimeFormat = "dd-MMM-yyyy HH:mm:ss.fff"
$pathList = @(
"\\server\ca$\los angeles\documents"
"\\server\ca$\san diego\documents"
"\\server\ca$\san francisco\documents"
)
$exAttr = @{
Title = 21
Authors = 20
LastPrinted = 200
LastSaved = 154
Pages = 157
WordCount = 160
TotalEditingTime = 159
FileCount = 163
}
# start
"Execute Start: {0}" -f (Get-Date -Format $dateTimeFormat) | Write-Host
$shell = New-Object -Com Shell.Application
foreach ($path in $pathList) {
# create a result file.
$targetDir = Get-Item -LiteralPath $path
$resultFileName = "{0}_{1}_{2}.csv" -f @(
$targetDir.Parent.Parent.Name.TrimEnd("$").ToLower()
$targetDir.Parent.Name.ToLower() -replace "\W","-"
Get-Date -Format yyyyMMdd_HHmmss.fff
)
$resultsFile = New-Item -Path $resultDir -Name $resultFileName -Force
$lineNumber = 1
# create customized file information objects.
$list = foreach ($file in Get-Childitem $targetDir -File -Recurse) {
if ($file.LastWriteTime -le $startDate -or $file.LastWriteTime -ge $endDate) { continue }
$exData = @{}
# get extended properties of office file.
if ($file.Extension -in ".docx",".xlsx") {
$dirObj = $shell.NameSpace($file.DirectoryName)
$itemObj = $dirObj.ParseName($file.Name)
$exAttr.GetEnumerator() | & { process { $exData[$_.Key] = $dirObj.GetDetailsOf($itemObj, $_.Value) -replace "\p{Cf}" } }
[void][Runtime.InteropServices.Marshal]::ReleaseComObject($itemObj)
[void][Runtime.InteropServices.Marshal]::ReleaseComObject($dirObj)
}
[pscustomobject]@{
LineNumber = $lineNumber++
ExtractTime = Get-Date -format $dateTimeFormat
FullName = $file.FullName
FilePath = $file.DirectoryName
FileName = $file.Name
FileSize = $file.Length / 1KB
CreationTime = $file.CreationTime.ToString($dateTimeFormat)
LastWriteTime = $file.LastWriteTime.ToString($dateTimeFormat)
Attributes = -join ($file.Attributes.Split(",") | & { process { $_.TrimStart()[0] } })
Title = $exData.Title
Authors = $exData.Authors
LastPrinted = if ($exData.LastPrinted) { Get-Date $exData.LastPrinted -Format $dateTimeFormat } else { $null }
LastSaved = if ($exData.LastSaved) { Get-Date $exData.LastSaved -Format $dateTimeFormat } else { $null }
PageCount = $exData.Pages
WordCount = $exData.WordCount
TotalEditingTime = $exData.TotalEditingTime
FileCount = $exData.FileCount
}
}
# output to csv file.
$list | Export-Csv $resultsFile -NoTypeInformation
$resultsFile
}
[void][Runtime.InteropServices.Marshal]::ReleaseComObject($shell)
"`nExecute End: {0}`n" -f (Get-Date -Format $dateTimeFormat) | Write-Host
If you want to run in parallel, there is a way to use Start-Job
.
$scriptBlock = {
param(
$path,
$startDate,
$endDate,
$resultDir,
$dateTimeFormat
)
$exAttr = @{
...
}
$shell = New-Object -Com Shell.Application
# create a result file.
$targetDir = Get-Item -LiteralPath $path
...
...
}
$pathList | foreach { $i = 0 } { Start-Job -Name $_ -ScriptBlock $scriptBlock -ArgumentList $_,$startDate,$endDate,$resultDir,$dateTimeFormat }
If file retrieval and filtering is a bottleneck, you can write partially in C#.
Add-Type -TypeDefinition @"
using System;
using System.IO;
using System.Collections.Generic;
public static class MyFileIO
{
public static List<FileInfo> GetFiles(DirectoryInfo dir, DateTime startDate, DateTime endDate)
{
var files = new List<FileInfo>();
var queue = new Queue<DirectoryInfo>();
queue.Enqueue(dir);
while (queue.Count > 0)
{
dir = queue.Dequeue();
foreach (var subDir in dir.GetDirectories()) queue.Enqueue(subDir);
foreach(var file in dir.GetFiles())
if(file.LastWriteTime > startDate && file.LastWriteTime < endDate) files.Add(file);
}
return files;
}
}
"@
foreach ($file in [MyFileIO]::GetFiles($targetDir, $startDate, $endDate)) {
...
}