2
\$\begingroup\$

I'm a bit new to PowerShell (version 5) and was wondering if there are any improvements I could make to the following script. Any suggestions for style, code, etc. are all welcome.

The script creates a csv of file attributes and extended properties recursively from a directory parameter. I'm going to then load the csv files into SQL Server.

I'm not sure if there is a more efficient way of filtering the files by last modified date. I need to run it against 30 million files in batches of 500,000.

I've included an example of the UNC path that I use to produce the results file name in the $searchPath variable.

I'm currently running this script in a .bat file. Is there a better way of running these scripts concurrently?

PowerShell Script

$searchPath = '\\server\ca$\los angeles\documents\*.*'; #An example of the path format used as a parameter for $args[0]
$resultsFileName = '';
$startDate = '01-Jul-2018';
$endDate = '31-Jul-2019';
$shell = New-Object -COMObject Shell.Application;
#set date defaults
$dateTimeFormat = 'dd-MMM-yyyy HH:mm:ss.fff';
$executeStart = Get-Date;
$executeStart = $executeStart.ToString($dateTimeFormat);
Write-Host 'Execute Start:' $executeStart;
#get the parent folder from the search path
Write-Host 'Search Path:' $searchPath;
$folder = ($searchPath).Substring(1, ($searchPath).Length - 4);
$parent = Split-Path $folder;
#if the path is not in the expected format dont set the results file name
 if($parent.Length -gt 2)
{
 #get the state name
 $state = Split-Path $parent;
 $state = Split-Path $state -Leaf;
 $state = $state -replace '\$','';
 $state = $state.ToLower();
 #get the office name
 $office = Split-Path $parent -Leaf;
 $office = $office -replace '\W','-';
 $office = $office.ToLower(); 
 $resultsFileName = $state + '_' + $office;
};
#format the result file name and path
$resultFileTimestamp = Get-Date -format 'yyyyMMdd_HHmmss.fff';
if($resultsFileName -eq '') {$resultsFileName = 'results_' + $resultsFileName + '_' + $resultFileTimestamp};
$resultsFile = "C:\Temp\Results\$resultsFileName.csv";
Write-Host 'Results File:' $resultsFile;
 $linenumber = 1;
 #get the file attributes from the recursively from the search path
 Get-Childitem -recurse -Path $searchPath | ? {($_.LastWriteTime -gt $startDate -AND $_.LastWriteTime -lt $endDate) -OR ($_.CreationTime -gt $startDate -AND $_.CreationTime -lt $endDate)} | ForEach-Object { 
 $fullName = $_.FullName;
 $folder = Split-Path $fullName;
 $file = Split-Path $fullName -Leaf;
 $shell = New-Object -COMObject Shell.Application;
 $shellfolder = $shell.Namespace($folder);
 $shellfile = $shellfolder.ParseName($file);
 #loop through the extended properties looking for the columns we want
 for ($a ; $a -le 325; $a++)
 { 
 if($shellfolder.getDetailsOf($File, $a))
 {
 $keyValue = $shellfolder.GetDetailsOf($null, $a)
 switch ( $keyValue )
 { 
 'Attributes' { $Attributes = $shellfolder.GetDetailsOf($shellfile, $a) }
 'Title' { $Title = $shellfolder.GetDetailsOf($shellfile, $a) }
 'Authors' { $Authors = $shellfolder.GetDetailsOf($shellfile, $a) }
 'Last printed' { $LastPrinted = $shellfolder.GetDetailsOf($shellfile, $a) }
 'Date last saved' { $DateLastSaved = $shellfolder.GetDetailsOf($shellfile, $a) }
 'Pages' { $Pages = $shellfolder.GetDetailsOf($shellfile, $a) }
 'Word count' { $WordCount = $shellfolder.GetDetailsOf($shellfile, $a) }
 'Total editing time' { $TotalEditingTime = $shellfolder.GetDetailsOf($shellfile, $a) }
 'File count' { $FileCount = $shellfolder.GetDetailsOf($shellfile, $a) }
 }
 } 
 } 
 $a=0;
 #format extended properties
 $LastPrinted = $LastPrinted -replace '[^\p{L}\p{Nd}/(/}/_/:/ ]', ''; #replace non date characters
 if($LastPrinted -ne '') {$LastPrinted = [datetime]::parseexact($LastPrinted, 'd/MM/yyyy h:mm tt', $null).ToString($dateTimeFormat) } else {$LastPrinted = 'NULL'};
 $DateLastSaved = $DateLastSaved -replace '[^\p{L}\p{Nd}/(/}/_/:/ ]', ''; #replace non date characters
 if($DateLastSaved -ne '') {$DateLastSaved = [datetime]::parseexact($DateLastSaved, 'd/MM/yyyy h:mm tt', $null).ToString($dateTimeFormat) } else {$DateLastSaved = 'NULL'};
 $Title = $Title.replace("`n","").replace("`r",""); #remove carriage return line feed from string
 $Authors = $Authors.replace("`n","").replace("`r",""); #remove carriage return line feed from string
 #show the user what file number the script is on
 $currentFileDateTime = Get-Date -format $dateTimeFormat;
 Write-Host $linenumber, $currentFileDateTime, $fullName;
 #format the output of the csv file
 Get-Content $fullName | Measure-Object -Character -Word | `
 Select-Object -ExcludeProperty Property `
 @{ Name = 'LineNumber'; Expression={$linenumber}} `
 , @{ Name = 'ExtractTime'; Expression={ Get-Date -format $dateTimeFormat }} `
 , @{ Name = 'FullName'; Expression={ $fullName }} `
 , @{ Name = 'FilePath'; Expression={ $folder }} `
 , @{ Name = 'FileName'; Expression={ $file }} `
 , @{ Name = 'FileSizeKb'; Expression={ (Get-Item $fullName).length / 1024 }} `
 , @{ Name = 'CreationTime'; Expression={(Get-ChildItem $fullName).CreationTime.ToString($dateTimeFormat) }} `
 , @{ Name = 'LastWriteTime'; Expression={(Get-ItemProperty $fullName).LastWriteTime.ToString($dateTimeFormat) }} `
 , @{ Name = 'Attributes'; Expression={ $Attributes.ToString() }} `
 , @{ Name = 'Title'; Expression={ $Title.ToString() }} `
 , @{ Name = 'Authors'; Expression={ $Authors.ToString() }} `
 , @{ Name = 'LastPrinted'; Expression={ $LastPrinted.ToString() }} `
 , @{ Name = 'LastSaved'; Expression={ $DateLastSaved.ToString() }} `
 , @{ Name = 'PageCount'; Expression={ $Pages.ToString() }} `
 , @{ Name = 'WordCount'; Expression={ $WordCount.ToString() }} `
 , Characters `
 , @{ Name = 'TotalEditingTime'; Expression={ $TotalEditingTime.ToString() }} `
 , @{ Name = 'FileCount'; Expression={ $FileCount.ToString() }}; `
 $linenumber ++ `
 } | Export-Csv -NoTypeInformation -Path $resultsFile;
$executeEnd = Get-Date;
$executeEnd = $executeEnd.ToString($dateTimeFormat);
Write-Host 'Execute End:' $executeEnd;

Example of CSV file

screenshot

asked Jun 17, 2019 at 12:33
\$\endgroup\$
2
  • 1
    \$\begingroup\$ What is the purpose of the script? \$\endgroup\$ Commented Jun 17, 2019 at 14:24
  • \$\begingroup\$ To create a CSV of file attributes from multiple directories. I'm going to then load the CSV files into SQL server. \$\endgroup\$ Commented Jun 17, 2019 at 15:02

1 Answer 1

2
\$\begingroup\$

As ForEach-Object and Where-Object are slow, you had better to use the foreach statement instead of them when working with large data.
Also, because it is useless to get all extended properties, narrow down the property numbers to be obtained in advance.
Although Select-Object is useful, it will be slower if you use "calculated property", so it is better to create a new object directly using [pscustomobject]@{}.

# settings
$startDate = Get-Date "01-Jul-2018"
$endDate = Get-Date "31-Mar-2029"
$resultDir = mkdir "E:\tets\Results" -Force
$dateTimeFormat = "dd-MMM-yyyy HH:mm:ss.fff"
$pathList = @(
 "\\server\ca$\los angeles\documents"
 "\\server\ca$\san diego\documents"
 "\\server\ca$\san francisco\documents"
)
$exAttr = @{
 Title = 21
 Authors = 20
 LastPrinted = 200
 LastSaved = 154
 Pages = 157
 WordCount = 160
 TotalEditingTime = 159
 FileCount = 163
}
# start
"Execute Start: {0}" -f (Get-Date -Format $dateTimeFormat) | Write-Host
$shell = New-Object -Com Shell.Application
foreach ($path in $pathList) {
 # create a result file.
 $targetDir = Get-Item -LiteralPath $path
 $resultFileName = "{0}_{1}_{2}.csv" -f @(
 $targetDir.Parent.Parent.Name.TrimEnd("$").ToLower()
 $targetDir.Parent.Name.ToLower() -replace "\W","-"
 Get-Date -Format yyyyMMdd_HHmmss.fff
 )
 $resultsFile = New-Item -Path $resultDir -Name $resultFileName -Force
 $lineNumber = 1
 # create customized file information objects.
 $list = foreach ($file in Get-Childitem $targetDir -File -Recurse) {
 if ($file.LastWriteTime -le $startDate -or $file.LastWriteTime -ge $endDate) { continue }
 $exData = @{}
 # get extended properties of office file.
 if ($file.Extension -in ".docx",".xlsx") {
 $dirObj = $shell.NameSpace($file.DirectoryName)
 $itemObj = $dirObj.ParseName($file.Name)
 $exAttr.GetEnumerator() | & { process { $exData[$_.Key] = $dirObj.GetDetailsOf($itemObj, $_.Value) -replace "\p{Cf}" } }
 [void][Runtime.InteropServices.Marshal]::ReleaseComObject($itemObj)
 [void][Runtime.InteropServices.Marshal]::ReleaseComObject($dirObj)
 }
 [pscustomobject]@{
 LineNumber = $lineNumber++
 ExtractTime = Get-Date -format $dateTimeFormat
 FullName = $file.FullName
 FilePath = $file.DirectoryName
 FileName = $file.Name
 FileSize = $file.Length / 1KB
 CreationTime = $file.CreationTime.ToString($dateTimeFormat)
 LastWriteTime = $file.LastWriteTime.ToString($dateTimeFormat)
 Attributes = -join ($file.Attributes.Split(",") | & { process { $_.TrimStart()[0] } })
 Title = $exData.Title
 Authors = $exData.Authors
 LastPrinted = if ($exData.LastPrinted) { Get-Date $exData.LastPrinted -Format $dateTimeFormat } else { $null }
 LastSaved = if ($exData.LastSaved) { Get-Date $exData.LastSaved -Format $dateTimeFormat } else { $null }
 PageCount = $exData.Pages
 WordCount = $exData.WordCount
 TotalEditingTime = $exData.TotalEditingTime
 FileCount = $exData.FileCount
 }
 }
 # output to csv file.
 $list | Export-Csv $resultsFile -NoTypeInformation
 $resultsFile
}
[void][Runtime.InteropServices.Marshal]::ReleaseComObject($shell)
"`nExecute End: {0}`n" -f (Get-Date -Format $dateTimeFormat) | Write-Host

If you want to run in parallel, there is a way to use Start-Job.

$scriptBlock = {
 param(
 $path,
 $startDate,
 $endDate,
 $resultDir,
 $dateTimeFormat
 )
 $exAttr = @{
 ...
 }
 $shell = New-Object -Com Shell.Application
 # create a result file.
 $targetDir = Get-Item -LiteralPath $path
 ...
 ...
}
$pathList | foreach { $i = 0 } { Start-Job -Name $_ -ScriptBlock $scriptBlock -ArgumentList $_,$startDate,$endDate,$resultDir,$dateTimeFormat }

If file retrieval and filtering is a bottleneck, you can write partially in C#.

Add-Type -TypeDefinition @"
using System;
using System.IO;
using System.Collections.Generic;
public static class MyFileIO
{
 public static List<FileInfo> GetFiles(DirectoryInfo dir, DateTime startDate, DateTime endDate)
 {
 var files = new List<FileInfo>();
 var queue = new Queue<DirectoryInfo>();
 queue.Enqueue(dir);
 while (queue.Count > 0)
 {
 dir = queue.Dequeue();
 foreach (var subDir in dir.GetDirectories()) queue.Enqueue(subDir);
 foreach(var file in dir.GetFiles())
 if(file.LastWriteTime > startDate && file.LastWriteTime < endDate) files.Add(file);
 }
 return files;
 }
}
"@
foreach ($file in [MyFileIO]::GetFiles($targetDir, $startDate, $endDate)) {
 ...
}
answered Jun 23, 2019 at 23:24
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.