Microsoft Excel cells basically have two values: the formatted value that is shown to the user, and the actual value of the cell. These two values can be extremely different when the cell has certain formatting types such as date, datetime, currency, numbers with commas, etc.
I am writing a script that cleans up database data. It uses find, replace, RegEx, and other functions that act on strings of text. I need to be able to convert data into very specific formats so that my upload tool will accept the data. Therefore I need all of the cells in my Excel spreadsheets to be TEXT. I need the displayed values to match the actual values.
This is surprisingly hard to do in Excel. There is no paste special option that I could find that pastes what you see on your screen as text. "Paste Special -> Values" pastes, predictably, unformatted values. So if you try to "Paste Special -> Values" into cells formatted as TEXT, and they are coming from cells formatted as DATES, you might get something like this:
Old Cell: 4/7/2001
New Cell: 36988
If you paste normally, then the paste goes in as 4/7/2001, but in date format. So if I create a RegEx that looks for 4/7/2001 and changes it to 4/7/01, it will not work, because the underlying value is 36988.
Anyway, I discovered the Cell.Text
property and wrote some code that converts all cells in a spreadsheet from formatted to text. But the code has two major problems:
1) It does not work on cells with more than 255 characters. This is because Cell.Text
literally displays what the cell is actually showing to the user, and if the column is too narrow, it will display pound signs (########
), and columns have a maximum width of 255 characters.
2) It is extremely slow. In my 1.5 million cell test sheet, code execution took 394 seconds.
I'm looking for ideas to optimize the code's speed, and to make the code work on all display value lengths.
Function ConvertAllCellsToText()
' Convert all cells from non-text to text
' This is helpful in a variety of ways
' Dates and datetimes can now be manipulated by replace, RegEx, etc., since their value is now 4/7/2001 instead of 36988
' Phone numbers and zip codes won't get leading zeros deleted if we try to add those
StartNewTask "Converting all cells to text"
' first draft of the code, does not work as intended, dates get scrambled
'Cells.Select
'Selection.Copy
'Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
:=False, Transpose:=False
'Application.CutCopyMode = False
Dim MyRange As Range
Set MyRange = ActiveSheet.UsedRange
' prevent Cell.Text from being ########
MyRange.ColumnWidth = 255
Dim i As Long
Dim CellText As String
For Each Cell In MyRange
If Cell.Text <> Cell.Value And Len(Cell.Value) <= 255 Then
CellText = Cell.Text
Cell.NumberFormat = "@"
Cell.Value = CellText
End If
i = i + 1
If i Mod 2500 = 0 Then
RefreshScreen
End If
Next
MyRange.ColumnWidth = 20
' set all cells to Text format
Cells.Select
Selection.NumberFormat = "@"
End Function
-
\$\begingroup\$ Do you control the code that produces the excel sheet in the first place? \$\endgroup\$Taemyr– Taemyr2016年03月28日 08:26:41 +00:00Commented Mar 28, 2016 at 8:26
-
\$\begingroup\$ @Taemyr No. The incoming data is either exported from various online registration websites, or it is a spreadsheet created by a human. My script formats it so that an upload tool can then upload it into our website. Our upload tool is strict, and one error will abort the upload process. Also, the tool is not verbose, and often won't pinpoint the cause of the error. Finally, the tool uses a queue, so if the import fails you will not know for a while. That is why I decided to tackle the problem in Excel. If the data is high enough quality, it will import smoothly. \$\endgroup\$RedDragonWebDesign– RedDragonWebDesign2016年03月28日 08:59:06 +00:00Commented Mar 28, 2016 at 8:59
2 Answers 2
This code has basically the same performance issue as your last question, and the performance solution is pretty much the same. Other things to note:
- This is more of a nag than anything, but you should really remove dead code that is commented out.
- The variable
Cell
is never declared. You should putOption Explicit
at the top of all of your code modules. See Why should I dimension my variables in VBA really? over on SO. - You don't return a value, so this should really be a
Sub
, not aFunction
. If you don't want it to appear in the macro list, declare it asPrivate
. I'd get into the habit of always explicitly declaring your scope. - You have a bug. You explain pretty clearly in your question that
the
.Value
of a cell can differ dramatically from the.Text
of a cell depending on the formatting. However, when you test the length of the cell values, you're checking against the Value that you already point out is not dependant on the formatting. You need to test the Text against your width. - You have another possible bug (not addressed in the code below) -
for proportional width fonts, the value referred to by
ColumnWidth
is based on the width of the character '0' (see the documentation). If you happen to be using a font where any character is wider than '0' (for example 'X' in my browser's default font), you can't reliably use theColumnWidth
as a proxy for the length of a string that it can display. - 255 is the magic number, yes it is. Since you're testing the maximum width of a text rendered cell to this value, and using it in multiple places in your code to provide formatting information, make it a constant and give it an appropriate name.
- I'm assuming that the call to
RefreshScreen
is to provide some sort of indication to the user that something is actually happening instead of a hung script. I'd remove it completely. If you callDoEvents
Excel will have a chance to update the screen buffer. This is not only more performant (it avoids a function call), but it will also be clearer to the reader what is going on. Even better would be to simply omit it, as redrawing the screen is what is killing your in performance in the first place (see below). - Personally, I would cache the column widths and reset them when you are done instead of using the arbitrary value '20'. This upholds the principle of least astonishment.
- You set the
NumberFormat
to text for the entire Worksheet at the end of theFunction
. Setting it for individual cells inside your loop is not only redundant, but forces aScreenUpdate
. This is killing your performance. Basically, what is happening is that you are updating the display once for every cell in theUsedRange
. This is incredibly expensive. TurnScreenUpdating
off instead. - As I mentioned in comments to your previous question, updating cells is more expensive in terms of performance than you might guess. When you are working with large ranges of cells, you should pull them into an array if at all possible, then write the entire array back at once. This lowers your "Excel overhead" almost quadratically.
- You should avoid using the
Cells
collection (it is entirely dependant on what object is active) and theSelection
object (almost every operation can be performed with theRange
you already have a reference to). - Use
With
blocks to your advantage, especially in loops. Every '.' operator that you see in your code causes an object deference call. Wrapping sections where the same object is used repeatedly in aWith
block essentially caches the result of the deference instead of repeatedly performing it. - Similar note with pulling cell values from a
Range
. Accessing the object is much more expensive than storing it. If you have to reference the same value twice (looking at you,If Cell.Text <> Cell.Value And Len(Cell.Value) <= 255 Then
), cache it.
The above notes basically boil down to something more like this:
Const MaxTextLength = 255
Private Sub ConvertAllCellsToText()
StartNewTask "Converting all cells to text"
Application.ScreenUpdating = False
Dim MyRange As Range
Set MyRange = ActiveSheet.UsedRange
Dim cache() As Long
cache = GetColumnWidths(MyRange)
With MyRange
.ColumnWidth = MaxTextLength
Dim Values() As Variant
ReDim Values(.Rows.Count, .Columns.Count)
Dim col As Long
Dim row As Long
For row = 0 To UBound(Values, 1)
For col = 0 To UBound(Values, 2)
Dim temp As String
temp = .Cells(row + 1, col + 1).Text
If Len(temp) <= MaxTextLength Then
Values(row, col) = temp
End If
Next col
Next row
.NumberFormat = "@"
End With
MyRange = Values
SetColumnWidths MyRange, cache
Application.ScreenUpdating = True
End Sub
Private Function GetColumnWidths(Target As Range) As Long()
Dim output() As Long
ReDim output(1 To Target.Columns.Count)
Dim index As Long
For index = 1 To Target.Columns.Count
output(index) = Target.Columns(index).ColumnWidth
Next index
GetColumnWidths = output
End Function
Private Sub SetColumnWidths(Target As Range, widths() As Long)
Dim index As Long
For index = LBound(widths) To UBound(widths)
Target.Columns(index).ColumnWidth = widths(index)
Next index
End Sub
NOTE: I didn't benchmark this, but I'd ballpark about 100 times faster.
EDIT: Almost forgot. If you have a bunch of numbers in your sheet, you'll want to ignore the numbers stored as text errors to get another nominal performance gain. Call it before you turn ScreenUpdating
back on.
-
\$\begingroup\$ Thanks for the code review and the working code! 1) I benchmarked it and it is 3 times faster. Task execution time is down to 126 seconds now. 2) The reason I use RefreshScreen is to call DoEvents so the program doesn't look hung, and also to update a timer label I have on my UserForm. VBA doesn't let the timer run separately like in Visual Basic. 3) This function still doesn't work for values greater than 255 characters. Any ideas? Perhaps finding the Excel function that converts values to text? Text = Function(Value, Formatting)? Then build an array? Thanks again. \$\endgroup\$RedDragonWebDesign– RedDragonWebDesign2016年03月31日 08:26:38 +00:00Commented Mar 31, 2016 at 8:26
-
1\$\begingroup\$ @AdmiralAdama - The underlying issue is that Excel won't let you size a column wider than 255 characters. You could try
.WrapText = True
. Just out of curiosity, was your benchmark with or without theRefreshScreen
call? \$\endgroup\$Comintern– Comintern2016年03月31日 12:46:04 +00:00Commented Mar 31, 2016 at 12:46 -
\$\begingroup\$ 1) The benchmark was without the RefreshScreen call. 2) I did some experimentation and discovered that the column width only messes up Cell.Text if it is a date, because of the ###### thing. Running Cell.Text on a cell formatted General with lots of text in it will return all the text, even if all of it isn't visible to the user. So the 255 trick solves the error, since a date will never be wider than 255. \$\endgroup\$RedDragonWebDesign– RedDragonWebDesign2016年03月31日 14:49:13 +00:00Commented Mar 31, 2016 at 14:49
@Comintern's answer has already covered your current approach quite comprehensively, so I won't add anything there. Instead, I'm going to critique your approach itself.
Speaking as somebody who's entire job is to (ab)use Excel in ways it was not really meant to be used, please take this to heart:
Excel is not a database
However, since you don't have a choice in the matter, here's some advice for you:
Always keep an original copy of the data
Right now, you are taking your data and re-formatting it. If you screw up the formatting, do you have a reliable way to recover the original data? I generally designate one sheet as the "Source" sheet, then any further work always starts with making a copy of that data.
Use DataTypes to your advantage
You have no control over the data you receive. There are an effectively infinite number of formats it could be in. Right now, as soon as some data appears in a format your regex hasn't accounted for, you're screwed.
So, assume nothing, and leverage the power of VBA's Data Type recognition. Take dates for example. You mention DateValues (E.G. 36988
) as a problem. It's the exact opposite. Rather than work around them, work with them.
Pass your sheet values into an Array (has the added advantage that you can now ignore text and column widths and all that mess). Now all your dates are in a consistent, known format (a DateValue). Then, convert them all to a specific string format using the Format()
function E.G.
If IsDate(value) then values(x, y) = Format(value, "dd/mm/yy")
Et voila, now you don't care what format your dates are in when you receive them, because Excel will do the hard work of converting them to a DateValue for you, and you can convert them to a specific output format with a single line.
Don't try to second guess your data. Hundreds of thousands of Developer-Hours went into giving Excel the ability to do the hard work for you. So, let it figure out what kind of data you're receiving, take those values, define formats for the handful of different Data Types and never worry about your source formatting again.
-
\$\begingroup\$ Thanks for your idea! I upvoted your answer and mulled this over for a few days. In the end, I decided I am going to stick with the "convert all to text" strategy, and add any "weird" dates to my RegEx on a case-by-case basis. My reason for this is 1) I don't want to have to deal with situations where the user doesn't choose the right formatting for that cell or column (I'd have to code something pretty complicated to check and fix the formatting of all the cells), and 2) I want to make sure that zip codes and phone numbers don't get their leading zeros chopped off. \$\endgroup\$RedDragonWebDesign– RedDragonWebDesign2016年03月31日 08:30:01 +00:00Commented Mar 31, 2016 at 8:30