5
\$\begingroup\$

Microsoft Excel cells basically have two values: the formatted value that is shown to the user, and the actual value of the cell. These two values can be extremely different when the cell has certain formatting types such as date, datetime, currency, numbers with commas, etc.

I am writing a script that cleans up database data. It uses find, replace, RegEx, and other functions that act on strings of text. I need to be able to convert data into very specific formats so that my upload tool will accept the data. Therefore I need all of the cells in my Excel spreadsheets to be TEXT. I need the displayed values to match the actual values.

This is surprisingly hard to do in Excel. There is no paste special option that I could find that pastes what you see on your screen as text. "Paste Special -> Values" pastes, predictably, unformatted values. So if you try to "Paste Special -> Values" into cells formatted as TEXT, and they are coming from cells formatted as DATES, you might get something like this:

Old Cell: 4/7/2001

New Cell: 36988

If you paste normally, then the paste goes in as 4/7/2001, but in date format. So if I create a RegEx that looks for 4/7/2001 and changes it to 4/7/01, it will not work, because the underlying value is 36988.

Anyway, I discovered the Cell.Text property and wrote some code that converts all cells in a spreadsheet from formatted to text. But the code has two major problems:

1) It does not work on cells with more than 255 characters. This is because Cell.Text literally displays what the cell is actually showing to the user, and if the column is too narrow, it will display pound signs (########), and columns have a maximum width of 255 characters.

2) It is extremely slow. In my 1.5 million cell test sheet, code execution took 394 seconds.

I'm looking for ideas to optimize the code's speed, and to make the code work on all display value lengths.

Function ConvertAllCellsToText()
 ' Convert all cells from non-text to text
 ' This is helpful in a variety of ways
 ' Dates and datetimes can now be manipulated by replace, RegEx, etc., since their value is now 4/7/2001 instead of 36988
 ' Phone numbers and zip codes won't get leading zeros deleted if we try to add those
 StartNewTask "Converting all cells to text"
 ' first draft of the code, does not work as intended, dates get scrambled
 'Cells.Select
 'Selection.Copy
 'Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
 :=False, Transpose:=False
 'Application.CutCopyMode = False
 Dim MyRange As Range
 Set MyRange = ActiveSheet.UsedRange
 ' prevent Cell.Text from being ########
 MyRange.ColumnWidth = 255
 Dim i As Long
 Dim CellText As String
 For Each Cell In MyRange
 If Cell.Text <> Cell.Value And Len(Cell.Value) <= 255 Then
 CellText = Cell.Text
 Cell.NumberFormat = "@"
 Cell.Value = CellText
 End If
 i = i + 1
 If i Mod 2500 = 0 Then
 RefreshScreen
 End If
 Next
 MyRange.ColumnWidth = 20
 ' set all cells to Text format
 Cells.Select
 Selection.NumberFormat = "@"
End Function
Phrancis
20.5k6 gold badges69 silver badges155 bronze badges
asked Mar 28, 2016 at 2:51
\$\endgroup\$
2
  • \$\begingroup\$ Do you control the code that produces the excel sheet in the first place? \$\endgroup\$ Commented Mar 28, 2016 at 8:26
  • \$\begingroup\$ @Taemyr No. The incoming data is either exported from various online registration websites, or it is a spreadsheet created by a human. My script formats it so that an upload tool can then upload it into our website. Our upload tool is strict, and one error will abort the upload process. Also, the tool is not verbose, and often won't pinpoint the cause of the error. Finally, the tool uses a queue, so if the import fails you will not know for a while. That is why I decided to tackle the problem in Excel. If the data is high enough quality, it will import smoothly. \$\endgroup\$ Commented Mar 28, 2016 at 8:59

2 Answers 2

7
\$\begingroup\$

This code has basically the same performance issue as your last question, and the performance solution is pretty much the same. Other things to note:

  1. This is more of a nag than anything, but you should really remove dead code that is commented out.
  2. The variable Cell is never declared. You should put Option Explicit at the top of all of your code modules. See Why should I dimension my variables in VBA really? over on SO.
  3. You don't return a value, so this should really be a Sub, not a Function. If you don't want it to appear in the macro list, declare it as Private. I'd get into the habit of always explicitly declaring your scope.
  4. You have a bug. You explain pretty clearly in your question that the .Value of a cell can differ dramatically from the .Text of a cell depending on the formatting. However, when you test the length of the cell values, you're checking against the Value that you already point out is not dependant on the formatting. You need to test the Text against your width.
  5. You have another possible bug (not addressed in the code below) - for proportional width fonts, the value referred to by ColumnWidth is based on the width of the character '0' (see the documentation). If you happen to be using a font where any character is wider than '0' (for example 'X' in my browser's default font), you can't reliably use the ColumnWidth as a proxy for the length of a string that it can display.
  6. 255 is the magic number, yes it is. Since you're testing the maximum width of a text rendered cell to this value, and using it in multiple places in your code to provide formatting information, make it a constant and give it an appropriate name.
  7. I'm assuming that the call to RefreshScreen is to provide some sort of indication to the user that something is actually happening instead of a hung script. I'd remove it completely. If you call DoEvents Excel will have a chance to update the screen buffer. This is not only more performant (it avoids a function call), but it will also be clearer to the reader what is going on. Even better would be to simply omit it, as redrawing the screen is what is killing your in performance in the first place (see below).
  8. Personally, I would cache the column widths and reset them when you are done instead of using the arbitrary value '20'. This upholds the principle of least astonishment.
  9. You set the NumberFormat to text for the entire Worksheet at the end of the Function. Setting it for individual cells inside your loop is not only redundant, but forces a ScreenUpdate. This is killing your performance. Basically, what is happening is that you are updating the display once for every cell in the UsedRange. This is incredibly expensive. Turn ScreenUpdating off instead.
  10. As I mentioned in comments to your previous question, updating cells is more expensive in terms of performance than you might guess. When you are working with large ranges of cells, you should pull them into an array if at all possible, then write the entire array back at once. This lowers your "Excel overhead" almost quadratically.
  11. You should avoid using the Cells collection (it is entirely dependant on what object is active) and the Selection object (almost every operation can be performed with the Range you already have a reference to).
  12. Use With blocks to your advantage, especially in loops. Every '.' operator that you see in your code causes an object deference call. Wrapping sections where the same object is used repeatedly in a With block essentially caches the result of the deference instead of repeatedly performing it.
  13. Similar note with pulling cell values from a Range. Accessing the object is much more expensive than storing it. If you have to reference the same value twice (looking at you, If Cell.Text <> Cell.Value And Len(Cell.Value) <= 255 Then), cache it.

The above notes basically boil down to something more like this:

Const MaxTextLength = 255
Private Sub ConvertAllCellsToText()
 StartNewTask "Converting all cells to text"
 Application.ScreenUpdating = False
 Dim MyRange As Range
 Set MyRange = ActiveSheet.UsedRange
 Dim cache() As Long
 cache = GetColumnWidths(MyRange)
 With MyRange
 .ColumnWidth = MaxTextLength
 Dim Values() As Variant
 ReDim Values(.Rows.Count, .Columns.Count)
 Dim col As Long
 Dim row As Long
 For row = 0 To UBound(Values, 1)
 For col = 0 To UBound(Values, 2)
 Dim temp As String
 temp = .Cells(row + 1, col + 1).Text
 If Len(temp) <= MaxTextLength Then
 Values(row, col) = temp
 End If
 Next col
 Next row
 .NumberFormat = "@"
 End With
 MyRange = Values
 SetColumnWidths MyRange, cache
 Application.ScreenUpdating = True
End Sub
Private Function GetColumnWidths(Target As Range) As Long()
 Dim output() As Long
 ReDim output(1 To Target.Columns.Count)
 Dim index As Long
 For index = 1 To Target.Columns.Count
 output(index) = Target.Columns(index).ColumnWidth
 Next index
 GetColumnWidths = output
End Function
Private Sub SetColumnWidths(Target As Range, widths() As Long)
 Dim index As Long
 For index = LBound(widths) To UBound(widths)
 Target.Columns(index).ColumnWidth = widths(index)
 Next index
End Sub

NOTE: I didn't benchmark this, but I'd ballpark about 100 times faster.

EDIT: Almost forgot. If you have a bunch of numbers in your sheet, you'll want to ignore the numbers stored as text errors to get another nominal performance gain. Call it before you turn ScreenUpdating back on.

answered Mar 28, 2016 at 4:18
\$\endgroup\$
3
  • \$\begingroup\$ Thanks for the code review and the working code! 1) I benchmarked it and it is 3 times faster. Task execution time is down to 126 seconds now. 2) The reason I use RefreshScreen is to call DoEvents so the program doesn't look hung, and also to update a timer label I have on my UserForm. VBA doesn't let the timer run separately like in Visual Basic. 3) This function still doesn't work for values greater than 255 characters. Any ideas? Perhaps finding the Excel function that converts values to text? Text = Function(Value, Formatting)? Then build an array? Thanks again. \$\endgroup\$ Commented Mar 31, 2016 at 8:26
  • 1
    \$\begingroup\$ @AdmiralAdama - The underlying issue is that Excel won't let you size a column wider than 255 characters. You could try .WrapText = True. Just out of curiosity, was your benchmark with or without the RefreshScreen call? \$\endgroup\$ Commented Mar 31, 2016 at 12:46
  • \$\begingroup\$ 1) The benchmark was without the RefreshScreen call. 2) I did some experimentation and discovered that the column width only messes up Cell.Text if it is a date, because of the ###### thing. Running Cell.Text on a cell formatted General with lots of text in it will return all the text, even if all of it isn't visible to the user. So the 255 trick solves the error, since a date will never be wider than 255. \$\endgroup\$ Commented Mar 31, 2016 at 14:49
2
\$\begingroup\$

@Comintern's answer has already covered your current approach quite comprehensively, so I won't add anything there. Instead, I'm going to critique your approach itself.

Speaking as somebody who's entire job is to (ab)use Excel in ways it was not really meant to be used, please take this to heart:


Excel is not a database

However, since you don't have a choice in the matter, here's some advice for you:


Always keep an original copy of the data

Right now, you are taking your data and re-formatting it. If you screw up the formatting, do you have a reliable way to recover the original data? I generally designate one sheet as the "Source" sheet, then any further work always starts with making a copy of that data.


Use DataTypes to your advantage

You have no control over the data you receive. There are an effectively infinite number of formats it could be in. Right now, as soon as some data appears in a format your regex hasn't accounted for, you're screwed.

So, assume nothing, and leverage the power of VBA's Data Type recognition. Take dates for example. You mention DateValues (E.G. 36988) as a problem. It's the exact opposite. Rather than work around them, work with them.

Pass your sheet values into an Array (has the added advantage that you can now ignore text and column widths and all that mess). Now all your dates are in a consistent, known format (a DateValue). Then, convert them all to a specific string format using the Format() function E.G.

If IsDate(value) then values(x, y) = Format(value, "dd/mm/yy")

Et voila, now you don't care what format your dates are in when you receive them, because Excel will do the hard work of converting them to a DateValue for you, and you can convert them to a specific output format with a single line.

Don't try to second guess your data. Hundreds of thousands of Developer-Hours went into giving Excel the ability to do the hard work for you. So, let it figure out what kind of data you're receiving, take those values, define formats for the handful of different Data Types and never worry about your source formatting again.

answered Mar 29, 2016 at 8:40
\$\endgroup\$
1
  • \$\begingroup\$ Thanks for your idea! I upvoted your answer and mulled this over for a few days. In the end, I decided I am going to stick with the "convert all to text" strategy, and add any "weird" dates to my RegEx on a case-by-case basis. My reason for this is 1) I don't want to have to deal with situations where the user doesn't choose the right formatting for that cell or column (I'd have to code something pretty complicated to check and fix the formatting of all the cells), and 2) I want to make sure that zip codes and phone numbers don't get their leading zeros chopped off. \$\endgroup\$ Commented Mar 31, 2016 at 8:30

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.