Using Java, I have a class which retrieves a webpage as a byte array. I then need to strip out some content if it exists. (The application monitors web pages for changes, but needs to remove session Ids from the html which are created by php, and would mean changes were detected each visit to the page).
Some of the resulting byte arrays could be 10s of 1000s bytes long. They're not stored like this - a 16 byte MD5 of the page is stored. However, it is the original full size byte array which needs to be processed.
(UPDATE - the code does not work. See comment from A.H. below) A test showing my code:
public void testSessionIDGetsRemovedFromData() throws IOException
{
byte[] forumContent = "<li class=\"icon-logout\"><a href=\"./ucp.php?mode=logout&sid=3a4043284674572e35881e022c68fcd8\" title=\"Logout [ barry ]\" accesskey=\"x\">Logout [ barry ]</a></li>".getBytes();
byte[] sidPattern = "&sid=".getBytes();
int sidIndex = ArrayCleaner.getPatternIndex(forumContent, sidPattern);
assertEquals(54, sidIndex);
// start of cleaning code
ArrayList<Byte> forumContentList = new ArrayList<Byte>();
forumContentList.addAll(forumContent);
forumContentList.removeAll(Arrays.asList(sidPattern));
byte[] forumContentCleaned = new byte[forumContentList.size()];
for (int i = 0; i < forumContentCleaned.length; i++)
{
forumContentCleaned[i] = (byte)forumContentList.get(i);
}
//end of cleaning code
sidIndex = ArrayCleaner.getPatternIndex(forumContentCleaned, sidPattern);
assertEquals(-1, sidIndex);
}
This all works fine, but I'm worried about the efficiency of the cleaning section. I had hoped to operate solely on arrays, but the ArrayList has nice built in functions to removed a collection from the ArrayList, etc, which is just what I need. So I have had to create an ArrayList of Byte, as I can't have an ArrayList of the primitive byte (can anyone tell me why?), convert the pattern to remove to another ArrayList (I suppose this could be an ArrayList all along) to pass to removeAll(). I then need to create another byte[] and cast each element of the ArrayList of Bytes to a byte and add it to the byte[].
Is there a more efficient way of doing all this? Can it be performed using arrays?
UPDATE This is the same functionality using strings:
public void testSessionIDGetsRemovedFromDataUsingStrings() throws IOException
{
String forumContent = "<li class=\"icon-logout\"><a href=\"./ucp.php?mode=logout&sid=3a4043284674572e35881e022c68fcd8\" title=\"Logout [ barry ]\" accesskey=\"x\">Logout [ barry ]</a></li>";
String sidPattern = "&sid=";
int sidIndex = forumContent.indexOf(sidPattern);
assertEquals(54, sidIndex);
forumContent = forumContent.replaceAll(sidPattern, "");
sidIndex = forumContent.indexOf(sidPattern);
assertEquals(-1, sidIndex);
}
Is this as efficient as the array/arrayList method?
Thanks, Barry
-
3I'm just curious if a web page parser such as Jsoup would work better.Hovercraft Full Of Eels– Hovercraft Full Of Eels2011年10月30日 15:03:57 +00:00Commented Oct 30, 2011 at 15:03
-
2Why are you using byte arrays and not Strings?Matt Ball– Matt Ball2011年10月30日 15:17:46 +00:00Commented Oct 30, 2011 at 15:17
-
Err... to be honest, I can't think of a good answer to that. I suppose it felt natural to use a byte[] to store this data in a database. Would using strings not use more memory?barry– barry2011年10月30日 15:36:24 +00:00Commented Oct 30, 2011 at 15:36
-
You also have yet to address my question: why not simply strip the data that you do want to check with Jsoup or a similar web parser?Hovercraft Full Of Eels– Hovercraft Full Of Eels2011年10月30日 15:37:36 +00:00Commented Oct 30, 2011 at 15:37
-
Well, I'll have to go and check it out before I can answer that. If there's an existing component that would do the job then that's great, so thanks for the pointer.barry– barry2011年10月30日 15:42:04 +00:00Commented Oct 30, 2011 at 15:42
3 Answers 3
You can use List#toArray()
to convert any list to an array.
Things are a bit more complicated in this specific use case because there is no elegant way to auto-unbox (from Byte
to byte
) when converting the list. Good ol' Java generics. Which is a nice segue into...
So I have had to create an ArrayList of Byte, as I can't have an ArrayList of the primitive byte (can anyone tell me why?)
Because, in Java, generic type parameters cannot be primitives. See Why can Java Collections not directly store Primitives types?
Side note: as a matter of style, you should almost always declare ArrayList
types as List
:
List<Byte> forumContentList = new ArrayList<Byte>();
See Java - declaring from Interface type instead of Class and Type List vs type ArrayList in Java.
Comments
This all works fine, I'm worried about the efficiency of the cleaning section...
Really? Did you inspect the resulting "string"? On my machine the data in forumContentCleaned
still contains the &sid=...
data.
That's because
forumContentList.removeAll(Arrays.asList(sidPattern));
tries to remove a List<byte[]>
from a List<Byte>
. This will do nothing. And even if you replace the argument of removeAll
with a real List<Byte>
containing the bytes of "&sid="
, then you will remove ALL occurences of each a
, each m
, each p
and so forth. The resulting data will look like this:
<l cl"con-logout">< href"./uc.h?oelogout34043284674572e35881e022c68fc8" ttle....
Well, strictly speaking, the &sid=
part is gone, but I'm quite sure this is not what you wanted.
Therefore take a step back and think: You are doing string manipulation here, so use a StringBuilder
, feed it with the String(forumContent)
and do your manipulation there.
Edit
Looking at the given example input string, I guess, that also the value of sid
should be removed, not only the key. This code should do it efficiently without regular expresions:
String removeSecrets(String input){
StringBuilder sb = new StringBuilder(input);
String sidStart = "&sid=";
String sidEnd = "\"";
int posStart = 0;
while ((posStart = sb.indexOf(sidStart, posStart)) >= 0) {
int posEnd = sb.indexOf(sidEnd, posStart);
if (posEnd < 0) // delete as far as possible - YMMV
posEnd = sb.length();
sb.delete(posStart, posEnd);
}
return sb.toString();
}
Edit 2
Here is a small benchmark between StringBuilder
and String.replaceAll
:
public class ReplaceAllBenchmark {
public static void main(String[] args) throws Throwable {
final int N = 1000000;
String input = "<li class=\"icon-logout\"><a href=\"./ucp.php?mode=logout&sid=3a4043284674572e35881e022c68fcd8\" title=\"Logout [ barry ]\" accesskey=\"x\">Logout [ barry ]</a>&sid=3a4043284674572e35881e022c68fcd8\"</li>";
stringBuilderBench(input, N);
regularExpressionBench(input, N);
}
static void stringBuilderBench(String input, final int N) throws Throwable{
for(int run=0; run<5; ++run){
long t1 = System.nanoTime();
for(int i=0; i<N; ++i)
removeSecrets(input);
long t2 = System.nanoTime();
System.out.println("sb: "+(t2-t1)+"ns, "+(t2-t1)/N+"ns/call");
Thread.sleep(1000);
}
}
static void regularExpressionBench(String input, final int N) throws Throwable{
for(int run=0; run<5; ++run){
long t1 = System.nanoTime();
for(int i=0; i<N; ++i)
removeSecrets2(input);
long t2 = System.nanoTime();
System.out.println("regexp: "+(t2-t1)+"ns, "+(t2-t1)/N+"ns/call");
Thread.sleep(1000);
}
}
static String removeSecrets2(String input){
return input.replaceAll("&sid=[^\"]*\"", "\"");
}
}
Results:
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.9) (6b20-1.9.9-0ubuntu1~10.04.2)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
sb: 538735438ns, 538ns/call
sb: 457107726ns, 457ns/call
sb: 443282145ns, 443ns/call
sb: 453978805ns, 453ns/call
sb: 458895308ns, 458ns/call
regexp: 2404818405ns, 2404ns/call
regexp: 2196834572ns, 2196ns/call
regexp: 2239056178ns, 2239ns/call
regexp: 2164337638ns, 2164ns/call
regexp: 2177091893ns, 2177ns/call
11 Comments
String.replaceAll
then please note, that his uses regular expressions internally and compiles the pattern for each call. Your question was about doing thing efficiently. Say goodbye to that with replaceAll
.I dont think two codes have the same function.
the first code removes all characters in the sidPattern from forumContent. the second code removes the sidPattern string from forumContnt, maybe not functional, cause replaceAll() accept the argument as regular expression pattern.
are you sure you want to remove "&sid=" rather than "&sid=3a4043284674572e35881e022c68fcd8" ?
anyway, I think String is fine, List is a little bit heavy.