0

Using Java, I have a class which retrieves a webpage as a byte array. I then need to strip out some content if it exists. (The application monitors web pages for changes, but needs to remove session Ids from the html which are created by php, and would mean changes were detected each visit to the page).

Some of the resulting byte arrays could be 10s of 1000s bytes long. They're not stored like this - a 16 byte MD5 of the page is stored. However, it is the original full size byte array which needs to be processed.

(UPDATE - the code does not work. See comment from A.H. below) A test showing my code:

public void testSessionIDGetsRemovedFromData() throws IOException
 {
 byte[] forumContent = "<li class=\"icon-logout\"><a href=\"./ucp.php?mode=logout&amp;sid=3a4043284674572e35881e022c68fcd8\" title=\"Logout [ barry ]\" accesskey=\"x\">Logout [ barry ]</a></li>".getBytes();
 byte[] sidPattern = "&amp;sid=".getBytes();
 int sidIndex = ArrayCleaner.getPatternIndex(forumContent, sidPattern);
 assertEquals(54, sidIndex);
 // start of cleaning code
 ArrayList<Byte> forumContentList = new ArrayList<Byte>();
 forumContentList.addAll(forumContent);
 forumContentList.removeAll(Arrays.asList(sidPattern));
 byte[] forumContentCleaned = new byte[forumContentList.size()];
 for (int i = 0; i < forumContentCleaned.length; i++)
 {
 forumContentCleaned[i] = (byte)forumContentList.get(i);
 }
 //end of cleaning code
 sidIndex = ArrayCleaner.getPatternIndex(forumContentCleaned, sidPattern);
 assertEquals(-1, sidIndex);
 }

This all works fine, but I'm worried about the efficiency of the cleaning section. I had hoped to operate solely on arrays, but the ArrayList has nice built in functions to removed a collection from the ArrayList, etc, which is just what I need. So I have had to create an ArrayList of Byte, as I can't have an ArrayList of the primitive byte (can anyone tell me why?), convert the pattern to remove to another ArrayList (I suppose this could be an ArrayList all along) to pass to removeAll(). I then need to create another byte[] and cast each element of the ArrayList of Bytes to a byte and add it to the byte[].

Is there a more efficient way of doing all this? Can it be performed using arrays?

UPDATE This is the same functionality using strings:

 public void testSessionIDGetsRemovedFromDataUsingStrings() throws IOException
{ 
 String forumContent = "<li class=\"icon-logout\"><a href=\"./ucp.php?mode=logout&amp;sid=3a4043284674572e35881e022c68fcd8\" title=\"Logout [ barry ]\" accesskey=\"x\">Logout [ barry ]</a></li>";
 String sidPattern = "&amp;sid=";
 int sidIndex = forumContent.indexOf(sidPattern);
 assertEquals(54, sidIndex);
 forumContent = forumContent.replaceAll(sidPattern, "");
 sidIndex = forumContent.indexOf(sidPattern);
 assertEquals(-1, sidIndex);
}

Is this as efficient as the array/arrayList method?

Thanks, Barry

asked Oct 30, 2011 at 15:01
6
  • 3
    I'm just curious if a web page parser such as Jsoup would work better. Commented Oct 30, 2011 at 15:03
  • 2
    Why are you using byte arrays and not Strings? Commented Oct 30, 2011 at 15:17
  • Err... to be honest, I can't think of a good answer to that. I suppose it felt natural to use a byte[] to store this data in a database. Would using strings not use more memory? Commented Oct 30, 2011 at 15:36
  • You also have yet to address my question: why not simply strip the data that you do want to check with Jsoup or a similar web parser? Commented Oct 30, 2011 at 15:37
  • Well, I'll have to go and check it out before I can answer that. If there's an existing component that would do the job then that's great, so thanks for the pointer. Commented Oct 30, 2011 at 15:42

3 Answers 3

5

You can use List#toArray() to convert any list to an array.

Things are a bit more complicated in this specific use case because there is no elegant way to auto-unbox (from Byte to byte) when converting the list. Good ol' Java generics. Which is a nice segue into...

So I have had to create an ArrayList of Byte, as I can't have an ArrayList of the primitive byte (can anyone tell me why?)

Because, in Java, generic type parameters cannot be primitives. See Why can Java Collections not directly store Primitives types?


Side note: as a matter of style, you should almost always declare ArrayList types as List:

List<Byte> forumContentList = new ArrayList<Byte>();

See Java - declaring from Interface type instead of Class and Type List vs type ArrayList in Java.

answered Oct 30, 2011 at 15:03

Comments

3

This all works fine, I'm worried about the efficiency of the cleaning section...

Really? Did you inspect the resulting "string"? On my machine the data in forumContentCleaned still contains the &amp;sid=... data.

That's because

forumContentList.removeAll(Arrays.asList(sidPattern));

tries to remove a List<byte[]> from a List<Byte>. This will do nothing. And even if you replace the argument of removeAll with a real List<Byte> containing the bytes of "&amp;sid=", then you will remove ALL occurences of each a, each m, each p and so forth. The resulting data will look like this:

<l cl"con-logout">< href"./uc.h?oelogout34043284674572e35881e022c68fc8" ttle....

Well, strictly speaking, the &amp;sid= part is gone, but I'm quite sure this is not what you wanted.

Therefore take a step back and think: You are doing string manipulation here, so use a StringBuilder, feed it with the String(forumContent) and do your manipulation there.

Edit

Looking at the given example input string, I guess, that also the value of sid should be removed, not only the key. This code should do it efficiently without regular expresions:

String removeSecrets(String input){
 StringBuilder sb = new StringBuilder(input);
 String sidStart = "&amp;sid=";
 String sidEnd = "\"";
 int posStart = 0;
 while ((posStart = sb.indexOf(sidStart, posStart)) >= 0) {
 int posEnd = sb.indexOf(sidEnd, posStart);
 if (posEnd < 0) // delete as far as possible - YMMV
 posEnd = sb.length();
 sb.delete(posStart, posEnd);
 }
 return sb.toString();
}

Edit 2

Here is a small benchmark between StringBuilder and String.replaceAll:

public class ReplaceAllBenchmark {
 public static void main(String[] args) throws Throwable {
 final int N = 1000000;
 String input = "<li class=\"icon-logout\"><a href=\"./ucp.php?mode=logout&amp;sid=3a4043284674572e35881e022c68fcd8\" title=\"Logout [ barry ]\" accesskey=\"x\">Logout [ barry ]</a>&amp;sid=3a4043284674572e35881e022c68fcd8\"</li>";
 stringBuilderBench(input, N);
 regularExpressionBench(input, N);
 }
 static void stringBuilderBench(String input, final int N) throws Throwable{
 for(int run=0; run<5; ++run){
 long t1 = System.nanoTime();
 for(int i=0; i<N; ++i)
 removeSecrets(input);
 long t2 = System.nanoTime();
 System.out.println("sb: "+(t2-t1)+"ns, "+(t2-t1)/N+"ns/call");
 Thread.sleep(1000);
 }
 }
 static void regularExpressionBench(String input, final int N) throws Throwable{
 for(int run=0; run<5; ++run){
 long t1 = System.nanoTime();
 for(int i=0; i<N; ++i)
 removeSecrets2(input);
 long t2 = System.nanoTime();
 System.out.println("regexp: "+(t2-t1)+"ns, "+(t2-t1)/N+"ns/call");
 Thread.sleep(1000);
 }
 }
 static String removeSecrets2(String input){
 return input.replaceAll("&amp;sid=[^\"]*\"", "\"");
 }
}

Results:

java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.9) (6b20-1.9.9-0ubuntu1~10.04.2)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
sb: 538735438ns, 538ns/call
sb: 457107726ns, 457ns/call
sb: 443282145ns, 443ns/call
sb: 453978805ns, 453ns/call
sb: 458895308ns, 458ns/call
regexp: 2404818405ns, 2404ns/call
regexp: 2196834572ns, 2196ns/call
regexp: 2239056178ns, 2239ns/call
regexp: 2164337638ns, 2164ns/call
regexp: 2177091893ns, 2177ns/call
answered Oct 30, 2011 at 16:00

11 Comments

I'm just working on it now... Apologies for posting dodgy code :-( +1
As an aside, the test did pass when I posted the question, but afterwards I had realised I'd made a mistake at the forumContentList.addAll(forumContent); line. I'd actually pasted forumContentList.addAll(forumContentList); which is nonsense but I got the JUnit green line :-/ No excuse, I know...
I've looked at StringBuilder, but I thing String looks more suited to my needs, as the replaceAll() method does exactly what I need.
If you use String.replaceAll then please note, that his uses regular expressions internally and compiles the pattern for each call. Your question was about doing thing efficiently. Say goodbye to that with replaceAll.
Ok, can you suggest how to do it efficiently?
|
1

I dont think two codes have the same function.

the first code removes all characters in the sidPattern from forumContent. the second code removes the sidPattern string from forumContnt, maybe not functional, cause replaceAll() accept the argument as regular expression pattern.

are you sure you want to remove "&sid=" rather than "&sid=3a4043284674572e35881e022c68fcd8" ?

anyway, I think String is fine, List is a little bit heavy.

answered Oct 30, 2011 at 16:22

2 Comments

Yes, eventually I want to remove the whole sid id, up to the closing ". But I was just wondering about efficiency at this point. Thanks for the answer.
F*k, all I edited lost. First, List will cost mn for removeAll() method loop, m*n means contentListSize * sidListSize. you may not avoid regular expression, cause you have to reomve the whole sid including the value of that parameter, unless you implement is yourself.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.