HTTP Authorization header parser

Question 1

I'm writing a parser for HTTP Authorization header (see RFC2616#14.8 and RFC2617#1.2). Note that I explicitly don't care about the base64-encoded syntax used by HTTP Basic authentication. I'm only interested in the auth-param syntax used by Digest authentication (to be more specific, I'm implementing a custom Authorization header similar to this question on SO). Basically, it's just a list of key=value pairs separated by commas and value could be quoted or unquoted.

Here's my code, which seems to parse the examples from the RFC just fine:

package com.example.sample;
import java.util.HashMap;
import java.util.Map;
import java.util.NoSuchElementException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class AuthorizationHeaderParser {
 private static final String SEPARATORS = "()<>@,;:\\\\\"/\\[\\]?={} \t";
 private static final Pattern TOKEN_PATTERN = Pattern
 .compile("[[\\p{ASCII}]&&[^" + SEPARATORS + "]&&[^\\p{Cntrl}]]+");
 private static final Pattern EQ_PATTERN = Pattern.compile("=");
 private static final Pattern TOKEN_QUOTED_PATTERN = Pattern
 .compile("\"([^\"]|\\\\\\p{ASCII})*\"");
 private static final Pattern COMMA_PATTERN = Pattern.compile(",");
 private static final Pattern LWS_PATTERN = Pattern
 .compile("(\r?\n)?[ \t]+");
 private static class Tokenizer {
 private String remaining;
 public Tokenizer(String input) {
 remaining = input;
 }
 private void skipSpaces() {
 Matcher m = LWS_PATTERN.matcher(remaining);
 if (!m.lookingAt()) {
 return;
 }
 String match = m.group();
 remaining = remaining.substring(match.length());
 }
 public String match(Pattern p) {
 skipSpaces();
 Matcher m = p.matcher(remaining);
 if (!m.lookingAt()) {
 return null;
 }
 String match = m.group();
 remaining = remaining.substring(match.length());
 return match;
 }
 public String mustMatch(Pattern p) {
 String match = match(p);
 if (match == null) {
 throw new NoSuchElementException();
 }
 return match;
 }
 public boolean hasMore() {
 skipSpaces();
 return remaining.length() > 0;
 }
 }
 public static Map<String, String> parse(String input) {
 Tokenizer t = new Tokenizer(input);
 Map<String, String> map = new HashMap<String, String>();
 String authScheme = t.match(TOKEN_PATTERN);
 map.put(":auth-scheme", authScheme);
 while (true) {
 while (t.match(COMMA_PATTERN) != null) {
 // Skip null list elements
 }
 if (!t.hasMore()) {
 break;
 }
 String key = t.mustMatch(TOKEN_PATTERN);
 t.mustMatch(EQ_PATTERN);
 String value = t.match(TOKEN_PATTERN);
 if (value == null) {
 value = t.mustMatch(TOKEN_QUOTED_PATTERN);
 // trim quotes
 value = value.substring(1, value.length() - 1);
 }
 map.put(key, value);
 if (t.hasMore()) {
 t.mustMatch(COMMA_PATTERN);
 }
 }
 return map;
 }
 public static void main(String args[]) {
 String test1 = "Digest\n"
 + " realm=\"[email protected]\",\n"
 + " qop=\"auth,auth-int\",\n"
 + " nonce=\"dcd98b7102dd2f0e8b11d0f600bfb0c093\",\n"
 + " opaque=\"5ccc069c403ebaf9f0171e9517f40e41\"";
 String test2 = "Digest username=\"Mufasa\",\n"
 + " realm=\"[email protected]\",\n"
 + " nonce=\"dcd98b7102dd2f0e8b11d0f600bfb0c093\",\n"
 + " uri=\"/dir/index.html\",\n"
 + " qop=auth,\n"
 + " nc=00000001,\n"
 + " cnonce=\"0a4f113b\",\n"
 + " response=\"6629fae49393a05397450978507c4ef1\",\n"
 + " opaque=\"5ccc069c403ebaf9f0171e9517f40e41\"";
 System.out.println(parse(test1));
 System.out.println(parse(test2));
 }
}

My questions:

For something as simple as this, is my approach (using regex) good enough or should I write a "real" parser?
Is my translation from the RFC BNF to regex correct, or have I made any mistakes that fail on a valid header or pass an invalid header?
The regular expressions seem too complex, can they be simplified?
Any other suggestions?

Question 2

Going through your specific questions, I have the following suggestions:

Should you write a 'real' parser? - Depends. Parsers can be complicated, and they make assumptions. Regardless, you have already written your own parser, and it is 'real'.
This is the BIG question... is it right? - With regexes it is often hard to tell, and it requires careful analysis of the regex and the data to find out. I have looked at your code, and inspected the regex, and, frankly, it was more complicated than I could easily understand in one sitting.... (and without 'playing' with the code). So, is it right? I don't know.
Are the regexes too complicated (can they be simplified)? - yes, I would say yes to being too complicated, and unsure about whether they can be simplified.
Other suggestions? - yes, a few.... which leads on to:

OK, so what are the other suggestions.....

since you have a class called Tokenizer, it is apparent you are breaking the code in to tokens.... why don't you just use the tools in Java to do the work for you?
This problem is commonly solved with a State machine as well, which are sometimes much faster, and quite interesting.

So, as an exercise, I took your code, and implemented both a state-machine and a Scanner implementation. I have used statemachines in the past to parse comma-separated value files, and the process was very fast... I figured it made sense here too. The Scanner is more complicated than I would have hoped, but you may find the implementation to be educational (I did).

As for a review of your code.... I found it 'easier' to write it again myself, than to try to understand yours. In a sense, that says a lot.

import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import java.util.NoSuchElementException;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class AuthorizationHeaderParser {
 /* ****************************************
 * OP Mechanism
 * **************************************** */
 private static final String SEPARATORS = "()<>@,;:\\\\\"/\\[\\]?={} \t";
 private static final Pattern TOKEN_PATTERN = Pattern
 .compile("[[\\p{ASCII}]&&[^" + SEPARATORS + "]&&[^\\p{Cntrl}]]+");
 private static final Pattern EQ_PATTERN = Pattern.compile("=");
 private static final Pattern TOKEN_QUOTED_PATTERN = Pattern
 .compile("\"([^\"]|\\\\\\p{ASCII})*\"");
 private static final Pattern COMMA_PATTERN = Pattern.compile(",");
 private static final Pattern LWS_PATTERN = Pattern
 .compile("(\r?\n)?[ \t]+");
 private static class Tokenizer {
 private String remaining;
 public Tokenizer(String input) {
 remaining = input;
 }
 private void skipSpaces() {
 Matcher m = LWS_PATTERN.matcher(remaining);
 if (!m.lookingAt()) {
 return;
 }
 String match = m.group();
 remaining = remaining.substring(match.length());
 }
 public String match(Pattern p) {
 skipSpaces();
 Matcher m = p.matcher(remaining);
 if (!m.lookingAt()) {
 return null;
 }
 String match = m.group();
 remaining = remaining.substring(match.length());
 return match;
 }
 public String mustMatch(Pattern p) {
 String match = match(p);
 if (match == null) {
 throw new NoSuchElementException();
 }
 return match;
 }
 public boolean hasMore() {
 skipSpaces();
 return remaining.length() > 0;
 }
 }
 public static Map<String, String> parse(String input) {
 Tokenizer t = new Tokenizer(input);
 Map<String, String> map = new HashMap<String, String>();
 String authScheme = t.match(TOKEN_PATTERN);
 map.put(":auth-scheme", authScheme);
 while (true) {
 while (t.match(COMMA_PATTERN) != null) {
 // Skip null list elements
 }
 if (!t.hasMore()) {
 break;
 }
 String key = t.mustMatch(TOKEN_PATTERN);
 t.mustMatch(EQ_PATTERN);
 String value = t.match(TOKEN_PATTERN);
 if (value == null) {
 value = t.mustMatch(TOKEN_QUOTED_PATTERN);
 // trim quotes
 value = value.substring(1, value.length() - 1);
 }
 map.put(key, value);
 if (t.hasMore()) {
 t.mustMatch(COMMA_PATTERN);
 }
 }
 return map;
 }
 /* ****************************************
 * State Machine Mechanism
 * **************************************** */
 private static enum ParseState{
 PROLOGSPACE,
 PROLOGWORD,
 KEY,
 KEYVALGAP,
 VALUE,
 QUOTEDVALUE,
 SEPARATOR,
 COMPLETE;
 }
 private static final String WHITESPACE = new String(" \t\r\n");
 public static Map<String,String> parseSM(String value) {
 Map<String,String> result = new HashMap<>();
 ParseState currentstate = ParseState.PROLOGSPACE;
 char[] valchars = value.toCharArray();
 // add a null character at the end.
 valchars = Arrays.copyOf(valchars, valchars.length + 1);
 int mark = 0;
 String key = null;
 for (int i = 0; i < valchars.length; i++) {
 final char ch = valchars[i];
 switch (currentstate) {
 case PROLOGSPACE: {
 // we are in any whitespace before the 'Digest' :auth-scheme 
 if (WHITESPACE.indexOf(ch) < 0) {
 // no longer in white-space, mark the spot, and move on.
 mark = i;
 currentstate = ParseState.PROLOGWORD;
 }
 break;
 }
 case PROLOGWORD: {
 // we are in the 'Digest' :auth-scheme 
 if (WHITESPACE.indexOf(ch) >= 0) {
 // no longer on the word, handle it....
 result.put(":auth-scheme", new String(valchars, mark, i - mark));
 currentstate = ParseState.SEPARATOR;
 }
 break;
 }
 case SEPARATOR: {
 // processing the gap before/between key=value pairs.
 if (ch == 0) {
 currentstate = ParseState.COMPLETE;
 } else if (ch != ',' && WHITESPACE.indexOf(ch) < 0) {
 mark = i;
 currentstate = ParseState.KEY;
 }
 break;
 }
 case KEY: {
 // processing a key=value key.
 if (ch == '=' /* || WHITESPACE.indexOf(ch) >= 0 */ ) {
 // no longer in key
 key = new String(valchars, mark, i-mark);
 currentstate = ParseState.KEYVALGAP;
 }
 break;
 }
 case KEYVALGAP: {
 if (ch != '=' /* && WHITESPACE.indexOf(ch) < 0 */) {
 mark = 0;
 if (ch == '"') {
 currentstate = ParseState.QUOTEDVALUE;
 mark = i + 1;
 } else {
 currentstate = ParseState.VALUE;
 mark = i;
 }
 }
 break;
 }
 case VALUE: {
 if (ch == ',' || ch == 0 || WHITESPACE.indexOf(ch) >= 0) {
 result.put(key, new String(valchars, mark, i - mark));
 currentstate = ParseState.SEPARATOR;
 }
 break;
 }
 case QUOTEDVALUE: {
 if (ch == '"') {
 result.put(key, new String(valchars, mark, i - mark));
 currentstate = ParseState.SEPARATOR;
 }
 break;
 }
 case COMPLETE: {
 throw new IllegalStateException("There should be no characters after COMPLETE");
 }
 }
 }
 if (currentstate != ParseState.COMPLETE) {
 throw new IllegalStateException("Unexpected parse path ended before completion (ended at " + currentstate + ").");
 }
 return result;
 }
 /* ****************************************
 * Scanner Mechanism
 * **************************************** */
 private static final Pattern SCANWHITESPACE = Pattern.compile("\\s+");
 private static final Pattern SCANEQUALS = Pattern.compile("=");
 private static final Pattern SCANONECHAR = Pattern.compile("\\s*");
 private static final Pattern SCANCOMMA = Pattern.compile("\\s*,\\s*");
 private static final Pattern SCANQUOTEEND = Pattern.compile("\"");
 public static Map<String,String> parseScanner(String value) {
 Map<String,String> result = new HashMap<>();
 try (Scanner scanner = new Scanner(value)) {
 scanner.useDelimiter(SCANWHITESPACE);
 if (scanner.hasNext(SCANWHITESPACE)) {
 scanner.skip(SCANWHITESPACE);
 }
 result.put(":auth-scheme", scanner.next());
 while (scanner.hasNext()) {
 scanner.skip(scanner.delimiter());
 scanner.useDelimiter(SCANEQUALS);
 String key = scanner.next();
 scanner.skip(scanner.delimiter());
 scanner.useDelimiter(SCANONECHAR);
 if (scanner.hasNext()) {
 String firstchar = scanner.next();
 if ("\"".equals(firstchar)) {
 scanner.useDelimiter(SCANQUOTEEND);
 String val = scanner.next();
 result.put(key, val);
 scanner.skip(scanner.delimiter());
 scanner.useDelimiter(SCANCOMMA);
 } else {
 scanner.useDelimiter(SCANCOMMA);
 result.put(key, firstchar + scanner.next());
 }
 }
 }
 }
 return result;
 }
 public static void main(String args[]) {
 String test1 = "Digest\n"
 + " realm=\"[email protected]\",\n"
 + " qop=\"auth,auth-int\",\n"
 + " nonce=\"dcd98b7102dd2f0e8b11d0f600bfb0c093\",\n"
 + " opaque=\"5ccc069c403ebaf9f0171e9517f40e41\"";
 String test2 = "Digest username=\"Mufasa\",\n"
 + " realm=\"[email protected]\",\n"
 + " nonce=\"dcd98b7102dd2f0e8b11d0f600bfb0c093\",\n"
 + " uri=\"/dir/index.html\",\n"
 + " qop=auth,\n"
 + " nc=00000001,\n"
 + " cnonce=\"0a4f113b\",\n"
 + " response=\"6629fae49393a05397450978507c4ef1\",\n"
 + " opaque=\"5ccc069c403ebaf9f0171e9517f40e41\"";
 System.out.println(parse(test1));
 System.out.println(parseSM(test1));
 System.out.println(parseScanner(test1));
 System.out.println(parse(test2));
 System.out.println(parseSM(test2));
 System.out.println(parseScanner(test2));
 }
}

Question 3

I was trying (perhaps too hard) to conform to the RFCs. I took a quick look at your code and your examples accept having no commas between pairs, and they don't accept backslash-escaped characters inside quoted strings. Nevertheless, I found them instructive.

rolfl 98.2k17 gold badges220 silver badges419 bronze badges · Accepted Answer · 2014-02-09 19:19:53Z

Going through your specific questions, I have the following suggestions:

Should you write a 'real' parser? - Depends. Parsers can be complicated, and they make assumptions. Regardless, you have already written your own parser, and it is 'real'.
This is the BIG question... is it right? - With regexes it is often hard to tell, and it requires careful analysis of the regex and the data to find out. I have looked at your code, and inspected the regex, and, frankly, it was more complicated than I could easily understand in one sitting.... (and without 'playing' with the code). So, is it right? I don't know.
Are the regexes too complicated (can they be simplified)? - yes, I would say yes to being too complicated, and unsure about whether they can be simplified.
Other suggestions? - yes, a few.... which leads on to:

OK, so what are the other suggestions.....

since you have a class called Tokenizer, it is apparent you are breaking the code in to tokens.... why don't you just use the tools in Java to do the work for you?
This problem is commonly solved with a State machine as well, which are sometimes much faster, and quite interesting.

So, as an exercise, I took your code, and implemented both a state-machine and a Scanner implementation. I have used statemachines in the past to parse comma-separated value files, and the process was very fast... I figured it made sense here too. The Scanner is more complicated than I would have hoped, but you may find the implementation to be educational (I did).

As for a review of your code.... I found it 'easier' to write it again myself, than to try to understand yours. In a sense, that says a lot.

import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import java.util.NoSuchElementException;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class AuthorizationHeaderParser {
 /* ****************************************
 * OP Mechanism
 * **************************************** */
 private static final String SEPARATORS = "()<>@,;:\\\\\"/\\[\\]?={} \t";
 private static final Pattern TOKEN_PATTERN = Pattern
 .compile("[[\\p{ASCII}]&&[^" + SEPARATORS + "]&&[^\\p{Cntrl}]]+");
 private static final Pattern EQ_PATTERN = Pattern.compile("=");
 private static final Pattern TOKEN_QUOTED_PATTERN = Pattern
 .compile("\"([^\"]|\\\\\\p{ASCII})*\"");
 private static final Pattern COMMA_PATTERN = Pattern.compile(",");
 private static final Pattern LWS_PATTERN = Pattern
 .compile("(\r?\n)?[ \t]+");
 private static class Tokenizer {
 private String remaining;
 public Tokenizer(String input) {
 remaining = input;
 }
 private void skipSpaces() {
 Matcher m = LWS_PATTERN.matcher(remaining);
 if (!m.lookingAt()) {
 return;
 }
 String match = m.group();
 remaining = remaining.substring(match.length());
 }
 public String match(Pattern p) {
 skipSpaces();
 Matcher m = p.matcher(remaining);
 if (!m.lookingAt()) {
 return null;
 }
 String match = m.group();
 remaining = remaining.substring(match.length());
 return match;
 }
 public String mustMatch(Pattern p) {
 String match = match(p);
 if (match == null) {
 throw new NoSuchElementException();
 }
 return match;
 }
 public boolean hasMore() {
 skipSpaces();
 return remaining.length() > 0;
 }
 }
 public static Map<String, String> parse(String input) {
 Tokenizer t = new Tokenizer(input);
 Map<String, String> map = new HashMap<String, String>();
 String authScheme = t.match(TOKEN_PATTERN);
 map.put(":auth-scheme", authScheme);
 while (true) {
 while (t.match(COMMA_PATTERN) != null) {
 // Skip null list elements
 }
 if (!t.hasMore()) {
 break;
 }
 String key = t.mustMatch(TOKEN_PATTERN);
 t.mustMatch(EQ_PATTERN);
 String value = t.match(TOKEN_PATTERN);
 if (value == null) {
 value = t.mustMatch(TOKEN_QUOTED_PATTERN);
 // trim quotes
 value = value.substring(1, value.length() - 1);
 }
 map.put(key, value);
 if (t.hasMore()) {
 t.mustMatch(COMMA_PATTERN);
 }
 }
 return map;
 }
 /* ****************************************
 * State Machine Mechanism
 * **************************************** */
 private static enum ParseState{
 PROLOGSPACE,
 PROLOGWORD,
 KEY,
 KEYVALGAP,
 VALUE,
 QUOTEDVALUE,
 SEPARATOR,
 COMPLETE;
 }
 private static final String WHITESPACE = new String(" \t\r\n");
 public static Map<String,String> parseSM(String value) {
 Map<String,String> result = new HashMap<>();
 ParseState currentstate = ParseState.PROLOGSPACE;
 char[] valchars = value.toCharArray();
 // add a null character at the end.
 valchars = Arrays.copyOf(valchars, valchars.length + 1);
 int mark = 0;
 String key = null;
 for (int i = 0; i < valchars.length; i++) {
 final char ch = valchars[i];
 switch (currentstate) {
 case PROLOGSPACE: {
 // we are in any whitespace before the 'Digest' :auth-scheme 
 if (WHITESPACE.indexOf(ch) < 0) {
 // no longer in white-space, mark the spot, and move on.
 mark = i;
 currentstate = ParseState.PROLOGWORD;
 }
 break;
 }
 case PROLOGWORD: {
 // we are in the 'Digest' :auth-scheme 
 if (WHITESPACE.indexOf(ch) >= 0) {
 // no longer on the word, handle it....
 result.put(":auth-scheme", new String(valchars, mark, i - mark));
 currentstate = ParseState.SEPARATOR;
 }
 break;
 }
 case SEPARATOR: {
 // processing the gap before/between key=value pairs.
 if (ch == 0) {
 currentstate = ParseState.COMPLETE;
 } else if (ch != ',' && WHITESPACE.indexOf(ch) < 0) {
 mark = i;
 currentstate = ParseState.KEY;
 }
 break;
 }
 case KEY: {
 // processing a key=value key.
 if (ch == '=' /* || WHITESPACE.indexOf(ch) >= 0 */ ) {
 // no longer in key
 key = new String(valchars, mark, i-mark);
 currentstate = ParseState.KEYVALGAP;
 }
 break;
 }
 case KEYVALGAP: {
 if (ch != '=' /* && WHITESPACE.indexOf(ch) < 0 */) {
 mark = 0;
 if (ch == '"') {
 currentstate = ParseState.QUOTEDVALUE;
 mark = i + 1;
 } else {
 currentstate = ParseState.VALUE;
 mark = i;
 }
 }
 break;
 }
 case VALUE: {
 if (ch == ',' || ch == 0 || WHITESPACE.indexOf(ch) >= 0) {
 result.put(key, new String(valchars, mark, i - mark));
 currentstate = ParseState.SEPARATOR;
 }
 break;
 }
 case QUOTEDVALUE: {
 if (ch == '"') {
 result.put(key, new String(valchars, mark, i - mark));
 currentstate = ParseState.SEPARATOR;
 }
 break;
 }
 case COMPLETE: {
 throw new IllegalStateException("There should be no characters after COMPLETE");
 }
 }
 }
 if (currentstate != ParseState.COMPLETE) {
 throw new IllegalStateException("Unexpected parse path ended before completion (ended at " + currentstate + ").");
 }
 return result;
 }
 /* ****************************************
 * Scanner Mechanism
 * **************************************** */
 private static final Pattern SCANWHITESPACE = Pattern.compile("\\s+");
 private static final Pattern SCANEQUALS = Pattern.compile("=");
 private static final Pattern SCANONECHAR = Pattern.compile("\\s*");
 private static final Pattern SCANCOMMA = Pattern.compile("\\s*,\\s*");
 private static final Pattern SCANQUOTEEND = Pattern.compile("\"");
 public static Map<String,String> parseScanner(String value) {
 Map<String,String> result = new HashMap<>();
 try (Scanner scanner = new Scanner(value)) {
 scanner.useDelimiter(SCANWHITESPACE);
 if (scanner.hasNext(SCANWHITESPACE)) {
 scanner.skip(SCANWHITESPACE);
 }
 result.put(":auth-scheme", scanner.next());
 while (scanner.hasNext()) {
 scanner.skip(scanner.delimiter());
 scanner.useDelimiter(SCANEQUALS);
 String key = scanner.next();
 scanner.skip(scanner.delimiter());
 scanner.useDelimiter(SCANONECHAR);
 if (scanner.hasNext()) {
 String firstchar = scanner.next();
 if ("\"".equals(firstchar)) {
 scanner.useDelimiter(SCANQUOTEEND);
 String val = scanner.next();
 result.put(key, val);
 scanner.skip(scanner.delimiter());
 scanner.useDelimiter(SCANCOMMA);
 } else {
 scanner.useDelimiter(SCANCOMMA);
 result.put(key, firstchar + scanner.next());
 }
 }
 }
 }
 return result;
 }
 public static void main(String args[]) {
 String test1 = "Digest\n"
 + " realm=\"[email protected]\",\n"
 + " qop=\"auth,auth-int\",\n"
 + " nonce=\"dcd98b7102dd2f0e8b11d0f600bfb0c093\",\n"
 + " opaque=\"5ccc069c403ebaf9f0171e9517f40e41\"";
 String test2 = "Digest username=\"Mufasa\",\n"
 + " realm=\"[email protected]\",\n"
 + " nonce=\"dcd98b7102dd2f0e8b11d0f600bfb0c093\",\n"
 + " uri=\"/dir/index.html\",\n"
 + " qop=auth,\n"
 + " nc=00000001,\n"
 + " cnonce=\"0a4f113b\",\n"
 + " response=\"6629fae49393a05397450978507c4ef1\",\n"
 + " opaque=\"5ccc069c403ebaf9f0171e9517f40e41\"";
 System.out.println(parse(test1));
 System.out.println(parseSM(test1));
 System.out.println(parseScanner(test1));
 System.out.println(parse(test2));
 System.out.println(parseSM(test2));
 System.out.println(parseScanner(test2));
 }
}

I was trying (perhaps too hard) to conform to the RFCs. I took a quick look at your code and your examples accept having no commas between pairs, and they don't accept backslash-escaped characters inside quoted strings. Nevertheless, I found them instructive.

Stack Exchange Network

HTTP Authorization header parser

1 Answer 1

You must log in to answer this question.

Hot Network Questions

HTTP Authorization header parser

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions