Friday, 13 September 2013

Regular Expression breaking Due to line break character (\n)

Regular Expression breaking Due to line break character (\n)

I have a regex that is using a "'''.*?'''|'.*?'" pattern to look for text
between tripple quotes (''') and single quotes ('). When carriage returns
are added to the input String the regex pattern fails to read to the end
of the triple quote. Any idea how to change the regex to read to the end
of the triple tick and not break on the \n?
Works:
'''<html><head></head></html>'''
Fails:
User Entered Value:
'''<html>
<head></head>
</html>'''
Java Representation:
'''<html>\n<head></head>\n</html>'''
Parsing Logic:
public static final Pattern QUOTE_PATTERN =
Pattern.compile("'''.*?'''|'.*?'");
Matcher quoteMatcher = ContentCommonConstants.QUOTE_PATTERN.matcher(value);
int normalPos = 0, length = value.length();
while (normalPos < length && quoteMatcher.find()) {
int quotePos = quoteMatcher.start(), quoteEnd = quoteMatcher.end();
if (normalPos < quotePos) {
copyBuilder.append(stripHTML(value.substring(normalPos,
quotePos)));
}
//quoteEnd fails to read to the end due to \n
copyBuilder.append(value.substring(quotePos, quoteEnd));
normalPos = quoteEnd;
}
if (normalPos < length)
copyBuilder.append(stripHTML(value.substring(normalPos)));

No comments:

Post a Comment