Skip to main content

Java's regular expression, String, and things..

Posted by hellofadude on March 31, 2014 at 10:29 AM PDT

The manipulation of strings is a quite common activity for which the programmer undertakes responsibility fairly frequently. In Java, strings are of a distinct data type, that are implemented as literal constants. The String class facilitates the creation and manipulation of objects that are of an immutable character, by which I mean to refer to the unchangeable property with which one must strive to become familiar. The following example is meant to illustrate this point:-

package regex.strings.java;

public class ImmutableString {

     static String trim(String immutableRef) {
         String trimmed =  immutableRef.trim();
         return trimmed;
     }
     public static void main(String[] args) {
         String immutable = "  Arbitrarily replace parts of this string";
         System.out.println("\"" + immutable + "\"" );
         String aString = ImmutableString.trim(immutable).replaceAll("tr|rep|p|hi", "x");
         System.out.println("\"" + aString + "\"");
         System.out.println("\"" + immutable + "\"");
     }
}
/* Output
  "  Arbitrarily replace parts of this string"
  "Arbixarily xlace xarts of txs sxing"
  "  Arbitrarily replace parts of this string"
*//

In this very simple example, we pass a regular expression to the replaceAll() member of the String class following a call to our own Immutable.trim() operation to remove leading whitespace, and return what might reasonably be expected to be a modified copy of our immutable String object. The replaceAll() method replaces the pattern expressed in regular expression syntax with the String object "x". The expression uses the logical regular expression operator '|' to describe a sequence of groups of characters that follow in no particular order.

Following these operations, you will notice that no change has been recorded against our original string, because in actuality, only a copy of the original string reference is passed to the trim() method in the first instance, which returns a reference to a new String object to account for the modified string. The process is repeated for the call to the replaceAll() method. If you were to examine the bytecodes arising from running Java's class file disassembler (javap with -c switch) with this class, you will notice that Java creates three StringBuilder objects in total, once for our original string and the other two in relation to each call to the trim() and replaceAll() methods.
This is what we mean when we make reference to the immutable property of String objects, which it is to be understood, can also become a source of inefficiency and why Java discretely makes liberal use of the StringBuilder class whenever there is the need to make a modification to a String. Java's API documentation describes the StringBuilder class as capable of providing for the construction of a sequence of mutable characters and operations designed specially for use with single threaded applications.
Some very talented programmers have suggested the propriety of explicitly making use of the StringBuilder class, in particular during repeated operations involving the modification of String objects in order to avoid any issues that might arise as a result of the inefficiencies associated with immutability, particularly in environments where unnecessary overhead is not normally considered acceptable.

For instance, one of the interesting things you can do with String objects is to concatenate two or more strings, the one end to the other. Java provides a number of ways to do this, like for example by means of the overloaded '+' operator. Overloaded in the sense that it takes on an extra meaning when used with the String class:-

package regex.strings.java;

public class ConcatenateString {
     String compoundOperator = "";
     public ConcatenateString(String args) {
         String[] splitter = args.split(" ");
         compoundOperator += " \"";
         for(int i = 0; i < splitter.length; i++) {
             compoundOperator += splitter[i] + " ";
         }
         compoundOperator += "\"";
     }
     public static void main(String[] args)  {
         String aDefinition = "Cognitive computing is the development of computer \n" +
                                       "systems modelled on the human brain";
         ConcatenateString overloadedPlus = new ConcatenateString(aDefinition);
         String listOfMetals = "Hydrogen"  + ", " + "Lithium" + ", " +  "Sodium" ;
         String statementOfFact = "A List of metals include: " +  listOfMetals;
         System.out.println(statementOfFact + "\n");
         System.out.println(overloadedPlus.compoundOperator);
     }
}
/* Output
A List of metals include: Hydrogen, Lithium, Sodium

"Cognitive computing is the development of computer
systems modelled on the human brain "
*//

In the main() method we make use of the '+' operator to combine a number of individual strings into one coherent String object. Within the class constructor, notice the use of the overloaded '+=' operator which also acts like an append method when used with String objects. Both of these methods are suitable for simple operations, but become inefficient when used in loops like you have in the above constructor because Java has to create a new StringBuilder object at every iteration which, as has been pointed out before now, is really not ideal. A more efficient way would be to explicitly create a StringBuilder object and make use of the append() method within the loop, like the following example demonstrates:-
package regex.strings.java;

public class ConcatenateString2 {
     StringBuilder buildString = new StringBuilder();
     public ConcatenateString2(String args) {
         String[] splitter = args.split(" ");
         buildString.append(" \"");
         for(int i = 0; i < splitter.length; i++) {
             buildString.append(splitter[i] + " ");
         }
         buildString.append(" \"");
     }
     public static void main(String[] args)  {
         String aDefinition = "Cognitive computing is the development of computer \n" +
                                 "systems modelled on the human brain";
         ConcatenateString2 stringBuilder = new ConcatenateString2(aDefinition);
         System.out.println(stringBuilder.buildString);
     }
}
/* Output
"Cognitive computing is the development of computer
systems modelled on the human brain "
*//

By making an explicit call to StringBuilder before you enter your loop, you produce better and more efficient code. With the release of Java SE 7, came the addition of the ability to use String objects with the switch statement which was not always the case. Being that Java SE 8 was released while I was putting this piece together, I figured I might as well use the opportunity this example provided me to test the utility of the new Date/Time API, which altogether seems to be quite an improvement over the previous version:-
package regex.strings.java;

import java.time.Month;
import java.time.YearMonth;
import java.time.format.TextStyle;
import java.util.ArrayList;
import java.util.List;
import java.util.Locale;

public class GregorianCalendar {
     private List<Month> leapYear = new ArrayList<Month>();
     public void setMonths() {
         for(int month = 1; month < 13; month++) {
             leapYear.add(Month.of(month));
         }
     }
     public void daysOfTheMonth(int month) {
         Month whichMonth = leapYear.get(month);
         switch(whichMonth.getDisplayName(TextStyle.FULL, Locale.ENGLISH)) {
         case "April":
         case "June":
         case "September":
         case "November":
                   System.out.println(whichMonth.getDisplayName(TextStyle.FULL,
                                                     Locale.ENGLISH) + " = 30 days");
                                                     break;
         case "February":
                   System.out.println(whichMonth.getDisplayName(TextStyle.FULL,
                                                     Locale.ENGLISH) + "  = 29 days");
                                                     break;
         default:
                    System.out.println(whichMonth.getDisplayName(TextStyle.FULL,
                                                     Locale.ENGLISH) + "  = 31 days");
          }
     }
     public static void main(String[] args) {
         GregorianCalendar aCalendar = new GregorianCalendar();
         aCalendar.setMonths();
         for(int i = 0; i < aCalendar.leapYear.size(); i++)
             aCalendar.daysOfTheMonth(i);
     }
}

Java 8's new Date/Time API consists of about 5 packages that provide you with a comprehensive framework for representing an instance of time. In our example, we make use of the Month class which is of an enumerated type to provide a string representation to test for a certain conditionality in our switch statement by which means we are able to print the number of days in a particular month.

Regular expressions

A regular expression is a string processing technique for describing the patterns that may be found in string based text. This is achieved by giving special meaning to the arrangement of metacharacters when used in particular context. A metacharacter is a character that has been given a special meaning by Java's regular expression interpreter and therefore it would not be a mistake for you to predispose your mind to the notion that presupposes that regular expressions are a completely separate, though indistinct language from Java, which really does not require too much of a stretch of the imagination, even if they are conceived as strings, and are subsequently applied to exactly this same type.

There are a number of predefined character classes that represent the basic building blocks of regular expressions with which you must become properly acquainted if it is your wish to attain an expert level of proficiency in this subject. For instance, a word character is represented with a backslash followed by the letter w like so '\w'. In consequence, a digit is represented by the metacharacter '\d', while '\s' is used to represent the whitespace character , '\W', a non-word character, and the non whitespace and non-digit character are represented as '\S' and '\D' respectively.

In addition, to indicate any of these predefined characters in regular expression syntax, you must precede each one with an additional backslash '\' such that the expression '\\w' would be used to indicate a word character, or '\\d' to indicate a digit, and et cetera..
You should also make yourself to become familiar with a set of arrangements that consists of characters enclosed in square brackets [...], known as character classes and used to indicate a preference, so that the arrangement "[abc]" indicates a preference for a,b or c and "[a-zA-Z]" indicates a preference for any character in the range a-z or A-Z. The '-' metacharacter acts like a range forming operator when used in square brackets. These character classes can also be arranged in other ways that make more complex operations possible, for instance the arrangement "[a-e[l-p]]" is a regular expression union operator that specifies a preference for any within the range a through e or l through p, and the expression "[a-z&&[hij]]" is an intersection operator that refers to h, i or j. And then of course you also have a number of logical operators including 'AB' to indicate that B follows A , 'A|B' to indicate A or B and (A) as a reference to a capturing group, an explanation of which is better sought within the pages of your official documentation; all of this in addition to other categories of metacharacters, boundary matchers, quantifiers and a multiplicity of character classes competing to represent every conceivable character combination you can possibly imagine.
For this reason, you will probably have to become used to making a constant reference to your official documentation for a full list of Java's regular expression metacharacters and the valid range of arrangements.

Nevertheless, the easiest way to use regular expressions in Java is to pass a valid argument to the convenience method, matches() of the String class to make a test of whether or not the expression matches the current String:-

package regex.strings.java;

public class StringMatcher {
     public static void main(String[] args) {
         System.out.println("Flight 8957".matches("\\w+\\s?\\d+"));
         System.out.println("Flight8957".matches("[Ff]li[huog]ht\\s?\\w+"));
         System.out.println(" Flight-8957".matches("\\s?\\w+-?\\d+"));
         System.out.println("Flight8957".matches("[a-zA-Z]+[0-9]+"));
         System.out.println("Flight8957".matches("\\w+"));
     }
}
/* Output
true
true
true
true
true
*//

In addition to some of Java's predefined character classes, the above example uses 'quantifiers' so as to regulate the varying degrees of frequency with which a match is made against a given string, according to a particular pattern. So for instance, the first expression in the above example uses the '+' quantifier to match a word character "\\w" one or more times, and then uses the '?' quantifier to match the whitespace character "\\s" once or not at all, before this time matching a digit "\\d+" one or many times in the preceding string "Flight 8957". The matches() method returns a boolean value that should indicate the truth or falsehood of any assertion. With quantifiers, it is important to bear in mind that they apply only to the metacharacter that immediately precede their definition, except when you use parenthesis to group patterns or of course when you apply them to character classes, so for instance the expression "\\w+" which indicates a preference for a word character one or more times, will match the string "Flight8957" but to make a match of the string "Flight-8957" which contains a non-word character, you will need to write "\\w+-?\\d+" which reads match a word character one or many times, followed by a hyphen once or not at all, and then match a digit one or more times. You may apply this reasoning to unravel the mystery behind the remaining regular expression statements.

Parenthesis allow you to group individual characters to form a pattern, for instance the '*' metacharacter is a third quantifier that allows you to make a match of a pattern zero or more times, so that the expression "(abc)*" would match the pattern 'abc' in a given string, zero or many times, while the expression "abc*" reads - match 'ab' followed by 'c' zero or many times. This quantifier in addition to those earlier referenced constitute a category of the so-called 'greedy' quantifiers, which when used in particular combination can be of the 'possessive' or 'reluctant' variety - the categories, are a reference to the varying degrees of regularity with which each class of quantifier is known to make a match against a given string, according to the varying levels of generality that exist within a pattern.

Java provides a package of APIs' (java.util.regex) that provide you with more advanced features with which you can create, compile and match regular expressions against string based text, in a standardised way. The relevant classes that should concern you initially are the Pattern and Matcher class. With this technique, you introduce a Pattern object by passing a valid regular expression argument to the static compile() method of the Pattern class, against which you must supply your search string as argument in a manner that makes possible, the recovery of a Matcher object, from which certain operations become available, by which means you may query such an object for complex information. In the following example, we apply this technique to locate patterns, first in a file and then manually on a test string:-

package pattern.matching.java;

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.NoSuchElementException;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Parser {
     static void parseString(String arg1, String arg2 ){
             Pattern regex = Pattern.compile(arg1);
             Matcher intepreter = regex.matcher(arg2);
             // find pattern match in search string
             while(intepreter.find()) {
                 System.out.println("found pattern: " + intepreter.group() + " at index " + intepreter.start());
              }
             //System.out.println("No match found");
             return;
     }
     public static void main(String[] args) throws Exception {
         FileReader reader = new FileReader("/home/geekfest/eclipse/about.html");
         BufferedReader in = new BufferedReader(reader);
         Scanner input = new Scanner(in);
         boolean reading = false;
         String searchString;
         // search pattern beginning http[^\s]
         String regex = "h\\w+p\\S+";
         try {
             while((searchString = input.nextLine()) != null) {
             //match search pattern in search string
             //parseString(regex, searchString);
             reading = true;
             }
         } catch(NoSuchElementException e) {
             //System.out.println("End of input");
          }
         input.close();
         if(reading == false)
             return;
         String testString = "abracadabracadab";
         parseString("abr?", testString);
         System.out.println("-----------------------------");
        parseString("abr??", testString);
         System.out.println("-----------------------------");
         parseString("(abr)+", testString);
      }
}
/* Output
found pattern: abr at index 0
found pattern: abr at index 7
found pattern: ab at index 14
-----------------------------
found pattern: ab at index 0
found pattern: ab at index 7
found pattern: ab at index 14
-----------------------------
found pattern: abr at index 0
found pattern: abr at index 7
*//

This example is useful to demonstrate a number of simple techniques. Firstly, we implement a simple parseString() algorithm by creating a Pattern object, and then calling the matcher() method of this object to create a Matcher object which we are then able to query repeatedly within our loop for information about specific patterns in particular using the group() and start() members.

In the main(), we make use of a Scanner object to scan input from a file into a String object against which we make a test of our regular expression that describes a pattern that matches the first few letters of a URL address. In this example, we have had to comment out the code that would have resulted from this match to satisfy our desire for compactness as much as possible in the use of code for illustrations.

Nevertheless, we manually apply this method to a simple search string the result of which is provided for your perusal. Notice the difference in the first two statements that make use of different classes of quantifiers. The second statement uses a quantifier from the reluctant '??' category which will match the minimum number of characters necessary to satisfy the pattern, while other two statements use quantifiers from the greedy category, which will find as many matches for the pattern as possible. Furthermore, consider the effect of the parenthesis based on the output from the third statement.

Formatting String

Another interesting thing you are able to do with strings is to control how they are written to a particular output destination. The Formatter class provides you with this kind of capability by acting as an interpreter for "format strings" by which means you are able to control the justification, alignment and spacing of string based output. Format strings are a combination of static text and format specifiers that describe the conversions, alignment, spacing and layout justification, that is to be applied to output. Java's format specifiers can be understood within the boundaries following conventions:-

%[argument_index$][flags][width][.precision]conversion

The first four specifiers in square brackets are optional and may take on a different meaning depending on the value of the conversion specifier to which they are applied. The conversion format specifier is mandatory and is used to indicate a preference for a particular data type, whose value may be optionally specified by means of the argument_index$ format specifier. Some of the typical conversions you should expect to be able to make include character conversions denoted by the specifier 'c', string conversions denoted by 's', a range of numeric conversions of either the integral or floating point kind 'd' and 'f' respectively, date/time conversions 't', and a number of less frequent others. For some of these conversions, an upper case representation is as equally valid, to the extent that the result is converted to upper case according to the rules of the prevailing locale, so for instance the conversion 't', and 'T' would both refer to Date/Time conversions except that the result of one will be in lower case and the other in upper case.

The argument_index$ format specifier is an integer used to indicate the position of the required argument in the argument list, which is nominally a reference to the value of the conversion format specifier, while the optional, flags format specifier consists of metacharacter that modify the output string in any one of a number of ways depending on the conversion. By contrast, the .precision format specifier is a non-negative integer that acts to restrict the number of characters to be written but whose precise functionality also depends on the conversion to which it relates, in exactly the same way as the flags format specifier. Finally, the width format specifier is a positive integer used to indicate the minimum number of characters to be written to output.
The Formatter API includes two variations of the format() method that accept a format string and type arguments and can write formatted output to a PrintStream object or other destination of your choice, with the difference being that one of the two makes use of the specified locale. The format string consists of a mixture of static text and format specifiers:-

package regex.strings.java;

public class MatchResult {
     public static void main(String[] args) {
         String s = "Team";
         int i = 6;
         System.out.format("%2$s-B (%1$d - 0)  %2$s-A ", i, s);
     }
}
/* Output
Team-B (6 - 0)  Team-A
*//

In the above example, we can discern five distinct pieces of static text in the format string argument to our format() method made up of "-B", "-", "0" and "-A" as well as the parenthesis grouping the integers that represent team scores. Format specifiers make up the remainder of the string, including the percent '%' argument which produces a literal result. The format string can be described from left to right to read, position the second argument as specified by the argument_index$ specifier, of type string to the left of the static text "-B" to produce the literal "Team-B", and then position the first argument of type integer to produce the literal result "6", and finally, position the second argument of type string such that it produces the output "Team-A". The argument list is comprised of two arguments of the integer and string type.
Here is another example that makes use of the flags and width format specifiers only:-
package regex.strings.java;

public class FormatString {
     public static void main(String[] args) {
         System.out.format("%-20s  \n%(.2f  \n%25s", "left-justify", -3.186329, "right-justify");

     }

}
/* Output
left-justify
(3.19)
             right-justify
*//

The format string in this example is separated into three lines of text by the interposing new line '\n' operator, with the first and third lines specifying a size value in the immediate path of the percent '%' metacharacter which, contrary to the current JDK documentation is applicable to the .precision specifier and indeed to at least one other flag in addition to the '-' flag whose purpose is to left-justify output, the default being right justified, as may be deduced from the position of the last line of output. The second line of the format string uses the '(' flag to enclose the negative value in parenthesis and the .precision specifier to indicate the number of digits after the radix point for our floating point conversion.
To employ this technique to your advantage, you must attain proficiency with the functionality of the format specifiers that comprise a fundamental part of the format string in order to write more efficient and flexible code. The following class formats a list of staff in a using the Formatter interpreter:-
package pattern.matching.java;

class Format {
     void FormatHeader() {
             System.out.format("%-10s %5s %10s  \n", "Name", "DoB", "Age");
             System.out.format("%-10s %5s %10s  \n", "----", "---", "---");
     }
     void FormatEntry(String name, String dob, int age) {
         System.out.format("%-9s %-11s %4d  \n", name, dob, age);
     }
}
public class Printer {
     public static void main(String[] args) {
         Format style = new Format();
         style.FormatHeader();
         style.FormatEntry("John", "09/06/1981", 33);
         style.FormatEntry("Amy", "20/11/1985", 29);
         style.FormatEntry("Karyn", "02/02/1978", 36);
     }
}
/* Output
Name         DoB        Age
----         ---        ---
John      09/06/1981    33
Amy       20/11/1985    29
Karyn     02/02/1978    36
*//

Essentially, each format string divides the output into three columns with values of unequal size, into some of which we apply the '-' flag to left justify output. You will find that the '-' flag and width format specifier work closely most often to control the spacing and position of output. In this example, the arguments consists mostly of string conversions and an integer conversion. The example is an obvious demonstration of how Formatter class makes it possible to write output for which we can tightly control presentation. You should consult with your official documentation for an in-depth explanation of format specifiers.

Strings are not a trivial topic by any means and overall include perhaps a slightly moderate level of complexity sufficient to satisfy any curious mind. In particular, regular expressions can be challenging, but there is absolutely nothing about it that should constitute a barrier to those interested to discover it's truth.