The Source for Java Technology Collaboration
User: Password:



Jack Shirazi's Blog

Integrating HTML validation to my site building process

Posted by jacksjpt on July 14, 2004 at 04:46 AM | Comments (8)

I generate my website using a local servlet container and JSP pages converting text source to html pages, then I upload all the pages to the server. Inspired by reading Cleaning Your Web Pages with HTML Tidy, I decided it was about time I had my HTML validated. But I wanted to do it as an integral part of the build process, not as an afterthought. That way, if HTML errors crept in to the pages for whatever reason, they would be flagged immediately. It turned extremely easy to do so.

First off, I am already building my pages locally using a Java program which connects to my local servlet container and asks for each page then stores it locally. This allows me to have a dynamic page display process for building my pages, giving me all the power and flexibility of servlets and JSPs. The result is a set of static pages which I can upload to my internet site, providing extremely fast downloads of pages from my internet site JavaPerformanceTuning.com.

So all I had to do to add HTML validation was add one method to my build process. Once each page is complete and loaded into a local file, I simply added a call to a new validateHTML(File destinationfile) method.

My validateHTML method basically calls the "Tidy" executable on the newly created HTML file, (Tidy validates and corrects HTML, and is available here). Then I check Tidy's output for anything I'm interested in. If there is a problem, I throw an exception.

I use Process to execute Tidy as an external process. I could process Tidy's stdout and stderr directly from the program, but there is no need, it is much simpler to use Tidy to dump these to files and check those files. I don't actually use Tidy's HTML output for my web pages, I'm really using it only as a validator. It is worth noting that the W3 organization has a validator at http://validator.w3.org/ if you only need to check some pages, but in my case I wanted to have all my pages checked each time I re-built the site.

I am only interested in the line notifcation warnings and errors that Tidy emits, so I use a regular expression to detect and parse those lines. In addition, there are some warnings that I don't really care to fix at the moment, so I have added the ability to ignore those, either on a per file basis or globally (see the two entries in the TidyNoficationsToIgnore HashMap for examples).

Finally, if I do find a problem, I like to print the error and relevant line from the HTML file so that I can see where it is and what to fix

Here's the code in case anyone else needs to resolve this problem in a similar way. If you have problems getting Tidy to execute, it's probably a path issue so you might try using the path to the executable in the command, e.g. .\Tidy or ./Tidy
  //Note I am putting this code fragment in the public domain
  public static final Pattern TidyHTMLLineNotification = Pattern.compile("^line\\s+(\\d+)\\s+column\\s+(\\d+)\\s+\\-\\s+(.*)$");
  static HashMap TidyNoficationsToIgnore = new HashMap();
  static
  {
    TidyNoficationsToIgnore.put("newsletter013.shtml+Warning: discarding unexpected </p>", Boolean.TRUE); 
    TidyNoficationsToIgnore.put("Warning: trimming empty <p>", Boolean.TRUE); //always ignore
  }
  public static void validateHTML(File destinationfile)
    throws IOException, InterruptedException
  {
    //Stdout to tt.txt, stderr to t2.txt.
    //tt.txt contains fixed HTML if you want it.
    //t2.txt contains Tidy's warnings and errors
    String command = "Tidy -o tt.txt -f t2.txt " + destinationfile;
    Runtime.getRuntime().exec(command).waitFor();
    BufferedReader rdr = new BufferedReader(new FileReader("t2.txt"));
    String line;
    while( (line = rdr.readLine()) != null)
    {
      //Only interested in lines beginning with "line"
      if (line.startsWith("line "))
      {
        Matcher m = TidyHTMLLineNotification.matcher(line);
        if (m.matches())
        {
          String linenumstr = m.group(1);
          String colnum = m.group(2);
          String message = m.group(3);
          if ( (TidyNoficationsToIgnore.get(message) != Boolean.TRUE) &&
               (TidyNoficationsToIgnore.get(destinationfile.toString()+'+'+message) != Boolean.TRUE) )
          {
            //line number in destinationfile of problem. Read the file
            //and get that line and the line before
            int linenum = Integer.parseInt(linenumstr);
            BufferedReader rdr2 = new BufferedReader(new FileReader(destinationfile));
            String l2 = null, l1 = null;
            for (int i = 0; i < linenum; i++)
            {
              l1 = l2;
              l2 = rdr2.readLine();
            }
            rdr2.close();
            rdr.close();
            throw new IOException("HTML Validation Problem Identified by Tidy in file " + destinationfile + ": line " + 
		linenum + " / " + message + System.getProperty("line.separator") + l1 +System.getProperty("line.separator") + l2);
          }
        }
      }
    }
    rdr.close();
  }
}

Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • One alternative and one improvement
    Another way which is also very powerful is Apache Forrest.

    You can speed up the upload time using Apache Ant and its selector. Especially if you´re using the current CVS version which contains delayed creation of the cachefile.

    Jan Matèrne

    Posted by: unknown1 on July 14, 2004 at 06:55 PM

  • One alternative and one improvement
    Thanks, Forrest looks to me like it is an alternative way to generate docs. I prefer using teh JSPs, it allows me to essentially have a purely local website which I can dynamically test at any time for any type of change on a page by page basis if I like.

    The blog here was about adding HTML validation to the process, which I think you would also have to do if you use Forrest.

    I don't have any problems with time, even including the integration with Tidy, my entrie site is re-generated in under one minute on my laptop.

    Posted by: unknown1 on July 14, 2004 at 07:29 PM

  • One alternative and one improvement
    Yep - HTML-Validation is a topic with should also be done in other genaration environments (JSP, Velocity, Forrest, Maven, ...). It seems that your solution could also be converted to an Ant task easily. So you can do


    The performance boost I suggested using the selector was something about uploading the generated site: only upload the files which content changed (not only the timestamp).

    So your process could be:




    Jan

    Posted by: unknown1 on July 14, 2004 at 08:31 PM

  • One alternative and one improvement
    Good points. In my case I am already only uploading the changed pages, but it is a proprietary routine so certainly the ANT config is better.

    The only problem with using ANT for the HTML validation is that I like to get the context of the error along with the error, as my solution provided. Mind you I suspect that could be done by post processing the Tidy error output and combining that with the HTML input files using some script, or possibly there is a better HTML validator to use which would give you the error in context rather than only line and column as Tidy provides.

    Posted by: jacksjpt on July 14, 2004 at 11:03 PM

  • One alternative and one improvement
    But looking more closely, I see your solution would work fine since it would use the proposed validate method :-) Nice.

    Posted by: jacksjpt on July 14, 2004 at 11:09 PM

  • JTidy
    See http://jtidy.sourceforge.net/

    Posted by: ejain on July 15, 2004 at 04:47 AM

  • JTidy
    Thanks! Silly of me not to have looked for it in the first place. If my current setup breaks, I'll see about moving to JTidy (I follow the " if it ain't broke, don't fix it" paradigm).

    Posted by: jacksjpt on July 15, 2004 at 05:21 AM

  • 女傭|菲傭
    外傭Domestic Helper家務助理我要搬屋,因為要換地方,所以要搬屋.搬屋真難呀.東西太多,還是放到迷你倉去吧.那裏有便宜的迷你倉呀!哎!迷你倉迷你倉室內設計google左侧排名

    Posted by: teiddy on October 18, 2007 at 12:41 AM





Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds