Skip to main content

Read office files with Java API

Posted by claudio on February 20, 2008 at 4:23 PM PST

Last year when working in a project, there were a lot of documents (requirements, user guides, architecture, etc.), from different sources (email attachments, file shares, backups, old version control). The same document name but different date and size.

So, how to know which one is the latest and delete the others ?

There were two ways to achieve that:

  1. Open each file and see its properties, like date and last author.
  2. Programmatically read its properties and print it out.

And there are their costs:

  1. Very time consuming if you have many files, like I had (300 files)
  2. Time consuming, but you do this only once and can write a blog about that (WoW)

Sure, I went to the 2nd option.

Then, in a short time I developed a small program to read a set of files and print its name, date of last modification and complete path.

OpenOffice SDK libraries is used, so you need to have a OpenOffice 2.x installed somewhere. Actually it reads a small set of office files like sxw, doc, xls, odt, ods, pps, odt, ppt and odp. Feel free to modify it to suit your needs. Or even extend its functionality.

The java source code can be downloaded at DocViewer.java

* its a UTF-8 file

** remove .txt extension after download

As I am a linux user, this works with linux in mind. Windows users have reported it works with small modifications to its runtime settings, but I don't know which modifications need to be done. 

Requirements

Compile time

The following libraries are needed:

$OO_HOME/program/classes/juh.jar
$OO_HOME/program/classes/jurt.jar
$OO_HOME/program/classes/jut.jar
$OO_HOME/program/classes/ridl.jar
$OO_HOME/program/classes/unoil.jar

$OO_HOME points to the OpenOffice installation. For me it is installed at /opt/broffice.org2.3

* BrOffice is the official Brazilian version of OpenOffice

At runtime

  • OpenOffice 2.x installation
  • X Virtual Frame Buffer (Xvfb)
  • Java (version 5 or more recent)

Compile

Very easy to compile

javac -classpath /opt/broffice.org2.3/program/classes/\* src/claudius/DocViewer.java

You see I have used classpath wildcards. Modify this as needed to compile it with JDK 5.

Run

To run it, openoffice standalone program need to be running, but to avoid a graphical program popping out in a window hundreds of times, it can run in a non graphical way. To achieve that I used X Virtual Frame Buffer (xvfb), its a kind of X window manager in memory, this is useful to run graphical libraries at server machines.

The OpenOffice SDK will connect to OpenOffice program through sockets, as the standalone program will do the real job of read the office file. 

  1. Run the Xvfb program
    Xvfb :5 -screen 0 800x600x16 &
  2. Load openoffice program on memory and inform it to use the X server at :5 display

    $OO_HOME/program/soffice -accept="socket,host=127.0.0.1,port=8100;urp;" -display :5 -headless -norestore -invisible &
  3. Run the java program
    java -classpath $CP claudius.DocViewer <path>

    $CP point to the classpath defined previously

    point to a single file or directory. If a directory it will search at subdirectories.

Result

The output will look similar to this

dir  = /home/claudio/resources/palestras/2007/10_justjava
file = diagnostico2.odp
Modified by: Claudio Miranda 5/10/2007 17:46:8

If this piece of code is useful or if you made any modification, please share it and write a comment.

Related Topics >>

Comments

I recommend you to take a look at other information sources, as http://development.openoffice.org/#COMPONENTS Also, take a look at http://www.jopendocument.org/

Thanks, it's a very useful info, but how can I iterate over paragraphs and words, and how can I get information about character's property (like: font name, color, size, etc.)?

With OpenOffice SDK its possible to to create/modify MS Office documents. See its SDK documentation

There's another API for particular task: Apache POI (http://poi.apache.org/index.html). Check is out.

Do these libraries allow you write M$ Office compatible documents (.doc/.ppt etc.)? If not can you recommend anything?