Read office files with Java API
Last year when working in a project, there were a lot of documents (requirements, user guides, architecture, etc.), from different sources (email attachments, file shares, backups, old version control). The same document name but different date and size.
So, how to know which one is the latest and delete the others ?
There were two ways to achieve that:
- Open each file and see its properties, like date and last author.
- Programmatically read its properties and print it out.
- Very time consuming if you have many files, like I had (300 files)
- Time consuming, but you do this only once and can write a blog about that (WoW)
Then, in a short time I developed a small program to read a set of files and print its name, date of last modification and complete path.
OpenOffice SDK libraries is used, so you need to have a OpenOffice 2.x installed somewhere. Actually it reads a small set of office files like sxw, doc, xls, odt, ods, pps, odt, ppt and odp. Feel free to modify it to suit your needs. Or even extend its functionality.
The java source code can be downloaded at DocViewer.java
* its a UTF-8 file
** remove .txt extension after download
As I am a linux user, this works with linux in mind. Windows users have reported it works with small modifications to its runtime settings, but I don't know which modifications need to be done.
Requirements
Compile time
The following libraries are needed:
$OO_HOME/program/classes/juh.jar $OO_HOME/program/classes/jurt.jar $OO_HOME/program/classes/jut.jar $OO_HOME/program/classes/ridl.jar $OO_HOME/program/classes/unoil.jar
$OO_HOME points to the OpenOffice installation. For me it is installed at /opt/broffice.org2.3
* BrOffice is the official Brazilian version of OpenOffice
At runtime
- OpenOffice 2.x installation
- X Virtual Frame Buffer (Xvfb)
- Java (version 5 or more recent)
Compile
Very easy to compile
javac -classpath /opt/broffice.org2.3/program/classes/\* src/claudius/DocViewer.java
You see I have used classpath wildcards. Modify this as needed to compile it with JDK 5.
Run
To run it, openoffice standalone program need to be running, but to avoid a graphical program popping out in a window hundreds of times, it can run in a non graphical way. To achieve that I used X Virtual Frame Buffer (xvfb), its a kind of X window manager in memory, this is useful to run graphical libraries at server machines.
The OpenOffice SDK will connect to OpenOffice program through sockets, as the standalone program will do the real job of read the office file.
- Run the Xvfb program
Xvfb :5 -screen 0 800x600x16 &
-
Load openoffice program on memory and inform it to use the X server at :5 display
$OO_HOME/program/soffice -accept="socket,host=127.0.0.1,port=8100;urp;" -display :5 -headless -norestore -invisible &
- Run the java program
java -classpath $CP claudius.DocViewer <path>
$CP point to the classpath defined previously
<path> point to a single file or directory. If a directory it will search at subdirectories.
Result
The output will look similar to this
dir = /home/claudio/resources/palestras/2007/10_justjava file = diagnostico2.odp Modified by: Claudio Miranda 5/10/2007 17:46:8
If this piece of code is useful or if you made any modification, please share it and write a comment.
- Login or register to post comments
- Printer-friendly version
- claudio's blog
- 2834 reads






Comments
by claudio - 2009-03-05 12:55
I recommend you to take a look at other information sources, as http://development.openoffice.org/#COMPONENTS Also, take a look at http://www.jopendocument.org/by kszkaresz - 2009-03-04 07:46
Thanks, it's a very useful info, but how can I iterate over paragraphs and words, and how can I get information about character's property (like: font name, color, size, etc.)?by claudio - 2008-02-21 06:18
With OpenOffice SDK its possible to to create/modify MS Office documents. See its SDK documentationby gadominas - 2008-02-21 03:08
There's another API for particular task: Apache POI (http://poi.apache.org/index.html). Check is out.by nbw - 2008-02-20 20:12
Do these libraries allow you write M$ Office compatible documents (.doc/.ppt etc.)? If not can you recommend anything?