|
YOUR FEEDBACK
Did you read today's front page stories & breaking news?
SYS-CON.TV |
TOP THREE LINKS YOU MUST CLICK ON Java Desktop Unlocking Microsoft Office Documents
An open source alternative
By: Ryan Ackley; Avik Sengupta
Aug. 5, 2004 12:00 AM
If you've ever written software to be used by business managers, you will no doubt have received requests for interoperability with the Microsoft Office Applications. "Get me the report in Excel; HTML doesn't cut it and I need to run my own analysis on it"; "Can you index the zillion word documents I have so that the whole organization can search on them?"; "I have all this data in Excel; do I have to enter it again on this Web page?".... These are things we commonly hear as application developers, which is not surprising given the ubiquity of MS Office. Does this mean you're forced to tie your application to Windows to interface with the COM APIs of Excel or Word? Apart from the fact that you don't want your language or platform decision to be constrained by a lack of choice, it's also important to note that these APIs can be unstable because they're automating a desktop application. Because of this, they are unreliable for any server-side deployment. For the Java developer, however, the power of Jakarta POI is close at hand.
POI is a pure Java application library for reading and writing the Microsoft OLE2 Compound Document Format (OLE2CDF) file formats. This format is used by (among others) various MS Office applications. As the name suggests, this is a format for storing multiple documents (or streams) in one file, for example, storing an embedded spreadsheet along with a presentation. Within this structure are stored the records that contain the application-specific data. POI is structured along these lines. At its base it has a component known as the POIFS or the POI File System, which is the most complete implementation of the OLE2CDF structure in Java. Layered above this are the components to read the Excel record structures (HSSF) or the Word record structures (HWPF). HSSF How easy does HSSF make reading Excel files? See for yourself!
InputStream in = new FileInputStream("data.xls"));
HSSFWorkbook wb = new HSSFWorkbook(in);
HSSFSheet sheet = wb.getSheetAt(0); // the 1st sheet
HSSFRow row = sheet.getRow(1); // get the 2rd row
HSSFCell cell = row.getCell((short)1); // the 2nd cell of the 2nd row
The model of an Excel document in HSSF begins with the HSSFWorkbook object. This object provides access to the sheets (by name or number), which in turn provides access to the rows (HSSFRow) in the sheet. Each row provides access to the individual cells (HSSFCell) it contains. From the cell object you can retrieve data contained in that cell via accessor methods, depending on the type of data. Listing 1 provides an example. Given this object model, writing is equally simple. Instead of "get"-ing rows and columns, you "create" them and then "set" the values in the cells as in Listing 2. Once again, start with the HSSFWorkbook class, whose default constructor provides a new workbook object; then populate the workbook by creating a sheet in which you create rows. In each row create the cells you need. Finally, populate the cells with the data. As Listing 2 shows, a cell can contain integers, floats, strings, and dates. Styles To start applying styles to cells, first create an instance of an HSSFStyle class: HSSFStyle myStyle = wb.createCellStyle() // wb is an HSSFWorkbook object The style object will now provide you with methods to set various style parameters, such as foreground and background colors, fonts, borders, and data formats, via conventionally named setters. Data Formats It's easy to get the index, however. For a built-in format, use the static getBuiltinFormat method in the HSSFDataFormat class. Give it the format string and it will return the correct index, the proper index for you. To set a format: myStyle.setDataFormat(HSSFDataFormat.getBuiltinFormat("d-mmm-yy"); For a user-defined format, first get an instance of HSSFDataFormat from an HSSFWorkbook object to ensure that your format is registered with the workbook:
HSSFDataFormat df = wb.createDataFormat();
myStyle.setDataFormat(df.getFormat("dd%MMM%yyyy"));
If you don't want to worry about which formats are user defined (it's documented in the Javadocs for HSSFDataFormat), simply use the nonstatic method and it will take care of this issue internally. When you have defined the style you want, just set it to the cell: cell.setCellStyle(myStyle); Reuse the same style object for cells that are similarly formatted - do not create new style objects for each cell, since Excel has an upper limit on the number of styles that can be referenced in a workbook. For example, you could create one style object for the table headers, one for the body, and one for the footer and use them throughout your spreadsheet. Formulas Formulas are created using the setCellFormula method of an HSSFCell object. The input to this method is a string containing the formula you want at that cell. It should be in the same format that you would type into the edit box in Excel (without a leading "="), thus: cell.setCellFormula("A1+A2^2"); You could use any built-in VBA function, or even a user-defined function, in the formulas:
Cell.setCellFormula("average(A1:B1)");
cell.setCellFormula("mySpecialFunction(A1/A2)");
If you need to provide your users with the ability to copy-paste or drag an Excel formula in the resultant sheet correctly, you might want to use absolute references instead of relative. If formulas with relative cell references (the default, e.g., A1) are copied from one cell and pasted to another, the cell references in the formulas change relative to the destination cell. cell.setCellFormula("A1/$A$25"); However, if the formula contains references that are absolute, they stay the same irrespective of the destination cell. Absolute references are specified by adding a $ symbol to the reference, viz. $A$1. Note that the row and the column can be individually addressed while specifying absolute references, viz. A$1 vs $A1. You can also reference other sheets in the same workbook in the formula. HSSF does not yet support the ability to write formulas referencing external workbook files.
Cell.setCellFormula("SUM(Sheet1!A1-Sheet1!A2)");
// formula in cell A1 of Sheet2.
Note, however, that the formula results are not calculated by HSSF, which is really a file format reader and writer, not a functional replacement for a spreadsheet application. The formula is merely written into the file in the proper format and evaluated when the file is opened in Excel. Finally However, there are always features of an Excel file that POI does not yet support. In such scenarios, templates are invaluable. The idea is to create an empty Excel spreadsheet populated with the attributes that POI doesn't support. You could, for example, create a chart in the spreadsheet referencing named ranges, or create a pivot table in a certain area. At runtime, in Java code, you could read the workbook in with POI and fill in the cells with data from your application. Now when the user opens the workbook in Excel, it comes loaded with data, charts, and pivot tables. Listing 3 provides an example. Hopefully this overview of HSSF has convinced you that HSSF has almost all it takes to create professionally produced Excel spreadsheets that'll be a joy to your users, and leave them asking for more. Word Documents with HWPF I am going to give a short introduction to the high-level structure of a Word document. These are basic concepts that can be applied to most styled document formats and they will make later sections of this article easier to digest. A Word document can be modeled as a tree-like structure. Figure 1 illustrates this. The document has sections, a section has paragraphs, and a paragraph has character runs. Each instance of these is associated with a range of text.
1 FileInputStream in =
2 new FileInputStream("C:\\test.doc");
3 HWPFDocument doc =
4 new HWPFDocument(in);
The Section, Paragraph, and CharacterRun classes represent the document tree that I explained earlier. I walk that tree in Listing 4. First, I get the Range object for the entire document. This is the entry point to the object model. The Range class is an important piece of the HWPF API. It represents an arbitrary range of text in the document, with one to many sections, paragraphs, and character runs. The Section, Paragraph, and CharacterRun classes extend the Range class. The methods numSections(), numParagraphs(), and numCharacterRuns() and the correlating getters are actually implemented in the Range class. Of course, if you call numSections() on a Paragraph object, it will return one. That would be the parent Section of that Paragraph object. Another important method in the Range class is text(). This can be used to get the plain text for a particular range. To get the text for a document, use the following code: String plainText = doc.getRange().text(); Once we have an instance of a Section, Paragraph, or CharacterRun object, we can read its properties by calling its various getters. //Check the number of columns //for this section Section sect = r.getSection(x); sect.getNumColumns(); //See if a paragraph is set to //have a page break before it. Paragraph par = sect.getParagraph(y); boolean breakBefore = par.pageBreakBefore() //Get the font name of a //character run CharacterRun run = par.getCharacterRun(z); String font = run.getFontName(); These are quick examples. There are dozens of settings and there isn't enough space to cover them all. I encourage you to read the Javadoc to see what is possible. Tables Listing 5 touches every paragraph in the document, looking for one with the table flag set. When it finds one, it passes it to the getTable method on line 8. Notice on line 12 that it's necessary to increment x so that the paragraphs that were part of the table aren't processed again. Tables have TableRows, which in turn have TableCells. All these classes extend Range so you can use all the methods that I've already talked about for getting the contents of these entities. Lists
1 for (int x = 0; x < numPars; x++)
2 {
3 Paragraph par =
4 range.getParagraph(x);
5
6 if (par instanceof ListEntry))
7 {
8 ListEntry entry = (ListEntry)par;
9
10 //do something with the entry...
11 }
12 }
Adding New Content The writing functionality of HWPF is somewhat experimental so expect some bugs and limited features. Modifying an existing document or creating a new Word document from scratch starts the same way - simply create a new HWPFDocument as shown in an earlier example. The only difference is that if you want to create one from scratch, you start with a blank document. The POI distribution comes with one called "blank.doc." To commit any changes to a physical file and see what they do, you must write out the modified document. The following code writes out a Word document that contains any changes made to the original object model.
FileOutputStream docOut =
new FileOutputStream(
"C:\\testout.doc");
doc.write(docOut);
To be safe, I wouldn't recommend overwriting the original document. HWPF attempts to keep things that it doesn't directly support in the file, but this doesn't guarantee that they will be there when it writes the file out again. The Section, Paragraph, and CharacterRun classes define setters that allow the various properties of existing content to be changed. The Range class defines the following methods for adding text and paragraphs to a document.
The methods that insert a paragraph require a style index. Paragraphs and character runs store their settings as deltas from a style stored in the stylesheet. Styles provide a convenient way to maintain a consistent look and feel in a document. They also help a person creating a Word document through a user interface to be more efficient. To a programmer this may not matter. No matter what the style is, whatever properties are set for a particular Paragraph or CharacterRun object will appear in the document. I recommend just using the number 0 for a style index. This will always refer to the "Normal" style in the stylesheet. Editing Tables Adding Lists 1 HWPFList list = new HWPFList(true, 2 doc.getStyleSheet()); 3 4 int listID = doc.registerList(list); The HWPFList constructor takes two arguments. The first one is a boolean determining whether the list should be bulleted (if the argument is false, the list will be numbered), and the second is the stylesheet of the document to which the list will belong. The register List method that I call on the method on line 4 is defined in HWPFDocument. It returns a unique ID that's needed when adding a list entry to the document. The Range class defines more insert methods for adding list entries.
Summary If POI doesn't cut it, there is a wide selection of commercial libraries for working with Excel, such as SoftArtisans OfficeWriter. SoftArtisans (www.softartisans.com) is the only vendor I could find that also offers a product that can create Word documents in pure Java. OfficeWriter also supports every feature of Word and Excel. With the new agreement between Sun and Microsoft, we may one day see the opening of the Microsoft file formats. While you wait for this day to come, POI provides a free open source alternative. References SIDEBAR A Guide to POI Versions The 1.5.1 version released early 2002 was the preferred production version for a long time. But after a long series of new features, followed by a longer period of bugfixes and stabilization, the 2.0 version was released in January 2004. Subsequently, the 2.5 version was released in late February 2004 to incorporate a major new piece of functionality - the ability to create drawings in Excel sheets via what is known as the Escher Layer. Meanwhile, development had been ongoing in an experimental branch to enable the reading and writing of Word documents (HWPF). Unfortunately, it's necessary to download this piece of POI directly from CVS and compile it yourself. There are many excellent and free client applications for accessing CVS repositories such as WinCVS and jCVS. SIDEBAR 2 Getting Started YOUR FEEDBACK
LATEST JAVA STORIES & POSTS
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK SPONSORED BY INFRAGISTICS
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||