Creating Microsoft Office files on Google App Engine

Posted in Google App Engine, Java, Microsoft Office with tags , , , , , , , , , on January 1, 2010 by stephenhuey

Happy New Year!

One month ago, I was a bit frustrated trying to figure out how to create zip files and Microsoft Office documents with Java on Google App Engine.  As you can see in the Will it Play? list, some of the tried-and-true libraries like Apache POI are not yet supported on there.  Our project has a requirement for our GAE application to generate pretty Microsoft Word and Excel reports that our users can download for further editing, so I briefly considered trying to use the Google Docs API and pass data from GAE to Google Docs in order to construct the reports there, but it seemed like that would be a huge number of API calls outside of GAE, every single one of which would cost money once we passed the free quota!

An alternative would be to try constructing the relatively new Microsoft Office 2007 format in the GAE sandbox because it’s essentially text-based whereas the old format was binary.  You may have started receiving MS Office attachments from people a couple of years ago that you couldn’t open in Word or Excel, and if you looked closely, you’d have noticed the Word extension was .docx instead of .doc and the Excel one was .xlsx (no, I also don’t get warm fuzzies from the fact that they merely added an “x” onto the file extension).   Rename these files so they end in .zip and you’ll find any zip utility will open them because they’re actually a zip file containing text files (XML, to be specific) and images.  Microsoft named the format Office Open XML which has been their new standard since MS Office 2007, and anyone can actually open them with older versions of Microsoft Office if you download and install the free Microsoft Office Compatibility Pack.  Granted, a zip file is binary, but if I could successfully generate one on GAE, then everything else I’d be working with would be supported on GAE (text files and binary image files that I didn’t have to manipulate).

So I was excited to hear about GaeVFS, a virtual file system for Google App Engine (GAE gives you no file system access, so a lot of the usual Java calls related to files are not supported).  After playing with it, I was gung-ho about trying to create the relatively new Microsoft Office format since it would make it easier for me to construct these zip files of XML files and images.  My GAE-generated zip files were recognized on Mac OS X first, and Winzip on Windows wasn’t difficult to please either, but the last holdout was Windows XP Compressed Folders (which needed to recognize it before any MS Office program would recognize it).  Finally, I got the free Microsoft Word Viewer to happily open a GAE-generated .docx file right around mid-December, and I used the same code to generate a valid .xlsx file as well.

Because I was aiming for a proof-of-concept to verify that I could create these Office documents on GAE, I manually unzipped a .docx file created in Microsoft Word and uploaded its files into GAE so I could write some Java to stick them all into a zip file.  I know, I know, you’re probably already groaning at the thought of having to do all that.  Building these zip and .docx files should really be a simple matter in Java, and in a traditional environment, there’d be no fuss about it at all.  But due to some limitations in GAE, this slightly more painful workaround is necessary, and it’d be a showstopper if we couldn’t do this on GAE, so I had to make sure it would work before doing any more development!

I snagged a photo from this MSDN page so you can see what to expect when you unzip one of these newfangled MS Office files:

.docx file structure

The contents of a .docx file

Love it or hate it, that’s what we’re dealing with folks!  Note that the media folder is where the images go, and the top-level _rels folder has a file in it that has no name before the extension, so it’s just called .rels (which means it won’t show up in the Finder in Mac OS X even though it’s really there).  Okay, now for my sample code…

GaeVFS ships with a servlet that handles file uploads and abstracts how it writes them to its virtual file system.  You can use a simple upload page like this to get your files into the GAE datastore:

 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
 <title>File Upload</title>
 </head>

 <body>
 <form action="/gaevfs/" enctype="multipart/form-data" method="post">
 <p>
 Path on server:<br>
 <input type="text" name="path" size="30" value="/gaevfs">
 </p>
 <p>
 Block size in KB (leave blank for default):<br>
 <input type="text" name="blocksize" size="10">
 </p>
 <p>
 File to upload:<br>
 <input type="file" name="filename" size="40">
 </p>
 <div>
 <input type="submit" value="Send">
 </div>
 </form>
 </body>
</html><code>

Make sure to use the correct paths when saving these files into the GaeVFS file system:

uploading file parts

Uploading the .docx file parts

That should work if your web.xml has mappings like this:


<?xml version="1.0" encoding="utf-8"?>
<web-app xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://java.sun.com/xml/ns/javaee"
xmlns:web="http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd"
xsi:schemaLocation="http://java.sun.com/xml/ns/javaee
http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" version="2.5">
<servlet>
 <servlet-name>DocxWriterServlet</servlet-name>
 <servlet-class>com.mydomain.app.DocxWriterServlet</servlet-class>
</servlet>
<servlet-mapping>
 <servlet-name>DocxWriterServlet</servlet-name>
 <url-pattern>/word</url-pattern>
</servlet-mapping>
 <servlet-mapping>
 <servlet>
 <servlet-name>gaevfs</servlet-name>
 <servlet-class>com.newatlanta.commons.vfs.provider.gae.GaeVfsServlet</servlet-class>
 <init-param>
 <param-name>dirListingAllowed</param-name>
 <param-value>true</param-value>
 </init-param>
 <init-param>
 <param-name>initDirs</param-name>
 <param-value>/gaevfs/images,/gaevfs/docs</param-value>
 </init-param>
 </servlet>
 <servlet-mapping>
 <servlet-name>gaevfs</servlet-name>
 <url-pattern>/gaevfs/*</url-pattern>
 </servlet-mapping>
 <servlet-mapping>
 <servlet-name>gaevfs</servlet-name>
 <url-pattern>/WEB-INF/*</url-pattern>
 </servlet-mapping>
 <welcome-file-list>
 <welcome-file>index.html</welcome-file>
 </welcome-file-list>
</web-app>

GaeVFS is built on top of Apache Commons VFS, and you use their FileObject instead of the standard File class.  There are some examples on the GaeVFS wiki, but I had to play around for a bit before I figured out some of the differences I needed to know.  Here’s a lil’ class I made for my servlet to use:

package com.stephenhuey.docx;

/*
 * @author Stephen Huey
 *
 */

import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.zip.ZipEntry;
import java.util.zip.ZipOutputStream;

import org.apache.commons.vfs.FileObject;
import org.apache.commons.vfs.FileSystemException;
import org.apache.commons.vfs.FileSystemManager;
import org.apache.commons.vfs.FileType;

public class FileObjectHelper {

 public static FileObject createFolder(FileSystemManager fsManager, String absolutePath) throws FileSystemException {
 FileObject theFolder = fsManager.resolveFile( absolutePath );
 if ( theFolder.exists() == false) {
 theFolder.createFolder();
 }
 return theFolder;
 }

 public static FileObject createFile(FileSystemManager fsManager, String absolutePath) throws FileSystemException {
 FileObject theFile = fsManager.resolveFile( absolutePath );
 if ( theFile.exists() == false) {
 theFile.createFile();
 }
 return theFile;
 }

 public static void zipDir(FileObject docxZipFile, FileObject directoryToZip) throws IOException {
 OutputStream out = docxZipFile.getContent().getOutputStream();
 ZipOutputStream zout = new ZipOutputStream(out);
 addDir(directoryToZip, zout, "");
 zout.close(); // make sure you close the ZipOutputStream, not the OutputStream!
 }

 public static void addDir(FileObject dirObj, ZipOutputStream zout, String basePathSoFar) throws IOException {
 FileObject[] files = dirObj.getChildren();
 byte[] tmpBuf = new byte[1024];

 for (int i = 0; i < files.length; i++) {
 FileObject currentFile = files[i];
 String currentFileBaseName = currentFile.getName().getBaseName();

 if (currentFile.getType().equals(FileType.FOLDER)) {
 addDir(currentFile, zout, basePathSoFar + currentFileBaseName + "/");

 } else { // else it's a file, not a directory
 BufferedInputStream bis = new BufferedInputStream(currentFile.getContent().getInputStream());
 zout.putNextEntry(new ZipEntry(basePathSoFar + currentFileBaseName));
 int len;
 while ((len = bis.read(tmpBuf)) != -1) {
 zout.write(tmpBuf, 0, len);
 }
 zout.closeEntry();
 bis.close();
 } // end if
 } // end for loop
 } // end addDir
}
<pre>

And this is my servlet that makes sure the files are there and calls for the zip files to be created and saved with a .docx extension:

package com.stephenhuey.docx;

/*
 * @author Stephen Huey
 *
 */

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.apache.commons.vfs.FileObject;
import org.apache.commons.vfs.FileSystemManager;

import com.newatlanta.commons.vfs.provider.gae.GaeVFS;

@SuppressWarnings("serial")
public class DocxWriterServlet extends HttpServlet {

 public void doGet( HttpServletRequest req, HttpServletResponse res ) throws IOException {

 GaeVFS.setRootPath( getServletContext().getRealPath( "/" ) );
 FileSystemManager fsManager = GaeVFS.getManager();
 try {

 List<FileObject> files = new ArrayList<FileObject>();

 FileObject relsFolder = FileObjectHelper.createFolder(fsManager, "gae://gaevfs/docxFiles/_rels");
 FileObject docPropsFolder = FileObjectHelper.createFolder(fsManager, "gae://gaevfs/docxFiles/docProps");
 FileObject wordFolder = FileObjectHelper.createFolder(fsManager, "gae://gaevfs/docxFiles/word");

 // subfolders under the word folder
 FileObject relsSubfolder = FileObjectHelper.createFolder(fsManager, "gae://gaevfs/docxFiles/word/_rels");
 FileObject mediaSubfolder = FileObjectHelper.createFolder(fsManager, "gae://gaevfs/docxFiles/word/media");
 FileObject themeSubfolder = FileObjectHelper.createFolder(fsManager, "gae://gaevfs/docxFiles/word/theme");

 // the only file in the top-level directory
 files.add(FileObjectHelper.createFile(fsManager, "gae://gaevfs/docxFiles/[Content_Types].xml"));

 // files in the docProps folder
 files.add(FileObjectHelper.createFile(fsManager, "gae://gaevfs/docxFiles/docProps/app.xml"));
 files.add(FileObjectHelper.createFile(fsManager, "gae://gaevfs/docxFiles/docProps/core.xml"));

 // files in the word folder
 files.add(FileObjectHelper.createFile(fsManager, "gae://gaevfs/docxFiles/word/document.xml"));
 files.add(FileObjectHelper.createFile(fsManager, "gae://gaevfs/docxFiles/word/fontTable.xml"));
 files.add(FileObjectHelper.createFile(fsManager, "gae://gaevfs/docxFiles/word/settings.xml"));
 files.add(FileObjectHelper.createFile(fsManager, "gae://gaevfs/docxFiles/word/styles.xml"));
 files.add(FileObjectHelper.createFile(fsManager, "gae://gaevfs/docxFiles/word/webSettings.xml"));

 // files in the _rels subfolder that's within the word folder
 files.add(FileObjectHelper.createFile(fsManager, "gae://gaevfs/docxFiles/word/_rels/document.xml.rels"));

 // files in the theme subfolder that's within the word folder
 files.add(FileObjectHelper.createFile(fsManager, "gae://gaevfs/docxFiles/word/theme/theme1.xml"));

 // files in the media subfolder that's within the word folder
 files.add(FileObjectHelper.createFile(fsManager, "gae://gaevfs/docxFiles/word/media/image1.jpeg"));
 files.add(FileObjectHelper.createFile(fsManager, "gae://gaevfs/docxFiles/word/media/image2.png"));

 try {
 FileObject docxRootFolder = FileObjectHelper.createFolder(fsManager, "gae://gaevfs/docxFiles");
 FileObject docxZipFile = FileObjectHelper.createFile(fsManager, "gae://gaevfs/generatedZip/wordDocumentGAE3.docx");
 FileObjectHelper.zipDir(docxZipFile, docxRootFolder);
 } catch (Exception e) {
 e.printStackTrace();
 }
 } finally {
 GaeVFS.clearFilesCache(); // this is important!
 }
 res.setContentType("text/plain");
 res.getWriter().println("Done!\n");
 }

 public void destroy() {
 GaeVFS.close(); // this is not mandatory, but nice to do
 }

}

Now just go to
http://localhost:8888/word

in your web browser, and once that runs, you can find your .docx file at

http://localhost:8888/gaevfs/generatedZip

Here’s what it looks like:

browsing the gaevfs file system

Finding the generated .docx in the GaeVFS file system

If you can open .docx files in your version of Microsoft Office, or if you have the free Microsoft Word Viewer and installed the free Microsoft Office Compatibility Pack, then you can download that file right away and verify that it looks exactly the same as the one you used as a starting point.

Of course, if you’re really needing to create MS Office files on Google App Engine, you’ll most likely be dynamically generating the parts. I already knew I could create XML and handle images in GAE, so the unknown for me was getting to this point. A website Microsoft created called OpenXMLDeveloper.org is unfortunately not very well-organized, and the Java examples aren’t all that helpful, but you may find something there you can use. I’ll probably just write my own helper classes to build exactly the kinds of documents I need.

A final word of caution for you:  I set this stuff aside for a few weeks after I got my solution working and started focusing on other parts of our application, and in the meantime I upgraded my Eclipse installation’s App Engine SDK from 1.2.8 to 1.3.0, so when I came back to it for getting an example ready for this blog entry, I found that my generated .docx file was no longer valid!  In other words, MS Word would no longer open the file for some reason.  That was pretty scary, but I speculated that perhaps the datastore had become corrupted even though all the individual files in there seemed fine.  I reverted to 1.2.8 and uploaded them again and everything worked, and I’ve found other folks online saying that the local datastore can easily become corrupted.  When I switched my SDK back to 1.3.0, it no longer worked again.  I had to upload the files with the GaeVFS servlet running on 1.3.0 to get my code to generate a valid .docx file on 1.3.0, and that makes sense since the underlying datastore implementation could’ve changed enough to cause a problem.

By the way, my generated files are not recognizable by MS Office if I run this code on Mac OS X.  It’s fine if I run the code from my local app on Windows and also from my production app on appspot.com, but I suspect there may be an issue with how the virtual file system abstracts things on OS X or something like that which causes MS Office to reject the .docx and .xlsx files even though they’re recognized as zip files by Windows XP Compressed Folders.  Most important of course is the fact that the production version on appspot.com generates valid files!

Anyway, I’m glad it works and so far I’m really enjoying playing with Google App Engine.  It sounds like there are plenty of improvements planned in the near future for their Java support, so I’m looking forward to more goodies from the GAE team.

Let me know if you have any questions or if I need to fix something in this post.  Here’s to a great start on 2010!

And so it begins…

Posted in Uncategorized with tags , , , , on December 21, 2009 by stephenhuey

Here’s hoping this site will be helpful to people.  The only reason I finally decided to put a blog up is because I have some GAE code I want to share with interested parties, and I figured this would be a good place to do it.  As much as I love other writing, this may end up being mostly technical stuff here, but we’ll see how it goes.

I always thought I’d run a blog myself with a custom WordPress install or my own hacked-up solution, but this free WordPress.com hosted account already took me long enough to get going without even paying to use my own custom CSS because I had to pick a free theme, edit a couple photos to put on here and add some basic widgets, so I have no regrets at the moment–I have plenty of other things to occupy my time today!  Rather than writing my own blogging app, I’d rather spend my time writing more unusual code and share that with the world instead.

Someday soon I’ll post some Java snippets related to Google App Engine, but right now I need to go for a quick jog.  In the meantime, I’ll leave you with a well-known poem.  While I don’t agree with all of it, Rudyard Kipling forces me to ponder my life’s direction and reflect on how I’m living my day:

If you can keep your head when all about you
Are losing theirs and blaming it on you,
If you can trust yourself when all men doubt you
But make allowance for their doubting too,
If you can wait and not be tired by waiting,
Or being lied about, don’t deal in lies,
Or being hated, don’t give way to hating,
And yet don’t look too good, nor talk too wise:

If you can dream–and not make dreams your master,
If you can think–and not make thoughts your aim;
If you can meet with Triumph and Disaster
And treat those two impostors just the same;
If you can bear to hear the truth you’ve spoken
Twisted by knaves to make a trap for fools,
Or watch the things you gave your life to, broken,
And stoop and build ‘em up with worn-out tools:

If you can make one heap of all your winnings
And risk it all on one turn of pitch-and-toss,
And lose, and start again at your beginnings
And never breath a word about your loss;
If you can force your heart and nerve and sinew
To serve your turn long after they are gone,
And so hold on when there is nothing in you
Except the Will which says to them: “Hold on!”

If you can talk with crowds and keep your virtue,
Or walk with kings–nor lose the common touch,
If neither foes nor loving friends can hurt you;
If all men count with you, but none too much,
If you can fill the unforgiving minute
With sixty seconds’ worth of distance run,
Yours is the Earth and everything that’s in it,
And–which is more–you’ll be a Man, my son!

If, by Rudyard Kipling

Posted in Uncategorized on December 12, 2007 by stephenhuey

Some Gmail engineers have put up this video.  The first engineer is Julie, and that girl was one of my most appreciated programming partners at Rice.  She definitely was instrumental in helping me make it through several classes (not just computer science–thank God she was around during computational numerical analysis and anything that made me write mathematical proofs).  Julie grew up near Austin and was annoyed at the restrictions when she took her Texas-sized truck over to California.  I haven’t bothered to scroll back through my xanga to look, but I think there’s a photo of her when I drove to California eleven months ago.  She has gotten married since then. 

Posted in Uncategorized on June 9, 2007 by stephenhuey

I’ve already seen Once twice. 

So far, it’s only playing at the Landmark River Oaks.  If you want to see it, I recommend you don’t read any reviews or plot summaries, or watch any of the videos or trailers on the film’s website except perhaps the one linked above (of them performing live at Sundance). 

Posted in Uncategorized on May 19, 2007 by stephenhuey

Just watched Blood Diamond for the first time.  Despite the overwhelming presence of Hollywood throughout the film, that stuff is largely true…De Beers running the diamond world, etc.  But while I may have gotten hot-headed about topics such as that many times in my past life, that’s not where my thoughts were as I lay on the couch afterward. 

I remember discussing with a friend half a decade ago….it seems like much of my imagining of life in Africa is romanticized.  Still, there are so many memories and so many fantasies of what life could be like there, and with whom…people I’ve wanted to be with, people I miss, people I just realized I might never see again even though they’re in the same country as me.  Some people I’ve missed out on having real, transparent and unselfish conversations with through no fault other than my own.  How many times since I was a boy have I imagined taking the woman I marry to places that almost don’t exist anymore? 

It always seemed as if I remember a lot from when I was very young.  Tonight I realized how many remembrances have been forgotten, and I don’t even know when they stopped coming to mind.  So much to draw upon, so much to fuel my everyday living.  I don’t want to forget anymore!  

I don’t want to fear anymore.  I don’t want to be afraid to dream, to expect that the best is possible. 

Careful with the laptops, boys…

Posted in Uncategorized on May 15, 2007 by stephenhuey

This isn’t the first time I’ve heard of this!  I admit I’ve been more skeptical than protective, and I hope that doesn’t come back to burn me (them). 

I was wrong…

Posted in Uncategorized on May 15, 2007 by stephenhuey

Wired pointed out that on this day in 1939, a Peruvian girl gave birth to a son at the age of 5.  The urban legend watchdog site Snopes confirms the veracity of this story, and it’s repeated on thousands of other webpages.   Lina Medina experienced precocious puberty, so she had her first period at 8 months and became fully sexually developed within a few years.  In case you’d rather not stumble upon it, be advised that there is a nude photo of her on Wikipedia and other sites. 

Her parents didn’t even realize she was pregnant until they took her to see the doctor about the massive “tumor” growing in her belly!  This 1939 TIME article shows the early skepticism among medical professionals in the United States, but from current accounts, it sounds as if they came to accept the story as true and unsuccessfully tried to get the girl up for a visit.  Apparently, there was a similar pregnancy with a 6-year old Russian girl, and precocious puberty is known to show up from time to time. 

My tennis coach in Nigeria showed me a tabloid newspaper article about some Argentinian girl having a child at the age of 4, and I argued with him about whether or not that was possible.  Now about 15 years later I finally found out I was wrong!