Why Excel
Even though there are many file formats for software localization, Excel is here to stay. It might look crazy at first as a loc engineer: why do people use Excel when there is XML, JSON, PO, and many other formats that are well defined, reliable, predictable, and just work, from the engineering perspective? Then, as you continue to work with teams relying on Excel in real life, you realize Excel’s strong points. For example, it works relatively well with large amounts of text to localize. A PM can open an Excel file in Excel itself, and have standard methods to verify its contents, to make changes, to search, sort, and filter, or even to make some changes. With just about any other format in localization, this requires special knowledge about the format, which PMs may not have, or they might be afraid of messing up the files when they try to edit them. With Excel, engineers might even be able to develop macros to automate some preparation, finalization, or verification tasks. For translation teams, and for customers as well, Excel is powerful and flexible, and above all, familiar. The initial learning curve is gentle, as everybody knows (some) Excel, and the potential is great to improve processes using software tricks.
What you get
You’ll learn about how you can deepen your knowledge and solve hard core loc engineering challenges involving memoq and Excel. The information below can save you, as an engineer or developer, months of learning, trial and error with inevitable failures. (At least it may start you on the right track, it doesn’t contain source code, just explanations of ideas.) If you are not a developer type but want to improve your processes dealing with Excel files, it might give you some pointers about what is possible. I’ll refer to memoq regularly in the text because that is the translation tool I’ve spent most of my professional life with (including the many years at memoq Ltd designing and building it), but the ideas may apply to loc engineering with other tools as well.
Multilingual Excel columns
One very general problem with Excel in software localization concerns “multilingual” Excel files with a source string column and translation columns for target languages. The problem is that the column setup is almost always all over the place. Even from the same customer, you just about never get files that have the same columns in the same place. Sometimes source is in column A, sometimes in B because A holds a string ID this time. Headers typically don’t help much because there’s no consistency there either: the source column may be called “Source string”, or “EN”, or “English”, and so on.
They could look like this:
Or like this, or something different again:
Even if you managed to convince all your customers to have a standard column setup, memoq still trips you up if the columns ever “move”: you can’t just tell memoq that your source column is the one that says “English” in the header row: memoq needs to know the column letter instead, i.e. what you need to tell memoq is that the source column is ‘A’, If it is ‘B’ next time (because the string ID is in ‘A’), tough luck: you need to create a new filter configuration for each new column setup. I’ve seen teams that literally created hundreds or thousands of filter configurations over a few years. (And then you can start to have various performance or technical issues in memoq, because the memoq developers never expected that their users would have so many filter configs….)
My solution to this was to develop a tool that “standardizes” Excel files before the translation workflow. It reads each column, and follows a simple logic:
The user can create a template that lists each column header and tells the tool what to do with it. For example, if a header cell says any of “Source”, “English”, “EN”, etc., then it is identified as the source string, and copied over to column A in the “standardized” sheet. If it is any of “German”, “DE”, etc., then it is moved to a column that is dedicated to German translations, for example, column K could be for German strings, always. The purpose of the template file is to define the mappings between the headers that occur in your translatable files, and the headers and header order you want in your standardized files.
When I developed this tool for my colleagues, the template was an Excel file so that everybody could work with it, and it looked something like this:
The first row is the actual header that the standardized sheet will contain. (This is a minimal example, in practice I’ve created templates with columns for many target languages, and other columns for comments or length limits.) Any additional rows hold all the header name “mutations” from the files we need to support. So, in our example, if a file that comes in has any of “ID” or “String identifier”, or “String ID” in a header cell of in any of its columns, then the template file tells us that it is our “string ID” column, and copies over every cell from that row to column ‘A’ of the standardized sheet, named “String ID”. Same for each language column name. If a new file type comes in with yet another new name for the English column in yet another location, we just need to add it under “English” in the template file. The template file was placed on a network share available for every PM, and they can edit it as they wish.
In the translation workflow, what is imported into memoq is this standardized sheet. The string ID could always be in column A, English in B, German always in K and so on. This way, you need just one filter config in memoq, and always use that. The original sheet remains hidden and is skipped by the translation tool.
When translation is finished, there is a post-processing to perform, which takes the translations from your standardized sheet, and copies them back to the original sheet they came from. Finally, the standardized sheet is deleted, and the customer receives a translated Excel file in the same form sent to you, with all the columns in their own lovely unique names and locations.
Importing CSV instead of Excel into memoq
When you have to import really enormous Excel files into memoq, which isn’t that uncommon in games localization, you will see that it is not fast and also uses a lot of memory in memoq. This can lead to different types of issues, depending on how the file import happens:
If your PMs manually import them, it might take forever because their laptops will run out of RAM and start “swapping” (using their hard drives or SSDs as RAM) which is magnitudes slower than real RAM. Running out of RAM can literally make a difference between an import that finishes in minutes or one that takes hours.
If the file imports are done via any sort of automation (memoq’s API, or it’s simple connector sidekick simply called “content connector”), then the import actually happens on your memoq server machine. Then your IT people will be scratching their heads because they’ll see a memoq server machine that hardly uses any RAM in normal daily operation, but has some crazy occasional RAM use peaks whenever these huge Excel files are imported. If your team doesn’t have the right expertise, they might never even figure out where those RAM peaks come from.
I found that when you need to import huge amounts of content from Excel, and there is no significant formatting (for example, because it is software localization, so all plain text), then importing the exact same content from a CSV file is both much faster and takes way less RAM. If you think of it, the real-life burden placed on the server is practically “RAM use” multiplied by “time to import”. If importing a file uses 1 GB of RAM, it is still beneficial if it finishes twice as fast, because it uses that amount of RAM for a shorter time.
Excel files can be converted to CSV before the memoq translation workflow and converted back to Excel after export. If you have such a piece of software developed (the right way!), you can also use memoq’s template automation to make this work for you without any manual steps: you just drop an Excel file on memoq’s PM window, and the file will get converted to CSV by memoq, and then imported. When exporting after translation, the back-conversion is also done under the hood, and what the PM gets out of memoq is an Excel file again.
Unfortunately, CSV isn’t a proper file format standard like XML is, for example. I would call it a semi-standard with some rules that are often broken by software that produces CSV files. At first sight, a CSV file is just rows of cell values separated by commas. But things get interesting when your cell value contains a comma itself, or a new line. You can put that cell value inside a pair of double quotes, but then what happens if the value itself needs to contain a double quote character? I’ve covered the practicalities of working with CSV instead of Excel for memoq users here:
https://www.linkedin.com/pulse/csv-localization-when-why-how-use-place-excel-gergely-vandor/
After a fair bit of blood sweat and tears, as well as pulling of hair, I managed to develop a tool that converts an Excel file to CSV exactly as Excel itself does, in a way that the conversion to CSV and importing into memoq is much faster and uses way less RAM than directly importing the Excel file into memoq.
Developing pre- and post-processing tools for Excel
In the previous section, I mentioned a pair of pre-processing and post-processing tools that I developed to convert an Excel file to CSV before translation, and back to Excel again after translation. But you might need to do other things in Excel files like re-arranging the columns.
If you are a developer who sets out to do something like this, or you have access to a developer, there are many different software libraries and frameworks that can process Excel files for you.
Arguably the easiest way, or at least the one with the gentlest initial learning curve for many loc engineers who already have Excel experience, is using Microsoft’s “Excel interop” library or even just plain old Excel macros. These are frameworks that, simply put, allow you to write software code to automate a copy of Excel and do things for you, often many magnitudes faster than a human clicking around in Excel could. When you develop a macro or a piece of standalone software using Excel Interop, you practically have a set of software tools in your hand that can do just about anything you can manually do in Excel. You want to freeze panes to fix your header row in place and make it bold? You can do it with a few lines of code. You want to save to CSV? Again, just one line of code with some parameters to a SaveAs method. Using Interop or macros is all wonderful until you realize that your code will never be usable in automation or integration in a server environment, because of how it works on the basic level: it starts an instance of Excel for you (or even starts from Excel itself it is a macro) and that Excel instance performs the tasks you tell it to do. This is just impossible on a server machine, not just because the server machine may not even have Excel installed, but for a list of other technical reasons. If you somehow managed to jump through all the hoops, it is also forbidden by Microsoft’s licensing terms.
The second option in my search was a family of commercial software libraries called Aspose. Aspose has all the same benefits as Excel Interop or macros: you can write code to do just about anything that one can do in Excel manually, and it is reasonably straightforward for a developer who understands Excel. The big selling point is that it does so without requiring Excel itself, so you can develop software that runs in your automation and on your server machine just fine. This combination of giving you most of the functionality of Excel, without actually relying on Excel itself in any way is Aspose’s killer feature. Also, Aspose has libraries not just for Excel but also for the other Office formats, and much more. The big setback is that you need to fork over quite a bit of money for an Aspose license, in a licensing scheme that depends on the number of developers, the scale and method of distribution of the software you develop using Aspose, and so on. It may be a great option if you need a commercial solution with (pretty good and responsive) tech support behind it, and money is no object. Another caveat is if you run into situations where performance and efficiency are important (like me with my huge Excel files I needed to convert to CSV and back), then Aspose might not be the option that is the fastest or uses the least RAM.
The third common option is Microsoft’s set of OpenXML libraries. Now, one might say that the “new” (yes. I’m old) Office formats with an x at the end of the file extension (XLSX, DOCX, PPTX) are just *.zip files with a bunch of standardized and well documented XML files inside that describe an Excel workbook or Word document and so on. This is all true, but the file formats are extremely complex and not very intuitive. There are millions of features, and if you set out to develop a tool that directly manipulates these files, there’s a lot of room for errors and many ways to get lost. The OpenXML libraries were created to make it easier for you as a developer. If you are already familiar with Excel macros or Interop, then moving to OpenXML means that you are no longer using Excel to do this and that for you as a robot: saving to CSV or sorting your sheet is no longer just one line of code. Instead, you just have a set of developer tools to read or manipulate an Excel XLSX file in a way that is more reasonable than unzipping and directly manipulating its contents. On the plus side, the OpenXML libraries are supported by Microsoft, and don’t need Excel to run, so, again, they are a good fit for your automation or integration tasks on the server side.
The list of options doesn’t end here: there are some free and open-source software libraries that can read and manipulate Excel files. They have different purposes: some of them you can use in Python, some of them are extremely performant if you just need to read an Excel sheet from top to bottom to import it into a database or to convert to CSV. Some of them work on top of OpenXML but make development much easier for you.
In the role of custom tools developer, I’ve worked with all these options (Aspose, OpenXML, Interop and open-source libraries). For the converter from XLSX to CSV and back, I used OpenXML, but would probably look at some of the best open-source libraries if I had to do it again today, because they are faster and more efficient for a “simple” top-down cell-by-cell conversion. When I had to create something more complex, I used Interop to rely on the easy access to powerful Excel features and sacrificed the ability to use the resulting software tool in server-side automation. Instead, I developed software tools that PMs could use on their own PCs (with Excel installed), interactively, with a GUI.