Welcome!

Linux Containers Authors: Elizabeth White, Liz McMillan, Pat Romanski, Yeshim Deniz, Amit Gupta

Related Topics: Linux Containers

Linux Containers: Article

Two OCR packages for Linux compared

Joe Barr looks at two applications that could make document-management a snap for SOHOs

(LinuxWorld) — Linux is everywhere you look, from mainframes to handhelds to servers to desktops. One place where its growth has been something less-than-spectacular is in SOHO (small office/home office) use. Now that full-featured office suites are available for Linux, more SOHOs than ever are beginning to migrate from Windows to Linux. However, SOHO users often require more than just an office suite; they need accounting software for the books and payroll, they need fax capabilities (see Resources for these sorts of things), and sometimes they need something a bit more exotic. Like OCR (Optical Character Recognition), for example.

In the legal and medical fields, document management is a very big deal. In modern office environments, OCR often plays a key role in solving that problem. Because OCR for Linux is one area I don't hear or read a lot about, I decided to do some digging and see what I could find. This week, I'll tell you about two solutions I found: one from the free-software camp and one proprietary application.

Kooka & Gocr

Let's start with free. The free-software solution is actually a combination of two projects: Kooka and Gocr. Kooka is a KDE application that's part of the kdegraphics package. It provides a front end for SANE-access to your scanner, and it calls Gocr for its OCR engine.

Those of you unfamiliar with the recent advances in compatibility between KDE and GNOME apps — and between competing distributions, as well — might think that getting Kooka and Gocr cooperating on my Red Hat 8.0 GNOME desktop would be a chore. Not so. It was as easy as falling off a log file.

First I ran up2date kooka. A little later on, I would have to run up2date kdebase as well, but I didn't know that yet. It's required in order to dump Gocr's ASCII-text output into Kate, a KDE text-editor.

When I tried up2date gocr, it wasn't found. No problem. I downloaded an RPM binary for Mandrake Cooker from rpmfind.net and installed it manually with rpm -Uvh gocr-0.37-2mdk.i586.rpm.

The up2date kooka process took care of everything, including adding Kooka to the Red Hat menu. I found Kooka under Extras -> Graphics, described simply as "Scan & OCR Program." Because my HP 5200C scanner had previously been configured for SANE, I was ready to go immediately.

I placed a recent mailing from ATT in the scanner and set up the scanner in Lineart mode with a resolution of 300dpi. Then I clicked the "Preview Scan" button. The HP 5200C woke up from its long slumber, whirred and clicked a bit, then made a fast pass under the letter. The preview window showed that it was seeing the letter and had it properly aligned. Next, I clicked the "Final Scan" button. When the final scan was complete, I saved the scanned image in PNG format. At this point, Kooka looked as you see in the image below:

Kooka Scan/OCR
Editor's note: The above image is reduced in size to allow it to display on this page. Click on this image to see the original.

From the toolbar across the top of the Kooka GUI, I clicked on the "OCR Image" icon. This brought up a sub-window explaining that Kooka calls Gocr for OCR. The window also allows you to modify the path to the Gocr executable and to adjust gray level, dust size and space width. I left them all at their default values and clicked on the Start OCR button. It took about 13 seconds to OCR the image, then a two-paned window appeared. Neither pane seemed to have much readable text in it. A button on the window offered to load the output in Kate (the aforementioned editor), so I clicked it. That produced the window you see below. As you can see, several characters were either not recognized or not recognized correctly, but I would judge the quality of the OCR as decent to good, especially from a relatively low-resolution image.

Kate edit of OCR'd image
Editor's note: The above image is reduced in size to allow it to display on this page. Click on this image to see the original.

OCR Shop

Now let's look at OCR Shop from Vividata. I filled out a brief registration form and agreed to the license terms to download a free 30-day evaluation copy of OCR Shop and have a license key e-mailed to me. I also downloaded a copy of the OCR Shop user manual in PDF format.

Just like the free-software solution, OCR Shop can handle both the scanning and the OCR. Unfortunately, my scanner was not one of those supported. No problem. I simply used the image file created and saved by Kooka. When the downloads were finished, I untarred the OCR Shop download and entered the vivadata_linux_4.61 directory created by tar. As root, I entered ./installer. Then I copied and pasted the license-key information from the e-mail into the appropriate windowpane when asked for it. At that point, I was good to go.

Starting OCR Shop (by entering /usr/vividata/bin/ocrshop) produced the main window you see below. It's small, compact, and loaded with things to tune and tweak. The options window covers everything from language to user dictionaries to proofing-editor setup. Speaking of editors, I added a new one to the default list, eschewing both vi and emacs so that I could use gedit, which is the one I usually write with. You can also choose recognition options and select the format (from a very long list of word-processing file types) in which you want the output produced. Vividata's Web site, however, points out that the Linux version of OCR Shop is limited to ASCII-formatting of output. It allows you to select other formats, but this just names the file with an .ami extension and adds strange statements to the text.

OCR Shop main menu
Editor's note: The above image is reduced in size to allow it to display on this page. Click on this image to see the original.

To start OCR on the previously scanned image, I clicked "Auto Recognize" and selected "File" as the input source, "Whole Page" as the area to recognize, entered the file name (kscan_0001.png) as the document name and then clicked "Start Recognition." That produced a file-selection window dialog. I located and selected the PNG image created by Kooka, then I highlighted it in the selected file-pane and clicked OK. I'm not sure why I needed to enter the file name and select it; perhaps I did something wrong along the way. In any case, clicking OK resulted in the gedit window containing the OCR output that you see below.

OCR Shop proof-editing
Editor's note: The above image is reduced in size to allow it to display on this page. Click on this image to see the original.

The recognition was done so much more quickly with OCR Shop than with Kooka/Gocr that I initially thought I had broken something. In well under two seconds, the image had been fully scanned, the output had been formatted and the proofing editor had started with the output in place. My guess is that the actual recognition was done in one second. I was very impressed.

After inspecting the output from OCR Shop, I was even more impressed. Gocr had not recognized symbols such as the dollar sign, percent sign and asterisk; made a couple of errors in spacing between words; and misrecognized very similar characters, such as reading the lower-case letter "L" as a capital "I". However, I could find nothing incorrect in OCR Shop's output. Not a single error.

Don't get me wrong; I'm a big fan of open source and believe that, in the end, it will become the dominant genre in many areas (read: operating systems, for sure). I am not bad-mouthing Gocr in the least when I point out that OCR Shop is clearly superior in terms of speed and accuracy.

The choice is yours

Of these two choices, which one would be right for your OCR needs? Depending on who you are and how you'd use the software, the answer differs.

Kooka/Gocr is the choice for me; harsh economic reality makes it so. If it weren't for free software, I would have no OCR capability at all. While Vividata's OCR Shop is clearly the performance winner, it is pricey. Corporate pricing for the desktop version starts at $1,495. If you want an annual maintenance contract with that license, that's another $299. OCR Shop is based on ScanSoft's award-winning OmniPage engine. A quick check on the ScanSoft Web site shows OmniPage Pro (for Windows) on sale for $599. For serious use where accuracy is paramount, then OCR Shop is the clear choice.

I spoke briefly with Radcliffe Goddard, national sales director for Vividata, to see what I could learn about the future of OCR Shop for Linux. Although she did not have the actual numbers at hand, she did say that most Vividata customers are running Linux. They are hard at work on new development work in OCR, though the direction seems to be towards industrial-strength server applications rather than the desktop usage expected in the SOHO market. OCR Shop XTR, for example, is available only in a CLI configuration, and it offers even greater power and more of everything.

OCR is definitely an area in which I have a lot to learn. Have I missed other OCR solutions available for Linux? I'm aware of CLARA, but what else is out there? Which way would you go if your choice was limited to the two solutions covered here today? Let me know in the forum or by e-mail if you prefer.

More Stories By Joe Barr

Joe Barr is a freelance journalist covering Linux, open source and network security. His 'Version Control' column has been a regular feature of Linux.SYS-CON.com since its inception. As far as we know, he is the only living journalist whose works have appeared both in phrack, the legendary underground zine, and IBM Personal Systems Magazine.

Comments (3) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
LjL 03/26/04 01:26:50 PM EST

There is ocrad: http://www.gnu.org/software/ocrad/ocrad.html

dkite 10/05/03 10:50:35 AM EDT

http://www.suse.de/us/company/press/press_releases/archive03/82.html says that suse linux includes a commercial ocr, kadmos. I believe that it uses kooka, with kadmos backend.

Haven't tried it. There was a licensing change in kooka a while back to allow this.

Derek

gilkyboy 10/04/03 04:44:06 PM EDT

One error: en joying. it added an extra space.

@ThingsExpo Stories
As businesses evolve, they need technology that is simple to help them succeed today and flexible enough to help them build for tomorrow. Chrome is fit for the workplace of the future — providing a secure, consistent user experience across a range of devices that can be used anywhere. In her session at 21st Cloud Expo, Vidya Nagarajan, a Senior Product Manager at Google, will take a look at various options as to how ChromeOS can be leveraged to interact with people on the devices, and formats th...
SYS-CON Events announced today that Yuasa System will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Yuasa System is introducing a multi-purpose endurance testing system for flexible displays, OLED devices, flexible substrates, flat cables, and films in smartphones, wearables, automobiles, and healthcare.
Organizations do not need a Big Data strategy; they need a business strategy that incorporates Big Data. Most organizations lack a road map for using Big Data to optimize key business processes, deliver a differentiated customer experience, or uncover new business opportunities. They do not understand what’s possible with respect to integrating Big Data into the business model.
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities – ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups. As a result, many firms employ new business models that place enormous impor...
SYS-CON Events announced today that Taica will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Taica manufacturers Alpha-GEL brand silicone components and materials, which maintain outstanding performance over a wide temperature range -40C to +200C. For more information, visit http://www.taica.co.jp/english/.
SYS-CON Events announced today that Dasher Technologies will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Dasher Technologies, Inc. ® is a premier IT solution provider that delivers expert technical resources along with trusted account executives to architect and deliver complete IT solutions and services to help our clients execute their goals, plans and objectives. Since 1999, we'v...
Recently, REAN Cloud built a digital concierge for a North Carolina hospital that had observed that most patient call button questions were repetitive. In addition, the paper-based process used to measure patient health metrics was laborious, not in real-time and sometimes error-prone. In their session at 21st Cloud Expo, Sean Finnerty, Executive Director, Practice Lead, Health Care & Life Science at REAN Cloud, and Dr. S.P.T. Krishnan, Principal Architect at REAN Cloud, will discuss how they b...
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
SYS-CON Events announced today that TidalScale, a leading provider of systems and services, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale has been involved in shaping the computing landscape. They've designed, developed and deployed some of the most important and successful systems and services in the history of the computing industry - internet, Ethernet, operating s...
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
Join IBM November 1 at 21st Cloud Expo at the Santa Clara Convention Center in Santa Clara, CA, and learn how IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Cognitive analysis impacts today’s systems with unparalleled ability that were previously available only to manned, back-end operations. Thanks to cloud processing, IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Imagine a robot vacuum that becomes your personal assistant tha...
SYS-CON Events announced today that TidalScale will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale is the leading provider of Software-Defined Servers that bring flexibility to modern data centers by right-sizing servers on the fly to fit any data set or workload. TidalScale’s award-winning inverse hypervisor technology combines multiple commodity servers (including their ass...
As hybrid cloud becomes the de-facto standard mode of operation for most enterprises, new challenges arise on how to efficiently and economically share data across environments. In his session at 21st Cloud Expo, Dr. Allon Cohen, VP of Product at Elastifile, will explore new techniques and best practices that help enterprise IT benefit from the advantages of hybrid cloud environments by enabling data availability for both legacy enterprise and cloud-native mission critical applications. By rev...
Infoblox delivers Actionable Network Intelligence to enterprise, government, and service provider customers around the world. They are the industry leader in DNS, DHCP, and IP address management, the category known as DDI. We empower thousands of organizations to control and secure their networks from the core-enabling them to increase efficiency and visibility, improve customer service, and meet compliance requirements.
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
SYS-CON Events announced today that N3N will exhibit at SYS-CON's @ThingsExpo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. N3N’s solutions increase the effectiveness of operations and control centers, increase the value of IoT investments, and facilitate real-time operational decision making. N3N enables operations teams with a four dimensional digital “big board” that consolidates real-time live video feeds alongside IoT sensor data a...
Amazon is pursuing new markets and disrupting industries at an incredible pace. Almost every industry seems to be in its crosshairs. Companies and industries that once thought they were safe are now worried about being “Amazoned.”. The new watch word should be “Be afraid. Be very afraid.” In his session 21st Cloud Expo, Chris Kocher, a co-founder of Grey Heron, will address questions such as: What new areas is Amazon disrupting? How are they doing this? Where are they likely to go? What are th...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, will lead you through the exciting evolution of the cloud. He'll look at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering ...
Digital transformation is changing the face of business. The IDC predicts that enterprises will commit to a massive new scale of digital transformation, to stake out leadership positions in the "digital transformation economy." Accordingly, attendees at the upcoming Cloud Expo | @ThingsExpo at the Santa Clara Convention Center in Santa Clara, CA, Oct 31-Nov 2, will find fresh new content in a new track called Enterprise Cloud & Digital Transformation.
SYS-CON Events announced today that NetApp has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. NetApp is the data authority for hybrid cloud. NetApp provides a full range of hybrid cloud data services that simplify management of applications and data across cloud and on-premises environments to accelerate digital transformation. Together with their partners, NetApp emp...