All posts by admin

Place of Optical Character Recognition in the Current Industry

What it OCR?
Good quality Optical Character Recognition (OCR) is an unquestionably effective tool for extracting textual content in machine readable and editable format from a digitized image. The following steps are involved in the process – from scanning a paper document to getting the textual content out of the scanned image.

1. Use a compatible good quality document scanner to scan a paper document into suitable electronic format, such as JPEG, TIFF, PNG, PDF etc.

2. Preprocess the scanned document for skew correction, noise removal, and some mathematical transformations.

3. Perform Line Segmentation from the processed document, i.e., segregate each line from the scanned document.

4. Perform Word Segmentation, i.e., segregate the words of each line from the previous step.

5. Perform Character Separation, i.e., separate the characters of the segregated words from the previous step.

6.Perform Character Recognition, i.e., recognize each of the above segregated characters by looking into the pattern. Once the pattern are recognized the corresponding textual character is obtained. The pattern recognition may involve different types of methods, such as statistical analysis, neural networks, structural matching etc.

7. From character to word, from word to line, and from line to the entire document becomes recognized and its textual version is obtained.

In short, this is the main process and objective of Optical Character Recognition (OCR).

There are different OCR products – some commercial and some open source, such as the widely used / known “Tesseract“, which is distributed as part of various types of product offerings from different companies. SARANGSoft offers a version of Tesseract as a free download from its website. It enhances the functionality of digipaper (http://sarangsoft.com/product/digipaper) by adding the OCR functionality.

Different Types of OCR
OCR is available for different languages, including English. Several successful researches have already been done and are still going on. Sometimes a document may contain multiple languages (such as forms). In that case OCR become more complex. An OCR engine for one language, such as English, is not sufficient in that case. The output textual content also would show textual form of different languages / scripts. Hence, this is known as Muitisctipt OCR as opposed to Single Script OCR in the simpler case of only one language / script.

It is also important to keep in mind that the textual content may be in printed form, or in handwritten form, or both. Very few OCR tools are effective in recognizing handwritten text, and most OCR tools focus on and handle only printed documents.

Teeseract supports recognition of texts in multiple languages. However, our experience is more focused on the English language. From what we have noticed, Tesseract doesn’t handle mixture of languages too well.

Accuracy
No OCR tool can guarantee a 100% accurate output because of different reasons:

  1. A document might contains clean text or it can be a a bit fuzzy / unclear / noisy, especially if it’s an old document where the contents are fade. Unclear / noisy documents should be run through some preprocessing like color and brightness adjustment, noise removal, filtering etc. to improve its clarity. Cleaner the document, more accurate the OCR result.
  1. Documents in grid format (i.e., in rows and columns) may also cause problems in accurately recognizing the textual contents. At times some of the (printed) texts go across the gred cell boundaries, thereby being more prone to error during recognition.
  1. Scanning of the paper document is an important factor in the recognition process. If the document is shrunk or skewed by a significant level, the preprocessing may not always back the document in its actual form. This also often lead to inaccurate OCR results.
  1. Recognizing handwriting is always hard, especially it’s cursive, in which case special processing (including Neural Networks techniques) may be needed to recognize the characters from the flowing nature of the writing. Bad handwriting is hard to recognize (both by humans and machines).

Use
In the business world, OCR can be used for long-term digital storage of important papers, such as invoices, financial statements, reports, legal documents, applications, permits etc., especially if the volume of such papers grows too much. In some organizations, there are dedicated staff members to manually enter the details from those paper documents into digital formats (e.g., spreadsheets) to collect the data and make it available for processing. Another reason for digitizing paper documents is safekeeping and organizing for the long-term storage. If the digitized documents are properly “tagged”, an added benefit is the ease and speed of retrieving those when needed. However, manual entry of data and/or manual tagging is both error-prone, time consuming, and often inconsistent between individuals. Also, any error in manually entering data can cause major problems for the organization. OCR can be a great help in reducing the use of manual steps in the overall digitization process. The textual content from the scanned document images can be extracted using OCR. Then those can be put into appropriate places, such as in database fields, rows and columns of spreadsheets, or as “tags” for the concerned documents. Please note the word “reduce” above – a lot organizations think that by using OCR they can completely get rid of the manual intervention, which is a bit too much to expect. As no OCR is 100% accurate, output of OCR needs to be manually verified as appropriate for the case. That means instead of “entering” every data item manually, there is need “check” the correctness of OCR-generated data. It’s not safe to blindly trust the OCR output in a lot of cases.

In Banking sector, OCR may be used to recognize check number, account number, bank name, routing number etc. In Educational sector, OCR may be able to help with form processing. In any domain OCR can be used as a part of the digitization process to extract textual content after the scanning and use the textual content as (secondary) tags to later identify the documents.

In short, though OCR reduces manual data entry, thereby saving time and reducing / avoiding errors to a good extent, at the same time keep in mind that OCR output should be carefully checked before using it any ciritical purposes.

2016: IT Security Challenges

A recently published report by Gartner & Raytheon (Dec-2015) makes some security predictions for the year 2016. The picture is not comforting at all. The already scary level of attacks by cyber criminals will rise even more because of the cyber terrorists (including the “Syrian Electronic Army” or SEA in short), who will be working in sync with ISIS and other such groups.

1.  The US Elections Cycle Will Drive Significant Themed Attacks: The level of use of social and online media for US Presidential election process will exceed all earlier instances. The candidates have started opening websites with their own profile and are regularly updating those with campaign schedules, time tables, issue-based debates etc. They are also using facebook, Twitter, Instagram etc. as campaign tools. A 2014 survey showed nearly 74% of US adults use social networking. According to a recent survey by Pew Research Center, nearly 92% of the Americans are on social media. Of them 96% adults read news on Presidential election there. They have less interest and trust in traditional media like TV, newspaper etc. The candidates also are paying more attention to their Ads in social media sites.
This will make things easier for the hackers and spammers. Pretending to campaign on behalf of some candidate(s), they will present attractive / interesting topics or use offers as bait to trap / cheat users visiting social media as well as push malware, spam etc. in their email / computer.

2.  The attack on Google, Bing etc. will reach an extremely high level. There will be attacks through facebook, Twitter “friend” / “connection”. Serious attacks like Highly Transient Web Threat (HTWT) will also happen.

3.  Addition of the GTLD system will provide new opportunities for attackers: The top ten botnets like “Cutwell”, “Rustock”, “Mega-D” etc. will become even more powerful and active. They have been spreading spam to about 100 million computers around the world, which is 88% of all the 100s of billions of spam sent daily. In 2016, it might grow by 15 times or more!
Since multinational corporations and marketing agencies are becoming increasingly dependent on online services and web-based systems, there is big growth in “cloud computing”. Now the cyber criminals / terrorists are making “cloud computing” systems as their major target.

4.  The cyber criminals will attack the “traditional customer authentication” methods used for online banking and financial transactions to steal funds from bank accounts. There will be tremendous rise in the “Man in the Browser” (MITB) Trojan attack incidents.

5.  The cyber terrorists will also attach in guise of lucrative offers in emails (possibly as attachments) with attractive topics, pictures, invites as well fake web links, so that you step into their trap to reveal important personal information.

6.  The criminals will also use “BlackHatSEO” to get the fake sites and/or links in front of you in search engine results by suppressing the genuine websites. For this they will use various SEO techniques, including paid SEO.

7.  Fake Advertisements in the name of reputed media houses will be used to inject virus into those organizations’ websites. The hackers and spammers will use the still-in-use outdated technologies, such as unsupported and unpatched old software.

8.  The tiny URLs used in facebook and Twitter are quite popular among users. Since those are easy to utilize, the criminals will target the tiny URLs to bring people to malware-ridden 100s of thousands of fake websites.
According to an estimate from a few years back by a security software firm, nearly 300,000 fake websites are launched EVERYDAY just to lure unsuspecting users and infect their computers with malware and virus.

9.  The cyber criminals are going to use “SQL Injection” attack against the famous multinational banks, commercial and marketing companies around the world, including USA. Along with that they will use Phishing (stealing data through browser / email), Vishing (stealing data via phone calls), Smishing (via SMS to mobile phones) attacks.

10. There will be major increase in the cyber terrorists’ use of “foreign language spam” as well as “identity theft” attacks to steal our “digital signatures” for online (commercial / legal / financial) activities.

The only protection is to be super-careful (being paranoid is OK), even for individuals, because our own personal finances can be ruined by such attacks. A whole lot of people have already been burnt by “ransomware” (a kind of malware). Phishing and Vishing are still going on, and people continue to fall for those. On the other hand, a lot of computer users are oblivious about upgrading their software — Operating Systems, Applications, Browsers etc., even if free upgrades are widely available. There are a lot of people who derive extra pleasure in using pirated software, without understanding how dangerous it is for THEMSELVES. The big software companies can afford to lose a couple of billions in lost revenue due to piracy, but a compromised computer can terribly affect an individual’s life or a small business. It really doesn’t cost much when it’s spread over the lifetime of a computer and software. However, some people still find it necessary to avoid paying the dues and lead a risky life. Also, some computer users indiscriminately download and install “free” software from the Internet. Is “free” a business model for anyone? Yes, there are some legitimate “free” (mostly open source) software organizations, but they are well known. Why use software from a random company that pops up in a Google search? Does anyone buy any other thing like that? In real life do you use an item handed out by a complete stranger? Hopefully not.

It’s important to practice “Safe Computing”:
a)  Use ONLY legitimate software
b)  Use RELIABLE anti-virus from a REPUTED company
c)  Regularly update / patch software
d)  Monitor network to detect intrusion / infection
e)  Take automatic backup of all important data

The challenges are grave. The threats are real. The repercussions can be devastating. It’s worth being extra careful.

Welcome to SARANGSoft New Blog

Welcome to our new blog. It took us a long time to transition from our earlier setup (website and blog) to this new version. Hopefully the blog will be better and more active from now on.

The goal of our blog is to share our thoughts, actions, findings (and more) with a wide audience. While working on various projects – building our own software products as well as developing custom software solutions for our customers, we come across different questions and challenges, which we sometimes solve by ourselves and at times depend on the wisdom shared by others like us in the industry. We know firsthand how much information-sharing helps, and we greatly appreciate that. In this day and age of collaboration and networking, we would like to put forward some of the hard lessons we learn as part of our work, so that others do not have to reinvent the same wheel or go through similar frustrations. We all win through sharing, because it goes both ways.

Our posts will be mostly technical in nature, but there will also be some business-related topics, because we are in “technology business”! We develop and use technology for business – to make those more capable and efficient. That’s where the value of technology is – enabling new possibilities, improving existing processes, and even identifying fresh opportunities.

We welcome all meaningful and relevant discussions, including constructive criticism, of our posts. The different viewpoints from all concerned will enrich all of us in various ways.