Steve Marcus, Mobisave
This is the second installment in a three part series discussing accuracy in receipt scanning and achieving the goal of 95% accuracy. In the first installment I reviewed the inherent advantages of receipt scanning in eliminating nearly all the fraud and misredemption challenges associated with paper coupons. In this installment I will discuss the accuracy challenges associated with converting receipt images into text files that can be searched to provide proof-of-purchase for paying rebates on specific products. For further reading on this topic you can review our patent filing dating back to November 2009 at this link: http://www.google.com/patents/WO2011063177A1?cl=en
Receipt scanning is still a relatively young technology and although we have been at this now for seven years we are constantly learning and refining the system. Our approach is probably somewhat different than others in that we believe the long term viability of this depends on the ability to machine read receipts with a high degree of accuracy. That enables us to come as close as possible to the immediate gratification of presenting coupons at the checkout while eliminating processing costs so that the fees charged to manufacturers are a fraction of the cost of paper.
There are at least 9 dependencies between the time a receipt is printed and when we can accurately determine if the correct purchase was made and the rebate can be paid. Here’s what we are doing to assure maximum accuracy.
- The original receipt quality: In our experience only 5% of the receipts we see have printing issues severe enough so they cannot be machine read. Most common are faint text, vertical lines with missing text or occasionally vertical colored stripes. If the receipt is still readable by human eyes, then they are routed for manual processing rather than rejected. In our view, this is not the fault of the user. In the rare case when a receipt is totally illegible, we do not accept it and advise the user that they are entitled to a readable receipt that we can process.
- User receipt abuse: We are constantly evaluating the latest developments in software to clean up receipt abuse problems such as wrinkles, creases or stains. In the event the receipt abuse cannot be sufficiently corrected to make it machine readable, we prefer not to reward bad behavior as this leads to matching errors, increased manual processing costs and delayed rewards. In these cases it is rejected and returned back to the user with an explanation and suggestions to correct the situation. By doing this we rapidly move cooperative users up the learning curve. The few users who continue to send us abused receipts eventually move on.
- Camera quality: The latest smartphones have features that make taking bad receipt snaps nearly impossible provided the user adheres to a few guidelines I’ll discuss in the next section. The newest cameras are benefitting from better low-light sensors, faster auto focusing and wider apertures. If the original receipt is good quality and the receipt has not been abused, with these cameras the only remaining challenge is for the user to follow some simple guidelines to snap an acceptable image.
- Camera usability in the app: All of the companies in receipt scanning have made substantial progress through trial and error and usability studies to guide the user to make the best possible snaps. This includes guide marks for proper sizing, overlap assists to eliminate gaps, soft flashes to eliminate shadows, cropping, anti-shake and auto focus lock in to name a few.
- User skills: Despite all the improvements in camera quality and usability aides in the app, we still occasionally get receipts where the user has sent us an image that is too small, out-of-focus because of the wrong camera settings or has a thumb hiding key information. These are handled the same as receipt abuse and returned to the user with suggestions for taking usable snaps. In additional, our app as well as others have a review and retake option that tell users if they can’t read the text in the image, neither can we so please take a new snap.
- Image cleanup: Some image defects can be corrected during a clean-up phase where the focus can be sharpened, contrast improved, extraneous noise in the image or the background can be removed or misaligned images can be corrected and made readable by OCR.
- OCR stage converting the image to text: The accuracy of the best OCR software under our conditions is between 90 and 95%. That means that zeroes can be confused with the letter “o.” Or the letter “l” can be confused with the number “1.” If our machine is searching the receipt text for a UPC code, a single incorrect number would in theory cause a search failure.
- Matching stage: To compensate for the inexactness of OCR, we and I assume others use similarity and approximate matching algorithms yielding a confidence score to determine if the character string matches a predetermined item description. So for example if we are looking for a 12 character UPC-A and we find 11 correct characters in the correct sequence and one incorrect character, the odds are almost infinite that this is a correct match.
- Data dictionary: While 60% of our receipts contain number strings and yield a near exact match at the SKU level, this is more difficult working with text strings and heavily dependent on having a comprehensive data dictionary. Candidly, even with a comprehensive dictionary if the short product description on the receipt is cryptic, we cannot be 100% certain it is correct and certainly cannot identify it at the SKU level. We have now processed receipts from over 2000 different retailers which make it literally impossible for us know how each of these retailers describes the product on their receipts. The good news is that we have done a spot check by purchasing the item at some of these retailers. In nearly every instance where we’ve done this, the user has purchased a qualifying item once again renewing our faith in the honestly of the vast majority of consumers.
Finally, with respect to the 95% accuracy goal, if we are dealing with number strings we are above that goal. With text strings we are not at 95% but within striking distance. Keep in mind that we look to make decisions on accuracy with a degree of confidence. Our shortfall on text strings does not necessarily mean we made an error. It just means it does not pass our high threshold.
The first installment in this series discussed “Receipt Scanning Fraud and Misredemption” . The third and final installment will recommend steps that the receipt scanning industry, manufacturers and the ACP can take together to get us all to the desired 95% level. ACP members will find all three installments in the member area of the ACP website. Non-members can contact John Morgan, email@example.com for access.
About the author
Steven Marcus, founder and president of MobiSave, has spent a lifetime in consumer marketing in consumer packaged goods, marketing services and financial services. He is an entrepreneur and an Internet pioneer having digitized a multiple rebate system in the mid-1990’s that became the basis for the receipt scanning applications. He served on the USPS anti-fraud rebate task force. You can contact Steve at firstname.lastname@example.org.