Product verification using OCR classification and Mondrian

Expert Systems With Applications 188 (2022) 115942

A0(

Contents lists available at ScienceDirect

Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa

Product verification using OCR classification and Mondrian conformalpredictionRachid Oucheikh a,∗,1, Tobias Pettersson b,a,c,∗∗,1, Tuwe Löfström a

a Department of Computing, Jönköping AI Lab, Jönköping University, Swedenb ITAB Shop Products AB, Swedenc Department of Engineering, University of Skövde, Sweden

A R T I C L E I N F O

Keywords:OCR classificationRetail product verificationMondrian conformal predictionSmart self-checkout system

A B S T R A C T

The retail sector is undergoing an apparent digital transformation that completely revolutionises shoppingoperations. To stay competitive, retailer stakeholders are forced to rethink and improve their business modelsto provide an attractive personalised experience to consumers. The self-service checkout process is at the heartof this transformation and should be designed to identify the products accurately and detect any possibleanomalous behaviour. In this paper, we introduce a product verification system based on OCR classificationand Mondrian conformal prediction. The proposed system includes three components: OCR reading, textclassification and product verification. By using image data from existing grocery stores, the system can detectanomalies with high performance, even when there is partial text information on the products. This makesthe system applicable for reducing shrinkage loss (caused, for example, by employee theft or shoplifting) ingrocery stores by identifying fraudulent behaviours such as barcode switching and miss-scan. Additionally,OCR reading with NLP classification shows that it is in itself a powerful classifier of products.

1. Introduction

The retail sector is considered one of the most active and powerfulbusinesses in the global economy. Being the biggest employer, the retailindustry signifies around 40% of Gross Domestic Product (GDP) inthe USA. The intensified struggle faced by the retail sector with thearrival of growing numbers of new competitors in both local and globalmarkets has obliged retailers to analyse and redesign their operationsand their marketing strategies critically (Perdikaki, 2009). Severalretailers have differentiated themselves by creating improved in-storeshopping experiences and various other tactics to gain a competitiveedge to stay competitive (Joseph et al., 2019). Furthermore, retailersmust constantly strive for the quality of operations. These practiceshave provided an abundant foundation and an innovative context forresearch and the involvement of technology in the retail operationsspace.

Over the last century, there has been a significant transformation inthe way grocery shopping is done. The evolution has shifted from over-the-counter shopping, picking products and paying at manned checkoutstations, self-service checkout (SCO) systems and more recently, fully

∗ Corresponding author.∗∗ Corresponding author at: ITAB Shop Products AB, Sweden.

E-mail addresses: [email protected] (R. Oucheikh), [email protected] (T. Pettersson), [email protected] (T. Löfström).URL: https://ju.se/jail/datakind (R. Oucheikh).

1 Equal contribution.

autonomous stores (Amazon.com, 2018; Beck, 2018; Taylor, 2016).This transformation has enabled retailers to deliver an enhanced andmore efficient experience for customers while decreasing the cost ofstaff. With this change, new methods of fraud have been enabledthat increase the retail shrinkage within a store. Shrinkage is a termused within the retailing domain as a ‘‘catch-all phrase to categorisethe financial losses that retailers face through spoilage, damage, error andtheft’’ (Beck & Peacock, 2009). Measures to counter this needs to bedeveloped to reduce the shrinkage loss at SCO systems.

Vision sensors, such as 2D/3D cameras and Light Detection AndRanging (LiDAR), can be used to complement and improve the per-formance achieved using only scales. Vision sensors could be used incombination with machine learning algorithms, such as ConvolutionalNeural Networks (CNN), to classify or verify products during scanning.It is also possible to analyse shopping patterns and notify staff if acustomer’s behaviour is deviating. Companies are already benefitingfrom this technology, and it is being used in SCO systems for monitoringfraudulent behaviour (Everseen, 2019; Malong, 2020). This solutionis powerful, but it has still some issues. It is required to collect largeamounts of training data to classify products with high accuracy. The

vailable online 30 September 2021957-4174/© 2021 The Authors. Published by Elsevier Ltd. Th

http://creativecommons.org/licenses/by-nc-nd/4.0/).

https://doi.org/10.1016/j.eswa.2021.115942Received 17 December 2020; Received in revised form 13 July 2021; Accepted 18

is is an open access article under the CC BY-NC-ND license

September 2021

http://www.elsevier.com/locate/eswa

http://www.elsevier.com/locate/eswa

mailto:[email protected]



https://ju.se/jail/datakind

https://doi.org/10.1016/j.eswa.2021.115942

https://doi.org/10.1016/j.eswa.2021.115942

http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2021.115942&domain=pdf

http://creativecommons.org/licenses/by-nc-nd/4.0/

Expert Systems With Applications 188 (2022) 115942R. Oucheikh et al.

different types of products are also heavily imbalanced at grocerystores. For example, bananas and milk are sold in large quantities, butother types of products, such as dragon fruit, are sold in significantlylower quantities. This makes it challenging to collect and create adataset where low-quality products are not suppressed during classi-fication. Additionally, it is also hard for image classification algorithmsto distinguish between classes with subtle differences, such as differenttypes of milk where only a small fraction of the text might be changedbetween packages.

One promising technique for reducing the shrinkage of SCO sys-tems is Optical Character Recognition (OCR) reading. OCR reading isan alternative to solutions based solely on image recognition. It canuse the images taken by the SCO to classify products and improvethe self-scan process or detect fraudulent behaviour. Although it isnot a new technology, OCR reading has seen great strides in bothspeed and performance in recent years with the introduction of deeplearning techniques (Liu et al., 2020; Xing et al., 2019). Along withadvances in OCR techniques, Natural Language Processing (NLP) hasseen even greater progress (Mathew & Bindu, 2020). Using NLP tech-niques such as text embedding and deep learning models, one couldcapture patterns in retail product metadata and use it for classificationor verification. The classification can serve, for example, product identi-fication or store inventory and provisions, while verification will allowcustomer behaviour analysis or staff error detection. OCR readings incombination with NLP classification could address many challengesfaced by CNNs. First of all, the training data to store would be justwords and sentences for each class. This can be compared to the largeamount of image data needed for training CNN models. We believethat classification with NLP could also mitigate the class imbalanceproblem, as it could take significantly fewer samples to acquire allOCR information of a class. Furthermore, it might also be easier todistinguish between classes where just a few words are different, forexample, different types of milk.

With an SCO system consisting of an OCR reader and an NLPclassifier, it would be possible to replace the scale while providingimproved identification of fraudulent behaviour. Furthermore, it shouldbe less sensitive to false positives blocking events generated at the scale,such as baskets, wallets, or gloves in the SCO area.

As the system is meant to decrease shrinkage, the final componentin the system performs product verification. One of the purposes of theproduct verification component is to detect outliers, abnormalities, oractivities that do not conform to the normal behaviour, commonly re-ferred to as anomalies. Chandola et al. (2009) presented an overview ofanomaly detection techniques and their utilisation in different researchareas and application domains. Various methods have been appliedto the anomaly detection problem in the literature, such as statisticalmethods using Neural Networks, Support Vector Machines, BayesianNetworks, etc (Taha & Hadi, 2019). In recent years, by considering theeffectiveness of the deep learning approach, several studies use deeplearning to detect anomalies, so-called Deep Anomaly Detection (DAD).

The purpose of the product verification component is to use theinformation retrieved from the text on the products to identify whena scanned product is unlikely to match the scanned barcode. Mondrianconformal prediction has been used for the administrator of the SCO toset a user-defined confidence level for when to dismiss a product as sus-picious. The benefit of using the conformal prediction framework (Vovket al., 2005) for this task is that the framework provides a guaranteethat a disregarded product is sufficiently atypical.

The SCO verification solutions based on image classification faceseveral challenges, primarily due to the generalisation problem andthe fact that many product classes are visually similar in terms ofcolour, shape, texture, and size. Furthermore, scale is a limited sensorfor preventing fraud protection while not blocking the customer toofrequently. On the other hand, solutions based on OCR technologycombined with NLP have promising results given their ability to grasp

2

the fine differences between classes and thanks to their scalability and

generalisability. Based on this analysis, we propose an OCR-based prod-uct verification system to be integrated into an end-to-end SCO system,enabling accurate and error-free checkout with a robust mechanism foranomalous event detection.

The paper’s outline is as follows: The second section introducesthe proposed system and its subsystems on a high level. Following isthe background, which presents details about the techniques and datatransformations used within the subsystems. The fourth section presentsthe experiments referred to in the results section and provides detailsabout the experimental setup used in the experiments. The results arepresented in the fifth section, and the concluding remarks, includingfuture work, are presented in the sixth section.

2. Related work

Retail products which belong to the same category or have thesame brand usually have extremely similar appearance features such ascolours, shapes and sizes. The Detection of differences between thesecategories are more subtle and cannot be detected or even distinguishedby conventional classification methods. The most effective way to solvethis problem is fine-grained classification methods which aim to dividecoarse-grained classes into detailed sub-classes, such as distinguishingthe type of milk products. The idea is to perform fine feature repre-sentations on a target local region. Wang et al. (2020a) proposed animproved fine-grained classification method for retail product recogni-tion. The method is based on self-attention destruction and constructionlearning (SADCL), where the self-attention mechanism is utilised toprevent noise overfitting after image destruction effectively. The exper-iments are performed on a large product identification dataset calledRetail Product Checkout (RPC) dataset created by MEGVII’s NanjingResearch Institute. The accuracy achieved is 81.4%. Fuchs et al. (2019)aimed to assess the feasibility of identifying retail product instead ofbarcodes in realistic, fast-paced retail environments. The experimentsare performed using transfer learning techniques and multi-productobject detection on multiple CNN architectures. One goal was to deter-mine the number of images instances required per product to achievemeaningful predictions.

Some researchers tried to use OCR in the retail sector. In Gundimedaet al. (2019), the authors addressed the problem of retail product recog-nition on grocery shelf images using low time complexity techniquebased on SVM for image classification. The same authors used OCRfor the analysis of retail product metadata and particularly for therecognition of expiration dates written on food packages (Gundimedaet al., 2018). Joseph et al. (2019) proposed a method to separateforeground and background from retail product images and enhancethem after checking with a neural network classifier if the backgroundhas a mono-colour gradient or not. Their ultimate goal is to identify theproduct region of interest.

To the best of our knowledge, no published paper proposes an end-to-end system for the identification and verification of retail productsusing OCR and deep learning techniques. In addition, the imagesconsidered in most papers have high quality. They do not considercertain challenges existing in the store environment, such as occlusion,luminosity, and product deformation. We believe that the use of OCRand deep learning can achieve high accuracy in classification evenfor images in the wild and detect the small nuances that distinguishproducts. It should be noted that the ultimate goal of this paper andwhich is not addressed in the literature, is the verification of pur-chased products. This can be done by providing statistical guarantees,determining accurate confidence levels in the prediction and avoidingfalse alarms in the fraudulent behaviour detection. The verification isrequired to reduce shrinkage loss and offer high-quality service for the
customers.


Fig. 1. Overview of the data flow within the product verification system and its components.

Fig. 2. Example of product gallery from the EasyFlow and HyperFlow system.

3. System overview

Fig. 1 illustrates the data flow of our proposed product verificationsystem that will be part of the smart SCO system. It consists of threemain components: OCR Reading, Text Classification and Product veri-fication. When the consumer scans the product in the SCO, an imageof the product is automatically taken using cameras integrated into theSCO. The image of the scanned barcode is sent to the first componentof the product verification system, the OCR reading. The output ofthis component are snippets of text extracted from the product image.Then, the text snippets and metadata are fed into the text classifier,which generates probability estimates for all the different products. Theproduct verification component uses the underlying classifier outputand the scanned barcode to decide whether the product scan shouldbe considered normal or not. If the product verification fails, i.e., ifthe product scan is not considered normal, the scanning process willbe halted with instructions for handling the situation and the SCO maypotentially call for an attendant. The purpose and general functionalityof each of the components are briefly explained in the subsectionsbelow. Still, the design choices and implementation details for eachsystem component are given and motivated in Section 5.

The SCO system that we have used as our testbed are two systemsproduced by Itab, a Swedish company developing checkout solutionsfor stores. The images we have used have been extracted using Itab’sautomatic scanning system, EasyFlow and HyperFlow (ITAB, 2019).These systems use a tunnel scanner system that identifies products,which pass through the tunnel with the help of barcode scanners andother types of sensors, of which one is an RGB camera. For eachproduct that passes through the tunnel, it automatically captures an

3

image frame and saves it to the disk along with information of thebarcode of the scanned product. The triggering is generated by a lightcurtain sensor which ensures that the camera captures images wherethe product is centred in the frame. An example gallery of capturedimages can be seen in Fig. 2.

3.1. OCR reading

The purpose of the OCR reading component is to perform two tasks.First, it preprocesses the image received from the SCO to extract thepositions in the image containing text. Secondly, OCR is applied toeach part of the image containing text to extract text snippets. Thesetwo tasks can be performed either sequentially or simultaneously. Textdetection techniques determine the presence of text and locate it bygenerating a bounding box around that text region. By focusing onthese regions, the text recognition process aims to identify and under-stand texts. If necessary, text segmentation is performed to separatethe text from its background. The resulting text snippets are then fedforward to the text classification subsystem.

3.2. Text classification

The text classifier component receives the extracted raw text snip-pets with text from the image that describes the product. Due toillumination issues, curved lines and since the image is taken froma certain angle, the extracted text is only an excerpt from the fulldescription of the product. This component aims to automatically anal-yse the text and then make a classification of product identity basedon the captured text. Before that, some preprocessing operations are


Fig. 3. Example of an SCO used for manual registration of products.

performed, namely cleaning and tokenisation. Various techniques forNLP can be used when performing this kind of text classification. Weopted for deep learning techniques and made a comparison betweensome of them before selecting our solution. Each type of product isconsidered a class in this component, and the classifier learns to identifythe product by using the latent features of its description text.

3.3. Product verification

The purpose of the product verification system is to detect anyanomalous behaviour during the self-checkout process. The anomaliescan be caused by many factors such as consumer fraud, staff mistakes,mismatched scanning, sensor damage or wear, or even cyber-attacks.It is crucial to minimise the errors and faults of the checkout by devel-oping a product verification component capable of detecting anomaliesand identifying them in real-time.

To build this component, Mondrian Conformal Prediction (MCP)have been used. MCP relies on validity within categories, i.e. producttypes, to ensure that only sufficiently abnormal products should beflagged as anomalous. What is sufficiently abnormal is determinedby user-defined thresholds. The scanned barcode is used to selectthe category to be compared with. The prediction results from thetext classification component are used to compare the product underverification with all other products of the same type. Ultimately, thecomponent will indicate whether a product passes the verification, thatis, whether it is normal compared to all reference products of thesame type or whether it fails the verification. The component can alsoprovide the SCO with the degree of normality, which could be used,for example, to decide how the SCO should handle a product that hasfailed the verification test.

4. Background

4.1. Shrinkage at SCO systems

SCO systems where customers manually register products are nowa commodity. These systems are sometimes referred to as fixed SCOsystems in contrast to mobile solutions where the customer uses ascanning device to register products. An example of a fixed SCO systemcan be seen in Fig. 3. Throughout this paper, an SCO is considered tobe a fixed SCO.

By enabling a greater sense of freedom for customers using SCOsystems compared to traditional manned checkouts, the shrinkage forretailers is expected to increase. One of the most common shrinkagecategories is theft by customers (Taylor, 2016). Theft by customerscan be seen as an anomalous event that should be detected auto-matically. An anomalous behaviour is generally a potentially inter-esting object or event and can, depending on the context, also be

4

referred to as a strange, suspicious, unusual, novel, or unexpectedbehaviour (Laxhammar, 2014).

A store with SCO systems will have a larger shrinkage loss thanthe one without (Beck, 2018). According to Beck, the shrinkage losswill increase by between 33% to 147% for a store when equipped withSCO. The number of SCO systems in the store and its utilisation rate arealso correlated with an increased shrinkage loss. Taylor (2018) statesthat stores with SCO systems are experiencing more shrinkage loss thana store without SCO systems and further adds that stores equippedwith SCO systems will see an increase in shrinkage over time becausecustomers are learning the machines and how to exploit them.

Beck (2018) presents several fraudulent techniques as factors forshrinkage in fixed SCO systems:

• Non-scanning: Product passing the scanning area and placed inthe bagging area without valid scanning. This could either be bymistake or with the purpose of not paying for the product.

• Walk-aways/Non-payment: Customer has scanned some or allproducts and walks away while not paying for the products.

• Promotion errors: In this case, the customer only scans the num-ber of products that it will be charged for. The free products in thepromotion will not be scanned. This may generate inconsistenciesin store inventory records.

• Multiple variety errors: The customer only scans one productmany times while having several different products at the sameprice. This could happen when a product has different flavoursbut the same price, leading to corruption of inventory records.

• Double-scanning: A customer is scanning the same product morethan once. While this is not a loss for the retailer, it will alsocause a drop in stock accuracy, leading to over-ordering of doublescanned products and resulting in dissatisfied customers.

• Miss-scanning: This can be performed in two ways; the customerplaces a product on a scale and selects another product from thelist of options. The other way is that the customer places moreproducts in the bag after weighing.

• Barcode switching: The customer uses or replaces the barcodeof a product with a cheaper type and scans it instead.

To minimise shrinkage, SCO systems are equipped with differenttypes of sensors. The most commonly used sensor is the scale whichhelps to verify the weight of products. Although the scale is far fromperfect, it can mitigate the effects of shrinkage loss for non-scanning,promotion errors, double scanning and barcode switching. A case studyshowed that the shrinkage loss for a retailer increased by 147% whenhaving a no-weight check (Beck, 2018). It is typically implemented sothat a product will be put on the scale for verification after it has beenscanned. If the product does not match the statistical weight range ofthe product, the user will be prompted by a notification message aboutthe weight error directly or at the end of the shopping session. For thisto work, several weight samples must be stored for every product sothat the system can build up statistical weight ranges that are accurate.Another issue is that people tend to put other products or things such asgloves on the scale, and this will require the assistance of an attendantwho will inform the customer on how to use the SCO as designed.Weight range generation is also problematic for weight priced productssuch as produce products. Produce products tend to have a largedeviation in weight, and the mean weight could also vary depending onseasonality. To increase the efficiency of the customer flow, the retailershave to either accept a larger weight range or disable/remove the scale.The consequence of these two choices is that the security of the SCOwill be affected.

A product verification system using OCR would be able to addressthe fraudulent techniques of barcode switching and miss-scanning.When a product is scanned, the verification system will check if it isplausible that the scanned barcode is correct. If not, a potential case ofbarcode switching is detected. In the same way, miss-scanned would bedetected when a customer selects a product on the list of options whichthe verification system does not agree with.


cptompmuDtnce(it

4.2. OCR techniques

Generally, the data on which OCR techniques achieve high perfor-mance is black-and-white and line-based documents. However, they failto provide good results for text in natural scene images. Like those of re-tail products, this type of images vary widely in appearance and layout,contains a large number of fonts and styles, suffers from occlusion andorientation issues, and includes incoherent lighting. Additionally, theimages may be very noisy if any objects are captured in the background.

This makes the task of automatic detection and recognition of textin natural or scene images, called text spotting, a more challengingproblem than document OCR. Some techniques, called stepwise meth-ods, try to solve this task in two separate stages: text detection with apre-trained detection model followed by recognition using a separatemodel (Long et al., 2020). In the work of Jaderberg et al. (2015),the text is detected based on a region proposal mechanism combiningedge boxes and a weak aggregate channel features detector. Then, adeep CNN is used for recognition by performing classification across apredefined dictionary of words.

Liao et al. (2018) used TextBoxes++ to detect arbitrary-orientedscene text and adopted a convolutional recurrent neural network(CRNN) for recognition. In Liu et al. (2018), the authors proposed a uni-fied end-to-end trainable Fast Oriented Text Spotting (FOTS) networkfor detection and recognition of scene text. The key idea is to performthese two tasks simultaneously and share features and computationbetween them, taking advantage of the convolution sharing strategy.

In our proposed system, Google Vision API (Google, 2019) was usedto extract textual information from image data. This solution was se-lected after comparison with certain solutions such as OCRopus (Breuel,2008), FOTS (Liu et al., 2018), Tesseract (Čakić et al., 2020), and ahybrid solution consisting of EAST (Zhou et al., 2017) and Tesseract.The pre-trained model of Google Vision API deals more efficiently withimages having text with various fonts, orientations, and densities.

4.3. Text classification

Text classification is the automatic process of classifying texts intocategories or assigning one or more classes to natural language textsbased on their content. The text content can be classified according tofour levels:

• Document-level in which the algorithm obtains the relevant fea-tures of the entire document.

• Paragraph level, which learns the category of the documentthrough the features of every single paragraph.

• Sentence level in which the techniques rely on every single sen-tence to perform the classification.

• Character level in which each character is used as a data unit tolearn the classification task.

Generally, classification techniques can be dissected into statisti-al and machine learning (ML) approaches. Statistical techniques areurely mathematical processes functioning like a computer programhat executes the given explicit instructions without any ability of itswn. They are not automated and satisfy the proclaimed hypothesesanually (Allahyari et al., 2017). To achieve a high classificationerformance, the amount of data to be processed and its dimensionalityust be low. The important statistical feature extraction techniquessed in text classification are Principal Component Analysis, Biasediscriminant Analysis, and Average Neighbourhood Margin Maximisa-

ion. They are generally inefficient for large datasets, not suitable foronlinear data, and achieve good results, particularly in binary textlassification. Further details can be found in this review paper (Dengt al., 2019). In practice, term frequency–inverse document frequencyTF–IDF) is the most widely used statistical feature (Beel et al., 2015). Its a numerical statistic that uses word frequency information to measure

5

he importance of each word for a document in a collection corpus. The

TF–IDF value increases proportionally to the occurrence frequency ofa term in a document. It is inversely proportional to the number ofdocuments in the corpus that contain this term.

Many machine learning approaches have outperformed the resultsobtained in NLP using conventional statistical methods. The successof ML approaches is due to their ability to understand complex andnonlinear relationships within data. The challenge for these approachesis to find appropriate structures, architectures, and techniques to in-crease learning performance. To analyse sentiment in short texts, DosSantos and Gatti (2014) proposed a deep network called Character toSentence Convolutional Neural Network, which uses two convolutionallayers to extract relevant features from words and sentences of anysize. The network extracts features from the character level up to thesentence level. Lee and Dernoncourt (2016) proposed a model basedon RNN and CNN for short text classification. Their results show thatincluding sequential information improves the quality of the prediction.Mirończuk and Protasiewicz provide an overview of the state-of-the-art elements and techniques of text classification and some of theirapplications (Mirończuk & Protasiewicz, 2018). The authors presenta review of what has been made on OCR, including the developedtechniques and their limitations.

Xu et al. (2020) a Dual Embeddings CNN concatenates two differentembeddings; word embedding and concept embedding. This modelutilises two layers to extract concepts and context respectively andthen employs an attention layer to extract context-relevant concepts.These concepts are incorporated into text representation for short textclassification jointly with word embedding.

4.3.1. Text representationThe deep learning model cannot deal directly with raw text. There-

fore, the raw text should be converted to numerical tensor before beingfed to the model input. This conversion process is called text vectorisa-tion and is categorised into two main methods: one-hot encoding andtext embedding.

One-hot encoding is a representation of categorical variables asbinary vectors. In NLP, this technique associates each unique word inthe document with a unique integer value and then maps the resultinginteger values into a set of binary vectors 𝑉 . Each vector 𝑉𝑖 ∈ 𝑉 has afixed length 𝑁 , which is equal to the vocabulary size and has only the𝑖th element as 1, and the remaining elements are all 0. This techniqueis memory consuming, and one-hot vectors are sparse since they arehigh dimensional and include many zeros. In addition, the syntactic andsemantic relations between words are lost. To reduce the sparsity andcomplexity, bag-of-words models assign each word an integer index andconstruct vectors for chunks of text. These vectors contain the indexesof the words present in the corresponding chunk. Thus, bag-of-wordsmodels use only information about which words are used in the textwithout considering the order or structure of words in the document.Bag of n-grams enables using partial information about the structureof the text. To capture the syntactic and semantic features of the text,word embedding has emerged.

Word embedding - Word embedding is a set of language modellingand text feature learning techniques. It aims to represent the languageitems (word or sentences) as float vectors. For that, the statistical orneural network models learn the parameterised function that mapswords to multidimensional vectors. The dimensions (tens or hundreds)are still much smaller than the one-hot encoded dimensions (thousandsor millions). Using continuous floating-point number vectors, more in-formation can be incorporated into lower-dimensional spaces (Mikolovet al., 2013). Word embedding achieves a high level of generalisationthat is not possible with n-gram language models. In fact, the resultingvectors incorporate a set of properties. It quantifies and categorisesthe semantic similarities between words or sentences based on theirdistribution in large text datasets. First, similar words will have similarvectors. In the multidimensional space, similar words will be located in
the same position and have the same coordinates. The cosine metric,


cGURiitbuvna(wwmcmIost

ttmatrgd

Pptvnt

socthc

4

tMComtolcusa

nndldefstedLtsiolCT

which measures the similarity between words (or documents), is equalto 1 for synonyms and morphological variations of the same word.Thus, the semantic of the document is easily captured even if theappearances (spelling) of the words are completely different. This isachieved thanks to the learning of context and the use of words. Incontrast, in Bag-of-words techniques, different words have differentrepresentations regardless of how they are used. Second, some algebraictransformations encompass semantic features. For instance, the differ-ences between vectors reflect the meaning included in the differencebetween the words represented by those vectors.

There are two main categories of word embedding methods (Wanget al., 2020b).

• Static: in this technique, the word embedding is trained offlineusing another machine learning model and a big text dataset.Then the trained vectors are embedded into the classifier. Duringthe classification process, the model does not update the wordvector according to the context. The static mode is generally usedwhere the amount of data is small.

• Non-static: the word embeddings are learned online and simul-taneously with text classification, and then a word can be repre-sented differently according to the contexts. The vectors are eitherrandomly initialised and then updated during the classification orinitialised by a pre-trained Word2Vec vector and then adjustedduring training. The second way accelerates the convergence,requires less computation and is more efficient if the training datais insufficient.

Among the solutions proposed to construct the word embedding, wean mention Word2Vec (Mikolov et al., 2013) developed by google,lobal Vectors (GloVe) (Pennington et al., 2014) built by Stanfordniversity, FastText (Bojanowski et al., 2017) created by Facebook AIesearch. Our attention is drawn to Word2Vec and GloVe. Word2vec

s a two-layer neural network that converts a text corpus received in itsnput to feature vectors representing words in that corpus consideringhe context and text structure. It is an unsupervised technique driveny the context, while the feature vector assigned to a word cannot besed to accurately predict that word’s context, the components of theector are adjusted. The numerical form of the output is useful for deepeural networks, which can be a classifier, translator, summariser, orny other model. Word2Vec utilises one of two model architectures:1) Continuous bag of words (CBOW): the model predicts the currentord (the centre word) based on a window of surrounding contextords. (2) Skip-gram: it tries to achieve the reverse of what the CBOWodel does. Using a word (centre word), it learns to predict the

ontext words (surrounding words). GloVe is a log-bilinear regressionodel which learns in an unsupervised way the word representation.

t encodes the semantic relationships between words using the ratiof co-occurrences. Choosing the suitable loss function, the resultingtatistical relationships are used to train a log-linear model to maximisehe similarity of the word pairs.Document embedding - Word embedding was a notable break-

hrough in NLP and has proved to be efficient in almost every NLPask. It provides a rich representation of the text to all machine learningodels that rely on vector representation as input, preserving semantic

nd syntactic information on words. This efficiency has led researcherso think of an innovative solution to create embeddings and richepresentation to larger units of texts, ranging from sentences and para-raphs, through short texts and products to books and more complexocuments. This effort has resulted in some notable techniques.

Le and Mikolov (2014) proposed an unsupervised algorithm, calledaragraph Vector, that learns fixed-length features from variable-lengthieces of text (sentences, paragraphs, and documents). They provedhe outperformance of bag-of-words and the technique’s capabilities onarious text classification and sentiment analysis tasks. The same tech-ique is evaluated on document similarity datasets (Dai et al., 2015), onhe question duplication task and the Semantic Textual Similarity (STS)

6

t

task (Lau & Baldwin, 2016). Another unsupervised learning techniqueis developed by Kiros et al. (2015), called Skip-Thought. It is basedon an encoder–decoder model trained to reconstruct the surroundingsentences of an encoded sequence. It scales up the job made by skip-gram on words to sentences, thus predicting the surrounding sentencesof a given sentence. The authors evaluated Skip-Thought Vectors on8 tasks: semantic relatedness, image-sentence ranking, question-typeclassification, and sentiment and subjectivity analysis.

Short text - Our data is a special case of textual content calledhort texts. The need for short text analysis arises in several contextsther than product descriptions such as online reviews, chat messages,omments, Twitter feeds, online product reviews, etc. In general, shortexts, and in particular our textual data, which are extracted using OCR,ave certain specific characteristics that present real challenges for thelassification task (Chen et al., 2019).

• Shortness: A short text is not rich and contains only a few words.This may result in an inadequate representation of the documentand, therefore, difficulties in extracting features. If the data isnoisy and needs to be cleaned, the texts will be shorter, and thecomplexity will increase.

• Sparsity: The dataset includes a high percentage of empty sam-ples. The length of texts describing the same product ranges fromempty text to a few paragraphs.

• Misspelling: Due to image occlusion, luminosity, and productshape, the extracted texts eventually include misspelt words.

• Multi-languages: Usually, the products sold in the same store aremade by various countries, and their metadata is written in dif-ferent languages. Moreover, the same product is often describedin multiple languages.

.3.2. Machine learning for text classificationIn the following, we describe formally the three models that form

he building blocks of our framework, namely, CNN, Long Short-Termemory (LSTM) and Random Multimodel Deep Learning (RMDL). ANN contains at least one convolutional layer pursued by at leastne fully connected layer. In the convolutional layer, CNN uses theathematical convolution operation instead of the usual matrix mul-

iplication to generate a feature map chained for more convolutionsr the dense layer. The shallow layers which are close to the inputearn low-level features. Going deeper in the model, the deep layersan learn high-order and more abstract features. CNN’s are consideredniversal nonlinear function approximators and mainly used to extractpatial information from the data. In contrast to classical statisticalpproaches, they do not need any handcrafted features.

A Recurrent Neural Network (RNN) is a category of deep neuraletworks that includes a recurrent connection in their layers. Theode’s output is redirected as the input of the node again, with the newata forming a temporal chain. This architecture is mainly designed toearn the temporal dependencies in data and memorise a sequence ofata. The main challenge in using basic RNNs is the vanishing gradientncountered during training, making the traditional RNN not suitableor capturing long-term dependencies in sequence data. LSTM is apecific subcategory of RNNs which solves this issue by incorporatinghree gates controlling the input, output, and memory state (Pascanut al., 2013). The input gate makes the addition of the states andetermines how far a new value flows into the cell. The memory ofSTM managed by the forget gate is developed by training to memorisehe past measurements and learn long-term dependencies on the dataequence through controlling the extent to which a new value remainsn the cell. The output gate controls the impact of the value in the celln the output activation of the LSTM unit. RMDL is an ensemble deepearning approach for classification which contains three models, DNN,NN and RNN, with a random number of layers (Kowsari et al., 2018).he output uses majority vote to provide the final result. This approachries to find the best deep learning structure and architecture which


iac

wa

w{t

ut

f

𝛤

provide high robustness and accuracy using ensembles of deep learningmodels.

The proposed system contains a classifier that aims to classify theproducts based on their descriptions. Fig. 4 illustrates the general work-flow of the implementation of this process. Since the data extractedfrom images using OCR is basically raw text snippets, several prepro-cessing techniques are applied to transform the data before feeding it tothe text classification component, such as tokenisation or cleaning. Thenext step in the workflow is feature engineering, which can be eitherfeature extraction based on statistical techniques or text embeddingexplained in detail in Section 4.3.1. We opted for using text embeddingsince it improves the ability of neural networks to learn textual data andoutperforms the conventional statistical feature extraction methods incapturing both syntax and semantics of the text.

4.4. Product verification using mondrian conformal prediction

Conformal Prediction (CP) (Vovk et al., 2005) was introduced as anapproach for associating classification predictions with confidence mea-sures. The central component of the CP framework is the nonconformityfunction. The nonconformity function assigns a nonconformity scoreto each instance-label pair in classification. When predicting a specifictest instance, a nonconformity score is assigned to each possible classlabel, and the scores are compared to the scores obtained from instanceswith known class labels. The instances with known class labels are onlyassigned a nonconformity score for the true class. The labels assignedto a test instance that is found to be nonconforming compared to thescores of the instances with known labels are excluded. A label isconsidered nonconforming if the nonconformity score for that label ishigher than a predefined fraction (significance level 𝜖) of the scoresassigned to the instances with known labels. Therefore, clearly, CP isa form of hypothesis testing. Each possible class could be considereda null hypothesis that may be disproved if it is significantly differentfrom labelled data. The prediction for the test instance is the set of classlabels that were not excluded.

In Inductive Conformal Prediction (ICP) (Papadopoulos, 2008), thetraining data is split into a proper training set, used to train predictivemodel, and a calibration set, used to calibrate the nonconformity scoresfor a test example to identify whether it is conforming or not. In thefollowing more formal description of ICP, a similar notation is usedas in Vovk (2013). The set of instances is denoted 𝑍, which is theCartesian product 𝑋 × 𝑌 of the independent variables 𝑋, henceforthcalled the object space, and the dependent variable 𝑌 . Consequently,each example 𝑧 ∈ 𝑍 consists of two parts: 𝑧 = (𝑥, 𝑦), where 𝑥 ∈ 𝑋 isthe object and 𝑦 ∈ 𝑌 is the dependent variable. In classification, 𝑌 is afinite set usually referred to as the class variable, and in regression, 𝑌s the real line R. Since our proposed system treats product verifications a classification problem, the presentation will only describe CP forlassification.

Let us consider a set 𝑧1,… , 𝑧𝑙 of training instances, where 𝑙 is thenumber of available instances and 𝑧𝑖 = (𝑥𝑖, 𝑦𝑖) ∈ 𝑍. We split the set intoa proper training set 𝑧1,… , 𝑧𝑚 of size 𝑚 < 𝑙 and a calibration set of size𝑛 ∶= 𝑙 − 𝑚.

Let 𝑥𝑙+1 be a new test object. The idea of conformal prediction is totry all possible class labels �̂� ∈ 𝑌 for the test object to measure how

ell each label conforms to the proper training set. In other words, forn object and class label, 𝑧 = (𝑥𝑙+1, �̂�), the aim is to determine if it is

possible that the label �̂� can be the true class label for the object 𝑥𝑙+1.To determine if that is possible, a nonconformity score 𝐴((𝑧1,… , 𝑧𝑚), 𝑧)needs to be calculated using the inductive conformity function. Theconformity function is often defined as

𝐴((𝑧1,… , 𝑧𝑚), (𝑥, �̂�)) ∶= 𝛥(�̂�, 𝑓 (𝑥)),

here 𝑓 is a predictive model, trained using the proper training set𝑧1,… , 𝑧𝑚}. 𝛥 measures the distance between the class label �̂� ∈ 𝑌 and

7

he prediction 𝑓 (𝑥) of the underlying model. The model 𝑓 is trained

sing a machine learning algorithm, such as neural networks, decisionrees, k-nearest neighbour, ensembles etc.

An inductive conformal predictor (ICP) using the nonconformityunction 𝐴 is defined as the set predictor𝜖(𝑧1,… , 𝑧𝑙 , 𝑥) ∶= {�̂�|𝑝�̂� > 𝜖},

where 𝜖 ∈ (0, 1) is the chosen significance level and 𝑝�̂�, �̂� ∈ 𝑌 is definedby:

𝑝�̂� =(|{𝑖 = 𝑚 + 1,… , 𝑙 ∶ 𝛼𝑖 >= 𝛼�̂�}| + 1)

(𝑙 − 𝑚 + 1),

where 𝛼𝑖 = 𝐴((𝑧1,… , 𝑧𝑚), 𝑧𝑖), 𝑖 = 𝑚+1,… , 𝑙 and 𝛼�̂� = 𝐴((𝑧1,… , 𝑧𝑚), (𝑥, �̂�))are the nonconformity scores for the calibration set and the test exam-ple, respectively. Nonconformity scores are only calculated for the truetarget on the calibration instances, while one nonconformity score iscalculated for each class label �̂� ∈ 𝑌 for the test object 𝑥.

When CP makes a prediction, the error rate is guaranteed for theprediction set. However, for each individual class, nothing can beguaranteed since all errors may be made on only one of the classes (Löf-ström et al., 2015). In product verification, this is severe since it isonly interesting to identify when something is not belonging to theverified product. Fortunately, CP can be transformed into a Mondrianconformal predictor (MCP), which can handle this situation with onlyminor alterations. An MCP is guaranteed to be valid for each classseparately (Vovk, 2013).

Transforming CP into an MCP is essentially a very straightforwardoperation. The main difference is that instead of considering all theinstances in the calibration set when calculating the 𝑝-value for anobject and class label, 𝑧 = (𝑥𝑙+1, �̂�), only calibration instances withthe same class as the tested class �̂� are considered. Furthermore, theuser can choose to only calculate p-values for the individual classes,skipping classes that are out of context. This is relevant to the productverification case since the only relevant test is whether an instancebelongs to the product it is assumed by the system to belong to or ifit can be discarded as not belonging to that product due to a too low𝑝-value. In the following description, we refer to the class representingthe product assumed by the system as being the normal class in eachapplication of product verification. This means that the 𝑝-value insteadneeds to be calculated in the following way

𝑝𝑛𝑜𝑟𝑚𝑎𝑙 =(|{𝑖 = 𝑚 + 1,… , 𝑙 ∶ 𝑦𝑖 = 𝑛𝑜𝑟𝑚𝑎𝑙 ∧ 𝛼𝑖 <= 𝛼𝑐}| + 1)

(|{𝑖 = 𝑚 + 1,… , 𝑙 ∶ 𝑦𝑖 = 𝑛𝑜𝑟𝑚𝑎𝑙}| + 1)

where 𝑦𝑖 is the normal class of calibration example 𝑧𝑖.CP produces, for each new test object, a set prediction that may

contain all possible classes. When considering an MCP used for productverification, only the 𝑝-value of the normal class is considered. Todetermine whether an instance is abnormal, the 𝑝-value is comparedwith the selected 𝜖. If the 𝑝-value is lower than the selected 𝜖, it meansthat the instance is among the 𝜖% most abnormal instances.

For the conformal framework to work, the data must be exchange-able. For data to be exchangeable, future data must behave like earlierinstances. Exchangeability is a slightly weaker assumption than theindependent and identically distributed (IID) assumption assumed bymost machine learning algorithms.

4.4.1. Nonconformity functionOne thing that might affect how well the product verification so-

lution works is determined by the underlying model. In principle, anynonconformity function can be used. Still, the better the nonconformityscores produced by the function can distinguish between normal andabnormal instances, the more effective the solution will be. In CP, thenonconformity function is usually defined by the predictions of a pre-dictive machine learning algorithm, such as deep learning or ensembletechniques. The most common choice to base the nonconformity scoreupon is to use a probability estimate from the predictive model so thata lower estimate from the predictive model for one class indicates thatclass is more nonconforming for that instance, i.e. that class is less
likely.


Fig. 4. The general workflow of the classifier implementation.

5. Implementation and design evaluation

When designing the system, the previously described techniqueshave all been considered. This section will describe and motivate whatdesign choices we have made and present the experimental setup forthe system evaluation that we have performed. The results of theevaluation are presented in Section 6.

5.1. Source data extraction and labelling

The image data used in this paper has been extracted from alarge retail chain during regular operation. The dataset consists of 59different types of products. The selection of products to include hasbeen based on sales figures of the retailer. That is why the datasetincludes both produce (banana, apple) which has typically limited textinformation and barcode (milk, packaged eggs) products, which containmore textual information. Additionally, some of the products are thesame type of product but with a different brand. For example, there arefour different types of packaged eggs within the dataset. Furthermore,the sales figures have been used for sampling an imbalanced dataset.The design choice of selecting the most common products while alsoreflecting the data distribution has been made to indicate how theverification system would perform in a real-world environment. Thenumber of images in the training dataset was 10395, and the testdataset consisted of 1520 images.

The collection of image data has been performed by two differentmethods: manual, and automatic. The manual mode was performedwhenever the EasyFlow/Hyperflow system was unable to classify aproduct. An attendant had to manually assign the type of product thathad passed through the scanning tunnel. For the automatic method,barcode reading and machine learning algorithms were used to classifythe product. If the confidence was above a certain threshold, the imagewas labelled with the barcode of the classified product type. With thislabelling process, incorrect labels may occur for some images. We havenot been able to quantify the number of incorrect labels due to closesimilarity between classes, for example, banana and organic banana.Manual subsampling of the dataset has shown that the number ofincorrect labels is negligible.

The data used in the text classification experiments are summarisedin Table 2 listing the size of the classes and their text density. Textdensity expresses the length of the product metadata qualitatively. Withtext density and the number of samples containing texts, four categorieswere defined using manual inspection of the dataset. Products havingdense text quality have description texts longer than the other products,then comes the other qualities in descending order: partial, minimal andnone. The proportion of the train and test data is only differing from afew classes. The id 19, 30, 50 has some additional test samples whilesome test samples has been reduced from id 33. However, this is just asmall fraction of the dataset and will not impact any of the results.

Table 1 shows the sparsity level (number of empty texts) andthe average number of characters per sample for each category. Theminimal and none contain high rates of empty samples. The emptysamples are generally related to products having short description texts,often not captured due to occlusion or OCR reading imperfections orthat no text is present in the image due to any existing text on theproduct being invisible to the camera. Fig. 5 illustrates a sample fromeach category of text quality.

8

Table 1Text sparsity levels and average number of characters for each text quality category.

Category Sparsity level Average characters

dense 1% 500partial 12.5% 153minimal 65.6% 10none 84.7% 0.5

Table 2Summary of the 59 products of the dataset used for the classification experimentsreported in section 5.1. The number of samples for each product reflects the salesfigures for the product.

Id Textdensity

#ImagesTrain/Test

Id Textdensity

#ImagesTrain/Test

1 dense 100/15 31 partial 100/152 dense 100/15 32 minimal 200/303 dense 100/15 33 minimal 1000/1004 dense 100/15 34 minimal 200/305 dense 100/15 35 minimal 100/156 dense 200/30 36 minimal 100/157 dense 100/15 37 minimal 200/308 dense 200/30 38 minimal 200/309 dense 200/30 39 minimal 300/4510 dense 100/15 40 minimal 100/1511 dense 400/60 41 minimal 100/1512 dense 200/30 42 minimal 300/4513 dense 100/15 43 minimal 100/1514 dense 200/30 44 minimal 300/4515 dense 100/15 45 minimal 100/1516 dense 100/15 46 minimal 300/4517 partial 200/30 47 minimal 200/3018 partial 200/30 48 minimal 100/1519 partial 40/10 49 minimal 100/1520 partial 300/45 50 minimal 40/1021 partial 300/45 51 minimal 100/1522 partial 100/15 52 minimal 100/1523 partial 200/30 53 none 100/1524 partial 100/15 54 none 200/3025 partial 200/30 55 none 200/3026 partial 200/30 56 none 100/1527 partial 200/30 57 none 200/3028 partial 300/45 58 none 100/1529 partial 300/45 59 none 100/1530 partial 15/5

To have a sufficiently large calibration set required for the MCPexperiment, we increased the size of the dataset. The used trainingdata includes about 500 samples for each product with dense or partialquality and about 200 samples for minimal and none qualities, and thetest data includes 100 samples for each product having dense or partialquality and 60 samples for those having minimal or none qualities.

5.2. OCR reading component

The first component, which is based on Google Vision API, receivesthe images from the source system. The result obtained from the APIwas represented in JSON format consisting of text information in theimage with corresponding metadata. Fig. 6 shows the partial resultsfrom the OCR extraction from a banana. Metadata such as boundingboxes for each element and locale was also provided in the resultingJSON file.


Fig. 5. Examples of the different categories of text quality. From left to right: none, minimal, partial and dense.

Fig. 6. Image and text representation in JSON of a banana in the dataset.

5.3. Text classification component

The second component of the system is a classifier which aimsto classify the products based on their descriptions. It includes threephases: data preprocessing, feature engineering, and classification, rep-resenting the core of the component. To decide which form of embed-ding to use, we tested the character, word and document embeddingsand their combinations. The feature extraction is used to assess theimpact of channel boosting on the classifier’s performance by feedingit with auxiliary text features. The channel boosting technique aims tofeed the learning model with multiple data features through differentchannels. The role of these channels may be complementary, and themodel may extract more knowledge than the case of receiving data viaa unique channel.

The various experiments carried out on this system component haveas objective to tackle the points listed below and their experimentalsetup are detailed in Section 5.5.

• Evaluation of different models for the classification task, namely,CNN, LSTM, and RMDL.

• Comparative analysis of the text representation methods.• Assessment of the impact of text sparsity on the classifier perfor-

mance.• Selection of feature extraction technique and evaluation of the

channel boosting concept.

5.4. Product verification component

The third and last component utilises Mondrian conformal predic-tion (MCP) to provide the system user with the ability to set individualsignificance levels for each product. MCP was selected to help handlethe class imbalance problem inherent in the application. Each Mondriancategory is defined by barcodes, meaning that each product has its own

9

calibration set. The nonconformity score is defined by the probabilityestimates received from the text classification component.

The experiments carried out to test the performance of this com-ponent are listed below and their experimental setup are detailed inSection 5.5:

• Ensure the preliminary conditions for the proper functioning ofthe conformal predictor, which are mainly the exchangeability ofthe data.

• Guarantee the validity of the Mondrian Conformal predictor.• Show the utility and efficiency of the proposed system as a

product verifier.

5.5. Experimental setup

The evaluation of the system is divided into two parts. The first partpresents the results from each of the components, as described above.In the second part, the performance of the product verification systemis evaluated using several experiments where the system is stress tested.The implementation is done using Python programming language andTensorflow 2.0 learning library. A machine with a microprocessor Inteli7 and RAM size of 16G is used to run the implemented code.

5.5.1. Evaluation of the componentsOCR Reading Component - As we do not have access to the ground

truth, i.e., the actual texts printed on all products used, no comparisonof the performance of the OCR solutions was performed.

Text Classification Component - In the first experiment, we in-vestigate the impact of data quality and text sparsity on the classifierperformance. Then, in the second experiment, we explore the efficiencyof different embedding levels for the classification task. The objective isto determine the most suitable level of the text representation and thespace dimension allowing the optimal discrimination of text features.In the experiments carried out using word and document embeddings,


hu(ctiavrthtllo

Table 3Hyper-parameter tuning search spaces used for tuning the CNN-based model structureand the obtained optimal values.

Search spaces Optimal values

Hidden layers (3,4,5) 3Filter sizes (3,5,7) 5,5,3# units in convolutional layers (64,128,160,256) 256,128,128# unit in dense layer (128,160,250) 128,128dropout [0,0.3] 0.2

Table 4Hyper-parameter tuning search spaces used for tuning the CNN-based model and theobtained optimal values.

Search spaces Optimal values

# Epochs [10, 20] 12Batch size (128,160,256) 128Learning rate [1e−4,1e−2] 8e−3Decay [0, 1e-4] 1e−2Regulariser l1 and l2 [0, 1e-6] 5e-5 / 1e-4

some text preprocessing operations are performed. First, the text to-kenising process serves to split the text into a sequence of words. Toobtain the grammar root of the obtained words, the stemming operationis performed. The benefit of stemming is that it reduces the vocabularysize of the training documents. Then, special characters and wordswhich do not carry any discriminative features, called stop words, areremoved. For the case of character embedding, the raw dataset is keptwithout any preprocessing. After data loading and preprocessing, thedeep learning model is created, and its hyperparameters are tuned.After the model has been fully trained in an optimal number of epochs,it is evaluated, and the performance results are reported.

The channel boosting technique is evaluated in the third experi-ment. The aim is to investigate possible rich textual ancillary featuresand assess the benefits of providing them as input to the learningmodels on the overall accuracy. In our case, the input of the deepmodel can be fed by diverse embeddings and features extracted byother techniques. TF–IDF is the main feature used in the experiments.

The obtained embeddings and features are used to train three deeplearning models, namely, an LSTM-based recurrent neural network, aconvolutional neural network and a random multimodel deep learning(RMDL).

Hyper-parameter optimisation (HPO) is used to tune the most im-portant parameters of the model. We used a bayesian optimisationbased on Gaussian process regression to find the optimal structure ofthe models and to tune the hyper-parameters of the model trainingprocess. The search spaces for the different hyper-parameters of theCNN-based model and their optimal values are described in Tables 3and 4, where continuous intervals are denoted with brackets anddiscrete sets by parentheses.

Product Verification Component - The best model having theighest performance metrics and providing the best predictions issed as the underlying model for the Mondrian Conformal PredictorMCP). The goal of the fourth experiment which is performed on thisomponent, is to prove both the validity and efficiency of the MCP. Forhe MCP to function properly, exchangeability among the observationss necessary to achieve validity. This will be achieved by subsamplingnd shuffling the data before feeding it to the underlying model. Thealidity of MCP under this condition means that it should, in the longun, get an error level equal to the given significance level. This meanshat any method claiming to have a certain level of confidence shouldave a probability of making the true predictions equal to that level ifhe experiment is repeated a large number of times, according to theaw of large numbers. This is verified by setting up various significanceevels and averaging the error rates of multiple runs. As the numberf experiments is limited, a slight variation in errors can be expected,

10

Table 5Impact of data quality on the classifier accuracy.

OCR quality CNN (%) LSTM (%) RMDL (%) Total (%)

dense 90.1 85.3 87.4 26.5partial 77.9 74.8 75.6 23.0minimal 26.2 24.5 24.3 40.8none 21.0 20.0 20.3 9.6Entire dataset 51.8 47.6 46.2 100.0

meaning that the resulting averages will be approximately equal to thepredefined significance levels.

The efficiency of the MCP is shown through the same experimentsperformed for the evaluation of the entire product verification system,as explained in the next subsection.

5.5.2. Evaluation of the Product Verification SystemThe final set of experiments are designed to test the performance of

the product verification in situations where it is supposed to fail andensure that the system is designed properly.

To test the proposed Product Verification System (PVS), we considerthe scenarios of barcode switching or miss-scan where the consumerscans a product with an incorrect barcode. The underlying model of thePVS has learned the features of the scanned product (true product) andthe one to which the barcode belongs (false product) since they existin the dataset used for training. This scenario is performed in the fifthexperiment, which includes three test cases depending on the similaritybetween the scanned product and the one to which the barcode belongs:either they are clearly different, or they are clearly similar, or thesimilarity information is not considered.

In the first and second case, with clearly different and clearly similarproducts, five sets of pairs have been selected for each as our test cases.

Regarding the last case, we randomly shuffled the data labels forthe whole dataset and averaged the error rates of assigning to eachproduct all other barcodes. Furthermore, in the sixth experiment, weaim to evaluate the overall performance of the system on the entiredataset. For that purpose, we switch the barcodes of the pairwiseproduct combination and thus compute the detection rate of productswith incorrect barcodes. The average of the resulting rates is calculatedas being the overall detection efficiency.

6. Results

This section shows the results obtained in the experiments explainedin Section 5.5 and performed on the classifier and product verification.

6.1. Experimental evaluation of the classification component

6.1.1. Results of the first experiment - Classifier evaluationTable 5 shows the results of the first experiment and particularly

the accuracy of the three used models for different data qualities. Thefourth column presents the percentage of each data category from theentire dataset.

The optimal number of epochs are 12 for CNN, 20 for LSTM and14 for RMDL when training on the entire dataset. Adam optimiseris used to minimise the categorical cross-entropy during the trainingprocess. The hyper-parameters of the training process are selectedusing a Bayesian optimiser based on the Gaussian process. As shownin Table 5, the dataset having dense quality is classified accuratelywith an accuracy of 90.1% using a CNN-based model. In contrast,the performance of the models in classifying the minimal and nonequality dataset is lower and has as best accuracies 26.2% and 21.0%respectively. The overall best accuracy of the entire dataset (51.8%) is
achieved using the CNN-based model.


p

bTs

6

6

pig

Table 6Impact of data sparsity on the classifier performance.

Accuracy CNN LSTM RMDL

No empty samples 72.6 69.3 70.4With empty samples 51.8 47.6 46.2

Table 7Classification performance using different embeddings.

Embedding input Dimension Tr. time (s) Accuracy (%)

Word embedding 300 1430 71.5Word embedding 100 862 71.3Character embedding 100 960 58.1Character embedding 50 540 57.9Document embeddings 4000 374 47.5Word + Document embeddings 100–4000 1230 55.5Word + character embeddings 100–50 1452 72.6

6.1.2. Results of the second experiment - Embeddings investigationTo investigate further the effect of the data quality on the classi-

fication performance, the second experiment compares the obtainedresults with the cleaned dataset, which does not include empty samples.Table 6 shows the impact of these samples on the model accuracy. Ifall empty samples are removed from the dataset, the overall accuracywill be increased by about 20.0%. The new accuracy obtained usingthe CNN-based model is 72.6% instead of 51.8%. These comparisonsgive a panoramic view of the OCR-based solution’s performance andhow much data it can leverage from the image dataset collected. Inthe remaining experiments, the clean dataset that does not include anyempty sample is used for training to increase performance.

In Table 7, the results of the experiment are shown. The table liststhe size of the embeddings, the time spent in model training (Tr. time(s)), and the performance, measured by the accuracy metric, of all theevaluated setups.

The entire dataset and the CNN-based model are used to evaluatethe text representation methods. Using pre-trained GloVe embeddingsof size 100 shows the same results as using a higher dimension of 300with accuracies of 71.3% and 71.5%, respectively. Thus, for the sakeof low computational complexity in time and space, the embedding ofwords in 100-dimensional space is preferable. The document embed-dings obtained using Doc2Vec do not provide a good representation ofthe text dataset, and concatenating them with word embedding reducesthe classifier accuracy to 55.5%. On the other hand, when the word andcharacter embeddings are used together, the accuracy is higher thanthe accuracies achieved separately by the two. This can be explainedby the ability of the character level to deal efficiently with misspeltwords, symbols (existing, for example, in the nutrition facts metadata)or abbreviations. Thus, these encodings can be complementary to wordembeddings.

6.1.3. Results of the third experiment - Channel boostingThe third experiment concerns the evaluation of channel boost-

ing impact on the overall accuracy of the models. It investigates thepossible advantages of feeding some auxiliary features to the deepmodels. Three different cases are considered as depicted in Table 8.‘‘None’’ refers to the case where no feature is added as an auxiliarychannel to the model’s input. The text length means the number ofcharacters in each text sample. In the last case, TF–IDF is used asa supplementary feature that quantifies the importance of a word ineach product description. It is obtained by multiplying two metrics: thenumber of occurrences of a word in the data of each product and theinverse frequency of the word across the entire dataset.

Table 8 shows that a simple feature like the text length is not pro-viding good additional information. It slightly improves the accuracyof the models based on CNN and RMDL but negatively influences the

11

performance of the LSTM-based model. On the other hand, advanced

Table 8Channel boosting using word embedding and extracted features.

Accuracy CNN LSTM RMDL

None 70.3 68.5 69.2Text Length 70.7 68.3 69.3TF–IDF 72.6 69.3 70.4

Table 9Verification rate of switching barcodes of clearly similar products.

Significance 𝜖 = 0.1 𝜖 = 0.05 𝜖 = 0.01

Error rate average 0.095 0.046 0.009

features like TF–IDF can more clearly improve the classification per-formance by increasing the model accuracy by 0.8 − 2.3 percentageoints.

The best accuracy reported in Tables 6–8 is obtained using a CNN-ased model fed with TF–IDF features, word and character embeddings.he same model is used for the evaluation of the product verificationystem in the next section.

.2. Experimental evaluation of product verification

.2.1. Results of the fourth experiment - MCP validityIn this section, we investigate the validity and efficiency of the

roduct verification system and show the results of the fourth exper-ment. The objective is to ensure the exchangeability of data. This isuaranteed through data random shuffling in 10 iterations and com-

paring the error rate obtained by the MCP. The second main objectiveis to determine whether the error rate of the MCP is well-calibrated,which means that it is less or equal to the predefined significancelevel, i.e. threshold. The design of the experiment is as follows: 70%of the text dataset is allocated to the proper training set and will servefor training the underlying machine learning algorithm, while 30% isset aside as a calibration set. The test is made for significance levelsof 0.01, 0.05 and 0.1. The average alarm rate across 20 repetitions isreported in Table 9 and shows that the error rate does not exceedthe predefined significance level. Thus, the conformal predictor iswell-calibrated or even slightly conservative.

6.2.2. Results of the fifth experiment - Target change/barcode switchingAfter proving that the system is practically valid, it is time to show

its efficiency. To test the performance and ability of the proposedproduct verification system (PVS) to detect if the consumer scans aproduct that has an incorrect barcode, we consider three test casesdepending on the similarity between the scanned product and the oneto which the barcode belongs: either they are clearly different, or theyare much similar, or the similarity information is not considered.

The calibration set should include a sufficient number of samplesfor each individual product to provide statistically accurate results. Forthis reason, we had to increase the size of the dataset, as explained inSection 5.

In the first two test cases, we manually selected similar and differentproducts to match each other. The goal is to assess the ability of thePVS to detect the inconsistency between the scanned product and thebarcode stuck thereon in extreme cases. Table 10 shows the type ofthe scanned products and of those associated with the barcode andgives the anomaly detection rates which correspond to the significancelevels 0.01, 0.05 and 0.1. The rates measure the capacity of the productverification system to detect the anomalies under a given confidencelevel. It equals the percentage of scan mismatching identified as anoma-lous with permission to have a false alarm rate equal to a predefinedsignificance level. The results show that 66.7% of the mismatchingbetween two similar milk products can be identified by the system,from which 1% are false alarms. This number can be increased to 80.0%


pl6m

Table 10Verification rate of switching barcodes of similar products.

Type of articles 𝜖 = 0.01 𝜖 = 0.05 𝜖 = 0.1

Milk 66.7 80.0 89.3Banana/Organic banana 32.8 43.0 50.0Carrots 24.8 86.7 86.7Packaged Eggs 100.0 100.0 100.0Butter 40.0 55.3 60.0

Table 11Verification rate of switching barcodes of clearly different products.

Type of articles 𝜖 = 0.01 𝜖 = 0.05 𝜖 = 0.1

Banana/cucumber 40.0 60.0 74.9Mini-cucumbers/carrots 86.7 93.3 93.3Yoghurt/lettuce 71.3 93.3 100.0Cauliflower/grapes 0.0 40.0 50.0Barbecue chicken/milk 100.0 100.0 100.0

Table 12Anomaly detection rates per category.

Quality 𝜖 = 0.01 𝜖 = 0.05 𝜖 = 0.1

dense 52.1 78.1 88.7partial 41.8 66.5 77.8minimal 16.6 34.5 43.6none 10.2 25.7 33.7

if we allow a false alarm rate of 5% and 89.3% with up to 10% falsealarms. On the other hand, only 32.8% of mismatching banana andorganic banana items with margin identification error of 1% and 43.0%(resp. 50.0%) can be identified if the margin increase to 5% (resp.10%). A high mismatching anomaly detection rate can be achievedfor some similar products, such as packaged egg products, for whichmismatching can be identified at 100.0% with 1% error margin.

On the other hand, Table 11 shows the system operation in caseof mismatching clearly different products, and it seems that it canbetter detect such mismatchings. For the cases of mismatching mini-cucumbers/carrots, it can detect 86.7% of the anomalies with theconfidence of 99%, and performs much better on the cases of switchingthe barcodes of barbecue chicken and milk as it detects 100.0% ofthem. The system shows low performance in detecting the mismatchingbetween cauliflower and grapes as it can detect only 50.0% of the caseswith 10% error margin.

6.2.3. Results of the sixth experiment - PVS efficiency assessmentThe sixth experiment, which is the last aims to quantify the overall

product verification efficiency for all the products. For this purpose, wecompute the anomaly detection rate of mismatching each product withall the others. This rate represents the ability of the system to discerneach individual product from others and to detect cases where it isswitched with other products. This is of particular interest if one wantsto know how easy a specific product (e.g., an expensive product) canbe stolen by barcode falsification. Another advantage of these results isto determine appropriate significance levels for each product.

Table 12 summarises the results per quality category. Mismatchingof products belonging to the dense category can be easily detected com-ared with other categories with an average of 78.1% for significanceevel 0.05. Then comes partial and minimal quality with an average of6.5% and 34.5% for the same significance level. Then, only 25.7% ofismatched scanning can be detected for products having quality none.

Delving more into the information that can be revealed from thenonconforming prediction results, Table 13 shows the average rate ofdetecting mismatching products with different text density qualities.The rows represent the qualities of the actual products, and columnsrepresent the qualities of false products, i.e., products with switchedbarcodes. Thus, the table values can be interpreted as the average rate

12

Table 13Anomaly detection rate for mismatching different categories (Actual qualities lie in therows and false qualities in the columns).

𝜖 = 0.01 𝜖 = 0.1

Quality dense partial minimal none dense partial minimal none

dense 87.6 71.8 28.3 4.9 99.1 96.0 88.9 50.1partial 72.2 55.2 22.2 4.1 94.9 84.7 72.3 41.5minimal 36.4 14.0 7.9 1.3 80.0 48.8 21.7 12.2none 27.4 3.8 4.4 0.3 75.5 38.6 7.0 3.8

of detecting the mismatching of actual products having the quality setin the row with false products having the quality set in the column.For illustration, if the actual product has dense text quality, the averagedetection rate is 99.1% when the false product has the same text qualityand with only 50.1% if the false product’s quality is none, under thesignificance level 0.1. If the actual and false products have dense orpartial qualities, the detection rate is high and lies in the interval[84.7, 99.1] if the significance level 0.1.

The significance level is a threshold determining when an alarm istriggered. Thus, the significance level can be assigned according to thestore objectives. The confidence level is disproportionate to the check-out error detection rate and proportional to the service quality. Whenit is decreased, we ensure that the circle of the suspected behavioursis enlarged and thus increase the anomaly detection rate. However,this is achieved through the reduction in the quality of service dueto the increase in generated false alarms. The significance level canalso be assigned so that the products susceptible to being stolen orthe expensive products, especially if they are not purchased often, havelower significance levels, leading to more alarms and staff intervention.

The system is designed to work in the following way: in the self-checkout phase, the customer scans the purchased product, an image ofthis product is taken and transmitted to the OCR reading. The extractedtext, which is part of the product description, is fed to the classifier,which provides a probability estimates to the conformal predictor. Thislatter determines whether the checkout is flawless or not. So, the use ofthis system is straightforward; an image of the product and its barcodeis given as input, and an alarm is triggered only if any maliciousbehaviour is detected. The significance level is the only parameter thatshould be determined for the conformal predictor.

7. Concluding discussion

In this work, a product verification system using OCR classificationand Mondrian conformal prediction is proposed for the retail sector.By using image data from existing grocery stores, the system candetect anomalies with high performance, even when there is partialtext information on the products. This makes the system applicablefor reducing shrinkage loss in grocery stores by identifying fraudulenttechniques such as barcode switching and miss-scan. Additionally, OCRreading with NLP classification shows that it is in itself a powerfulclassifier of products.

To make this verification system more widely applicable, the ef-fects of changing the resolution and position of cameras should beconsidered to better capture images with text present. Other areas ofimprovement would include: A thorough analysis of how missing andmiss-classified words affect overall performance; How the size of thetraining dataset affects performance; the effect of class imbalance, andwhich measures that can be taken to handle this. Using auxiliary fea-tures increased the accuracy of the models, which makes it compellingto test other types of features such as text positioning.

The operation of the proposed system may face some challengeswhich require further investigation. We can mention, for instance, thechange of product metadata and the introduction of new products to thestore, which may require a long time to gather a sufficient number ofimages to create a sufficiently large training and calibration set. Specific


G

I

J

J

K

K

L

L

L

L

L

L

L

L

L

M

M

M

M

P

P

P

measures need to be developed to detect similar emerging cases anddeal with each of them properly. Another interesting aspect that canbe explored is the inclusion of multilingual embeddings in the classifierinput. That is to say, because the metadata of some products is writtenin multiple languages.

CRediT authorship contribution statement

Rachid Oucheikh: Conceptualization, Visualization, Methodology,Software, Writing – review & editing. Tobias Pettersson: Data cu-ration, Conceptualization, Methodology, Writing – review & editing.Tuwe Löfström: Conceptualization, Methodology, Supervision, Fund-ing acquisition, Writing – review & editing.

Declaration of competing interest

The authors declare that they have no known competing finan-cial interests or personal relationships that could have appeared toinfluence the work reported in this paper.

Acknowledgements

This work was supported by the Swedish Knowledge Foundation(DATAKIND 20190194) and by the Swedish Governmental Agency forInnovation Systems (Airflow 2018-03581).

References

Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., &Kochut, K. (2017). Text summarization techniques: A brief survey. InternationalJournal of Advanced Computer Science and Applications, 8(10), 398–405. http://dx.doi.org/10.14569/IJACSA.2017.081052.

Amazon. com (2018). Amazon go. Retrieved from https://www.amazon.com/b?ie=UTF8&node=16008589011. (Accessed 15 December 2020).

Beck, A. (2018). Self-checkout in retail: Measuring the loss: Technical report, EfficientConsumer Response Community, http://dx.doi.org/10.13140/RG.2.2.14100.55686.

Beck, A., & Peacock, C. (2009). New loss prevention: Redefining shrinkage management.London: Palgrave Macmillan.

Beel, J., Gipp, B., Langer, S., & Breitinger, C. (2015). Research-paper recommender sys-tems: A literature survey. International Journal on Digital Libraries, 17(4), 305–338.http://dx.doi.org/10.1007/s00799-015-0156-0.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors withsubword information. Transactions of the Association for Computational Linguistics, 5,135–146. http://dx.doi.org/10.1162/tacl_a_00051.

Breuel, T. M. (2008). The OCRopus open source OCR system. In B. A. Yanikoglu, &K. Berkner (Eds.), Document recognition and retrieval XV (vol. 6815) (pp. 120–134).International Society for Optics and Photonics, SPIE, http://dx.doi.org/10.1117/12.783598.

Čakić, S., Popović, T., Šandi, S., Krčo, S., & Gazivoda, A. (2020). The use of tesseractOCR number recognition for food tracking and tracing. In 24th internationalconference on information technology (pp. 1–4). http://dx.doi.org/10.1109/IT48810.2020.9070558.

Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACMComputing Surveys, 41(3), 1–58.

Chen, J., Hu, Y., Liu, J., Xiao, Y., & Jiang, H. (2019). Deep short text classificationwith knowledge powered attention. In Proceedings of the AAAI conference on artificialintelligence (vol. 33), no. 01 (pp. 6252–6259). http://dx.doi.org/10.1609/aaai.v33i01.33016252, URL: https://ojs.aaai.org/index.php/AAAI/article/view/4585.

Dai, A. M., Olah, C., & Le, Q. V. (2015). Document embedding with paragraph vectors.CoRR, abs/1507.07998, arXiv:1507.07998.

Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification:A review. Multimedia Tools and Applications, 78, 3797–3816. http://dx.doi.org/10.1007/s11042-018-6083-5.

Dos Santos, C. N., & Gatti, M. (2014). Deep convolutional neural networks for sentimentanalysis of short texts. In 25th international conference on computational linguistics,proceedings of COLING 2014: Technical papers (pp. 69–78).

Everseen (2019). Everseen. Retrieved from https://everseen.com. (Accessed 15December 2020).

Fuchs, K., Grundmann, T., & Fleisch, E. (2019). Towards identification of packagedproducts via computer vision: Convolutional neural networks for object detectionand image classification in retail environments. In Proceedings of the 9th internationalconference on the internet of things. New York, NY, USA: Association for ComputingMachinery, http://dx.doi.org/10.1145/3365871.3365899.

13

Google (2019). Google vision API. Retrieved from https://cloud.google.com/vision.(Accessed 15 December 2020).

Gundimeda, V., Murali, R. S., Joseph, R., & Babu, N. T. N. (2018). An automatedcomputer vision system for extraction of retail food product metadata. In Advancesin intelligent systems and computing (pp. 199–216). Springer Singapore, http://dx.doi.org/10.1007/978-981-13-1580-0_20.

undimeda, V., Murali, R. S., Joseph, R., & Naresh Babu, N. T. (2019). An automatedcomputer vision system for extraction of retail food product metadata. In R. S. Bapi,K. S. Rao, & M. V. N. K. Prasad (Eds.), First International conference on artificialintelligence and cognitive computing (pp. 199–216). Singapore: Springer Singapore,http://dx.doi.org/10.1007/978-981-13-1580-0_20.

TAB (2019). EasyFlow/HyperFlow. Retrieved from https://itab.com/en/itab/checkout/self-checkouts/. (Accessed 15 December 2020).

aderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2015). Reading text in thewild with convolutional neural networks. International Journal of Computer Vision,116(1), 1–20. http://dx.doi.org/10.1007/s11263-015-0823-z.

oseph, R., Naresh Babu, N. T., Murali, R. S., & Gundimeda, V. (2019). Automatic retailproduct image enhancement and background removal. In R. S. Bapi, K. S. Rao, &M. V. N. K. Prasad (Eds.), First international conference on artificial intelligence andcognitive computing (pp. 1–15). Springer Singapore.

iros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., &Fidler, S. (2015). Skip-thought vectors. In C. Cortes, N. D. Lawrence, D. D. Lee,M. Sugiyama, & R. Garnett (Eds.), Advances in neural information processing systems(pp. 3294–3302). Curran Associates, Inc.

owsari, K., Heidarysafa, M., Brown, D. E., Meimandi, K. J., & Barnes, L. E. (2018).RMDL: Random multimodel deep learning for classification. In Proceedings of the2nd international conference on information system and data mining (pp. 19–28). NewYork, NY, USA: Association for Computing Machinery, http://dx.doi.org/10.1145/3206098.3206111.

au, J. H., & Baldwin, T. (2016). An empirical evaluation of doc2vec with practicalinsights into document embedding generation. In Proceedings of the 1st workshopon representation learning for NLP (pp. 78–86). Berlin, Germany: Association forComputational Linguistics, http://dx.doi.org/10.18653/v1/W16-1609.

axhammar, R. (2014). Conformal anomaly detection: Detecting abnormal trajectories insurveillance applications (Ph.D. thesis), University of Skövde.

e, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents.In E. P. Xing, & T. Jebara (Eds.), Proceedings of the 31st international conference oninternational conference on machine learning (pp. 1188—1196). Beijing, China.

ee, J. Y., & Dernoncourt, F. (2016). Sequential short-text classification with recurrentand convolutional neural networks. In Proceedings of conference of the north americanchapter of the association for computational linguistics: Human language technologies(pp. 515–520). http://dx.doi.org/10.18653/v1/n16-1062.

iao, M., Shi, B., & Bai, X. (2018). TextBoxes++: A single-shot oriented scene textdetector. IEEE Transactions on Image Processing, 27(8), 3676–3690.

iu, Y., Chen, H., Shen, C., He, T., Jin, L., & Wang, L. (2020). ABCNet: Real-timescene text spotting with adaptive bezier-curve network. In IEEE/CVF conference oncomputer vision and pattern recognition. IEEE, http://dx.doi.org/10.1109/cvpr42600.2020.00983.

iu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., & Yan, J. (2018). FOTS: Fast orientedtext spotting with a unified network. In Proceedings of the IEEE computer societyconference on computer vision and pattern recognition (pp. 5676–5685). http://dx.doi.org/10.1109/CVPR.2018.00595.

öfström, T., Boström, H., Linusson, H., & Johansson, U. (2015). Bias reduction throughconditional conformal prediction. Intelligent Data Analysis, 19(6), 1355–1375.

ong, S., He, X., & Yao, C. (2020). Scene text detection and recognition: The deeplearning era. International Journal of Computer Vision, 129(1), 161–184. http://dx.doi.org/10.1007/s11263-020-01369-0.

along (2020). Malong technologies. Retrieved from https://www.getretailai.com.(Accessed 15 December 2020).

athew, L., & Bindu, V. R. (2020). A review of natural language processing techniquesfor sentiment analysis using pre-trained models. In 4th international conference oncomputing methodologies and communication (pp. 340–345). IEEE, http://dx.doi.org/10.1109/ICCMC48092.2020.ICCMC-00064.

ikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of wordrepresentations in vector space. In ICLR workshop papers (pp. 1–12). arXiv:1301.3781.

irończuk, M. M., & Protasiewicz, J. (2018). A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications, 106, 36–54.http://dx.doi.org/10.1016/j.eswa.2018.03.058.

apadopoulos, H. (2008). Inductive conformal prediction: Theory and application toneural networks. Tools in Artificial Intelligence, 315–330. http://dx.doi.org/10.5772/6078.

ascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrentneural networks. In Proceedings of the 30th international conference on internationalconference on machine learning (pp. 1310–1318). JMLR, http://dx.doi.org/10.5555/3042817.3043083.

ennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for wordrepresentation. In Proceedings of the conference on empirical methods in naturallanguage processing (pp. 1532–1543). Doha, Qatar: Association for ComputationalLinguistics, http://dx.doi.org/10.3115/v1/D14-1162.

http://dx.doi.org/10.14569/IJACSA.2017.081052



https://www.amazon.com/b?ie=UTF8&node=16008589011



http://dx.doi.org/10.13140/RG.2.2.14100.55686

http://refhub.elsevier.com/S0957-4174(21)01296-3/sb4



http://dx.doi.org/10.1007/s00799-015-0156-0

http://dx.doi.org/10.1162/tacl_a_00051

http://dx.doi.org/10.1117/12.783598

http://dx.doi.org/10.1117/12.783598

http://dx.doi.org/10.1117/12.783598

http://dx.doi.org/10.1109/IT48810.2020.9070558

http://dx.doi.org/10.1109/IT48810.2020.9070558

http://dx.doi.org/10.1109/IT48810.2020.9070558




http://dx.doi.org/10.1609/aaai.v33i01.33016252



https://ojs.aaai.org/index.php/AAAI/article/view/4585

http://arxiv.org/abs/1507.07998

http://dx.doi.org/10.1007/s11042-018-6083-5

http://dx.doi.org/10.1007/s11042-018-6083-5

http://dx.doi.org/10.1007/s11042-018-6083-5

https://everseen.com

http://dx.doi.org/10.1145/3365871.3365899

https://cloud.google.com/vision

http://dx.doi.org/10.1007/978-981-13-1580-0_20

http://dx.doi.org/10.1007/978-981-13-1580-0_20

http://dx.doi.org/10.1007/978-981-13-1580-0_20

http://dx.doi.org/10.1007/978-981-13-1580-0_20

https://itab.com/en/itab/checkout/self-checkouts/



http://dx.doi.org/10.1007/s11263-015-0823-z















http://dx.doi.org/10.1145/3206098.3206111

http://dx.doi.org/10.1145/3206098.3206111

http://dx.doi.org/10.1145/3206098.3206111

http://dx.doi.org/10.18653/v1/W16-1609




http://dx.doi.org/10.18653/v1/n16-1062




http://dx.doi.org/10.1109/cvpr42600.2020.00983



http://dx.doi.org/10.1109/CVPR.2018.00595






http://dx.doi.org/10.1007/s11263-020-01369-0

http://dx.doi.org/10.1007/s11263-020-01369-0

http://dx.doi.org/10.1007/s11263-020-01369-0

https://www.getretailai.com

http://dx.doi.org/10.1109/ICCMC48092.2020.ICCMC-00064






http://dx.doi.org/10.1016/j.eswa.2018.03.058

http://dx.doi.org/10.5772/6078



http://dx.doi.org/10.5555/3042817.3043083

http://dx.doi.org/10.5555/3042817.3043083

http://dx.doi.org/10.5555/3042817.3043083

http://dx.doi.org/10.3115/v1/D14-1162


T

T

W

X

X

Z

Perdikaki, O. (2009). Essays on retail operations (Ph.D. thesis), Kenan-Flagler BusinessSchool, http://dx.doi.org/10.17615/157y-zs47.

aha, A., & Hadi, A. S. (2019). Anomaly detection methods for categorical data. ACMComputing Surveys, 52(2), 1–35. http://dx.doi.org/10.1145/3312739.

aylor, E. (2016). Supermarket self-checkouts and retail theft: The curious case of theSWIPERS. Criminology & Criminal Justice, 16(5), 552–567.

Taylor, E. (2018). COPS and robbers: Customer operated payment systems, self-servicecheckout and the impact on retail crime. In Retail crime: International evidence andprevention (pp. 99–119). Springer International Publishing, http://dx.doi.org/10.1007/978-3-319-73065-3_5, chapter 5.

Vovk, V. (2013). Conditional validity of inductive conformal predictors. MachineLearning, 92(2–3), 349–376. http://dx.doi.org/10.1007/s10994-013-5355-6.

Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world.Springer Science & Business Media.

ang, W., Cui, Y., Li, G., Jiang, C., & Deng, S. (2020). A self-attention-baseddestruction and construction learning fine-grained image classification method forretail product recognition. Neural Computing and Applications, 32(18), 14613–14622.http://dx.doi.org/10.1007/s00521-020-05148-3.

14

Wang, Y., Hou, Y., Che, W., & Liu, T. (2020). From static to dynamic word represen-tations: A survey. International Journal of Machine Learning and Cybernetics, 11(7),1611–1630. http://dx.doi.org/10.1007/s13042-020-01069-8.

ing, L., Tian, Z., Huang, W., & Scott, M. R. (2019). Convolutional character networks.CoRR, abs/1910.07954, arXiv:1910.07954.

u, J., Cai, Y., Wu, X., Lei, X., Huang, Q., fung Leung, H., & Li, Q. (2020).Incorporating context-relevant concepts into convolutional neural networks forshort text classification. Neurocomputing, 386, 42–53. http://dx.doi.org/10.1016/j.neucom.2019.08.080.

hou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., & Liang, J. (2017). EAST: Anefficient and accurate scene text detector. In IEEE conference on computer vision andpattern recognition (pp. 2642–2651). http://dx.doi.org/10.1109/CVPR.2017.283.

http://dx.doi.org/10.17615/157y-zs47

http://dx.doi.org/10.1145/3312739




http://dx.doi.org/10.1007/978-3-319-73065-3_5

http://dx.doi.org/10.1007/978-3-319-73065-3_5

http://dx.doi.org/10.1007/978-3-319-73065-3_5

http://dx.doi.org/10.1007/s10994-013-5355-6




http://dx.doi.org/10.1007/s00521-020-05148-3

http://dx.doi.org/10.1007/s13042-020-01069-8


http://dx.doi.org/10.1016/j.neucom.2019.08.080




Documents

Product verification using OCR classification and Mondrian