The new Artificial Intelligence Formula: Image + Metadata

Antonio Rubio, Senior Computer Vision Researcher at Wide Eyes Technologies and research scientists at Institut de Robòtica i Informàtica Industrial (CSIC-UPC), share in this article the last research highlights about how image recognition technology is reshaping the fashion industry by leveraging the power of deep learning to address the challenges of fashion. Specially, the mutli-modal product retrieval and the main product detection.

To this end, the artificial intelligence is being used in a new multi-modal information fusion technology that automatically combines the power of computer vision and metadata information. A fashion discovery research work which won the Best Paper Award of the Computer Vision of Fashion Workshop at ICCV 2017, the top event worldwide reference on Computer Vision.

Learning to use multi-modal information

BY ANTONIO RUBIO, Senior CV Researcher at Wide Eyes Technologies & Research Scientist at CSIC-UPC

In Wide Eyes Technologies we offer several visual solutions for fashion retailers, such as similar recommendation, shop the look, shop by image, attribute discovery, style advisor and image classification. All these methods rely on artificial intelligence systems trained with huge amounts of data from the fashion industry.

But, one year ago, we realized that there was an important information that was being completely ignored: text. Many of the images in our dataset come accompanied by textual information, such as tags, brief descriptions or user comments, and we were not using all that data at the moment. 

Hence, we came up with a way to incorporate text data into our networks in order to solve two different problems:

  • Multi-modal product retrieval: being able to search for a product using images or texts and obtain as a result also images or texts.
  • Main product detection: given an image in which several products may appear and the textual metadata associated to the product being sold, being able to identify such product on the image.

The following article explains how we tackled these problems. For more information on each one of the solutions, please take a look at our papers: Multi-modal joint embedding for fashion product retrieval and Multi-Modal Embedding for Main Product Detection in Fashion.


[1] Dealing with text

Cleaning the dataset

The first task here consisted on taking a deep look at our data, which comes from very different sources, and unifying all the available textual information. We basically saw two types of textual information: descriptions and series of tags. 

Series of tags look like this:

Item Type: Suits // Model Number: Suits,Men Suits,903C1220 // Front Style: Pleated // Pant Closure Type: Zipper Fly // Closure Type: Single Breasted // Fit Type: Straight // Clothing Length: Regular // Material: Polyester // For: Men, Male, Man, Groom, Gentleman, Boyfriend, Him, Husband // Fabric Type: Broadcloth // Clothing length: Regular // Front Style: Flat // Style: Casual, Fashion, Luxury, High Top Quality, Latest Design, Vintage, Handmade // Collar: Mandarin Collar, Chinese Style, Stand Collar // Pattern Type: Solid // is_customized: Yes // Occasion: Wedding, Prom, Party,Formal, Business, Stage Clothes, Anniversary  // Color: Black // Gender: Men

Descriptions, as you can imagine, present relatively longer sentences with a certain grammatical structure, as for instance:

Satisfy your summer needs. This is an easy fitting, straight leg pant with pockets, belt loops and center front closure.

Nonetheless, not all descriptions look that good. There are also lots of noisy descriptions filled with many almost irrelevant words just to maximize the probability of the product being found by textual search, such as:

Seven7 Brand Men Suits Chinese Mandarin Collar Male Suit Slim Fit Blazer Wedding Terno Tuxedo 2 Piece (Jacket and Pant) [+info]

So, the first job here consisted on filtering and unifying our dataset, being able to obtain approximately half a million products with valid textual metadata for training and testing the proposed algorithms. Each product is classified in one of the following 32 fashion categories:

vest, hats, boots, polo, jewelry, skirt, clutch/wallet, cardigan, shirt, dress, backpack, swimwear, suits, travel bags, glasses/sunglasses, pants/leggings, flats, shorts, coat/cape, tops, pump/wedge, sweatshirt/hoodie, sandals, crossbody/messenger bag, blazer, top handles, belts, jacket, other accessories, jumpsuits, sweater and joggers.

How to represent texts?

Once we cleaned our texts, we want to feed them into a neural network. In order to do that, we need to generate vectors representing the texts, and for that we tried three different methods. Prior to generating these descriptors, texts go through a pre-processing step consisting on:

  1. Deleting numbers and punctuation marks.
  2. Switching all characters to lowercase.
  3. Lemmatization: defined as “grouping together the inflected forms of a word so they can be analysed as a single item”.

After these steps, we generate the descriptors with one of the following approaches:

  • Bag of words: for this case, we train a Bag of Words with our fashion texts discarding infrequent words. This gives as a result a descriptor vector whose length is equal to the dictionary length.
  • word2vec [3]: this neural network-based distributed representation allows to choose the dimension of the output descriptor. We experimented with 500, 300 and 100 dimensions.
  • doc2vec [4]: this is an evolution of the previous approach, but taking into account a broader context when generating the descriptors.

We also considered using GloVe [5] if the previous approaches didn’t provide good enough results, but it was not necessary.

In order to choose one of the methods, we performed a quick classification test using descriptors generated by the three methods and a short neural network with a few fully connected layers. Distributed representations (word2vec and doc2vec) achieved bigger accuracy than Bag of Words. We chose word2vec because the difference with doc2vec was negligible and its computation was faster.

We also decided to take a look at the word2vec model, choosing one word and selecting the most similar words according to its word2vec representation, and the results were surprisingly good. Take a look at the following results, taking into account that the model was exclusively trained with fashion texts:

Here we see how the most similar words also refer to shirts, and the rest of the results are somehow related to shirts. What happens if we input a more abstract concept, such as a color name?

Results are other color names, and the most similar seem to be colors that are related to blue. But what happens if we play around with words that are not related with fashion? Results still make sense:

Turns out our model learned relationships between words referring to superheroes!

Considering that the training included only fashion descriptions, we are pretty satisfied with these results and therefore ready to move on to the next step: once we know how to input our textual metadata into a neural network, let’s focus on how we solve the two aforementioned problems.


[2] Multi-Modal Product Retrieval

We wanted to be able to perform search tasks in a catalogue using as queries images or texts indistinctly to retrieve also images or texts. We trained a neural network to produce 128-dimensional representations of the products’ images and textual descriptions that lie in a same space, so that we can compute distances between them with as simple euclidean distances.

We face this problem with the creation of a two-branch neural network as seen in the figure:

We feed forward the images through an AlexNet-like CNN, and the word2vec representations of the texts through four fully connected layers (ReLU and Batch Normalization layers are present in the model but omitted for clarity). Both branches are connected via several fully-connected layers to two a contrastive loss function [6].

This loss function receives two vectors and a similarity label as inputs, and helps reducing the distance between the vectors if they represent similar objects, and increasing it otherwise (if it is smaller than a specific margin value).

Our contrastive loss function is defined as:                        And is computed as (being m the margin value): 


But, as you probably noticed, there’s another loss function in the network diagram, called classification loss. This is just a classical cross-entropy loss with 32 fashion categories that we added so that it encourages the embedding to have semantic meaning. There’s one for texts and one for images. We combine the contrastive loss with image and text classification losses as a weighted sum:

With this global loss function, we train our model using almost half a million images and texts coming from our database (formed by products from our clients and products crawled from the internet) to learn the embedded descriptors. Positive pairs are formed with corresponding texts and images, and negative pairs are formed using texts and images from different categories. You can take a look at our paper [1] to check more details of the dataset and the training process.

The results we get from all this process are evaluated as follows (and compared against KCCA and against a simpler version of our method using Bag of Words instead of word2vec):

  • First, we compute all the 128-d descriptors for images and texts in our test dataset, and use them to perform two tasks: image retrieval with text queries, and text retrieval with image queries.
  • We order all the retrieved products according to their distance to the query, and check the position (rank) of the correct result. The median value of all these ranks is the median rank. We decided to express it in terms of percentage of the test set. Since there are many similar items in our dataset, the first results might refer to items that are not exactly the correct one, but a very similar one, so we just see if the correct one is among the top possibilities. As we see in the table, with our method the retrieved exact match is normally in the top 2% of the results, which having almost 130K test products we consider a good result.
  • We also check in how many cases the correct match is among the 5% or 10% (recall@K). With our method, in almost 80% of the cases, the exact match is in the top 5% retrieved items.
  • Since we’re using classification to help the creation of the embedding, we also check the classification accuracy, which is very high even considering that it is not the main task and can be negatively influenced by the contrastive loss gradient.

In order to get an idea of how well our contrastive loss is learning to separate positive and negative pairs, we check the mean difference between positive distances and negative distances (we want this difference to be as big as possible). Again, with our method, positive and negative pairs are more separated.

Here are some qualitative results of the method (text query and closest images):


We also tried to see the influence of the different text fields in the result. The result is that the fields “title” and “description” are the most relevant to the method, which makes sense since they are usually the most related to the product main features. We also see that there is a trade off between classification and distance between pairs in the embedding. If we don’t use the classification loss, the difference between pair distances is increased, but the resulting embedding space has less semantic meaning (lower classification accuracy, products of the same category aren’t grouped as well as using classification loss). On the other hand, using both loss types slightly reduces the difference between positive and negative pairs, increasing the classification accuracy by ~10%.


[3] Main Product Detection

Once we have the knowledge on how to create multi-modal embeddings, we can use them for different tasks. In this case, we observed that many of our images contain not only the main product (the one they’re trying to sell), but the rest of the garments worn by the model, and we thought: why not using text to find the main product in these images? And so we did. The goal of the paper can be summed up in the following animation:

In the pursuit of this goal, we went through four phases (the loss functions and the text part are exactly the same as before, and remain unchanged for all the approaches):

  • Full images: we construct a network like the one in the previous paper: two branches for texts and images respectively, joined by a contrastive loss function, and with specific classification losses. We train the network with full images. In evaluation phase, we will obtain the embedded descriptors of several bounding boxes of an image and compute their distance to the embedding descriptor of the text describing the main product (A in the network scheme).
  • Bounding boxes: since we will have to compute the bounding boxes for the evaluation phase anyway, why not training with these bounding boxes? We extract around 300 bounding boxes per image with the Edge Boxes method of P. Dollár [7] and use them as inputs. We form positive pairs using bounding boxes that have a large overlap with the ground truth, and negative pairs with bounding boxes whose overlap with the ground truth is very low and bounding boxes of items from a different category. The network is the same as before, only the image inputs are different (B in the network scheme).
  • Bounding boxes descriptors: the computation of the backpropagation for hundreds of bounding boxes per image is a costly process, so we try to avoid it freezing the convolutional layers of the network. We use weights pretrained with fashion images and RoI Pooling for extracting descriptors of each bounding box and use them as input to a few fully connected layers that end up into our loss functions (C in the network scheme).
  • Overlap loss: since we are trying to maximize the overlap between a predicted bounding box and the ground truth bounding box (the main product), we decided to add another term to the loss function to predict the overlap of a bounding box with the ground truth. This term is just added with a weighting hyperparameter to the previous global loss function, and consists of a L1 regression loss. Using this loss, we achieve a 6% error in the overlap prediction, which is very good, but it doesn’t help much with the main product detection task. This loss is omitted from the scheme for clarity.

The evaluation process followed for this task is shown in the following video. Here are the quantitative results evaluated in 3000 images from an e-commerce that was never seen in training:

As we see, the RoI pooling approach is the one with better results, being able to detect a valid bounding box in the top-3 out of 300 bounding boxes in the 80% of the cases.

From a qualitative point of view, we can take a look at the resulting embedding:

And at some particular results of the main product detection task:

We conclude this post remarking that images aren’t the only paradigm in artificial intelligence applications for fashion. Textual information is very important and can be quite useful for many tasks.

[1] Rubio, A., Yu, L., Simo-Serra, E., & Moreno-Noguer, F. (2017, April). Multi-Modal Fashion Product Retrieval. In The 6th Workshop on Vision and Language (p. 43).
[2] Rubio, A., Yu, L., Simo-Serra, E., Moreno-Noguer, F., Song, Y., Li, Y., … & Takagi, M. (2017). Multi-Modal Embedding for Main Product Detection in Fashion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2236-2242).
[3] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
[4] Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 1188-1196).
[5] Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)(pp. 1532-1543).
[6] Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on (Vol. 2, pp. 1735-1742). IEEE.
[7] Zitnick, C. L., & Dollár, P. (2014, September). Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision (pp. 391-405). Springer, Cham.

Leave a Reply

Your email address will not be published. Required fields are marked *