VisionNLPCommend
level: CCF_A CVPR author: Amaia Salvador1(FaceBook Al Research) date: 2019 keyword:
- image understanding; information retrieval
Salvador, Amaia, et al. “Inverse cooking: Recipe generation from food images.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
Paper: Inverse Cooking
Summary
- introduce an inverse cooking system that recreates cooking recipes given food images.
- predicts ingredients as sets by means of a novel architecture, modeling their dependencies without imposing any order, and then generates cooking instructions by attending to both image and its inferred ingredients simultaneously.
- for dataset constraints, generates a cooking recipe containing a title, ingredients, and cooking instructions directly from image.
Research Objective
- Application Area: information retrieval, food image understand and recommendation.
- Purpose: recognize the type of meal or its ingredients, but also understand its preparation process.
Proble Statement
- food and its components have high intraclass variability and present heavy deformations that occur during cooking process.
- ingredients are frequently occluded in a cooked dish and come in a variety of coloers, forms and textures;
- visual ingredient detection requires high level reasoning and prior knowledge.
previous work:
- Food Understanding: Food-101, Recipe1M datasets; with focus in image classification, estimating the number of calories given a food image, estimating food quantities, predicting the list of present ingredients and finding the recipe for a given image.
- [34] provides a detailed corss-region analysis of food recipes, considering images, attributes and recipes, considering images, attributes and recipe ingredients.
Methods
-
Problem Formulation: generating a recip
-
system overview:
【Module One】Generating recipes from images
Evaluation
Conclusion
- present an inverse cooking system, which generates cooking instructions conditioned on an image and its ingredients, exploring different attention strategies to reason about both modalities simultaneously.
- exhaustively study ingredients as both a list and a set, and propose a new architecture for ingredient prediction that exploits co-dependencies among ingredients without imposing order.
- ingredient prediction is indeed a difficult task and demonstrate the superiority of our proposed system against image-to-recipe retrieval approaches.
Notes 去加强了解
- 论文关键部分没有看完,看不懂
level:
author: Valentin Gabeur, Chen Sun(Google Research)
date: 2020
keyword:
- retrieval, caption-to-video, video-to-caption
Gabeur, Valentin, Chen Sun, Karteek Alahari, and Cordelia Schmid. “Multi-modal Transformer for Video Retrieval.” arXiv preprint arXiv:2007.10639 (2020).
Paper: Multi-modal Transformer
Methods
- Problem Formulation:
- how to learn accurate representations of both caption and video to base our similarity estimation on?
- video data varies in terms of appearance, motion, audio, overlaid text, and speech,etc;
given a dataset of $n$ video-caption pairs ${(v1, c1), …,(vn, cn)}$, the goal of the learnt similarity function $s(vi , cj )$, between video $vi$ and caption $cj$ , is to provide a high value if $i = j$, and a low one if $i != j$.
- system overview:
【Module one】Video Representation $$ \Omega(v)=F(v)+E(v)+T(v) $$
【Module two】Caption Representation: obtain an embedding h(c) of the caption, and then project it with a function g into N different spaces as $\varphi=g*h$; $$ \varphi(c)={\varphi}{i=1}^N $$ 【Module three】Similarity estimation $$ s(v,c)=\sum{i=1}^Nw_i(c)(\varphi^i,\psi_{agg}^i)\ w_i(c)=e^{h(c)^Ta_i}/\sum^N_{j=1}e^{h(c)^Ta_j} $$
Notes 去加强了解
- http://thoth.inrialpes.fr/research/MMT
- similarity learning[29]