liudongdong1 收录于 Categories 视觉AI

2020-08-21 约 574 字预计阅读 2 分钟 - 次阅读

https://lddpicture.oss-cn-beijing.aliyuncs.com/picture/voice-recognition-speech-detect-deep-260nw-694633963.webp

level: CCF_A CVPR author: Amaia Salvador1(FaceBook Al Research) date: 2019 keyword:

image understanding; information retrieval

Salvador, Amaia, et al. “Inverse cooking: Recipe generation from food images.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

Paper: Inverse Cooking

Summary

introduce an inverse cooking system that recreates cooking recipes given food images.
predicts ingredients as sets by means of a novel architecture, modeling their dependencies without imposing any order, and then generates cooking instructions by attending to both image and its inferred ingredients simultaneously.
for dataset constraints, generates a cooking recipe containing a title, ingredients, and cooking instructions directly from image.

Research Objective

Application Area: information retrieval, food image understand and recommendation.
Purpose: recognize the type of meal or its ingredients, but also understand its preparation process.

Proble Statement

food and its components have high intraclass variability and present heavy deformations that occur during cooking process.
ingredients are frequently occluded in a cooked dish and come in a variety of coloers, forms and textures;
visual ingredient detection requires high level reasoning and prior knowledge.

previous work:

Food Understanding: Food-101, Recipe1M datasets; with focus in image classification, estimating the number of calories given a food image, estimating food quantities, predicting the list of present ingredients and finding the recipe for a given image.
- [34] provides a detailed corss-region analysis of food recipes, considering images, attributes and recipes, considering images, attributes and recipe ingredients.

Methods

Problem Formulation: generating a recip
system overview:

【Module One】Generating recipes from images

Evaluation

Conclusion

present an inverse cooking system, which generates cooking instructions conditioned on an image and its ingredients, exploring different attention strategies to reason about both modalities simultaneously.
exhaustively study ingredients as both a list and a set, and propose a new architecture for ingredient prediction that exploits co-dependencies among ingredients without imposing order.
ingredient prediction is indeed a difficult task and demonstrate the superiority of our proposed system against image-to-recipe retrieval approaches.

Notes 去加强了解

论文关键部分没有看完，看不懂

level:
author: Valentin Gabeur, Chen Sun(Google Research) date: 2020 keyword:

retrieval, caption-to-video, video-to-caption

Gabeur, Valentin, Chen Sun, Karteek Alahari, and Cordelia Schmid. “Multi-modal Transformer for Video Retrieval.” arXiv preprint arXiv:2007.10639 (2020).

Methods

Problem Formulation:
- how to learn accurate representations of both caption and video to base our similarity estimation on?
- video data varies in terms of appearance, motion, audio, overlaid text, and speech,etc;

given a dataset of $n$ video-caption pairs ${(v1, c1), …,(vn, cn)}$, the goal of the learnt similarity function $s(vi , cj )$, between video $vi$ and caption $cj$ , is to provide a high value if $i = j$, and a low one if $i != j$.

system overview:

【Module one】Video Representation $$ \Omega(v)=F(v)+E(v)+T(v) $$

【Module two】Caption Representation: obtain an embedding h(c) of the caption, and then project it with a function g into N different spaces as $\varphi=g*h$; $$ \varphi(c)={\varphi}{i=1}^N $$ 【Module three】Similarity estimation $$ s(v,c)=\sum{i=1}^Nw_i(c)(\varphi^i,\psi_{agg}^i)\ w_i(c)=e^{h(c)^Ta_i}/\sum^N_{j=1}e^{h(c)^Ta_j} $$

Notes 去加强了解

http://thoth.inrialpes.fr/research/MMT
similarity learning[29]

VisionNLPCommend

Paper: Inverse Cooking

Summary

Research Objective

Proble Statement

Methods

Evaluation

Conclusion

Notes 去加强了解

Paper: Multi-modal Transformer

Methods

Notes 去加强了解