liudongdong1 收录于 Categories 视觉AI

2020-06-06 约 13137 字预计阅读 27 分钟 - 次阅读

https://lddpicture.oss-cn-beijing.aliyuncs.com/picture/voice-recognition-speech-detect-deep-260nw-694633963.webp

安防监控领域，包括人脸识别、行为识别、运动跟踪、人群分析等等，利用卡口精准位置布控视频监测，实现了监控区域内异常的自动识别，例如动态视频中的人脸与黑名单库实时比对检测，多视点视频协同分析运行轨迹，视频数据结构化后对关键目标的检索等等；

互联网娱乐场景，包括拍照优化、视频优化、实时人像美颜、AR特效、自定义背景等等，丰富了直播、短视频等互联网娱乐应用；

金融身份认证场景，包括各种刷脸的金融应用，如远程开户、支付取款等等；

无人商场与广告营销，包括线下零售、商品识别、广告AR赋能等等；

工业机器的视觉系统，包括物品分拣、缺陷检验等等，通常是自动图像分析与光学成像等其他方法技术相结合；

无人机无人车控制，包括视觉导航、行人分析、障碍物检测等等，通常作为一种传感器和激光雷达、毫米波雷达、红外探头与惯性测量单元融合生成供自主决策的信息；

0. 视频理解方向

Task1：未修剪视频分类(Untrimmed Video Classification)。这个有点类似于图像的分类，未修剪的视频中通常含有多个动作，而且视频很长。有许多动作或许都不是我们所关注的。所以这里提出的Task就是希望通过对输入的长视频进行全局分析，然后软分类到多个类别。

Task2：修剪视频识别(Trimmed Action Recognition)。这个在计算机视觉领域已经研究多年，给出一段只包含一个动作的修剪视频，要求给视频分类。

Task3：时序行为提名(Temporal Action Proposal)。这个同样类似于图像目标检测任务中的候选框提取。在一段长视频中通常含有很多动作，这个任务就是从视频中找出可能含有动作的视频段。

Task4：时序行为定位(Temporal Action Localization)。相比于上面的时序行为提名而言，时序行为定位于我们常说的目标检测一致。要求从视频中找到可能存在行为的视频段，并且给视频段分类。

Task5：密集行为描述(Dense-Captioning Events)。之所以称为密集行为描述，主要是因为该任务要求在时序行为定位(检测)的基础上进行视频行为描述。也就是说，该任务需要将一段未修剪的视频进行时序行为定位得到许多包含行为的视频段后，对该视频段进行行为描述。比如：man playing a piano

1. 手语论文

工业界：

腾讯优图实验室AI手语识别 https://www.jiqizhixin.com/articles/2019-05-16-16

中科大和微软推出了基于Kinect的手语翻译系统，加州大学曾经推出过的手语识别手套

潜在需求分析：

1. 听障人士数量数量多 世界卫生组织最新数据显示[1]，目前全球约有4.66亿人患有残疾性听力损失，超过全世界人口的5%，估计到2050年将有9亿多人（约十分之一）出现残疾性听力损失。据北京听力协会2017年公开数据，估计中国残疾性听力障碍人士已达7200万[2]，

无障碍普及率有待提升，听障人群需求被忽视

提供一套兼容全球手语的双向翻译器/或是简单的识别器

立即可以为上千万聋哑人获得更多的电脑控制权

结合 IFTTT 以及 Home 类似智能家庭控制器

完全可以形成一个嵌入专用硬件的产业了

问题

1. 自动区分手语表达中的各类手势、动作以及这些手势和动作之间的切换，最后将表达的手语翻译成文字。传统的方法通常会针对特定的数据集设计合理的特征，再利用这些特征进行动作和手势的分类。受限于人工的特征设计和数据量大小，这些方法在适应性、泛化性和鲁棒性上都非常有限。

使用Kinect摄像机的多种传感器来提前获取手语表达者的肢体关节点信息：传感器手套、或配备EMG、IMU传感器的手环来获取手臂和手掌的活动信息

level: CVPR CCF_A author:Junfu Pu CAS Key Laboratory of GIPAS, University of Science and Technology of China date: 2019 keyword:

ASL , CTC

Paper: Iterative Alignment Network

Summary

Research Objective

Application Area:
- sign language (SL) is used by millions of people with hearing or spoken damage in their daily life
- lack of systematic study for sign language, it becomes very difficult for many people to communicate with the deaf-mute
Purpose: propose an alignment network with iterative optimization for weakly supervised continuous signlanguage recognition

Proble Statement

previous work:

isolated SLR recognition [16, 22, 42, 43]
video representation： 3D-CNN ResNet P3D
sequence modeling:
- attention-based encoder-decoder network
  - Bahdanau et al. [1] introduce attention mechanism into encoder-decoder network to learn the correspondence between source sequence and target sequence
- connectionist temporal classification(CTC) based network
  - CTC is able to deal withunsegmented input data, and learn the correspondence between the input sequence and output sequence.
continuous SLR
- hand-crafted feature based
  - Hidden Markov Model (HMM) or Hidden Conditional Random Fields (HCRF)
  - [35] two real-time HMM-based systems for recognizing sentence-level continuous American Sign Language (ASL).
  - [40]a discriminative sequence model with Hidden Conditional Random Field (HCRF) for gesture recognition
- deep learning based [9, 23, 25] datasets 了解一下
  - video represntations by redidual network ResNet[18], 3D-CNN [33, 37]
  - [23] with hierarchical attention in latent space

Methods

Problem Formulation:

system overview:

CTC_Loss:

LSTM_Loss:

The Whole NetworkLoss:

Evaluation

Environment:
- Dataset:
  - RWTH-PHOENIX-Weather multi-signer [25] for German SLR
  - CSL [23] for Chinese SLR
Evaluate Methods:
The window size is set to be 8 with a stride of 4,the 3D-ResNet is pre-trained on an isolated sign language recognition dataset released in [43]
Performance:

Conclusion

A unified deep learning architecture integrating encoderdecoder network and connectionist temporal classification (CTC) for continuous sign language recognition.
A soft dynamic time warping (soft-DTW) alignment constraint between the LSTM and CTC decoders, which indicates the temporal segmentation in sign videos
Iterative optimization strategy to train feature extractor and encoder-decoder network alternately with alignment proposals by warping path

Notes 去加强了解

论文23: Video-based sign language recognition without temporal segmentation China
paper25 数据集 German Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers
SubUNets: End-to-end hand shape and continuous sign language recognition
Online early-late fusion based on adaptive HMM for sign language recognition
Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet
Attention based 3D-CNNs for large-vocabulary sign language recognition
Video-based sign language recognition without temporal segmentation
Dilated convolutional network with iterative optimization for continuous sign language recognition
Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers
Online early-late fusion based on adaptive HMM for sign language recognition
Joint CTC/attention decoding for end-to-end speech recognition
Attention based 3D-CNNs for large-vocabulary sign language recognition
Video-based sign language recognition without temporal segmentation
Deep sign: hybrid CNN-HMM for continuous sign language recognition
Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs.
Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks
Dilated convolutional network with iterative optimization for continuous sign language recognition

level: Sensys CCF_B author:Biyi Fang Michigan State University date: 2017 keyword:

ASL, Leep Motion(an infrared light-based sensing device)

Paper: DeepASL

performance at both word level and sentence level (unseen ASL sentences ,unseen users)
robustness under various real-world settings (various ambient lighting conditions, body postures,and interference sources )
system performance test in terms of runtime , memory usage and energy consumption.

Research Objective

Application Area:seeking help from a sign language interpreter, writing on paper, or typing on a mobile phone,each of these methods has its own key limitations in terms of cost, availability, or convenience
Purpose:

Proble Statement

ASL : hand shape, hand movement, relative location of two hands, body movement, face emotions
Electromyography (EMG) sensors, RGB cameras, Kinect sensors intrusive where sensors have to be attached to !ngers and palms of users, lack of resolutions to capture the key characteristics of signs, or significantly constrained by ambient lighting conditions or backgrounds in real-world settings
existing sign language translation systems can only translate a single sign at a time, thus requiring users to pause between adjacent signs.

previous work:

wearable sensor-based :motion sensors(accelerometers, gyroscopes), EMG sensors, bending of fingers to infer the performed fingers. intrusive and impractical for daily usage
Radio Frequency-based: wire-less signals have very limited resolutions to see the hands
RGB camera-based: poor lighting conditions or generally uncontrolled backgrounds, privacy
Kinect-based: hard to capture the hand shape information

Leap Motion is able to extract skeleton joints of the fingers, palms and forearms from the raw infrared images.

Methods

system overview:

a temporal sequence of 3D coordinates of the skeleton joints of !ngers, palms and forearms
the key characteristics of ASL signs including hand shape, hand movement and relative location of two hands spatio-temporal trajectories of ASL characteristics
models the spatial structure and temporal dynamics of the spatio-temporal trajectories of ASL characteristics for word-level ASL translation
CTC-based framework that leverages the captured probabilistic dependencies between words in one complete sentence and translates the whole sentence end-to-end without requiring users to pause between adjacent signs.

【ASL Characteristics Extraction】

Savitzky-Golay flter [37] to improve the signal to noise ratio of the raw skeleton joints data

extract hand shape:
hand movement information:

【Word-Level ASL Translation】： translation errors when different signs share very similar characteristics at the beginning of the signs

Hierarchical Bidirectional RNN for Single-Sign Modeling:

【Sentence level Translation】 using CTC network

2. 视频理解

level: CVPR_CCFA author:Romero Morais date: keyword:

video analyse, anomaly detection