Not logged in : Login
(Sponging disallowed)

About: Beyond Equal-Length Snippets: How Long is Sufficient to Recognize an Audio Scene?     Goto   Sponge   Distinct   Permalink

An Entity of Type : bibo:AcademicArticle, within Data Space : demo.openlinksw.com associated with source document(s)

AttributesValues
type
seeAlso
http://eprints.org/ontology/hasAccepted
http://eprints.org/ontology/hasDocument
dc:hasVersion
Title
  • Beyond Equal-Length Snippets: How Long is Sufficient to Recognize an Audio Scene?
  • Beyond Equal-Length Snippets: How Long is Sufficient to Recognize an Audio Scene?
described by
Date
  • 2019-06-08
  • 2019-06-08
Creator
status
Publisher
abstract
  • Due to the variability in characteristics of audio scenes, some scenes can naturally be recognized earlier than others. In this work, rather than using equal-length snippets for all scene categories, as is common in the literature, we study to which temporal extent an audio scene can be reliably recognized given state-of-the-art models. Moreover, as model fusion with deep network ensemble is prevalent in audio scene classification, we further study whether, and if so, when model fusion is necessary for this task. To achieve these goals, we employ two single-network systems relying on a convolutional neural network and a recurrent neural network for classification as well as early fusion and late fusion of these networks. Experimental results on the LITIS-Rouen dataset show that some scenes can be reliably recognized with a few seconds while other scenes require significantly longer durations. In addition, model fusion is shown to be the most beneficial when the signal length is short.
  • Due to the variability in characteristics of audio scenes, some scenes can naturally be recognized earlier than others. In this work, rather than using equal-length snippets for all scene categories, as is common in the literature, we study to which temporal extent an audio scene can be reliably recognized given state-of-the-art models. Moreover, as model fusion with deep network ensemble is prevalent in audio scene classification, we further study whether, and if so, when model fusion is necessary for this task. To achieve these goals, we employ two single-network systems relying on a convolutional neural network and a recurrent neural network for classification as well as early fusion and late fusion of these networks. Experimental results on the LITIS-Rouen dataset show that some scenes can be reliably recognized with a few seconds while other scenes require significantly longer durations. In addition, model fusion is shown to be the most beneficial when the signal length is short.
Is Part Of
Subject
list of authors
presented at
is topic of
is primary topic of
Faceted Search & Find service v1.17_git151 as of Feb 20 2025


Alternative Linked Data Documents: iSPARQL | ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 08.03.3332 as of Mar 17 2025, on Linux (x86_64-generic-linux-glibc25), Single-Server Edition (378 GB total memory, 16 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2025 OpenLink Software