Sequenced sound manipulations Archives

Sequenced sound manipulations Archives

Sequenced sound manipulations Archives

Sequenced sound manipulations Archives

Open Access

Peer-reviewed

  • Benjamin Skerritt-Davis,
  • Mounya Elhilali
  • Benjamin Skerritt-Davis, 
  • Mounya Elhilali

Abstract

Our ability to parse our acoustic environment relies on the brain’s capacity to extract statistical regularities from surrounding sounds. Previous work in regularity extraction has predominantly focused on the brain’s sensitivity to predictable patterns in sound sequences. However, natural sound environments are rarely completely predictable, often containing some level of randomness, yet the brain is able to effectively interpret its surroundings by extracting useful information from stochastic sounds. It has been previously shown that the brain is sensitive to the marginal lower-order statistics of sound sequences (i.e., mean and variance). In this work, we investigate the brain’s sensitivity to higher-order statistics describing temporal dependencies between sound events through a series of change detection experiments, where listeners are asked to detect changes in randomness in the pitch of tone sequences. Behavioral data indicate listeners collect statistical estimates to process incoming sounds, and a perceptual model based on Bayesian inference shows a capacity in the brain to track higher-order statistics. Further analysis of individual subjects’ behavior indicates an important role of perceptual constraints in listeners’ ability to track these sensory statistics with high fidelity. In addition, the inference model facilitates analysis of neural electroencephalography (EEG) responses, anchoring the analysis relative to the statistics of each stochastic stimulus. This reveals both a deviance response and a change-related disruption in phase of the stimulus-locked response that follow the higher-order statistics. These results shed light on the brain’s ability to process stochastic sound sequences.

Author summary

To understand our auditory surroundings, the brain extracts invariant representations from sounds over time that are robust to the randomness inherent in real-world sound sources, while staying flexible to adapt to a dynamic environment. The computational mechanisms used to achieve this in auditory perception are not well understood. Typically, this ability is investigated using predictable patterns in a sequence of sounds, asking: “How does the brain detect the pattern embedded in this sequence?”, which does not generalize well to natural listening. Here, we examine processing of stochastic sounds that contain uncertainty in their interpretation, asking: “How does the brain detect the statistical structure instantiated by this sequence?”. We present human experimental evidence employing a perceptual model for predictive processing to show that the brain collects higher-order statistics about the temporal dependencies between sounds. In addition, the model reveals correlates between task performance and individual differences in perception, as well as deviance effects in the neural response that would be otherwise hidden with conventional, stimulus-driven analyses. This model guides our interpretation of both behavioral and neural responses in the presence of stimulus uncertainty, allowing for the study of perception of more natural stimuli in the laboratory.

Citation: Skerritt-Davis B, Elhilali M (2018) Detecting change in stochastic sound sequences. PLoS Comput Biol 14(5): e1006162. https://doi.org/10.1371/journal.pcbi.1006162

Editor: Wolfgang Einhäuser, Technische Universitat Chemnitz, GERMANY

Received: February 1, 2018; Accepted: April 30, 2018; Published: May 29, 2018

Copyright: © 2018 Skerritt-Davis, Elhilali. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Model code is available online at https://engineering.jhu.edu/lcap/index.php?id=software Experimental data is available at https://engineering.jhu.edu/lcap/index.php?id=research.

Funding: This work was supported by grants from the National Institutes of Health (R01 HL133043), the Office of Naval Research (N00014-16-1-2045, N00014-17-1-2736, N00014-16-1-2879), and a Johns Hopkins University Catalyst Award. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

To understand soundscapes, the brain parses incoming sounds into distinct sources and tracks these sources through time. This process relies on the brain’s ability to sequentially collect information from sounds as they evolve over time, building representations of the underlying sources that are invariant to the randomness present in real-world sounds, while being flexible to adapt to changes in the acoustic scene. Extracting these representations from ongoing sounds is automatic and effortless for the average listener, but the underlying computations in the brain are largely unknown. To better understand how the brain processes real-world sounds, we investigate how the brain builds invariant representations from sounds containing randomness.

Invariant representations of sound sources are referred to in the literature as regularities, where regularity extraction is the brain’s ability to access these representations for use in auditory scene analysis [1, 2]. We differentiate between two types of regularities: deterministic regularities that describe a repeating or predictable pattern, and stochastic regularities that contain some randomness and are not fully predictable. Deterministic regularities can be as simple as a repeating tone or sequence, or they can be quite complex, for example: two interleaved deterministic patterns [3], an abstract pattern within a single acoustic feature (“falling pitch within tone-pairs” [4]) or one spanning multiple features (“the higher the pitch, the louder the intensity” [5]). The signature trait of deterministic regularities is the absence of ambiguity: a new sound can immediately be interpreted as a continuation of or a deviation from the regularity with certainty.

Stochastic regularities, on the other hand, are characterized by the lack of certainty, as their inherent randomness leaves room for multiple possible interpretations of a sequence of sounds. A new sound belongs to a stochastic regularity probabilistically according to how well it fits relative to other possible interpretations. For example, consider a sequence of tones with frequencies drawn from an arbitrary distribution, such as in [6]. Each tone could be drawn from the same distribution as the preceding tones or it could be drawn from a new distribution. Given a new tone, deciding between these two alternatives (i.e., “same or different?”) cannot be done with certainty, but rather proportionally to how likely the new tone is given its preceding context. Implicit in this example is that the brain is able to extract meaningful contextual information from previously heard sounds to characterize the stochastic regularity, and represent this abstracted information for interpreting new sounds.

One possible mechanism for how the brain represents stochastic regularities is through statistical estimates, which entails extracting representative parameters from observed sensory cues [7]. The nature and extent of statistics collected by the brain remains unknown. Previous studies have focused on the marginal statistics of tones within a sequence, showing that the brain is sensitive to changes in mean and variance [8, 9]. We refer to these as lower-order statistics, describing sounds independent of their context. In the present work, we investigate whether the brain collects higher-order statistics about the dependencies between sounds over time; namely, we examine how the brain gathers information about the temporal covariance structure in a stochastic sequence of sounds. We use melody stimuli with pitches based on random fractals, which exhibit long-range dependencies and cannot be described solely by lower-order statistics. We specifically use random fractals because of their ecological relevance: previous work has demonstrated the presence of random fractals in music [10], speech [11], and natural sounds [12] and shown the brain is sensitive to the amount of randomness, or entropy, in random fractal melodies [13, 14].

Change detection experiments are well-suited for investigating regularity extraction, where the task is to detect deviation from an established regularity in a sequence of sounds. A detection can be reported behaviorally or recorded in the brain’s response (e.g., the mismatch negativity, MMN). A correct detection indicates the brain is sensitive to the tested regularity, for a change response is necessarily preceded by knowledge of what is being changed. Change detection experiments in the auditory domain using electroencephalography (EEG) and magnetoencephalography (MEG) have shown the brain is sensitive to a wide range of deterministic regularities [15–17]. Stochastic regularities, however, have mostly been studied using discrimination experiments, where the task is to differentiate between different regularities, with both behavioral [12] and brain imaging results [8, 13, 14, 18, 19] showing the brain is sensitive to various stochastic regularities. Compared to discrimination, the change detection paradigm more closely mirrors how the brain processes sounds in the real world, where boundaries between sound sources are not known a priori, but must be inferred from changes in ongoing sound.

The mechanisms needed for change detection may differ depending on the type of regularity. With deterministic regularities, the brain can explicitly test whether each incoming sound deviates from the extracted pattern or not with near certainty. Deviation from a stochastic regularity, on the other hand, emerges gradually as evidence is accumulated over time, causing a delay in the perceived moment of change proportional to the amount of evidence needed to detect the change. This uncertainty unavoidably introduces variability in perception across trials and across subjects, which is particularly problematic for time-locked analyses such as in EEG, where low SNR necessitates many repetitions and precise temporal alignment across trials and subjects to get meaningful results. To account for this variability and facilitate the study of stochastic regularities in change detection, we need a suitable perceptual model of the mechanisms for extracting and using regularities in a changing scene to guide our analysis.

While there have been several theoretical accounts of regularity extraction in the brain [2, 20–23], there are very few mathematical implementations of these concepts into concrete models for tracking regularities in sound inputs. One popular model is the CHAINS model, which examines pattern discovery and competition between alternate partitions of a sequence into concurrent, interleaved patterns [24]. This model has been very insightful in shedding light on principles of bistable perception in stream segregation; yet, its limitation to deterministic patterns impedes its applicability to stochastic regularities in the signal. Another model, IDyOM, initially formulated for application to music perception, uses information-theoretic principles to model auditory expectation, collecting occurrences of previously seen events to build predictions, similar to the n-grams used in language models for speech recognition or text processing [25]. While the IDyOM model is able to capture the statistical structure of both stochastic and deterministic regularities, it is formulated to operate only on a discrete, unordered, small set of possible events, and therefore does not generalize well to sounds that vary on a continuum like pitch or loudness.

In this work, we employ a Bayesian framework to model the tracking of sensory statistics by the auditory system [26]. One of the advantages of Bayesian theory is that it is agnostic to priors and underlying distributions, optimally integrating priors and sensory evidence in the inference process. In particular, this framework makes minimal assumptions on the stationarity of the observed sequence and offers an ideal scheme for tracking statistics and detecting change in underlying probability distributions. Bayesian frameworks have been widely used in various incarnations to model data ranging from financial markets to human behavior in reading-inference, change detection, and reinforcement learning tasks [26–31]. In the present application, this mathematical platform allows us to directly probe the degree of optimality in brain processes observed and test alternative hypotheses for the computations involved.

Here, we adapt this Bayesian framework for perceptual processing to investigate the extent to which auditory statistical information is represented in memory. We introduce perceptual parameters to the model that represent resource limitations (i.e., finite working memory and observation noise) and provide constraints on performance that are valuable to interpret sub-optimal detection performance and variability across listener behaviors. By fitting the model to human behavior from a series of change detection experiments, we can explore questions regarding auditory stochastic regularity extraction: Which statistics are sufficient to explain human behavior? How do the perceptual parameters of the model account for differences in behavior across subjects? Finally, we use the model to guide analysis of EEG data, revealing effects that would be otherwise hidden using conventional EEG analyses.

Results are presented in three parts: the first section presents psychophysics results from a series of change detection experiments, the second section introduces the model and presents results from fitting the model to human behavior, and the third section presents neural results obtained by using the model to guide EEG analysis. We believe this model opens up new avenues into investigating how the brain collects information from stochastic sounds that are more relevant to everyday perception.

Results

Psychophysics

A series of experiments probed listener’s ability to detect changes in fractal melodies. Stimuli were constructed from melodies at four levels of randomness or entropy in pitch (both terms used interchangeably). Melody entropy is parameterized by β, where β = 0 corresponds to the highest entropy (white noise), and entropy decreases as β increases (see Fig 1a for examples of fractal melodies at different levels of β). Lower-order statistics (mean and variance) were normalized across the melody. Half-way through the melody, only the higher-order statistics change (see Fig 1b for examples of change stimuli). The task in all experiments was the same: detect a change in entropy of the melody.

Fig 1. Examples of random fractal melodies.

Schematic spectrograms shown with frequency and time along the vertical and horizontal axes, respectively (see S1–S6 Audio. for accompanying audio). a) Melodies at four levels of entropy, parameterized by β. Higher β corresponds with lower entropy, and vice versa. b) Change stimuli for each change direction; INCR and DECR stimuli always end and begin, respectively, with the highest level of entropy (β = 0 or white noise).

https://doi.org/10.1371/journal.pcbi.1006162.g001

Experiment 1.

We tested how well listeners could detect changes in the entropy of tone sequences and whether the direction of change affected detection performance; see Fig 1b for example stimuli. Listeners (N = 10) heard stimuli with three degrees of change in entropy (between β = 0 and β = 1.5, 2, 2.5) in both directions (INCR and DECR), with control stimuli containing no change (with β = 0, 1.5, 2, 2.5). Each melody trial contained 60 tones presented isochronously over 10.5 seconds (175 ms inter-onset interval); there were 150 trials in total, with 15 trials per condition. After each melody trial, listeners responded whether they heard a change and received immediate feedback.

Detection performance as measured by d′ is shown in Fig 2a; d′ comprises both hits and false-alarms (FAs), with higher d′ corresponding to better detection performance and d′ = 0 corresponding to chance performance. Repeated-measures ANOVAs were used in all analyses to account for between-subject variability. An ANOVA with 2 within-subjects factors (3 change degree x 2 direction) showed a strong effect of degree (F(2, 18) = 31.5, p < 0.0001), no significant effect of direction, and a significant interaction (F(2, 18) = 9.4, p < 0.01). We investigated this interaction further by applying ANOVAs separately to hit- and FA-rates (see S1 Fig). The hit-ANOVA showed a strong effect of degree (F(2, 18) = 21.9, p < 0.0001) but no effect of direction or interaction, while the FA-ANOVA showed an effect of entropy level (F(3, 27) = 4.7, p < 0.01), with FAs increasing with entropy (Note the increase in degrees-of-freedom is due to the 4 levels of β for control stimuli). The significant interaction between degree and direction seen in d′ above is therefore only due to the effect of entropy on FAs: all DECR stimuli begin with the same high level of entropy (β = 0), thus increasing FAs and decreasing d′ for DECR compared to INCR stimuli.

Fig 2. Psychophysics results from Experiments 1 and 2.

Average change detection performance (d′) across subjects is shown by stimulus condition. Error bars indicate 95% bootstrap confidence interval across subjects. a) In Experiment 1 (N = 10), melody entropy changed with different degrees (Δβ, abscissa) and in both INCR and DECR direction (color). Detection performance increased with Δβ but did not differ by direction, although there was a weak interaction between Δβ and direction due to FAs only (see S1 Fig). b) In Experiment 2 (N = 10), an additional factor of melody length was introduced (color). Detection performance increased with both Δβ and melody length.

https://doi.org/10.1371/journal.pcbi.1006162.g002

It is surprising that there is no effect of change direction on hit-rates. If listeners are relying solely on lower-order statistics, INCR changes should be easier to detect than DECR changes by listening for outliers. We look closely at this effect in a follow-up experiment (Experiment 1b) to contrast response time (RT) to INCR versus DECR changes.

Experiment 1b.

In this experiment, listeners (N = 21) responded as soon as they heard a change during melody presentation; otherwise, the stimuli and procedure were the same as in Experiment 1. To confirm that the difference in task itself had no effect on detection performance, two-sample t-tests of d′ for each condition showed no difference across the two experiments (p > 0.05 for all tests, using Bonferroni correction for multiple comparisons). In addition, ANOVAs applied to hit- and FA-rates as in Experiment 1 showed the same significant effects.

A repeated-measures ANOVA applied to the RT data averaged within conditions for change-trials (3 change degree x 2 direction) showed a significant main effect of change degree (F(2, 40) = 14.3, p < 0.0001) but no main effect of direction and no significant interaction. This confirms the result from Experiment 1, with no effect of change direction on detection performance.

Experiment 2.

Next, we tested the effect of sequence length on change detection performance. In addition to the same change degree and direction manipulations from Experiment 1, listeners (N = 10) heard melodies with different lengths (20, 40, and 60 tones), with the change always occurring at the midpoint of the melody. As there was no effect of change direction on performance seen in Experiments 1 and 1b, we pooled results across INCR and DECR trials. As in Experiment 1, listeners responded whether they heard a change after the melody presentation and received immediate feedback.

Detection performance as measured by d′ is shown in Fig 2b. A repeated-measures ANOVA with 2 factors (3 change degree and 3 melody length) showed significant main effects of both change degree (F(2, 18) = 23.9, p < 0.0001) and melody length (F(2, 18) = 17.7, p < 0.0001), with a weak interaction (F(4, 36) = 2.8, p < 0.05). Post-hoc tests indicated the weak interaction was due to chance performance in the most difficult conditions: Δβ = 1.5 with lengths of 20 and 40 tones. In separate ANOVAs for hit- and FA-rates (see S2 Fig), hit-rates showed both main effects of change degree (F(2, 18) = 10.2, p < 0.01) and length (F(2, 18) = 29.6, p < 0.0001) with no significant interaction, while the FA-rates only showed a significant effect of entropy level (F(2, 18) = 14.6, p < 0.001) and no effect of length or interaction.

Model

To model brain processes involved in extracting information from stochastic sequences, we adapted a Bayesian sequential prediction model [26], incorporating perceptually plausible constraints to the model’s resources. Fig 3 shows a schematic of the model and its outputs.

Fig 3. Schematic of perceptual model and model outputs.

a) At time t, the model contains multiple parameter estimates, , collected over run-lengths from 0 up to the memory constraint m. Each estimate yields a prediction for the next observation, with increased uncertainty due to observation noise n. Upon observing xt+1, the model updates the run-length beliefs using the predictive probability for each hypothesis. Note that the prediction for length m is used to update all beliefs with length greater than or equal to m, thus limiting the number of past observations used in the update. A new belief with length 0 is added with probability π, the change-prior. Finally, parameter estimates are updated with xt+1; these are in turn used to predict the next observation. b) Outputs from the model for an example change stimulus (top, foreground). At each time, the predictive distribution (top, background) combines predictions across run-length hypotheses weighted by their beliefs, thus “integrating out” run-length. Surprisal (middle) measures how well each new observation fits the prediction. The change probability (bottom) is the probability at least one changepoint has occurred, as inferred using the run-length beliefs. The model detects a change if the change probability exceeds the threshold τ. Model parameters (m, n, π, τ) are in red.

https://doi.org/10.1371/journal.pcbi.1006162.g003

The input to the model is a sequence of observations {xt}; in our case, the observations are the pitches from the melody stimulus. The model sequentially builds a predictive distribution of the next observation at time t + 1 given the previous observations: P(xt+1|x1:t). Observations are assumed to be distributed according to some probability distribution with unknown parameters θ. At unknown changepoint times, the parameters θ change, and all following observations are drawn from this new distribution, independent of observations before the change. Observations between changepoints drawn from the same distribution form a run, and the time between changepoints is referred to as the run-length. If the most recent changepoint (or equivalently, the current run-length) were known, the independence of observations across changepoints could be used to simplify the prediction equation: given the current run-length rt, P(xt+1|rt, x1:t) = P(xt+1|rt, xtrt+1:t).

Because changepoints must rather be inferred from the observations, the model maintains multiple hypotheses across all possible run-lengths and integrates them to predict the next observation: In the sum, the prediction given run-length rt (the first term) is weighted by the model belief that the current run-length is rt (the second term). With each incoming observation, these run-length beliefs are incrementally updated and a new belief is added with length zero and weight π, the change-prior, re-weighting the predictions in the sum. The change-prior is a parameter of the model that represents the prior belief that a change will occur at any time before evidence for a change is observed (see S1 Text). Maintaining multiple run-length hypotheses is a key aspect of the model. Rather than making a hard decision about when a changepoint occurs and “resetting” the statistics, the model uses the observations as evidence to weight different interpretations of the sequence.

In the present application of the model, the generating distribution is assumed to be a D-dimensional multivariate Gaussian with unknown mean and covariance structure, where the dimensionality D specifies the amount of temporal dependence in the model. As new observations come in, the model incrementally collects sufficient statistics whose form depends on D (see Methods). Here, we ask whether human behavior from Experiments 1–2 can be captured by a model that collects marginal lower-order statistics (D = 1, i.e., mean and variance) or if higher-order statistics (D = 2, i.e., mean, variance, and covariance) are needed; we refer to these two versions of the model as the LOS model and HOS model, respectively.

Perceptual parameters.

As described thus far, the model can maintain an infinite number of hypotheses, predicting the next observation in a Bayes-optimal manner [26]. To introduce more perceptual plausibility, we imposed two constraints on the model. First, a memory parameter (m) represents finite working memory capacity, limiting how many past observations can be used to build predictions and update run-length beliefs. Second, an observation noise parameter (n) sets a lower bound on prediction uncertainty by adding a constant variance to the predictive distributions (see Methods for details).

Model output.

Fig 3b shows the outputs of the model for an example sequence of observations (top-foreground). The predictive distribution (top-background) integrates predictions across all hypotheses and provides a single posterior prediction given previous observations. After a new observation is “observed”, the surprisal (Fig 3b: middle) measures how well the observation was predicted by the model: where St is the surprisal for Xt, the new observation at time t, and P(xt = Xt|x1:t−1) is the predictive probability of observing Xt. Note surprisal is inversely related to the predictive probability—an observation with low probability has high surprisal, and vice versa.

We also derive a change probability—the probability a change has occured—from the run-length beliefs, P(rt|x1:t). The probability that a change has not occurred before time t is equal to the belief that the current run-length is equal to the length of the entire observed sequence (i.e., P(rt = t|x1:t)); the probability that at least one change has occurred is then the converse of this, or the sum of beliefs in run-lengths less than the length of the observed sequence: An example of how this change probability unfolds is shown in Fig 3b (bottom). Importantly, the model is causal, so the predictive distribution, surprisal, and change probability only depend on the preceding observations and are updated sequentially with each new observation.

Finally, to collect responses from the model that are comparable to those collected from human listeners in Experiments 1–2, we use a simple decision rule. At the end of the melody (i.e., post-trial), the model makes a change decision by comparing the final change probability to a decision threshold: where T is the full melody length and the threshold τ is an additional parameter of the model. We then define the model changepoint as the earliest time at which the change probability exceeds this threshold:

Perceptual parameters and model behavior.

We first examined the model detection performance for different sets of model parameters: memory (m), observation noise (n), change-prior (π), and threshold (τ). Using a parameter sweep, we collected model change decision responses to the same stimuli used in Experiments 1–2 and measured model performance for each operating point in the sweep.

Fig 4 shows model performance for Experiment 1. Performance is displayed in Receiver Operating Characteristic space (ROC-space); ROC-space is a method for visualizing the trade-off between Hit- and FA-rates in system performance at multiple operating points (i.e., parameter sets); the upper-left corner is perfect performance (Hit = 1, FA = 0), and the diagonal is chance performance (Hit = FA). Fig 4a displays the coverage of model performance in ROC-space for the LOS and HOS model (in blue and red, respectively); for example, at every red-colored coordinate in ROC-space, there is a set of parameters {m, n, τ, π} in the HOS model with that performance (i.e., Hit- and FA-rate). In this manner, we can compare the range of performance between the two models across the entire parameter sweep. Individual human performance from Experiments 1 and 1b (with the same stimuli, N = 31) and equal-d′ curves are overlaid in the same space for comparison. Results from Experiment 2 were similar.

Fig 4. Range of model behavior in Experiment 1.

Model detection performance measured at different operating points in a parameter sweep. a) Comparison of detection performance for LOS and HOS models displayed in ROC-space across the parameter sweep, with model type denoted by color. Each blue (red) coordinate indicates existence of a parameter set for the LOS (HOS) model yielding that performance. Individual human performance from Experiments 1 and 1b is overlaid, along with equal-d′ curves. b) d′ surface as a function of memory (m) and observation noise (n) parameters for LOS model (top) and HOS model (bottom). π and τ were held constant at 0.01 and 0.5, respectively.

https://doi.org/10.1371/journal.pcbi.1006162.g004

There is a clear contrast in the range of performance in ROC-space between LOS and HOS models, with the HOS model having both wider coverage and higher ceiling performance overall compared to the LOS model. While the LOS model only overlaps with poorer performing subjects (d′ < 1.5), the HOS model overlaps with all human performance points. Additionally, human performance never exceeds the range of the HOS model, indicating that with unconstrained resources (i.e., infinite memory and zero observation noise) the HOS model can act as an “ideal observer”, providing an upper bound for human performance.

Fig 4b shows the d′ surface for the LOS model (top) and HOS model (bottom) as a function of the two perceptual parameters, allowing us to assess which parameters are responsible for the performance variability seen in Fig 4a for each model. With the LOS model, the memory m is largely responsible for performance variability, with only a narrow band around m = 10 where the LOS model performs well above chance (d′ = 0). The HOS model performance, on the other hand, varies jointly with both memory m and observation noise n, with the best performance around {n = 0, m = 30}.

Fitting the model to subject behavior.

We fit the model parameters to each subject from Experiments 1–2. There was very high between-subject variability in performance (e.g., see human performance plotted in ROC-space in Fig 4a), so we examined how the parameters from the fitted model explain this variance. Model performance was measured for each set of parameters in the parameter sweep, and the best set of parameters was selected for each subject using minimum Euclidean distance between model and subject performance. Performance was measured using hit- and FA-rate within each change direction, which provided a more stringent criterion for distinguishing between parameters with equal overall hit- and FA-rates.

Fig 5 shows results from fitting the model to subjects from Experiments 1–2 (N = 41). In Fig 5a, subject d′ is plotted against model d′ for both LOS and HOS models. Using a linear regression with zero-intercept, the HOS model provided a better fit to subject behavior (r2 = 0.85, p < 0.0001) compared to the LOS model (r2 = 0.23, p < 0.0001), which cannot match the better-performing subjects.

Fig 5. Model fit to subject behavior from Experiments 1–2.

a) Subject d′ plotted against fitted model d′ for both LOS and HOS models, denoted by color. Legend shows r2-value from zero-intercept linear regression. b) Fitted perceptual parameters plotted against subject d′ for m (top) and n (bottom), with LOS model on the left and HOS model on the right. r2 and p-values shown for standard linear regression.

https://doi.org/10.1371/journal.pcbi.1006162.g005

Fig 5b shows the fitted perceptual parameters (m and n) plotted against subject d′ for the LOS and HOS models. With the LOS model (left), neither perceptual parameter has a significant linear relationship with subject d′ (m: r2 = 0.009, F(1, 39) = 0.359, p > 0.05; n: r2 = 0.05, F(1, 39) = 2.03, p > 0.05). With the HOS model (right), both memory and observation noise exhibit significant linear relationships with subject d′ (m: r2 = 0.423, F(1, 39) = 28.6, p < 0.0001; n: r2 = 0.352, F(1, 39) = 21.1, p < 0.0001), with higher memory and lower observation noise corresponding with better subject performance. Similar analysis with the other model parameters (π and τ) showed no correlation with subject d′ for either model.

To determine whether both perceptual parameters are needed to fit the HOS model to subject behavior, we tested a reduced model with only one of the perceptual parameters free. The memory-only HOS model, holding observation noise at n = 0, provided a poorer fit compared to the full HOS model shown in Fig 5a (r2 = 0.60, p < 0.001), as did the observation noise-only HOS model, holding memory at the maximum stimulus length m = 60 (r2 = −0.29, p < 0.001). Both memory and observation noise are needed as constraints to the model to fit the full range of human behavior.

Additionally, we compared the model changepoints to the RTs collected in Experiment 1b. Using a linear regression, the HOS model showed a significant linear relationship between model changepoint and subject RTs (r2 = 0.05, F(1, 1512) = 86.9, p < 0.0001), while the LOS model showed no significant relationship. Importantly, the model was fitted using the Yes/No response only and not the RTs themselves.

Electroencephalography

Next, we examined neural underpinnings of higher-order stochastic regularities in the brain. In an experiment structured similarly to Experiments 1 and 2 above, listeners were asked to detect changes in stochastic melodies while EEG was simultaneously recorded from central and frontal locations on the scalp. Stimuli were generated at two levels of entropy (i.e., one change degree) with both INCR and DECR change direction.

Deviance response according to melody entropy.

We first examined effects of melody entropy on ERPs to individual tones. Magnitude of frequency deviation (ΔF) is known to affect ERP morphology [32], so to determine any additional effect of entropy on the ERP, we computed average ERPs for both small and large ΔFF = 1 and 4 s.t. or semitones from the previous tone) at each entropy level (LOW and HIGH). Large ΔF tones are more rare in LOW entropy melodies compared to HIGH entropy melodies, so we might expect a deviance response that reflects this difference in relative occurrence (as seen in [32]). ΔF = 1 was chosen because it is the most frequent in both entropy levels, and ΔF = 4 was chosen to maximize frequency deviation magnitude while ensuring an adequate number of trials in the LOW entropy condition. We note that this analysis is more closely aligned with lower-order statistics, where deviance is always proportional to ΔF.

Fig 6a (top) shows grand-average ERPs for the four conditions averaged across frontal electrodes, which exhibited the strongest effect (described below). There is a divergence around 150-280 ms post-onset, where the ERP to large ΔF in LOW entropy (purple-dotted line) increases relative to the corresponding ERPs with the same ΔF (gray-dotted line) or the same entropy context (purple-solid line). Fig 6a (bottom) shows the mean amplitude in two time windows: ① 90–150ms and ② 170–260ms, corresponding roughly to N1/MMN and P2 time ranges [32]. A repeated-measures ANOVA with 2 factors (entropy and ΔF) applied to the later window showed a main effect of entropy (F(1, 7) = 7.49, p < 0.05) and a trend due to ΔF (F(1, 7) = 4.57, p < 0.07) with no interaction effect. Considering large-ΔF amplitudes only, a post-hoc paired t-test showed a significant difference between LOW and HIGH entropy contexts (p < 0.05). We performed the same t-test for each electrode; Fig 6a (bottom, far right) shows the p-values by electrode plotted on the scalp, with significant differences at frontal electrodes only. Similar analysis on the earlier window ① showed no effects of frequency deviation or entropy context.

Fig 6. Contextual effects on tone ERP.

a) Grand-average ERPs (top) for large and small ΔF in LOW and HIGH entropy melodies show a positivity for large ΔF in LOW entropy context around 200ms after tone onset. Mean amplitudes are shown for ① and ② time windows (bottom). Scalp map (right) shows frontal distribution of t-test p-values for large ΔF deflection between entropy contexts. b) Using model surprisal, regression-ERP analysis teases out distinct components depending on the set of statistics used in the model: a positivity 150-230ms after onset with LOS surprisal (similar to a) above) and a MMN-like negativity 100-200ms after onset with HOS surprisal. Error bars show 95% bootstrap confidence interval across subjects.

https://doi.org/10.1371/journal.pcbi.1006162.g006

An MMN response is notably absent from the ERPs in Fig 6a, even though large frequency deviations are rare in LOW entropy melodies. Assuming an MMN response in the brain to regularity deviations, this indicates a discrepancy between the “regularity” as defined in this analysis and the regularity collected by the brain: the MMN response is not well-differentiated by frequency deviation alone, and therefore it does not show up in this analysis. To see an MMN response, we need the proper definition of regularity in our analysis.

Deviance response according to model surprisal.

The model outputs surprisal as a continuous measure of regularity violation, where the regularity is defined by the statistics collected by the model. We used a linear regression analysis to find contributors to the tone-elicited ERPs attributable to surprisal from the LOS and HOS models fit to individual subject behavior [33, 34]. The resulting regression ERPs (or rERPs) give a fitted regression to single-trial ERPs at each time-point for each measure of surprisal, and their interpretation is straightforward: the surprisal rERP shows the change in the baseline ERP for a unit increase in surprisal (see Methods).

Fig 6b shows the surprisal rERP for the LOS model (top) and HOS model (bottom). The rERPs show two distinct contributors to the ERP differing both in polarity and latency, with the LOS-rERP containing a positive deflection around 150–250ms post-onset and the HOS-rERP containing a negative deflection around 100–200ms.

To test the significance of these rERP deflections, we applied a linear mixed effects (LME) model to single trial amplitudes in the same two windows as the analysis above: 90-150ms and 170-260ms after tone onset, roughly corresponding to N1/MMN and P2 time windows. LME models are well-suited for testing single-trial effects with unbalanced designs [35], which is the case with surprisal (by definition, there are fewer surprising events than unsurprising events). In the later time window, the LME model showed a significant effect of LOS-surprisal (p < 0.01) on mean amplitude and no effect from HOS-surprisal. The same model applied to mean amplitude in the earlier time window showed the opposite: no significant effect from LOS-surprisal and a significant effect from HOS-surprisal (p < 0.001). This analysis shows deviance responses in the tone-ERP that differ depending on the statistics, or regularities, collected by the model, and an MMN-like response only to tones surprising according to the higher-order statistics of the preceding melody.

Disruption in phase-locking at model changepoint.

We examined neural phase-locking to tone onsets before and after changepoints obtained from the LOS and HOS models. Phase-locking at the tone presentation rate (6.25 Hz) was measured from EEG data averaged across all 32 electrodes using the phase-locking value (PLV). PLV provides a measure of the phase agreement of the stimulus-locked response across trials, independent of power [36]. The difference in PLV before and after the changepoint (ΔPLV) measures the disruption in phase-locking at that time (see Fig 7a for illustration of ΔPLV calculation).

Fig 7. Phase-locking analysis at model changepoints.

ΔPLV is used to measure disruptions in phase-locking of EEG to the tone presentation rate (6.25 Hz) at the time when the model detects a change in the stimulus (i.e., at the changepoint). a) Illustration of ΔPLV calculation. PLV measures phase agreement across trials independent of power; an example PLV calculation (right) shows the phase of individual EEG trials (in grey)—PLV is the magnitude of the mean of these normalized phasors (in black). ΔPLV is then the difference in PLV within a 7-tone (1-sec) window before and after the changepoint (left, shown at the HOS changepoint in the melody). For each subject, ΔPLV was calculated for three sets of changepoints: the changepoints output from the LOS and HOS models, and the nominal changepoint (i.e., midpoint) used to generate the stimuli. Additionally, as a control, the same HOS changepoints were applied to responses to no-change stimuli. b) Empirical distributions of ΔPLV at the LOS-, HOS-, Nominal-, and Control-changepoints (line) calculated by bootstrap sampling across subjects, along with the null distribution (solid gray) calculated by performing the same analysis with random sampling of the changepoint position. This null distribution estimates variability in ΔPLV present throughout the melody. Significant change from zero and from the null distribution is seen in the HOS-changepoint only.

https://doi.org/10.1371/journal.pcbi.1006162.g007

ΔPLV was measured at four sets of changepoints: the LOS and HOS model-changepoints, the nominal changepoint, and a control condition. The nominal changepoint (i.e., the midpoint) is the time where the generating distributions before and after have the greatest contrast. As a control for this analysis, HOS-changepoints were randomly assigned to control trials to ensure that any difference in PLV was due to the neural response recorded during change trials, and not simply due to the position of the changepoints.

Fig 7b shows the bootstrap distributions of the mean ΔPLV for each set of changepoints (lines). A paired t-test shows a significant decrease in PLV at the HOS-changepoints (p < 0.001), while there was no significant difference for the other changepoints. We also tested the ΔPLV measured at the changepoints against the variation in phase-locking present throughout the melody by estimating a null distribution, sampling null-changepoints at random positions in the melody and calculating ΔPLV. There was again a significant difference for the HOS-changepoints only (p < 0.001). These results together indicate there is a disruption in phase-locking that is specifically related to the changepoints obtained from the fitted HOS model.

Discussion

How the brain extracts information from stochastic sound sources for auditory scene analysis is not well understood. We investigated stochastic regularity processing using change detection experiments, where listeners detected changes in the entropy of pitches in melodies. Results from Experiments 1–2 confirmed results from previous work showing that listeners represent information about stochastic sounds through statistical estimates [6, 8]. Listeners’ detection performance scaled with change degree (Experiments 1, 1b) and with the length of the sequence (Experiment 2), consistent with the use of a sufficient statistic to detect changes: a larger change in the statistic and a larger pool of sensory evidence both improved detection performance.

What statistics are collected by the brain?

We introduced a perceptual model for stochastic regularity extraction and applied this model to the same change detection experiments as our human listeners. We used different sets of statistics in the model to determine which best replicate human behavior: a lower-order statistics (LOS) model that collects the marginal mean and variance of tone pitches or a higher-order statistics (HOS) model that additionally collects the covariance between successive tone pitches. Comparing the performance range for LOS and HOS models to human performance, we showed that higher-order statistics are necessary to capture all human behaviors, while lower-order statistics are insufficient to capture the full range of subject behaviors. This disparity strongly suggests the brain is collecting and using higher-order statistics about the temporal dependencies between incoming sounds. Furthermore, the model revealed effects in EEG that are only discernible using higher-order statistics: ERP evidence showed an MMN response elicited by tones that are surprising according to the higher-order statistics of the preceding melody, and cortical phase-locking was disrupted at the changepoints specified by the HOS model.

Interestingly, both LOS and HOS models were able to replicate behavior from poorer performing subjects (d′ < 1.5), but the LOS model is unable to mirror behaviors with high hit-rates without also increasing the FA-rate (Fig 4a). Intuition states that marginal statistics within the local context (i.e., short memory or small m) might be effective for detecting changes in local variance in the fractal sequences; this notion is supported by the model, where m = 10 tones yields the best LOS model performance (Fig 4b). Yet this local LOS model, with limited sampling in the statistics collected, is unable to match the performance exhibited by better performing subjects. In other words: if listeners (or the LOS model) rely solely on marginal statistics, then their ability to accurately flag changes in random fractal structure is highly constrained. Furthermore, relying on low-order statistics should elicit an effect of the direction of change (from low to high entropy or vice versa) on the hit-rates. Behavioral data shows no such effect of change direction on behavioral hit-rates (Experiments 1 and 1b), which further corroborates that listeners cannot be solely relying on lower-order statistics.

While these results strongly argue for the brain’s ability to track higher-order statistics in sound sequences, they do not disagree with previous work demonstrating sensitivity to lower-order statistics [8, 9]. Rather, by designing a task in which higher-order statistics are beneficial, we show that listeners are additionally sensitive to the temporal covariance structure of stochastic sequences. We also do not argue that the statistics collected by the brain are limited to these, but could include longer-range covariances. We performed the same analysis using a D = 3 model that collects covariance between non-adjacent sounds, but it did not provide any improvement over the D = 2 (HOS) model. This merely means that for our stimuli, there was no additional information to aid in change detection beyond the adjacent covariances. Additional experiments with stimuli that specifically control for this are needed to determine the extent of the temporal range of statistics collected by the brain.

Individual differences revealed by stochastic processing

By their very nature, the stimuli used here exhibit a high degree of irregularity and randomness across individual instances of sequences. For the listener, deciding where the actual change in regularity occurs in a particular stimulus is a noisy process that arises with some level of uncertainty. Perceptually, most trials do not contain an obvious “aha moment” when change is detected; rather, the accumulation of evidence for statistical change emerges as a gradual process. Similarly from a data analysis point of view, determining the exact point of time when the statistical structure undergoes a notable change is a nontrivial problem, given that the perception of statistical change is not binary but continuous and varies both between trials and between listeners. As such, the study of stochastic processing hinges on the use of a model that is well-matched to the computations occurring in the brain, combining the right granularity of statistics with the right scheme for cue integration and decision making. And with the introduction of perceptual parameters to the model, we gain flexibility in the behaviors that can be reproduced by the model with clear interpretation as to the computational constraints leading to these behaviors.

Taking a close look at individual differences through the lens of the model, we were able to inspect underlying roots of this variability. Rather than simply a difference in decision threshold (i.e., “trigger-happiness”), we argue the variability across listeners was due to individual differences in the limitations of the perceptual system. We incorporated these limitations into the model via perceptual parameters. The memory parameter represents differences in working memory capacity [37, 38], and the observation noise parameter represents individual differences in pitch perception fidelity [39]. We should note that these parameters may also be capturing other factors that affect listener performance like task engagement, neural noise, or task understanding, which could be contributing noise to these results.

By fitting the model to individual listeners through their behavior, we showed correlates between human performance and the perceptual parameters of the model, and we found that neither perceptual parameter alone was adequate to fit all subjects. Rather than a nuisance, we see the inter-subject variability in these results as a consequence of individual differences in the perceptual system that are amplified by the uncertainty present in stochastic processing.

Neural response depends on statistical context

We found effects of the statistical context on the neural response. First, examining ERP responses to individual tones, we found an enhanced P2 response to large frequency deviations in low-entropy melodies compared to high-entropy melodies and a frontal distribution of this difference consistent with sources in the auditory cortex. This result corresponds with previous work where large frequency deviations that were less likely given the previous context showed an enhanced P2 amplitude [32]. Similarly, we interpret this result reflecting a release from adaptation, where the low-entropy melody has a narrow local frequency range. Importantly, we do not see an MMN effect, arguably because frequency deviation alone is too crude to provide an adequate definition of “deviant” with our stochastic stimuli: large frequency deviations do not always violate the regularities in our stimuli, which may explain the lack of an observable MMN in the average differential response.

Using the fitted model, we were able to tease out distinct surprisal effects on the tone ERP that differ both in statistics and in temporal integration window: the LOS surprisal measured how well each tone was predicted by the lower-order statistics of the local context, while the HOS surprisal measured how well each tone was predicted by the higher-order statistics of the longer context, as fit by the model to individual behavior. Because LOS and HOS surprisal are partially (and unavoidably) correlated, both LOS and HOS surprisal were included in a single regression in order to find components in the ERP that correlate with each independent of the other [34].

We found an enhanced P2 amplitude with increasing LOS surprisal that is similar in amplitude and latency to the P2 difference discussed above; indeed, LOS surprisal provides a similar definition of regularity to the ERP analysis based on melody entropy above, for large frequency deviations are always “deviants” according to the lower-order statistics. We again attribute this increased P2 to a release from adaptation. Consequently, we can then attribute the MMN response to HOS surprisal as a deviance response according to higher-order statistics independent from lower-order adaptation effects.

There has been much discussion on whether the MMN response is truly a deviance response or merely due to adaptation [40, 41]. Many experiments suffer from confounding frequency deviance with regularity deviance, making it difficult to definitively attribute MMN to one or the other. With our stochastic stimuli differing in higher-order statistics, we were able to disentangle the two interpretations. We again stress that this result is not in conflict with previous results showing effects of lower-order statistics on the MMN [8, 42], because deviants in these studies could also be considered deviants according to their higher-order statistics (i.e., the HOS model reduces to the LOS model when the covariance between sounds is zero).

Finally, we found a disruption in the brain’s phase-locked response to tone onsets that coincides with HOS model changepoints, where the model detects a change in the higher-order statistics of each stimulus. Contrasting various controls using different estimates of when the change point occurs, we observed a notable phase disruption with changes in higher-order statistics only. The change in phase synchrony across trials could be due to the combined modulation of multiple ERPs to tones following the changepoint, or it could reflect a change in the oscillatory activity of the brain, which has been shown to correspond with both changes in predictive processing and attentional effects [43, 44

Источник: [https://torrent-igruha.org/3551-portal.html]
, Sequenced sound manipulations Archives

Digital Performer: Audio Pitch & Tempo Manipulation

Digital Performer Tips & Techniques By Robin Bigwood

The Transpose pop-up menu in the Soundbite window's info pane is where you can choose between Standard and PureDSP pitch-shift algorithms.

A guide to DP's options for audio pitch and tempo manipulation is on the agenda for this month's instalment of Performer Notes.

Ten years ago, the 'big thing' was the development of non-linear multitrack recording systems, with applications such as Performer (later to become Digital Performer) allowing users the freedom to edit, cut, paste and duplicate sections of audio in a way that was unthinkable with tape. Time will tell, but it looks as though the next 'big thing' — and perhaps just as significant to the way we all work — might well be high-quality pitch and tempo manipulation of audio such as is already offered in Ableton's Live, and even more specialist applications such as Celemony's Melodyne. The new Cubase SX 3 is making moves towards this kind of technology, and if Logic 7 introduces some sort of 'liquid audio' capabilities (see our preview in this issue), most other sequencers, DP included, are going to have to follow suit.

For the time being, though, if you want this kind of audio manipulation you're going to have to get Ableton Live, which, when Rewired to a host sequencer, makes a lovely setup. I'll be covering the DP/Live partnership in next month's Performer Notes, but this month I'm taking a look at DP 's existing features for manipulating the pitch and tempo of audio — a far cry from Live, maybe, but useful anyway.

Perfect Pitch?

DP offers two main approaches to manipulating the pitch of audio: the Transpose function, and the purpose-built Spectral Effects dialogue box. DP also has two different pitch-shift algorithms: Standard and PureDSP.

To cut a long story a little shorter, I've summarised recommended pitch-shift type and algorithm combinations for different musical applications (see table below).

Type of MaterialPitch Shift
Polyphonic/chordal/rhythmicTranspose/Standard
Monophonic, melodicTranspose/PureDSP
Monophonic, melodic, with formant controlSpectral Effects
Monophonic, with deliberate 'side effects'Transpose/Standard, or Spectral Effects

DP's Transpose window works for both MIDI and audio, albeit with some slight differences. To apply a pitch-shift to audio using it, you first need to choose some audio, either by selecting an entire Soundbite (or Soundbites) or by selecting a region, using the I-Beam tool, for example. Then open the Transpose window, either by hitting Apple-9 or choosing 'Transpose...' from the Region menu. Because DP can transpose audio and MIDI at the same time if necessary, you have to ignore some of the MIDI-only options, such as transposing by anything other than Interval, and the Harmonise function. Additionally, you need to check the 'Transpose audio' box and, of course, choose an appropriate interval. One audio-only option is the 'Fine-tune audio' setting, where you can specify an additional pitch-shift amount in hundredths of a semitone (cents), up or down. After all that has been set, you just hit Apply, then DP does the processing in the background, allowing you to get on with other tasks while it works.

The Spectral Effects window is an intuitive front end for DP's PureDSP pitch algorithm, and can also be used to time-compress (or expand) audio.

You choose between DP 's Standard and PureDSP algorithms by selecting a Soundbite in the Soundbites window, then using the Transpose pop-up menu in the Info pane. Because this is a preference that is set on an individual Soundbite basis, it's possible to pitch-shift several Soundbites with different algorithms simultaneously. It's also possible to prevent a Soundbite from being pitch-shifted at all, by choosing 'Don't Pitch Shift'. This is handy for unpitched parts such as drums when you're pitch-shifting other Soundbites en masse.

The difference between Standard and PureDSP pitch-shifting, as you may have guessed already, is that PureDSP is optimised for monophonic parts. In fact, if you try to use it to transpose chordal or monophonic parts it's rarely at all successful. Standard pitch-shifting makes a better job of these, but the working range is certainly limited to just a few semitones before serious artifacts become obvious.

If you are working with monophonic parts, though, you might want to consider DP 's Spectral Effects instead of PureDSP Transposition. The Spectral Effects dialogue box is accessible via the Audio menu, after you've selected some audio in one of DP 's editing windows, and offers control of formant-frequency shifting independently of pitch. This works particularly well on voices, and can be used to quickly mock-up fairly believable backing vocal 'ensembles' with just one singer, dialling in small formant shifts to alter character and pitch changes to swap gender! The user interface is rather nice, with the current pitch-shift (and/or time-compression) settings represented by the location of a red ball in a pseudo-3D space. Moving the mouse carefully around the ball selects different cursors, which in turn allow the ball to be moved up/down, left/right, or to the front/rear. In practice, however, it's probably easier just to interact with the value boxes — clicking and dragging in them allows for very fine value changes. There's even a preset scheme, which showcases some of the more extreme treatments that are possible.

It's About Time

DP is capable of both timestretching and time-compressing Soundbites, and therefore changing the tempo of rhythmic material. There aren't any choices of algorithm (sadly), but the stretch or compression can be effected in several ways.

DP's Scale Time dialogue box, accessible from the Region menu, is one of the easiest ways to apply time compression or expansion to Soundbites.

One way, as we've seen, is to use Spectral Effects, and its Tempo setting, to impose percentage changes in length. But perhaps the easiest way is to simply point at the upper left-hand or right-hand corner of a Soundbite in the Sequence Editor until the cursor turns into a 'pulling' hand, then click and drag. If you have the edit grid turned on, the length changes will be quantised to the grid, though this behaviour can be toggled off (or, indeed, on) in the usual way, by holding down the Apple key as you drag. If you need more accuracy, or prefer to work in a numerical way, try the Scale Time command. You start by selecting all (or part) of a Soundbite, then choose 'Scale Time...' from the Region menu. In the dialogue box that appears you're shown the current start and end time of your selection, and the duration. In the lower of the two lines you can select new values, and specify length in percentage terms if necessary. Time units are switchable using the little button at top right. To apply the change, just make sure 'Time-scale audio' is selected, and hit OK.

Using Scale Time you can do some clever things, such as taking a rhythmic loop Soundbite that doesn't match your sequence tempo and, using the Measures time format, define a new length for it that fits in with your tempo. Similarly, you can specify duration in SMPTE time format, making the fitting of ambiences and effects to specific time locations easier.

If your needs are primarily musical, however, it's well worth looking into DP's Soundbite tempo functions. By specifying tempo information for Soundbites you can make them follow tempo changes in the Conductor Track, actually tracking any speeding up or slowing down of the music. For more on this, see the Performer Notes column from way back in June 2001.

Quick Tips

  • If you're always working with loops and often need to change their tempo, remember that DP 4 can deal quite effectively with REX files — more about this in a forthcoming Performer Notes article. For instant looping gratification you can't go far wrong with the AU version of Phatmatik Pro, by Bitshift Audio, which works great in DP.
  • When you're using DP's PureDSP algorithm or Spectral Effects to pitch-shift monophonic audio, be careful to leave most plug-in treatments, especially reverb and delay, until after the pitch-shift has taken place. These treatments effectively destroy the monophonic nature of the original, and may cause all sorts of unpredictable effects in the pitch-shifting process.
  • DP generates analysis files in your project folder to help with time compression and expansion, but these can take up quite a bit of disk space. If you're archiving a project, don't be afraid to throw them away — DP can always regenerate them if you ever need to do any further editing.

Yet More New Toys

I covered a bumper crop of goodies in last month's Performer Notes, but the flow of high-quality new plug-ins shows no sign of abating.

This is Roger, one of the three quirky but great plug-ins that make up Audioease's Rocket Science bundle, now available in MAS format for DP 4.

Making a welcome return to DP in OS X is PSP's Mixpack. This is a bundle of four plug-ins — Mixbass, Mixsaturator, Mixpressor and Mixtreble — whose applications go well beyond final mixing or mastering. In particular, I find Mixpressor an excellent compressor which works well on both individual tracks and whole mixes. Mixsaturator can also be put to great use in fattening up individual tracks, and Mixtreble has various handy uses, including cutting hiss, widening the perceived stereo image, and perking up flat-sounding transients, as well as acting as a 'conventional' enhancer. All in all, Mixpack is a versatile set of plug-ins that you may find become 'bread and butter' tools, handling basic tonal and dynamic processing tasks in a very sophisticated and subtle way. Mixpack costs $149 from www.pspaudioware.com.

Hot on the heels of Audioease's Nautilus bundle (see last month's column) comes their even more idiosyncratic Rocket Science bundle. Rocket Science is older than Nautilus and presents an even odder selection of processors. For example, you may never have considered the possibility of treating your tracks with a 'Multiple Gender Vowel Bank', but should you need to, there's Roger. As single-purpose plug-ins go, this has got to be one of the most limited, but it might be just what you need one day, and it can certainly produce some funky and often funny treatments!

Next up is Orbit, a cross between a reverb and a doppler processor. It localises sounds in an acoustic environment, also allowing them to move on various paths, and employing psychoacoustic techniques to make them seem as though they can pass behind a listener monitoring in stereo. Maybe it's not a plug-in you'd use every day, but it comes into its own when you need that special something. Finally, Follo is a resonant band-pass filter whose cutoff frequency is controlled by the dry signal's amplitude. Applications include autowah and weird vocal treatments. Rocket Science might not appeal to those looking for ever more high-end EQs and compressors, but it offers plenty of scope for anyone who likes their audio treatments 'out there'. It costs $199 from www.audioease.com.

Источник: [https://torrent-igruha.org/3551-portal.html]
Sequenced sound manipulations Archives

US20110113357A1 - Manipulating results of a media archive search - Google Patents

Manipulating results of a media archive search Download PDF

Info

Publication number
US20110113357A1
US20110113357A1US12/616,903US61690309AUS2011113357A1US 20110113357 A1US20110113357 A1US 20110113357A1US 61690309 AUS61690309 AUS 61690309AUS 2011113357 A1US2011113357 A1US 2011113357A1
Authority
US
United States
Prior art keywords
archive
items
search
method
display
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/616,903
Inventor
Marcel C. Rosu
Nathaniel Ayewah
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines CorpfiledCriticalInternational Business Machines Corp
Priority to US12/616,903priorityCriticalpatent/US20110113357A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATIONreassignmentINTERNATIONAL BUSINESS MACHINES CORPORATIONASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: AYEWAH, NATHANIEL, ROSU, MARCEL C.
Publication of US20110113357A1publicationCriticalpatent/US20110113357A1/en
Abandonedlegal-statusCriticalCurrent

Links

  • 238000004590computer programMethods0.000claimsdescription13
  • 238000004091panningMethods0.000claimsdescription4
  • 238000000034methodsMethods0.000description8
  • 239000000463materialsSubstances0.000description2
  • 230000000694effectsEffects0.000description1
  • 238000004519manufacturing processMethods0.000description1
  • 238000004904shorteningMethods0.000description1

Images

Classifications

    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43—Querying
    • G06F16/438—Presentation of query results
    • G06F16/4387—Presentation of query results by the use of playlists
    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43—Querying
    • G06F16/438—Presentation of query results
    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/44—Browsing; Visualisation therefor
    • G06F16/447—Temporal browsing, e.g. timeline
    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Abstract

Description

  • The present invention relates to searching media archives, and more specifically, to a dynamic interface for manipulating media archive search results.
  • Audio and video recordings, such as conference call recordings, podcasts, videos, and recordings of presentations or lectures, are increasingly used for information dissemination and storage. Searching media archives is a more difficult task than searching text archives because searching media archives relies mainly on indexing and searching the voice-to-text synchronized translations of the sound tracks of the archive recordings, which are rarely accurate. The precision of the transcription of the recordings can vary widely with the quality of the recording and with the speaker characteristics. As such, media transcripts may include several errors to the text such as misspellings or that the words were transcribed incorrectly because the recognition of the word is context-dependent. As a consequence, the result of a search can include many more irrelevant elements than text search would include on the archive of manual transcripts, (i.e. precision and recall can be lower as compared to a non-transcript text search). Furthermore, once the search completes, it is more difficult for the user to determine the relevance of the results returned by a media archive search than by a text archive search, as the visualization of the latter include short text fragments highlighting the search terms in context. Enhancing the results of a media archive search with text fragments surrounding search terms from the transcript is possible but difficult because automatic transcripts: include many errors, especially for short, common words (which are rarely used in search but are crucial when trying to understand the meaning of a sentence/short fragment); and are rarely capable of segmenting the word stream into sentences, or to identify punctuation signs, new paragraphs or speakers. Furthermore, the transcription is of limited value as words not included in the transcriber's vocabulary are never present in the index and cannot be used for searching. Therefore, transcription accuracy affects the ranking of search results, which takes into account the frequency of the search terms in each of the recordings that satisfy the Boolean query.
  • Existing systems identify the location of the search terms in the stream to quickly allow users to gather context by listening to the recording segment surrounding the search term position. Such user interfaces are static and they do not allow users to properly react to what they have listen to, such as updating the relevance of the “just listened to” recoding(s). Identifying the relevant information among the results of a media search is more difficult than for the results of document searches as well. A quick look at a document is typically enough to determine if it includes the information needed. The document formatting elements, such as paragraphs or fonts, play an important role in helping us find the relevant sentences, phrases, or data (tables, graphs, enumerations, etc.). Unfortunately, such visual cues cannot be generated accurately using existing voice-to-text computer programs. Typically, in a media search, the relevant information is retrieved by playing variable length segments of the media recordings retrieved by the Boolean search. This process is lengthy, it may complete in more than one session and the user may be interrupted by other events before it completes a search task. To speed-up the identification task, the system should precisely identify the relevant segments for the user and it should allow the user to edit the ranked set as desired during the identification process, with the goal of maximizing the user productivity across sessions, minimizing the negative impact of interruptions, or for saving a customization of the search results for later usage or sharing.
  • Therefore, there is a need for the user to easily manipulate the search results, save the outcome of this effort, and possibly share it with other users of the system.
  • Exemplary embodiments include a method for manipulating the results of a media archive search, the method including sending search terms related to one or more archive items in the media archive, receiving search results from the media archive, displaying the search results on a display, sending manipulation commands, performing manipulation operations based on the manipulation commands, displaying modified search results on the screen based on the manipulation operations and identifying attributes for each of the one or more archive items.
  • Further exemplary embodiments include a method in a computer system having a graphical user interface including a display and a selection device, the method for manipulating the results of a media archive search on the display, and including retrieving a set of items in a media search, displaying the set of items on the display, receiving a manipulation selection command indicative of the selection device pointing at a selected items of the media search and in response to the manipulation selection command, performing a manipulation action at the selected items of the media search.
  • Additional embodiments include a computer program product for manipulating the results of a media archive search, the computer program product including instructions for causing a computer to implement a method, the method including sending search terms related to one or more media archive items in the media archive, receiving search results from the media archive, displaying the search results on the display, sending manipulation commands, performing manipulation operations based on the manipulation commands, displaying modified search results on the screen based on the manipulation operations and identifying attributes for each of the one or more archive items.
  • Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention. For a better understanding of the invention with the advantages and the features, refer to the description and to the drawings.
  • Exemplary embodiments include a method and system for enhancing the user interfaces used for searching media archives, enabling media archive searches and precise random access based on the positions of the search terms in the voice stream. The method and system also enable users to remove streams found irrelevant, re-rank media streams in the result set, zoom into certain streams and keep only a fragment of the stream in the final result set, which can be a result of accessing the returned media streams. The method and system can further create new streams from the concatenation of existing streams or stream fragments, annotating both original and new streams and saving the result of these user actions for future access or for sharing it with other users of the media archive.
  • illustrates a system 100 for manipulating the results of searching media archives. In exemplary embodiments, the system 100, having a media search application 112, is part of a client-server architecture. The client (i.e., the application 112) can be a browser-based application residing on a general purpose computer 101. The server side, which can be in communication with a network 165, is responsible for the interaction with the media archive while the client component implements the user interface. In exemplary embodiments, the application 112 can be distributed over the client side and the server side. For illustrative purposes, the exemplary embodiments described herein are illustrated with respect to the client side. It is appreciated that the methods can be implemented across both the client and server. The server side responsible for interacting with the media archive can be implemented by one or multiple server machines. The client application 112, as a browser-based application, can be retrieved from the local client machine or, more commonly, from a server machine 170, possibly different from the servers performing the searches of the media archive.
  • In exemplary embodiments, users initiate media searches by providing the search terms used to form the Boolean query, which is sent over the network 165 to the server component for execution. Search results, which include reference to the media streams in the archive, static metadata related to each of the streams (stream title, author, date, length, and annotations), and dynamic metadata (e.g., position(s) of the search terms in the said streams, transcript fragments surrounding search terms) are returned to the client component. Using the values returned by the server and user preferences, the client component constructs the result screen. User preferences determine the order in which the streams are displayed (e.g., by their rank in the result set, increasing/decreasing length, by date, by title or by author), the static metadata that is displayed, and which of the received dynamic metadata elements are displayed and their display format (e.g., if transcript fragment surrounding the search terms are displayed, length of fragments, analyzed used to filter said fragments). User preferences can be static while the best representation of the search results is dependent on the results of the search. Furthermore, transcript errors lead to incorrect relevance ranking, and false positives, which reduce precision, as defined by true positives divided by the sum of true and false positives. In addition, transcript errors often lead to false negatives, which reduce recall, i.e., percentage of relevant items retrieved by the Boolean search. False negatives are likely when the searched terms occur only once or a few times in the recording and all instances of the searched terms are translated incorrectly. To increase recall, users typically make searches more inclusive, which lowers precision. As a result, the search result (or ranked) set is large and users need help in locating the relevant items. In exemplary embodiments, users can dynamically customize the result screen using domain specific information or information collected from listening to fragments of the retrieved streams. The customization enables listening to a series/collection of recordings on a desired topic or for sharing a collection of (one or more) podcasts with colleagues as part of a collaborative activity. Customization of search results includes but it is not limited to: reordering the elements in the result set, extending the result set with stream fragments, removing elements of the result set, setting the visibility of various search terms marking the streams in the results set, and editing the transcript fragments associated with the search terms (e.g., to compensate for the transcriber's inability to identify out-of-the-vocabulary terms, start of a sentence).
  • The exemplary methods described herein can be implemented in software (e.g., firmware), hardware, or a combination thereof. In exemplary embodiments, the methods described herein are implemented in software, as an executable program, and is executed by a special or general-purpose digital computer, such as a personal computer, workstation, minicomputer, or mainframe computer. The system 100 therefore includes the general-purpose computer 101. Other embodiments include a software implementation with the client and server components running on the same machine or a monolithic software implementation, with the previously described client and server functionality implemented in one application running on a personal computer.
  • In exemplary embodiments, in terms of hardware architecture, as shown in , the computer 101 includes a processor 105, memory 110 coupled to a memory controller 115, and one or more input and/or output (I/O) devices 140, 145 (or peripherals) that are communicatively coupled via a local input/output controller 135. The input/output controller 135 can be, for example but not limited to, one or more buses or other wired or wireless connections, as is known in the art. The input/output controller 135 may have additional elements, which are omitted for simplicity, such as controllers, buffers (caches), drivers, repeaters, and receivers, to enable communications. Further, the local interface may include address, control, and/or data connections to enable appropriate communications among the aforementioned components.
  • The processor 105 is a hardware device for executing software, particularly that stored in memory 110. The processor 105 can be any custom made or commercially available processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the computer 101, a semiconductor based microprocessor (in the form of a microchip or chip set), a macroprocessor, or generally any device for executing software instructions.
  • The memory 110 can include any one or combination of volatile memory elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, etc.)) and nonvolatile memory elements (e.g., ROM, erasable programmable read only memory (EPROM), electronically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), tape, compact disc read only memory (CD-ROM), disk, diskette, cartridge, cassette or the like, etc.). Moreover, the memory 110 may incorporate electronic, magnetic, optical, and/or other types of storage media. Note that the memory 110 can have a distributed architecture, where various components are situated remote from one another, but can be accessed by the processor 105.
  • The software in memory 110 may include one or more separate programs, each of which comprises an ordered listing of executable instructions for implementing logical functions. In the example of , the software in the memory 110 includes the media archives search manipulation methods described herein in accordance with exemplary embodiments and a suitable operating system (OS) 111. The operating system 111 essentially controls the execution of other computer programs, such the media archives search manipulation systems and methods described herein, and provides scheduling, input-output control, file and data management, memory management, and communication control and related services.
  • The media archives search manipulation methods described herein may be in the form of a source program, executable program (object code), script, or any other entity comprising a set of instructions to be performed. When a source program, then the program needs to be translated via a compiler, assembler, interpreter, or the like, which may or may not be included within the memory 110, so as to operate properly in connection with the OS 111. Furthermore, the media archives search manipulation methods can be written as an object oriented programming language, which has classes of data and methods, or a procedure programming language, which has routines, subroutines, and/or functions.
  • In exemplary embodiments, a conventional keyboard 150 and mouse 155 can be coupled to the input/output controller 135. Other output devices such as the I/O devices 140, 145 may include input devices, for example but not limited to a printer, a scanner, microphone, and the like. Finally, the I/O devices 140, 145 may further include devices that communicate both inputs and outputs, for instance but not limited to, a network interface card (NIC) or modulator/demodulator (for accessing other files, devices, systems, or a network), a radio frequency (RF) or other transceiver, a telephonic interface, a bridge, a router, and the like. The system 100 can further include a display controller 125 coupled to a display 130. In exemplary embodiments, the system 100 can further include a network interface 160 for coupling to a network 165. The network 165 can be an IP-based network for communication between the computer 101 and any external server, client and the like via a broadband connection. The network 165 transmits and receives data between the computer 101 and external systems. In exemplary embodiments, network 165 can be a managed IP network administered by a service provider. The network 165 may be implemented in a wireless fashion, e.g., using wireless protocols and technologies, such as WiFi, WiMax, etc. The network 165 can also be a packet-switched network such as a local area network, wide area network, metropolitan area network, Internet network, or other similar type of network environment. The network 165 may be a fixed wireless network, a wireless local area network (LAN), a wireless wide area network (WAN) a personal area network (PAN), a virtual private network (VPN), intranet or other suitable network system and includes equipment for receiving and transmitting signals.
  • If the computer 101 is a PC, workstation, intelligent device or the like, the software in the memory 110 may further include a basic input output system (BIOS) (omitted for simplicity). The BIOS is a set of essential software routines that initialize and test hardware at startup, start the OS 111, and support the transfer of data among the hardware devices. The BIOS is stored in ROM so that the BIOS can be executed when the computer 101 is activated.
  • When the computer 101 is in operation, the processor 105 is configured to execute software stored within the memory 110, to communicate data to and from the memory 110, and to generally control operations of the computer 101 pursuant to the software. The media archives search manipulation methods described herein and the OS 111, in whole or in part, but typically the latter, are read by the processor 105, perhaps buffered within the processor 105, and then executed.
  • As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
  • Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
  • The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
  • In exemplary embodiments, where the media archives search manipulation methods are implemented in hardware, the media archives search manipulation methods described herein can implemented with any or a combination of the following technologies, which are each well known in the art: a discrete logic circuit(s) having logic gates for implementing logic functions upon data signals, an application specific integrated circuit (ASIC) having appropriate combinational logic gates, a programmable gate array(s) (PGA), a field programmable gate array (FPGA), etc.
  • The following figures illustrate screenshots of an exemplary user interface in accordance with exemplary embodiments. The screenshots illustrate examples of user manipulation of media search results.
  • illustrates a screenshot 200 of an example of a browser-based user interface (UI) for manipulating the results of speech archive searches. In the example, the UI can include a first query field 205, which in the example, includes the word “software”. The first query field is for selecting recordings in the archive that contain all the words in query field 205. In the particular archive being searched, one result is displayed in a first result field 210. The result illustrated is an audio recording “When will we see applications for multicore systems?” The example further illustrates that the search result can further include a score indicating a weight of the search result, which can be based on the number of times the search query word occurs in the result with respect to the total number of words in the result. The UI can further include an Add Annotation field 215, in which the user can manually add annotations to the search.
  • illustrates a screenshot 300 of an example of a browser-based UI for manipulating the results of speech archive searches, showing a two-recording result set. In this example, the user includes two words in a second query field 305 that provides a query based on any of the words entered. In the example, the words “software” and “name” are used in the query. A two-stream result is illustrated. The result “When will we see applications for multicore systems?” is shown again in the first result field 210, and a result “Short 2” is shown in a second result field 310. Each of the results includes a score. The score for “When will we see applications for multicore systems?” is different than as illustrated in . The difference in the score is a result of different relative weightings of the presence of any of the words that are used in the search query as further described herein.
  • illustrates a screenshot 400 of an example of a browser-based user interface (UI) for manipulating the results of speech archive searches, after a user manually changes the order of the streams in the result set. This example illustrates that a user can manually change the ordering of the media archive search results as further described herein.
  • illustrates a flowchart of a method 500 for manipulating the results of a media archive search in accordance with exemplary embodiments. The method 500 illustrates an example of the sequence of actions the user can perform on the results of a search on a media archive. In exemplary embodiments, these actions can be performed on the results returned by a search tool, which can be a component of the application 112 or a self-contained search tool residing on the computer 101 on the server. As such, at block 501, the search terms can be input and the search started. In exemplary embodiments, the actions can also be performed on a previously edited list of results, where the previous edits were performed by the same user or a colleague/friend/collaborator. As such, at block 502, the search results can be restored or received. An example of such a search result list is shown in , with each horizontal line representing a recording in the media archive that satisfies the Boolean query condition. More details for each recording are shown in and further described herein.
  • Referring still to , in exemplary embodiments, the user can select an action at blocks 511-516 to perform by visually inspecting the results on the display 130. The actions can include, but are not limited to removing an item at block 511, moving an item up and down at block 512, zooming an item up or down at block 513, panning an item up or down at block 514, copying an item at block 515 and creating a new item from other items at block 516. In exemplary embodiments, a user can also decide to play a segment of a media recording at block 504 and, based on the result select one of the actions at blocks 511-516, or play an additional segment of one of the media items on the display 130. In exemplary embodiments, the actions at blocks 511-516 can be performed either on the visual attributes on the display 130 at block 503 or on the heard/viewed information at block 505.
  • In exemplary embodiments, visual inspection takes into consideration a number of attributes of the media items in the result set. Some of the attributes are intrinsic of the media items and are stored in the archive together with the item. Other attributes are specific to the search action and are generated by the media search tool. Another category of attributes is generated by the user actions. Visual attributes are described with respect to .
  • In exemplary embodiments, items that the user considers to be irrelevant can be removed from the result set at block 511. This action allows the user to focus on a smaller set of results. Items can also be reordered at block 512 such that the user can focus on one group at a time or to prioritize play actions. If, for example, only a segment of a media item is deemed interesting, the user can zoom at block 513 and pan at block 514 on the relevant section. If more than one sections of an item are considered interesting, the user can create one or more copies of an item at block 515, followed by different zoom at block 513 and pan at block 514 applied to each copy.
  • In addition to copying items, the user can create new items by cut & paste operations involving one or more items at block 516. For example, a user can remove uninteresting segments from a media recording, effectively shortening it, or create a longer one from several related recordings. The aforementioned operations are virtual in the sense that no new recordings are generated; instead, each new recording is represented by the tool as a sequence of operations for the embedded media player. In exemplary embodiments, at any time, the user can select to save the modified result list for later or for sharing it with colleagues at block 520.
  • illustrates a representation of media archive search result items, which shows several result items. In exemplary embodiments, items can include one or more segments, which can be displayed in different shades of grey or color, with the shade/color of the segment representing the speaker and blanks representing quiet periods, for example. Even when speaker identities are unknown, speech-to-text systems can typically differentiate between speakers. A quick visual inspection can help the user identify the type of media item. This type of identification is very helpful for recordings with which the user is familiar.
  • In exemplary embodiments, a user can have a pattern in mind, that comes from a previous experience, such as attending the recorded event, or from a description of the event (or its recording) received from someone else. As such, in , the user has a familiarity with the results, and has an idea for the items in which she or he is looking.
  • As illustrated, a first item 601, from top to bottom, appears to be a presentation by one speaker followed by a Q&A session with three questions/comments from different people in the audience. The second item 602 appears, to the user, to be a meeting with four participants, two of which are more active than the other two, and with one of the two most active participants possibly being the host (as indicated by him/her being the first speaker in the meeting), as indicated by the varying lengths of the shaded segments. The third item 603 looks, to the user, like the recording of an interview, with short questions followed by longer answers and with the host starting and finishing the recordings, possibly with introduction- and conclusion-like sections, respectively. The fourth item 604 is a two-way meeting or phone conversation with two quiet periods, which are more likely to happen in phone conversations and the word ‘two’ in ‘two-way’ comes from a transcription system identifying two speakers. As such, the item 601 appears to be the desired recording than the remaining items. For example, the speaker in item 602 changes too often and unpredictably (not clear if there is one presenter or not). In addition, if item 603 were a presentation than the Qs were asked during the presentation not in the Q&A session. Finally, the item 604 has some quiet periods, which do not occur in a typical presentation and an almost even distribution between two speakers, which does not fit the pattern with which the user is familiar.
  • In exemplary embodiments, certain attributes, such as recording quality, can be used to infer whether an item is a recording of a phone conversation or not. If a database of speaker speech signatures is available, the speaker ID can be inferred with high probability by the speech-to-text tool(s). Other attributes, such as title, author, recording duration, date and place, may already be included in the recording file(s) (e.g., MP3 file attributes). Other may be added manually to the recording at transcription or indexing time. All these attributes are considered to be intrinsic to the recording. The “Score” attribute is generated by the tool and is dependent on the search terms used and the content of the media archive.
  • illustrates a representation of media archive search results. illustrates additional visual attributes attached to a single media item. In exemplary embodiments, the user can turn off or disable attribute types/classes at any time. The beginnings and ends of the speaker segments 711, 712, 713, 714, 715 in a recording are a type of intrinsic attributes of the recording. The positions of the search terms 701, 702, 703, 704, 705, 706, 707 in an item included in the result set of Boolean query represents an example of search-specific attribute. In addition to the term position, the tool can display the confidence attached by the speech-to-text tool to the specific term in that position. With regard to the confidence, the speech-to-text translation process is probabilistic, in which the system selects the most likely word at each point in the process. “Most likely” is based on a number, a probability, which is computed based on the recorded sound at that point in the transcription and the context, i.e., previous words, which is captured in what's called the language model. Some systems output alternative text translations together with the associated/computed probabilities.
  • As a result of user zoom and pan operations, the left and right ends of the line may represent moments in the media recording after or before its start or end, respectively. The visual attributes 721, 722 are examples of attributes generated by the user actions. Additional visual attributes 731, 732 are shown as hashed areas mark segments played by the user and are generated by the tool as a result of user actions.
  • The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, element components, and/or groups thereof.
  • The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated
  • The flow diagrams depicted herein are just one example. There may be many variations to this diagram or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order or steps may be added, deleted or modified. All of these variations are considered a part of the claimed invention.
  • While the preferred embodiment to the invention had been described, it will be understood that those skilled in the art, both now and in the future, may make various improvements and enhancements which fall within the scope of the claims which follow. These claims should be construed to maintain the proper protection for the invention first described.

Claims (20)

1. A method for manipulating the results of a media archive search, the method comprising:
sending search terms related to one or more archive items in the media archive;
receiving search results from the media archive;
displaying the search results on a display;
sending manipulation commands;
performing manipulation operations based on the manipulation commands;
displaying modified search results on the screen based on the manipulation operations; and
identifying attributes for each of the one or more archive items.
2. The method as claimed in wherein the manipulation operations include at least one of moving the one or more archive items on the display, zooming the one or more archive items on the display, panning the one or more archive items on the display, copying the one or more archive items on the display, and creating a new item from the one or more archive items.
3. The method as claimed in further comprising playing a segment of the one or more archive items.
4. The method as claimed in wherein the attributes are intrinsic qualities to the one or more archive items.
5. The method as claimed in wherein the attributes are specific to a search action based on the search terms and are generated by a search tool.
6. The method as claimed in wherein the attributes are generated by user actions.
7. The method as claimed in wherein the attributes are visual, wherein each of the one or more archive items can be displayed as segments.
8. The method as claimed in wherein the segments are identified by at least one of grey scale and color.
9. The method as claimed in wherein the segments are indentified by hash marks.
10. In a computer system having a graphical user interface including a display and a selection device, a method for manipulating the results of a media archive search on the display, the method comprising:
retrieving a set of items in a media search;
displaying the set of items on the display;
receiving a manipulation selection command indicative of the selection device pointing at a selected items of the media search; and
in response to the manipulation selection command, performing a manipulation action at the selected items of the media search.
11. The method as claimed in wherein the manipulations action include at least one of moving the one or more archive items on the display, zooming the one or more archive items on the display, panning the one or more archive items on the display, copying the one or more archive items on the display, and creating a new item from the one or more archive items.
12. The method as claimed in further comprising:
receiving a play selection signal indicative of the selection device pointing at the one or more archive items on the display; and
in response to the play selection signal playing a segment of the one or more archive items.
13. The method as claimed in wherein the attributes are intrinsic qualities to the one or more archive items.
14. The method as claimed in wherein the attributes are specific to a search action based on the search terms and are generated by a search tool.
15. The method as claimed in wherein the attributes are generated by user actions.
16. The method as claimed in wherein the attributes are visual, wherein each of the one or more archive items can be displayed as segments.
17. The method as claimed in wherein the segments are identified by at least one of grey scale and color.
18. The method as claimed in wherein the segments are indentified by hash marks.
19. A computer program product for manipulating the results of a media archive search, the computer program product including instructions for causing a computer to implement a method, the method comprising:
sending search terms related to one or more media archive items in the media archive;
receiving search results from the media archive;
displaying the search results on the display;
sending manipulation commands;
performing manipulation operations based on the manipulation commands;
displaying modified search results on the screen based on the manipulation operations; and
identifying attributes for each of the one or more archive items.
20. The system as claimed in wherein the manipulation operations include at least one of moving the one or more archive items on the display, zooming the one or more archive items on the display, panning the one or more archive items on the display, copying the one or more archive items on the display, and creating a new item from the one or more archive items.
US12/616,9032009-11-122009-11-12Manipulating results of a media archive search AbandonedUS20110113357A1 (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
US12/616,903US20110113357A1 (en) 2009-11-122009-11-12Manipulating results of a media archive search

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
US12/616,903US20110113357A1 (en) 2009-11-122009-11-12Manipulating results of a media archive search

Publications (1)

Источник: [https://torrent-igruha.org/3551-portal.html]
.

What’s New in the Sequenced sound manipulations Archives?

Screen Shot

System Requirements for Sequenced sound manipulations Archives

Add a Comment

Your email address will not be published. Required fields are marked *