Edited by: Masanobu Miura, Hachinohe Institute of Technology, Japan
Reviewed by: Steven Robert Livingstone, University of Wisconsin-River Falls, United States; Jasna Leder Horina, Faculty of Mechanical Engineering and Naval Architecture, University of Zagreb, Croatia; Charalampos Saitis, Technische Universität Berlin, Germany
This article was submitted to Performance Science, a section of the journal Frontiers in Psychology
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
香京julia种子在线播放
When practicing an instrument on their own, music students focus mostly on playing correctly, and not enough on playing expressively. Software tools for assisting practice could address this issue by offering suggestions and feedback about expressive aspects of performance such as dynamics, articulation, and timbre quality, ensuring students keep these concerns in mind at all times. However, the implementation of this feature requires equipping software tools with models of performance that enable them to generate coherent suggestions for any musical piece included in the practice routine. This work explores the feasibility of that approach by proposing such a model and analyzing its suggested variations in dynamics and note onset timing of solo violin pieces.
Much of the work in the field of expressive performance modeling deals with predicting expressive features of notes in isolation, however, in this particular scenario where the goal is providing guidance to a human player, long–term movements and character of expression take preference over minute variations that occur note–by–note. For that reason, the devised model bases its outputs on features of compositions that play a strong role in representing musical context:
As for
Expression in music performance has been actively studied for over two decades from several standpoints. Many authors dedicate their efforts to computationally modeling expressive performance actions as surveyed by Kirke and Miranda (
Existing approaches to performance modeling have mostly been developed for piano (Kirke and Miranda,
Specifically targeting the violin, the works of Maestre (
Evaluating expression in violin performances presents several distinct challenges when contrasted with piano performances. The advantage of the piano lies in the fact that the interaction between performer and instrument can be summarized with only a few parameters (mainly instants and velocities of keys and pedals pressure and release). Thus, for piano, modeling expression in these parameters is often sufficient for synthesizing convincing performances. Since the violin offers musicians several other expressive dimensions (e.g., attack speed, vibrato, intonation, etc.), our approach to validating the restricted set of modeled features—timing and dynamics—has been to ensure that the synthesized sounds lacked any other type of expressive variation. However, in many situations, these variations work in conjunction and justify one another.
The model as implemented forms expressive suggestions based on information taken from the musical score only, with the purpose of presenting them to students as either visual feedback, such as an orchestra conductor does, or by overlaying graphs on the score itself, or auditory, by providing synthesized examples or even accompaniment. At this stage, however, the mode output is raw, and has been analyzed in terms of similarity to an actual expert performance and perceived human-likeness, essentially exploring whether an automatic recognition of phrasing and melody from a score can used as predictor of its performance attributes. Consequently, this is an indirect way of addressing the broader question: to what extent do these musical structures influence performance?
As previously mentioned, the developed algorithm models the dynamics and timing for a given musical piece, meaning that given the computed score features, it outputs recommended loudness levels for its performance, and indications of when to rush or drag the tempo. Since its intended application is as a pedagogical aid, we are interested in modeling the long–term variations in expression that arise from the players' interpretation of the music rather than short–term variations that are not only impractical to communicate in real time but are also, to a large extent, influenced by the same long–term intentions as well as some unconscious factors such as playing technique. For that reason the model is trained on phrases rather than notes. Nevertheless, it is built on top of a note–level structure, which means all training data are initially processed in the form of note–level inputs and outputs, and later aggregated into phrase features as informed by a phrase boundaries determination method.
The note–level input features of interest are note pitch and note duration. These were extracted from musicXML by a parser written in MATLAB (code available in GitHub
The note–level output features are the mean level of loudness at which each note was performed in the recordings, and their onset times and durations. To compute them, the audio files were normalized to −0.1 dB and then input in the Tony software (Mauch et al.,
The flowchart in
Summary of the steps for producing suggestions of expression.
The automatic detection of melodically–coherent phrases uses a top-down approach based on the Local Boundary Detection Model (LBDM) (Cambouropoulos,
Piece Segmentation
1: |
2: |
3: |
4: |
5: Calculate z-scores from values [ |
6: |
7: |
8: |
9: |
As a result of its structure, the algorithm gravitates toward phrases of approximately 10 notes without imposing a hard restriction. Even though phrases of a single note might be musicologically acceptable, we intentionally prevent their occurrence since pieces with ambiguous phrase boundaries often cause the LBDM to output high likelihood values for consecutive notes in situations where one-note phrases would not be reasonable.
Once a piece is split into phrases, each phrase must be associated with a set of characteristics—phrase input features—that allow us to describe its expressive character as phrase output features.
The input features are essentially the key–invariant melodic contour, and the tempo–invariant rhythmic contour. The melodic contour is a time–series with one data point per phrase note beginning in zero and with each subsequent value equal to the difference in semitones between consecutive note pitches. As an example, a major triad in root position would be encoded as the sequence: 0, 4, 3. The rhythmic contour is also a time–series with one data point per phrase note; its first value is one and the others are equal to the ratio between each note in the phrase and the previous one.
The intelligence built into the model follows the principle that similar–sounding melodies are more likely to be performed similarly. In practice, this is implemented by applying a k-NN algorithm that, for each input phrase, locates its most similar phrases in the training set with respect to the described features. The measure applied for determining the degree of similarity between two phrases is an implementation of the method proposed by Stammen and Pennycook (
The phrase output features must describe the dynamics and timing of a phrase in terms that can be transposed to other pieces and contexts. Our implementation defines four of them: mean dynamic level, dynamic range, dynamic contour, and local tempo curve. These features are computed for every phrase in the training set recordings and act as references for the model's operation. When suggesting how to express a new phrase, all four features are decided by the algorithm for the new phrase using the available references.
The formal definition of each of these features depends on some metrics related to the entire piece. In a piece with
Essentially,
Using these metrics, we define the mean dynamic level ℓ
This feature (ℓ
The second output feature is the phrase dynamic range
Analogously to what happens between ℓ
Lastly, phrase contour (
This definition has two implications. The first is that we can determine the values of
The second implication of Equation 5 is that determining ℓ
Finally, the local tempo curve is defined as the function that describes how the tempo changes throughout the phrase. For each note
Where
For a suggested local tempo curve τ, one can use Equation 6 to compute the IOI of each note, since
It should be highlighted that the only processing step in the modeling that required manual intervention was the onset detection for note alignment. However, since this task was partially automated and completed without making use of score information, we are confident that the entire modeling process could be done automatically to satisfactory results, enabling its application in the desired pedagogical context.
The evaluation is divided into numeric and perceptual analyses. The numeric analysis checks if the score–related metrics of phrasing and melody correlate with the dynamics and timing of a performance whereas the perceptual analysis verifies if synthesized performances based on modeled expression possess human–like qualities.
Eight short (approximately 50 s each) musical excerpts were recorded by a professional violinist to be used for both model generation and evaluation, making up dataset
In the numeric analysis, the suggested loudness values computed from Equation 5 were interpreted as estimations and compared against the measured values in the recorded performances of the pieces. Likewise, the suggested onset times were compared to the performed ones for each piece. Test sets were built in a leave–one–out approach, meaning that dynamics and timing suggestions were generated for each available piece using the other seven recorded pieces as training set.
For the perceptual analysis, the note loudness values obtained from the model were converted into MIDI velocity values used to control the dynamics of synthesized versions of the pieces. The syntheses were made using Apple Logic Pro X's EXS24 sampler
Three versions of each piece were synthesized for the evaluation, the only difference being the supplied velocity values and note onset times and durations, resulting in different dynamics and timing for each of them: one version used velocity and onset values derived from the model suggestions as described above; a second version corresponds to the expression of the performer in the recordings, as measured for usage in the training set. The third and last version serves as baseline and scientific control, and uses the same velocity for all notes, its value being the mean value used in the “human” version to minimize discrepancies in volume level, and its timing has no fluctuation, the tempo being set to the mean tempo from the “human” version. Each of the three versions of the original 8 pieces were manually divided into 3 excerpts of approximately 15 s each and their audio normalized (applying the same gain to all three versions of an excerpt to prevent from modifying their relative dynamic range). Finally, the eight most complete, melodic sounding of those 24 excerpts were selected for the evaluation.
The evaluation was conducted by means of an on-line survey. Participants were instructed to hear randomized pairs of audio samples from the synthesized pieces, always consisting of two out the three existing versions of an excerpt. They were then presented with two questions for which to choose between audio samples 1 or 2: “In terms of dynamics (the intensity of volume with which notes are expressed), select which audio sample sounds most like a human performance to you.” and “Which performance did you like best?” Finally, participants were instructed to answer, from 1 to 5, “How clearly do you perceive the distinction between the two audio samples?” A space for free comments was also included in each screen to encourage participants to share insights about their thought–process.
A total of 20 people participated in the experiment. Recruitment was carried out by personal invitation and each participant was assisted in accessing the web page containing the survey and its instructions using their own computers and audio equipment. Each of them was asked to provide answers to 16 pairs of melodies as described above, but early abandonment was allowed. This provided a total of 305 pairwise comparisons.
Summary of perceptual evaluation participants information.
Numeric analysis results are presented in
Aggregate mean absolute errors when considering dynamics suggestions as performance predictors.
Boxplot “kNN (exact,
In the perceptual analysis, a total of 305 pairs of melodies were compared by listeners in terms of human–likeness and personal preference. The mean perceived distinction between pairs was 3.41 ± 0.13 (on a 1–5 scale, α = 0.05).
Aggregate results of perceptual survey pairwise comparisons.
Measured
C1 | Human–likeness | 0.7500 |
Preference | 0.8016 | |
C2 | Human–likeness | 0.1440 |
Preference | 0.1440 | |
C3 | Human–likeness | 0.7500 |
Preference | 0.8378 |
Lastly,
Subset of perceptual survey results for musically active participants.
The large variance observed for the boxplots of model errors in the numeric analysis indicate that predictions for some sections of the pieces share similar dynamics to those performed in the reference recordings whereas other sections differ. Increasing the number of neighbors considered from one to three is effective at pruning out eccentric predictions as can be seen by the shorter tail of the distribution, but has been observed to also reduce the overall dynamic range of the output, making renditions a bit “dull.” In fact, this effect is expected for a small dataset such as DS1, since there ought to be very few examples of sufficiently similar melodies to be selected as nearest neighbors. Consequently, in such conditions, employing a single nearest neighbor is the most promising approach for a perceptually valid output, since the most melodically–similar phrase represents the available data point most likely to have applicable expression data for a given target melody, and copying such data parameters from the same sample retains the coherence between different expressive output variables. For this reason, and considering the success of the parabolic representation as a parametric model of contour indicated by their similar error distributions, the model with
The overall higher median errors observed in all measurements indicate that with DS1 as training set the model is not accurate at predicting timing and dynamics, but since there is no single correct interpretation of a musical piece, this result is not enough to dismiss the model as a tool for suggestions of expression, hence the utility of the perceptual validation.
Regarding perceptual analysis results, typically (Katayose et al.,
A deeper investigation of the results does offer fruitful insights, however. Ratings for the measure of perceived distinction between audio clips was generally high across all comparisons. For C1, in particular, its mean value was 3.31 and standard error, 0.12 (on a scale of 1–5). This strongly suggests that participants were able to perceive differences in the renditions, but still reached conflicting decisions. Reflecting upon this fact and contrasting the melodies present in our dataset against pieces typically found in benchmark datasets (e.g., as used by Oore et al.,
This view is reinforced by some participant comments. One of them states, after declaring preference for the deadpan rendering over the human–based one: “
From the musicians' results graph, it can be seen that the percentage of choices favoring the deadpan renditions is smaller in this subset than in the full result set, which could reflect a higher ability of the musicians in interpreting the performances even out of context. Furthermore, what is encouraging in the musicians' data is that the percentage of choices favoring the modeled performance is larger than in the full set, which is a hint of evidence in the direction of our own perception that the proposed modeling approach can yield convincing results under some conditions.
For further analysis, the previously presented dataset (DS1) was complemented by the recordings of the first violin from the String Quartet number 4,
Finally, for a look at the model's performance under a large collection of data, a random sample of approximately 1 h of audio was taken from the 2017 recordings in the Maestro dataset (Hawthorne et al.,
Summary of all datasets used.
DS1 | Violin | 6′43″ | 68 | MusicXML |
DS2 | Violin | 13′01″ | 192 | MusicXML |
DS3 | Piano | 57′08″ | 2706 | MIDI |
For ensuring consistently–sized test sets in spite of a larger variance in piece durations, all testings in both DS2 and DS3 were effectively 10–fold cross–validations on phrases, that is, datasets were split into 10 subsets by random sampling phrases without repetition, and each of those was used as test set in a different round. As in the previous scenario, note–level mean absolute errors between performance and model predictions were the chosen metric for all modeled expressive features.
To observe how the proposed modeling approach fares against more conventional models that rely on note features rather than phrase features, we computed 41 note features from score information and derived musicological inferences using dataset DS2, and employed the resulting feature vectors for predicting note velocity values and local tempi using various algorithms as implemented in the Weka machine learning software tool, version 3.8.3 (Frank et al.,
As an exploration of the impact of larger datasets in the model performance,
Performance of note–level algorithms vs. proposed phrase–level method on DS2.
SVM | 0.4557 | 15.44 | 14.11 |
ANN | 0.3789 | 22.71 | 20.08 |
kNN | 0.5910 | 13.17 | 12.57 |
Random forest | 0.7319 | 11.18 | 10.49 |
Phrase-level kNN (ours) | 0.2956 | 16.82 | — |
Velocity predictions across notes in a violin piece vs. performed ground truth.
Lastly,
Mean absolute error distribution as a function of dataset size.
Three of the four algorithms tested on note–level were able to outperform our phrase–level model in terms of mean absolute errors, and all five exhibit some prediction success if compared to the deadpan MAE of 18.02. Despite the poor rank of our model, these results are encouraging, since they indicate that the majority of the relevant information for predicting expression present in the note features was retained and summarized in the phrase–level form of the dataset.
Visually inspecting the velocity values predicted by the best–performing model, using random forests, against our phrase–level predictions and ground truth values from a recording (
Results of the DS3 analysis show that MAE drop for larger datasets but eventually stabilize at a plateau. Though the median error level sits below baseline for large datasets, the large variance (represented by the quartiles indicated as Q1 and Q3) shows precision doesn't improve as much. Using a higher number of neighbors (
As we have argued and demonstrated in previous sections, modeling expression in violin performance is a challenging task in many ways. Some examples observed in our first dataset include prolonged, loud notes which we found that sounded harsh in the synthesized version used for perceptual evaluation, but pleasant in the original recording due to the presence of vibrato. We have also met difficulties with notes having very slow attacks in the recordings, for which the placement of a crisp onset in the synthesis inevitably led to rhythmically odd melodies. In many such accounts, participants in the perceptual evaluation rejected the human–based audio samples in favor of the robotic–sounding renditions. Although these findings prevented us from evaluating our modeling strategy as intended, we feel that these results provide a valuable account of the importance of preserving a cohesive set of expressive features as well as the musical context where they appear in order to retain the character in a performance.
Despite not having been able to predict the dynamics or the timing deviations applied by our reference performer, our modeling approach has produced some convincing suggestions of expression, at times worthy of praise by listeners in a blind setting, with considerably less training data than most state–of–the–art models and virtually no time expenditure on model training thanks to the musically coherent approach of processing
For the desired pedagogical applications, the ability to produce musically valid expressive performances from few examples gives it versatility, allowing students and teachers to select the most relevant reference recordings to make up a training set, for instance for studying the style of a particular performer, or of a specific musical genre. When contrasting the performance of phrase–level against note–level modeling, our phrase–level approach was able to achieve comparable results despite the resulting information compression that comes from summarizing note features in terms of melodic similarity. Additionally, the smoothness inherent in the curves output by our model makes the expressive movements represented by them much easier to follow in real–time by a student.
Perhaps as important as the results concerning our modeling approach are our findings about the methodology of evaluation of expressive performance models for instruments other than piano, for which realistic synthesis is an issue and expression can potentially involve several variables. In those scenarios, we conclude that the modeling is best evaluated if all relevant expressive capabilities offered by the instrument are included in the sound, and preferably modeled as a group to avoid conflicting intentions in the different expressive outputs.
As melodic similarity is central to the expressive engine, expanding the reference violin datasets to include a wide enough variety of melodies, is a natural evolution of this work, in order to investigate in detail the particularities of this instrument when it comes to expression and how much can be improved in the model as a performance predictor.
As briefly mentioned above, there are a number of features, from score markings to the harmonic context of each
Lastly, the exploration of different modes of student feedback that can be provided with the outputs of this machine learning model, from auditory to visual to tactile, is an important step to understanding the functions our models should compute as well as being essential to achieving our end goal of improving how people learn and internalize expression in music.
This study was carried out in accordance with the recommendations of the British Psychological Society with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Conservatoires UK Research Ethics commitee on 04/04/2017. The consent form presented to subjects read (along with experimental instructions) as follows: Note on Participation and Data Usage Participation in this survey is voluntary and open to anyone aged 16 or older. All responses will be anonymised, and the data collected will be presented at national and international conferences as well as published in academic journals, and may be used for subsequent research. If you decide to take part, by beginning this survey you are providing your informed consent. You will still be free to withdraw at any time. If you are interested to learn more about the results or if you would like your data removed from the project please contact the researchers.
FO, SG, and RR: contributed to the design of the model. FO, AP, and RR: designed the evaluation methods. FO and SG: wrote the code for the model and perceptual evaluation. FO analyzed data and wrote this paper.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The authors would like to thank all participants in the study, whose patience and feedback have helped immensely to enrich our understanding of a topic so vast as music performance.
1
2MusicXML 3.1 Specification. The W3C Music Notation Community Group, 2017.
3
4
5Violin samples from user ldk1609 at freesound.org, licensed with Creative Commons v.1.0 (
6