Preface

During my university studies, sight-singing and ear training was a course that gave me quite a headache. While the course itself is very fundamental, it is not easy for many music majors, and I was one of them (laughs). As I progressed through my studies, I gradually discovered a significant problem with sight-singing and ear training when practicing independently: students often find it difficult to notice their own mistakes in a timely manner. Whether it’s the perception of pitch or rhythm, learners’ subjective judgments are often not entirely accurate, and practicing alone therefore becomes more prone to errors.

The quality of sight-singing practice largely depends on the student’s own auditory ability and ear training level. Even if students are conscientious and practice diligently, if their pitch recognition ability is relatively weak, it is often difficult to identify problems in the sight-singing process in time, even with the aid of a piano for verification. Moreover, many problems that arise in sight-singing practice are not simply “right” or “wrong,” but are mixed with a lot of redundant information and subtle deviations caused by singing habits, all of which affect the overall effect on the ear. From the perspective of teaching empathy, sight-singing scoring itself is a multi-dimensional, comprehensive judgment process.

After later encountering and learning computer technology and engineering methods, I began to try to re-examine this long-standing practical problem from an engineering perspective. As a classically trained music student, I once couldn’t understand why there has never been a mature system widely used for automatic analysis and scoring of sight-singing; when I reconsidered from the dual perspective of professional music background and engineering technology, I gradually realized that sight-singing scoring is indeed a task with complex dimensions, varied situations, and extremely high implementation difficulty.


Research

Student Sight-Singing Habits and Problems

From long-term observation of learning experience and teaching practice, students often exhibit a series of highly regular behavioral patterns during sight-singing. These behaviors can usually be quickly identified by experienced teachers through auditory judgment and teaching experience; however, when these behaviors are converted into audio signals and analyzed by automated systems, they often cause significant interference to recognition and scoring models based on acoustic features and time series assumptions.

1. Retreating from a certain position and repeatedly singing after making a mistake

When students realize they have made a mistake during singing, they often instinctively correct themselves by returning to the previous sentence or even repeating within the same sentence. This behavior is generally viewed in teaching as a positive signal that students have basic self-monitoring awareness, but at the audio level it directly destroys the monotonic forward progress of the singing timeline, causing retreat, repetition, or even local loops in the audio. This nonlinear temporal structure severely disrupts the originally assumed one-to-one correspondence between audio and score, creating difficulties for automatic alignment and scoring.

2. Pitch fluctuates around the target note with obvious trembling

Some students can approximately approach the target pitch when singing, but find it difficult to maintain stable residence at that pitch, manifesting as pitch fluctuation up and down around the target value. This phenomenon is mostly related to nervous emotions, insufficient breath control ability, or lack of confidence in the target pitch. Human listeners can usually judge by experience that it is “basically in tune,” while automatic systems find it difficult to extract a clear and reliable pitch determination result from a continuously fluctuating fundamental frequency trajectory.

3. Approaching the target pitch through continuous glissando

Some students will gradually slide from a lower or higher pitch toward the target pitch to complete the singing, trying to avoid the technical difficulty of directly hitting the pitch. In this case, the pitch change presents as a continuous trajectory rather than a clear discrete jump. Although they may eventually reach the correct pitch, the entire pitch structure deviates significantly from the score requirements in both time and form, making note boundaries and pitch determination ambiguous.

4. Overall or local instability in rhythm, with speed changing according to circumstances

Students often make subjective adjustments to timing during sight-singing based on their psychological state and difficulty level. For example, the speed significantly increases in nervous or unfamiliar passages, deliberately slows down at uncertain or technically difficult points, and even rushes beats in simple passages. This indicates that students usually still have basic beat sense, but rhythm control is easily affected by cognitive load and emotional factors, causing the timeline to be compressed or stretched locally.

5. Sudden interruption of vocalization during singing

In some cases, students will suddenly stop singing at positions that are not natural ends of phrases or passages. This interruption often stems from memory interruption, pitch judgment failure, or rhythm loss of control, rather than work completion. From an audio perspective, this manifests as an unexpected termination of the time series, which is difficult for automatic analysis systems to distinguish in nature.

6. “Stuck” during ascending or descending pitch, unable to reach target note

Some students, during the singing process, have the correct trend of pitch change itself, but experience obvious stagnation before approaching the target pitch, ultimately only able to reach a pitch position that is close but not hit. This type of problem may be caused by vocal range limitations, insufficient breath support, or nervous emotions, manifesting as the pitch trajectory entering a plateau period prematurely, forming systematic deviation.

7. Vocal register shift during singing

Within the same melodic line, students may experience obvious vocal mechanism switching due to register changes or unstable technical control, causing sudden changes in timbre, loudness stability, and overtone distribution. Even if the pitch value itself does not change much, this change in vocal state is still very significant acoustically and easily misjudged by automatic systems as pitch or stability problems.

8. Obvious problems in the attack phase

Attack problems are concentrated in the moment when notes begin, possibly manifesting as delayed attack, vague attack, or initial pitch deviation. Human listeners can usually automatically ignore these unstable moments in the overall musical context, while automatic systems are often highly sensitive to attack position and easily use it as a basis for judging pitch or rhythm errors.

9. Improper handling of connection relationships between notes

Students may add pauses at positions where the score does not require them during singing, or segment notes that should be coherent too fragmentedly, or inappropriately connect notes that should be separated. This type of phenomenon usually reflects students’ insufficient understanding of rhythmic structure and melodic coherence, or the result of simplifying sight-singing to reading notes one by one.

10. Overall beat alignment error, but internal proportional relationships basically correct

In some performances, students can maintain the relative duration relationships between beats relatively well, but the overall tempo is too fast or too slow, causing the beat landing points to shift overall from the external reference. This indicates that students have a certain sense of relative rhythm but have not yet established a stable absolute tempo reference. In automatic scoring, this type of problem needs to be distinguished from true rhythmic chaos.

11. Relative intervals correct, but overall pitch systematically shifted

This is an extremely typical type of problem in sight-singing, manifesting as the entire melody section being shifted up or down by a semitone or whole tone, while the internal interval relationships and melodic direction remain basically correct. This reflects that students have a good grasp of relative pitch relationships, but have deviations in the establishment of tonal sense or starting pitch reference.

12. Slow pitch drift within long notes

During sustained vocalization of long notes, students may have relatively accurate pitch in the initial stage, but gradually sink or float as time passes. This phenomenon is different from momentary trembling and is more directly related to breath control and vocal stability, posing additional challenges for pitch modeling within individual notes.

13. Pitch and rhythm are separately correct, but not synchronized in time

Students often know cognitively what pitch and rhythm pattern should come next, but when actually vocalizing, the timing of pitch switching does not accurately fall on the corresponding beat position. Teachers can make error-tolerant judgments through overall musical sense, while for automatic systems, pitch sequences and rhythm sequences cannot be aligned and are easily judged as multiple errors simultaneously.

14. Internal beat structure flattened or smoothed

Students may maintain the correct number of beats and total duration macroscopically, but the durations of notes within beats tend to average out, and the originally present long-short contrast and hierarchical structure disappear. Aurally, the rhythm lacks hierarchy, but from an automatic analysis perspective, the total duration may be close to correct, making this type of structural problem difficult to identify.

15. Accompanying vocalization interferes with main melody judgment

During singing, students may add grace notes, glissandos, or tail notes not marked in the score before and after notes, or sing a single note as a compound structure of “main note plus auxiliary note.” This is often not intentional ornamentation but the result of imprecise pitch control. Teachers usually automatically ignore these “rough edges,” while machine systems may identify them as extra notes or wrong notes.

16. Incorrect accent position

With the pitch sequence and duration basically correct, students may fail to place the accent on the appropriate beat or note position. This type of problem reflects more a deviation at the musical understanding level rather than pure pitch or rhythm errors, and is difficult to model separately for systems that rely only on acoustic features.

17. Local “memorization-style” jumps

When students experience a break in memory, they may directly skip a small section in the middle or suddenly jump from the current note to a position further back. From the audio surface, these pitches and rhythms may still be locally valid, but are not continuous in score logic, severely interfering with automatic alignment.

18. Mixed use of note name or solfège systems

Some students will unconsciously switch between fixed-do and movable-do systems during singing, especially when key changes or accidentals appear. Teachers can quickly judge their cognitive system confusion, while automatic systems can only observe systematic, non-random pitch shifts.

19. Unreasonable breathing point placement

Students may breathe at positions not allowed by the score, artificially truncating a complete note. This type of problem is not a rhythmic calculation error but interference with musical structure by physiological behavior. Automatic systems can often only view it as a note ending prematurely.

20. Pitch features missing due to low volume

Some notes are sung extremely lightly; although the pitch itself may be correct, the signal-to-noise ratio is too low, causing pitch estimation failure or direct feature loss. Teachers can usually supplement auditory information by experience, while machines find it difficult to do so.

21. Strategic “abandonment of control” after continuous errors

After continuous errors in the previous section, some students will psychologically enter a state of reduced control precision, manifesting as the pitch direction roughly remaining but with loose rhythm and significantly reduced stability. This is a stage-by-stage system degradation caused by changes in psychological state. Automatic systems can often only observe overall quality decline but find it difficult to identify the cause.

No. Problem Type Typical Manifestation Understanding from Teaching Perspective Impact on Automatic Analysis/Scoring
1 Pitch fluctuates, voice trembles Pitch fluctuates up and down around target; lacks stable residence point Nervousness, insufficient breath control, or lack of confidence Fundamental frequency difficult to converge, difficult to judge if target pitch is hit
2 Sliding from one pitch to another Sliding from low pitch to target; or sliding down from high pitch Using continuous change to avoid directly hitting pitch Pitch presents continuous trajectory, violating discrete pitch assumption
3 Cannot reach high/low notes Stops before approaching target; appears “stuck” Vocal range limitation, nervousness, or insufficient breath Pitch trend correct but not reaching target, forming systematic deviation
4 Vocal register shift Timbre, loudness, overtone distribution sudden change Vocal mechanism switching Acoustic feature mutation, easily misjudged as pitch or stability error
5 Attack problems Delayed attack; vague; initial pitch deviation Human hearing tolerant and automatically ignores System highly sensitive to attack, prone to misjudgment
6 Relative intervals correct but overall pitch shifted Entire melody section shifts up or down Tonal sense or starting pitch reference error Systematic pitch shift rather than random error
7 Pitch drift in long notes Attack accurate; sinks or floats during sustain Insufficient breath and vocal stability Pitch slowly changes over time
8 Decorative or accompanying vocalization interference Extra grace notes, glissandos, tail notes Imprecise pitch control “rough edges” Easily identified as extra notes or wrong notes
9 Mixed note name/solfège systems Mixed fixed-do and movable-do systems Cognitive system confusion Systematic, non-random pitch shifts
10 Volume changes causing reduced pitch visibility Extremely light singing; low signal-to-noise ratio Teachers can supplement by experience Pitch estimation failure or feature loss
No. Problem Type Typical Manifestation Understanding from Teaching Perspective Impact on Automatic Analysis/Scoring
1 Sudden repetition from a section after mistake Returning to previous sentence to re-sing; local multiple repetitions Self-monitoring and correction awareness Timeline retreat or loop, destroying linear alignment
2 Rhythm unstable, speed suddenly fast or slow Acceleration, slowing down, dragging, rushing beats Subjective time adjustment Local rhythmic stretching or compression
3 Sudden stop during singing Abruptly stops at non-phrase positions Memory or judgment failure Unexpected time series termination
4 Connection problems between notes Unnecessary pauses; excessive segmentation Insufficient coherence understanding Note boundaries inconsistent with score
5 Beat alignment error but proportions correct Overall too fast or too slow Relative rhythm sense exists Need to distinguish overall shift from chaos
6 Pitch and rhythm separately correct but not synchronized Switching points misaligned with beats Execution not synchronized Pitch sequence and rhythm sequence difficult to align
7 Internal beat structure flattened or smoothed Durations within beats tend to average Insufficient rhythmic hierarchy Total duration correct but structure wrong
8 Incorrect accent position Accent not on appropriate position Musical understanding deviation Accent difficult to model separately
9 Local “memorization-style” jumps Skipping sections and jumping directly Memory break Score logic not continuous
10 Unreasonable breathing point position Illegal breathing; truncating notes Physiological behavior interfering with structure Easily judged as premature ending
11 Strategic abandonment after continuous errors Reduced control precision in later sections Psychological state degradation Stage-wise quality changes difficult to identify

Sight-Singing Evaluation Criteria

Based on the problems mentioned above, when we score sight-singing practice and tests, we also make different degrees of evaluation based on the problems and basic abilities that students exhibit. For example, if a student’s attack is not accurate but the interval relationships throughout the singing are very accurate, then we can consider deducting points accordingly, but from an engineering perspective, all of the student’s notes are inaccurate, and the score would be very low.

In addition, the evaluation criteria should also be multi-dimensional and multi-angled. In actual teaching, we generally use the following standards as sight-singing assessment and scoring criteria:

  • Pitch 40%
  • Rhythm 30%
  • Fluency/Completeness 20%
  • Expression/Style 10%

I. Pitch (40%)

Pitch is the most core element in sight-singing ability, but its evaluation should not be limited to “whether individual pitches are sung accurately,” but should comprehensively consider pitch stability, interval relationships, and overall tonal perception ability. This dimension can be subdivided into the following aspects:

  • Individual pitch accuracy: The degree of deviation between the actual pitch sung by the student and the target pitch, including persistent sharp or flat situations.
  • Pitch stability: Whether the pitch remains stable during sustained vocalization, and whether there are obvious problems of trembling, fluctuation, or inability to reside at the target pitch.
  • Interval relationship correctness: Whether the interval relationships between adjacent notes are correct, and whether the interval structure remains accurate even if there is a systematic shift in overall pitch.
  • Glissando and fuzzy hitting situations: Whether the transition from one pitch to another is made through glissando, thereby avoiding direct hitting of the target pitch.

From a teaching perspective, if a student’s overall pitch is shifted but the interval relationships are accurate, it can be considered that they have good relative pitch ability and points can be deducted accordingly; in engineering analysis, this situation will cause multiple pitch points to mismatch the score and requires additional rules or models to distinguish.

II. Rhythm (30%)

The rhythm dimension mainly examines students’ understanding and control ability of temporal structure. The evaluation focus is not on “whether to completely mechanically align with the beat,” but on the stability and rationality of overall rhythmic sense. This dimension can be subdivided into:

  • Beat accuracy: Whether note starting times roughly fall near expected beat points, and whether there is systematic rushing or dragging.
  • Rhythm stability: Whether the overall tempo remains relatively stable, and whether there are obvious fluctuations in speed.
  • Duration relationship correctness: Whether the duration proportions between different notes are correct, and whether the rhythmic structure remains consistent even if the overall tempo is too fast or too slow.
  • Pause and prolongation handling: Whether rhythmic elements such as rests and sustained notes are correctly understood and executed.

In actual teaching, teachers often focus more on whether the rhythmic structure is correct; in automatic analysis, fluctuations in the timeline directly affect note alignment and scoring accuracy.

III. Fluency and Completeness (20%)

This dimension mainly reflects students’ ability to grasp the sight-singing content as a whole, focusing on whether the singing process is coherent and whether the score content is completely presented. Specifically includes:

  • Singing continuity: Whether there are frequent interruptions, pauses, or hesitations during singing.
  • Retreat and repetition phenomena: Whether, after making mistakes, returning to the previous sentence or a certain measure to repeatedly sing, causing the temporal structure to no longer progress monotonically.
  • Completeness: Whether singing completely to the end of the score, and whether there are situations of giving up midway or stopping abruptly.
  • Self-correction method: Whether correction is made through obvious interruption or retreat, or whether singing is continued in a relatively natural way.

This dimension is often viewed in teaching as a reflection of “proficiency” and “psychological stability,” while in engineering systems, it is directly related to whether alignment algorithms and overall scoring processes can execute normally.

IV. Expression and Style (10%)

The expression dimension is mainly used to distinguish between “merely completing note tasks” and “having musical expression awareness” in singing. Its scoring should maintain relative flexibility and avoid excessive subjectivity. Main considerations:

  • Dynamics and timbre control: Whether there is basic awareness of dynamic changes, and whether timbre is overly rigid or monotonous.
  • Breathing and phrasing awareness: Whether breathing occurs at appropriate positions, and whether basic phrase structure is reflected.
  • Style consistency: Whether singing conforms to basic musical style requirements and avoids obviously inappropriate handling methods.

This dimension is usually difficult to quantify accurately in automatic scoring, so it is more often used for teaching evaluation or as an auxiliary weighting item.

Summary and Reflection of Research

Through systematic review of my university-level sight-singing learning experience, and combined with sorting and reading of relevant academic literature, a relatively obvious and regrettable phenomenon can be found: among the considerable quantity of existing sight-singing teaching and training research literature, few studies systematically summarize and classify the specific problems that students may encounter in the actual sight-singing process, and there is even less detailed analysis and operational analysis of sight-singing assessment and evaluation criteria.

From the perspective of existing research content, current academic discussions related to sight-singing teaching are more concentrated in two directions: first, psychological factors in students’ sight-singing learning process, such as nervous emotions, stage anxiety, self-efficacy, etc.; second, training methods and teaching strategies centered on pitch and rhythm, such as interval imitation, rhythm pattern training, auditory memory reinforcement, etc. This type of research has clear value in teaching practice, but its focus often stays at the level of “how to train” and “how to alleviate problems,” and rarely delves into more basic and critical questions such as “in what specific ways students actually make mistakes,” “how errors manifest at temporal structure and acoustic levels,” and “how these errors should be distinguished and evaluated in assessments.”

Looking further, sight-singing as a highly comprehensive ability is often affected by multiple factors, including pitch control, rhythm stability, temporal continuity, sight-reading and auditory mapping ability, and immediate error correction behavior. However, in most teaching research and assessment practice, these dimensions are often highly summarized into macro indicators such as “whether pitch is correct” and “whether rhythm is accurate,” with evaluation results relying more on teachers’ overall listening sense and experiential judgment, and lacking clear, reproducible analytical frameworks. This evaluation method may work in groups of experienced teachers, but in standardized teaching, large-scale assessment, and automated analysis scenarios, its limitations are particularly prominent.

Based on the above observations, it is reasonable to reflect: Does current sight-singing teaching research to a certain extent show tendencies toward “coarse-grained problem description” and “experiential evaluation systems”? While emphasizing training effectiveness and psychological adjustment, is systematic characterization and structured analysis of students’ actual singing behavior itself being neglected? Without clear understanding of students’ common error patterns and clear definition of how different errors should be weighed in assessments, the fairness, interpretability, and targeting of teaching feedback in sight-singing assessments will all be limited.