As a team of four, we attempted to teach Nao to play the Glockenspiel over the course of about six weeks. This included visual detection of the instrument and playing sticks, grasp planning to retrieve the sticks, and a simple method for learning the ideal joint configuration to play each note once Nao was ready to play. The visual estimate of the instrument's pose gave a good guess for which note would be played via an IK solution, and audio processing allowed Nao to determine when the correct note was played.
To locate the sticks, Nao searched first for the red tips, and given the size of each could estimate well enough the location of the handles. From this, Nao would approach the sticks, reach for the handles with IK and attempt to grasp them. Since Nao's grippers are not so reliable or sensitive, it would then attempt to hold each stick above its head and verify visually both that the stick was held in each hand, and estimate the actual transform from each hand to the end of each stick (this varied each time the sticks were grasped to small differences in the IK solutions, slipping of the sticks in the grippers, etc.). Having an accurate estimate of the tool endpoints is crucial for manipulating them effectively.
Sticks in hand, Nao then moved on to visually locate the instrument, identify a good place to stand based on the instrument's pose, and then visually estimate the location of the center of each note. Intuitively, this should now be enough to begin playing, but in practice the small errors in visual estimation of both the instrument and sticks, as well as small errors in the Nao's own motion planning results mean there's still more work to do. The IK solver provided with Nao works ok, but its results are inconsistent so the only hope of reliably playing the same note twice (using the provided naoqi APIs, anyway) comes from finding an ideal joint configuration for each arm and moving there directly.
To find such a configuration for each note, an initial guess is made using IK. Given the transform from the hand to the end of the playing stick, the hand is moved to a position above in the instrument where it is expected the stick will make contact when the wrist is rotated. From the visual estimation of the instrument, it is known in advance which note should be heard if it's played successfully, so if playing from this position produces a note higher or lower than the expected note then trivially we command Nao to adjust the hand position slightly to the left or right and try again. However, because the notes on the instrument are very small, it's always possible that the hand position may be too far forward or back, which can result in either missing the instrument entirely or hitting one of the support posts which holds the notes in place. These results can also be detected, but they don't tell us in which direction the error was. When Nao retries playing a note, it samples from an elliptical area around the (estimated) center of the target note. If the note is missed or a post is hit, one axis of the ellipse is widened, while if the resulting sound is just a higher or lower note the other axis is widened (but in only one direction). This allows Nao to incrementally increase its search space for the ideal playing position. In some cases, it's possible that the search space can expand too far, in which case it's reset and the search starts again from the start.
This search process is shown here: