Music playing skills

Cross-posted to openconversational.ai and OpenVoiceOS forum:

Y’all know I am hyper-focused on music playing. I have worked with:

- Mycroft: Too bad they went belly-up - but what a trailblazer :)),
- Minimy:  Thanks to Ken S. for all he did, but it's too much to maintain, and English-only is an issue,
- OVOS:    I could not get music to play, but was told "Just wait for ovos-media ... it's coming ..." I patiently wait ...,
- Neon:    I got a skill to find music, returned the correct data, but it would not play.  But ovos-media is coming to Neon too, right?

Let me ask, what fueled the need to switch to ovos-media, and how will that help slay the Dragon of Complexity?
I appreciate all everyone has done, but I read https://github.com/OpenVoiceOS/ovos-media?tab=readme-ov-file#ocp-fallback
To me it seems we still have the Dragon of Complexity - I contend we don’t need an abstacted way to “fall back”. Rather, if music is requested, do this:

  • Search local music, if none,
  • Search the Internet. No other fallback is needed. People definitely don’t want information from Wikipedia, etc.
    They want music, and searching the Internet will find something, every fricking time - the user requested music, some type of music plays.

Is there a way to make the new ovos-media way, way simpler? Here is some simpler code that fits: https://github.com/mike99mac/minimy-mike99mac/blob/main/skills/user_skills/mpc/mpc_client.py

I would like to help.

In short, we need a more flexible framework to plug in different playback backends and do better with ambiguous requests. The previous CommonPlay and OCP implementations work pretty well where the QT GUI is available, but weren’t really designed to be as flexible as we wanted.

I think this is mostly true, but there are some complications around disambiguation, i.e. “put on the bedroom lights” and “put on something by Lights” are syntactically similar but indicate very different intents from the speaker. Within a playback request, there is also some necessary disambiguation; I could say “play Alfred Hitchcock” and there’s probably some music match somewhere, but if I have a movie collection available then there’s likely a better match there.

Under the hood, maybe a little but @JarbasAl would know best about that. There may be ways to simplify the interface for skills to make it easier to integrate with.

I appreciate all the time you’ve put into music playback thus far; I think you may even have more time into ovos-media than I have (I’ve still yet to get that tested and validated under Neon but it’s on my list). I don’t have much to suggest here until I get ovos-media integrated in Neon and have something ready to evaluate/test.

Daniel, thanks for the feedback.

… there are some complications around disambiguation, i.e. “put on the bedroom lights” and “put on something by Lights” are syntactically similar but indicate very different intents from the speaker.

Great discussion. I contend that our users should not expect music to be played with requests starting with “Put on”. If it doesn’t work, they will learn quickly.

My first voice platform was an '08 Ford Focus with “MicroSoft Sync” - one of the first voice input systems in a car. I thought it was so cool, but it’s vocabulary was extremely limited. However, a limited vocabulary is easier to learn than a more complicated one.

I could say “play Alfred Hitchcock” and there’s probably some music match somewhere, but if I have a movie collection available then there’s likely a better match there."
Our users should not expect a movie to be played if it starts with “Play …”. Rather, we should teach them to start with “Play video” or “Play movie”.

If we agree that one of the two top uses of Personal Voice Assistants is music playing, then if our users say “Play ”, they should expect music.

Following up on that vein, if one of the two most used resources is question answering, then shoudn’t any request starting with “Question” be sent to a question answering skill? I just asked “Question, what is 2 times 2” and got a failure. Should anything starting with “Question” go to a question answering skill?

until I get ovos-media integrated in Neon and have something ready to evaluate/test.

… waiting patiently …

Common query def also needs some love, that “question” example is a good feat request!

1 Like

So could I ask that any request starting with “Question” go to the main question answering skill? If there is a more formal way of a feature request, I will follow the instructions. -thanks -Mike Mac

1 Like

There is no “question answering skill”, there is the common query framework, it sends the question to all query skills and selects best answer

In this case common query framework is not recognizing that as a question apparently, I think it only checks for questions starting with what/when/why/who/how

To provide another counter-example “play Space Invader” could be another ambiguous request. I definitely agree that certain requests are more likely media-related than others, but I think we always reach some point where we want some further parsing.

I agree with this. Some of the documentation we have includes a set of known valid phrases to show what the various default skills are and how to use them. I also believe though that we should try to handle as many user requests as we can since “I don’t know” basically indicates a failure of the assistant.

From a technical perspective, we do this by using confidence levels; “play X” should always return a high-confidence playback response, where “put on Y” would maybe be a medium or low confidence depending on whether or not “Y” looks like something media-related.
I haven’t thought specifically about “play” without something we know how to play back, but I could be convinced that it should always default to music if there’s no explicit other media requested (i.e. a movie or game). Maybe a specific skill could be specified as the default in thiis case.