Lessons & guides
Short guides and lessons on Daydream, the engine underneath, and getting it running in your DAW.
Contents
6 sections · 17 guides · 10-term glossary
Section 01 · 4 guides
Download Daydream Effect, install it in your DAW, and connect your account. Up and running in a couple of minutes.
Daydream is the first AI-native instrument. Leveraging cutting-edge audio synthesis research, it delivers near-real-time controllability of audio generation. While it's registered as an Effect in most DAWs, it behaves more like an instrument: using any input, it continuously generates music based on your input.
Download the plugin and get it showing up in your DAW. A couple of minutes, start to finish.
Your key links the plugin to your account. Grab it from your Daydream sign-in, once.
Paste your key into the plugin, connect, and start remixing audio in real time.
Section 02
Daydream inside your DAW as a VST/AU plugin. The plugin is in alpha — start with what it does, then open the DAW Guides for host-specific steps and Controls for what every knob, slider, and switch does.
The live controls, drawn the way they look in the app. Open any card for what the control does and its range.
How hard the model reshapes your source audio.
Strength is the most expressive control on the panel. The engine calls it denoise under the hood, but the knob just says Strength. Keep it low and you get a subtle remix that stays close to your original. Push it up and the model takes over, all the way to a full transformation.
It's the first knob to reach for. Sweeping it live while a track plays is the fastest way to feel what Daydream actually does.
Think of it like the wet/dry mix on an effect, except the wet signal is the model reimagining your track from scratch.
How closely it follows the song's arrangement.
Structure decides how tightly the output tracks your original's sections, rhythm, and dynamics (the engine calls it hint_strength). Turn it up to keep the arrangement intact. Bring it down and the model is free to rearrange, and near zero it stops following your track and starts writing its own.
This is how you choose whether a remix stays on the rails of the original song or wanders off to dream up something new.
Like a follow-the-chart dial: full means stick to the original, low means take liberties.
How much of the original's instrument character carries through.
Timbre sets how much of your source's tone and color survives into the output (the engine calls it timbre_strength). High keeps the original instruments recognizable. Low frees the model to swap them for whatever fits the prompt.
It separates what is being played from what it sounds like, so you can keep the arrangement while the model recasts the actual instruments.
Like reamping a part: same performance, but you decide how much of the original tone bleeds through.
The text that tells the model what to generate.
In Daydream the prompt field is labelled Tags: a short description of genre, mood, instruments, and tempo. You can run two sets, Tags A and Tags B, and crossfade between them live. Editing the text doesn't send it on its own; you hit Send Tags to commit it.
Specificity wins. “Deep house, muted bass, warm rhodes” lands far closer than “electronic beat.” Re-roll the Seed for a fresh take on the same tags, or lock the Seed to reproduce one exactly.
Like calling out a vibe to a session band. The clearer the brief, the closer the first take.
A small add-on file that teaches the model a style.
A LoRA (Low-Rank Adaptation) is a small add-on file, far smaller than the model itself, that nudges it toward a particular genre or sound without retraining the whole thing. In the app they live in the LoRA Library; you can enable up to four at once, each with its own strength fader, and they stack. A prompt is your instruction for this take. A LoRA is a baked-in aesthetic the model carries across every prompt.
Pick the style with a LoRA, then steer the specifics with tags. 16 genre LoRAs ship out of the box, and they hot-swap into the running engine in about 1.2 seconds, so you can audition styles while the music plays.
A session player who can do anything, taking a quick lesson in one genre: cheap to teach, fast to swap, and it colors everything they play.
How much each new generation echoes the last.
Feedback sets how similar each new generation is to the previous one. Low gives you fresh variety on every refresh; higher gives a continuous evolution where each generation flows into the next.
It's the difference between constant reinvention and a smooth, evolving morph. 0.3–0.5 is the sweet spot for continuity without everything sounding the same.
How far back in time the feedback reaches.
Feedback depth sets how far back the Feedback knob looks. At 1 (default) it blends with the most recent generation; higher values reach back several ticks for an echo or ghost effect, where a faint repeat of an earlier moment surfaces in the current output.
It lets you get distant, ghostly feedback without cranking Feedback all the way up.
Where the model concentrates its work across denoising.
An advanced control that changes where the model focuses effort across the denoising steps. The default is tuned for the turbo engine and works well in most cases.
Leave it alone unless you're chasing a specific feel — it's a fine-tuning knob, not an everyday one.
How many diffusion steps each generation runs.
The diffusion step count. Fewer steps means lower quality; more steps means more latency. Changing it rebuilds the streaming pipeline, so expect a brief audio glitch when you move it.
It's the direct quality-versus-latency trade. Most of the time the default is right; raise it only if you can spare the latency.
Concurrent denoising slots in the streaming ring buffer.
How many generations the StreamDiffusion ring buffer keeps in flight at once. Low depth means faster parameter-update latency (best for snappy, discrete changes); high depth means higher throughput, smoother glides, and better GPU use. It's capped to the engine's max batch size.
It tunes the engine to your playing style: shallow for stabby, reactive moves; deep for liquid, continuous sweeps.
How hard the output is pushed toward the prompt.
Classifier-free guidance (CFG) strength. It only takes effect when the RCFG mode is not off. Higher values push the output further toward the prompt at the cost of more artifacts. The turbo model is CFG-distilled, so the useful range is narrower than a base model.
It's your prompt-adherence dial — but turbo likes a light touch, around 3 to 8.
Tames the harshness that high guidance can add.
After CFG is applied, this mixes the guided signal's loudness back toward what the un-pushed pass produced. 0 keeps raw CFG; 1 fully snaps the magnitude back. Pair it with high guidance to keep the prompt-push without the harshness high CFG causes on its own.
It lets you chase strong prompt adherence without the output turning brittle or clipped.
Whether guidance is on, and in what mode.
Off means no guidance — the turbo default. The other modes re-introduce classifier-free guidance at near-zero cost over the baseline, which is what brings the Guidance scale and CFG rescale knobs to life.
It's the master switch for prompt-guidance. Off is fastest; turn it on when you want the prompt to bite harder.
Experimental band scalers for the model's self-correction.
DCW is an internal correction the model applies to itself during generation. The low and high knobs adjust its strength in each band — low acts in the early part of the run, high in the later part. The exact audio mapping is still being explored.
Pure sound-design territory: sweep it to discover what it does to your source. Extreme values can be unpredictable, but interesting.
Tilts the sound brighter, with more highs.
An activation-steering knob: it nudges the model's internal representation toward a brighter spectrum (a higher spectral centroid), independent of the prompt. 0 is off.
A direct tone-shaping move that works whatever you've prompted — reach for it when a mix needs air. Useful range is roughly 5 to 15 by ear.
Tilts the sound warmer, toward the bass.
An activation-steering knob that shifts the spectrum toward the low end for a warmer feel. The counterpart to Bright. 0 is off.
Pulls the tone down into the chest without touching the prompt. Useful range is roughly 5 to 15 by ear.
Adds grit and noise to the texture.
An activation-steering knob that increases spectral flatness — grittier, noisier output. The effect builds slowly as you push it. 0 is off.
Dirties a clean generation up, useful when something sounds too polished. Useful range is roughly 5 to 15 by ear.
Thins the texture toward sparse and minimal.
An activation-steering knob that thins the sound toward a sparser, more minimal texture. 0 is off.
Opens space in a busy generation, pulling it toward minimal arrangements. Useful range is roughly 5 to 15 by ear.
Experimental feature. These are not traditional audio channels and gains — they manipulate different dimensions of the model's latent space, and produce results ranging from nuanced and beautiful to abrupt and discordant. Use at your own risk.
Steer individual dimensions of the model's latent space.
Not traditional audio channels or gains. Channel highlights nudge individual latent channels (ch13, ch14, and friends); channel groups (ch g0–g7) move whole bands at once. Each defaults to 1.0 — turn one and you push the model along a dimension that has no neat audio name.
It's the deepest steering Daydream exposes. Reach for it when the prompt and tone knobs can't get you somewhere, and treat the result as discovery rather than control.
Section 03 · 4 guides
Four ways people put Daydream to work — building sample libraries, processing parts in a production, designing sound, and using it as a creative partner.
Build coherent sample packs by performing Daydream live and pulling the moments that work.
Use Daydream as an instrument in your session — play a part into existence rather than arranging it.
Turn arbitrary input — field recordings, found sounds, noise — into designed sound, performed in real time.
Play Daydream as an instrument — set conditions, listen to what comes back, and respond.
Section 04 · Coming soon
Daydream as a Max for Live device, native to Ableton's session view and clip workflow.
Section 05 · 2 lessons
How the engine works, for the curious. You don't need any of this to play, but it helps to know what the knobs are talking to.
Section 06 · 10 terms
Plain-language definitions for the engine, the tech, and the gear behind it all — the words the lessons throw at you, in one place.
The real-time engine that makes the music playable.
The engine is the runtime and control layer behind the instrument — “StreamDiffusion, for audio.” It takes a model that would normally render a song in one batch and makes that generation streamable and steerable as it plays. It's open source, and it can run on your own GPU.
This is why Daydream feels like an instrument instead of a render queue. You move a knob and hear the change, instead of submitting a prompt and waiting.
If the model is the band, the engine is the live mixing desk that lets you ride the faders mid-performance.
The open music model the engine actually runs.
ACE-Step v1.5 is the open-source model that writes the music, and the Daydream engine wraps around it. The split is worth knowing: ACE-Step composes, the engine makes it playable in real time. The default checkpoint is a 2B-parameter turbo model (a larger XL version also exists), released by the ACE-Step team under an MIT license.
The split tells you what's swappable. The engine stays put while the model underneath can change.
ACE-Step is the songwriter. The engine is the touring rig that lets you perform what they wrote, reshaped, every night.
Generating sound by clearing away noise, step by step.
A diffusion model starts from pure noise and removes it in small steps until coherent audio emerges. It learned the trick by watching real audio get buried in static and practicing the reverse. Music models do this in a compressed space rather than on raw waveforms, and the turbo model finishes in just 8 steps.
It's why the Strength knob feels the way it does. You're telling the model how far back into the noise to start, which is how much room it has to reinvent before it settles.
Like a Polaroid developing, sharpening into focus a little more with each pass.
The real-time-image trick the engine borrows for audio.
StreamDiffusion made image diffusion run in real time by keeping several generations in flight at once, each frozen at a different stage of denoising on one assembly line, all advanced by a single pass. The Daydream engine is the audio version of that idea. Its “ring depth” is how many generations it keeps in flight, from 1 to 8.
It's the reason you get continuous, gap-free audio you can steer live. The engine is always working ahead, so the next sound is ready the moment you need it.
A kitchen line where every station works a different dish at once, so plates come out steadily instead of one at a time.
The codec that shrinks audio so the model can work fast.
A VAE compresses audio into a small code the model paints in, then decodes it back to a waveform. To stream without gaps, the engine decodes in overlapping one-second windows and keeps only the middle slice, trimming the edges that exist just to avoid seams. Decoding only the window you need, instead of the whole song, is what keeps the latency low.
It's the unglamorous part that makes “no clicks, no gaps, no waiting” actually true.
Like crossfading loops in a DAW so the splice is inaudible. The decoder overlaps its chunks for the same reason.
How fast your knob move reaches the audio.
Latency is the gap between your hand and the sound. The engine's per-frame knobs land in roughly 14ms at shallow ring depth, rising to about 81ms at the deepest, with 25 control points every second. Sending a whole new prompt takes longer (around 112 to 649ms, depending on depth) because the model has to converge on the new idea.
Latency this low, well under a tenth of a second, is what makes a tool stop feeling like a render and start feeling like an instrument under your hands. It's also why a knob sweep feels instant while a fresh prompt takes a beat to land.
Like the difference between bending a string, which is instant, and calling a key change to the band, who need a bar to land it.
Bind a physical knob to any on-screen control.
MIDI is a control protocol. It carries notes, timing, and parameter messages, not audio, and it lets a hardware controller drive software. In Daydream there's no separate setup screen: right-click most knobs, sliders, or buttons, wiggle the physical control you want, and it binds on the spot. That's “MIDI learn.”
It turns Daydream from a mouse-driven app into something you play with your hands. Map Strength to a fader and you're performing the model, not clicking it.
Exactly like MIDI-learning a plugin parameter in your DAW: same gesture, same muscle memory.
Isolated layers, like vocals and instruments, of your source.
A stem is one grouped layer of a song bounced to its own track: all the vocals, or the whole instrumental bed. Daydream pulls vocal and instrumental stems out of your source, so you can mix each layer into the output on its own, or feed just the vocals or just the instruments into generation.
It lets you keep one part recognizable while the model transforms the rest. Lock the vocal and let it rebuild the backing, or the other way around.
Like soloing the instrumental bus versus the vocal bus on a mixing desk.
The plugin format versus the studio it plugs into.
A DAW (Digital Audio Workstation) is the software where you record, sequence, mix, and arrange, like Ableton, FL Studio, or Logic. A VST is a plugin format: add-on software that runs inside a DAW. AU is Apple's Mac and iOS equivalent. A standalone app is the same tool packaged to run on its own. Daydream's web app runs in your browser today, and a VST plugin is on the way.
It tells you how Daydream fits your workflow. The browser app needs nothing installed, while the coming VST drops the engine straight into your DAW session alongside your other plugins.
The DAW is your studio. A VST is a piece of gear that racks into it. A standalone is that same gear on its own stand.
The graphics-card memory you need to self-host.
VRAM is the dedicated memory on your graphics card, and it's the main thing that decides whether a model runs locally, because the model has to fit. You only need it if you want to self-host the open-source engine; the hosted web app needs no GPU at all. The realtime engine's practical floor is around 16GB of NVIDIA VRAM, with a 24GB card like the RTX 4090 comfortable. The benchmarks were run on an RTX 5090 with 32GB.
It draws the line between just opening the browser and running it yourself, and if you do run it yourself, more headroom means lower latency.
Like track count and plugin headroom on an old machine: run out and everything chokes, have plenty and it flies.
Artists congregate in the Daydream Discord. If you've got a question, chances are someone will know the answer.