Advanced Machine Learning Day 3: Neural Program Synthesis


>>I’m a researcher here at MSR AI.
I work primarily on
Program Synthesis or
more generally on
various different ways
to combine together
your traditional technologies
from the good old fashioned AI,
symbolic search, form a logic,
ways to analyze programs,
and so on with deep learning,
reinforcement learning,
neural technologies, and all
the advanced machine learning
that we have nowadays.
Today’s lecture is generally
an overview of the field,
overview of the ways you can
apply Program Synthesis
on your products,
ways you can benefit from it,
all the different applications of it,
and that’s generally about it.
Feel free to ask questions at
any point during the hour.
We are going to stay roughly at
the high level and give
you a lot of pointers,
so that if you’re interested in
any specific topic
throughout the hour,
feel free to follow it
offline, your welcome.
That’s good. Okay. So, let’s start
with what Program Synthesis is and
why do we even want
to generate programs.
Programming is fun, why automate it?
Well, back in the early
days of AI in the 1950s,
1960s when people assembled
and talked together about
what do we want AI to be?
What this term even is?
Automatically writing codes was
one of the problems that they
designated as a key feature
that we want the computer
to be capable of.
Back in those days,
it basically meant, one second,
I’m going to turn
off the outer switch
because you don’t want to be
going faster than it’s necessary.
Back on track.
So, back in those days,
that basically meant that we
are going to give the computer
some sort of specification over
what we want our program to do.
Like, if you want the computer
to give us a sorting routine,
we are going to say that
the sorting function is
something that takes
an input in array and
produces another array such
that it’s a permutation
of the original one,
but it’s also sorted.
The idea was that,
we would write this down,
give it to a computer,
and it would spit us back a piece of
code that satisfies
the specification.
As many problem specifications
back in those days,
it was a little too ambitious
and we never gotten to
the point where we can
automatically varieties a web browser
or a Cortana or [inaudible] However,
modern Program Synthesis still
leaves and finds a ton of
different applications
even though it doesn’t
give us applications of
the size of a web browser.
It looks a little bit different now.
So, modern day Program Synthesis
has three big components.
The first is that it still
starts with user intent.
Although, what we found
over the years is
that people generally don’t like to
write logical complete specifications
over what we want our programs to do.
Not only you cannot always come
off with it in the first place,
but it’s also laborious.
There are other ways to
specify intent, like,
you can give examples of
what you want your program
to do, or you can,
in a natural language,
in English say,
describe the behavior of the program.
On the other hand,
we have the program space.
That describes all the space
of possible programs from
which you want to select one
that satisfies your intent.
You can imagine that being, say,
any program in a general purpose
programming language like C#
or Python or more often than not,
we actually write
domain-specific languages
for particular tasks and purposes
and say pick me a program
from this language.
Both of these components
are inputs to
a search algorithm whose goal is
to actually find the right program.
Historically, that’s
usually been some search;
enumerative or deductive
logical search,
but nowadays we also have neural
networks at our disposal and they
do this job fairly well as
we’ll see in main cases.
Out of this combination,
comes a program.
Different projects pick
different combinations
of these three components,
and we will see several examples
throughout the talk.
There are several applications
if you pick on these.
Well, let’s say we want to take
a question from a user and generate
an interpretation of that question as
a command and then plan
how to satisfy a query.
So, that gives us a dialog system
on conversational agent.
Semantic Machines, a startup
recently acquired by Microsoft,
has Program Synthesis technology as
the back end from
their key conversational system.
Who here is familiar with
Flash Fill in Excel. A few people.
Good. So, basically we’ll look at
Flash Fill a little bit
later, but the idea is,
if you have a lot of
data in your spreadsheet
that you want to clean up and
standardize in the right format,
instead of writing a bunch of
regular expressions yourself,
you can give a few
input-output examples
of what you want
the transformation to do and it
will give you
a string-to-string program
that cleans up your data
according to the examples.
That’s another instantiation
of the same paradigm.
Or we can go all the way
and actually talk
about general purpose
code generation,
like IntelliSense or
Autocompletion if you think
about it on the level not of a single
token but more of
complete expressions.
There, your intent is
what’s the context of
the current program and
the program spaces any expressions
in the language that
you’re programming it.
So, that concludes
the general outline.
In what follows, I’m going to
highlight some more applications,
this particular choices of
these components and we are
going to progress through
several different ways to
satisfy this application
and implement them.
We’ll start with
the traditional logical one
and hover it at a very high level.
How do you take
an input-output examples
and search for the program.
That’s called Program Synthesis
[inaudible] and we’ll cover
pros and cons of this.
Then, we will talk about how to add
more Machine Learning techniques into
the mix and talk about
several projects
that combine together
symbolic implementations
of program synthesis and neural ones.
Any questions? Excellent.
So, let’s move on.
Well, if you’re talking about
programming by examples,
let’s start to be as probably
the most famous Microsoft
application of it, Flash Fill.
It’s a feature in
Microsoft Excel that
started wide industrial applications
of Programs Synthesis.
Basically, what it allows
you to do is to say, “Hey,
if I have a lot of data in
my spreadsheet and I don’t feel
like writing the
desired transformation
of this data all by myself.”
Like in this case, combining
the names in the right format,
I can just give
one or two examples of what
I want this transformation to do.
In this case, one example
given in the second line will
suffice and Flash Fill will come
back and fill in the rest of
your data according to the template
that it thinks you want.
The actual template that
it generates behind
the scenes look somewhat like this.
It’s a program, a program in
a specific language of
string-to-string transformations.
So, in this case, what you
probably want to sum to, like,
take the last name add a comma,
edit first name, add a space,
then add a sub string of
the middle name starting
from the first character
and ending from
the last uppercase character,
and then add everything as a dot.
Except, if you look at it,
you’ll actually notice that the data
is in two different formats.
So, in order to complete a toll,
you’d probably want to give
a second example and then
Flash Fill will amend the program
and add a conditional to adopt,
based on whether the middle name
is present or not.
So, that’s the full
program that satisfies
the task here specifically
in the spreadsheet.
Couple things to notice about this.
First of all, the program is
not in Excel formula language.
The program is not in C#.
The program is not in BASIC.
The program is in particular language
that the designers of Flash Fill
came up with in order
to cover the space of tasks that is
useful for this feature,
string-to-string transformations.
A domain-specific
language or a DSL is
crucial for most programming
by example tasks.
It gives you the
expertivity required to
cover the space of
tasks without making
the problem so hard that
there is an infinite number
of programs in
a general purpose language
and do him no hope of
finding the right one because
there’s just way too many of them.
The general principle
that we described
here is called Programming
by Examples or PB.
This is one flavor of
Program Synthesis,
we’ll see several more
later in the talk,
and [inaudible] by examples after
Flash Fill came around into
the mix became widely
popular successful.
You can take the same
principle and implement it
for selecting data from
a web page, transforming tables,
joining them automatically,
transforming data
from one JSON schema to
another JSON schema,
clustering data based
on syntactic patterns.
Lots and lots of different
things of the same form.
PB works excellently but
in order to make it work,
you have to overcome
a few challenges.
So, You can already see from
this example what they are.
For instance, the number one rule,
if someone gave you examples,
you must satisfy them.
That’s what the user expects.
At the very next slide,
we’re going to see how
that actually happens.
Second, examples are great,
but a single example doesn’t
actually necessarily constrain
your problem enough in order
to satisfy everything
from the first example.
Some are ambiguous,
even in that case,
we needed two in order
to get to the right one.
So, ideally, we would like to
learn the intended program,
not just the one that
satisfies the examples,
and we cannot read
the mind of the user,
so we need to do
something smarter here.
Even if we can get
the intended program,
how do we know that for sure,
because the user can just
keep giving examples?
How do you know when to stop and say,
this is definitely the same program,
your examples don’t
really change anything?
Now, the process that I showed
in the previous slide is interactive,
it’s back and forth
communication with the user.
Because it happens in real time,
you want every iteration of
it to be extremely fast.
The DSL design is
a little bit of an art.
It must be expressive enough to
cover the space of tasks but
still concise enough so that
program synthesis finishes
within a few seconds.
So, for instance, for Flash Fill,
this is an excerpt.
That’s not the full DSL,
but basically, it works in layers.
You’re saying, a program is
either a single piece or
a concatenation of multiple pieces.
A single piece can be
either a constant string or
a substring of one of the inputs
based on two positions,
and each of these positions is either
an absolute position in
a string based on an index or
regular expression-based
position in the string
based on a pair of
regular expression surrounding it.
You can design these,
stick more and more
operators like that.
It’s only your
imagination, essentially.
That’s literally what
you would write,
a context-free grammar
of a small language,
and if you make it
expressive enough to cover
the tasks, that will be sufficient.
Questions so far? Yes.
>>What do you do with
conflicting examples?
>>In this setup, we assume
the examples are golden,
and if the user made
a typo or anything,
then you would just come
off as an empty set.
However, what you can actually do
in practice, because
that’s unsatisfactory,
is to synthesize a set of
programs from either example
alone and then see if you
take out one example,
is it possible that the rest of them
actually gives you
a consistent program.
Then, that will give you a mechanism
to point to the user and say,
you made an error in this one.
It’s not consistent
for anything else.
>>Do you also need to write
the findings to
a real programming language?
>>Yes. So, you need
an implementation for each of
these operators if you
want this program to run.
In this case, you imagine this
is like couple lines of C Sharp.
So, how do you actually
find the right program?
You have a grammar,
you have examples.
Now, we want a program that
definitely satisfies these examples.
I’m going to describe the
principle in one slide,
and we’re going to make it
a little bit more complex later on.
So, for instance, let’s use
Flash Fill as a running example.
We start with some
specification of the intent.
So, for instance, we have a couple
of these input-output examples.
We want this string to be translated
to that string and
this one to that one.
At every point of time during
the search, you do the following.
So, the search proceeds top
down throughout the grammar.
That I showed in the previous slide.
So, you’re going to be building
your program in the order of
its Abstract Syntax tree.
If you haven’t seen
the term AST before,
it’s basically the tree
representation of
the program that I showed
in the previous slide.
So, the program that we
want here is some concatenation
of several pieces,
some of these are constants,
some of these are substrings.
Of course, while we’re
doing the search,
we don’t know that yet.
So, at every point of time,
we have a partial program.
For instance, suppose in
this moment during the search,
we’ve already somehow
decided that, okay,
the left part of the program must
be a constant “To” colon space,
but we don’t know what the
right part should be yet.
So, this starts as a hole.
This is always some hole
in our partial program.
Initially, it’s just the root.
From the previous program,
we know that these are
the examples that this hole
should satisfy in order for the rest
of it to satisfy the original ones.
Again, initially, this is just the
original examples given to us.
I’m going to describe the principle
recursively so that it applies
at the top as well as
at every hole below.
What we want to do is
the following: first,
look at the grammar and say,
what possible operators
could appear at this point?
For instance, let’s say,
we decided that here,
we also want another concatenation.
There are other options.
Now, we have a program
with two holes,
a left portion of a concat and
the right portion of a concat.
Again, we don’t know yet which
programs should appear in there,
but whatever they are,
based on these examples,
the left one must
output A and B in order
for the entire program to satisfy
the holes and the right
one must output L and O.
What we just did is, we
propagated the examples
top down throughout the program,
and you can imagine
writing this rule in,
again, a few lines of code.
This rule basically looks like.
Look at the output
examples that I’m given.
I know that the program operator
at this point is concat.
Now, a specification for
the constituents is,
just split the output string
to known possible locations.
Some of these splits should
give us the right program.
In this case, there was only one
split if we ignore empty strings,
but there could be multiple.
So, it looks like this,
essentially, a couple lights. Yes?
>>Can the program give
us the empty strings?
>>Just to make the search
faster, essentially.
In practice, for this domain,
it’s almost never the case that
the right program actually
gives us empty string.
If it turned out to be the
case for a specific user task,
you can always go back
and re-synthesize.
This process now continues
for every hole top down all
the way until the leafs,
until you get the full program.
So, what do you get out of it?
First, it’s correct by construction.
If your propagation procedures
are valid and sound,
you know that the
program that you get out
of it must satisfy
the examples, at least.
If you wanted to not just satisfy
the examples but actually
be preferred ideally,
you want to incorporate a ranking
function that will look at
the program and give
you some score over how
likely this program is
to be generalizable.
You can machine-learn
this ranking function.
There will be some pointers
in the slides.
If you have the ranking function,
you can propagate it throughout
the whole search process and
actually learn top key programs
and not just a single good one.
Finally, this principle applies
to lots and lots of different
operators and domains.
I just showed it on
string transformations,
but you can imagine the same
for any data transformations,
web pages, you name it.
The only thing that you need is
a way to propagate the examples,
essentially invert
the implementation of
the operator even approximately,
and for many cases,
it is possible to write
efficiently. Yes.
>>It looks like exponential complex.
>>Yes. We’ll cover that
just in a bit. Okay.
Do you want to apply this principle?
We’ve built a framework in which you
can just write down
your domain-specific language,
write down these propagation rules.
They are really short and simple
and it will spit you a synthesizer,
so you can build
yourself a copy of flash
fill in a matter of hours or days.
That’s been applied in
dozen different places
inside and outside the company
for various domains.
Many of these domains are already
integrated in the framework,
so if all you need is
just by example data
wrangling that one is
available out of the box.
But if you don’t,
basically what happens
is that you have
a framework that has all the
synthesis strategies like
deductive surge and you give it
the definition of the language and
it generates you
a specific synthesizer
tailored to these
domain-specific language.
On top of which, you can now build
apps and services and UI experiences,
which are the ones that
the runtime take as
input examples and spit out
the programs for the user.
Okay. I’m not going to return to
prose so if you have
any questions at this point,
you’ll feel free to ask.
>>Open source?
>>Not open source.
Open source inside the company
not open-sourced yet.
Yes. Okay, so the obvious question
was asked right now.
It’s a great search procedure
but as many search procedures,
it suffers from a tiny drawback.
It can be slow, exponentially slow,
because at every point
you are going to be
splitting your spec
into possible cases of
how you can satisfy it and that
grows very fast, very quickly.
So, what do we do about this?
Which brings us to
the modern approaches to program
synthesis that combine the power
of two different worlds.
Statistical and symbolic.
So, let’s talk about making deductive
search faster first of all.
What is the issue is
deductive search.
Well, you have
the entire search space
with lots of different branches.
Unless you can guarantee
that some branch is
definitely dunked cannot possibly
satisfy the examples
and prune them out,
you’re going to explore them
trying to find the right program.
Well, but that doesn’t
have to be the case.
So, if you look at
any individual problem,
you as a human can
immediately see that some of
these branches are unviable even
if they are theoretically viable.
So, here, it can’t possibly be
a sub-string program alone
that satisfies this spec.
There’s a colon in here.
There’s no colon into input you
cannot satisfied this was
a single sub-string extraction.
You need to concatenate
at least something.
So, if you know this,
why can’t we teach a
neural network to do this?
There are two approaches
to doing that
one more shallow and other more deep.
The shallow one, is very
simple to implement and train.
It’s called DeepCoder.
If you want to look up the paper
and the principle is simple.
We are going to take
our examples and feed them to a
neural network that will give us
a distribution over the operators in
the language that it
beliefs should appear in
the program that satisfies
these examples and then once
we have this distribution,
we can look at
the least likely parts of
the distribution and say how
about we just eliminate them
from the search, because
the network beliefs
these paths cannot satisfy the
example and that’s essentially it.
If you want to train this,
the only thing that you
need is a database of
tasks and programs
even random ones will
suffice way too often and
from each program you
extract a set of operators
and give them to
the network saying this one
is present, this one is not.
Now, this already cuts out
quite a bit and works well,
but you can notice that this process
runs only before starting
the search space.
However, the search continuous
throughout the entire
tree recursively.
It seems like we are not using
the opportunity as often as we could.
If you do that at every step
instead of just at the beginning,
you arrive at something
that we call the
neural guided deductive search.
Where basically, at every
step of this way in
the search we are going to
take over partial search
state and feed it
to a network who is going
to look at it and reorder
the operators and give us
a different distribution over
the branches at the next step.
Now, that allows you to cut off
another part of the
search and you can keep
doing it at every step of
the way until you arrive
at the right program.
Okay. Now, how does it
actually look technically?
What you want to do,
is the following.
So, you want to first
create a complete data set of
intermediate search results.
Run your program synthesis without
any neural networks or
without any guidance.
Let to explore
all the exponential space,
but collect older results and
say at this search branch when we had
the choice between K operators and we
had this spec of inputs
and output example,
we explored all K
branches which gave us
back K top-level programs and
according to the ranking
function H that we had
available program P1 had that score,
program K had that score and so on.
What we want, is to pick
a branch that will maximize
the score of the program.
That is the score that tells it
the program is
generalizable and intended.
So, we are going to learn
a predictive model
which can take as input
a branch ID at this state and
approximate how good this score is.
To clarify, you don’t want
to just choose a branch.
You want to make use of these scores
because scores are not uniform,
they actually have meaning.
A score of that range will tell
you this is the program that
is highly likely and a
score of that range will
tell you it’s likely but
not that much and so on.
So, you want to optimize
the scores and we
train using the squared error
objective over this course.
Which model architecture do you
actually use depends on your task?
For instance, for strings
you’re going to say,
“I have my input example,
I have my output example.
I’m going to embed them
one character at a time process
visible STM four inputs and
outputs and then for
each individual production ID,
I’m going to feed that to
multi-layer perceptron which
will emit me a program score.”
That pipeline can be run for
each production independently for
each operator and then you just
select a few top ranked ones.
There are some [inaudible]
through this process.
So, the good parts is,
if implemented well,
it guides the search towards
higher-quality programs and it
eliminates the branches
that are unproductive.
They don’t get you to the goal.
The downside and the thing that
you need to be careful of,
is that if you choose wrongly at
any point throughout the tree,
you are going to
eliminate the branch with
the right program and you’re done.
So, if you don’t want that to happen,
you don’t want to always pick
just the best branch as
predicted by this model.
You want to pick several of them and
ideally implements some sort
like branch and bound on a star
in order to explore
a frontier of the search
process instead of just
blindly picking
the best prediction. Okay.
So, that concludes
the first approach to
combining the storage
and the neural world.
If we go all the way to
the sky and look at it
from a bird’s eye view,
what basically happened is that we
had a completely logical procedure,
a symbolic search
process and we found
a way to guide it using
a neural network.
Well, the obvious question
is: Why don’t
you just try and learn the
neural network in the first place?
It turns out that works in certain
cases doesn’t work and others,
and we are going to see
a few examples on how do
you design the right architecture
of a neural network,
if you want that to solve the
entire program synthesis problem.
The obvious downsides to just
using a neural network is that,
you cannot communicate
the descriptions
of all the operators
in the program to it.
You cannot guarantee a 100 percent
by construction that the program
that the neural network is going to
emit is going to
satisfy the examples,
and forget about satisfying
the examples, is going to compile.
So, if you want
that neural network to be at
the core of the search process,
you need a way to guide that
using the symbolic processes.
Have a component that
tells the network
that this is how you make
your programs compile and this
is how you make them
satisfy the examples.
So, let’s look at how to do
that on a few other domains.
The setting where this
can be illustrated best,
is the one dear to
many of our hearts,
which is writing SQL queries.
It’s been a longstanding dream of
the program synthesis community for
a long time to have
something that can just
take a question over
a given table and spit out
a SQL program that actually
corresponds to this question,
so that you don’t need
to write it yourself.
It turns out we have,
if you phrase it,
as a machine translation
problem of sorts and
implements in standard encoder
decoder attention pointer,
so STM, that takes as input
one sequence and outputs
another sequence.
There are quite a few things that
this architecture already gets right.
So, for instance, it
can tell you that,
okay, based on the
embeddings of the words,
this is worth of subsidiaries
temporal properties
and this column name
is also about temporal properties,
so maybe this month should
correspond to the year.
I can tell you that,
this column contains numbers so it’s
ordinal and there is
a word last in here,
so last often corresponds to
a max operator in this
case based on my dataset.
I can tell you that this
constant appears in
the table so it should probably
appear also in the program,
if it appears in
the question as well.
This things can be noticed
statistically pretty reliably.
If you want details,
you can look up, for
instance, this paper.
So, specifically, how you implement
the sequence-to-sequence
architecture in this case
is that you haven’t know STM that
encodes your schema,
the column names.
You have a little STM that
encodes the question and then
the decoder LSTM at every point
can output one of three things;
either it outputs
a column name, in which case,
what you actually
want to do is to have
a copy and mechanism
that we’ll look at
all the tokens in the column names
and private distribution all
of them and pick the best one.
Or you’d just output a
constant token, in this case,
you just have a softmax
distribution over
all the possible constants
that appear in your grammar,
like an equal sign or
an operator max or things like that.
Or you want a particular
value, in which case,
that’s another copy and
mechanism that you just
looks at the distribution over
the words of the question,
instead of words of the schema.
Now, if you implement that,
you’ll see that it
works to some extent,
like it gets you 80 percent of
the way figuratively speaking.
However, as we mentioned it
cannot guarantee you that
the program that you are
generating is actually correct.
Now, in this case,
we have natural language as
a specification, not examples.
So, there is no way to say,
“Look at the program
and check whether it
satisfied the spec or
not by construction.”
However, at least we would like us to
generate programs that are valid
SQL and unlikely to be correct,
and compile, and execute,
and we delegate to
the neural network the job of
making this program likely,
as in, likely to be the one that
satisfies the intent
expressed in the language.
If we want this guarantee
to be present there,
what do you want to add on
top of the neural network
is something that we
call execution guidance.
Execution guidance is a lot of
post HOC filtering criteria
that enforce that the generation
of the neural network
is a valid program.
What does it mean, specifically?
So, let’s look at this example.
We have a schema.
We have a question,
we have some neural network
that processes all of it and
then one token at a time
outputs us a SQL query.
However, in practice, as
you probably already know
based on the standard architecture
HFML and base to
machine translation and whatnot,
you don’t usually output
a single prediction.
You implement something
called a Beam Search and
output multiple predictions
at every step of the way.
So, as you progress
through time during
these decoding and generation,
you’re going to say,
“Give me the top 50 predictions
at this point based on
their joint probability with
the prefix that they’ve
generated so far.
Then I am going to eliminate
all but the top 50 ones
and then move to the next time step
and also ask for
a top predictions and so on.”
If your beam width is not 50 but one,
you get the greedy
single token generation
that actually is pretty
fragile, because again,
if a single token during
generation erroneously has
a lower probability
than some other one,
you will not reconstruct
the right program,
which is why people implement
something like Beam Search.
Another thing that
Beam Search gives us,
yes? Is there a question?
Okay. Another thing that
Beam Search gives us,
is a way to look at
these partial programs and
eliminate not just the ones
that are low probability,
but also the ones that
are unlikely to be good.
So, specifically, let’s
look at some cases in here.
Okay. There is a sum
and there is a CT.
It’s a syntax error,
I don’t know how to sum CTs.
If you had this full program,
you could give it to
a SQL compiler and it will tell
us that cannot possibly compile.
But even at this step,
you can already look at
the partial program,
give it to a compiler
and it will tell
us this is already a syntax error,
so you might as well
eliminate that from the beam.
Same here. This is
not a syntax error.
This is technically a valid SQL
but if you try to execute it,
it will already give
you an empty set.
If, for instance, your SQL dialect
doesn’t allow ORS,
then you know that sticking
further more and more
clauses on top of
this program is only going to
make the output more empty.
Most of the time,
the question is actually answerable,
so you might want to decide not to
output a program whose output
is guaranteed to be empty.
If you want to allow
an answerable questions,
that’s perfectly fine,
just don’t include
that particular filtering criteria.
So, SQL actually, as
any language has a grammar.
At any point of time,
as you are generating the program,
you can look at the last
token and say, “Okay,
based on my grammar,
I know reach tokens can appear at
the next position in principle.”
Like I just generated ‘where’,
the next thing must be a column name.
It cannot be some other
operator or whatever.
In practice, it means that as
you are generating things,
you want to do this analysis,
come up with a set of tokens that
can come at the next position,
and then all the other ones mask
out of the output vocabulary.
Give them the mask of minus
infinity so that the network
cannot possibly choose them.
That enforces you a very simple way
to make the programs
compile by construction.
This technique you can actually
incorporate in training as well,
not just in the inference,
so that as you are training
the network, at every point,
you give it a choice only
of the valid tokens,
not of all ones that are
possible in the grammar.
That works in programming
languages because
programming languages
have a lot at grammar.
It’s a little harder to do that
if it’s natural languages. Okay.
Questions?
Another good way to
use insight from the programming
languages world,
is- I’m going to illustrate
this on the same problem
of taking language and outputting
SQL or something like SQL.
We are going to talk
about sketch generation.
So, intuitively, when you’re
writing a program as a human,
you don’t usually
write it all the way
end-to-end like I
illustrated right now.
You don’t sit down at
your computer and say “Okay.
I thought really hard.
This is the right program.
I’m going to write it,
start talking to the ends token
and is going to be correct
from the first attempt”.
I mean, this sometimes happens,
but not as often as we would like.
What happens most of the time,
is that you do that in stages.
You write some skeleton
of the target program,
and then you fill in the holes.
Sometimes, you go back and iterate on
that skeleton and maybe correct
some parts that you wrote before and
fill in the next holes and so on.
So, there are neural network
architectures for program synthesis,
that emulate that process and strive
to make it more akin to how
human programmers operate,
and that is surprisingly effective.
The simplest way to do that,
is to split the
generation in two stages,
what called sketched generation
and program generation.
A sketch is a program with holes.
This is the way the program
is going to look like as
a template but without
any specific leafs,
constants, variables that are going
to make it a divided program.
Like in here, if this is an
expression that I want to generate,
this is a function that
selects all the flights
sorry whose departure time
was less than
that and it’s a flight
and essentially it.
So, remove all the constants
and variables from this process
and you’ll get a template.
A template is, it’s a
function that selects
all flights whose departure time
is less than something.
So split your program
generation in two phases.
The first LSTM, you’re allowed to put
this sketch and the second LSTM,
we’ll condition it on
the sketch and spit out
specific ways to fill in that sketch.
There are several different ways
to implement that.
The simplest one here that I’ve
shown is just have two LSTMs.
One you’re allowed to put
tokens of the sketch.
That’s where we’re allowed to
put tokens of the program.
In cases when the target program
actually appears in this sketch,
it will just copy it from
the previous one and if it does not,
it will generate a new one.
So, this idea of works
not just for SQL,
this idea of works basically for
almost any program
generation process as long
as you can come up with
a particular definition of
what constitutes a sketch.
Okay. Questions at this point?
We’ve come all the way
from your symbolic,
we are approaching the last part
of the presentation.
No questions. Excellent.
So, let’s talk
about Neural Program Synthesis
if we don’t have the
search available to us.
Now, some of you may be asking, well,
why did we come all
this way and try to take
the methodology from
machine translation
and try to apply it to
program generation.
I mean, there are some similarities,
but it’s not exactly
the same task. And you are right.
So, it was the perspective
of the machine learning community
in the beginning,
to say, “Well, a program is
just a sequence of words.
We know how to deal with
sequences of words.
Let’s try to apply the same ideas.”
That’s works, but has
a few problems like,
see, programs are very different
from natural language.
They have keywords,
the semantics of which
we know it’s well-defined,
executable but it’s not
known to the neural network.
Programs are a lot more sparse
than natural language sentences.
See, a lot of
the identifiers in the programs
are only used seldomly.
That someone came up with them and
they’re used a few times in
your source code and that’s it.
The distribution is it
has a very long tail.
The natural language distribution
also has a long tail
but not even closely as long.
In practice, it means that if
the neural network doesn’t see
enough examples of using
a particular identifier,
it doesn’t know what to do with it.
The programs have a much
longer distance dependencies.
See, we’ve probably talked about
long distance dependency problem and
the natural language for which
the LSTM was invented and
some other techniques,
but a long distance in natural
language sentence means what?
Like 20 words.
A long distance in a program means
there is this file somewhere in
the other subdirectory
that you’ve probably
forgotten by now but I suddenly
call a function from it.
So, when people realized a lot that,
tried to graduate from sequences
of words to trees of words and
tried to process
whole ASTs in order to
give structure to the
program they’re generating,
but even that doesn’t make use of
all the known rich information
available in the source
code of a program.
So, the current approach is
encode the programs as graphs.
This is the one that brings
the best of both worlds.
What do I mean by graphs?
So, specifically, we’re going to
have nodes that correspond to
particular locations of our program,
and edges that correspond to
semantic not just syntactic relations
between the variables in the
program and the ways they’re used.
I’m going to show that in
a lot of detail just in
a few slides, so don’t worry.
The nice part about this approach is
that they can run
Static Analysis algorithms,
like the ones that are present in
the compilers that we’ve
known for decades,
and get all of
the semantic information.
Then, embed it so that the
neural network can make good use of
it and know when the program is
executable and if
it’s executable how.
So, let’s see how do we do that.
We are going to try to analyze or
generate a program that
looks somewhat like this.
Every location in this program,
you’re going to represent it
as a node in a giant graph.
Then, there will be
different types of
edges that connect
these nodes together.
What are the types of
edges and what are
the semantic relationships
that we want
to encode in this representation?
For instance, and the most
obvious one is just Syntax,
this is what you already seeing.
So, for instance, if you just have
a single type of edge
called “Next Token”,
what you’ll get is a chain of
nodes connected with edges
that corresponds to
a single sequence of words,
and that’s exactly what an LSTM
takes as input and outputs.
That’s your baseline. If you just
add a single other type of edge,
the abstract syntax tree child,
we now have our
representation that has
both the sequence relationships
and the tree relationships,
the AST on top of this.
So, now, this already gives you
two different ways to
look at the program,
its structure and
its order in the file.
But that’s not enough,
we actually want to understand what
the program does not
just how it looks like.
A way to do that, is to also add
semantic edges from the Data Flow.
So if you haven’t heard of
the Data Flow frameworks,
what they do is they look at the
program and they come up with
some approximation of how
the relations between
the variables can flow throughout
the program if it’s executed.
Like, for instance, you can say, okay
I’m looking at this variable
here right now.
When was the last time
it was possibly written?
If you are looking here for instance,
well the last write is
probably somewhere around,
either here or here.
I didn’t know in which iteration
of the loop I am right now,
so I’m going to add
both of these edges.
If I’m looking at this location,
when could possibly
be the last write?
Again, either here or here.
Just add all of them.
Don’t try to analyze it
on a specific input,
this is an generalization of all.
Another edge. When was
a variable last used?
Similar but just about
the read instead of writes.
Another one is, if I’m looking
at an assignment from which
particular variables is it
computed from and so on.
You add them to the same graph
crucially and now you have
a representation that encodes
the program as a mixture
of both syntax
and semantics which is
what we didn’t have
before in the previous approaches
which looked at a program
just as a syntactical object
not as an executable object.
Now, after you have all of these
the important part is of course,
how did we process all
of these graph with
a neural network and actually
make any kind of sense out of it?
Let’s talk about that. So,
we’ll start with RNNs.
RNNs are our baseline,
so when we had the chain
structured data like text,
we processed it using
a very simple algorithm which is,
we had a chain of
so-called recurrent units,
a cell that does a computation
based on the vector so for
presentations so if individual words
and we process it in
a single sequence.
Saying, let’s take the representation
of the word embed it as a message,
a vector and then at each time step,
the recurrent unit is going
to transform this message,
pass it to the next one,
because it going to
transform it again,
pass it to its neighbor and so on.
So, an operation at this point is
just take your current
representation,
take the presentation pass to
you from the prefix and process
it and then pass further and
that continues until the end.
That’s essentially what
an RNN is, y’all know RNNs.
Now, if you add more edges
to this structure,
we now have graph structure data
not just a chain.
So, how do we adopt
this computation scheme?
Now, there is
no simple order anymore.
Well, let’s look at how to do that.
So, we are going to start
with a graph and our graph
has states which are
locations of a program,
which have embeddings just
like words had embeddings.
For instance, you could say take
the identifier variable and
embedded as a word and
represent as a vector.
We also have edges.
So, we had multiple
different types of edges
and each of them is going to be
also represented as
particular recurrent unit.
So now we have a structure that
has a recurrent unit at each node,
just like this text and a
representations in this node,
just like this text, and we
have multiple types of
edges instead of just one.
So, the idea is just you take
a note state and you pass it through
the edge network to every neighbor.
Every neighbor at once.
So, take node state pass it,
do that with every possible neighbor
according to that knowledge.
Now, we have more than one message
arriving at a given node,
instead of one isn’t
the case of RNNs,
so you’ll need some way
to aggregate it.
Sounds usually work well but you can
experiment with
beverages or other ones.
So and the recurrent unit
again takes your current
representation of the node,
takes the arriving messages from
the neighbors and updates
its representation.
The structure is very
similar to RNNs,
it just happens at
more than one node and it happens
at more than one neighbor.
We do that for every edge type
separately and we do that for
every node concurrently and what we
get is a way to take
our presentation and
node and use the information
about its neighbors in
order to compute this one.
If you do that multiple times,
unroll the network for
several different layers,
we now get some processing of
a Hole program based on
the propagation of the messages
throughout its structure.
Okay. So, this is about GNNs,
natural question is what does this
all have to do with synthesis.
Well, see what it gives us
is a great way to encode
the program and if you have
a particular good representations of
a program that is able to
make sense of its syntax,
as well as its semantics,
you can apply this encoding method
to the programs you are
generating as well as
the programs you are conditioning
on if you have something
to condition on.
So for instance, suppose that you
want to do program completion,
you have a particular program
and you want a way to
take a sit down at
particular points as
the user is typing and
give it suggestions.
IntelliSense but not just a
single token IntelliSense.
IntelliSense on the level
of Hole expressions.
In this setting, there are
no specifications like
examples, languages,
the intent expression is what
here’s the context of my program,
what is the most likely
completion of that context.
So, the principle is
actually now that you
saw the encoding is
relatively straightforward.
You’re saying, “Let’s take
this context and encode it
using a graph neural network.”
After we encode it, we get some
representation at the Hole node,
this is going to be a vector
from which we start
generating our program
and representations
of variables around
it and that they will
pass to a sequential
generation procedure
that is going to give
us the generated code,
like this expression here.
The generated procedure
is very similar to
the deductive search
that you already saw,
you just implemented
using neural network
instead of the logical search.
So, essentially it’s going to built
the expression top-down again,
top-down left to right.
Just do this using message
passing and propagation
of a graph neural network instead
of the logical derivations.
So we’ll start here,
focus on the Hole,
we want to select what should
be expanded in this Hole.
We have productions like
this point it could be
a minus or a plus or a call
for function and whatnot.
Just like in the deductive
search inflexion we had
productions but there we
expanded all of them.
Here we’re going to just have
a softmax that selects over them.
Focus on the next Hole, expand it,
now we have terminal node,
so a terminal node we’ll
select our variables.
Add the edges that
correspond to the semantics,
propagate through them
like the last use edge
that looks at the variable note
in the context and says,
well where it was last used,
and continue the same process
step-by-step left to right.
If you look at this particular
setting and apply it,
you get a network that is actually
surprisingly good at
looking at small snippets
of code and producing
expressions at every point
and rankings of them.
So, for instance, this
example you’ve already saw,
sometimes it’s not quite correct.
So, for instance, here
the right expression
appears at the second spot
instead of the first one,
but their probabilities
are not that different.
Sometimes it’s surprisingly good,
like in this expression it knows that
the right one is a character variable
should be compared to some constant.
It can generate you
the right constant,
but it knows it should be
some constant and so on.
So I’m going to
conclude at this point,
and that’s a higher level overview
of multiple different things.
If any of these specific
things interests you,
here’s a good set of links.
So the survey paper at
the very top is probably the
best one for whole field,
and it covers the symbolic
sections that covers
the neural sections and many
particular projects and applications,
depending on the ones
that interests you.
I encourage you to check out
specific frameworks because
people have built them to
simplify particular ways to apply
program synthesis in
end-user driven applications,
and pros and sketch are
the two most popular ones,
so make use of them.
There are a couple of
blog posts that point to
more recent research and
papers and results if
you want to do those.
Before we go, I want
to leave you with
these some takeaways
over this entire talk.
Program synthesis as
a whole is a way to
generate programs in
a particular application domain,
and in order to implement that,
you need three main components.
Define your language of programs,
define how you take a specification,
and define how do you
do in this search.
There are several particular
choices for all of them.
If your specification is examples,
so you want to take a few of
them as possible and you
want to do that fast.
An enumerative or deductive search
usually is a great way to do that.
It ensures correctness
of your programs,
at least with respect to examples,
but it’s not always
the most performant ways.
So if you want to speed it up,
a neural guided search is
a good way to do that.
For other specifications,
where you can’t quite use
search because you need
to make sense of language
or you need to make
sense of the context
of your program or
other domains in which
neural networks are
more effective way to
encode that, use neural networks.
These are our current ones
or graph-based ones,
depending on the complexity
of your programs.
You give up execution correctness.
So guide them using
partial program execution or
the structure artificially added to
the programs in order to ensure
that the programs compile.
Hints help well.
You don’t want to just
generate any program.
If you have any hint about what
the program should be, give it.
If you can write a sketch of
the right program, write one.
If you have a grammar of
their language in which you’re
generating your program,
communicate that grammar
to the network.
Like, for instance, limit
the tokens that are available
at each position based on
the partially generated program.
If you can train a
ranking function that
can prioritize some programs
better than others,
that is another hit.
there are frameworks that
people have built for
particular kinds of program synthesis
that you can leverage
in different settings.
That concludes the high-level
overview of Program Synthesis,
and happy to take questions.
Thank you.
Questions? Yes.
>>Can you write on
the strength options?
Because I think you
covered how to do ranking
using neural networks on
the node level but not
on the program level.
>>Yes. So a ranking function is
something that takes
as input the program,
optionally takes as input the task,
like what are the inputs and
outputs that you wanted
to satisfy for instance.
An output is some kind
of score over eight.
This program is probably
more generalizable.
If you want to train that,
what you want to do
is build your data
set of all possible programs
that could satisfy
the examples and then mark
them as generalizable or not.
You can do that because
you usually have
more input data examples
than the ones that are given
to this specification.
Like in a spreadsheet,
in case of flash failed,
you can say, “Okay,
I’m generating the program
based on the top two rows,
but I’m going to
judge the programs as
good or not based on their behavior
on the rest of the rows.”
That gives you training data.
There are lots of
different objectives in
the ways to actually set
up the learning problem,
but what essentially you
want to achieve is to say
any good program should be higher
than all possible bad programs.
So the score outputted by the
ranking function should separate
a good program further from
the scores of all the bad programs.
That corresponds to
a something called the
max-margin loss, where you say,
“I want to make the margin between
good programs and
bad programs as much as
possible at the high level.”
So this is one way to do that
if you wanted to adjust,
encode full programs and the tasks,
and that gives you
a ranking function You could
also do that at every level.
If you can collect
the training data from
all the levels of the search process
just like we did within
neural guided search.
Of course, if you don’t
want the training,
I think you can just
write ranking function.
Actually, some simple heuristics
that say programs that a shorter are
better programs that
have fewer constants are
better yada yada already
get you 80% of the way.
Questions? Yes.
>>Related question.
How do you ensure
the performance of the generated
programs under domain grade?
Would that generated program be
ordinance quiet program
or if it’s a SQL,
does it take care of the writing,
that system-related, how do you kind
of build that knowledge
into the generation?
>>Yes. Great question. So
I’m going to repeat that.
The question is how do you
ensure the performance as in
the speed of the generated
programs is up to the bar,
and for different domains
that corresponds to
different notions of say complexity
or indexing and what not?
Okay. Performance is another metric
on which you want to score
the generated program.
We just talked about
ranking functions.
The ranking functions are about
generalization or
applicability of a program.
You can build similar thing
for performance.
If you look at the program and say
“These operators are in practice has
slower implementations than those
operators,” and now as you’re
generating you want to optimize
a mixture of two objectives.
This is if you can build
a simple cost model.
For SQL indexes I suppose,
again you can say I have
indexes for this columns.
I know that these operators
are more efficient
than those operators,
and that is a second component
to my objective.
So the network learns
to optimize that,
in addition to actually generating
just the right program.
You can also often do
that in two stages,
where you generate a set of
programs that are
likely to be correct,
and then among those, you’d
pick the most efficient one.
Yes.
More questions? All right.
Thank you again, and
let’s get back to coffee.

4 Replies to “Advanced Machine Learning Day 3: Neural Program Synthesis”

  1. Does anyone use time complexity as a constraint? Like, "give me please algorithm for finding substring in string in O(len(str) + len(substr))"? If no, why?

Leave a Reply

Your email address will not be published. Required fields are marked *