AI for Imperfect-Information Games: Beating Top Humans in No-Limit Poker


>>I’m going to talk about AI for
Large Imperfect
Information Games,
in particular, on
how emitted AI that
beat top humans in
no-limit poker.
Okay. So, for starters,
this talk is going to be about
imperfect-information
games in general.
I’m not going to talk about
perfect-information
games like chess or Go,
it will be applicable to poker,
but also more generally,
any strategic interaction
that involves hidden information,
for example, security
interactions or negotiations.
I think this is really important
for bringing AI into
the real world,
because the truth is
most real-world
strategic interactions
involve some amount of
hidden information.
So, when it comes to these games,
poker has served as
the primary benchmark challenge
going back decades.
In fact, if you look at
the original papers
on game theory,
pretty much the only
application they
talk about is poker,
because it’s so
accurately, it captures
the challenge of
hidden information.
Particularly, there’s
a variant of poker called
heads-up no-limit
Texas hold’em that has
emerged as the primary benchmark
for these games.
A heads-up no-limit Texas
hold ’em is a massive game.
It has about 10 to the 161
different decision points.
It is also the most
popular variant
of poker in the world.
For example, no-limit
Texas hold’em
is the game that is played at
The World Series of
Poker main events.
Every year the winner is
determined by Heads of
No-limit Texas Hold ‘Em.
It’s also featured in
popular movies about poker.
For example, Casino Royale
and Rounders.
In some ways, you
could argue it’s
the purest form of poker.
It’s subjective, but it
is a very strategic game,
whether you win or lose, it’s
entirely up to your skill.
It’s not up to the other
players at the table,
except for your own
opponent, I guess.
So, there’s no kingmaker
effects for example,
and no pro AI has been able to
beat top humans in this game.
That isn’t till 2017.
So, in 2017, we organized
something called the
Brains vs AI Challenge.
We created an AI
called the Libratus,
which we played against
four of the world’s
best heads-up no-limit Texas
hold’em specialists in the world.
These are all people that
make about seven figures per
year playing this game online.
As we played 120,000 hands of
poker over the course of 20 days,
and there was a
$200,000 prize pool
divided among the pros
to incentivize them
to play their best.
So, they weren’t risking money,
but how much money they want,
depended on how well they did
relative to the other players.
So, obviously, if you’re
familiar with poker,
you might not have
heard of these pros.
So, I wanted to say
a word about how
strong these pros are,
because it really is important to
play against the top players.
Unfortunately, there are
no objective rankings
of professional poker players.
But like I said,
these are all players
that make millions
of dollars a year.
In fact, here’s a question
from the poker subreddit,
where somebody was
asking, ”How good
are these players that we
were playing against?”
Somebody responded, ”These
players will absolutely
trounce all the 2,000
heroes that you
might have heard of.
The heroes from 2000s would
be division
three college players.
Well, whereas these guys are
all star caliber pros.”
So, this is a pretty accurate
description I would say.
There is a big scale difference
between the a pros
that you see on ESPN,
and these guys who actually
play this game for.
The guys you see on ESPN
are basically celebrities.
These guys are
the guys that actually
make a living playing this game.
The final result is
that Libratus beat
the humans in this game by a lot.
The victory margin
was 147 mbb/game,
which is a measurement of
win rate and poker, which,
unless you are an actual poker
player doesn’t mean much,
but to give you some perspective,
this is about three
times the win rate
of a top pro versus
an average pro.
It was statistically
significant at
about four standard deviations,
and each human lost
individually to the AI.
This was a big surprise
to everybody.
In fact, when we announced
the competition,
there was a betting market
on the outcome,
because it’s the poker world,
and obviously, like to
gamble on these things.
When we first announced
that we’re going to
do this competition,
the betting odds were
four to one against us.
In fact, even after we
won on the first day,
the betting odds were still
two to one against us.
I think I was until
like the third day that
the betting odds were even,
and by the eighth day,
you couldn’t even
bet on the outcome of
the competition anymore.
You could just bet on how much
each human would lose
on each individual day,
because it was clear at
that point that this AI
is going to win.
In fact, even if you asked us,
we were not very confident
that we would win.
I put our odds at
about like 60 percent,
maybe 65, but I
didn’t think we would have
a lot victories like this.
Actually, after this competition,
we did another competition
against these Chinese pros.
So, basically, somebody called
Kai-Fu Lee in China
called us and he said,
”We would like you to
do another competition
in China against Chinese players.
We will broadcast it, it
would be a lot of fun.”
We were like, ”Well,
why should we do this?
Because we just played
against the top humans.
These Chinese players
not as good.”
He said that he would pay us.
So, we said, ”Okay, great.”
So, we played 36,000 hands
against six Chinese players.
We beat them by
even more than we beat
the top humans in America.
That was actually
a huge hit in China.
It was watched live
by millions of
people during that competition.
They had really nice production
where you could see
a poster like this.
It was way better than
what we did in America.
All right. So, why are
imperfect-information
games so hard?
After all, we have AIs that can
beat humans in games like chess,
we have AIs that
beat humans in Go.
In fact, you might have
heard recently about
AlphaZero which can beat humans.
Well, it’s essentially
superhuman in chess,
Go, and shogi, all using
the same algorithm.
So, what is it about
imperfect information games
that are so difficult?
One of the major challenges,
not the only one, but
one of the major ones,
is that in an
imperfect-information game,
the optimal strategy
for a subgame,
for part of the game, cannot
be determined in isolation.
It cannot be determined using
information in just
that subgame alone.
So, let me show you what I mean.
Before I get to that,
deep learning has taken
a lot of credit recently for
a lot of the breakthroughs in AI.
Actually, all AI did not use
any deep learning, no
deep learning at all.
But I would also argue
that a big reason for why
all these AIs are superhuman
in various games like chess,
Go, backgammon even,
is because they use
real-time planning.
The planning component is huge.
In AlphaGo, for example,
use Monte-Carlo Tree Search,
in Deep Blue, it used
Alpha-beta pruning.
So, in fact, if you look at
AlphaZero, without
real-time planning,
I guess this is washed out,
but it ends up being right around
there without Monte-Carlo
Tree Search during real-time.
Top human performance
is right around here.
So, in fact, without
Monte-Carlo Tree Search,
AlphaZero is not superhuman.
The tree search gets
you 2,000 ELO addition.
So, real-time planning
is really important,
not just in Go,
but also in poker, it turns out.
This is actually the key
breakthrough that allowed
us to be top humans is figuring
out how to do real-time planning.
But it turns out that in poker,
it ends up being way harder
which is where it
gets you right now.
So in perfect-information games,
you take some action,
your opponent takes some action,
you find yourself in
a particularly subgame.
Now, you can forget about
everything that came before,
all the other situations
you did not encountered.
The only thing that matters
is the situation that you’re in,
and the situations that can
be reached from this point on.
So in perfect-information games,
so for example, if I were to
show you this chess board,
you don’t have to know how we
ended up in this situation,
you don’t have to know about the
Sicilian defense of
the Queen’s gambit.
You can just look at this board,
and if you’re white,
you can say, ”Okay,
well, if I do a search,
I can see if I move
my white queen there,
then it’s checkmate,
and the game is over.
So, I should just
do that. You don’t
have to know anything about
the strategy of chess.
But in imperfect-information
games,
if you take some action,
and your opponent
takes some action,
and you find yourself in
a particularly sub-game,
now some other sub game that
you are not in, and in fact,
you might not even be able
to reach from this point on,
can affect what
the optimal strategy
is for the sub-game
that you are in.
This is counter-intuitive,
but I’m going to give you
a concrete example in a little
bit that illustrates this.
Now, before I get to that,
I want to talk a little bit
about what our goal
is in these games.
Our goal is to find
a Nash equilibrium which
in-two player zero-sum games,
is the same thing as
a min-max equilibrium.
I won’t get too technical
about the definition,
but basically,
in a two-player zero-sum game,
if you’re playing the
Nash equilibrium,
you are guaranteed to
not lose an expectation.
Now, it’s not always easy
to find a Nash equilibrium,
but it’s always
guaranteed to exist
and a finite two-player
zero-sum game.
So, for example, in
rock, paper, scissors,
the Nash equilibrium is to
this mix randomly
between rock, paper,
and scissors, with
equal probability,
because if you do that,
then no matter what
your opponent does, you
will not lose an expectation.
Now, in rock, paper, scissors,
that also means you’re not
going to win an expectation,
but in a complicated
game like poker where
there’s a lot of
sub-optimal actions that aren’t
actually played in
the Nash equilibrium,
it’s likely that your opponent
will make mistakes
and you will end up in
practice winning as well.
Yes.
>>How important is it
going to be to play a game
So, if I compare this to say,
heads up or not.
If I got a heads up, if I got
to sort of thinking about
like this will go
about seven players.
>>That is a great
question. So, I’ll get to
this, let’s talk about this now.
In poker, it doesn’t
really matter.
So, in poker, if you were to use
these same techniques
for six player poker,
you would almost certainly win.
That said in general,
poker is a special game because,
I don’t know if you play poker
but two special
things about poker.
One is, it’s really hard to
collaborate with other players.
So, you can’t say, “Hey,
let’s team up against
this other person at the table.”
In fact, if you try to do that,
that’ll be against
the rules of poker.
The other thing that’s
unique about poker
is that people fold in the game.
So, even if you have
six players at
the start of the game,
it very quickly comes down to
two players because people fold.
So, you can use these techniques
that are only guaranteed for
two-player zero-sum
games and it will
just work in six player poker.
But a big challenge is
extending these techniques to
other games that do
allow for collaboration.
In that, we don’t really
have a good approach
for those games yet.
So, for now,
I’m just going to assume
that we’re working in
the two-player
zero-sum setting and
it does extend in some cases
to other situations as well.
So, our goal is to find
an approximate Nash equilibrium.
We’re going to
measure performance
in terms of exploitability.
You can think of it as, distance
from a Nash equilibrium,
it’s how well we would do against
a worst-case adversary relative
to if we had played a
Nash equilibrium instead.
So, how exploitable we are?
I would argue that exploitability
is actually extremely
important and has
been overlooked in the AI
community as a whole.
I think two recent
man-machine matches
actually really highlight this.
One is the OpenAI,
one versus one Dota2
Matches that you might
have heard about,
and the other is
Fan Hui versus AlphaGo.
In the OpenAI Matches,
they made this AI
that was able to beat
top humans in one versus
one Dota2 over three games.
But after they won
against the top human,
they actually opened it up to
the public and they invited
random mediocre players to
play against it to see if they
could find any weaknesses.
In fact, pretty quickly
within a few thousand games,
weak players were able to
find certain tricks that they
could basically fool the AI
and figured out how to
exploit it and beat it.
Also, in Fan Hui versus AlphaGo,
so they famously
beat Fan Hui 5-0.
But then after they
published the Nature paper,
they invited him to play
several more matches
against it to see
if he could find
out any weaknesses in the AI.
In fact, he was able to find
weaknesses where he was
able to consistently beat
the AI and they had to
patch this before they
released on the Nature.
So, I think what this really
demonstrates is that
it’s not enough to beat
top humans in three or
five or even 10 games.
You really have to be able to
consistently beat top humans,
especially if you want to deploy
an AI into the real world.
If you’re Microsoft
and you’re trying to
deploy this products
with real users,
there’s millions or
billions of them,
if there’s a weakness,
they’re going to find it.
But with the [inaudible] ,
we played the top humans not
just in three or five
hands of poker,
we played them in
120,000 hands of
poker over the course of 20 days.
That whole time, all
four players working as
a team to try to
exploit the AI in any
way they could find.
In fact, actually, I
had lunch with one
of the players,
just a couple months ago.
He said that the
thing they found most
shocking about the
competition is that,
at the end of each day,
we gave them a log of
all the hands that were
played and we told them what
the bot had on each
hand that was played.
This is big because in poker,
a big part of the game is
actually keeping
your strategy hidden.
If you fold, your opponents
does see what cards you have.
In fact, even if you don’t
fold but you lose the hand,
you still don’t show
your cards are.
So, you only see your opponent’s
hand about 20 percent,
25 percent of the time.
So, like poker players
will sometimes even call,
just to see what
their opponent had.
But here, we’re just giving
them that information.
We’re telling them
what the bot had on
every single hand that it played.
So, they didn’t have to
worry about that part all,
and they found it absolutely
amazing that they
could not figure
out how to exploit the AI,
even though we were
showing them the hands that
the bot was playing every
single time and the bot strategy
wasn’t really changing
that much between days.
All right, so I
think exploitability is
extremely important.
I think has been overlooked
by the AI community,
and this is telling that the
imperfect information
game solving
community has focused on
throughout its existence.
All right, so now,
I want to get to
the example of why imperfect
information games are hard.
I’m going to talk
about a simple example
that I call a Coin Toss.
It starts with the coin flip.
So, the coin is flipped that
lands heads or tails
with 50-50 probability.
Player one is going to observe
the outcome of the coin toss.
Player two is not.
So, after this coin lands,
Player one has a choice.
They can either sell the coin
or they can choose play.
We’ll say, if they choose sell
this to some separate subgame,
the details of which
are not important.
The only thing that’s important
is the expected value.
So, we’ll say, if
the coin landed heads,
then the coin was lucky
and they can sell
it for 0.50 cents.
On the other hand, if
the coin landed it tails,
we’ll say it’s unlucky
and Player one
loses 0.50 cents by selling it.
On the other hand, they
could choose play,
and if they choose play,
then it leads to Player two,
and Player two has to
then guess how the coin
landed without having observed
how it actually landed.
So, if they guess
correctly that is
Player two guesses heads
and the coin actually
landed heads,
then Player one is going to
lose one dollar and Player
two is going to gain one dollar.
Here, the payoffs are shown for
Player one because this is
a two-player zero-sum game.
So, Player two just receives
the opposite payoff.
Now, on the other hand,
if Player two
guesses incorrectly that is
they guess tails and the coin
actually landed heads,
then Player one gains one dollar
and Player two loses one dollar.
You can see there’s
a dotted line between
the two players,
two nodes this
signifies that Player
two is in what’s called
an information set.
This means that Player
two because they
did not observe how
the coin landed,
they do not know which of
those two states they
were actually in.
So, why do you imagine
that you are Player two in
this game and yeah,
so why do you
imagine that you are
Player two in this game,
you’ve just observed
Player one chooses play
action and so you know that you
are in this imperfect
information subgame.
So, what should
you do? Should you
guess heads or should
you guess tails?
But one option is to
just always guess heads.
But if you do that, that’s
obviously a really bad strategy
because now Player
two can just sell the coin
when it lands heads
and get 0.50 cents,
and choose play when
the coin lands tails
and gain a dollar.
So, on average they’re
getting 0.75 cents.
On the other hand,
you could always choose tails,
but that’s also a really
bad idea because now Player
two can choose play
when the coin lands
heads and gain a dollar and
choose sell when the coin lands
in tails and lose 0.50
cents but it’s better
than losing a dollar.
So, on average,
they’re still getting
0.25 cents in this game.
So, it turns out that
the optimal strategy is to mix.
It’s to guess heads with
25 percent probability and
tails with 75
percent probability.
If you do that, then no
matter what Player one does,
the best they can do
is just break-even,
get on average
zero dollar in this game.
So, this is the
Nash equilibrium strategy
for Player two in this game,
at least for this subgame.
But now, let’s say we change
the game a little bit.
Let’s say we changed the payoff
for the sell action.
So, now, an
expectation Player one
loses 0.50 cents for choosing
sell when the coin lands heads,
and gains 0.50 cents for choosing
sell when the coin lands tails.
Well, it’s pretty easy to
see that as Player two,
your strategy in this subgame
should now change as well.
Now, you should be
guessing heads with
75 percent probability and
tails with 25
percent probability.
But you can see what’s
happened here is that,
by changing the expected
value of the sell action,
we have affected what
the optimal strategy is
in the play subgame.
Even though
the sell action is not
part of the play
subgame and in fact,
it’s not even on the path
leading to the play subgame.
So, this is something that
happens in imperfect
information games.
It does not happen in
perfect information games.
In perfect information games,
if you wanted to determine
the optimal strategy in subgame,
you only need to look at
that subgame by itself.
But in imperfect
information games,
you have to look at
the game as a whole.
So, you can think of like
perfect information games is
a special case where
you don’t have to worry
about all this stuff.
Imperfect information games
are the more general case
where this is a problem.
So, what do we do? Well, it
turns out that we
don’t actually have
to know the strategy for
the entire game as a whole.
I mentioned that this sell
action leads to a subgame,
where both players
might take actions.
But you don’t have
to worry about that,
the only that really matters for
determining the optimal strategy
in this play subgame,
is the expected value of
Player one choosing sell.
So, what we can do
is try to estimate
what that value is to Player one,
and if we have that,
then we can determine
the optimal strategy
in the play subgame.
So, that’s what we actually
did in [inaudible].
We also have a theorem that
says, “If this estimate is
within delta of the true
Nash equilibrium value,
then we can solve for
the play subgame and get
within delta of
the Nash Equilibrium.”
So, in the [inaudible]
, we actually do this.
We have this massive game
which is simply way too
large to solve upfront.
So, we come up with
a really good strategy just for
the early part of the game,
and we estimate what
the optimal strategy is and what
the expected values are in
the later parts of the game.
Now, when we’re actually playing,
we find ourselves in
a particular subgame,
we come up with
a much better strategy
for that particular subgame
using information about
the expected values from
the other subgames.
Then, we repeat this process,
we just come up with
a really good strategy
for that early parts
that are coming up and just
estimate how to play
in the later parts.
We find ourselves
in early subgame,
we again compute
a much better strategy
and that particular subgame
using information
about the expected
values of the other subgames.
That’s called nested
subgame solving.
This was the key
breakthrough that
allowed us to be top humans.
So, when I, yes?
>>Just [inaudible].
>>Yes, that’s a great question.
So, actually when we do this,
this is sort of a general,
how we would do this in general.
But in poker, we
solved the first two,
there’s four betting rounds.
So, we solve the first two,
with a pre-computed strategy.
Because it’s like each round
grows exponentially in size.
So, the first two rounds
are actually pretty small.
We got to the end of
the second betting round,
that’s when we applied
Subgame Solving.
So, we came up with a much better
strategy for the
remainder of the game.
We abstracted the bets.
So, we want to consider
all the 20,000
different bet sizes,
we would just consider
a small fraction of them.
Then each time
the opponent acted,
each time they’ve made a bet,
then we would solve
a new subgame for that bet size.
So, we would apply
this recursive solving thing
every time the opponent made an
action beyond
the second betting round.
So, when I mention
this idea of what’s called
Safe Subgame Solving,
where we use the expected values
from the other subgames,
people always ask
about this thing
called Unsafe Subgame Solving,
which is the more intuitive
approach to doing this.
The idea here is well,
why don’t we just
estimate what
the opponent strategy is?
Let’s say we can sort of like
we played a bunch of hands
against them or we
can estimate what
the Nash equilibrium is for them,
and we figured out, well,
they should be choosing
play 80 percent of the
time when the coin lands
heads and play
30 percent of the time
when the coin lands tails.
Let’s say, just for example.
Now, if we assume that the
opponent’s playing this strategy,
can we then reason about
the distribution of
states that we might be
in and then solve optimally
using that distribution.
It turns out that doesn’t
work. So, let me give you
an example of what
this would look like.
When the coin lands
either heads or tails,
we reason that we’re in one of
these states with
50-50 probability.
Now if we observe
player one choose play,
we would say, okay, well,
in a Nash equilibrium,
we would expect
player one to choose
play 80 percent of the time
if we were in the left state,
30 percent of the time when
we’re in the tail state.
So, we update our belief
about what state
we’re in using Bayes rule,
and now we can reason
that we’re in that
left state with
73 percent probability,
and in that right state with
27 percent probability.
Now, we would just say, well, if
we assume this
distribution is correct,
then the optimal strategy
is to always choose heads.
But if we’ve already established
that’s a really bad idea,
because now the opponent
can simply shift to
selling the coin
when it lands heads
and choose and play with
the coin lands tails.
So, the problem with
this approach is that
we’re making an assumption about
how the opponent is playing.
If it were true, like
if this distribution were
true that they were choosing
play with 80 percent probability
in heads and 30 percent
probability in tails,
then yes, we can
apply this reasoning.
But the opponent strategy
can always change.
They can always change
adversarially to us. Yes?
>>There’s one thing that I’ve
always been interested in,
when you play the Nash or when
you play against the opponent.
It seems like they’re
not going to shift.
Even if you’re playing
the wrong strategy,
they wouldn’t exploit
it off immediately.
They have to learn to exploit it.
I guess it’s safe to
definitely model the Nash,
but I am curious about
this intermediate space where
you’d play against how they’ve
been playing in the past,
recognize that you
need to shift in
some way because they
may shift as well.
>>So, yeah, that’s
a great question.
We actually did not,
so one of the interesting things
about humans playing poker,
is that they’re actually
really good at exploiting.
They are phenomenal at it.
Way better than
computers are currently.
So, we actually did a competition
against them in
2015 where we lost,
and we would sometimes
change the bot that
they were playing
against between days.
Within 50 hands of playing,
they could figure out
how the bot had changed.
So yes, if you can
make an AI that could
figure out how to do this better
than humans, then that
that might be valuable.
But we’re playing against
really talented humans
and we didn’t think that we
could beat them at that game,
but then also why bother
playing that game?
Why bother trying to play
that mind game if we can just
approximate a Nash equilibrium
and guarantee that
we’re going to win.
So, I would argue that in
the two player zero-sum game,
if you want to beat top humans
in a two-player zero-sum game,
the best approach is to just
approximate the
Nash equilibrium because now,
no matter what
they’re going to do,
you’re going to beat them.
Now, I would argue that if
your objectives are
different, so for example,
if you really want to beat up on
a weak player and
exploit them, then yeah,
you don’t want to necessarily
play Nash equilibrium.
You want to adapt to
their weaknesses.
This is challenging
to do correctly,
because if you try to adapt
to a weak players weaknesses,
you never know if they’re
just fooling you.
Like if you’re playing
rock, paper, scissors
against somebody and they
throw rocks three times in
a row, and you say, well,
he’s clearly an idiot who’s
throwing rock every single time,
I’m going to throw
paper next time,
they could just throw scissors.
So, there’s no safe way that,
except in special cases,
there’s no safe way
to do that kind
of opponent exploitation
and still guarantee that
you’re going to beat
top humans expectation.
So, I think that is
an excellent avenue
for future research,
but I think that in
the two-player zero-sum setting,
where we’re just trying
to beat top humans,
I think this is the better
way to go about it.
So, Unsafe Subgame Solving,
is very risky for
this reason because if you
make an assumption about how
the opponent is playing,
they can always
shift to a different
strategy and take
advantage of that.
Now, that said, it turns out
that this actually works,
yes, so we must account for
the opponent’s ability to adapt.
Now, that said in practice,
Unsafe Subgame Solving works
unusually well in poker.
It turns out that if you
just approximate what
the Nash Equilibrium strategy is
and then assume that
the opponent is playing that,
and apply Subgame
solving in this way,
that actually works really
well in this domain.
But we have found situations
where this does not work well,
and I think in
more general settings,
it would not do well.
So, we actually use this in
a few situations in Libratus.
But in general, I would
not recommend doing this.
Unless the domain is
specially structured
that it would work.
So, Safe Subgame Solving,
the idea is instead that
we’re just going to estimate
what the expected value is
for the opponent’s Subgames,
for the opponent’s actions
for different Subgames,
and use that information
to determine
the optimal strategy for
the Subgame that we’re in.
Now, this works if your
expected values are perfect,
but if they’re not perfect
you’re obviously not going
to compute an exact
Nash equilibrium.
So, it turns out that there’s
room for improvement here.
By the way, this idea has
been around for awhile.
It’s was first
introduced in 2014.
It was never really used
in practice because
it didn’t actually give you
good results in practice.
Because you don’t have
perfect estimates.
But what we came up with,
is a way to dramatically improve
the performance without giving
up any theoretical guarantees.
With this thing called
Reach Subgame Solving.
So, here’s an example
of how this works.
This is going to get
a little tricky,
so if you have any questions in
the next few slides
please let me know.
So, let’s say that we have
this slightly larger game now.
It’s still basically
the same structure,
there’s a coin flip that
only player one observes.
Player one takes
a sequence of actions,
they eventually end up in
this choice between selling
the coin or playing,
choosing play.
Now, if they choose play,
player two has to guess
how the coin landed.
Well, let’s say your estimates
are off in this game.
Let’s say we estimate
that for choosing Sell,
they will get an
expected value of minus
one regardless of
which state they’re in.
Well, the best that we
can do is just guess
50-50 between heads and
tails and guarantee
that they get just an
expected value
zero, for choosing play.
But maybe we can use
information about
the earlier actions
to improve upon this.
So, maybe there is
this earlier action that player
one could have chosen,
if the coin landed heads,
where they could have gotten
expected value of 0.5.
Well, that means that in
the Nash equilibrium,
they would choose that action
and get expected value of 0.5,
and in the other case,
they would come down here
and choose play and get
it fixed value of zero.
So, they’re getting an
average of 0.25 in this game.
But we can now shift
our strategy as player two,
to ensure we get
tails more often,
which guarantees
that player one now
gets negative 0.5 in this case.
In the heads case that
means they will get to
0.5 for choosing play,
but that doesn’t really
matter because they’re
already getting
0.5 for this earlier
deviate action.
So, we’re not really giving
up anything in this situation,
we’re just making
ourselves better off.
Because they would never
gets to the situation
where they would
choose the 0.5 anyway.
This seems really
intuitive but there’s
a problem with this,
which is really subtle.
I’m going to have to
go to a bigger game,
which is going to get
even more complicated
to really illustrate it.
So, here is this bigger game.
It’s still pretty similar,
that a coin that lands heads or
tails with 50-50 probability.
Player one in both cases now,
let’s say has
this deviate action.
In the heads case,
they can get 0.5,
and at tails case,
they get minus 0.5.
Or they can choose to continue,
in which case we run
into this chance node.
This chance node is public.
It just leads to
two different Subgames,
so both players observe
the outcome of this chance node.
It just leads to see
different situations
that are strategically identical.
It’s an irrelevant chance node,
but it is a chance node.
Then after this chance node,
player one let’s say,
chooses play and we estimate
the expected value of them
choosing play is now zero.
So, let’s say, we were
player two in this game,
we observe player
one choose play.
Which means that we are in one
of these two
different situations.
Either the coin landed heads,
they choose to continue
and let’s say we
observed that the chance
node end up going left,
and then they chose play.
Or the coin landed tails,
player one chose to continue.
We observe the chance node
going left and they choose play.
So, we’re in either this
situation or this situation.
Well, we observed that they have
this deviate action of where
they could’ve gotten
expected value 0.5,
if the coin landed heads.
So, maybe we would say,
we say, okay, well,
we can increase the expected
value for this action to
one and lower it to
minus one in this case,
for example, by always
guessing tails.
That is okay because since
this situation is only
encountered 50
percent of the time,
the expected value for
this action is now just 0.5,
and so that matches
the deviant actions,
so we’re not giving up anything.
Does anybody see the problem
with this? All right.
The problem is, if
that chance node had
gone the other way,
if it had gone right,
we would apply
this same exact reasoning.
We would say, okay, well, we can
increase this expected
value to one,
because we’re all encountering
the situation half the time,
so this expected value
goes up to 0.5,
and now the opponent is
getting expected value zero,
we’re not giving up anything.
But if we apply this reasoning
regardless of which way
this chance node goes,
then what that means is
our strategy is to always guess
tails in both situations.
So, in reality, it means that
the expected value in
this case is one and
in this case is one,
which means that the
expected value is
actually one, it’s not 0.5.
So, now player one could be
better off by choosing to
continue instead of choosing
this deviate action.
So, what this illustrates
is that when you
are doing this Subgame solving,
you’re doing this
real-time reasoning,
you can’t look at
the expected values of
what we call
the Blueprint Strategy,
the pre-computed strategy.
You have to think about
what the expected values
would have been if we
had entered that Subgame and
applied Subgame
solving there too.
So, that makes things that
way more complicated.
But fortunately, with Reach
subgame solving, by the way,
two prior papers had actually
discussed this idea of,
okay, we’re encountering
this situation,
let’s just increase the
expected value for here,
because they could have
gotten an expected value
earlier on and missed
this problem that
you have to consider
about all the subgames that
people could end up in.
So, two prior papers
published about this
and they both got it
wrong, and our paper,
NIPS 2017 recognized
this problem and actually came
up with a fix that
allows you to do this Reach
subgame solving,
while still guaranteeing
that the points,
your exploitability is
not going to go up.
The basic idea is to
just only increase
the expected value for both
of these situations by 0.5.
The actual details get
a little bit more complicated,
but the aren’t too
important for this talk.
But the idea is you just
increase the expected values by
less depending on
how many subgames they could
end up in. You have question?
>>Well, I was just wondering,
I know this is really simple
thing, you can’t hold it.
You don’t think if it’s
just wrong versus this
weird like if you only
increase the values for
the left of the
public chance node.
>>Yeah, so let’s see
where the situation-
>>You set those
first to explain it?
>>Yeah.
>>It seems like this
is still correct.
It’s just weird because you’re
saying if nature
flips a coin heads,
then I’m going to do
this weird reaction,
and if it’s tails, then I’m not.
>>So, I’ll say this.
If you were actually
only increasing
the expected value
to one in this situation,
and keeping the
expected value at zero
in this situation,
that’s totally fine.
>>Okay.
>>But you have to think about
what would have happened.
Imagine this from player
one’s perspective.
If we would increase the
expected value to one in
this situation because
this chance node went left,
and we would have increased
this expected value to one if
this chance node are gone,
right, Then what player
one is thinking is that
if they’re in this situation
they’re thinking that,
if I choose this action,
regardless of which way
this chance node goes,
I’m getting expected
value of one.
>>Yeah, yeah. If
our algorithms actually did.
>>Yeah. So, that’s what
I’m saying is that,
you have to think about what
your algorithm would have done in
all these other situations
that didn’t come
up as player two, yeah.
>>Okay.
>>Okay. So, that’s
Reach subgame solving.
Yeah, so the idea is
for off path actions,
we have to consider how
we would have applied
subgame solving and
all these other
situations that didn’t come up.
We have a solution for it,
which is basically to split
that what we call slack among
the different subgames.
We have a theorem that says,
“Reach subgame solving will not
increase the exploitability of
a safe subgame
solving’s technique.”
If there is, these earlier
actions where the opponent could
have chosen a higher
expected value,
then it will actually
improve performance
relative to traditional
subgame solving.
In terms of actual, yes.
>>[inaudible].
>>Yeah.
>>Then you’ll have to know
the expected values for
the subgames that are
on the path from there.
>>That’s correct.
So, you look at
all the path, yeah, yes.
You look at all the situations
where the opponent could
have gone off path.
>>Yeah.
>>You have to know the fixed
values for those. Yeah.
>>I understand that
these are values you put.
So, is it correct
when you’re learning?
So, it seems that
you’re allowed to do the
updating some of these
expected values,
but then you are making
this per all [inaudible].
>>Yes, so maybe I should
have clarified this earlier.
So, I’m assuming that
we’re basically run
an algorithm that approximates
a Nash equilibrium
for the entire game,
and that’s where these
expected values are coming from.
So, we have this a rough
approximation of what
the Nash equilibrium is
for the entire game,
and that’s giving us
expected values for
all these different actions,
but they’re not perfect.
It’s like with
AlphaZero for example.
AlphaZero gives you policy
and values for
all the different states
that you might be in,
but that’s obviously not perfect
and you can improve upon
it by doing Monte-Carlo
tree search in real-time.
>>What do you mean formally
by the safe technique
to exploitability.
So, you assume that [inaudible]
is off by a certain fixed data?
>>That’s a great question.
So, by safe subgames solving,
I mean that there is
some exploitability,
our strategy, this
pre-computed strategy that
we have is exploitable
by some amount.
>>Assuming that all of these
are correct or they have-?
>>Well, I’m just saying
that we’ve run this.
Let’s just say we’ve run a
traditional reinforcement
learning algorithm
on the entire game,
no real-time playing,
just pre-computed strategy,
that strategy that we have now
is exploitable by some amount.
I am saying that we would
like to improve upon
this in practice by doing
real-time planning.
But we want to guarantee
that by doing real-time planning,
we’re at least not
going to increase
the exploitability
of our strategy
relative to what we had before.
Now, in practice,
it ends up being
way lower exploitability,
but we want to guarantee.
We can’t really guarantee
that it’s going
to decrease in most situations,
but we want at least guarantee
that’s not going to increase.
So, that is what I mean
by safe subgame solving.
>>The purple like estimates from
your pre-computed
suboptimal value function?
>>Yeah, so those
are the values that
we’ve estimated based on
our prec-omputed strategy
for the game.
>>We can use those to
compute the red basically?
>>Yeah.
>>Okay.
>>So, the red is the real-time
planning expected values.
>>So, what’s
the right procedure?
>>For time reasons,
I decided to not really talk
about how we’re actually
doing all this computation
of the strategy.
We use something called
counterfactual rep minimization,
which converges to
an approximate solution
and one over square root T time.
So, if you do T iterations,
you will get within one
over square root T of
the Nash equilibrium. Okay.
>>Like in terms of
the numbers things then.
>>It is also, yes.
So, it’s linear in the number
of information sets.
>>So, like terabytes?
>>Well, okay. So,
with the broadest,
we actually used about
several terabytes of data
and we used about
millions of core hours
of computation time.
>>In real-time?
>>In real time, no. In
real time, it was lower.
>>[inaudible] .
>>So, that was for
the pre-computed strategy.
For real time, it
ended up being we
used about 1,000 cores,
so about 50 nodes,
and the memory is
actually really small.
It was probably
less than 100 megabytes.
We all actually
figured out how to
improve upon this massively,
which I’ll get to
in a little bit.
>>Is it a 100 like
per core or per?
>>No, it’s the whole thing,
100 megabytes for
the whole thing.
The actual game,
when you’re solving
from the turn onwards,
like third betting rounds
at the end of the game,
the actual size of the
game is pretty small,
but because you have to compute
a solution equilibrium for it,
it takes a lot of
computational power.
>>So, was there a limit
on how much time you
have to make a decision?
>>Yeah, we ended up doing it in
less than 20 seconds or so.
There wasn’t like
an official time limit,
but we gave them some guidelines
on how long it would
take on average.
We also didn’t get the
humans the time on it.
So, if they wanted to
take 10 minutes for
a decision then that
was fine with us.
I will also, when we’re
thinking about the
time on the thing.
So, one of the interesting
challenges of poker is that,
you don’t want to give it
away timing tells, right?
So, if it takes you two seconds
to make a decision,
then the opponent might think you
have a really easy
decision whereas,
if you take two minutes,
and it might be
a difficult decision,
they can figure out
what hand you have.
So, if you’re playing, if you
look at the World
Series of Poker,
it gets really annoying
because, at the final table,
they all take the same amount
of time for
every single decision.
So, they take like two minutes
for every single decision,
even if it’s a
really trivial one.
We didn’t want the
humans to have to do
this because it would’ve taken
forever and would
pissed them off,
would have pissed us off,
so we told them flat
out that the bot is not going
to look at timing tells.
So, if they took two seconds
to make a decision,
that’s totally fine, we
won’t change anything.
But we can’t make them
also do that for the bot,
rather like if the bot took
two seconds versus two minutes,
they would pick up
on that and they
can’t not pick up on that, right?
So, we had to make the bot
take the same amount of
time for every single decision
you made to prevent that.
So, that’s also why
it ends up taking
longer to do this thing.
There’s a lot of
decisions that are easy,
but we can’t make that obvious.
All right. So, experiments
on medium-size games.
So, it turns out that our reach
subgame solving technique
that I just described,
that’s about three times,
it’s about three times
less exploitable.
Then, prior, safe subgame
solving techniques,
and nested subgame solving,
this idea of applying subgame
solving repeatedly as you
go down the game tree.
That is 12 times less
exploitable than
the prior state of the art,
which is to just say, well,
if the guy bet $60,
and we have in
our precomputed solution,
a strategy for if he had bet $50,
then we’ll round it, and treat it
as if you had bet $50 instead.
That was the previous
state of the art.
So, this is 12 times less
exploitable than that
in Heads-up No-Limit
Texas Hold’em.
Sorry, in smaller versions
of Heads up No-limit
Texas Hold’em.
Okay. So, that is
one reason for why imperfect
information games are hard.
There is a second reason
that I wanted to get to you,
and I think I still
have time to do it.
This is more recent research.
The second reason is that
states don’t have well defined
values in
imperfect-information games.
So, let me show you
what I mean here.
In a perfect-information game,
and it’s single agent settings,
if you take smashing,
your opponent takes an action,
you find yourself in
a particular decision point
is particularly subgame.
You don’t solve to
the end of that subgame.
Right? You do what’s called
depth limited reasoning.
So, the remainder of
the game is too large,
so you come up with
a strategy for
the next several actions.
Then, once you get to
a certain depth limit,
you say,okay, I’ve
looked far enough,
I’m just going to substitute
a value for this leaf node,
and say the value of
this arrangement of
pieces on the board
looks like player one,
white wins with
60 percent probability.
So, you assign a value
to that state,
and then you solve the depth
limited subgame using
those values at the leaf nodes.
It turns out that this does not
work in
imperfect-information games.
I can give you a really
simple example of why.
This is a game that I called
Rock-Paper-Scissors+,
is exactly like
rock-paper-scissors,
except if either player
throws, what is it?
Scissors? Yes. If
other player throws scissors,
then the winner gets two points,
and the loser loses two points.
That’s just to break
the symmetry of the game.
So, the next equilibrium
in this game is to
throw rock and paper with
40 percent probability,
each, and scissors with
20 percent probability.
Now, imagine that we’re trying to
do a depth-limited version,
a depth-limited solving of
this game as player one.
So, we look one move ahead,
and then we’re
going to substitute
the Nash equilibrium value
at each of those states,
where instead of going
to the end of the game.
This is the
depth-limited subgame.
It’s really easy to
see that if we were
to try to solve
this depth limited subgame,
there is no way that
we’re going to find
the optimal strategy of
40 percent, 40
percent, 20 percent.
Right? There’s no
just not enough information
in this depth-limited subgame to
find the Nash
Equilibrium. Why is that?
Well, it turns out
the reason is because we are
essentially assuming that
beyond this decision point,
player two is going to play
the Nash equilibrium strategy.
Right? That’s where
we got the 000 from,
this is assuming that if
we had chosen scissors,
and player two plays the Nash
equilibrium strategy
beyond this point,
this expected value is zero..
But in reality, player
two strategy beyond
the depth-limit depends on what
our strategy is above
the depth-limit.
If we choose rock,
80 percent of the time,
player two’s strategy
isn’t going to
be to play the Nash equilibrium,
it’s going to be choosing
paper, 100 percent of the time.
So, this is what
the state values will look
like or if we were
to choose paper,
80 percent of time, then
they would switch to
always choosing scissors,
and this is what the state
values would look like.
So, in an
imperfect-information game,
the values of the states of
the depth-limit depend
on what our policy is,
in the earlier parts of the game.
So, how do we deal with this?
Well, one option is to just
actually make the state values
dependent on our policy,
and say, “Okay.
Well, the value of a state is
a function of the
description of that state,
and our policy for
the entire game.”
Well, that is theoretically
correct to the promise
that’s extremely expensive,
I mean absurdly expensive.
The other option is
something called, well,
we actually metaphor about
this AI called DeepStack.
They condition
the value of a state on
the belief distribution of
both players at that state.
So, they said, okay.
Well, at this decision point,
I’m not going to condition on
the strategy for
the early part of the game,
I’m going to look at all
the different states
that I might be in,
and the probability that I
believe I’m in each
of those states,
which in this case is 80
percent, 10 percent, 10 percent.
Then, this game, it ends up being
the same exact thing as just
conditioning on the policy but,
in general that’s not the case.
The problem is that this is
still extremely expensive.
So DeepStack for example,
in Heads up No-limit
Texas Hold’em,
use 1.5 million core
hours of computation,
and could not be prior top AIs.
The other problem is that
the technique currently does not
scale to larger games,
basically games where you have
more states in an
information set.
This, you can get by with this
in Heads-up No-Limits
Texas Hold’em, because
it’s only about,
in any single decision point,
every single information set,
there’s only about
1,000 different
states in that information set.
But in a game like
Five Card Draw for example,
there could be 5 billion or
in a game like Stratego,
that could be 10th to
the 20th or something.
So, this would not work
in those larger games.
So, we do instead,
is this paper that we just
had except it’s NIPS for
2018 called a
depth-limited solving,
we use this in an AI, we
created called Modicum,
and let me walk you through
the algorithm here.
The idea is, instead of assuming
that there is a single value
at this depth-limit,
we’re actually going to let
the opponent choose between
multiple different values
for these states.
We create these different values
in an iterative process.
So, we start off by assuming
that player two, beyond
the depth-limit,
is going to play
the Nash equilibrium strategy
that we precomputed.
Then, we solve
this depth-limited subgame,
which in that case
means, let say,
we solve it, and we say
our strategy is going to be
one-third, one-third,
one-third probability.
Now, we’re going to
calculate a player
two best response.
So, we can say, okay.
Well, if our strategy
here is to play
one-third, one-third,
one-third, player two could
exploit us by always
choosing rock.
Now, we add that best response to
the set of strategies
that player two can choose
at the depth-limit.
So, now, we’re going to solve
this depth-limited subgame again,
and we’re going to say,
add this depth-limit,
now player two can choose.
They can either choose the
Nash equilibrium
strategy that we had
before or they can
choose this best response
that we just calculated,
which was always play rock.
They make this decision
because all of these states,
sharing information set,
they can’t say, “Okay,
well, in this state, I’m
going to choose this policy,
and this state I’m going
choose this policy.”
They have to make
the same decision at
each of these different
states that share
an information set.
So, we solve this
depth-limited subgame again.
Then, we again calculate a player
two best response
to that strategy,
and then we add
that strategy to the set of
best responses that player
two can choose at
the depth-limit.
We can repeat this process
as many times as we want.
Now, some details
about this technique,
it might seem like this
is really expensive,
and may not get
good performance because
we can’t add like
a million different
strategies for
player two to choose
that the debt limit,
but it turns out that
because they are making
this choice separately
at each information set,
they’re essentially able to,
even if we only
give them a choice
between 10 different strategies,
beyond the depth limit,
if they are making
that choice individually
at 100 different
information sets,
they’re actually
choosing between 10 to
the 100 different strategies
for the entire remainder
of the game.
So, it actually
grows very quickly.
The other thing for player one,
I talked about
what player two does,
so player two is choosing
between these different
strategies that they
could play for
the remainder of the game,
player one, we’re
going to assume is
playing the approximate
Nash equilibrium.
Player one is us, we’re going
assume that weren’t
playing according to
the approximate Nash equilibrium
strategy for the
remainder of the game.
The set of player
two’s strategies
is precomputed it’s not
determined in real time.
They could do it in real time,
it would be too
expensive in practice.
Okay. So, this is what
performance looks like if
we do this depth-limited
solving thing.
So, on the x-axis here,
we have the number of values per
leaf node that the opponent
can choose between
at the depth limit,
and on the y-axis, we have
exploitability measured
in milligrams per game.
This is a simplified version
of No-Limit Access Hold’em,
that only has two betting rounds
instead of four.
You can see, if we only assume,
that if we assume that
each state has a unique value,
which is essentially
the Nash equilibrium value
like we would in
a perfect information game,
exploitability is extremely high.
But as we add more strategies for
the appointed to choose
between at the depth limit,
exploitability drops
off very quickly.
In fact, with
16 different choices,
were essentially at
Nash equilibrium.
So, the beautiful
thing about this. Yes.
>>So why is that have one.
[inaudible] blue
will be the same..
>>No. Because actually blue,
sorry, so for blue here,
it’s not doing any
real-time reasoning,
it’s doing this like,
if they had bet $60, I’m
going to round that to $50.
So, red is actually doing
real-time reasoning,
but it’s assuming
that each value has
a well defined unique value.
>>So, [inaudible] generation or frequency
or something, so what [inaudible].
>>There are a lot of
similarities between this
and things like
double oracle methods and
things like that. Yes.
All right. So, in terms of
head-to-head performance,
the really cool thing about
this technique is that
it allows us to make
a really strong poker AI
using very few resources.
To give you some
perspective, we had
this bot called Tartanian8
that we made in 2016,
which won the Annual
Computer Poker Competition,
which is competition
among poker AIs.
It used two million core hours
of computation,
18 terabytes of memory,
and there’s no
real-time reasoning.
We have Slumbot which won the,
that wasn’t us, this
is a different bot,
that won the 2018 competition,
used 250,000 core hours,
two terabytes of memory,
no real-time reasoning.
Modicum which is the bot
that uses its depth
limited solving,
uses just 700 core hours,
16 gigabytes of memory,
plus real-time with
a 4-core CPU in
under 20 seconds per hand,
and it beats both of
those other bots.
So, to put this in perspective
even further, the broadest,
which is the AI that we played
against the top humans,
used millions of core.
I think it was something
like two million or
five million core hours,
probably it’s
20 terabytes of memory,
and played in real-time
using 1,000 cores.
So, we’re able to get what
is essentially
probably superhuman,
we haven’t actually tested
against humans but I’m
pretty sure this is
a superhuman poker AI.
We’re able to get
basically superhuman performance
using the resources in a laptop.
In fact, since I
published this paper,
I’ve just put another paper on
archive where we
figured out how to make
this three times faster.
So, it could probably run
on a smartphone now. Yeah.
>>Where is the [inaudible]
you can just run.
>>It turns out that
there is a huge
amount of variance in poker.
>>Yes.
>>Because we’re doing
real-time reasoning
and we’re taking
20 seconds per hand,
the variance is massive.
In fact, we actually,
we train this using
700 core hours,
it took us like
a million core hours to
actually compute
all these results.
So, this has been a problem
in the entire field,
that the variance is just absurd.
So, this is a graph of
head-to-head performance.
Here we have the Libratus,
which beats up humans.
Here is Modicum
which is the AI we
just created that uses
way fewer resources.
Here are some other
benchmark bots.
A bot from 2016,
a bot from 2014.
Here is DeepStack, which
is a bought from
the University of Alberta,
which actually has very
low exploitability.
But in terms of
head-to-head performance
didn’t end up being that strong.
It also uses this
real-time reasoning as well.
Though a different form of it.
All right so the
key takeaways, yes.
>>You said that
one doesn’t have,
it has low exploitability
but it’s not that strong?
>>In terms of
head-to-head performance
it’s not as strong.
So, in terms of
head-to-head performance
it actually doesn’t
beat prior benchmark bots.
>>Yes, I guess
that’s curious to me,
because if you’re not
exploitable. Okay so-.
>>When I say low exploitability,
I mean just relative
to the past bots.
So, the exploitability is still,
it could be extremely high,
we actually don’t know.
We can’t calculate
exploitability exactly
and heads up no limit takes
it whole so we don’t know.
But it appears that it has
lower exploitability compared to
the absurdly high exploitability
of the previous bots Yes.
So, key takeaways. In
real-time planning,
you always have to consider how
the opponent can adapt to
changes in your policy.
That is something that is really
important in imperfect
permission games.
Perfect-information games you can
mostly ignore that
but not completely.
Imperfect-information subgame
cannot be solved
in isolation.
States and
imperfect-information games
do not have well-defined values.
All right. So, I
have done some other work
that I did not discuss.
One thing I did not discuss,
well I guess I talked about
it briefly is how we
actually solve these games.
We use an algorithm
called counterfactual
regret minimization,
which has been around
for about 10 years now.
Works extremely well, even
though the theoretical guarantee
on convergence is only
1 over square root t. I just
had a paper that I released,
where I developed
a new form of CFR which
beats the prior state-of-the-art
by a factor of three.
So, I’m really
excited about that,
and that’s going to be used
in all the future research.
I have some work
on pruning in CFR.
So, it turns out
that CFR you end up
exploring the entire game tree
which is a big waste,
because a lot of actions are
suboptimal and you don’t
want to waste time
coming up with what
you should do if
you play a really
crappy poker hand,
because in reality you
just fold it right away.
So, I have this pruning technique
that provably reduces to
computing and memory costs of
running CFR asymptotically.
In practice, it speeds things
up by an order of
magnitude of two.
I also have a paper on
determining the optimal actions
in a continuous action space.
So, one of the interesting things
about no limits exist hold
them is that you have
this continuous action space
where you can choose to bet
any amount between
$100 to $20,000.
The truth is that, what
we do right now is
just domain-specific
abstraction techniques,
where we say okay, well,
you probably just want to
bet either half the pot
or one times the pot
or two times the pot,
and it doesn’t really matter if
you are betting 0.6 times
the pot or 0.5 times the pot.
But that relies on
domain knowledge that we know
the optimal bet fractions
are roughly in that range.
So, it’ll be nice to have
an algorithm that doesn’t
rely on domain knowledge
that can actually
determine the bets to
use without any human knowledge.
So, that’s where
this 2014 paper does,
and hopefully we’ll have
some follow-up work on that in
the future. For
future directions.
So, I mentioned before,
a lot of this work has been
looking at poker as a domain,
and that’s because
it takes a lot of
infrastructure to actually build
up to do experiments
on large games.
So, we have a lot of
expertise that has been
developed over the years
on how to run these techniques
efficiently on a game like poker.
And if we wanted to test
on another large game,
it would take years to
build up that expertise
on how to do experiments
on those other games.
There’s also no
other really good,
we can’t really
compare performance
to other bots in other games,
because there are
no other games where
there are bots that
are competitive.
But I would love to
move beyond poker,
and a few different
directions with that.
One is, we have
these techniques for
perfect information
games like AlphaZero,
and we have these techniques for
imperfect-information
games like poker,
and it’ll be nice to
bridge the gap and find
a single algorithm that works
really well in
all these different games.
Another thing that I’m
really interested in is
going beyond two-player
zero-sum games.
So, I mentioned that
if you try to move on
to general-sum games
it’s a bunch of
theoretical challenges
the pop-up.
So, right now, we
don’t know how to
cope with those challenges.
But dealing with
general-sum games is really
important because
most real-world situations
are general-sum and
not not zero-sum,
except for like maybe military
interactions or security.
So, in particular working out
something like negotiation,
I think is a really
interesting line of research.
In general moving things more
towards real-world domains,
I think we’re at the point
right now where we
can actually start bringing these
techniques into the real world,
and I think that’s
going to be a really
interesting line of
research as well.
All right. So, I’ll stop
there and I’ll take
some last minute
questions thank you.
>>Yes so, I guess,
in the real world often you don’t
know the rules of the game,
you don’t know them in advance,
[inaudible] situation.
>>Right.
>>But you do observe
the outcomes of players.
Have you thought about trying to,
what works with race when
you go into a situation
where you don’t know
the rules of the game but you
can observe the
outcomes of players?
>>That’s a good question so yes.
So, all of our work
assumes that we
have an accurate model
of the world or the game
that’s being played.
I think a lot of these techniques
will carry over if you
want to try to figure
out the structure of
the game as you go.
There was actually a paper
from another student at
CMU recently on this problem.
So, people are starting
to look in this
direction, I have not.
But it’s something that I
think is very interesting.
It is also a way harder problem,
because to figure out
what information
your opponent knows
or figuring out what they
could know is a really
difficult challenge.
I’m not sure how to
go about that. Yes.
>>I think in
many applications in
something like
reinforcement learning.
So, right now I can imagine
splitting the environment
into the portion that which I can
model that will be
like Chess moves,
and then I can set the parts
that I don’t necessarily
want to model,
because I’m very risk averse.
So, that could become
like adversarial moves,
and I to be robust.
Have you thought about how
well other techniques
would apply,
or completely out
of the ballpark.
>>To be honest, I
never really considered
that direction for the research.
I think there’s a lot
of potential here.
This is an area of
research that I think has
been overlooked by
a lot of people.
So, it’s actually been
a very small community
that has been working
on the space,
and I think people
are starting to
appreciate that can be
applied to a lot of
different things.
We’re starting to
see papers on how
to apply something like
counterfactual regret
minimization to
more mainstream reinforcement
learning topics.
So, I think there was a paper
from Berkeley recently on
regret minimization in
single-agent settings.
So, I think there is definitely
potential to extend
the research and to
the more traditional
reinforcement learning settings,
but I have not looked
into that yet.
>>[inaudible] a little bit,
is there anything I’m learning?
So, in the real world one of
the things that’s
really difficult is,
I don’t know
the rules of the game
and I really have no idea what
my problems hails are often.
Is that something that
people have looked at?
Trying to basically think about
simultaneously trying to improve
my own hails trying to
get them in model of
what’s going on with my opponent.
>>Yes. I guess that
would be kind of a subset
of the previous case
we discussed,
where it will be
like, maybe you know
the structure of the game but you
don’t know what the
opponent’s payoffs.
I think this has
been looked at in
the game theory community,
but more in simple cases not
like large-scale computer
science settings.
So, I could be wrong about
that I don’t know too well.
>>As you said I should know
the game theory, yes sure.
So, there are models where
there’s a point in time but it’s
unknown but I parameterized
the function, sure.
>>Yes. So, I don’t know
of any large-scale work
on that sort of thing,
but that also I think
it wouldn’t be,
that seems like it’ll
be a lot easier to
extend this work to that domain,
and it does not
know the rules the
game at all. Okay.
>>Thank you.

2 Comments

Add a Comment

Your email address will not be published. Required fields are marked *