# AI for Imperfect-Information Games: Beating Top Humans in No-Limit Poker

>>I’m going to talk about AI for

Large Imperfect

Information Games,

in particular, on

how emitted AI that

beat top humans in

no-limit poker.

Okay. So, for starters,

this talk is going to be about

imperfect-information

games in general.

I’m not going to talk about

perfect-information

games like chess or Go,

it will be applicable to poker,

but also more generally,

any strategic interaction

that involves hidden information,

for example, security

interactions or negotiations.

I think this is really important

for bringing AI into

the real world,

because the truth is

most real-world

strategic interactions

involve some amount of

hidden information.

So, when it comes to these games,

poker has served as

the primary benchmark challenge

going back decades.

In fact, if you look at

the original papers

on game theory,

pretty much the only

application they

talk about is poker,

because it’s so

accurately, it captures

the challenge of

hidden information.

Particularly, there’s

a variant of poker called

heads-up no-limit

Texas hold’em that has

emerged as the primary benchmark

for these games.

A heads-up no-limit Texas

hold ’em is a massive game.

It has about 10 to the 161

different decision points.

It is also the most

popular variant

of poker in the world.

For example, no-limit

Texas hold’em

is the game that is played at

The World Series of

Poker main events.

Every year the winner is

determined by Heads of

No-limit Texas Hold ‘Em.

It’s also featured in

popular movies about poker.

For example, Casino Royale

and Rounders.

In some ways, you

could argue it’s

the purest form of poker.

It’s subjective, but it

is a very strategic game,

whether you win or lose, it’s

entirely up to your skill.

It’s not up to the other

players at the table,

except for your own

opponent, I guess.

So, there’s no kingmaker

effects for example,

and no pro AI has been able to

beat top humans in this game.

That isn’t till 2017.

So, in 2017, we organized

something called the

Brains vs AI Challenge.

We created an AI

called the Libratus,

which we played against

four of the world’s

best heads-up no-limit Texas

hold’em specialists in the world.

These are all people that

make about seven figures per

year playing this game online.

As we played 120,000 hands of

poker over the course of 20 days,

and there was a

$200,000 prize pool

divided among the pros

to incentivize them

to play their best.

So, they weren’t risking money,

but how much money they want,

depended on how well they did

relative to the other players.

So, obviously, if you’re

familiar with poker,

you might not have

heard of these pros.

So, I wanted to say

a word about how

strong these pros are,

because it really is important to

play against the top players.

Unfortunately, there are

no objective rankings

of professional poker players.

But like I said,

these are all players

that make millions

of dollars a year.

In fact, here’s a question

from the poker subreddit,

where somebody was

asking, ”How good

are these players that we

were playing against?”

Somebody responded, ”These

players will absolutely

trounce all the 2,000

heroes that you

might have heard of.

The heroes from 2000s would

be division

three college players.

Well, whereas these guys are

all star caliber pros.”

So, this is a pretty accurate

description I would say.

There is a big scale difference

between the a pros

that you see on ESPN,

and these guys who actually

play this game for.

The guys you see on ESPN

are basically celebrities.

These guys are

the guys that actually

make a living playing this game.

The final result is

that Libratus beat

the humans in this game by a lot.

The victory margin

was 147 mbb/game,

which is a measurement of

win rate and poker, which,

unless you are an actual poker

player doesn’t mean much,

but to give you some perspective,

this is about three

times the win rate

of a top pro versus

an average pro.

It was statistically

significant at

about four standard deviations,

and each human lost

individually to the AI.

This was a big surprise

to everybody.

In fact, when we announced

the competition,

there was a betting market

on the outcome,

because it’s the poker world,

and obviously, like to

gamble on these things.

When we first announced

that we’re going to

do this competition,

the betting odds were

four to one against us.

In fact, even after we

won on the first day,

the betting odds were still

two to one against us.

I think I was until

like the third day that

the betting odds were even,

and by the eighth day,

you couldn’t even

bet on the outcome of

the competition anymore.

You could just bet on how much

each human would lose

on each individual day,

because it was clear at

that point that this AI

is going to win.

In fact, even if you asked us,

we were not very confident

that we would win.

I put our odds at

about like 60 percent,

maybe 65, but I

didn’t think we would have

a lot victories like this.

Actually, after this competition,

we did another competition

against these Chinese pros.

So, basically, somebody called

Kai-Fu Lee in China

called us and he said,

”We would like you to

do another competition

in China against Chinese players.

We will broadcast it, it

would be a lot of fun.”

We were like, ”Well,

why should we do this?

Because we just played

against the top humans.

These Chinese players

not as good.”

He said that he would pay us.

So, we said, ”Okay, great.”

So, we played 36,000 hands

against six Chinese players.

We beat them by

even more than we beat

the top humans in America.

That was actually

a huge hit in China.

It was watched live

by millions of

people during that competition.

They had really nice production

where you could see

a poster like this.

It was way better than

what we did in America.

All right. So, why are

imperfect-information

games so hard?

After all, we have AIs that can

beat humans in games like chess,

we have AIs that

beat humans in Go.

In fact, you might have

heard recently about

AlphaZero which can beat humans.

Well, it’s essentially

superhuman in chess,

Go, and shogi, all using

the same algorithm.

So, what is it about

imperfect information games

that are so difficult?

One of the major challenges,

not the only one, but

one of the major ones,

is that in an

imperfect-information game,

the optimal strategy

for a subgame,

for part of the game, cannot

be determined in isolation.

It cannot be determined using

information in just

that subgame alone.

So, let me show you what I mean.

Before I get to that,

deep learning has taken

a lot of credit recently for

a lot of the breakthroughs in AI.

Actually, all AI did not use

any deep learning, no

deep learning at all.

But I would also argue

that a big reason for why

all these AIs are superhuman

in various games like chess,

Go, backgammon even,

is because they use

real-time planning.

The planning component is huge.

In AlphaGo, for example,

use Monte-Carlo Tree Search,

in Deep Blue, it used

Alpha-beta pruning.

So, in fact, if you look at

AlphaZero, without

real-time planning,

I guess this is washed out,

but it ends up being right around

there without Monte-Carlo

Tree Search during real-time.

Top human performance

is right around here.

So, in fact, without

Monte-Carlo Tree Search,

AlphaZero is not superhuman.

The tree search gets

you 2,000 ELO addition.

So, real-time planning

is really important,

not just in Go,

but also in poker, it turns out.

This is actually the key

breakthrough that allowed

us to be top humans is figuring

out how to do real-time planning.

But it turns out that in poker,

it ends up being way harder

which is where it

gets you right now.

So in perfect-information games,

you take some action,

your opponent takes some action,

you find yourself in

a particularly subgame.

Now, you can forget about

everything that came before,

all the other situations

you did not encountered.

The only thing that matters

is the situation that you’re in,

and the situations that can

be reached from this point on.

So in perfect-information games,

so for example, if I were to

show you this chess board,

you don’t have to know how we

ended up in this situation,

you don’t have to know about the

Sicilian defense of

the Queen’s gambit.

You can just look at this board,

and if you’re white,

you can say, ”Okay,

well, if I do a search,

I can see if I move

my white queen there,

then it’s checkmate,

and the game is over.

So, I should just

do that. You don’t

have to know anything about

the strategy of chess.

But in imperfect-information

games,

if you take some action,

and your opponent

takes some action,

and you find yourself in

a particularly sub-game,

now some other sub game that

you are not in, and in fact,

you might not even be able

to reach from this point on,

can affect what

the optimal strategy

is for the sub-game

that you are in.

This is counter-intuitive,

but I’m going to give you

a concrete example in a little

bit that illustrates this.

Now, before I get to that,

I want to talk a little bit

about what our goal

is in these games.

Our goal is to find

a Nash equilibrium which

in-two player zero-sum games,

is the same thing as

a min-max equilibrium.

I won’t get too technical

about the definition,

but basically,

in a two-player zero-sum game,

if you’re playing the

Nash equilibrium,

you are guaranteed to

not lose an expectation.

Now, it’s not always easy

to find a Nash equilibrium,

but it’s always

guaranteed to exist

and a finite two-player

zero-sum game.

So, for example, in

rock, paper, scissors,

the Nash equilibrium is to

this mix randomly

between rock, paper,

and scissors, with

equal probability,

because if you do that,

then no matter what

your opponent does, you

will not lose an expectation.

Now, in rock, paper, scissors,

that also means you’re not

going to win an expectation,

but in a complicated

game like poker where

there’s a lot of

sub-optimal actions that aren’t

actually played in

the Nash equilibrium,

it’s likely that your opponent

will make mistakes

and you will end up in

practice winning as well.

Yes.

>>How important is it

going to be to play a game

So, if I compare this to say,

heads up or not.

If I got a heads up, if I got

to sort of thinking about

like this will go

about seven players.

>>That is a great

question. So, I’ll get to

this, let’s talk about this now.

In poker, it doesn’t

really matter.

So, in poker, if you were to use

these same techniques

for six player poker,

you would almost certainly win.

That said in general,

poker is a special game because,

I don’t know if you play poker

but two special

things about poker.

One is, it’s really hard to

collaborate with other players.

So, you can’t say, “Hey,

let’s team up against

this other person at the table.”

In fact, if you try to do that,

that’ll be against

the rules of poker.

The other thing that’s

unique about poker

is that people fold in the game.

So, even if you have

six players at

the start of the game,

it very quickly comes down to

two players because people fold.

So, you can use these techniques

that are only guaranteed for

two-player zero-sum

games and it will

just work in six player poker.

But a big challenge is

extending these techniques to

other games that do

allow for collaboration.

In that, we don’t really

have a good approach

for those games yet.

So, for now,

I’m just going to assume

that we’re working in

the two-player

zero-sum setting and

it does extend in some cases

to other situations as well.

So, our goal is to find

an approximate Nash equilibrium.

We’re going to

measure performance

in terms of exploitability.

You can think of it as, distance

from a Nash equilibrium,

it’s how well we would do against

a worst-case adversary relative

to if we had played a

Nash equilibrium instead.

So, how exploitable we are?

I would argue that exploitability

is actually extremely

important and has

been overlooked in the AI

community as a whole.

I think two recent

man-machine matches

actually really highlight this.

One is the OpenAI,

one versus one Dota2

Matches that you might

have heard about,

and the other is

Fan Hui versus AlphaGo.

In the OpenAI Matches,

they made this AI

that was able to beat

top humans in one versus

one Dota2 over three games.

But after they won

against the top human,

they actually opened it up to

the public and they invited

random mediocre players to

play against it to see if they

could find any weaknesses.

In fact, pretty quickly

within a few thousand games,

weak players were able to

find certain tricks that they

could basically fool the AI

and figured out how to

exploit it and beat it.

Also, in Fan Hui versus AlphaGo,

so they famously

beat Fan Hui 5-0.

But then after they

published the Nature paper,

they invited him to play

several more matches

against it to see

if he could find

out any weaknesses in the AI.

In fact, he was able to find

weaknesses where he was

able to consistently beat

the AI and they had to

patch this before they

released on the Nature.

So, I think what this really

demonstrates is that

it’s not enough to beat

top humans in three or

five or even 10 games.

You really have to be able to

consistently beat top humans,

especially if you want to deploy

an AI into the real world.

If you’re Microsoft

and you’re trying to

deploy this products

with real users,

there’s millions or

billions of them,

if there’s a weakness,

they’re going to find it.

But with the [inaudible] ,

we played the top humans not

just in three or five

hands of poker,

we played them in

120,000 hands of

poker over the course of 20 days.

That whole time, all

four players working as

a team to try to

exploit the AI in any

way they could find.

In fact, actually, I

had lunch with one

of the players,

just a couple months ago.

He said that the

thing they found most

shocking about the

competition is that,

at the end of each day,

we gave them a log of

all the hands that were

played and we told them what

the bot had on each

hand that was played.

This is big because in poker,

a big part of the game is

actually keeping

your strategy hidden.

If you fold, your opponents

does see what cards you have.

In fact, even if you don’t

fold but you lose the hand,

you still don’t show

your cards are.

So, you only see your opponent’s

hand about 20 percent,

25 percent of the time.

So, like poker players

will sometimes even call,

just to see what

their opponent had.

But here, we’re just giving

them that information.

We’re telling them

what the bot had on

every single hand that it played.

So, they didn’t have to

worry about that part all,

and they found it absolutely

amazing that they

could not figure

out how to exploit the AI,

even though we were

showing them the hands that

the bot was playing every

single time and the bot strategy

wasn’t really changing

that much between days.

All right, so I

think exploitability is

extremely important.

I think has been overlooked

by the AI community,

and this is telling that the

imperfect information

game solving

community has focused on

throughout its existence.

All right, so now,

I want to get to

the example of why imperfect

information games are hard.

I’m going to talk

about a simple example

that I call a Coin Toss.

It starts with the coin flip.

So, the coin is flipped that

lands heads or tails

with 50-50 probability.

Player one is going to observe

the outcome of the coin toss.

Player two is not.

So, after this coin lands,

Player one has a choice.

They can either sell the coin

or they can choose play.

We’ll say, if they choose sell

this to some separate subgame,

the details of which

are not important.

The only thing that’s important

is the expected value.

So, we’ll say, if

the coin landed heads,

then the coin was lucky

and they can sell

it for 0.50 cents.

On the other hand, if

the coin landed it tails,

we’ll say it’s unlucky

and Player one

loses 0.50 cents by selling it.

On the other hand, they

could choose play,

and if they choose play,

then it leads to Player two,

and Player two has to

then guess how the coin

landed without having observed

how it actually landed.

So, if they guess

correctly that is

Player two guesses heads

and the coin actually

landed heads,

then Player one is going to

lose one dollar and Player

two is going to gain one dollar.

Here, the payoffs are shown for

Player one because this is

a two-player zero-sum game.

So, Player two just receives

the opposite payoff.

Now, on the other hand,

if Player two

guesses incorrectly that is

they guess tails and the coin

actually landed heads,

then Player one gains one dollar

and Player two loses one dollar.

You can see there’s

a dotted line between

the two players,

two nodes this

signifies that Player

two is in what’s called

an information set.

This means that Player

two because they

did not observe how

the coin landed,

they do not know which of

those two states they

were actually in.

So, why do you imagine

that you are Player two in

this game and yeah,

so why do you

imagine that you are

Player two in this game,

you’ve just observed

Player one chooses play

action and so you know that you

are in this imperfect

information subgame.

So, what should

you do? Should you

guess heads or should

you guess tails?

But one option is to

just always guess heads.

But if you do that, that’s

obviously a really bad strategy

because now Player

two can just sell the coin

when it lands heads

and get 0.50 cents,

and choose play when

the coin lands tails

and gain a dollar.

So, on average they’re

getting 0.75 cents.

On the other hand,

you could always choose tails,

but that’s also a really

bad idea because now Player

two can choose play

when the coin lands

heads and gain a dollar and

choose sell when the coin lands

in tails and lose 0.50

cents but it’s better

than losing a dollar.

So, on average,

they’re still getting

0.25 cents in this game.

So, it turns out that

the optimal strategy is to mix.

It’s to guess heads with

25 percent probability and

tails with 75

percent probability.

If you do that, then no

matter what Player one does,

the best they can do

is just break-even,

get on average

zero dollar in this game.

So, this is the

Nash equilibrium strategy

for Player two in this game,

at least for this subgame.

But now, let’s say we change

the game a little bit.

Let’s say we changed the payoff

for the sell action.

So, now, an

expectation Player one

loses 0.50 cents for choosing

sell when the coin lands heads,

and gains 0.50 cents for choosing

sell when the coin lands tails.

Well, it’s pretty easy to

see that as Player two,

your strategy in this subgame

should now change as well.

Now, you should be

guessing heads with

75 percent probability and

tails with 25

percent probability.

But you can see what’s

happened here is that,

by changing the expected

value of the sell action,

we have affected what

the optimal strategy is

in the play subgame.

Even though

the sell action is not

part of the play

subgame and in fact,

it’s not even on the path

leading to the play subgame.

So, this is something that

happens in imperfect

information games.

It does not happen in

perfect information games.

In perfect information games,

if you wanted to determine

the optimal strategy in subgame,

you only need to look at

that subgame by itself.

But in imperfect

information games,

you have to look at

the game as a whole.

So, you can think of like

perfect information games is

a special case where

you don’t have to worry

about all this stuff.

Imperfect information games

are the more general case

where this is a problem.

So, what do we do? Well, it

turns out that we

don’t actually have

to know the strategy for

the entire game as a whole.

I mentioned that this sell

action leads to a subgame,

where both players

might take actions.

But you don’t have

to worry about that,

the only that really matters for

determining the optimal strategy

in this play subgame,

is the expected value of

Player one choosing sell.

So, what we can do

is try to estimate

what that value is to Player one,

and if we have that,

then we can determine

the optimal strategy

in the play subgame.

So, that’s what we actually

did in [inaudible].

We also have a theorem that

says, “If this estimate is

within delta of the true

Nash equilibrium value,

then we can solve for

the play subgame and get

within delta of

the Nash Equilibrium.”

So, in the [inaudible]

, we actually do this.

We have this massive game

which is simply way too

large to solve upfront.

So, we come up with

a really good strategy just for

the early part of the game,

and we estimate what

the optimal strategy is and what

the expected values are in

the later parts of the game.

Now, when we’re actually playing,

we find ourselves in

a particular subgame,

we come up with

a much better strategy

for that particular subgame

using information about

the expected values from

the other subgames.

Then, we repeat this process,

we just come up with

a really good strategy

for that early parts

that are coming up and just

estimate how to play

in the later parts.

We find ourselves

in early subgame,

we again compute

a much better strategy

and that particular subgame

using information

about the expected

values of the other subgames.

That’s called nested

subgame solving.

This was the key

breakthrough that

allowed us to be top humans.

So, when I, yes?

>>Just [inaudible].

>>Yes, that’s a great question.

So, actually when we do this,

this is sort of a general,

how we would do this in general.

But in poker, we

solved the first two,

there’s four betting rounds.

So, we solve the first two,

with a pre-computed strategy.

Because it’s like each round

grows exponentially in size.

So, the first two rounds

are actually pretty small.

We got to the end of

the second betting round,

that’s when we applied

Subgame Solving.

So, we came up with a much better

strategy for the

remainder of the game.

We abstracted the bets.

So, we want to consider

all the 20,000

different bet sizes,

we would just consider

a small fraction of them.

Then each time

the opponent acted,

each time they’ve made a bet,

then we would solve

a new subgame for that bet size.

So, we would apply

this recursive solving thing

every time the opponent made an

action beyond

the second betting round.

So, when I mention

this idea of what’s called

Safe Subgame Solving,

where we use the expected values

from the other subgames,

people always ask

about this thing

called Unsafe Subgame Solving,

which is the more intuitive

approach to doing this.

The idea here is well,

why don’t we just

estimate what

the opponent strategy is?

Let’s say we can sort of like

we played a bunch of hands

against them or we

can estimate what

the Nash equilibrium is for them,

and we figured out, well,

they should be choosing

play 80 percent of the

time when the coin lands

heads and play

30 percent of the time

when the coin lands tails.

Let’s say, just for example.

Now, if we assume that the

opponent’s playing this strategy,

can we then reason about

the distribution of

states that we might be

in and then solve optimally

using that distribution.

It turns out that doesn’t

work. So, let me give you

an example of what

this would look like.

When the coin lands

either heads or tails,

we reason that we’re in one of

these states with

50-50 probability.

Now if we observe

player one choose play,

we would say, okay, well,

in a Nash equilibrium,

we would expect

player one to choose

play 80 percent of the time

if we were in the left state,

30 percent of the time when

we’re in the tail state.

So, we update our belief

about what state

we’re in using Bayes rule,

and now we can reason

that we’re in that

left state with

73 percent probability,

and in that right state with

27 percent probability.

Now, we would just say, well, if

we assume this

distribution is correct,

then the optimal strategy

is to always choose heads.

But if we’ve already established

that’s a really bad idea,

because now the opponent

can simply shift to

selling the coin

when it lands heads

and choose and play with

the coin lands tails.

So, the problem with

this approach is that

we’re making an assumption about

how the opponent is playing.

If it were true, like

if this distribution were

true that they were choosing

play with 80 percent probability

in heads and 30 percent

probability in tails,

then yes, we can

apply this reasoning.

But the opponent strategy

can always change.

They can always change

adversarially to us. Yes?

>>There’s one thing that I’ve

always been interested in,

when you play the Nash or when

you play against the opponent.

It seems like they’re

not going to shift.

Even if you’re playing

the wrong strategy,

they wouldn’t exploit

it off immediately.

They have to learn to exploit it.

I guess it’s safe to

definitely model the Nash,

but I am curious about

this intermediate space where

you’d play against how they’ve

been playing in the past,

recognize that you

need to shift in

some way because they

may shift as well.

>>So, yeah, that’s

a great question.

We actually did not,

so one of the interesting things

about humans playing poker,

is that they’re actually

really good at exploiting.

They are phenomenal at it.

Way better than

computers are currently.

So, we actually did a competition

against them in

2015 where we lost,

and we would sometimes

change the bot that

they were playing

against between days.

Within 50 hands of playing,

they could figure out

how the bot had changed.

So yes, if you can

make an AI that could

figure out how to do this better

than humans, then that

that might be valuable.

But we’re playing against

really talented humans

and we didn’t think that we

could beat them at that game,

but then also why bother

playing that game?

Why bother trying to play

that mind game if we can just

approximate a Nash equilibrium

and guarantee that

we’re going to win.

So, I would argue that in

the two player zero-sum game,

if you want to beat top humans

in a two-player zero-sum game,

the best approach is to just

approximate the

Nash equilibrium because now,

no matter what

they’re going to do,

you’re going to beat them.

Now, I would argue that if

your objectives are

different, so for example,

if you really want to beat up on

a weak player and

exploit them, then yeah,

you don’t want to necessarily

play Nash equilibrium.

You want to adapt to

their weaknesses.

This is challenging

to do correctly,

because if you try to adapt

to a weak players weaknesses,

you never know if they’re

just fooling you.

Like if you’re playing

rock, paper, scissors

against somebody and they

throw rocks three times in

a row, and you say, well,

he’s clearly an idiot who’s

throwing rock every single time,

I’m going to throw

paper next time,

they could just throw scissors.

So, there’s no safe way that,

except in special cases,

there’s no safe way

to do that kind

of opponent exploitation

and still guarantee that

you’re going to beat

top humans expectation.

So, I think that is

an excellent avenue

for future research,

but I think that in

the two-player zero-sum setting,

where we’re just trying

to beat top humans,

I think this is the better

way to go about it.

So, Unsafe Subgame Solving,

is very risky for

this reason because if you

make an assumption about how

the opponent is playing,

they can always

shift to a different

strategy and take

advantage of that.

Now, that said, it turns out

that this actually works,

yes, so we must account for

the opponent’s ability to adapt.

Now, that said in practice,

Unsafe Subgame Solving works

unusually well in poker.

It turns out that if you

just approximate what

the Nash Equilibrium strategy is

and then assume that

the opponent is playing that,

and apply Subgame

solving in this way,

that actually works really

well in this domain.

But we have found situations

where this does not work well,

and I think in

more general settings,

it would not do well.

So, we actually use this in

a few situations in Libratus.

But in general, I would

not recommend doing this.

Unless the domain is

specially structured

that it would work.

So, Safe Subgame Solving,

the idea is instead that

we’re just going to estimate

what the expected value is

for the opponent’s Subgames,

for the opponent’s actions

for different Subgames,

and use that information

to determine

the optimal strategy for

the Subgame that we’re in.

Now, this works if your

expected values are perfect,

but if they’re not perfect

you’re obviously not going

to compute an exact

Nash equilibrium.

So, it turns out that there’s

room for improvement here.

By the way, this idea has

been around for awhile.

It’s was first

introduced in 2014.

It was never really used

in practice because

it didn’t actually give you

good results in practice.

Because you don’t have

perfect estimates.

But what we came up with,

is a way to dramatically improve

the performance without giving

up any theoretical guarantees.

With this thing called

Reach Subgame Solving.

So, here’s an example

of how this works.

This is going to get

a little tricky,

so if you have any questions in

the next few slides

please let me know.

So, let’s say that we have

this slightly larger game now.

It’s still basically

the same structure,

there’s a coin flip that

only player one observes.

Player one takes

a sequence of actions,

they eventually end up in

this choice between selling

the coin or playing,

choosing play.

Now, if they choose play,

player two has to guess

how the coin landed.

Well, let’s say your estimates

are off in this game.

Let’s say we estimate

that for choosing Sell,

they will get an

expected value of minus

one regardless of

which state they’re in.

Well, the best that we

can do is just guess

50-50 between heads and

tails and guarantee

that they get just an

expected value

zero, for choosing play.

But maybe we can use

information about

the earlier actions

to improve upon this.

So, maybe there is

this earlier action that player

one could have chosen,

if the coin landed heads,

where they could have gotten

expected value of 0.5.

Well, that means that in

the Nash equilibrium,

they would choose that action

and get expected value of 0.5,

and in the other case,

they would come down here

and choose play and get

it fixed value of zero.

So, they’re getting an

average of 0.25 in this game.

But we can now shift

our strategy as player two,

to ensure we get

tails more often,

which guarantees

that player one now

gets negative 0.5 in this case.

In the heads case that

means they will get to

0.5 for choosing play,

but that doesn’t really

matter because they’re

already getting

0.5 for this earlier

deviate action.

So, we’re not really giving

up anything in this situation,

we’re just making

ourselves better off.

Because they would never

gets to the situation

where they would

choose the 0.5 anyway.

This seems really

intuitive but there’s

a problem with this,

which is really subtle.

I’m going to have to

go to a bigger game,

which is going to get

even more complicated

to really illustrate it.

So, here is this bigger game.

It’s still pretty similar,

that a coin that lands heads or

tails with 50-50 probability.

Player one in both cases now,

let’s say has

this deviate action.

In the heads case,

they can get 0.5,

and at tails case,

they get minus 0.5.

Or they can choose to continue,

in which case we run

into this chance node.

This chance node is public.

It just leads to

two different Subgames,

so both players observe

the outcome of this chance node.

It just leads to see

different situations

that are strategically identical.

It’s an irrelevant chance node,

but it is a chance node.

Then after this chance node,

player one let’s say,

chooses play and we estimate

the expected value of them

choosing play is now zero.

So, let’s say, we were

player two in this game,

we observe player

one choose play.

Which means that we are in one

of these two

different situations.

Either the coin landed heads,

they choose to continue

and let’s say we

observed that the chance

node end up going left,

and then they chose play.

Or the coin landed tails,

player one chose to continue.

We observe the chance node

going left and they choose play.

So, we’re in either this

situation or this situation.

Well, we observed that they have

this deviate action of where

they could’ve gotten

expected value 0.5,

if the coin landed heads.

So, maybe we would say,

we say, okay, well,

we can increase the expected

value for this action to

one and lower it to

minus one in this case,

for example, by always

guessing tails.

That is okay because since

this situation is only

encountered 50

percent of the time,

the expected value for

this action is now just 0.5,

and so that matches

the deviant actions,

so we’re not giving up anything.

Does anybody see the problem

with this? All right.

The problem is, if

that chance node had

gone the other way,

if it had gone right,

we would apply

this same exact reasoning.

We would say, okay, well, we can

increase this expected

value to one,

because we’re all encountering

the situation half the time,

so this expected value

goes up to 0.5,

and now the opponent is

getting expected value zero,

we’re not giving up anything.

But if we apply this reasoning

regardless of which way

this chance node goes,

then what that means is

our strategy is to always guess

tails in both situations.

So, in reality, it means that

the expected value in

this case is one and

in this case is one,

which means that the

expected value is

actually one, it’s not 0.5.

So, now player one could be

better off by choosing to

continue instead of choosing

this deviate action.

So, what this illustrates

is that when you

are doing this Subgame solving,

you’re doing this

real-time reasoning,

you can’t look at

the expected values of

what we call

the Blueprint Strategy,

the pre-computed strategy.

You have to think about

what the expected values

would have been if we

had entered that Subgame and

applied Subgame

solving there too.

So, that makes things that

way more complicated.

But fortunately, with Reach

subgame solving, by the way,

two prior papers had actually

discussed this idea of,

okay, we’re encountering

this situation,

let’s just increase the

expected value for here,

because they could have

gotten an expected value

earlier on and missed

this problem that

you have to consider

about all the subgames that

people could end up in.

So, two prior papers

published about this

and they both got it

wrong, and our paper,

NIPS 2017 recognized

this problem and actually came

up with a fix that

allows you to do this Reach

subgame solving,

while still guaranteeing

that the points,

your exploitability is

not going to go up.

The basic idea is to

just only increase

the expected value for both

of these situations by 0.5.

The actual details get

a little bit more complicated,

but the aren’t too

important for this talk.

But the idea is you just

increase the expected values by

less depending on

how many subgames they could

end up in. You have question?

>>Well, I was just wondering,

I know this is really simple

thing, you can’t hold it.

You don’t think if it’s

just wrong versus this

weird like if you only

increase the values for

the left of the

public chance node.

>>Yeah, so let’s see

where the situation-

>>You set those

first to explain it?

>>Yeah.

>>It seems like this

is still correct.

It’s just weird because you’re

saying if nature

flips a coin heads,

then I’m going to do

this weird reaction,

and if it’s tails, then I’m not.

>>So, I’ll say this.

If you were actually

only increasing

the expected value

to one in this situation,

and keeping the

expected value at zero

in this situation,

that’s totally fine.

>>Okay.

>>But you have to think about

what would have happened.

Imagine this from player

one’s perspective.

If we would increase the

expected value to one in

this situation because

this chance node went left,

and we would have increased

this expected value to one if

this chance node are gone,

right, Then what player

one is thinking is that

if they’re in this situation

they’re thinking that,

if I choose this action,

regardless of which way

this chance node goes,

I’m getting expected

value of one.

>>Yeah, yeah. If

our algorithms actually did.

>>Yeah. So, that’s what

I’m saying is that,

you have to think about what

your algorithm would have done in

all these other situations

that didn’t come

up as player two, yeah.

>>Okay.

>>Okay. So, that’s

Reach subgame solving.

Yeah, so the idea is

for off path actions,

we have to consider how

we would have applied

subgame solving and

all these other

situations that didn’t come up.

We have a solution for it,

which is basically to split

that what we call slack among

the different subgames.

We have a theorem that says,

“Reach subgame solving will not

increase the exploitability of

a safe subgame

solving’s technique.”

If there is, these earlier

actions where the opponent could

have chosen a higher

expected value,

then it will actually

improve performance

relative to traditional

subgame solving.

In terms of actual, yes.

>>[inaudible].

>>Yeah.

>>Then you’ll have to know

the expected values for

the subgames that are

on the path from there.

>>That’s correct.

So, you look at

all the path, yeah, yes.

You look at all the situations

where the opponent could

have gone off path.

>>Yeah.

>>You have to know the fixed

values for those. Yeah.

>>I understand that

these are values you put.

So, is it correct

when you’re learning?

So, it seems that

you’re allowed to do the

updating some of these

expected values,

but then you are making

this per all [inaudible].

>>Yes, so maybe I should

have clarified this earlier.

So, I’m assuming that

we’re basically run

an algorithm that approximates

a Nash equilibrium

for the entire game,

and that’s where these

expected values are coming from.

So, we have this a rough

approximation of what

the Nash equilibrium is

for the entire game,

and that’s giving us

expected values for

all these different actions,

but they’re not perfect.

It’s like with

AlphaZero for example.

AlphaZero gives you policy

and values for

all the different states

that you might be in,

but that’s obviously not perfect

and you can improve upon

it by doing Monte-Carlo

tree search in real-time.

>>What do you mean formally

by the safe technique

to exploitability.

So, you assume that [inaudible]

is off by a certain fixed data?

>>That’s a great question.

So, by safe subgames solving,

I mean that there is

some exploitability,

our strategy, this

pre-computed strategy that

we have is exploitable

by some amount.

>>Assuming that all of these

are correct or they have-?

>>Well, I’m just saying

that we’ve run this.

Let’s just say we’ve run a

traditional reinforcement

learning algorithm

on the entire game,

no real-time playing,

just pre-computed strategy,

that strategy that we have now

is exploitable by some amount.

I am saying that we would

like to improve upon

this in practice by doing

real-time planning.

But we want to guarantee

that by doing real-time planning,

we’re at least not

going to increase

the exploitability

of our strategy

relative to what we had before.

Now, in practice,

it ends up being

way lower exploitability,

but we want to guarantee.

We can’t really guarantee

that it’s going

to decrease in most situations,

but we want at least guarantee

that’s not going to increase.

So, that is what I mean

by safe subgame solving.

>>The purple like estimates from

your pre-computed

suboptimal value function?

>>Yeah, so those

are the values that

we’ve estimated based on

our prec-omputed strategy

for the game.

>>We can use those to

compute the red basically?

>>Yeah.

>>Okay.

>>So, the red is the real-time

planning expected values.

>>So, what’s

the right procedure?

>>For time reasons,

I decided to not really talk

about how we’re actually

doing all this computation

of the strategy.

We use something called

counterfactual rep minimization,

which converges to

an approximate solution

and one over square root T time.

So, if you do T iterations,

you will get within one

over square root T of

the Nash equilibrium. Okay.

>>Like in terms of

the numbers things then.

>>It is also, yes.

So, it’s linear in the number

of information sets.

>>So, like terabytes?

>>Well, okay. So,

with the broadest,

we actually used about

several terabytes of data

and we used about

millions of core hours

of computation time.

>>In real-time?

>>In real time, no. In

real time, it was lower.

>>[inaudible] .

>>So, that was for

the pre-computed strategy.

For real time, it

ended up being we

used about 1,000 cores,

so about 50 nodes,

and the memory is

actually really small.

It was probably

less than 100 megabytes.

We all actually

figured out how to

improve upon this massively,

which I’ll get to

in a little bit.

>>Is it a 100 like

per core or per?

>>No, it’s the whole thing,

100 megabytes for

the whole thing.

The actual game,

when you’re solving

from the turn onwards,

like third betting rounds

at the end of the game,

the actual size of the

game is pretty small,

but because you have to compute

a solution equilibrium for it,

it takes a lot of

computational power.

>>So, was there a limit

on how much time you

have to make a decision?

>>Yeah, we ended up doing it in

less than 20 seconds or so.

There wasn’t like

an official time limit,

but we gave them some guidelines

on how long it would

take on average.

We also didn’t get the

humans the time on it.

So, if they wanted to

take 10 minutes for

a decision then that

was fine with us.

I will also, when we’re

thinking about the

time on the thing.

So, one of the interesting

challenges of poker is that,

you don’t want to give it

away timing tells, right?

So, if it takes you two seconds

to make a decision,

then the opponent might think you

have a really easy

decision whereas,

if you take two minutes,

and it might be

a difficult decision,

they can figure out

what hand you have.

So, if you’re playing, if you

look at the World

Series of Poker,

it gets really annoying

because, at the final table,

they all take the same amount

of time for

every single decision.

So, they take like two minutes

for every single decision,

even if it’s a

really trivial one.

We didn’t want the

humans to have to do

this because it would’ve taken

forever and would

pissed them off,

would have pissed us off,

so we told them flat

out that the bot is not going

to look at timing tells.

So, if they took two seconds

to make a decision,

that’s totally fine, we

won’t change anything.

But we can’t make them

also do that for the bot,

rather like if the bot took

two seconds versus two minutes,

they would pick up

on that and they

can’t not pick up on that, right?

So, we had to make the bot

take the same amount of

time for every single decision

you made to prevent that.

So, that’s also why

it ends up taking

longer to do this thing.

There’s a lot of

decisions that are easy,

but we can’t make that obvious.

All right. So, experiments

on medium-size games.

So, it turns out that our reach

subgame solving technique

that I just described,

that’s about three times,

it’s about three times

less exploitable.

Then, prior, safe subgame

solving techniques,

and nested subgame solving,

this idea of applying subgame

solving repeatedly as you

go down the game tree.

That is 12 times less

exploitable than

the prior state of the art,

which is to just say, well,

if the guy bet $60,

and we have in

our precomputed solution,

a strategy for if he had bet $50,

then we’ll round it, and treat it

as if you had bet $50 instead.

That was the previous

state of the art.

So, this is 12 times less

exploitable than that

in Heads-up No-Limit

Texas Hold’em.

Sorry, in smaller versions

of Heads up No-limit

Texas Hold’em.

Okay. So, that is

one reason for why imperfect

information games are hard.

There is a second reason

that I wanted to get to you,

and I think I still

have time to do it.

This is more recent research.

The second reason is that

states don’t have well defined

values in

imperfect-information games.

So, let me show you

what I mean here.

In a perfect-information game,

and it’s single agent settings,

if you take smashing,

your opponent takes an action,

you find yourself in

a particular decision point

is particularly subgame.

You don’t solve to

the end of that subgame.

Right? You do what’s called

depth limited reasoning.

So, the remainder of

the game is too large,

so you come up with

a strategy for

the next several actions.

Then, once you get to

a certain depth limit,

you say,okay, I’ve

looked far enough,

I’m just going to substitute

a value for this leaf node,

and say the value of

this arrangement of

pieces on the board

looks like player one,

white wins with

60 percent probability.

So, you assign a value

to that state,

and then you solve the depth

limited subgame using

those values at the leaf nodes.

It turns out that this does not

work in

imperfect-information games.

I can give you a really

simple example of why.

This is a game that I called

Rock-Paper-Scissors+,

is exactly like

rock-paper-scissors,

except if either player

throws, what is it?

Scissors? Yes. If

other player throws scissors,

then the winner gets two points,

and the loser loses two points.

That’s just to break

the symmetry of the game.

So, the next equilibrium

in this game is to

throw rock and paper with

40 percent probability,

each, and scissors with

20 percent probability.

Now, imagine that we’re trying to

do a depth-limited version,

a depth-limited solving of

this game as player one.

So, we look one move ahead,

and then we’re

going to substitute

the Nash equilibrium value

at each of those states,

where instead of going

to the end of the game.

This is the

depth-limited subgame.

It’s really easy to

see that if we were

to try to solve

this depth limited subgame,

there is no way that

we’re going to find

the optimal strategy of

40 percent, 40

percent, 20 percent.

Right? There’s no

just not enough information

in this depth-limited subgame to

find the Nash

Equilibrium. Why is that?

Well, it turns out

the reason is because we are

essentially assuming that

beyond this decision point,

player two is going to play

the Nash equilibrium strategy.

Right? That’s where

we got the 000 from,

this is assuming that if

we had chosen scissors,

and player two plays the Nash

equilibrium strategy

beyond this point,

this expected value is zero..

But in reality, player

two strategy beyond

the depth-limit depends on what

our strategy is above

the depth-limit.

If we choose rock,

80 percent of the time,

player two’s strategy

isn’t going to

be to play the Nash equilibrium,

it’s going to be choosing

paper, 100 percent of the time.

So, this is what

the state values will look

like or if we were

to choose paper,

80 percent of time, then

they would switch to

always choosing scissors,

and this is what the state

values would look like.

So, in an

imperfect-information game,

the values of the states of

the depth-limit depend

on what our policy is,

in the earlier parts of the game.

So, how do we deal with this?

Well, one option is to just

actually make the state values

dependent on our policy,

and say, “Okay.

Well, the value of a state is

a function of the

description of that state,

and our policy for

the entire game.”

Well, that is theoretically

correct to the promise

that’s extremely expensive,

I mean absurdly expensive.

The other option is

something called, well,

we actually metaphor about

this AI called DeepStack.

They condition

the value of a state on

the belief distribution of

both players at that state.

So, they said, okay.

Well, at this decision point,

I’m not going to condition on

the strategy for

the early part of the game,

I’m going to look at all

the different states

that I might be in,

and the probability that I

believe I’m in each

of those states,

which in this case is 80

percent, 10 percent, 10 percent.

Then, this game, it ends up being

the same exact thing as just

conditioning on the policy but,

in general that’s not the case.

The problem is that this is

still extremely expensive.

So DeepStack for example,

in Heads up No-limit

Texas Hold’em,

use 1.5 million core

hours of computation,

and could not be prior top AIs.

The other problem is that

the technique currently does not

scale to larger games,

basically games where you have

more states in an

information set.

This, you can get by with this

in Heads-up No-Limits

Texas Hold’em, because

it’s only about,

in any single decision point,

every single information set,

there’s only about

1,000 different

states in that information set.

But in a game like

Five Card Draw for example,

there could be 5 billion or

in a game like Stratego,

that could be 10th to

the 20th or something.

So, this would not work

in those larger games.

So, we do instead,

is this paper that we just

had except it’s NIPS for

2018 called a

depth-limited solving,

we use this in an AI, we

created called Modicum,

and let me walk you through

the algorithm here.

The idea is, instead of assuming

that there is a single value

at this depth-limit,

we’re actually going to let

the opponent choose between

multiple different values

for these states.

We create these different values

in an iterative process.

So, we start off by assuming

that player two, beyond

the depth-limit,

is going to play

the Nash equilibrium strategy

that we precomputed.

Then, we solve

this depth-limited subgame,

which in that case

means, let say,

we solve it, and we say

our strategy is going to be

one-third, one-third,

one-third probability.

Now, we’re going to

calculate a player

two best response.

So, we can say, okay.

Well, if our strategy

here is to play

one-third, one-third,

one-third, player two could

exploit us by always

choosing rock.

Now, we add that best response to

the set of strategies

that player two can choose

at the depth-limit.

So, now, we’re going to solve

this depth-limited subgame again,

and we’re going to say,

add this depth-limit,

now player two can choose.

They can either choose the

Nash equilibrium

strategy that we had

before or they can

choose this best response

that we just calculated,

which was always play rock.

They make this decision

because all of these states,

sharing information set,

they can’t say, “Okay,

well, in this state, I’m

going to choose this policy,

and this state I’m going

choose this policy.”

They have to make

the same decision at

each of these different

states that share

an information set.

So, we solve this

depth-limited subgame again.

Then, we again calculate a player

two best response

to that strategy,

and then we add

that strategy to the set of

best responses that player

two can choose at

the depth-limit.

We can repeat this process

as many times as we want.

Now, some details

about this technique,

it might seem like this

is really expensive,

and may not get

good performance because

we can’t add like

a million different

strategies for

player two to choose

that the debt limit,

but it turns out that

because they are making

this choice separately

at each information set,

they’re essentially able to,

even if we only

give them a choice

between 10 different strategies,

beyond the depth limit,

if they are making

that choice individually

at 100 different

information sets,

they’re actually

choosing between 10 to

the 100 different strategies

for the entire remainder

of the game.

So, it actually

grows very quickly.

The other thing for player one,

I talked about

what player two does,

so player two is choosing

between these different

strategies that they

could play for

the remainder of the game,

player one, we’re

going to assume is

playing the approximate

Nash equilibrium.

Player one is us, we’re going

assume that weren’t

playing according to

the approximate Nash equilibrium

strategy for the

remainder of the game.

The set of player

two’s strategies

is precomputed it’s not

determined in real time.

They could do it in real time,

it would be too

expensive in practice.

Okay. So, this is what

performance looks like if

we do this depth-limited

solving thing.

So, on the x-axis here,

we have the number of values per

leaf node that the opponent

can choose between

at the depth limit,

and on the y-axis, we have

exploitability measured

in milligrams per game.

This is a simplified version

of No-Limit Access Hold’em,

that only has two betting rounds

instead of four.

You can see, if we only assume,

that if we assume that

each state has a unique value,

which is essentially

the Nash equilibrium value

like we would in

a perfect information game,

exploitability is extremely high.

But as we add more strategies for

the appointed to choose

between at the depth limit,

exploitability drops

off very quickly.

In fact, with

16 different choices,

were essentially at

Nash equilibrium.

So, the beautiful

thing about this. Yes.

>>So why is that have one.

[inaudible] blue

will be the same..

>>No. Because actually blue,

sorry, so for blue here,

it’s not doing any

real-time reasoning,

it’s doing this like,

if they had bet $60, I’m

going to round that to $50.

So, red is actually doing

real-time reasoning,

but it’s assuming

that each value has

a well defined unique value.

>>So, [inaudible] generation or frequency

or something, so what [inaudible].

>>There are a lot of

similarities between this

and things like

double oracle methods and

things like that. Yes.

All right. So, in terms of

head-to-head performance,

the really cool thing about

this technique is that

it allows us to make

a really strong poker AI

using very few resources.

To give you some

perspective, we had

this bot called Tartanian8

that we made in 2016,

which won the Annual

Computer Poker Competition,

which is competition

among poker AIs.

It used two million core hours

of computation,

18 terabytes of memory,

and there’s no

real-time reasoning.

We have Slumbot which won the,

that wasn’t us, this

is a different bot,

that won the 2018 competition,

used 250,000 core hours,

two terabytes of memory,

no real-time reasoning.

Modicum which is the bot

that uses its depth

limited solving,

uses just 700 core hours,

16 gigabytes of memory,

plus real-time with

a 4-core CPU in

under 20 seconds per hand,

and it beats both of

those other bots.

So, to put this in perspective

even further, the broadest,

which is the AI that we played

against the top humans,

used millions of core.

I think it was something

like two million or

five million core hours,

probably it’s

20 terabytes of memory,

and played in real-time

using 1,000 cores.

So, we’re able to get what

is essentially

probably superhuman,

we haven’t actually tested

against humans but I’m

pretty sure this is

a superhuman poker AI.

We’re able to get

basically superhuman performance

using the resources in a laptop.

In fact, since I

published this paper,

I’ve just put another paper on

archive where we

figured out how to make

this three times faster.

So, it could probably run

on a smartphone now. Yeah.

>>Where is the [inaudible]

you can just run.

>>It turns out that

there is a huge

amount of variance in poker.

>>Yes.

>>Because we’re doing

real-time reasoning

and we’re taking

20 seconds per hand,

the variance is massive.

In fact, we actually,

we train this using

700 core hours,

it took us like

a million core hours to

actually compute

all these results.

So, this has been a problem

in the entire field,

that the variance is just absurd.

So, this is a graph of

head-to-head performance.

Here we have the Libratus,

which beats up humans.

Here is Modicum

which is the AI we

just created that uses

way fewer resources.

Here are some other

benchmark bots.

A bot from 2016,

a bot from 2014.

Here is DeepStack, which

is a bought from

the University of Alberta,

which actually has very

low exploitability.

But in terms of

head-to-head performance

didn’t end up being that strong.

It also uses this

real-time reasoning as well.

Though a different form of it.

All right so the

key takeaways, yes.

>>You said that

one doesn’t have,

it has low exploitability

but it’s not that strong?

>>In terms of

head-to-head performance

it’s not as strong.

So, in terms of

head-to-head performance

it actually doesn’t

beat prior benchmark bots.

>>Yes, I guess

that’s curious to me,

because if you’re not

exploitable. Okay so-.

>>When I say low exploitability,

I mean just relative

to the past bots.

So, the exploitability is still,

it could be extremely high,

we actually don’t know.

We can’t calculate

exploitability exactly

and heads up no limit takes

it whole so we don’t know.

But it appears that it has

lower exploitability compared to

the absurdly high exploitability

of the previous bots Yes.

So, key takeaways. In

real-time planning,

you always have to consider how

the opponent can adapt to

changes in your policy.

That is something that is really

important in imperfect

permission games.

Perfect-information games you can

mostly ignore that

but not completely.

Imperfect-information subgame

cannot be solved

in isolation.

States and

imperfect-information games

do not have well-defined values.

All right. So, I

have done some other work

that I did not discuss.

One thing I did not discuss,

well I guess I talked about

it briefly is how we

actually solve these games.

We use an algorithm

called counterfactual

regret minimization,

which has been around

for about 10 years now.

Works extremely well, even

though the theoretical guarantee

on convergence is only

1 over square root t. I just

had a paper that I released,

where I developed

a new form of CFR which

beats the prior state-of-the-art

by a factor of three.

So, I’m really

excited about that,

and that’s going to be used

in all the future research.

I have some work

on pruning in CFR.

So, it turns out

that CFR you end up

exploring the entire game tree

which is a big waste,

because a lot of actions are

suboptimal and you don’t

want to waste time

coming up with what

you should do if

you play a really

crappy poker hand,

because in reality you

just fold it right away.

So, I have this pruning technique

that provably reduces to

computing and memory costs of

running CFR asymptotically.

In practice, it speeds things

up by an order of

magnitude of two.

I also have a paper on

determining the optimal actions

in a continuous action space.

So, one of the interesting things

about no limits exist hold

them is that you have

this continuous action space

where you can choose to bet

any amount between

$100 to $20,000.

The truth is that, what

we do right now is

just domain-specific

abstraction techniques,

where we say okay, well,

you probably just want to

bet either half the pot

or one times the pot

or two times the pot,

and it doesn’t really matter if

you are betting 0.6 times

the pot or 0.5 times the pot.

But that relies on

domain knowledge that we know

the optimal bet fractions

are roughly in that range.

So, it’ll be nice to have

an algorithm that doesn’t

rely on domain knowledge

that can actually

determine the bets to

use without any human knowledge.

So, that’s where

this 2014 paper does,

and hopefully we’ll have

some follow-up work on that in

the future. For

future directions.

So, I mentioned before,

a lot of this work has been

looking at poker as a domain,

and that’s because

it takes a lot of

infrastructure to actually build

up to do experiments

on large games.

So, we have a lot of

expertise that has been

developed over the years

on how to run these techniques

efficiently on a game like poker.

And if we wanted to test

on another large game,

it would take years to

build up that expertise

on how to do experiments

on those other games.

There’s also no

other really good,

we can’t really

compare performance

to other bots in other games,

because there are

no other games where

there are bots that

are competitive.

But I would love to

move beyond poker,

and a few different

directions with that.

One is, we have

these techniques for

perfect information

games like AlphaZero,

and we have these techniques for

imperfect-information

games like poker,

and it’ll be nice to

bridge the gap and find

a single algorithm that works

really well in

all these different games.

Another thing that I’m

really interested in is

going beyond two-player

zero-sum games.

So, I mentioned that

if you try to move on

to general-sum games

it’s a bunch of

theoretical challenges

the pop-up.

So, right now, we

don’t know how to

cope with those challenges.

But dealing with

general-sum games is really

important because

most real-world situations

are general-sum and

not not zero-sum,

except for like maybe military

interactions or security.

So, in particular working out

something like negotiation,

I think is a really

interesting line of research.

In general moving things more

towards real-world domains,

I think we’re at the point

right now where we

can actually start bringing these

techniques into the real world,

and I think that’s

going to be a really

interesting line of

research as well.

All right. So, I’ll stop

there and I’ll take

some last minute

questions thank you.

>>Yes so, I guess,

in the real world often you don’t

know the rules of the game,

you don’t know them in advance,

[inaudible] situation.

>>Right.

>>But you do observe

the outcomes of players.

Have you thought about trying to,

what works with race when

you go into a situation

where you don’t know

the rules of the game but you

can observe the

outcomes of players?

>>That’s a good question so yes.

So, all of our work

assumes that we

have an accurate model

of the world or the game

that’s being played.

I think a lot of these techniques

will carry over if you

want to try to figure

out the structure of

the game as you go.

There was actually a paper

from another student at

CMU recently on this problem.

So, people are starting

to look in this

direction, I have not.

But it’s something that I

think is very interesting.

It is also a way harder problem,

because to figure out

what information

your opponent knows

or figuring out what they

could know is a really

difficult challenge.

I’m not sure how to

go about that. Yes.

>>I think in

many applications in

something like

reinforcement learning.

So, right now I can imagine

splitting the environment

into the portion that which I can

model that will be

like Chess moves,

and then I can set the parts

that I don’t necessarily

want to model,

because I’m very risk averse.

So, that could become

like adversarial moves,

and I to be robust.

Have you thought about how

well other techniques

would apply,

or completely out

of the ballpark.

>>To be honest, I

never really considered

that direction for the research.

I think there’s a lot

of potential here.

This is an area of

research that I think has

been overlooked by

a lot of people.

So, it’s actually been

a very small community

that has been working

on the space,

and I think people

are starting to

appreciate that can be

applied to a lot of

different things.

We’re starting to

see papers on how

to apply something like

counterfactual regret

minimization to

more mainstream reinforcement

learning topics.

So, I think there was a paper

from Berkeley recently on

regret minimization in

single-agent settings.

So, I think there is definitely

potential to extend

the research and to

the more traditional

reinforcement learning settings,

but I have not looked

into that yet.

>>[inaudible] a little bit,

is there anything I’m learning?

So, in the real world one of

the things that’s

really difficult is,

I don’t know

the rules of the game

and I really have no idea what

my problems hails are often.

Is that something that

people have looked at?

Trying to basically think about

simultaneously trying to improve

my own hails trying to

get them in model of

what’s going on with my opponent.

>>Yes. I guess that

would be kind of a subset

of the previous case

we discussed,

where it will be

like, maybe you know

the structure of the game but you

don’t know what the

opponent’s payoffs.

I think this has

been looked at in

the game theory community,

but more in simple cases not

like large-scale computer

science settings.

So, I could be wrong about

that I don’t know too well.

>>As you said I should know

the game theory, yes sure.

So, there are models where

there’s a point in time but it’s

unknown but I parameterized

the function, sure.

>>Yes. So, I don’t know

of any large-scale work

on that sort of thing,

but that also I think

it wouldn’t be,

that seems like it’ll

be a lot easier to

extend this work to that domain,

and it does not

know the rules the

game at all. Okay.

>>Thank you.

How are you guys organized? Any github / forum / etc?

High stakes here I come.