A Second Course in Algorithms Lecture Notes (Stanford CS261)

yBmZlQzJ 2024-04-16

102

CS261: A Second Course in Algorithms

Lecture #1: Course Goals and Introduction to

Maximum Flow^∗

Tim Roughgarden^†

January 5, 2016

Course Goals

CS261 has two major course goals, and the courses splits roughly in half along these lines.

.1 Well-Solved Problems

This ﬁrst goal is very much in the spirit of an introductory course on algorithms. Indeed,

the ﬁrst few weeks of CS261 are pretty much a direct continuation of CS161 — the topics

that we’d cover at the end of CS161 at a semester school.

Course Goal 1 Learn eﬃcient algorithms for fundamental and well-solved problems.

There’s a collection of problems that are ﬂexible enough to model many applications and

can also be solved quickly and exactly, in both theory and practice. For example, in CS161

you studied shortest-path algorithms. You should have learned all of the following:

. The formal deﬁnition of one or more variants of the shortest-path problem.

. Some famous shortest-path algorithms, like Dijkstra’s algorithm and the Bellman-Ford

algorithm, which belong in the canon of algorithms’ greatest hits.

. Applications of shortest-path algorithms, including to problems that don’t explicitly

involve paths in a network. For example, to the problem of planning a sequence of

decisions over time.

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

The study of such problems is top priority in a course like CS161 or CS261. One of

the biggest beneﬁts of these courses is that they prevent you from reinventing the wheel

(or trying to invent something that doesn’t exist), instead allowing you to stand on the

shoulders of the many brilliant computer scientists who preceded us. When you encounter

such problems, you already have good algorithms in your toolbox and don’t have to design

one from scratch. This course will also give you practice spotting applications that are just

thinly disguised versions of these problems.

Speciﬁcally, in the ﬁrst half of the course we’ll study:

. the maximum ﬂow problem;

. the minimum cut problem;

. graph matching problems;

. linear programming, one the most general polynomial-time solvable problems known.

Our algorithms for these problems with have running times a bit bigger than those you

studied in CS161 (where almost everything runs in near-linear time). Still, these algorithms

are suﬃciently fast that you should be happy if a problem that you care about reduces to

one of these problems.

.2 Not-So-Well-Solved Problems

Course Goal 2 Learns tools for tackling not-so-well-solved problems.

Unfortunately, many real-world problems fall into this camp, for many diﬀerent reasons.

We’ll focus on two classes of such problems.

. NP-hard problems, for which we don’t expect there to be any exact polynomial-time

algorithms. We’ll study several broadly useful techniques for designing and analyzing

heuristics for such problems.

. Online problems. The anachronistic name does not refer to the Internet or social

networks, but rather to the realistic case where an algorithm must make irrevocable

decisions without knowing the future (i.e., without knowing the whole input).

We’ll focus on algorithms for NP-hard and online problems that are guaranteed to output

a solution reasonably close to an optimal one.

.3 Intended Audience

CS261 has two audiences, both important. The ﬁrst is students who are taking their ﬁnal

algorithms course. For this group, the goal is to pack the course with essential and likely-

to-be-useful material. The second is students who are contemplating a deeper study of

algorithms. With this group in mind, when the opportunity presents itself, we’ll discuss

recent research developments and give you a glimpse of what you’ll see in future algorithms

courses. For this second audience, CS261 has a third goal.

Course Goal 3 Provide a gateway to the study of advanced algorithms.

After completing CS261, you’ll be well equipped to take any of the many 200- and 300-

level algorithms courses that the department oﬀers. The pace and diﬃculty level of CS261

interpolates between that of CS161 and more advanced theory courses.

When you speak to audience, it’s good to have one or a few “canonical audience members”

in mind. For your reference and amusement, here’s your instructor’s mental model for

canonical students in courses at diﬀerent levels:

. CS161: a constant fraction of the students do not want to be there, and/or hate math.

. CS261: a self-selecting group of students who like algorithms and want to learn much

more about them. Students may or may not love math, but they shouldn’t hate it.

. CS3xx: geared toward students who are doing or would like to do research in algo-

rithms.

Introduction to the Maximum Flow Problem

Figure 1: (a, left) Our ﬁrst ﬂow network. Each edge is associated with a capacity. (b, right)

A sample ﬂow of value 5, where the red, green and blue paths have ﬂow values of 2, 1, 2

respectively.

.1 Problem Deﬁnition

The maximum ﬂow problem is a stone-cold classic in the design and analysis of algorithms.

It’s easy to understand intuitively, so let’s do an informal example before giving the formal

deﬁnition.

The picture in Figure 1(a) resembles the ones you saw when studying shortest paths, but

the semantics are diﬀerent. Each edge is labeled with a capacity, the maximum amount of

stuﬀ that it can carry. The goal is to ﬁgure out how much stuﬀ can be pushed from the

vertex s to the vertex t.

For example, Figure 1(b) exhibits a method of pushing ﬁve units of ﬂow from s to t, while

respecting all edges’ capacities. Can we do better? Certainly not, since at most 5 units of

ﬂow can escape s on its two outgoing edges.

Formally, an instance of the maximum ﬂow problem is speciﬁed by the following ingre-

dients:

•

a directed graph G, with vertices V and directed edges E;¹

a source vertex s ∈ V ;

a sink vertex t ∈ V ;

a nonnegative and integral capacity u_efor each edge e ∈ E.

(3)

(2)

2 (2)

5 (1)

3 (3)

Figure 2: Denoting a ﬂow by keeping track of the amount of ﬂow on each edge. Flow amount

is given in brackets.

Since the point is to push ﬂow from s to t, we can assume without loss of generality

that s has no incoming edges and t has no outgoing edges.

Given such an input, the feasible solutions are the ﬂows in the network. While Figure 1(b)

depicts a ﬂow in terms of several paths, for algorithms, it works better to just keep track of

the amount of ﬂow on each edge (as in Figure 2).²Formally, a ﬂow is a nonnegative vector

{

f_e}_e∈E, indexed by the edges of G, that satisﬁes two constraints:

All of our maximum ﬂow algorithms can be extended to undirected graphs; see Exercise Set #1.

Every ﬂow in this sense arises as the superposition of ﬂow paths and ﬂow cycles; see Problem #1.

Capacity constraints: f ≤ u for every edge e ∈ E;

Conservation constraints: for every vertex v other than s and t,

amount of ﬂow entering v = amount of ﬂow exiting v.

The left-hand side is the sum of the f_e’s over the edge incoming to v; likewise with the

outgoing edges for the right-hand side.

The objective is to compute a maximum ﬂow — a ﬂow with the maximum-possible value,

meaning the total amount of ﬂow that leaves s. (As we’ll see, this is the same as the total

amount of ﬂow that enters t.)

.2 Applications

Why should we care about the maximum ﬂow problem? Like all central algorithmic prob-

lems, the maximum ﬂow problem is useful in its own right, plus many diﬀerent problems are

really just thinly disguised version of maximum ﬂow. For some relatively obvious and literal

applications, the maximum ﬂow problem can model the routing of traﬃc through a trans-

portation network, packets through a data network, or oil through a distribution network.³

In upcoming lectures we’ll prove the less obvious fact that problems ranging from bipartite

matching to image segmentation reduce to the maximum ﬂow problem.

.3 A Naive Greedy Algorithm

We now turn our attention to the design of eﬃcient algorithms for the maximum ﬂow prob-

lem. A priori, it is not clear that any such algorithms exist (for all we know right now, the

problem is NP-hard).

We begin by considering greedy algorithms. Recall that a greedy algorithm is one that

makes a sequence of myopic and irrevocable decisions, with the hope that everything some-

how works out at the end. For most problems, greedy algorithms do not generally produce

the best-possible solution. But it’s still worth trying them, because the ways in which greedy

algorithms break often yields insights that lead to better algorithms.

The simplest greedy approach to the maximum ﬂow problem is to start with the all-zero

ﬂow and greedily produce ﬂows with ever-higher value. The natural way to proceed from

one to the next is to send more ﬂow on some path from s to t (cf., Figure 1(b)).

A ﬂow corresponds to a steady-state solution, with a constant rate of arrivals at s and departures at t.

The model does not capture the time at which ﬂow reaches diﬀerent vertices. However, it’s not hard to

extend the model to also capture temporal aspects as well.

A Naive Greedy Algorithm

initialize f = 0 for all e ∈ E

repeat

search for an s-t path P such that f < u for every e ∈ P

/ takes O(|E|) time using BFS or DFS

if no such path then

halt with current ﬂow {f_e}_e∈E

else

room on e

z }| {

room on P

let ∆ = min (u − f )

e∈P

}

for all edges e of P do

increase f_eby ∆

Note that the path search just needs determine whether or not there is an s-t path in

the subgraph of edges e with f < u . This is easily done in linear time using your favorite

graph search subroutine, such as breadth-ﬁrst or depth-ﬁrst search. There may be many

such paths; for now, we allow the algorithm to choose one arbitrarily. The algorithm then

pushes as much ﬂow as possible on this path, subject to capacity constraints.

(3)

5 (3)

3 (3)

Figure 3: Greedy algorithm returns suboptimal result if ﬁrst path picked is s-v-w-t.

This greedy algorithm is natural enough, but it does it work? That is, when it terminates

with a ﬂow, need this ﬂow be a maximum ﬂow? Our sole example thus far already provides

a negative answer (Figure 3). Initially, with the all-zero ﬂow, all s-t paths are fair game. If

the algorithm happens to pick the zig-zag path, then ∆ = min{3, 5, 3} = 3 and it routes 3

units of ﬂow along the path. This saturates the upper-left and lower-right edges, at which

point there is no s-t path such that f < u on every edge. The algorithm terminates at this

point with a ﬂow with value 3. We already know that the maximum ﬂow value is 5, and we

conclude that the naive greedy algorithm can terminate with a non-maximum ﬂow.⁴

.4 Residual Graphs

The second idea is to extend the naive greedy algorithm by allowing “undo” operations. For

example, from the point where this algorithm gets stuck in Figure 3, we’d like to route two

more units of ﬂow along the edge (s, w), then backward along the edge (v, w), undoing 2 of

the 3 units we routed the previous iteration, and ﬁnally along the edge (v, t). This would

yield the maximum ﬂow of Figure 1(b).

u_e− f_e

u (f )

f_e

Figure 4: (a) original edge capacity and ﬂow and (b) resultant edges in residual network.

Figure 5: Residual network of ﬂow in Figure 3.

We need a way of formally specifying the allowable “undo” operations. This motivates

the following simple but important deﬁnition, of a residual network. The idea is that, given

a graph G and a ﬂow f in it, we form a new ﬂow network G_fthat has the same vertex set

of G and that has two edges for each edge of G. An edge e = (v, w) of G that carries ﬂow f_e

and has capacity u (Figure 4(a)) spawns a “forward edge” (u, v) of G with capacity u −f

(the room remaining) and a “backward edge” (w, v) of G with capacity f (the amount

It does compute what’s known as a “blocking ﬂow;” more on this next lecture.

of previously routed ﬂow that can be undone). See Figure 4(b).⁵Observe that s-t paths

with f < u for all edges, as searched for by the naive greedy algorithm, correspond to the

special case of s-t paths of G_fthat comprise only forward edges.

For example, with G our running example and f the ﬂow in Figure 3, the corresponding

residual network G is shown in Figure 5. The four edges with zero capacity in G are

omitted from the picture.⁶

.5 The Ford-Fulkerson Algorithm

Happily, if we just run the natural greedy algorithm in the current residual network, we get

a correct algorithm, the Ford-Fulkerson algorithm.⁷

Ford-Fulkerson Algorithm

initialize f = 0 for all e ∈ E

repeat

search for an s-t path P in the current residual graph G_fsuch that

every edge of P has positive residual capacity

/ takes O(|E|) time using BFS or DFS

if no such path then

halt with current ﬂow {f_e}_e∈E

else

let ∆ = min

(e’s residual capacity in G_f)

/ augment the flow f using the path P

e∈P

for all edges e of G whose corresponding forward edge is in P do

increase f_eby ∆

for all edges e of G whose corresponding reverse edge is in P do

decrease f_eby ∆

For example, starting from the residual network of Figure 5, the Ford-Fulkerson algorithm

will augment the ﬂow by units along the path s → w → v → t. This augmentation produces

the maximum ﬂow of Figure 1(b).

We now turn our attention to the correctness of the Ford-Fulkerson algorithm. We’ll

worry about optimizing the running time in future lectures.

If G already has two edges (v, w) and (w, v) that go in opposite directions between the same two vertices,

then G_fwill have two parallel edges going in either direction. This is not a problem for any of the algorithms

that we discuss.

More generally, when we speak about “the residual graph,” we usually mean after all edges with zero

residual capacity have been removed.

Yes, it’s the same Ford from the Bellman-Ford algorithm.

.6 Termination

We claim that the Ford-Fulkerson algorithm eventually terminates with a feasible ﬂow. This

follows from two invariants, both proved by induction on the number of iterations.

First, the algorithm maintains the invariant that {f }

is a ﬂow. This is clearly true

initially. The parameter ∆ is deﬁned so that no ﬂow value f becomes negative or exceeds

e∈E

the capacity u_e. For the conservation constraints, consider a vertex v. If v is not on the

augmenting path P in G_f, then the ﬂow into and out of v remain the same. If v is on P,

with edges (x, v) and (v, w) belonging to P, then there are four cases, depending on whether

or not (x, v) and (v, w) correspond to forward or reverse edges. For example, if both are

forward edges, then the ﬂow augmentation increases both the ﬂow into and the ﬂow out of

v increase by ∆. If both are reverse edges, then both the ﬂow into and the ﬂow out of v

decrease by ∆. In all four cases, the ﬂow in and ﬂow out change by the same amount, so

conservation constraints are preserved.

Second, the Ford-Fulkerson algorithm maintains the property that every ﬂow amount f_e

is an integer. (Recall we are assuming that every edge capacity u_eis an integer.) Inductively,

all residual capacities are integral, so the parameter ∆ is integral, so the ﬂow stays integral.

Every iteration of the Ford-Fulkerson algorithm increase the value of the current ﬂow by

the current value of ∆. The second invariant implies that ∆ ≥ 1 in every iteration of the

Ford-Fulkerson algorithm. Since only a ﬁnite amount of ﬂow can escape the source vertex,

the Ford-Fulkerson algorithm eventually halts. By the ﬁrst invariant, it halts with a feasible

ﬂow.⁸

Of course, all of this applies equally well to the naive greedy algorithm of Section 2.3.

How do we know whether or not the Ford-Fulkersonalgorithm can also terminate with a non-

maximum ﬂow? The hope is that because the Ford-Fulkersonalgorithm has more path eligible

for augmentation, it progresses further before halting. But is it guaranteed to compute a

maximum ﬂow?

.7 Optimality Conditions

Answering the following question will be a major theme of the ﬁrst half of CS261, culminating

with our study of linear programming duality.

HOW DO WE KNOW WHEN WE’RE DONE?

For example, given a ﬂow, how do we know if it’s a maximum ﬂow? Any correct maximum

ﬂow algorithm must answer this question, explicitly or implicitly. If I handed you an allegedly

maximum ﬂow, how could I convince you that I’m not lying? It’s easy to convince someone

that a ﬂow is not maximum, just by exhibiting a ﬂow with higher value.

The Ford-Fulkersonalgorithm continues to terminate if edges’ capacities are rational numbers, not nec-

essarily integers. (Proof: scaling all capacities by a common number doesn’t change the problem, so we can

clear denominators to reduce the rational capacity case to the integral capacity case.) It is a bizarre mathe-

matical curiosity that the Ford-Fulkersonalgorithm need not terminate with edges’ capacities are irrational.

Returning to our original example (Figure 1), answering this question didn’t seem like a

big deal. We exhibited a ﬂow of value 5, and because the total capacity escaping s is only 5,

it’s clear that there can’t be any ﬂow with high value. But what about the network in

Figure 6(a)? The ﬂow shown in Figure 6(b) has value only 3. Could it really be a maximum

ﬂow?

(1)

100 (2)

100

1 (1)

100 (2)

1 (1)

Figure 6: (a) A given network and (b) the alleged maximum ﬂow of value 3.

We’ll tackle several fundamental computational problems by following a two-step paradigm.

Two-Step Paradigm

. Identify “optimality conditions” for the problem. These are suﬃcient

conditions for a feasible solution to be an optimal solution. This step is

structural, and not necessarily algorithmic. The optimality conditions

vary with the problem, but they are often quite intuitive.

. Design an algorithm that terminates with the optimality conditions sat-

isﬁed. Such an algorithm is necessarily correct.

This paradigm is a guide for proving algorithms correct. Correctness proofs didn’t get too

much airtime in CS161, because almost all of them are straightforward inductions — think

of MergeSort, or Dijkstra’s algorithm, or any dynamic programming algorithm. The harder

problems studied in CS261 demand a more sophisticated and principle approach (with which

you’ll get plenty of practice).

So how would we apply this two-step paradigm to the maximum ﬂow problem? Consider

the following claim.

Claim 2.1 (Optimality Conditions for Maximum Flow) If f is a ﬂow in G such that

the residual network G_fhas no s-t path, then the f is a maximum ﬂow.

This claim implements the ﬁrst step of the paradigm. The Ford-Fulkersonalgorithm, which

can only terminate with this optimality condition satisﬁed, already provides a solution to

the second step. We conclude:

Corollary 2.2 The Ford-Fulkersonalgorithm is guaranteed to terminate with a maximum

ﬂow.

Next lecture we’ll prove (a generalization of) the claim, derive the famous maximum-

ﬂow/minimum-cut problem, and design faster maximum ﬂow algorithms.

CS261: A Second Course in Algorithms

Lecture #2: Augmenting Path Algorithms for

Maximum Flow^∗

Tim Roughgarden^†

January 7, 2016

Recap

u_e− f_e

u (f )

f_e

Figure 1: (a) original edge capacity and ﬂow and (b) resultant edges in residual network.

Recall where we left oﬀ last lecture. We’re considering a directed graph G = (V, E) with a

source s, sink t, and an integer capacity u for each edge e ∈ E. A ﬂow is a nonnegative vector

{

f }

that satisﬁes capacity constraints (f ≤ u for all e) and conservation constraints

(ﬂow in = ﬂow out except at s and t).

e∈E

Recall that given a ﬂow f in a graph G, the corresponding residual network has two edges

for each edge e of G, a forward edge with residual capacity u − f and a reverse edge with

residual capacity f that allows us to “undo” previously routed ﬂow. See also Figure 1.¹

The Ford-Fulkerson algorithm repeatedly ﬁnds an s-t path P in the current residual

graph G_f, and augments along p as much as possible subject to the capacity constraints of

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

We usually implicitly assume that all edges with zero residual capacity are omitted from the residual

network.

the residual network.²We argued that the algorithm eventually terminates with a feasible

ﬂow. But is it a maximum ﬂow? More generally, a major course theme is to understand

How do we know when we’re done?

For example, could the maximum ﬂow value in the network in Figure 2 really just be 3?

(1)

100 (2)

100

1 (1)

100 (2)

1 (1)

Figure 2: (a) A given network and (b) the alleged maximum ﬂow of value 3.

Around the Maximum-Flow/Minimum-Cut Theorem

We ended last lecture with a claim that if there is no s-t path (with positive residual ca-

pacity on every edge) in the residual graph G_f, then f is a maximum ﬂow in G. It’s conve-

nient to prove a stronger statement, from which we can also derive the famous maximum-

ﬂow/minimum cut theorem.

.1 (s, t)-Cuts

To state the stronger result, we need an important deﬁnition, of objects that are “dual” to

ﬂows in a sense we’ll make precise later.

Deﬁnition 2.1 (s-t Cut) An (s, t)-cut of a graph G = (V, E) is a partition of V into sets

A, B with s ∈ A and t ∈ B.

Sometimes we’ll simply say “cut” instead of “(s, t)-cut.”

Figure 3 depicts a good (if cartoonish) way to think about an (s, t)-cut of a graph. Such

a cut buckets the edges of the graph into four categories: those with both endpoints in A,

those with both endpoints in B, those sticking out of A (with tail in A and head in B), and

those sticking into A (with head in A and tail in B.

To be precise, the algorithm ﬁnds an s-t path in G_fsuch that every edge has strictly positive residual

capacity. Unless otherwise noted, in this lecture by “G_f” we mean the edges with positive residual capacity.

Figure 3: cartoonish visualization of cuts. The squiggly line splits the vertices into two sets

A and B and edges in the graph into 4 categories.

The capacity of an (s, t)-cut (A, B) is deﬁned as

u .

e∈δ+(A)

where δ+(A) denotes the set of edges sticking out of A. (Similarly, we later use δ⁻(A) to

denote the set of edges sticking into A.)

Note that edges sticking in to the source-side of an (s, t)-cut to do not contribute to its

capacity. For example, in Figure 2, the cut {s, w}, {v, t} has capacity 3 (with three outgoing

edges, each with capacity 1). Diﬀerent cuts have diﬀerent capacities. For example, the cut

{

s}, {v, w, t} in Figure 2 has capacity 101. A minimum cut is one with the smallest capacity.

.2 Optimality Conditions for the Maximum Flow Problem

We next prove the following basic result.

Theorem 2.2 (Optimality Conditions for Max Flow) Let f be a ﬂow in a graph G.

The following are equivalent:³

(1) f is a maximum ﬂow of G;

(2) there is an (s, t)-cut (A, B) such that the value of f equals the capacity of (A, B);

(3) there is no s-t path (with positive residual capacity) in the residual network G_f.

Theorem 2.2 asserts that any one of the three statements implies the other two. The

special case that (3) implies (1) recovers the claim from the end of last lecture.

Meaning, either all three statements hold, or none of the three statements hold.

Corollary 2.3 If f is a ﬂow in G such that the residual network G_fhas no s-t path, then

the f is a maximum ﬂow.

Recall that Corollary 2.3 implies the correctness of the Ford-Fulkerson algorithm, and more

generally of any algorithm that terminates with a ﬂow and a residual network with no s-t

path.

Proof of Theorem 2.2: We prove a cycle of implications: (2) implies (1), (1) implies (3), and

(3) implies (2). It follows that any one of the statements implies the other two.

Step 1: (2) implies (1): We claim that, for every ﬂow f and every (s, t)-cut (A, B),

value of f ≤ capacity of (A, B).

This claim implies that all ﬂow values are at most all cut values; for a cartoon of this, see

Figure 4. The claim implies that there no “x” strictly to the right of the “o”.

Figure 4: cartoon illustrating that no ﬂow value (x) is greater than a cut value (o).

To see why the claim yields the desired implication, suppose that (2) holds. This corre-

sponds to an “x” and “o” that are co-located in Figure 4. By the claim, no “x”s can appear

to the right of this point. Thus no ﬂow has larger value than f, as desired.

We now prove the claim. If it seems intuitively obvious, then great, your intuition is

spot-on. For completeness, we provide a brief algebraic proof.

Fix f and (A, B). By deﬁnition,

value of f =

f_e=

f_e−

f ;

(1)

e∈δ+(s)

e∈δ−(s)

the second equation is stated for convenience, and follows from our standing assumption

that s has no incoming vertices. Recall that conservation constraints state that

{z }

| {z }

ﬂow out of s

vacuous sum

f_e−

{z }

f_e= 0

(2)

e∈δ+(v)

e∈δ−(v)

| {z }

ﬂow out of v

ﬂow into of v

for every v = s, t. Adding the equations (2) corresponding to all of the vertices of A \ {s} to

equation (1) gives







f  .

value of f =

f_e−

(3)

v∈A

e∈δ+(v)

e∈δ−(v)

Next we want to think about the expression in (3) from an edge-centric, rather than vertex-

centric, perspective. How much does an edge e contribute to (3)? The answer depends on

which of the four buckets e falls into (Figure 3). If both of e’s endpoints are in B, then

e is not involved in the sum (3) at all. If e = (v, w) with both endpoints in A, then it

contributes f once (in the subexpression

−

f ) and −f once (in the subexpression

∈

_δ+(v)

f ). Thus edges inside A contribute net zero to (3). Similarly, an edge e sticking

e∈δ−(w)

out of A contributes f , while an edge sticking into A contributes −f . Summarizing, we

have

value of f =

f_e−

f .

e∈δ+(A)

e∈δ−(A)

This equation states that the net ﬂow (ﬂow forward minus ﬂow backward) across every cut

is exactly the same, namely the value of the ﬂowf.

Finally, using the capacity constraints and the fact that all ﬂows values are nonnegative,

we have

value of f =

f −

{z}

|{z}

e∈δ+(A)

e∈δ−(A)

≤

≥0

≤

(4)

(5)

e∈δ+(A)

capacity of (A, B),

which completes the proof of the ﬁrst implication.

Step 2: (1) implies (3): This step is easy. We prove the contrapositive. Suppose f is a

ﬂow such that G_fhas an s-t path P with positive residual capacity. As in the Ford-Fulkerson

algorithm, we augment along P to produce a new ﬂow f⁰with strictly larger value. This

shows that f is not a maximum ﬂow.

Step 3: (3) implies (2): The ﬁnal step is short and sweet. The trick is to deﬁne

A = {v ∈ V : there is an s v path in G }.

Conceptually, start your favorite graph search subroutine (e.g., BFS or DFS) from s until

you get stuck; A is the set of vertices you get stuck at. (We’re running this graph search

only in our minds, for the purposes of the proof, and not in any actual algorithm.)

Note that (A, V − A) is an (s, t)-cut. Certainly s ∈ A, so s can reach itself in G . By

assumption, G has no s-t path, so t ∈/ A. This cut must look like the cartoon in Figure 5,

with no edges (with positive residual capacity) sticking out of A. The reason is that if there

were such an edge sticking out of A, then our graph search would not have gotten stuck at

A, and A would be a bigger set.

Figure 5: Cartoon of the cut. Note that edges crossing the cut only go from B to A.

Let’s translate the picture in Figure 5, which concerns the residual network G_f, back to

the ﬂow f in the original network G.

. Every edge sticking out of A in G (i.e., in δ+(A)) is saturated (meaning f = u ). For

if f < u for e ∈ δ+(A), then the residual network G would contain a forward version

of e (with positive residual capacity) which would be an edge sticking out of A in G_f

(contradicting Figure 5).

−

. Every edge sticking into in A in G (i.e., in δ (A)) is zeroed out (f = 0). For if

f < u for e ∈ δ+(A), then the residual network G would contain a forward version

of e (with positive residual capacity) which would be an edge sticking out of A in G_f

(contradicting Figure 5).

These two points imply that the inequality (4) holds with equality, with

value of f = capacity of (A, V − A).

This completes the proof. ꢀ

We can immediately derive some interesting corollaries of Theorem 2.2. First is the

famous Max-Flow/Min-Cut Theorem.⁴

Corollary 2.4 (Max-Flow/Min-Cut Theorem) In every network,

maximum value of a ﬂow = minimum capacity of an (s, t)-cut.

Proof: The ﬁrst part of the proof of Theorem 2.2 implies that the maximum value of a ﬂow

cannot exceed the minimum capacity of an (s, t)-cut. The third part of the proof implies

that there cannot be a gap between the maximum ﬂow value and the minimum cut capacity.

ꢀ

Next is an algorithmic consequence: the minimum cut problem reduces to the maximum

ﬂow problem.

Corollary 2.5 Given a maximum ﬂow, and minimum cut can be computed in linear time.

This is the theorem that, long ago, seduced your instructor into a career in algorithms.

Proof: Use BFS or DFS to compute, in linear time, the set A from the third part of the

proof of Theorem 2.2. The proof shows that (A, V − A) is a minimum cut. ꢀ

In practice, minimum cuts are typically computed using a maximum ﬂow algorithm and

this reduction.

.3 Backstory

Ford and Fulkerson published in the max-ﬂow/min-cut theorem in 1955, while they were

working at the RAND Corporation (a military think tank created after World War II). Note

that this was in the depths of the Cold War between the (then) Soviet Union and the United

States. Ford and Fulkerson got the problem from Air Force researcher Theodore Harris and

retired Army general Frank Ross. Harris and Ross has been given, by the CIA, a map of the

rail network connecting the Soviet Union to Eastern Bloc countries like Poland, Czechoslo-

vakia, and Eastern Germany. Harris and Ross formed a graph, with vertices corresponding

to administrative districts and edge capacities corresponding to the rail capacity between

two districts. Using heuristics, Harris and Ross computed both a maximum ﬂow and mini-

mum cut of the graph, noting that they had equal value. They were rather more interested

in the minimum cut problem (i.e., blowing up the least amount of train tracks to sever con-

nectivity) than the maximum ﬂow problem! Ford and Fulkerson proved more generally that

in every network, the maximum ﬂow value equals that minimum cut capacity. See [?] for

further details.

The Edmonds-Karp Algorithm: Shortest Augment-

ing Paths

.1 The Algorithm

With a solid understanding of when and why maximum ﬂow algorithms are correct, we

now focus on optimizing the running time. Exercise Set #1 asks to show that the Ford-

Fulkerson algorithm is not a polynomial-time algorithm. It is a “pseudopolynomial-time”

algorithm, meaning that it runs in polynomial time provide all edge capacities are polyno-

mially bounded integers. With big edge capacities, however, the algorithm can require a

very large number of iterations to complete. The problem is that the algorithm can keep

choosing a “bad path” over and over again. (Recall that when the current residual network

has multiple s-t paths, the Ford-Fulkerson algorithm chooses arbitrarily.) This motivates

choosing augmenting paths more intelligently. The Edmonds-Karp algorithm is the same as

the Ford-Fulkerson algorithm, except that it always chooses a shortest augmenting path of

the residual graph (i.e., with the fewest number of hops). Upon hearing “shortest paths”

you may immediately think of Dijkstra’s algorithm, but this is overkill here — breadth-ﬁrst

search already computes (in linear time) a path with the fewest number of hops.

Edmonds-Karp Algorithm

initialize f = 0 for all e ∈ E

repeat

compute an s-t path P (with positive residual capacity) in the

current residual graph G_fwith the fewest number of edges

/ takes O(|E|) time using BFS

if no such path then

halt with current ﬂow {f_e}_e∈E

else

let ∆ = min

(e’s residual capacity in G_f)

/ augment the flow f using the path P

e∈P

for all edges e of G whose corresponding forward edge is in P do

increase f_eby ∆

for all edges e of G whose corresponding reverse edge is in P do

decrease f_eby ∆

.2 The Analysis

As a specialization of the Ford-Fulkerson algorithm, the Edmonds-Karp algorithm inherits

its correctness. What about the running time?

Theorem 3.1 (Running Time of Edmonds-Karp [?]) The Edmonds-Karp algorithm runs

in O(m²n) time.⁵

Recall that m typically varies between ≈ n (the sparse case) and ≈ n²(the dense case),

so the running time in Theorem 3.1 is between n³and n⁵. This is quite slow, but at least

the running time is polynomial, no matter how big the edge capacities are. See below and

Problem Set #1 for some faster algorithms.⁶Why study Edmonds-Karp, when we’re just

going to learn faster algorithms later? Because it provides a gentle introduction to some

fundamental ideas in the analysis of maximum ﬂow algorithms.

Lemma 3.2 (EK Progress Lemma) Fix a network G. For a ﬂow f, let d(f) denote the

number of hops in a shortest s-t path (with positive residual capacity) in G , or +∞ if no

such paths exist (or +∞ if no such paths exist).

(a) d(f) never decreases during the execution of the Edmonds-Karp algorithm.

(b) d(f) increases at least once per m iterations.

In this course, m always denotes the number |E| of edges, and n the number |V | of vertices.

Many diﬀerent methods yield running times in the O(mn) range, and state-of-the-art algorithm are still

a bit faster. It’s an open question whether or not there is a near-linear maximum ﬂow algorithm.

Since d(f) ∈ {0, 1, 2, . . . , n − 2, n − 1, +∞}, once d(f) ≥ n we know that d(f) = +∞ and s

and t are disconnected in G .⁷Thus, Lemma 3.2 implies that the Edmonds-Karp algorithm

terminates after at most mn iterations. Since each iteration just involves a breadth-ﬁrst-

search computation, we get the running time of O(m²n) promised in Theorem 3.1.

For the analysis, imagine running breadth-ﬁrst search (BFS) in G_fstarting from the

source s. Recall that BFS discovers vertices in “layers,” with s in the 0th layer, and layer

i + 1 consisting of those vertices not in a previous layer and reachable in one hop from a

vertex in the ith layer. We can then classify the edges of G_fas forward (meaning going from

layer i to layer i + 1, for some i), sideways (meaning both endpoints are in the same layer),

and backwards (traveling from a layer i to some layer j with j < i). By the deﬁnition of

breadth-ﬁrst search, no forward edge of G_fcan shortcut over a layer; every forward edge

goes only to the next layer.

We deﬁne L , with the L standing for “layered,” as the subgraph of G consisting only

of the forward edges (Figure 6). (Vertices in layers after the one containing t are irrelevant,

so they can be discarded if desired.)

Figure 6: Layered subgraph L_f

Why bother deﬁning L_f? Because it is a succinct encoding of all of the shortest s-t paths

of G_f— the paths on which the Edmonds-Karp algorithm might augment. Formally, every

s-t in L_fcomprises only forward edges of the BFS and hence has exactly d(f) hops, the

minimum possible. Conversely, an s-t path that is in G but not L must contain at least

one detour (a sideways or backward edge) and hence requires at least d(f) + 1 hops to get

to t.

Any path with n or more edges has a repeated vertex, and deleted the corresponding cycle yields a path

with the same endpoints and fewer hops.

Figure 7: Example from ﬁrst lecture. Initially, 0th layer is {s}, 1st layer is {v, w}, 2nd layer

is {t}.

Figure 8: Residual graph after sending ﬂow on s → v → t. 0th layer is {s}, 1st layer is

{v, w}, 2nd layer is {t}.

Figure 9: Residual graph after sending additional ﬂow on s → w → t. 0th layer is {s}, 1st

layer is {v}, 2nd layer is {w}, 3rd layer is {t}.

For example, let’s return to our ﬁrst example in Lecture #1, shown in Figure 7. Let’s

watch how d(f) changes as we simulate the algorithm. Since we begin with the zero ﬂow,

initially the residual graph G_fis the original graph G. The 0th layer is s, the ﬁrst layer is

{

v, w}, and the second layer is t. Thus d(f) = 2 initially. There are two shortest paths,

s → v → t and s → w → t. Suppose the Edmonds-Karp algorithm chooses to augment on

the upper path, sending two units of ﬂow. The new residual graph is shown in Figure 8. The

layers remain the same: {s}, {v, w}, and {t}, with d(f) still equal to 2. There is only one

shortest path, s → w → t. The Edmonds-Karp algorithm sends two units along this ﬂow,

resulting in the new residual graph in Figure 9. Now, no two-hop paths remain: the ﬁrst

layer contains only v, with w in second layer and t in the third layer. Thus, d(f) has jumped

from 2 to 3. The unique shortest path is s → v → w → t, and after the Edmonds-Karp

algorithm pushes one unit of ﬂow on this path it terminates with a maximum ﬂow.

Proof of Lemma 3.2: We start with part (a) of the lemma. Note that the only thing

we’re worried about is that an augmentation somehow introduces a new, shortest path that

shortcuts over some layers of L_f(as deﬁned above).

Suppose the Edmonds-Karp algorithm augments the currents ﬂow f by routing ﬂow on

the path P. Because P is a shortest s-t path in G , it is also a path in the layered graph L .

The only new edges created by augmenting on P are edges that go in the reverse direction

of P. These are all backward edges, so any s-t of G_fthat uses such an edge has at least

d(f) + 2 hops. Thus, no new shorter paths are formed in G_fafter the augmentation.

Now consider a run of t iterations of the Edmonds-Karp algorithm in which the value of

d(f) = c stays constant. We need to show that t ≤ m. Before the ﬁrst of these iterations,

we save a copy of the current layered network: let F denote the edges of L at this time,

and V = {s}, V , V , . . . , V the vertices if the various layers.⁸

Consider the ﬁrst of these t iterations. As in the proof of part (a), the only new edges

introduced go from some V_ito V_i−1. By assumption, after the augmentation, there is still

an s-t path in the new residual graph with only c hops. Since no edge of of such a path can

shortcut over one of the layers V , V , . . . , V , it must consist only of edges in F. Inductively,

every one of these t iterations augments on a path consisting solely of edges in F. Each

such iteration zeroes out at least one edge e = (v, w) of F (the one with minimum residual

capacity), at which point edge e drops out of the current residual graph. The only way e

can reappear in the residual graph is if there is an augmentation in the reverse direction

(the direction (w, v)). But since (w, v) goes backward (from some V_ito V_i−1) and all of the

t iterations route ﬂow only on edges of F (from some V_ito to V_i+1), this can never happen.

Since F contains at most m edges, there can only be m iterations before d(f) increases (or

the algorithm terminates). ꢀ

Dinic’s Algorithm: Blocking Flows

The next algorithm bears a strong resemblance to the Edmonds-Karp algorithm, though it

was developed independently and contemporaneously by Dinic. Unlike the Edmonds-Karp

algorithm, Dinic’s algorithm enjoys a modularity that lends itself to optimized algorithms

with faster running times.

The residual and layered networks change during these iterations, but F and V , . . . , V always refer to

networks before the ﬁrst of these iterations.

Dinic’s Algorithm

initialize f = 0 for all e ∈ E

while there is an s-t path in the current residual network G do

construct the layered network L from G using breadth-ﬁrst search,

as in the proof of Lemma 3.2

/ takes O(|E|) time

compute a blocking ﬂow g (Deﬁnition 4.1) in L_f

/ augment the flow f using the flow g

for all edges (v, w) of G for which the corresponding forward edge

of G_fcarries ﬂow (g_vw> 0) do

increase f_eby g_e

for all edges (v, w) of G for which the corresponding reverse edge

of G_fcarries ﬂow (g_wv> 0) do

decrease f_eby g_e

Dinic’s algorithm can only terminate with a residual network with no s-t path, that is, with a

maximum ﬂow (by Corollary 2.3). While in the Edmonds-Karp algorithm we only formed the

layered network L_fin the analysis (in the proof of Lemma 3.2), Dinic’s algorithm explicitly

constructs this network in each iteration.

A blocking ﬂow is, intuitively, a bunch of shortest augmenting paths that get processed

as a batch. Somewhat more formally, blocking ﬂows are precisely the possible outputs of the

naive greedy algorithm discussed at the beginning of Lecture #1. Completely formally:

Deﬁnition 4.1 (Blocking Flow) A blocking ﬂow g in a network G is a feasible ﬂow such

that, for every s-t path P of G, some edge e is saturated by g (i.e.,. f = u ).

That is, a blocking ﬂow zeroes out an edge of every s-t path.

(3)

5 (3)

3 (3)

Figure 10: Example of blocking ﬂow. This is not a maximum ﬂow.

Recall from Lecture #1 that a blocking ﬂow need not be a maximum ﬂow; the blocking

ﬂow in Figure 10 has value 3, while the maximum ﬂow value is 5. While the blocking ﬂow

in Figure 10 uses only one path, generally a blocking ﬂow uses many paths. Indeed, every

ﬂow that is maximum (equivalently, no s-t paths in the residual network) is also a blocking

ﬂow (equivalently, no s-t paths in the residual network comprising only forward edges).

The running time analysis of Dinic’s algorithm is anchored by the following progress

lemma.

Lemma 4.2 (Dinic Progress Lemma) Fix a network G. For a ﬂow f, let d(f) denote

the number of hops in a shortest s-t path (with positive residual capacity) in G , or +∞ if

no such paths exist (or +∞ if no such paths exist). If h is obtained from f be augmenting f

by a blocking ﬂow g in G , then d(h) > d(f).

That is, every iteration of Dinic’s algorithm strictly increases the s-t distance in the current

residual graph.

We leave the proof of Lemma 4.2 as Exercise #5; the proof uses the same ideas as that

of Lemma 3.2. For an example, observe that after augmenting our running example by the

blocking ﬂow in Figure 10, we obtain the residual network in Figure 11. We had d(f) = 2

initially, and d(f) = 3 after the augmentation.

Figure 11: Residual network of blocking ﬂow in Figure 10. d(f) = 3 in this residual graph.

Since d(f) can only go up to n − 1 before becoming inﬁnite (i.e., disconnecting s and t

in G ), Lemma 4.2 immediately implies that Dinic’s algorithm terminates after at most n

iterations. In this sense, the maximum ﬂow problem reduces to n instances of the blocking

ﬂow problem (in layered networks). The running time of Dinic’s algorithm is O(n · BF),

where BF denotes the running time required to compute a blocking ﬂow in a layered network.

The Edmonds-Karp algorithm and its proof eﬀectively shows how to compute a blocking

ﬂow in O(m²) time, by repeatedly sending as much ﬂow as possible on a single path of L_f

with positive residual capacity. On Problem Set #1 you’ll seen an algorithm, based on depth-

ﬁrst search, that computes a blocking ﬂow in time O(mn). With this subroutine, Dinic’s

algorithm runs in O(n²m) time, improving over the Edmonds-Karp algorithm. (Remember,

it’s always a win to replace an m with an n.)

Using fancy data structures, it’s known how to compute a maximum ﬂow in near-linear

time (with just one extra logarithmic factor), yielding a maximum ﬂow algorithm with run-

ning time close to O(mn). This running time is no longer so embarrassing, and resembles

time bounds that you saw in CS161, for example for the Bellman-Ford shortest-path algo-

rithm and for various all-pairs shortest paths algorithms.

Looking Ahead

Thus far, we focused on “augmenting path” maximum ﬂow algorithms. Properly imple-

mented, such algorithms are reasonably practical. Our motivation here is pedagogical: these

algorithms remain the best way to develop your initial intuition about the maximum ﬂow

problem.

Next lecture introduces a diﬀerent paradigm for computing maximum ﬂows, known as

the “push-relabel” framework. Such algorithms are reasonably simple, but somewhat less

intuitive than augmenting path algorithms. Properly implemented, they are blazingly fast

and are often the method of choice for solving the maximum ﬂow problem in practice.

CS261: A Second Course in Algorithms

Lecture #3: The Push-Relabel Algorithm for Maximum

Flow^∗

Tim Roughgarden^†

January 12, 2016

Motivation

The maximum ﬂow algorithms that we’ve studied so far are augmenting path algorithms,

meaning that they maintain a ﬂow and augment it each iteration to increase its value. In

Lecture #1 we studied the Ford-Fulkerson algorithm, which augments along an arbitrary

s-t path of the residual networks, and only runs in pseudopolynomial time. In Lecture #2

we studied the Edmonds-Karp specialization of the Ford-Fulkerson algorithm, where in each

iteration a shortest s-t path in the residual network is chosen for augmentation. We proved

a running time bound of O(m²n) for this algorithm (as always, m = |E| and n = |V |).

Lecture #2 and Problem Set #1 discuss Dinic’s algorithm, where each iteration augments

the current ﬂow by a blocking ﬂow in a layered subgraph of the residual network. In Problem

Set #1 you will prove a running time bound of O(n²m) for this algorithm.

In the mid-1980s, a new approach to the maximum ﬂow problem was developed. It is

known as the “push-relabel” paradigm. To this day, push-relabel algorithms are often the

method of choice in practice (even if they’ve never quite been the champion for the best

worst-case asymptotic running time).

To motivate the push-relabel approach, consider the network in Figure 1, where k is a

large number (like 100,000). Observe the maximum ﬂow value is k. The Ford-Fulkerson and

Edmonds-Karp algorithms run in Ω(k²) time in this network. Moreover, much of the work

feels wasted: each iteration, the long path of high-capacity edges has to be re-explored, even

though it hasn’t changed from the previous iteration. In this network, we’d rather route k

units of ﬂow from s to x (in O(k) time), and then distribute this ﬂow across the k paths from

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

x to t (in O(k) time, linear-time overall). This is the idea behind push-relabel algorithms.¹

Of course, if there were strictly less than k paths from x to t, then not all of the k units of

ﬂow can be routed from x to t, and the remainder must be resent to the source. What is a

principled way to organize such a procedure in an arbitrary network?

v₁

v₂

v_k

Figure 1: The edge {s, x} has a large capacity k, and there are k paths from x to t via

k diﬀerent vertices v for 1 ≤ i ≤ k (3 are drawn for illustrative purposes). Both Ford-

Fulkerson and Edmonds-Karp take Ω(k²) time, but ideally we only need O(k) time if we can

somehow push k units of ﬂow from s to x in one step.

Preliminaries

The ﬁrst order of business is to relax the conservation constraints. For example, in Figure 1,

if we’ve routed k units of ﬂow to x but not yet distributed over the paths to t, then the

vertex x has k units of ﬂow incoming and zero units outgoing.

Deﬁnition 2.1 (Preﬂow) A preﬂow is a nonnegative vector {f }

that satisﬁes two con-

e∈E

straints:

Capacity constraints: f ≤ u for every edge e ∈ E;

Relaxed conservation constraints: for every vertex v other than s,

amount of ﬂow entering v ≥ amount of ﬂow exiting v.

The left-hand side is the sum of the f_e’s over the edge incoming to v; likewise with the

outgoing edges for the right-hand side.

The push-relabel framework is not the unique way to address this issue. For example, fancy data

structures (“dynamic trees” and their ilk) can be used to remember the work performed by previous searches

and obtain faster running times.

The deﬁnition of a preﬂow is exactly the same as a ﬂow (Lecture #1), except that the

conservation constraints have been relaxed so that the amount of ﬂow into a vertex is allowed

to exceed the amount of ﬂow out of the vertex.

We deﬁne the residual graph G_fwith respect to a preﬂow f exactly as we did for the

case of a ﬂow f. That is, for an edge e that carries ﬂow f and capacity u , G includes a

forward version of e with residual capacity u − f and a reverse version of e with residual

capacity f . Edges with zero residual capacity are omitted from G .

Push-relabel algorithms work with preﬂows throughout their execution, but at the end

of the day they need to terminate with an actual ﬂow. This motivates a measure of the

“

degree of violation” of the conservation constraints.

Deﬁnition 2.2 (Excess) For a ﬂow f and a vertex v = s, t of a network, the excess α_f(v)

amount of ﬂow entering v − amount of ﬂow exiting v.

For a preﬂow ﬂow f, all excesses are nonnegative. A preﬂow is a ﬂow if and only if the excess

of every vertex v = s, t is zero. Thus transforming a preﬂow to recover feasibility involves

reducing and eventually eliminating all excesses.

The Push Subroutine

How do we augment a preﬂow? When we were restricting attention to ﬂows only, our hands

were tied — to maintain the conservation constraints, we only augmented along an s-t (or,

for a blocking ﬂow, a collection of such paths). With the relaxed conservation constraints,

we have much more ﬂexibility. All we need to is to augment a ﬂow along a single edge at a

time, routing ﬂow from one of its endpoints to the other.

Push(v)

choose an outgoing edge (v, w) of v in G_f(if any)

/ push as much flow as possible

let ∆ = min{α (v), resid. cap. of (v, w)}

push ∆ units of ﬂow along (v, w)

The point of the second step is to send as much ﬂow as possible from v to w using the edge

(v, w) of G_f, subject to the two constraints that deﬁne a preﬂow. There are two possible

bottlenecks. One is the residual capacity of the edge (v, w) (as dictated by nonnegativ-

ity/capacity constraints); if this binds, then the push is called saturating. The other is the

amount of excess at the vertex v (as dictated by the relaxed conservation constraints); if

this binds, the push is non-saturating. In the ﬁnal step, the preﬂow is updated as in our

augmenting path algorithms: if (v, w) the forward version of edge e = (v, w) in G, then f_e

is increased by ∆; if (v, w) the reverse version of edge e = (w, v) in G, then f_eis decreased

by ∆. As always, the residual network is then updated accordingly. Note that after pushing

ﬂow from v to w, w has positive excess (if it didn’t already).

Heights and Invariants

Just pushing ﬂow around the residual network is not enough to obtain a correct maximum

ﬂow algorithm. One worry is illustrated by the graph in Figure 2 — after initially pushing

one unit ﬂow from s to v, how do we avoid just pushing the excess around the cycle v →

w → x → y → v forevermore. Obviously we want to push the excess to t when it gets to x,

but how can we be systematic about it?

Figure 2: When we push ﬂows in the above graph, how do we ensure that we do not push

ﬂows in the cycle v → w → x → y → v?

The next key idea will ensure termination of our algorithm, and will also implies correct-

ness as termination. The idea is to maintain a height h(v) for each vertex v of G. Heights

will always be nonnegative integers. You are encouraged to visualize a network in 3D, with

the height of a vertex giving it its z-coordinate, with edges going “uphill” and “downhill,”

or possibly staying ﬂat. The plan for the algorithm is to always maintain three invariants

(two trivial and one non-trivial):

Invariants

. h(s) = n at all times (where n = |V |);

. h(t) = 0;

. for every edge (v, w) of the current residual network (with positive resid-

ual capacity), h(v) ≤ h(w) + 1.

Visually, the third invariant says that edges of the residual network are only to go downhill

gradually (by one per hop). For example, if a vertex v has three outgoing edges (v, w₁),

(v, w ), and (v, w ), with h(w ) = 3, h(w ) = 4, and h(w ) = 6, then the third invariant

requires that h(v) be 4 or less (Figure 3). Note that edges are allowed to go uphill, stay ﬂat,

or go downhill (gradually).

w₁

w₂

w₃

Figure 3: Given that h(w ) = 3, h(w ) = 4, h(w ) = 6, it must be that h(v) ≤ 4.

Where did these invariants come from? For one motivation, recall from Lecture #2 our

optimality conditions for the maximum ﬂow problem: a ﬂow is maximum if and only if there

is no s-t path (with positive residual capacity) in its residual graph. So clearly we want this

property at termination. The new idea is to satisfy the optimality conditions at all times,

and this is what the invariants guarantee. Indeed, since the invariants imply that s is at

height n, t is at height 0, and each edge of the residual graph only goes downhill by at

most 1, there can be no s-t path with at most n − 1 edges (and hence no s-t path at all).

It follows that if we ﬁnd a preﬂow that is feasible (i.e., is actually a ﬂow, with no excesses)

and the invariants hold (for suitable heights), then the ﬂow must be a maximum ﬂow.

It is illuminating to compare and contrast the high-level strategies of augmenting path

algorithms and of push-relabel algorithms.

Augmenting Path Strategy

Invariant: maintain a feasible ﬂow.

Work toward: disconnecting s and t in the current residual network.

Push-Relabel Strategy

Invariant: maintain that s, t disconnected in the current residual network.

Work toward: feasibility (i.e., conservation constraints).

While there is a clear symmetry between the two approaches, most people ﬁnd it less intuitive

to relax feasibility and only restore it at the end of the algorithm. This is probably why the

push-relabel framework only came along in the 1980s, while the augmenting path algorithms

we studied date from the 1950s-1970s. The idea of relaxing feasibility is useful for many

diﬀerent problems.

In both cases, algorithm design is guided by an explicitly articulated strategy for guar-

anteeing correctness. The maximum ﬂow problem, while polynomial-time solvable (as we

know), is complex enough that solutions require signiﬁcant discipline. Contrast this with,

for example, the minimum spanning tree algorithms, where it’s easy to come up with cor-

rect algorithms (like Kruskal or Prim) without any advance understanding of why they are

correct.

The Algorithm

The high-level strategy of the algorithm is to maintain the three invariants above while trying

to zero out any remaining excesses. Let’s begin with the initialization. Since the invariants

reference both a correct preﬂow and current vertex heights, we need to initialize both. Let’s

start with the heights. Clearly we set h(s) = n and h(t) = 0. The ﬁrst non-trivial decision

is to set h(v) = 0 also for all v = s, t. Moving onto the initial preﬂow, the obvious idea

is to start with the zero ﬂow. But this violates the third invariant: edges going out of s

would travel from height n to 0, while edges of the residual graph are supposed to only go

downhill by 1. With the current choice of height function, no edges out of s can appear

(with non-zero capacity) in the residual network. So the obvious ﬁx is to initially saturate

all such edges.

Initialization

set h(s) = n

set h(v) = 0 for all v = s

set f = u for all edges e outgoing from s

set f_e= 0 for all other edges

All three invariants hold after the initialization (the only possible violation is the edges out

of s, which don’t appear in the initial residual network). Also, f is initialized to a preﬂow

(with ﬂow in ≥ ﬂow out except at s).

Next, we restrict the Push operation from Section 3 so that it maintains the invari-

ants. The restriction is that ﬂow is only allowed to be pushed downhill in the residual

network.

Push(v) [revised]

choose an outgoing edge (v, w) of v in G_fwith h(v) = h(w) + 1 (if any)

/ push as much flow as possible

let ∆ = min{α (v), resid. cap. of (v, w)}

push ∆ units of ﬂow along (v, w)

Here’s the main loop of the push-relabel algorithm:

Main Loop

while there is a vertex v = s, t with α_f(v) > 0 do

choose such a vertex v with the maximum height h(v)

/ break ties arbitrarily

if there is an outgoing edge (v, w) of v in G_fwith h(v) = h(w) + 1

then

Push(v)

else

increment h(v)

// called a ‘‘relabel’’

Every iteration, among all vertices that have positive excess, the algorithm processes the

highest one. When such a vertex v is chosen, there may or may not be a downhill edge

emanating from v (see Figure 4(a) vs. Figure 4(b)). Push(v) is only invoked if there is

such an edge (in which case Push will push ﬂow on it), otherwise the vertex is “relabeled,”

meaning its height is increased by one.

w₁(3)

v(4)

w₂(4)

v(2)

w₂(4)

w₃(6)

Figure 4: (a) v → w₁is downhill edge (4 to 3) (b) there are no downhill edges

Lemma 5.1 (Invariants Are Maintained) The three invariants are maintained through-

out the execution of the algorithm.

Neither s not t ever get relabeled, so the ﬁrst two invariants are always satisﬁed. For

the third invariant, we consider separately a relabel (which changes the height function but

not the preﬂow) and a push (which changes the preﬂow but not the height function). The

only worry with a relabel at v is that, afterwards, some outgoing edge of v on the residual

network goes downhill by more than one step. But the precondition for relabeling is that all

outgoing edges are either ﬂat or uphill, so this never happens. The only worry with a push

from v to w is that it could introduce a new edge (w, v) to the residual network that might

go downhill by more than one step. But we only push ﬂow downward, so a newly created

reverse edge can only go upward.

The claim implies that if the push-relabel algorithm ever terminates, then it does so with

a maximum ﬂow. The invariants imply the maximum ﬂow optimality conditions (no s-t path

in the residual network), while the termination condition implies that the ﬁnal preﬂow f is

in fact a feasible ﬂow.

Example

Before proceeding to the running time analysis, let’s go through an example in detail to make

sure that the algorithm makes sense. The initial network is shown in Figure 5(a). After the

initialization (of both the height function and the preﬂow) we obtain the residual network

in Figure 5(b). (Edges are labeled with their residual capacities, vertices with both their

heights and their excesses.) ²

v(0, 1)

100

t(0)

100

s(4)

100

w(0, 100)

Figure 5: (a) Example network (b) Network after initialization. For v and w, the pair (a, b)

denotes that the vertex has height a and excess b. Note that we ignore excess of s and t, so

s and t both only have a single number denoting height.

In the ﬁrst iteration of the main loop, there are two vertices with positive excess (v and

w), both with height 0, and the algorithm can choose arbitrarily which one to process. Let’s

process v. Since v currently has height 0, it certainly doesn’t have any outgoing edges in the

residual network that go down. So, we relabel v, and its height increases to 1. In the second

iteration of the algorithm, there is no choice about which vertex to process: v is now the

unique highest label with excess, so it is chosen again. Now v does have downhill outgoing

We looked at this network last lecture and determined that the maximum ﬂow value is 3. So we should

be skeptical of the 100 units of ﬂow currently on edge (s, w); it will have to return home to roost at some

point.

edges, namely (v, w) and (v, t). The algorithm is allowed to choose arbitrarily between such

edges. You’re probably rooting for the algorithm to push v’s excess straight to t, but to

keep things interesting let’s assume that that the algorithm pushes it to w instead. This is a

non-saturating push, and the excess at v drops to zero. The excess at w increases from 100

to 101. The new residual network is shown in Figure 6.

v(1, 0)

100

s(4)

t(0)

w(0, 101)

Figure 6: Residual network after non-saturating push from v to w.

In the next iteration, w is the only vertex with positive excess so it is chosen for processing.

It has no outgoing downhill edges, so it get relabeled (so now h(w) = 1). Now w does have a

downhill outgoing edge (w, t). The algorithm pushes one unit of ﬂow on (w, t) — a saturating

push —- the excess at w goes back down to 100. Next iteration, w still has excess but has

no downhill edges in the new residual network, so it gets relabeled. With its new height

of 2, in the next iteration the edges from w to v go downhill. After pushing two units of ﬂow

from w to v — one on the original (w, v) edge and one on the reverse edge corresponding

to (v, w) — the excess at w drops to 98, and v now again has an excess (of 2). The new

residual network is shown in Figure 7.

v(1, 2)

100

s(4)

100

t(0)

w(2, 98)

Figure 7: Residual network after non-saturating push from v to w.

Of the two vertices with excess, w is higher. It again has no downhill edges, however,

so the algorithm relabels it three times in a row until it does. When its height reaches 5,

the reverse edge (v, s) goes downhill, the algorithm pushes w’s entire excess to s. Now v is

the only vertex remaining with excess. Its edge (v, t) goes down hill, and after pushing two

units of ﬂow on it the algorithm halts with a maximum ﬂow (with value 3).

The Analysis

.1 Formal Statement and Discussion

Verifying that the push-relabel algorithm computes a maximum ﬂow in one particular net-

work is all ﬁne and good, but it’s not at all clear that it is correct (or even terminates) in

general. Happily, the following theorem holds.³

Theorem 7.1 The push-relabel algorithm terminates after O(n²) relabel operations and

O(n³) push operations.

The hidden constants in Theorem 7.1 are at most 2. Properly implemented, the push-relabel

algorithm has running time O(n³); we leave the details to Exercise Set #2. The one point

that requires some thought is to maintain suitable data structures so that a highest vertex

with excess can be identiﬁed in O(1) time.⁴In practice, the algorithm tends to run in

sub-quadratic time.

A sharper analysis yields the better bound of O(n^2√

m); see Problem Set #1. Believe it or now, the

√

worst-case running time of the algorithm is in fact Ω(n²m).

Or rather, O(1) “amortized” time, meaning in total time O(n³) over all of the O(n³) iterations.

The proof of Theorem 7.1 is more indirect then our running time analyses of augmenting

path algorithms. In the latter algorithms, there are clear progress measures that we can use

(like the diﬀerence between the current and maximum ﬂow values, or the distance between

s and t in the current residual network). For push-relabel, we require less intuitive progress

measures.

.2 Bounding the Relabels

The analysis begins with the following key lemma, proved at the end of the lecture.

Lemma 7.2 (Key Lemma) If the vertex v has positive excess in the preﬂow f, then there

is a path v s in the residual network G_f.

The intuition behind the lemma is that, since the excess for to v somehow from v, it should

be possible to “undo” this ﬂow in the residual network.

For the rest of this section, we assume that Lemma 7.2 is true and use it to prove

Theorem 7.1. The lemma has some immediate corollaries.

Corollary 7.3 (Height Bound) In the push-relabel algorithm, every vertex always has

height at most 2n.

Proof: A vertex v is only relabeled when it has excess. Lemma 7.2 implies that, at this

point, there is a path from v to s in the current residual network G_f. There is therefore such

a path with at most n − 1 edges (more edges would create a cycle, which can be removed to

obtain a shorter path). By the ﬁrst invariant (Section 4), the height of s is always n. By the

third invariant, edges of G_fcan only go downhill by one step. So traversing the path from

v to s decreases the height by at most n − 1, and winds up at height n. Thus v has height

n − 1 or less, and at most one more than this after it is relabeled for the ﬁnal time. ꢀ

The bound in Theorem 7.1 on the number of relabels follows immediately.

Corollary 7.4 (Relabel Bound) The push-relabel algorithm performs O(n²) relabels.

.3 Bounding the Saturating Pushes

We now bound the number of pushes. We piggyback on Corollary 7.4 by using the number

of relabels as a progress measure. We’ll show that lots of pushes happen only when there

are already lots of relabels, and then apply our upper bound on the number of relabels.

We handle the cases of saturating pushes (which saturate the edge) and non-saturating

pushes (which exhaust a vertex’s excess) separately.⁵For saturating pushes, think about a

particular edge (v, w). What has to happen for this edge to suﬀer two saturating pushes in

the same direction?

To be concrete, in case of a tie let’s call it a non-saturating push.

Lemma 7.5 (Saturating Pushes) Between two saturating pushes on the same edge (v, w)

in the same direction, each of v, w is relabeled at least twice.

Since each vertex is relabeled O(n) times (Corollary 7.3), each edge (v, w) can only suﬀer

O(n) saturating pushes. This yields a bound of O(mn) on the number of saturating pushes.

Since m = O(n²), this is even better than the bound of O(n³) that we’re shooting for.⁶

Proof of Lemma 7.5: Suppose there is a saturating push on the edge (v, w). Since the push-

relabel algorithm only pushes downhill, v is higher than w (h(v) = h(w) + 1). Because the

push saturates (v, w), the edge drops out of the residual network. Clearly, a prerequisite

for another saturating push on (v, w) is for (v, w) to reappear in the residual network. The

only way this can happen is via a push in the opposite direction (on (w, v)). For this to

occur, w must ﬁrst reach a height larger than that of v (i.e., h(w) > h(v)), which requires

w to be relabeled at least twice. After (v, w) has reappeared in the residual network (with

h(v) < h(w)), no ﬂow will be pushed on it until v is again higher than w. This requires at

least two relabels to v. ꢀ

.4 Bounding the Non-Saturating Pushes

We now proceed to the non-saturating pushes. Note that nothing we’ve said so far relies

on our greedy criterion for the vertex to process in each iteration (the highest vertex with

excess). This feature of the algorithm plays an important role in this ﬁnal step.

Lemma 7.6 (Non-Saturating Pushes) Between any two relabel operations, there are at

most n non-saturating pushes.

Corollary 7.4 and Lemma 7.6 immediately imply a bound of O(n³) on the number of non-

saturating pushes, which completes the proof of Theorem 7.1 (modulo the key lemma).

Proof of lemma 7.6: Think about the entire sequence of operations performed by the algo-

rithm. “Zoom in” to an interval bracketed by two relabel operations (possibly of diﬀerent

vertices), with no relabels in between. Call such an interval a phase of the algorithm. See

Figure 8.

We’re assuming that the input network has no parallel edges, between the same pair of vertices and in

the same direction. This is eﬀectively without loss of generality — multiple edges in the same direction can

be replaced by a single one with capacity equal to the sum of the capacities of the parallel edges.

Figure 8: A timeline showing all operations (’O’ represents relabels, ’X’ represents non-

saturating pushes). An interval between two relabels (’O’s) is called a phase. There are

O(n²) phases, and each phase contains at most n non-saturating pushes.

How does a non-saturating push at a vertex v make progress? By zeroing out the excess

at v. Intuitively, we’d like to use the number of zero-excess vertices as a progress measure

within a phase. But a non-saturating push can create a new excess elsewhere. To argue that

this can’t go on for ever, we use that excess is only transferred from higher vertices to lower

vertices.

Formally, by the choice of v, as the highest vertex with excess, we have

h(v) ≥ h(w)

for all vertices w with excess

(1)

at the time of a non-saturating push at v. Inequality (1) continues to hold as long as there is

no relabel: pushes only send ﬂow downhill, so can only transfer excess from higher vertices

to lower vertices.

After the non-saturating push at v, its excess is zero. How can it become positive again

in the future?⁷It would have to receive ﬂow from a higher vertex (with excess). This cannot

happen as long as (1) holds, and so can’t happen until there’s a relabel. We conclude that,

within a phase, there cannot be two non-saturating pushes at the same vertex v. The lemma

follows. ꢀ

.5 Analysis Recap

The proof of Theorem 7.1 has several cleverly arranged steps.

. Each vertex can only be relabeled O(n) times (Corollary 7.3 via Lemma 7.2),

for a total of O(n²) relabels.

. Each edge can only suﬀer O(n) saturating pushes (only 1 between each

time both endpoints are relabeled twice, by Lemma 7.5)), for a total of

O(mn) saturating pushes.

. Each vertex can only suﬀer O(n²) non-saturating pushes (only 1 per

phase, by Lemma 7.6), for a total of O(n³) such pushes.

For example, recall what happened to the vertex v in the example in Section 6.

Proof of Key Lemma

We now prove Lemma 7.2, that there is a path from every vertex with excess back to the

source s in the residual network. Recall the intuition: excess got to v from s somehow, and

the reverse edges should form a breadcrumb trail back to s.

Proof of Lemma 7.2: Fix a preﬂow f.⁸Deﬁne

A = {v ∈ V : there is an s v path P in G with f > 0 for all e ∈ P}.

Conceptually, run your favorite graph search algorithm, starting from s, in the subgraph of

G consisting of the edges that carry positive ﬂow. A is where you get stuck. (This is the

second example we’ve seen of the “reachable vertices” proof trick; there are many more.)

Why deﬁne A? Note that for a vertex v ∈ A, there is a path of reverse edges (with

positive residual capacity) from v to s in the residual network G . So we just have to prove

that all vertices with excess have to be in A.

Figure 9: Visualization of a cut. Recall that we can partition edges into 4 categories:(i)

edges with both endpoints in A; (ii) edges with both endpoints in B; (iii) edges sticking out

of B; (iv) edges sticking into B.

Deﬁne B = V − A. Certainly s is in A, and hence not in B. (As we’ll see, t isn’t in B

either.) We might have B = ∅ but this ﬁne with us (we just want no vertices with excess in

B).

The key trick is to consider the quantity

[

ﬂow out of v - ﬂow in to v] .

(2)

The argument bears some resemblance to the ﬁnal step of the proof of the max-ﬂow/min-cut theorem

(Lecture #2) — the part where, given a residual network with no s-t path, we exhibited an s-t cut with

}

v∈B

≤

value equal to that of the current ﬂow.

Because f is a preﬂow (with ﬂow in at least ﬂow out except at s) and s ∈/ B, every term

of (2) is non-positive. On the other hand, recall from Lecture #2 that we can write the

sum in diﬀerent way, focusing on edges rather than vertices. The partition of V into A and

B buckets edges into four categories (Figure 9): (i) edges with both endpoints in A; (ii)

edges with both endpoints in B; (iii) edges sticking out of B; (iv) edges sticking into B.

Edges of type (i) are clearly irrelevant for (2) (the sum only concerns vertices of B). An

edge e = (v, w) of type (ii) contributes the value f_eonce positively (as ﬂow out of v) and once

negatively (as ﬂow into w), and these cancel out. By the same reasoning, edges of type (iii)

and (iv) contribute once positively and once negatively, respectively. When the dust settles,

we ﬁnd that the quantity in (2) can also be written as

f −

f ;

(3)

{z}

|{z}

e∈δ+(B)

e∈δ−(B)

≥

recall the notation δ+(B) and δ (B) for the edges of G that stick out of and into B, respec-

−

tively. Clearly each term is the ﬁrst sum is nonnegative. Each term is the second sum must

be zero: an edge e ∈ δ (B) sticks into A, so if f > 0 then the set A of vertices reachable by

−

ﬂow-carrying edges would not have gotten stuck as soon as it did.

The quantities (2) and (3) are equal, yet one is non-positive and the other non-negative.

Thus, they must both be 0. Since every term in (2) is non-positive, every term is 0. This

implies that conservation constraints (ﬂow in = ﬂow out) hold for all vertices of B. Thus all

vertices with excess are in A. By the deﬁnition of A, there are paths of reverse edges in the

residual network from these vertices to s, as desired. ꢀ

CS261: A Second Course in Algorithms

Lecture #4: Applications of Maximum Flows and

Minimum Cuts^∗

Tim Roughgarden^†

January 14, 2016

From Algorithms to Applications

The ﬁrst three lectures covered four maximum ﬂow algorithms (Ford-Fulkerson, Edmonds-

Karp, Dinic’s blocking ﬂow-based algorithm, and the Goldberg-Tarjan push-relabel algo-

rithm). We could talk about maximum ﬂow algorithms til the cows come home — there

has been decades of intense work on the problem, including some interesting breakthroughs

just in the last couple of years. But four algorithms is enough for a course like CS261; it’s

time to move on to applications of the algorithms, and then on to study other fundamental

problems.

Let’s remind ourselves why we studied these algorithms.

. Often the best way to get a good understanding of a computational problem is to study

algorithms for it. For example, the Ford-Fulkerson algorithm introduced the crucial

concept of a residual network, and gave us an excellent initial feel for the maximum

ﬂow problem.

. These algorithms are part of the canon, among the greatest hits of algorithms. So it’s

fun to know how they work.

. Maximum ﬂow problems really do come up in practice, so it good to how you might

solve them quickly. The push-relabel algorithm is an excellent starting point for im-

plementing fast maximum ﬂow algorithms.

The above reasons assume that we care about the maximum ﬂow problem. And why do we

care? Because like all central algorithmic problems, it directly models several well-motivated

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

problems (traﬃc in transportation networks, oil in a distribution network, data packets in a

communication network), and also a surprising number of problems are really just maximum

ﬂow in disguise. The lecture gives two examples, in computer vision and in graph matching,

and the exercise and problem sets contain several more. Perhaps the most useful skill you

can learn in CS261, for both practical and theoretical work, is how to recognize when the

tools of the course apply. Hopefully, practice makes perfect.

The Minimum Cut Problem

Figure 1: Example of an (s, t)-cut.

The minimum (s, t)-cut problem made a brief cameo in Lecture #2. It is the “dual” problem

to maximum ﬂow, in a sense we’ll make precise in later lectures, and it is just as ubiquitous

in applications. In the minimum (s, t)-cut problem, the input is the same as in the maximum

ﬂow problem (a directed graph, source and sink vertices, and edge capacities). The feasible

solutions are the (s, t)-cuts, meaning the partitions of the vertex V into two sets A and B

with s ∈ A and t ∈ B (Figure 1). The objective is to compute the s-t cut with the minimum

capacity, meaning the total capacity on edges sticking out of the source-side of the cut (those

sticking in don’t count):

capacity of (A, B) =

u .

e∈δ+(a)

In Lecture #2 we noted a simple but extremely useful fact.

Corollary 2.1 The minimum s-t cut problem reduces in linear time to the maximum ﬂow

problem.

Recall the argument: given a maximum ﬂow, just do breadth- or depth-ﬁrst search from s

in the residual graph (in linear time). We proved that if this search gets stuck at A, then

(A, V − A) is an (s, t)-cut with capacity equal to that of the ﬂow; since no cut has capacity

less than any ﬂow, the cut (A, V − A) must be a minimum cut.

While there are some algorithms for solving the minimum (s, t)-cut problem without

going through maximum ﬂows (especially for undirected graphs), in practice it is very com-

mon to solve it via this reduction. Next is an application of the problem to a basic image

segmentation task.

Image Segmentation

.1 The Problem

We consider the problem of classifying the pixels of an image as either foreground or back-

ground. We model the problem as follows. The input is an undirected graph G = (V, E),

where V is the set of pixels. The edges E designate pairs of pixels as neighbors. For example,

a common input is a grid graph (Figure 2(a)), with an edge between two pixels that diﬀerent

by 1 in of the two coordinates. (Sometimes one also throws in the diagonals.) In any case,

the solution we present works no matter than the graph G is.

1/0

0/1

1/0

Figure 2: Example of a grid network. In each vertex, ﬁrst value denotes a_vand second value

denotes b_v.

The input also contains 2|V | + |E| parameter values. Each vertex v is annotated with

two nonnegative numbers a and b , and each edge e has a nonnegative value p . We discuss

the semantics of these shortly.

The feasible outputs are the partitions V into a foreground X and background Y ; it’s

OK if X or Y is empty. We assess the quality of a solution by the objective function

a_v+

b_v−

p ,

(1)

v∈X

v∈Y

e∈δ(X)

which we want to make as large as possible. (δ(X) denotes the edges cut by the partition

(X, Y ), with one endpoint on each side.)

We see that a vertex v earns a “prize” of a if it is included in X and b otherwise. In

practice, these parameter values come from a prior as to whether a pixel v is more “likely” to

be in the foreground (in which case a is big and b small) or in the background (leading to

a big b and small a . It’s not important for our purposes how this prior or these parameter

are chosen, but it’s easy to imagine examples. Perhaps a light blue pixel is typically part

of the background (namely, the sky). Or perhaps one already knows a similar image that

has already been segmented, like one taken earlier from the same position, and then declares

that each pixel’s region is likely to be the same as in the reference image.

If all we had were the a’s and b’s, the problem would be trivial — independently for

each pixel, you would just assign it optimally to either X (if a > b ) or Y (if b > a ).

The point of the neighboring relation E is that we also expect that images are mostly

“

smooth,” with neighboring pixels much more likely to be in the same region than in diﬀerent

regions. The penalty p_eis incurred whenever the endpoints of e violate this prior belief. In

machine learning terminology, the ﬁnal objective (1) corresponds to a massaged version of

the “maximum likelihood” objective function.

For example, suppose all p_e’s are 0 in Figure 2(a). Then, the optimal solution assigns the

entire boundary to the foreground and the middle pixel to the background. The objective

function would be 9. If all the p_e’s were 1, however, then this feasible solution would have

value only 5 (because of the four cut edges). The optimal solution assigns all 9 pixels to the

foreground, for a value of 8. The latter computation eﬀectively recovers a corrupted pixel

inside some homogeneous region.

.2 Toward a Reduction

Theorem 3.1 The image segmentation problem reduces, in linear time, to the minimum

(s, t)-cut problem (and hence to the maximum ﬂow problem).

How would one ever suspect that such a reduction exists? The big clue is the form of the

output of the image segmentation problem, as the partition of a vertex set into two pieces.

This sure sounds like a cut problem. The coolest thing that could be true is that the problem

reduces to a cut problem that we already know how to solve, like the minimum (s, t)-cut

problem.

Digging deeper, there are several diﬀerences between image segmentation and (s, t)-cut

that might give us pause (Table 1). For example, while both problems have one parameter

per edge, the image segmentation problem has two parameters per vertex that seem to have

no analog in the minimum (s, t)-cut problem. Happily, all of these issues can be addressed

with the right reduction.

Minimum (s, t)-cut Image segmentation

minimization objective maximization objective

source s, sink t

directed

no vertex parameters

no source, sink vertices

undirected

a , b for each v ∈ V

Table 1: Diﬀerences between the image segmentation problem and the minimum (s, t)-cut

problem.

.3 Transforming the Objective Function

First, it’s easy to convert the maximization objective function into a minimization one by

multiplying through by -1:

min

(X,Y )

p_e−

a_v−

b .

e∈δ(X)

v∈X

vinY

Clearly, the optimal solution under this objective is the same as under the original objective.

It’s hard not to be a little spooked by the negative numbers in this objective function

(e.g., in max ﬂow or min cut, edge capacities are always nonnegative). This is also easy to

ﬁx. We just shift the objective function by adding the constant value

every feasible solution. This gives the objective function

a_v+

b_vto

v∈V

∈

min

(X,Y )

p_e+

a_v+

b .

(2)

e∈δ(X)

v∈Y

vinX

Since we shifted all feasible solutions by the same amount, the optimal solution remains

unchanged.

.4 Transforming the Graph

We use tricks familiar from Exercise Set #1. Given the undirected graph G = (V, E), we

construct a directed graph G = (V , E ) as follows:

•

V 0 = V ∪ {s, t} (i.e., add a new source and sink)

E0 has two bidirected edges for each edge e in E (i.e., a directed edge in either direction).

The capacity of both directed edges is deﬁned to be p_e, the given penalty of edge e

(Figure 3).

p_e

Figure 3: The (undirected) edges of G are bidirected in G⁰.

•

E0 also has an edge (s, v) for every pixel v ∈ V , with capacity u = a .

•

E0 has an edge (v, t) for every pixel v ∈ V , with capacity u = b .

See Figure 4 for a small example of the transformation.

Figure 4: (a) initial network and (b) the transformation

.5 Proof of Theorem 3.1

Consider an input G = (V, E) to the image segmentation problem and directed graph G⁰=

(V , E ) constructed by the reduction above. There is a natural bijection between partitions

t . The key claim

∪ { }

(X, Y ) of V and (s, t)-cut (A, B) of G⁰, with A

↔

s and B

∪ { }

↔

is that this correspondence preserves objective function value — that the capacity of every

(s, t)-cut (A, B) of G⁰is precisely the objective function value (under (2))of the partition

(A \ {s}, B \ {t}).

So ﬁx an (s, t)-cut (X ∪ {s}, Y ∪ {t}) of G⁰. Here are the edges sticking out of X

s :

∪ { }

. for every v ∈ Y , δ+(X ∪ {s}) contains the edge (s, v), which has capacity a_v;

. for every v ∈ X, δ+(X ∪ {s}) contains the edge (v, t), which has capacity b_v;

. for every edge e ∈ δ(X), δ+(X ∪ {s}) contains exactly one of the two corresponding

directed edges of G (the other one goes backward), and it has capacity p .

These are precisely the edges of δ+(X∪{s}). We compute the cut’s capacity just be summing

up, for a total of

a_v+

b_v+

p .

v∈Y

v∈X

e∈δ(X)

This is identical to the objective function value (2) of the partition (X, Y ). We conclude

that computing the optimal such partition reduces to computing a minimum (s, t)-cut of G⁰.

The reduction can be implemented in linear time.

Bipartite Matching

Figure 5: Visualization of bipartite graph. Edges exist only between the partitions V and

We next give a famous application of maximum ﬂow. This application also serves as a segue

between the ﬁrst two major topics of the course, the maximum ﬂow problem and graph

matching problems.

In the bipartite matching problem, the input is an undirected bipartite graph G = (V ∪

W, E), with every edge of E having one endpoint in each of V and W. That is, no edges

internal to V or W are allowed (Figure 5). The feasible solutions are the matchings of the

graph, meaning subsets S ⊆ E of edges that share no endpoints. The goal of the problem is

to compute a matching with the maximum-possible cardinality. Said diﬀerently, the goal is

to pair up as many vertices as possible (using edges of E).

For example, the square graph (Figure 6(a)) is bipartite, and the maximum-cardinality

matching has size 2. It matches all of the vertices, which is obviously the best-case scenario.

Such a matching is called perfect.

Figure 6: (a) square graph with perfect matching of 2. (b) star graph with maximum-

cardinality matching of 1. (c) non-bipartite graph with maximum matching of 1.

Not all graphs have perfect matchings. For example, in the star graph (Figure 6(b)),

which is also bipartite, no many how many vertices there are, the maximum-cardinality

matching has size only 1.

It’s also interesting to discuss the maximum-cardinality matching problem in general

(non-bipartite) graphs (like Figure 6(c)), but this is a harder topic that we won’t cover here.

While one can of course consider the bipartite special case of any graph problem, in matching

bipartite graphs play a particularly fundamental role. First, matching theory is nicer and

matching algorithms are faster for bipartite graphs than for non-bipartite graphs. Second, a

majority of the applications of already in the bipartite special case — assigning workers to

jobs, courses to room/time slots, medical residents to hospitals, etc.

Claim: maximum-cardinality matching reduces in linear time to maximum ﬂow.

Proof sketch: Given an undirected bipartite graph (V ∪W, E), construct a directed graph

∪

∪{

}

G as in Figure 7(b). We add a source and sink, so the new vertex set is V = V

s, t .

To obtain E from E, we direct all edges of G from V to W and also add edges from s to

every vertex of V and from every vertex of W to t. Edges incident to s or t have capacity 1,

reﬂecting the constraints the each vertex of V ∪ W can only be matches to one other vertex.

Each edge (v, w) directed from V to W can be given any capacity that is at least 1 (v can

only receive one unit of ﬂow, anyway); for simplicity, give all these edges inﬁnite capacity.

You should check that there is a one-to-one correspondence between matchings of G and

integer-valued ﬂows in G⁰, with edge (v, w) corresponding to one unit of ﬂow on the path

s → v → w → t in G (Figure 7). This bijection preserves the objective function value.

Thus, given an integral maximum ﬂow in G , the edges from V to W that carry ﬂow form a

maximum matching.¹

All of the maximum ﬂow algorithms that we’ve discussed return an integral maximum ﬂow provided all

the edge capacities are integers. The reason is that inductively, the current (pre)ﬂow, and hence the residual

capacities, and hence the augmentation amount, stay integral throughout these algorithms.

∞

Figure 7: (a) original bipartite graph G and (b) the constructed directed graph G. There is

one-to-one correspondence between matchings of G and integer-valued ﬂows of G⁰e.g. (v, w)

in G corresponds to one unit of ﬂow on s → v → w → t in G⁰.

Hall’s Theorem

In this ﬁnal section we tie together a number of courses ongoing themes. We previously

asked the question

How do we know when we’re done (i.e., optimal)?

for the maximum ﬂow problem. Let’s ask it again for the maximum-cardinality bipartite

matching problem. Using the reduction in Section 4, we can translate the optimality con-

ditions for the maximum ﬂow problem (i.e., the max-ﬂow/min-cut theorem) into a famous

optimality condition for bipartite matchings.

Consider a bipartite graph G = (V ∪ W, E) with |V | ≤ |W|, renaming V, W if necessary.

Call a matching of G perfect if it matches every vertex in V ; clearly, a perfect matching is a

maximum matching. Let’s ﬁrst understand which bipartite graphs admit a perfect matching.

Some notation: for a subset S ⊆ V , let N(S) denote the union of the neighborhoods of

the vertices of S: N(S) = {w ∈ W : ∃v ∈ S s.t. (v, w) ∈ E}. See Figure 8 for two examples

of such neighbor sets.

N(S)

N(T)

Figure 8: Two examples of vertex sets S and T and their respective neighbour sets N(S)

and N(T).

Does the graph in Figure 8 have a perfect matching? A little thought shows that the

answer is “no.” The three vertices of S have only two distinct neighbors between them.

Since each vertex can only be matched to one other vertex, there is no hope of matching

more than two of the three vertices of S.

More generally, if a bipartite graph has a constricting set S ⊆ V , meaning one with

N(S)| < |S|, then it has no perfect matching. But what about the converse? If a bipartite

graph admits no perfect matching, can you always ﬁnd a short convincing argument of this

fact, in the form of a constricting set? Or could there be obstructions to perfect matchings

beyond just constricting sets? Hall’s Theorem gives the beautiful answer that constricting

sets are the only obstacles to perfect matchings.²

Theorem 5.1 (Hall’s Theorem) A bipartite graph (V ∪ W, E) with |V | ≤ |W| has a per-

fect matching if and only if, for every subset S ⊆ V , |N(S)| ≥ |S|.

Hall’s theorem actually predates the max-ﬂow/min-cut theorem by 20 years.

Thus, it’s not only easy to convince someone that a graph has a perfect matching (just

exhibit a matching), it’s also easy to convince someone that a graph does not have a perfect

matching (just exhibit a constricting set).

Proof of Theorem 5.1: We already argued the easy “only if” direction. For the “if” direction,

suppose that |N(S)| ≥ |S| for every S ⊆ V .

Claim: in the ﬂow network G⁰that corresponds to G (Figure 7), every (s, t)-cut has

capacity at least |V |.

To see why the claim implies the theorem, note that it implies that the minimum cut

value in G is at least V , so the maximum ﬂow in G is at least V (by the max-ﬂow/min-cut

| |

theorem), and an integral ﬂow with value |V | corresponds to a perfect matching of G.

Proof of claim: Fix an (s, t)-cut (A, B) of G⁰. Let S = A V denote the vertices of

V that lie on the source side. Since s ∈ A, all (unit-capacity) edges from s to vertices of

V − A contribute to the capacity of (A, B). Recall that we gave the edges directed from V

to W inﬁnite capacity. Thus, if some vertex w of N(S) fails to also be in A, then the cut

(A, B) has inﬁnite capacity (because of the edge from S to w) and there is nothing to prove.

So suppose all of N(S) belongs to A. Then all of the (unit-capacity) edges from vertices of

N(S) to t contribute to the capacity of (A, B). Summing up, we have

∩

capacity of (A, B) ≥ (|V | − |S|)

|N(S)|

edges from N(S) to t

}

| {z }

edges from s to V − S

≥

|V |,

(3)

where (3) follows from the assumption that |N(S)| ≥ |S| for every S ⊆ V . ꢀ

On Exercise Set #2 you will extend this proof to show that, more generally, for every

bipartite graph (V ∪ W, E) with |V | ≤ |W|,

size of maximum matching = m_S⊆i_Vn (|V | − (|S| − |N(S)|)) .

Note that at least |S| − |N(S)| vertices of S are unmatched in every matching.

CS261: A Second Course in Algorithms

Lecture #5: Minimum-Cost Bipartite Matching^∗

Tim Roughgarden^†

January 19, 2016

Preliminaries

Figure 1: Example of bipartite graph. The edges {a, b} and {c, d} constitute a matching.

Last lecture introduced the maximum-cardinality bipartite matching problem. Recall that

a bipartite graph G = (V ∪ W, E) is one whose vertices are split into two sets such that

every edge has one endpoint in each set (no edges internal to V or W allowed). Recall that

a matching is a subset M ⊆ E of edges with no shared endpoints (e.g., Figure 1). Last

lecture, we sketched a simple reduction from this problem to the maximum ﬂow problem.

Moreover, we deduced from this reduction and the max-ﬂow/min-cut theorem a famous

optimality condition for bipartite matchings. A special case is Hall’s theorem, which states

that a bipartite graph with |V | ≤ |W| has a perfect matching if and only if for every subset

S ⊆ V of the left-hand side, the number |N(S)| of S on the right-hand side is at least |S|.

See Problem Set #2 for quite good running time bounds for the problem.

But what if a bipartite graph has many perfect matchings? In applications, there are

often reasons to prefer one over another. For example, when assigning jobs to works, perhaps

there are many workers who can perform a particular job, but some of them are better at

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

it than others. The simplest way to model such preferences is attach a cost c_eto each edge

e ∈ E of the input bipartite graph G = (V ∪ W, E).

We also make three assumptions. These are for convenience, and are not crucial for any

of our results.

. The sets V and W have the same size, call it n. This assumption is easily enforced by

adding “dummy vertices” (with no incident edges) to the smaller side.

. The graph G has at least one perfect matching. This is easily enforced by adding

“dummy edges” that have a very high cost (e.g., one such edge from the ith vertex of

V to the ith vertex of W, for each i).

. Edge costs are nonnegative. This can be enforced in the obvious way: if the most

negative edge cost is −M, just add M to the cost of every edge. This adds the same

number (nM) to every perfect matching, and thus does not change the problem.

The goal in the minimum-cost perfect bipartite matching problem is to compute the perfect

matching M that minimizes

c_e. The feasible solutions to the problem are the perfect

e∈M

matchings of G. An equivalent problem is the maximum-weight perfect bipartite matching

problem (just multiply all weights by −1 to transform them into costs).

When every edge has the same cost and we only care about cardinality, the problem

reduces to the maximum ﬂow problem (Lecture #4). With general costs, there does not

seem to be a natural reduction to the maximum ﬂow problem. It’s true that edges in a ﬂow

network come with attached numbers (their capacities), but there is a type mismatch: edge

capacities aﬀect the set of feasible solutions but not their objective function values, while

edge costs do the opposite. Thus, the minimum-cost perfect bipartite matching problem

seems like a new problem, for which we have to design an algorithm from scratch.

We’ll follow the same kind of disciplined approach that served us so well in the maximum

ﬂow problem. First, we identify optimality conditions, which tell us when a given perfect

matching is in fact minimum-cost. This step is structural, not algorithmic, and is analogous

to our result in Lecture #2 that a ﬂow is maximum if and only if there is no s-t path in the

residual network. Then, we design an algorithm that can only terminate with the feasibility

and optimality conditions satisﬁed. For maximum ﬂow, we had one algorithmic paradigm

that maintained feasibility and worked toward the optimality conditions (augmenting path

algorithms), and a second paradigm that maintain the optimality conditions and worked

toward feasibility (push-relabel). Here, we follow the second approach. We’ll identify invari-

ants that imply the optimality condition, and design an algorithm that keeps them satisﬁed

at all times and works toward a feasible solution (i.e., a perfect matching).

Optimality Conditions

How do we know if a given perfect matching has the minimum-possible cost? Optimality

conditions are diﬀerent for diﬀerent problems, but for the problems studied in CS261 they are

all quite natural in hindsight. We ﬁrst need an analog of a residual network. This requires

some deﬁnitions (see also Figure 2).

Figure 2: If our matching contains {a, b} and {c, d}, then a → b → d → c → a is both an

M-alternating cycle and a negative cycle.

Deﬁnition 2.1 (Negative Cycle) Let M be a matching in the bipartite graph G = (V ∪

W, E).

(a) A cycle C of G is M-alternating if every other edge of C belongs to M (Figure 2).¹

(b) An M-alternating cycle is negative if the edges in the matching have higher cost than

those outside the matching:

c_e>

c .

e∪C∩M

e∈C\M

Otherwise, it is nonnegative.

One interesting thing about alternating cycles is that by “toggling” the edges of C with

respect to M — that is, removing the edges of C ∩M and plugging in the edges of C \M —

yields a new matching M that matches exactly the same set of vertices. (Vertices outside

of C are clearly unaﬀected; vertices inside C remain matched to precisely one other vertex

of C, just a diﬀerent one than before.)

Suppose M is a perfect matching, and we toggle the edges of an M-alternating cycle

to get another (perfect) matching M . Dropping the edges from C M saves us a cost of

∩

c , while adding the edges of C \ M cost us

c . Then M has smaller cost

C M

e∪C∩M

than M if and only if C is a negative cycle.

∈

The point of a negative cycle is that it oﬀers a quick and convincing proof that a per-

fect matching is not minimum-cost (since toggling the edges of the cycle yields a cheaper

matching). But what about the converse? If a perfect matching is not minimum-cost, are

we guaranteed such a short and convincing proof of this fact? Or are there “obstacles” to

optimality beyond the obvious ones of negative cycles?

Since G is bipartite, C is necessarily an even cycle. One certainly can’t have more than every other edge

of C contained in the matching M.

Theorem 2.2 (Optimality Conditions for Min-Cost Bipartite Matching) A perfect

matching of a bipartite graph has minimum-cost if and only if there is no negative M-

alternating cycle.

Proof: We have already argued the “only if” direction. For the harder “if” direction, suppose

that M is a perfect matching and that there is no negative M-alternating cycle. Let M

be any other perfect matching; we want to show that the cost of M⁰is at least that of M.

Consider M ⊕ M , meaning the symmetric diﬀerence of M, M (if you want to think of them

as sets) or their XOR (if you want to think of them as 0/1 vectors). See Figure 3 for two

examples.

⊕

Figure 3: Two examples that show what happens when we XOR two matchings (the dashed

edges).

In general, M ⊕ M is a union of (vertex-)disjoint cycles. The reason is that, since every

vertex has degree 1 in both M and M , every vertex of v has degree either 0 (if it is matched

to the same vertex in both M and M⁰) or 2 (otherwise). A graph with all degrees either 0

or 2 must be the union of disjoint cycles.

Since taking the symmetric diﬀerence/XOR with the same set two times in a row recovers

the initial set, (M ⊕ M ) M⁰= M. Since M M⁰is a disjoint union of cycles, taking the

⊕

symmetric diﬀerent/XOR with M ⊕ M just means toggling the edges in each of its cycles

(since they are disjoint, they don’t interfere and the toggling can be done in parallel). Each

of these cycles is M-alternating, and by assumption each is nonnegative. Thus toggling the

edges of the cycles can only produce a more expensive perfect matching M . Since M was

an arbitrary perfect matching, M must be a minimum-cost perfect matching. ꢀ

Reduced Costs and Invariants

Now that we know when we’re done, we work toward algorithms that terminate with the

optimality conditions satisﬁed. Following the push-relabel approach (Lecture #3), we next

identify invariants that will imply the optimality conditions at all times. Our algorithm will

maintain these as it works toward a feasible solution (i.e., a perfect matching). Continuing

the analogy with the push-relabel paradigm, we maintain a extra number p_vfor each vertex

v ∈ V ∪ W, called a price (analogous to the “heights” in Lecture #3). Prices are allowed to

be positive or negative. We use prices to force us to add edges to our current matching only

in a disciplined way, somewhat analogous to how we only pushed ﬂow “downhill” in Lecture

Formally, for a price vector p (indexed by vertices), we deﬁne the reduced cost of an edge

e = (v, w) by

c = c − p − p .

Here are our invariants, which are respect to a current matching M and a current vector p

(1)

of prices.

Invariants

. Every edge of G has nonnegative reduced cost.

. Every edge of M is tight, meaning it has zero reduced cost.

v(7)

w(0)

v(5)

w(2)

y(0)

x(2)

y(0)

x(2)

Figure 4: For the given (perfect) matching (dashed edges), (a) violates invariant 1, while (b)

satisﬁes all invariants.

For example, consider the (perfect) matching in Figure 4. Is it possible to deﬁne prices

so that the invariants hold? To satisfy the second invariant, we need to make the edges

(v, w) and (x, y) tight. We could try setting the price of w and y to 0, which then dictates

setting p = 7 and p = 2 (Figure 4(a)). This violates the ﬁrst invariant, however, since

the reduced cost of edge (v, y) is -1. We can satisfy both invariants by resetting p_v= 5 and

p_w= 2; then both edges in the matching are tight and the other two edges have reduced

cost 1 (Figure 4(b)).

The matching in Figure 4 is a min-cost perfect matching. This is no coincidence.

Lemma 3.1 (Invariants Imply Optimality Condition) If M is a perfect matching and

both invariants hold, then M is a minimum-cost perfect matching.

Proof: Let M be a perfect matching such that both invariants hold. By our optimality

condition (Theorem 2.2), we just need to check that there is no negative cycle. So consider

any M-alternating cycle C (remember a negative cycle must be M-alternating, by deﬁnition).

We want to show that the edges of C that are in M have cost at most that of the edges of

C not in M. Adding and subtracting

p_vand using the fact that every vertex of C is

v∈C

the endpoint of exactly one edge of C ∩ M and of C \ M (e.g., Figure 5), we can write

c_e=

c +

p_v

(2)

(3)

e∈C∩M

v∈C

and

c_e=

c +

p .

e∈C\M

v∈C

(We are abusing notation and using C both to denote the vertices in the cycle and the edges

in the cycle; hopefully the meaning is always clear from context.) Clearly, the third terms

in (2) and (3) are the same. By the second invariant (edges of M are tight), the second term

in (2) is 0. By the ﬁrst invariant (all edges have nonnegative reduced cost), the second term

in (3) is at least 0. We conclude that the left-hand side of (2) is at most that of (3), which

proves that C is not a negative cycle. Since C was arbitrary M-alternating cycle, the proof

is complete. ꢀ

Figure 5: In the example M-alternating cycle and matching shown above, every vertex is an

endpoint of exactly one edge in M and one edge not in M.

The Hungarian Algorithm

Lemma 3.1 reduces the problem of designing a correct algorithm for the minimum-cost

perfect bipartite matching problem to that of designing an algorithm that maintains the

two invariants and computes an arbitrary perfect matching. This section presents such an

algorithm.

.1 Backstory

The algorithm we present goes by various names, the two most common being the Hungarian

algorithm and the Kuhn-Munkres algorithm. You might therefore ﬁnd it weird that Kuhn

and Munkres are American. Here’s the story. In the early/mid-1950s, Kuhn really wanted

an algorithm for solving the minimum-cost bipartite matching problem. So he was reading

a graph theory book by K˝onig. This was actually the ﬁrst graph theory book ever written

—

in the 1930s, and available in the U.S. only in 1950 (even then, only in German). Kuhn

was intrigued by an oﬀhand citation in the book, to a paper of Egerv´ary. Kuhn tracked

down the paper, which was written in Hungarian. This was way before Google Translate, so

he bought a big English-Hungarian dictionary and translated the whole thing. And indeed,

Egerva´ry’s paper had the key ideas necessary for a good algorithm. K˝onig and Egerva´ry were

both Hungarian, so Kuhn called his algorithm the Hungarian algorithm. Kuhn only proved

termination of his algorithm, and soon thereafter Munkres observed a polynomial time bound

(basically the bound proved in this lecture). Hence, also called the Kuhn-Munkres algorithm.

In a (ﬁnal?) twist to the story, in 2006 it was discovered that Jacobi, the famous math-

ematician (you’ve studied multiple concepts named after him in your math classes), came

up with an equivalent algorithm in the 1840s! (Published only posthumously, in 1890.)

Kuhn, then in his 80s, was a good sport about it, giving talks with the title “The Hungarian

Algorithm and How Jacobi Beat Me By 100 Years.”

.2 The Algorithm: High-Level Structure

The Hungarian algorithm maintains both a matching M and prices p. The initialization is

straightforward.

Initialization

set M = ∅

set p = 0 for all v ∈ V ∪ W

The second invariant holds vacuously. The ﬁrst invariant holds because we are assuming

that all edge costs (and hence initial reduced costs) are nonnegative.

Informally (and way underspeciﬁed), the main loop works as follows. The terms “aug-

ment,” “good path,” and “good set” will be deﬁned shortly.

Main Loop (High-Level)

while M is not a perfect matching do

if there is a good path P then

augment M by P

else

ﬁnd a good set S; update prices accordingly

.3 Good Paths

We now start ﬁlling in the details. Fix the current matching M and current prices p. Call a

path P from v to w good if:

. both endpoints v, w are unmatched in M, with v ∈ V and w ∈ W (hence P has odd

length);

. it alternates edges out of M with edges in M (since v, w are unmatched, the ﬁrst and

last edges are not in M);

. every edge of P is tight (i.e., has zero reduced cost and hence eligible to be included

in the current matching).

Figure 6 depicts a simple example of a good path.

a(2)

b(2)

c(2)

d(2)

Figure 6: Dashed edges denote edges in the matching and red edges denote a good path.

The reason we care about good paths is that such a path allows us to increase the

cardinality of M without breaking either invariant. Speciﬁcally, consider replacing M by

M⁰= M P. This can be thought of as toggling which edges of P are in the current

⊕

matching. By deﬁnition, a good path is M-alternating, with ﬁrst and last hops not in M;

thus, |P ∩ M| = |P \ M| − 1, and the size of M is one more than M. (E.g., if P is a 9-hop

path, this toggling removes 4 edges from M but that adds in 5 other edges.) No reduced

costs have changed, so certainly the ﬁrst invariant still holds. All edges of P are tight be

deﬁnition, so the second invariant also continues to hold.

Augmentation Step

given a good path P, replace M by M ⊕ P

Finding a good path is deﬁnitely progress — after n such augmentations, the current

matching M must be perfect and (since the invariants hold) we’re done. How can we eﬃ-

ciently ﬁnd such a path? And what do we do if there’s no such path?

To eﬃciently search for such a path, let’s just follow our nose. It turns out that breadth-

ﬁrst search (BFS), with a twist to enforce M-alternation, is all we need.

Figure 7: Dashed edges are the edges in the matching. Only tight edges are shown.

The algorithm will be clear from an example. Consider the graph in Figure 7; only the

tight edges are shown. Note that the graph does not contain a good path (if it did, we

could use it to augment the current matching to obtain a perfect matching, but vertex #4 is

isolated so there is no perfect matching.).²So we know in advance that our search will fail.

But it’s useful to see what happens when it fails.

Figure 8: BFS spanning tree if we start BFS travel from node 3. Note that the edge {2, 6}

is not used.

We start a graph search from an unmatched vertex of V (the ﬁrst such vertex, say); see

also Figure 8. In the example, this is vertex #3. Layer 0 of our search tree is {3}. We

obtain layer 1 from layer 0 by BFS; thus, layer 1 is {2, 7}. Note that if either 2 or 7 is

unmatched, then we have found a (one-hop) good path and we can stop the search. Both 2

and 7 are already matched in the example, however. Here is the twist to BFS: at the next

layer 2 we put only the vertices to which 2 and 7 are matched, namely 1 and 8. Conspicuous

in its absence is vertex #6; in regular BFS it would be included in layer 2, but here we

omit it because it is not matched to a vertex of layer 1. The reason for this twist is that

we want every path in our search tree to be M-alternating (since good paths need to be

M-alternating).

Remember we assume only that G contains a perfect matching; the subgraph of tight edges at any given

time will generally not contain a perfect matching.

We then switch back to BFS. At vertex #8 we’re stuck (we’ve already seen its only

neighbor, #7). At vertex #1, we’ve already seen its neighbor 2 but have not yet seen vertex

5, so the third layer is {5}. Note that if 5 were unmatched, we would have found a good

path, from 5 back to the root 3. (All edges in the tree are tight by deﬁnition; the path is

alternating and of odd length, joining two unmatched vertices of V and W.) But 5 is already

matched to 6, so layer 4 of the search tree is {6}. We’ve already seen both of 6’s neighbors

before, so at this point we’re stuck and the search terminates.

In general, here is the search procedure for ﬁnding a good path (given a current match-

ing M and prices p).

Searching for a Good Path

level 0 = the ﬁrst unmatched vertex r of V

while not stuck and no other unmatched vertex found do

if next level i is odd then

deﬁne level i from level i − 1 via BFS

/ i.e., neighbors of level i − 1 not already seen

else if next level i is even then

deﬁne level i as the vertices matched in M to vertices at

level i − 1

if found another unmatched vertex w then

return the search tree path between the root r and w

else

return “stuck”

To understand this subroutine, consider an edge (v, w) ∈ M, and suppose that v is

reached ﬁrst, at level i. Importantly, it is not possible that w is also reached at level i. This

is where we use the assumption that G is bipartite: if v, w are reached in the same level,

then pasting together the paths from r to v and from r to w (which have the same length)

with the edge (v, w) exhibits an odd cycle, contradicting bipartiteness. Second, we claim

that i must be odd (cf., Figure 8). The reason is just that, by construction, every vertex

at an even level (other than 0) is the second endpoint reached of some matched edge (and

hence cannot be the endpoint of any other matched edge). We conclude that:

(*) if either endpoint of an edge of M is reached in the search tree, then both endpoints

are reached, and they appear at consecutive levels i, i + 1 with i odd.

Suppose the search tree reaches an unmatched vertex w other than the root r. Since

every vertex at an even level (after 0) is matched to a vertex at the previous level, w must

be at an odd level (and hence in W). By construction, every edge of the search tree is tight,

and every path in the tree is M-alternating. Thus the r-w path in the search tree is a good

path, allowing us to increase the size of M by 1.

.4 Good Sets

Suppose the search gets stuck, as in our example. How do we make progress, and in what

sense? In this case, we keep the matching the same but update the prices.

Deﬁne S ⊆ V as the vertices at even levels. Deﬁne N(S) ⊆ S as the neighbors of S via

tight edges, i.e.,

N(S) = {w : ∃v ∈ S with (v, w) tight}.

(4)

We claim that N(S) is precisely the vertices that appear in the odd levels of the search tree.

In proof, ﬁrst note that every vertex at an odd level is (by construction/BFS) adjacent via a

tight edge to a vertex at the previous (even) level. For the converse, every vertex w ∈ N(S)

must be reached in the search, because (by basic properties of graph search) the search can

only stuck if there are no unexplored edges out of any even vertex.

The set S is a good set, in that is satisﬁes:

. S contains an unmatched vertex;

. every vertex of N(S) is matched in M to a vertex of S (since the search failed, every

vertex in an odd level is matched to some vertex at the next (even) level).

See also Figure 9.

Figure 9: S = {1, 2, 3, 4} is example of good set, with N(S) = {5, 6}. Only black edges are

tight edges (i.e. (4, 7) is not tight). The matching edges are dashed.

Having found such a good set S, the Hungarian algorithm updates prices as follows.

Price Update Step

given a good set S, with neighbors via tight edges N(S)

for all v ∈ S do

increase p by ∆

for all w ∈ N(S) do

decrease p by ∆

/ ∆ is as large as possible, subject to invariants

Prices in S (on the left-hand side) are increased, while prices in N(S) (on the right-hand

side) are decreased by the same amount. How does this aﬀect the reduced cost of each edge

of G (Figure 9)?

. for an edge (v, w) with v ∈/ S and w ∈/ N(S), the prices of v, w are unchanged so c^p

is unchanged;

. for an edge (v, w) with v ∈ S and w ∈ N(S), the sum of the prices of v, w is unchanged

(one increased by ∆, the other decreased by ∆) so c^pis unchanged;

. for an edge (v, w) with v ∈/ S and w ∈ N(S), p stays the same while p goes down by

∆

, so c^pgoes up by ∆;

. for an edge (v, w) with v ∈ S and w ∈/ N(S), p stays the same while p goes up by

∆

, so c^pgoes down by ∆.

So what happens with the invariants? Recalling (*) from Section 4.3, we see that edges of M

are in either the ﬁrst or second category. Thus they stay tight, and the second invariant

remains satisﬁed. The ﬁrst invariant is endangered by edges in the fourth category, whose

reduced costs are dropping with ∆.³By the deﬁnition of N(S), edges in this category are

not tight. So we increase ∆ to the largest-possible value subject to the ﬁrst invariant — the

ﬁrst point at which the reduced cost of some edge in the fourth category is zeroed out.⁴

Every price update makes progress, in the sense that it strictly increases the size of search

tree. To see this, suppose a price update causes the edge (v, w) to become tight (with v ∈ S,

w ∈/ N(S)). What happens in the next iteration, when we search from the same vertex r

for a good path? All edges in the previous search tree fall in the second category, and hence

are again tight in the next iteration. Thus, the search procedure will regrow exactly the

same search tree as before, will again reach the vertex v, and now will also explore along the

newly tight edge (v, w), which adds the additional vertex w ∈ W to the tree. This can only

happen n times in a row before ﬁnding a good path, since there are only n vertices in W.

Edges in third category might go from tight to non-tight, but these edges are not in M (every vertex of

N(S) is matched to a vertex of S) and so no invariant is violated.

A detail: how do we know that such an edge exists? If not, then all neighbors of S in G (via tight edges

or not) belong to N(S). The two properties of good sets imply that |N(S)| < |S|. But this violates Hall’s

condition for perfect matchings (Lecture #4), contradicting our standing assumption that G has at least one

perfect matching.

.5 The Hungarian Algorithm (All in One Place)

The Hungarian Algorithm

set M = ∅

set p = 0 for all v ∈ V ∪ W

while M is not a perfect matching do

level 0 of search tree T = the ﬁrst unmatched vertex r of V

while not stuck and no other unmatched vertex found do

if next level i is odd then

deﬁne level i of T from level i − 1 via BFS

/ i.e., neighbors of level i − 1 not already seen

else if next level i is even then

deﬁne level i of T as the vertices matched in M to vertices at

level i − 1

if T contains an unmatched vertex w ∈ W then

let P denote the r-w path in T

replace M by M ⊕ P

else

let S denote the vertices of T in even levels

let N(S) denote the vertices of T in odd levels

for all v ∈ S do

increase p by ∆

for all w ∈ N(S) do

decrease p by ∆

/ ∆ is as large as possible, subject to invariants

return M

.6 Running Time

Since M can only contain n edges, there can only be n iterations that ﬁnd a good path.

Since the search tree can only contain n vertices of W, there can only be n prices updates

between iterations that ﬁnd good paths. Computing the search tree (and hence P or S and

N(S)) and ∆ (if necessary) can be done in O(m) time. This gives a running time bound of

O(mn²). See Problem Set #2 for an implementation with running time O(mn log n).

.7 Example

We reinforce the algorithm via an example. Consider the graph in Figure 10.

v(0)

x(0)

w(0)

y(0)

Figure 10: Example graph. Initially, all prices are 0.

We initialize all prices to 0 and the current matching to the empty set. Initially, there

are no tight edges, so there is certainly no good path. The search for such a path gets stuck

where it starts, at vertex v. So S = {v} and N(S) = ∅. We execute a price update step,

raising the price of v to 2, at which point the edge (v, w) becomes tight. Next iteration,

the search starts at v, explores the tight edge (v, w), and encounters vertex w, which is

unmatched. Thus this edge is added to the current matching. Next iteration, a new search

starts from the only remaining unmatched vertex on the left (x). It has no tight incident

edges, so the search gets stuck immediately, with S = {x} and N(S) = ∅. We thus do a price

update step, with ∆ = 5, at which point the edge (x, w) becomes newly tight. Note that

the edges (v, y) and (x, y) have reduced costs 1 and 2, respectively, so neither is tight. Next

iteration, the search from x explores the incident tight edge (x, w). If w were unmatched,

we could stop the search and add the edge (x, w). But w is already matched, to v, so w and

v are placed at levels 1 and 2 of the search tree. v has no tight incident edges other than

to w, so the search gets stuck here, with S = {x, v} and N(S) = {w}. So we do another a

price update step, increasing the price of x and v by ∆ and decreasing the price of w by ∆.

With ∆ = 1, the reduced cost of edge (v, y) gets zeroed out. The ﬁnal iteration discovers

the good path x → w → v → y. Augmenting on this path yields the minimum-cost perfect

matching {(v, y), (x, w)}.

CS261: A Second Course in Algorithms

Lecture #6: Generalizations of Maximum Flow and

Bipartite Matching^∗

Tim Roughgarden^†

January 21, 2016

Fundamental Problems in Combinatorial Optimiza-

tion

Figure 1: Web of six fundamental problems in combinatorial optimization. The ones covered

thus far are in red. Each arrow points from a problem to a generalization of that problem.

We started the course by studying the maximum ﬂow problem and the closely related s-t

cut problem. We observed (Lecture #4) that the maximum-cardinality bipartite matching

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

problem can be viewed as a special case of the maximum ﬂow problem (Figure 1). We

then generalized the former problem to include edge costs, which seemed to give a problem

incomparable to maximum ﬂow.

The inquisitive student might be wondering the following:

. Is there a natural common generalization of the maximum ﬂow and minimum-cost

bipartite matching problems?

. What’s up with graph matching in non-bipartite graphs?

The answer to the ﬁrst question is “yes,” and it’s a problem known as minimum-cost ﬂow

(Figure 1). For the second question, there is a nice theory of graph matchings in non-

bipartite graphs, both for the maximum-cardinality and minimum-cost cases, although the

theory is more diﬃcult and the algorithms are slower than in the bipartite case. This lecture

introduces the three new problems in Figure 1 and some essential facts you should know

about them. The six problems in Figure 1, along with the minimum spanning tree and

shortest path problems that you already know well from CS161, arguably form the complete

list of the most fundamental problems in combinatorial optimization, the study of eﬃciently

optimizing over large collections of discrete structures.

The main take-ways from this lecture’s high-level discussion are:

. You should know about the existence of the minimum-cost ﬂow and non-bipartite

matching problems. They do come up in applications, if somewhat less frequently

than the problems studied in the ﬁrst ﬁve lectures.

. There are reasonably eﬃcient algorithms for all of these problems, if a bit slower than

the state-of-the-art algorithms for the problems discussed previously. We won’t discuss

running times in any detail, but think of roughly O(mn) or O(n³) as a typical time

bound of a smart algorithm for these problems.

. The algorithms and analysis for these problems follow exactly the same principles that

you’ve been studying in previous lectures. They use optimality conditions, various

progress measures, well-chosen invariants, and so on. So you’re well-positioned to

study deeply these problems and algorithms for them, in another course or on your own.

Indeed, if CS261 were a semester-long course, we would cover this material in detail

over the next 4-5 lectures. (Alas, it will be time to move on to linear programming.)

The Minimum Cost Flow Problem

An instance of the minimum-cost ﬂow problem consists of the following ingredients:

•

a directed graph G = (V, E);

a source s ∈ V and sink t ∈ V ;

•

a target ﬂow value d;

a nonnegative capacity u_efor each edge e ∈ E;

a real-valued cost c_efor each edge e ∈ E.

The goal is to compute a ﬂow f with value d — that is, pushing d units of ﬂow from s to

t, subject to the usual conservation and capacity constraints — that minimizes the overall

cost

c f .

(1)

e∈E

Note that, for each edge e, we think of c as a “per-ﬂow unit” cost, so with f units of ﬂow

the contribution of edge e to the overall cost is c f .¹

There are two diﬀerences with the maximum ﬂow problem. The important one is that

now every edge has a cost. (In maximum ﬂow, one can think of all the costs being 0.) The

second diﬀerence, which is artiﬁcial, is that we speciﬁed a speciﬁc amount of ﬂow d to send.

There are multiple other equivalent formulations of the minimum-cost ﬂow problem. For

example, one can ask for the maximum ﬂow with the minimum cost. Alternatively, instead

of having a source s and sink t, one can ask for a “circulation” — meaning a ﬂow that

satisﬁes conservation constraints at every vertex of V — with the minimum cost (in the

sense of (1)).²

Impressively, the minimum-cost ﬂow problem captures three diﬀerent problems that

you’ve studied as special cases.

. Shortest paths. Suppose you are given a “black box” that quickly does minimum-

cost ﬂow computations, and you want to compute the shortest path between some s

and some t is a directed graph with edge costs. The black box is expecting a ﬂow

value d and edge capacities u_e(in addition to G, s, t, and the edge costs); we just

set d = 1 and u_e= 1 (say) for every edge e. An integral minimum-cost ﬂow in this

network will be a shortest path from s to t (why?).

. Maximum ﬂow. Given an instances of the maximum ﬂow problem, we need to deﬁne

d and edge costs before feeding the input into our minimum-cost ﬂow black box. The

edge costs should presumably be set to 0. Then, to compute the maximum ﬂow value,

we can just use binary search to ﬁnd the largest value of d for which the black box

returns a feasible solution.

. Minimum-cost perfect bipartite matching. The reduction here is the same as

that from maximum-cardinality bipartite matching to maximum ﬂow (Lecture #4) —

the edge costs just carry over. The value d should be set to n, the number of vertices

on each side of the bipartite graph (why?).

If there is no ﬂow of value d, then an algorithm should report this fact. Note this is easy to check with

a single maximum ﬂow computation.

Of course if all edge costs are nonnegative, then the all-zero solution is optimal. But with negative

cycles, this is a nontrivial problem.

Problem Set #2 explores various aspects of minimum-cost ﬂows. Like the other prob-

lems we’ve studied, there are nice optimality conditions for minimum-cost ﬂows. First, one

extends the notion of a residual network to networks with costs — the only twist is that

if an edge (w, v) of the residual network is the reverse edge corresponding to (v, w) ∈ E,

then the cost of c_wvshould be set to −c . (Which makes sense given that reverse edges

correspond to “undo” operations.) Then, a ﬂow with value d is minimum-cost if and only

if the corresponding residual network has no negative cycle. This then suggests a simple

“

cycle-canceling” algorithm, analogous to the Ford-Fulkerson algorithm. Polynomial-time

algorithms can be designed using the same ideas we used for maximum ﬂow in Lectures

2 and #3 and Problem Set #1 (blocking ﬂows, push-relabel, scaling, etc.). There are

algorithms along these lines with running time roughly O(mn) that are also quite fast in

practice. (Theoretically, it is also known how do a bit better.) In general, you should be

happy if a problem that you care about reduces to the minimum-cost ﬂow problem.

Non-Bipartite Matching

.1 Maximum-Cardinality Non-Bipartite Matching

In the general (non-bipartite) matching problem, the input is an undirected graph G =

(V, E), not necessarily bipartite. The goal to compute a matching (as before, a subset

M ⊆ E with no shared endpoints) with the largest cardinality. Recall that the simplest

non-bipartite graphs are odd cycles (Figure 2).

Figure 2: Example of non-bipartite graph: odd cycle.

A priori, it is far from obvious that the general graph matching problem is solvable in

polynomial time (as opposed to being NP-hard). It appears to be signiﬁcantly more diﬃcult

than the special case of bipartite matching. For example, there does not seem to be a natural

reduction from non-bipartite matching to the maximum ﬂow problem. Once again, we need

to develop from scratch algorithms and strategies for correctness,

The non-bipartite matching problem admits some remarkable optimality conditions. For

motivation, what is the maximum size of a matching in the graph in Figure 3? There are 16

vertices, so clearly a matching has at most 8 edges. It’s easy to exhibit a matching of size 6

(Figure 3), but can we do better?

Figure 3: Example graph. A matching of size 6 is denoted by dashed edges.

Here’s one way to argue that there is no better matching. In each of the 5 triangles, at

most 2 of the 3 vertices can be matched to each other. This leaves at least ﬁve vertices,

one from each triangle, that, if matched, can only be matched to the center vertex. The

center vertex can only be matched to one of these ﬁve, so every matching leaves at least four

vertices unmatched. This translates to matching at most 12 vertices, and hence containing

at most 6 edges.

In general, we have the following.

Lemma 3.1 In every graph G = (V, E), the maximum cardinality of a matching is at most

min [|V | − (oc(S) − |S|)] ,

(2)

S⊆V

where oc(S) denotes the number of odd-size connected components in the graph G \ S.

Note that G\S consists of the pieces left over after ripping the vertices in S out of the graph

G (Figure 4).

Figure 4: Suppose removing S results in 4 connected components, A, B, C and D. If 3 of

them are odd-sized, then oc(S) = 3

For example, in the Figure 3, we eﬀectively took S to be the center vertex, so oc(S) = 5

(since G\S is the ﬁve triangles) and (2) is (16 (5 1)) = 6. The proof is a straightforward

−

generalization of our earlier argument.

Proof of Lemma 3.1: Fix S ⊆ V . For every odd-size connected component C of G \ S,

at least one vertex of C is not matched to some other vertex of C. These oc(S) vertices

can only be matched to vertices of S (if two vertices of C and C could be matched to

each other, then C and C would not be separate connected components of G \ S). Thus,

every matching leaves at least oc(S) − |S| vertices unmatched, and hence matches at most

V | − (oc(S) − |S|) vertices, and hence has at most ¹(|V | − (oc(S) − |S|)) edges. Ranging

over all choices of S ⊆ V yields the upper bound in (2). ꢀ

Lemma 3.1 is an analog of the fact that a maximum ﬂow is at most the value of a

minimum s-t cut. We can think of (2) as the best upper bound that we can prove if we

restrict ourselves to “obvious obstructions” to large matchings. Certainly, if we ever ﬁnd a

matching with size equal to (2), then no other matching could be bigger. But can there be a

gap between the maximum size of a matching and the upper bound in (2)? Could there be

obstructions to large matchings more subtle than the simple parity argument used to prove

Lemma 3.1? One of the more beautiful theorems in combinatorics asserts that there can

never be a gap.

Theorem 3.2 (Tutte-Berge Formula) In Lemma 3.1, equality always holds:

max matching size = min [|V | − (oc(S) − |S|)] .

S⊆V

The original proof of the Tutte-Berge formula is via induction, and does not seems to lead

to an eﬃcient algorithm.³In 1965, Edmonds gave the ﬁrst polynomial-time algorithm for

Tutte characterized the graphs with perfect matchings in the 1940s; in the 1950s, Berge extended this

characterization to prove Theorem 3.2.

computing a maximum-cardinality matching.⁴Since the algorithm is guaranteed to produce

a matching with cardinality equal to (2), Edmonds’ algorithm provides an algorithmic proof

of the Tutte-Berge formula.

A key challenge in non-bipartite matching is searching for a good path to use to increase

the size of the current matching. Recall that in the Hungarian algorithm (Lecture #5), we

used the bipartite assumption to argue that there’s no way to encounter both endpoints of

an edge in the current matching in the same level of the search tree. But this certainly can

happen in non-bipartite graphs, even just in the triangle. Edmonds called these odd cycles

“

blossoms,” and his algorithm is often called the “blossom algorithm.” When a blossom is

encountered, it’s not clear how to proceed with the search. Edmonds’ idea was to “shrink,”

meaning contract, a blossom when one is found. The blossom becomes a super-vertex in

the new (smaller) graph, and the algorithm can continue. All blossoms are uncontracted in

reverse order at the end of the algorithm.⁵

.2 Minimum-Cost Non-Bipartite Matching

An algorithm designer is never satisﬁed, always wanting better and more general solutions

to computational problems. So it’s natural to consider the graph matching problem with

both of the complications that we’ve studied so far: general (non-bipartite) graphs and edge

costs.

The minimum-cost non-bipartite matching problem is again polynomial-time solvable,

again ﬁrst proved by Edmonds. From 30,000 feet, the idea to combine the blossom shrinking

idea above (which handles non-bipartiteness) with the vertex prices we used in Lecture #5

for the Hungarian algorithm (which handle costs). This is not as easy as it sounds, however

— it’s not clear what prices should be given to super-vertices when they are created, and

such super-vertices may need to be uncontracted mid-algorithm. With some care, however,

this idea can be made to work and yields a polynomial-time algorithm.

While polynomial-time solvable, the minimum-cost matching problem is a relatively hard

problem within the class P. State-of-the-art algorithms can handle graphs with 100s of

vertices, but graphs with 1000s of vertices are already a challenge. From your other computer

science courses, you know that in applications one often wants to handle graphs that are

bigger than this by 1–6 orders of magnitude. This motivates the design of heuristics for

matching that are very fast, even if not fully correct.⁶

For example, the following Kruskal-like greedy algorithm is a natural one to try. For

convenience, we work with the equivalent maximum-weight version of the problem (each edge

In this remarkable paper, titled “Paths, Trees, and Flowers,” Edmonds deﬁnes the class of polynomial-

time solvable problems and conjectures that the traveling salesman problem is not in the class (i.e., that

P = NP). Keep in mind that NP-completeness wasn’t deﬁned (by Cook and Levin) until 1971.

Your instructor covered this algorithm in last year’s CS261, in honor of the algorithm’s 50th anniversary.

It takes two lectures, however, and has been cut this year in favor of other topics.

In the last part of the course, we explore this idea in the context of approximation algorithms for

NP-hard problems. It’s worth remembering that for suﬃciently large data sets, approximation is the most

appropriate solution even for problems that are polynomial-time solvable.

has a weight w_e, the goal is to compute the matching with largest sum of weights).

Greedy Matching Algorithm

sort and rename the edges E = {1, 2, . . . , m} so that w ≥ w ≥ · · · w

M = ∅

for i = 1 to m do

if e_ishares no endpoint with edges in M then

add e_ito M

1+ꢀ

Figure 5: The greedy algorithm picks the edge (b, c), while the optimal matching consists of

(a, b) and (c, d).

A simple example (Figure 5) shows that, at least for some graphs, the greedy algorithm

can produce a matching with weight only 50% of the maximum possible. On Problem Set

2 you will prove that there are no worse examples — for every (non-bipartite) graph and

edge weights, the matching output by the greedy algorithm has weight at least 50% of the

maximum possible. Just over the past few years, new matching approximation algorithms

have been developed, and it’s now possible to get a (1 − ꢀ)-approximation in O(m) time, for

any constant ꢀ > 0 (the hidden constant in the “big-oh” depends on ) [?].

ꢀ

CS261: A Second Course in Algorithms

Lecture #7: Linear Programming: Introduction and

Applications^∗

Tim Roughgarden^†

January 26, 2016

Preamble

With this lecture we commence the second part of the course, on linear programming, with

an emphasis on applications on duality theory.¹We’ll spend a fair amount of quality time

with linear programs for two reasons.

First, linear programming is very useful algorithmically, both for proving theorems and

for solving real-world problems.

Linear programming is a remarkable sweet spot between power/generality and

computational eﬃciency.

For example, all of the problems studied in previous lectures can be viewed as special cases

of linear programming, and there are also zillions of other examples. Despite this generality,

linear programs can be solved eﬃciently, both in theory (meaning in polynomial time) and

in practice (with input sizes up into the millions).

Even when a computational problem that you care about does not reduce directly to

solving a linear program, linear programming is an extremely helpful subroutine to have in

your pocket. For example, in the fourth and last part of the course, we’ll design approx-

imation algorithms for NP-hard problems that use linear programming in the algorithm

and/or analysis. In practice, probably most of the cycles spent on solving linear programs

is in service of solving integer programs (which are generally NP-hard). State-of-the-art

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

The term “programming” here is not meant in the same sense as computer programming (linear program-

ming pre-dates modern computers). It’s in the same spirit as “television programming,” meaning assembling

a scheduled of planned activities. (See also “dynamic programming”.)

algorithms for the latter problem invoke a linear programming solver over and over again to

make consistent progress.

Second, linear programming is conceptually useful —- understanding it, and especially

LP duality, gives you the “right way” to think about a host of diﬀerent problems in a simple

and consistent way. For example, the optimality conditions we’ve studied in past lectures

(like the max-ﬂow/min-cut theorem and Hall’s theorem) can be viewed as special cases of

linear programming duality. LP duality is more or less the ultimate answer to the question

“how do we know when we’re done?” As such, it’s extremely useful for proving that an

algorithm is correct (or approximately correct).

We’ll talk about both these aspects of linear programming at length.

How to Think About Linear Programming

.1 Comparison to Systems of Linear Equations

Once upon a time, in some course you may have forgotten, you learned about linear systems

of equations. Such a system consists of m linear equations in real-valued variables x , . . . , x :

a x + a x + · · · + a x = b

a x + a x + · · · + a x = b .

The a ’s and the b ’s are given; the goal is to check whether or not there are values for the

x_j’s such that all m constraints are satisﬁed. You learned at some point that this problem

can be solved eﬃciently, for example by Gaussian elimination. By “solved” we mean that

the algorithm returns a feasible solution, or correctly reports that no feasible solution exists.

Here’s an issue, though: what about inequalities? For example, recall the maximum ﬂow

problem. There are conservation constraints, which are equations and hence OK. But the

capacity constraints are fundamentally inequalities. (There is also the constraint that ﬂow

values should be nonnegative.) Inequalities are part of the problem description of many

other problems that we’d like to solve. The point of linear programming is to solve systems

of linear equations and inequalities. Moreover, when there are multiple feasible solutions, we

would like to compute the “best” one.

.2 Ingredients of a Linear Program

There is a convenient and ﬂexible language for specifying linear programs, and we’ll get lots

of practice using it during this lecture. Sometimes it’s easy to translate a computational

problem into this language, sometimes it takes some tricks (we’ll see examples of both).

To specify a linear program, you need to declare what’s allowed and what you want.

Ingredients of a Linear Program

. Decision variables x , . . . , x ∈ R.

. Linear constraints, each of the form

Xⁿ

a_jx_j(∗) b_i,

j=1

where (*) could be ≤, ≥, or =.

. A linear objective function, of the form

Xⁿ

max

min

c_jx_j

j=1

Xⁿ

c x .

j=1

Several comments. First, the a ’s, b ’s, and c ’s are constants, meaning they are part of the

input, numbers hard-wired into the linear program (like 5, -1, 10, etc.). The x_j’s are free, and

it is the job of a linear programming algorithm to ﬁgure out the best values for them. Second,

when specifying constraints, there is no need to make use of both “≤” and “≥”inequalities

—

one can be transformed into the other just by multiplying all the coeﬃcients by -1 (the

a ’s and b ’s are allowed to be positive or negative). Similarly, equality constraints are

superﬂuous, in that the constraint that a quantity equals b_iis equivalent to the pair of

inequality constraints stating that the quantity is both at least b and at most b . Finally,

there is also no diﬀerence between the “min” and “max” cases for the objective function

—

allowed to be positive or negative).

one is easily converted into the other just by multiplying all the c ’s by -1 (the c ’s are

So what’s not allowed in a linear program? Terms like x², x x , log(1 + x ), etc. So

whenever a decision variable appears in an expression, it is alone, possibly multiplied by

a constant (and then summed with other such terms). While these linearity requirements

may seem restrictive, we’ll see that many real-world problems can be formulated as or well

approximated by linear programs.

.3 A Simple Example

Figure 1: a toy example of linear program.

To make linear programs more concrete and develop your geometric intuition about them,

let’s look at a toy example. (Many “real” examples of linear programs are coming shortly.)

Suppose there are two decision variables x and x — so we can visualize solutions as

points (x , x ) in the plane. See Figure 2.3. Let’s consider the (linear) objective function of

maximizing the sum of the decision variables:

max x + x .

We’ll look at four (linear) constraints:

x₁≥ 0

x₂≥ 0

x₁+ x₂≤ 1

x₁+ 2x₂≤ 1.

The ﬁrst two inequalities restrict feasible solutions to the non-negative quadrant of the

plane. The second two inequalities further restrict feasible solutions to lie in the shaded

region depicted in Figure 2.3. Geometrically, the objective function asks for the feasible

point furthest in the direction of the coeﬃcient vector (1, 1) — the “most northeastern”

feasible point. Put diﬀerently, the level sets of the objective function are parallel lines

running northwest to southeast.²Eyeballing the feasible region, this point is ( , ), for an

optimal objective function value of . This is the “last point of intersection” between a

level set of the objective function and the feasible region (as one sweeps from southwest to

northeast).

Recall that a level set of a function g has the form {x : g(x) = c}, for some constant c. That is, all

points in a level set have equal objective function value.

.4 Geometric Intuition

While it’s always dangerous to extrapolate from two or three dimensions to an arbitrary

number, the geometric intuition above remains valid for general linear programs, with an ar-

bitrary number of dimensions (i.e., decision variables) and constraints. Even though we can’t

draw pictures when there are many dimensions, the relevant algebra carries over without any

diﬃculties. Speciﬁcally:

. A linear constraint in n dimensions corresponds to a halfspace in R_n. Thus a feasible

region is an intersection of halfspaces, the higher-dimensional analog of a polygon.³

. The level sets of the objective function are parallel (n − 1)-dimensional hyperplanes in

R_n, each orthogonal to the coeﬃcient vector c of the objective function.

. The optimal solution is the feasible point furthest in the direction of c (for a maximiza-

tion problem) or −c (for a minimization problem). Equivalently, it is the last point of

intersection (traveling in the direction c or −c) of a level set of the objective function

and the feasible region.

. When there is a unique optimal solution, it is a vertex (i.e., “corner”) of the feasible

region.

There are a few edge cases which can occur but are not especially important in CS261.

. There might be no feasible solutions at all. For example, if we add the constraint

x + x ≥ 1 to our toy example, then there are no longer any feasible solutions. Linear

programming algorithms correctly detect when this case occurs.

. The optimal objective function value is unbounded (+∞ for a maximization problem,

−

∞ for a minimization problem). Note a necessary but not suﬃcient condition for

this case is that the feasible region is unbounded. For example, if we dropped the

constraints 2x + x ≤ 1 and x + 2x ≤ 1 from our toy example, then it would have

unbounded objective function value. Again, linear programming algorithms correctly

detect when this case occurs.

. The optimal solution need not be unique, as a “side” of the feasible region might

be parallel to the levels sets of the objective function. Whenever the feasible region

is bounded, however, there always exists an optimal solution that is a vertex of the

feasible region.⁴

A ﬁnite intersection of halfspaces is also called a “polyhedron;” in the common special case where the

feasible region is bounded, it is called a “polytope.”

There are some annoying edge cases for unbounded feasible regions, for example the linear program

max(x + x ) subject to x + x = 1.

Some Applications of Linear Programming

Zillions of problems reduce to linear programming. It would take an entire course to cover

even just its most famous applications. Some of these applications are conceptually a bit

boring but still very important — as early as the 1940s, the military was using linear pro-

gramming to ﬁgure out the most eﬃcient way to ship supplies from factories to where they

were needed.⁵Several central problems in computer science reduce to linear programming,

and we describe some of these in detail in this section. Throughout, keep in mind that all

of these linear programs can be solved eﬃciently, both in theory and in practice. We’ll say

more about algorithms for linear programming in a later lecture.

.1 Maximum Flow

If we return to the deﬁnition of the maximum ﬂow problem in Lecture #1, we see that it

translates quite directly to a linear program.

. Decision variables: what are we try to solve for? A ﬂow, of course, Speciﬁcally, the

amount f of ﬂow on each edge e. So our variables are just {f }_e∈E.

. Constraints: Recall we have conservation constraints and capacity constraints. We

can write the former as

f_e−

{z } | {z }

f_e= 0

e∈δ−(v)

ﬂow in

ﬂow out

for every vertex v = s, t.⁶We can write the latter as

f_e≤ u_e

for every edge e ∈ E. Since decision variables of linear programs are by default allowed

to take on arbitrary real values (positive or negative), we also need to remember to

add nonnegativity constraints:

f_e≥ 0

for every edge e ∈ E. Observe that every one of these 2m + n − 2 constraints (where

m = |E| and n = |V |) is linear — each decision variable f only appears by itself (with

a coeﬃcient of 1 or -1).

. Objective function: We just copy the same one we used in Lecture #1:

max

f .

e∈δ+(s)

Note that this is again a linear function.

Note this is well before computer science was ﬁeld; for example, Stanford’s Computer Science Department

was founded only in 1965.

Recall that δ− and δ+ denote the edges incoming to and outgoing from v, respectively.

.2 Minimum-Cost Flow

In Lecture #6 we introduced the minimum-cost ﬂow problem. Extending specialized al-

gorithms for maximum ﬂow to generalized algorithms takes non-trivial work (see Problem

Set #2 for starters). If we’re just using linear programming, however, the generalization

is immediate.⁷The main change is in the objective function. As deﬁned last lecture, it is

simply

min

c f ,

e∈E

where c is the cost of edge e. Since the c ’s are ﬁxed numbers (i.e., part of the input), this

is a linear objective function.

For the version of the minimum-cost ﬂow problem deﬁned last lecture, we should also

add the constraint

f_e= d,

e∈δ+(s)

where d is the target ﬂow value. (One can also add the analogous constraint for t, but this

is already implied by the other constraints.)

To further highlight how ﬂexible linear programs can be, suppose we want to impose a

lower bound `_e(other than 0) on the amount of ﬂow on each edge e, in addition to the

usual upper bound u_e. This is trivial to accommodate in our linear program — just replace

“

f ≥ 0” by f ≥ ` .⁸

.3 Fitting a Line

We now consider two less obvious applications of linear programming, to basic problems in

machine learning. We ﬁrst consider the problem of ﬁtting a line to data points (i.e., linear

regression), perhaps the simplest non-trivial machine learning problem.

Formally, the input consists of m data points p¹, . . . , p^m∈ R_d, each with d real-valued

“

features” (i.e., coordinates).⁹For example, perhaps d = 3, and each data point corresponds

to a 3rd-grader, listing the household income, number of owned books, and number of years

of parental education. Also part of the input is a “label” ` ∈ R for each point pⁱ.¹⁰For

example, ` could be the score earned by the 3rd-grader in question on a standardized test.

We reiterate that the pⁱ’s and `_i’s are ﬁxed (part of the input), not decision variables.

While linear programming is a reasonable way to solve the maximum ﬂow and minimum-cost ﬂow

problems, especially if the goal is to have a “quick and dirty” solution, but the best specialized algorithms

for these problems are generally faster.

If you prefer to use ﬂow algorithms, there is a simple reduction from this problem to the special case

with ` = 0 for all e ∈ E (do you see it?).

Feel free to take d = 1 throughout the rest of the lecture, which is already a practically relevant and

computationally interesting case.

⁰This is a canonical “supervised learning” problem, meaning that the algorithm is provided with labeled

data.

Informally, the goal is to expresses the `_ias well as possible as a linear function of the

p ’s. That is, the goal is to compute a linear function h : R_d→ R such that h(pⁱ) ≈ ` for

every data point i.

The two most common motivations for computing a “best-ﬁt” linear function are pre-

diction and data analysis. In the ﬁrst scenario, one uses labeled data to identify a linear

function h that, at least for these data points, does a good job of predicting the label `_i

from the feature values pⁱ. The hope is that this linear function “generalizes,” meaning that

it also makes accurate predictions for other data points for which the label is not already

known. There is a lot of beautiful and useful theory in statistics and machine learning about

when one can and cannot expect a hypothesis to generalize, which you’ll learn about if you

take courses in those areas. In the second scenario, the goal is to understand the relationship

between each feature of the data points and the labels, and also the relationships between

the diﬀerent features. As a simple example, it’s clearly interesting to know when one of the d

features is much more strongly correlated with the label `ⁱthan any of the others.

We now show that computing the best line, for one deﬁnition of “best,” reduces to linear

programming. Recall that every linear function h : R_d→ R has the form

X^d

for some coeﬃcients a , . . . , a and intercept b. (This is one of several equivalent deﬁnitions

h(z) =

a z + b

j j

j=1

of a linear function.¹¹So it’s natural to take a , . . . , a , b as our decision variables.

What’s our objective function? Clearly if the data points are colinear we want to compute

the line that passes through all of them. But this will never happen, so we must compromise

between how well we approximate diﬀerent points.

For a given choice of a , . . . , a , b, deﬁne the error on point i as

ꢀ

X^d

ꢀ

E_i(a, b) =

a p − b −

(1)

ꢀ

|{z}

ꢀ

−

“ground truth”

ꢀ

}

ꢀ

prediction

Geometrically, when d = 1, we can think of each (pⁱ, `ⁱ) as a point in the plane and (1) is

just the vertical distance between this point and the computed line.

In this lecture, we consider the objective function of minimizing the sum of errors:

X^m

This is not the most common objective for linear regression; more standard is minimizing the

min

a,b

E_i(a, b).

(2)

i=1

squared error

E (a, b). While our motivation for choosing (2) is primarily pedagogical,

¹Sometimes people use “linear function” to mean the special case where b = 0, and “aﬃne function” for

the case of arbitrary b.

this objective is reasonable and is sometimes used in practice. The advantage over squared

error is that it is more robust to outliers. Squaring the error of an outlier makes it a squeakier

wheel. That is, a stray point (e.g., a faulty sensor or data entry error) will inﬂuence the line

chosen under (2) less that it would with the squared error objective (Figure 2).¹²

Figure 2: When there exists an outlier (red point), using the objective function deﬁned

in (2) causes the best-ﬁt line not to ”stray” as far away from the non-outliers (blue line) as

when using the squared error objective (red line), because the squared error objective would

penalize more greatly when the chosen line is far from the outlier.

Consider the problem of choosing a, b to minimize (2). (Since the a_j’s and b can be

anything, there are no constraints.) The problem: this is not a linear program. The source

of nonlinearity is the absolute value sign | · | in (1). Happily, in this case and many others,

absolute values can be made linear with a simple trick.

The trick is to introduce extra variables e , . . . , e , one per data point. The intent is for

e to take on the value E (a, b). Motivated by the identify |x| = max{x, −x}, we add two

constraints for each data point:

X^d

e_i≥

a p − b − `

(3)

(4)

j=1

and

X^d

e_i≥ −

a p − b − ` .

j=1

²Squared error can be minimized eﬃciently using an extension of linear programming known as convex

programming. (For the present “ordinary least squares” version of the problem, it can even be solved

analytically, in closed form.) We may discuss convex programming in a future lecture.

We change the objective function to

X^m

Note that optimizing (5) subject to all constraints of the form (3) and (4) is a linear program,

min

e_i.

(5)

i=1

with decision variables e , . . . , e , a , . . . , a , b.

The key point is: at an optimal solution to this linear program, it must be that e_i=

E (a, b) for every data point i. Feasibility of the solution already implies that e ≥ E (a, b) for

every i. And if e > E (a, b) for some i, then we can decrease e slightly, so that (3) and (4)

still hold, to obtain a superior feasible solution. We conclude that an optimal solution to

this linear program represents the line minimizing the sum of errors (2).

.4 Computing a Linear Classiﬁer

Figure 3: We want to ﬁnd a linear function that separates the positive points (plus signs)

from the negative points (minus signs)

Next we consider a second fundamental problem in machine learning, that of learning a

linear classiﬁer.¹³While in Section 3.3 we sought a real-valued function (from R_dto R),

here we’re looking for a binary function (from R_dto {0, 1}). For example, data points could

represent images, and we want to know which ones contain a cat and which ones don’t.

Formally, the input consists of m “positive” data points p¹, . . . , p^m∈ R_dand m “neg-

ative” data points q¹, . . . , q^m. In the terminology of the previous section, all of the labels

³Also called halfspaces, perceptrons, linear threshold functions, etc.

are “1” or “0,” and we have partitioned the data accordingly. (So this is again a supervised

learning problem.)

The goal is to compute a linear function h(z) =

a z + b (from R_dto R) such that

j=1

h(pⁱ) > 0

(6)

for all positive points and

h(qⁱ) < 0

(7)

for all negative points. Geometrically, we are looking for a hyperplane in R_dsuch all positive

points are on one side and all negative points on the other; the coeﬃcients a specify the

normal vector of the hyperplane and the intercept b speciﬁes its shift. See Figure 3. Such a

hyperplane can be used for predicting the labels of other, unlabeled points (check which side

of the hyperplane it is on and predict that it is positive or negative, accordingly). If there is

no such hyperplane, an algorithm should correctly report this fact.

This problem almost looks like a linear program by deﬁnition. The only issue is that

the constraints (6) and (7) are strict inequalities, which are not allowed in linear programs.

Again, the simple trick of adding an extra decision variable solves the problem. The new

decision variable δ represents the “margin” by which the hyperplane satisﬁes (6) and (7). So

max δ

subject to

X^d

for all positive points pⁱ

for all negative points qⁱ,

a p + b − δ ≥ 0

j=1

X^d

which is a linear program with decision variables δ, a , . . . , a , b. If the optimal solution

a q + b + δ ≤ 0

j=1

to this linear program has strictly positive objective function value, then the values of the

variables a , . . . , a , b deﬁne the desired separating hyperplane. If not, then there is no such

hyperplane. We conclude that computing a linear classiﬁer reduces to linear programming.

.5 Extension: Minimizing Hinge Loss

There is an obvious issue with the problem setup in Section 3.4: what if the data set is not

as nice as the picture in Figure 3, and there is no separating hyperplane? This is usually the

case in practice, for example if the data is noisy (as it always is). Even if there’s no perfect

hyperplane, we’d still like to compute something that we can use to predict the labels of

unlabeled points.

We outline two ways to extend the linear programming approach in Section 3.4 to handle

non-separable data.¹⁴The ﬁrst idea is to compute the hyperplane that minimizes some notion

⁴In practice, these two approaches are often combined.

of “classiﬁcation error.” After all, this is what we did in Section 3.3, where we computed

the line minimizing the sum of the errors.

Probably the most natural plan would be to compute the hyperplane that puts the

fewest number of points on the wrong side of the hyperplane — to minimize the number

of inequalities of the form (6) or (7) that are violated. Unfortunately, this is an NP-hard

problem, and one typically uses notions of error that are more computationally tractable.

Here, we’ll discuss the widely used notion of hinge loss.

Let’s say that in a perfect world, we would like a linear function h such that

h(pⁱ) ≥ 1

(8)

(9)

for all positive points pⁱand

h(qⁱ) ≤ −1

for all negative points qⁱ; the “1” here is somewhat arbitrary, but we need to pick some

constant for the purposes of normalization. The hinge loss incurred by a linear function h on

a point is just the extent to which the corresponding inequality (8) or (9) fails to hold. For a

positive point pⁱ, this is max{1−h(pⁱ), 0}; for a negative point qⁱ, this is max{1+h(pⁱ), 0}.

Note that taking the maximum with zero ensures that we don’t reward a linear function for

classifying a point “extra-correctly.” Geometrically, when d = 1, the hinge loss is the vertical

distance that a data point would have to travel to be on the correct side of the hyperplane,

with a “buﬀer” of 1 between the point and the hyperplane.

Computing the linear function that minimizes the total hinge loss can be formulated as a

linear program. While hinge loss is not linear, it is just the maximum of two linear functions.

So by introducing one extra variable and two extra constraints per data point, just like in

Section 3.3, we obtain the linear program

X^m

min

e_i

i=1

subject to:

X^d

for every positive point pⁱ

for every negative point qⁱ

e_i≥ 1 −

a p + b

j=1

X^d

e_i≥ 1 +

e_i≥ 0

a q + b

j=1

for every point

in the decision variables e , . . . , e , a , . . . , a , b.

.6 Extension: Increasing the Dimension

Figure 4: The points are not linearly separable, but they can be separated by a quadratic

line.

A second approach to dealing with non-linearly-separable data is to use nonlinear boundaries.

E.g., in Figure 4, the positive and negative points cannot be separated perfectly by any line,

but they can be separated by a relatively simple boundary (e.g., of a quadratic function).

But how we can allow nonlinear boundaries while retaining the computationally tractability

of our previous solutions?

The key idea is to generate extra features (i.e., dimensions) for each data point. That

R_d→ R_d⁰

is, for some dimension d⁰

≥

d and some function ϕ :

, we map each p to ϕ(p )

and each q to ϕ(qⁱ). We’ll then try to separate the images of these points in d -dimensional

space using a linear function.¹⁵

A concrete example of such a function ϕ is the map

(z , . . . , z ) → (z , . . . , z , z , . . . , z , z z , z z , . . . , z_d−1z_d);

(10)

that is, each data point is expanded with all of the pairwise products of its features. This

map is interesting even when d = 1:

z → (z, z²).

(11)

Our goal is now to compute a linear function in the expanded space, meaning coeﬃcients

⁵This is the basic idea behind “support vector machines;” see CS229 for much more on the topic.

a , . . . , a and an intercept b, that separates the positive and negative points:

X^d

a · ϕ(p ) + b > 0

(12)

(13)

i=1

for all positive points and

X^d

a · ϕ(q ) + b < 0

i=1

for all negative points. Note that if the new feature set includes all of the original features,

as in (10), then every hyperplane in the original d-dimensional space remains available in

the expanded space (just set a_d+1, a_d+2, . . . , a_d0= 0). But there are also many new options,

and hence it is more likely that there is way to perfectly separate the (images under ϕ of

the) data points. For example, even with d = 1 and the map (11), linear functions in the

expanded space have the form h(z) = a z²+ a z + b, which is a quadratic function in the

original space.

We can think of the map ϕ as being applied in a preprocessing step. Then, the resulting

problem of meeting all the constraints (12) and (13) is exactly the problem that we already

solved in Section 3.4. The resulting linear program has decision variables δ, a , . . . , a , b

(d⁰+ 2 in all, up from d + 2 in the original space).¹⁶

⁶The magic of support vector machines is that, for many maps ϕ including (10) and (11), and for many

methods of computing a separating hyperplane, the computation required scales only with the original

dimension d, even if the expanded dimension d⁰is radically larger. This is known as the “kernel trick;” see

CS229 for more details.

CS261: A Second Course in Algorithms

Lecture #8: Linear Programming Duality (Part 1)^∗

Tim Roughgarden^†

January 28, 2016

Warm-Up

This lecture begins our discussion of linear programming duality, which is the really the

heart and soul of CS261. It is the topic of this lecture, the next lecture, and (as will become

clear) pretty much all of the succeeding lectures as well.

Recall from last lecture the ingredients of a linear program: decision variables, linear

constraints (equalities or inequalities), and a linear objective function. Last lecture we saw

that lots of interesting problems in combinatorial optimization and machine learning reduce

to linear programming.

Figure 1: A toy example to illustrate duality.

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

To start getting a feel for linear programming duality, let’s begin with a toy example. It

is a minor variation on our toy example from last time. There are two decision variables x₁

and x₂and we want to

max x₁+ x₂

(1)

subject to

x + x ≤ 2

(2)

(3)

(4)

(5)

x + 2x ≤ 1

x ≥ 0

x ≥ 0.

(Last lecture, the ﬁrst constraint of our toy example read 2x + x ≤ 1; everything else is

the same.)

Like last lecture, we can solve this LP just by eyeballing the feasible region (Figure 1)

and searching for the “most northeastern” feasible point, which in this case is the vertex

(i.e., “corner”) at ( , ). Thus the optimal objective function value if .

When we go beyond three dimensions (i.e., decision variables), it seems hopeless to solve

linear programs by inspection. With a general linear program, even if we are handed on a

silver platter an allegedly optimal solution, how do we know that it is really is optimal?

Let’s try to answer this question at least in our toy example. What’s an easy and

convincing proof that the optimal objective function value of the linear program can’t be

too large? For starters, for any feasible point (x , x ), we certainly have

x + x ≤ 4x + x ≤

objective

¹{z }²

{z}

upper bound

with the ﬁrst inequality following from x ≥ 0 and the second from the ﬁrst constraint. We

can immediately conclude that the optimal value of the linear program is at most 2. But

actually, it’s obvious that we can do better by using the second constraint instead:

x + x ≤ x + 2x ≤ 1,

giving us a better (i.e., smaller) upper bound of 1. Can we do better? There’s no reason

we need to stop at using just one constraint at a time, and are free to blend two or more

constraints. The best blending takes of the ﬁrst constraint and of the second to give

x + x ≤ (4x + x ) + (x + 2x )

≤

2 +

1 = .

(6)

}

⁷| {z

}

≤

2 by (2)

≤ 1 by (3)

(The ﬁrst inequality actually holds with equality, but we don’t need the stronger statement.)

So this is a convincing proof that the optimal objective function value is at most . Given

the feasible point ( , ) that actually does realize this upper bound, we can conclude that

really is the optimal value for the linear program.

Summarizing, for the linear program (1)–(5), there is a quick and convincing proof that

the optimal solution has value at least (namely, the feasible point ( , )) and also such a

proof that the optimal solution has value at most (given in (6)). This is the essence of

linear programming duality.

The Dual Linear Program

We now generalize the ideas of the previous section. Consider an arbitrary linear program

(call it (P)) of the form

Xⁿ

max

c_jx_j

(7)

j=1

subject to

Xⁿ

a x ≤ b

(8)

(9)

j=1

Xⁿ

a x ≤ b

j=1

≤ .

(10)

(11)

Xⁿ

a x ≤ b

j=1

x , . . . , x ≥ 0.

(12)

This linear program has n nonnegative decision variables x , . . . , x and m constraints (not

counting the nonnegativity constraints). The a ’s, b ’s, and c ’s are all part of the input

(i.e., ﬁxed constants).¹

You may have forgotten your linear algebra, but it’s worth paging the basics back in

when learning linear programming duality. It’s very convenient to write linear programs in

matrix-vector notation. For example, the linear program above translates to the succinct

description

max c^Tx

subject to

Ax ≤ b

x ≥ 0,

Remember that diﬀerent types of linear programs are easily transformed to each other. A minimization

objective can be turned into a maximization objective by multiplying all c_j’s by -1. An equality constraint

can be simulated by two inequality constraints. An inequality constraint can be ﬂipped by multiplying by

-1. Real-valued decision variables can be simulated by the diﬀerence of two nonnegative decision variables.

An inequality constraint can be turned into an equality constraint by adding an extra “slack” variable.

where c and x are n-vectors, b is an m-vector, A is an m × n matrix (of the a ’s), and the

inequalities are componentwise.

Remember our strategy for deriving upper bounds on the optimal objective function

value of our toy example: take a nonnegative linear combination of the constraints that

(componentwise) dominates the objective function. In general, for the above linear program

with m constraints, we denote by y , . . . , y ≥ 0 the corresponding multipliers that we use.

The goal of dominating the objective function translates to the conditions

X^m

y a ≥ c

(13)

i=1

for each objective function coeﬃcient (i.e. for j = 1, 2, . . . , m). In matrix notation, we are

interested in nonnegative m-vectors y ≥ 0 such that

A^Ty ≥ c;

note the sum in (13) is over the rows i of A, which corresponds to an inner product with the

jth column of A, or equivalently with the jth row of A^T.

By design, every such choice of multipliers y₁, . . . , y_mimplies an upper bound on the

optimal objective function value of the linear program (7)–(12): for every feasible solution

(x , . . . , x ),

Xⁿ

x’s obj fn

XⁿX^m

c x ≤

y_ia_ijx_j

(14)

j=1

i=1

{z }

X^m

Xⁿ

y_i·

a x_j

(15)

(16)

i=1

j=1

X^m

upper bound

≤

y b . .

i i

i=1

{z }

In this derivation, inequality (14) follows from the domination condition in (13) and the

nonnegativity of x , . . . , x ; equation (15) follows from reversing the order of summation;

and inequality (16) follows from the feasibility of x and the nonnegativity of y , . . . , y .

Alternatively, the derivation may be more transparent in matrix-vector notation:

c x ≤ (A y) x = y (Ax) ≤ y b.

The upshot is that, whenever y ≥ 0 and (13) holds,

X^m

OPT of (P) ≤

b y .

i=1

In our toy example of Section 1, the ﬁrst upper bound of 2 corresponds to taking y₁= 1

and y = 0. The second upper bound of 1 corresponds to y = 0 and y = 1 The ﬁnal upper

bound of corresponds to y₁= and y = .

Our toy example illustrates that there can be many diﬀerent ways of choosing the y_i’s,

and diﬀerent choices lead to diﬀerent upper bounds on the optimal value of the linear pro-

gram (P). Obviously, the most interesting of these upper bounds is the tightest (i.e., smallest)

one. So we really want to range over all possible y’s and consider the minimum such upper

bound.²

Here’s the key point: the tightest upper bound on OPT is itself the optimal solution to a

linear program. Namely:

X^m

min

b_iy_i

i=1

subject to

X^m

a y ≥ c

i=1

X^m

a y ≥ c

i=1

≥ .

X^m

a y ≥ c

i=1

y , . . . , y ≥ 0.

Or, in matrix-vector form:

subject to

min b^Ty

A^Ty ≥ c

y ≥ 0.

This linear program is called the dual to (P), and we sometimes denote it by (D).

For example, to derive the dual to our toy linear program, we just swap the objective

and the right-hand side and take the transpose of the constraint matrix:

min 2y₁+ y₂

For an analogy, among all s-t cuts, each of which upper bounds the value of a maximum ﬂow, the

minimum cut is the most interesting one (Lecture #2). Similarly, in the Tutte-Berge formula (Lecture #5),

we were interested in the tightest (i.e., minimum) upper bound of the form |V | − (oc(S) − |S|), over all

choices of the set S.

subject to

y + y ≥ 1

y + 2y ≥ 1

y , y ≥ 0.

The objective function values of the feasible solutions (1, 0), (0, 1), and ( , ) (of 2, 1, and

)

correspond to our three upper bounds in Section 1.

The following important result follows from the deﬁnition of the dual and the deriva-

tion (14)–(16).

Theorem 2.1 (Weak Duality) For every linear program of the form (P) and correspond-

ing dual linear program (D),

OPT value for (P) ≤ OPT value for (D).

(17)

(Since the derivation (14)–(15) applies to any pair of feasible solutions, it holds in particular

for a pair of optimal solutions.) Next lecture we’ll discuss strong duality, which asserts

that (17) always holds with equality (as long as both (P) and (D) are feasible).

Duality Example #1: Max-Flow/Min-Cut Revisited

This section brings linear programming duality back down to earth by relating it to an old

friend, the maximum ﬂow problem. Last lecture we showed how this problem translates easily

to a linear program. This lecture, for convenience, we will use a diﬀerent linear programming

formulation. The new linear program is much bigger but also simpler, so it is easier to take

and interpret its dual.

.1 The Primal

The idea is to work directly with path decompositions, rather than ﬂows. So the decision

variables have the form f , where P is an s-t path. Let P denote the set of all such paths.

The beneﬁt of working with paths is that there is no need to explicitly state the conservation

constraints. We do still have the capacity (and nonnegativity) constraints, however.

max

f_P

(18)

P∈P

subject to

f_P≤ u_e

for all e ∈ E

(19)

(20)

P∈P : e∈P

}

total ﬂow on e

f ≥ 0

for all P ∈ P.

Again, call this (P). The optimal value to this linear program is the same as that of the linear

programming formulation of the maximum ﬂow problem given last lecture. Every feasible

solution to (18)–(20) can be transformed into one of equal value for last lecture’s LP, just by

setting f_eequal to the left-hand side of (19) for each e. For the reverse direction, one takes

a path decomposition (Problem Set #1). See Exercise Set #4 for details.

.2 The Dual

The linear program (18)–(20) conforms to the format covered in Section 2, so it has a well-

deﬁned dual. What is it? It’s usually easier to take the dual in matrix-vector notation:

max 1^Tf

subject to

Af ≤ u

f ≥ 0,

where the vector f is indexed by the paths P, 1 stands for the (|P|-dimensional) all-ones

vector, u is indexed by E, and A is a P × E matrix. Then, the dual (D) has decision

variables indexed by E (denoted {` }

for reasons to become clear) and is

e∈E

min u^T`

A ` ≥ 1

≥ 0.

Typically, the hardest thing about understanding a dual is interpreting what the transpose

operation on the constraint matrix (A → A^T) is doing. By deﬁnition, each row (correspond-

ing to an edge e) of A has a 1 in the column corresponding to a path P if e ∈ P, and 0

otherwise. So an entry a_ePof A is 1 if e ∈ P and 0 otherwise. In the column of A (and hence

row of A^T) corresponding to a path P, there is a 1 in each row corresponding an edge e of P

(and zeroes in the other rows).

Now that we understand A^T, we can unpack the dual and write it as

min

u `

e e

e∈E

subject to

`_e≥ 1

for all P ∈ P

for all e ∈ E.

(21)

e∈P

`_e≥ 0

.3 Interpretation of Dual

The duals of natural linear programs are often meaningful in their own right, and this one

is a good example. A key observation is that every s-t cut corresponds to a feasible solution

to this dual linear program. To see this, ﬁx a cut (A, B), with s ∈ A and t ∈ B, and set

ꢀ

if e ∈ δ+(A)

`_e

otherwise.

(Recall that δ+(A) denotes the edges sticking out of A, with tail in A and head in B; see

Figure 2.) To verify the constraints (21) and hence feasibility for the dual linear program,

note that every s-t path must cross the cut (A, B) as some point (since it starts in A and

ends in B). Thus every s-t path has at least one edge e with `_e= 1, and (21) holds. The

objective function value of this feasible solution is

u ` =

e e

u_e= capacity of (A, B),

e∈E

e∈δ+(A)

where the second equality is by deﬁnition (recall Lecture #2).

s-t-cuts correspond to one type of feasible solution to this dual linear program, where

every decision variable is set to either 0 or 1. Not all feasible solutions have this property:

any assignment of nonnegative “lengths” `_eto the edges of G satisfying (21) is feasible. Note

that (21) is equivalent to the constraint that the shortest-path distance from s to t, with

respect to the edge lengths {`_e}_e∈E, is at least 1.³

Figure 2: δ+(A) denotes the two edges that point from A to B.

.4 Relation to Max-Flow/Min-Cut

Summarizing, we have shown that

max ﬂow value = OPT of (P) ≤ OPT of (D) ≤ min cut value.

(22)

To give a simple example, in the graph s → v → t, one feasible solution assigns `_sv= `_vt= ¹₂. If the

edge (s, v) and (v, t) have the same capacity, then this is also an optimal solution.

The ﬁrst equation is just the statement the maximum ﬂow problem can be formulated as

the linear program (P). The ﬁrst inequality is weak duality. The second inequality holds

because the feasible region of (D) includes all (0-1 solutions corresponding to) s-t cuts; since

it minimizes over a superset of the s-t cuts, the optimal value can only be less than that of

the minimum cut.

In Lecture #2 we used the Ford-Fulkerson algorithm to prove the maximum ﬂow/minimum

cut theorem, stating that there is never a gap between the maximum ﬂow and minimum cut

values. So the ﬁrst and last terms of (22) are equal, which means that both of the inequalities

are actually equalities. The fact that

OPT of (P) = OPT of (D)

is interesting because it proves a natural special case of strong duality, for ﬂow linear pro-

grams and their duals. The fact that

OPT of (D) = min cut value

is interesting because it implies that the linear program (D), despite allowing fractional

solutions, always admits an optimal solution in which each decision variable is either 0 or 1.

.5 Take-Aways

The example in this section illustrates three general points.

. The duals of natural linear programs are often natural in their own right.

. Strong duality. ( We veriﬁed it in a special case, and will prove it in general next

lecture.)

. Some natural linear programs are guaranteed to have integral optimal solutions.

Recipe for Taking Duals

Section 2 deﬁnes the dual linear program for primal linear programs of a speciﬁc form

(maximization objective, inequality constraints, and nonnegative decision variables). As

we’ve mentioned, diﬀerent types of linear programs are easily converted to each other. So

one perfectly legitimate way to take the dual of an arbitrary linear program is to ﬁrst convert

it into the form in Section 2 and then apply that deﬁnition. But it’s more convenient to be

able to take the dual of any linear program directly, using a general recipe.

The high-level points of the recipe are familiar: dual variables correspond to primal

constraints, dual constraints correspond to primal variables, maximization and minimization

get exchanged, the objective function and right-hand side get exchanged, and the constraint

matrix gets transposed. The details concern the diﬀerent type of constraints (inequality vs.

equality) and whether or not decision variables are nonnegative.

Here is the general recipe for maximization linear programs:

Primal

variables x₁, . . . , x_n

m constraints

objective function c

right-hand side b

max c^Tx

Dual

n constraints

variables y₁, . . . , y_m

right-hand side c

objective function b

min b^Ty

constraint matrix A constraint matrix A^T

ith constraint is “≤”

ith constraint is “≥”

ith constraint is “=”

x ≥ 0

y ≥ 0

y ≤ 0

y ∈ R

jth constraint is “≥”

x ≤ 0

jth constraint is “≤”

jth constraint is “=”

x ∈ R

For minimization linear programs, we deﬁne the dual as the reverse operation (from the right

column to the left). Thus, by deﬁnition, the dual of the dual is the original primal.

Weak Duality

The above recipe allows you to take duals in a mechanical way, without thinking about

it. This can be very useful, but don’t forget the true meaning of the dual (which holds in

all cases): feasible dual solutions correspond to bounds on the best-possible primal objective

function value (derived from taking linear combinations of the constraints), and the optimal

dual solution is the tightest-possible such bound.

If you remember the meaning of duals, then it’s clear that weak duality holds in all cases

(essentially by deﬁnition).⁴

Theorem 5.1 (Weak Duality) For every maximization linear program (P) and corre-

sponding dual linear program (D),

OPT value for (P) ≤ OPT value for (D);

for every minimization linear program (P) and corresponding dual linear program (D),

OPT value for (P) ≥ OPT value for (D).

Weak duality can be visualized as in Figure 3. Strong duality also holds in all cases; see next

lecture.

Math classes often teach mathematical deﬁnitions as if they fell from the sky. This is not representative

of how mathematics actually develops. Typically, deﬁnitions are reverse engineered so that you get the

right” theorems (like weak/strong duality).

“

Figure 3: visualization of weak duality. X represents feasible solutions for P while O repre-

sents feasible solutions for D.

Weak duality already has some very interesting corollaries.

Corollary 5.2 Let (P),(D) be a primal-dual pair of linear programs.

(a) If the optimal objective function value of (P) is unbounded, then (D) is infeasible.

(b) If the optimal objective function value of (D) is unbounded, then (P) is infeasible.

Parts (a) and (b) hold because any feasible solution to the dual of a linear program oﬀers

a bound on the best-possible objective function value of the primal (so if there is no such

bound, then there is no such feasible solution). The hypothesis in (c) asserts that Figure 3

contains an “x” and an “o” that are superimposed. It is immediate that no other primal

solution can be better, and that no other dual solution can be better. (For an analogy, in

Lecture #2 we proved that capacity of every cut bounds from above the value of every ﬂow,

so if you ever ﬁnd a ﬂow and a cut with equal value, both must be optimal.)

CS261: A Second Course in Algorithms

Lecture #9: Linear Programming Duality (Part 2)^∗

Tim Roughgarden^†

February 2, 2016

Recap

This is our third lecture on linear programming, and the second on linear programming

duality. Let’s page back in the relevant stuﬀ from last lecture.

One type of linear program has the form

Xⁿ

max

c_jx_j

j=1

subject to

Xⁿ

a x ≤ b

j=1

Xⁿ

a x ≤ b

j=1

≤ .

Xⁿ

a x ≤ b

j=1

x , . . . , x ≥ 0.

Call this linear program (P), for “primal.” Alternatively, in matrix-vector notation it is

max c^Tx

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

subject to

Ax ≤ b

x ≥ 0,

where c and x are n-vectors, b is an m-vector, A is an m × n matrix (of the a ’s), and the

inequalities are componentwise.

We then discussed a method for generating upper bounds on the maximum-possible

objective function value of (P): take a nonnegative linear combination of the constraints

so that the result dominates the objective c, and you get an upper bound equal to the

corresponding nonnegative linear combination of the right-hand side b. A key point is that

the tightest upper bound of this form is the solution to another linear program, known as

the “dual.” We gave a general recipe for taking duals: the dual has one variable per primal

constraint and one constraint per primal variable; “max” and “min” get interchanged; the

objective function and the right-hand side get interchanged; and the constraint matrix gets

transposed. (There are some details about whether decision variables are nonnegative or

not, and whether the constraints are equalities or inequalities; see the table last lecture.)

For example, the dual linear program for (P), call it (D), is

min y^Tb

subject to

A y ≥ c

y ≥ 0

in matrix-vector form. Or, if you prefer the expanded version,

X^m

min

b_iy_i

i=1

subject to

X^m

a y ≥ c

i=1

X^m

a y ≥ c

i=1

≥ .

X^m

a y ≥ c

i=1

y , . . . , y ≥ 0.

In all cases, the meaning of the dual is the tightest upper bound that can proved on

the optimal primal objective function by taking suitable linear combinations of the primal

constraints. With this understanding, we see that weak duality holds (for all forms of LPs),

essentially by construction.

For example, for a primal-dual pair (P),(D) of the form above, for every pair x, y of

feasible solutions to (P),(D), we have

Xⁿ

x’s obj fn

XⁿX^m

c x ≤

y_ia_ijx_j

(1)

j=1

i=1

{z }

X^m

Xⁿ

y_i

a x_j

(2)

(3)

i=1

j=1

X^m

y’s obj fn

≤

y b .

i i

i=1

{z }

Or, in matrix-vector notion,

c x ≤ (A y) x = y (Ax) ≤ y b.

The ﬁrst inequality uses that x ≥ 0 and A^ty ≥ c; the second that y ≥ 0 and Ax ≤ b.

We concluded last lecture with the following suﬃcient condition for optimality.¹

Corollary 1.1 Let (P),(D) be a primal-dual pair of linear programs. If x, y are feasible

solutions to (P),(D), and c^Tx = y^Tb, then both x and y are both optimal.

For the reason, recall Figure 1 — no “x” can be to the right of an “o”, so if an “x” and

“o” are superimposed it must be the rightmost “x” and the leftmost “o.” For an analogy,

whenever you ﬁnd a ﬂow and s-t cut with the same value, the ﬂow must be maximum and

the cut minimum.

Figure 1: Illustrative ﬁgure showing feasible solutions for the primal (x) and the dual (o).

We also noted that weak duality implies that whenever the optimal objective function of (P) is unbounded

the linear program (D) is infeasible, and vice versa.

Complementary Slackness Conditions

.1 The Conditions

Next is a corollary of Corollary 1.1. It is another suﬃcient (and as we’ll see later, necessary)

condition for optimality.

Corollary 2.1 (Complementary Slackness Conditions) Let (P),(D) be a primal-dual

pair of linear programs. If x, y are feasible solutions to (P),(D), and the following two

conditions hold then both x and y are both optimal.

(1) Whenever x_j= 0, y satisﬁes the jth constraint of (D) with equality.

(2) Whenever y_i= 0, x satisﬁes the ith constraint of (P) with equality.

The conditions assert that no decision variable and corresponding constraint are simultane-

ously “slack” (i.e., it forbids that the decision variable is not 0 and also the constraint is not

tight).

Proof of Corollary 2.1: We prove the corollary for the case of primal and dual programs of

the form (P) and (D) in Section 1; the other cases are all the same.

The ﬁrst condition implies that

X^m

c x =

y_ia_ijx_j

i=1

for each j = 1, . . . , n (either x = 0 or c =

y a ). Hence, inequality (1) holds with

i ij

equality. Similarly, the second condition implies that

Xⁿ

y_i

a x_i= y b

i i

j=1

for each i = 1, . . . , m. Hence inequality (3) also holds with equality. Thus c^Tx = y^Tb, and

Corollary 1.1 implies that both x and y are optimal. ꢀ

.2 Physical Interpretation

Figure 2: Physical interpretation of complementary slackness. The objective function pushes

a particle in the direction c until it rests at x^∗. Walls also exert a force on the particle, and

complementary slackness asserts that only walls touching the particle exert a force, and sum

of forces is equal to 0.

We oﬀer the following informal physical metaphor for the complementary slackness condi-

tions, which some students ﬁnd helpful (Figure 2). For a linear program of the form (P) in

Section 1, think of the objective function as exerting “force” in the direction c. This pushes

a particle in the direction c (within the feasible region) until it cannot move any further in

this direction. When the particle comes to rest at position x^∗, the sum of the forces acting on

it must sum to 0. What else exerts force on the particle? The “walls” of the feasible region,

corresponding to the constraints. The direction of the force exerted by the ith constraint of

j=1

the form

the constraint matrix. We can interpret the corresponding dual variable y as the magnitude

a x ≤ b is perpendicular to the wall, that is, −a , where a is the ith row of

of the force exerted in this direction −a . The assertion that the sum of the forces equals 0

y a . The complementary slackness conditions assert

i=1

corresponds to the equation c =

that y > 0 only when a^Tx = b — that is, only the walls that the particle touches are

allowed to exert force on it.

.3 A General Algorithm Design Paradigm

So why are the complementary slackness conditions interesting? One reason is that they

oﬀer three principled strategies for designing algorithms for solving linear programs and

their special cases. Consider the following three conditions.

A General Algorithm Design Paradigm

. x is feasible for (P).

. y is feasible for (D).

. x, y satisfy the complementary slackness conditions (Corollary 2.1).

Pick two of these three conditions to maintain at all times, and work

toward achieving the third.

By Corollary 2.1, we know that achieving these three conditions simultaneously implies that

both x and y are optimal. Each choice of a condition to relax oﬀers a disciplined way

of working toward optimality, and in many cases all three approaches can lead to good

algorithms. Countless algorithms for linear programs and their special cases can be viewed

as instantiations of this general paradigm. We next revisit an old friend, the Hungarian

algorithm, which is a particularly transparent example of this design paradigm in action.

Example #2: The Hungarian Algorithm Revisited

.1 Recap of Example #1

Recall that in Lecture #8 we reinterpreted the max-ﬂow/min-cut theorem through the lens

of LP duality (this was “Example #1”). We had a primal linear program formulation of

the maximum ﬂow problem. In the corresponding dual linear program, we observed that s-t

cuts translate to 0-1 solutions to this dual, with the dual objective function value equal to

the capacity of the cut. Using the max-ﬂow/min-cut theorem, we concluded two interesting

properties: ﬁrst, we veriﬁed strong duality (i.e., no gap between the optimal primal and dual

objective function values) for primal-dual pairs corresponding to ﬂows and (fractional) cuts;

second, we concluded that these dual linear programs are always guaranteed to possess an

integral optimal solution (i.e., fractions don’t help).

.2 The Primal Linear Program

Back in Lecture #7 we claimed that all of the problems studied thus far are special cases

of linear programs. For the maximum ﬂow problem, this is easy to believe, because ﬂows

can be fractional. But for matchings? They are suppose to be integral, so how could they

be modeled with a linear program? Example #1 provides the clue — sometimes, linear

programs are guaranteed to have an optimal integral solution. As we’ll see, this also turns

out to be the case for bipartite matching.

Given a bipartite graph G = (V ∪ W, E) with a cost c_efor each edge, the relevant linear

program (P-BM) is

min

c_ex_e

e∈E

subject to

x_e= 1

x_e≥ 0

for all v ∈ V ∪ W

for all e ∈ E,

e∈δ(v)

where δ(v) denotes the edges incident to v. The intended semantics is that each x_eis either

equal to 1 (if e is in the chosen matching) or 0 (otherwise). Of course, the linear program is

also free to use fractional values for the decision variables.²

In matrix-vector form, this linear program is

min c^Tx

subject to

Ax = 1

x ≥ 0,

where A is the (V ∪ W) × E matrix

ꢀ

ꢁ

ꢂ

if e ∈ δ(v)

A = a =

(4)

otherwise

.3 The Dual Linear Program

We now turn to the dual linear program. Note that (P-BM) diﬀers from our usual form

both by having a minimization objective and by having equality (rather than inequality)

constraints. But our recipe for taking duals from Lecture #8 applies to all types of linear

programs, including this one.

When taking a dual, usually the trickiest point is to understand the eﬀect of the transpose

operation (on the constraint matrix). In the constraint matrix A in (4), each row (indexed

by v ∈ V ∪ W) has a 1 in each column (indexed by e ∈ E) for which e is incident to v (and

s in other columns). Thus, a column of A (and hence row of A^T) corresponding to edge e

has 1s in precisely the rows (indexed by v) such that e is incident to v — that is, in the two

rows corresponding to e’s endpoints.

Applying our recipe for duals to (P-BM), initially in matrix-vector form for simplicity,

yields

max p^T1

subject to

A p ≤ c

p ∈ R .

If you’re tempted to also add in the constraints that x ≤ 1 for every e ∈ E, note that these are already

implied by the current constraints (why?).

We are using the notation p for the dual variable corresponding to a vertex v ∈ V ∪ W, for

reasons that will become clearly shortly. Note that these decision variables can be positive

or negative, because of the equality constraints in (P-BM).

Unpacking this dual linear program, (D-BM), we get

max

p_v

v∈V ∪W

subject to

p + p ≤ c

for all (v, w) ∈ E

for all v ∈ V ∪ W.

p ∈ R

Here’s the punchline: the “vertex prices” in the Hungarian algorithm (Lecture #5) corre-

spond exactly to the decision variables of the dual (D-BM). Indeed, without thinking about

this dual linear program, how would you ever think to maintain numbers attached to the

vertices of a graph matching instance, when the problem deﬁnition seems to only concern

the graph’s edges?³

It gets better: rewrite the constraints of (D-BM) as

− p − p ≥ 0

reduced cost

(5)

}

for every edge (v, w) ∈ E. The left-hand side of (5) is exactly our deﬁnition in the Hungarian

algorithm of the “reduced cost” of an edge (with respect to prices p). Thus the ﬁrst invariant

of the Hungarian algorithm, asserting that all edges have nonnegative reduced costs, is

exactly the same as maintaining the dual feasibility of p!

To seal the deal, let’s check out the complementary slackness conditions for the primal-

dual pair (P-BM),(D-BM). Because all constraints in (P-BM) are equations (not counting

the nonnegativity constraints), the second condition is trivial. The ﬁrst condition states

that whenever x_e> 0, the corresponding constraint (5) should hold with equality — that is,

edge e should have zero reduced cost. Thus the second invariant of the Hungarian algorithm

(that edges in the current matching should be “tight”) is just the complementary slackness

condition!

We conclude that, in terms of the general algorithm design paradigm in Section 2.3,

the Hungarian algorithm maintains the second two conditions (p is feasible for (D-BM)

and complementary slackness conditions) at all times, and works toward the ﬁrst condition

(primal feasibility, i.e., a perfect matching). Algorithms of this type are called primal-dual

algorithms, and the Hungarian algorithm is a canonical example.

In Lecture #5 we motivated vertex prices via an analogy with the vertex labels maintained by the push-

relabel maximum ﬂow algorithm. But the latter is from the 1980s and the former from the 1950s, so that

was a pretty ahistorical analogy. Linear programming (and duality) were only developed in the late 1940s,

and so it was a new subject when Kuhn designed the Hungarian algorithm. But he was one of the ﬁrst

masters of the subject, and he put his expertise to good use.

.4 Consequences

We know that

OPT of (D-PM) ≤ OPT of (P-PM) ≤ min-cost perfect matching.

(6)

The ﬁrst inequality is just weak duality (for the case where the primal linear program has

a minimization objective). The second inequality follows from the fact that every perfect

matching corresponds to a feasible (0-1) solution of (P-BM); since the linear program min-

imizes over a superset of these solutions, it can only have a better (i.e., smaller) optimal

objective function value.

In Lecture #5 we proved that the Hungarian algorithm always terminates with a perfect

matching (provided there is at least one). The algorithm maintains a feasible dual and the

complementary slackness conditions. As in the proof of Corollary 2.1, this implies that the

cost of the constructed perfect matching equals the dual objective function value attained

by the ﬁnal prices. That is, both inequalities in (6) must hold with equality.

As in Example #1 (max ﬂow/min cut), both of these equalities are interesting. The ﬁrst

equation veriﬁes another special case of strong LP duality, for linear programs of the form

(P-BM) and (D-BM). The second equation provides another example of a natural family of

linear programs — those of the form (P-BM) — that are guaranteed to have 0-1 optimal

solutions.⁴

Strong LP Duality

.1 Formal Statement

Strong linear programming duality (“no gap”) holds in general, not just for the special cases

that we’ve seen thus far.

Theorem 4.1 (Strong LP Duality) When a primal-dual pair (P),(D) of linear programs

are both feasible,

OPT for (P) = OPT for (D).

Amazingly, our simple method of deriving bounds on the optimal objective function value of

(P) through suitable linear combinations of the constraints is always guaranteed to produce

the tightest-possible bound! Strong duality can be thought of as a generalization of the max-

ﬂow/min-cut theorem (Lecture #2) and Hall’s theorem (Lecture #5), and as the ultimate

answer to the question “how do we know when we’re done?”⁵

See also Exercise Set #4 for a direct proof of this.

When at least one of (P),(D) is infeasible, there are three possibilities, all of which can occur. First, (P)

might have unbounded objective function value, in which case (by weak duality) (D) is infeasible. It is also

possible that (P) is infeasible while (D) has unbounded objective function value. Finally, sometimes both

(P) and (D) are infeasible (an uninteresting case).

.2 Consequent Optimality Conditions

Strong duality immediately implies that the suﬃcient conditions for optimality identiﬁed

earlier (Corollaries 1.1 and 2.1) are also necessary conditions — that is, they are optimality

conditions in the sense derived earlier for the maximum ﬂow and minimum-cost perfect

bipartite matching problems.

Corollary 4.2 (LP Optimality Conditions) Let x, y are feasible solutions to the primal-

dual pair (P),(D) be a = primal-dual pair, then

x, y are both optimal if and only if c x = y b

if and only if the complementary slackness conditions hold.

The ﬁrst if and only if follows from strong duality: since both (P),(D) are feasible by as-

∗ T

∗

sumption, strong duality assures us of feasible solutions x , y with cx = (y ) b. If x, y

fail to satisfy this equality, then either c^Tx is worse than c^Tx or y b is worse than (y ) b

(or both). The second if and only if does not require strong duality; it follows from the proof

of Corollary 2.1 (see also Exercise Set #4).

∗

∗ T

.3 Proof Sketch: The Road Map

We conclude the lecture with a proof sketch of Theorem 4.1. Our proof sketch leaves some

details to Problem Set #3, and also takes on faith one intuitive geometric fact. The goal of

the proof sketch is to at least partially demystify strong LP duality, and convince you that

it ultimately boils down to some simple geometric intuition.

Here’s the plan:

separating hyperplane ⇒ Farkas’s Lemma → strong LP duality .

}

will prove

will assume

PSet #3

The “separating hyperplane theorem” is the intuitive geometric fact that we assume (Sec-

tion 4.4). Section 4.5 derives from this fact Farkas’s Lemma, a “feasibility version” of strong

LP duality. Problem Set #3 asks you to reduce strong LP duality to Farkas’s Lemma.

.4 The Separating Hyperplane Theorem

In Lecture #7 we discussed separating hyperplanes, in the context of separating data points

labeled “positive” from those labeled “negative.” There, the point was to show that the

computational problem of ﬁnding such a hyperplane reduces to linear programming. Here,

we again discuss separating hyperplanes, with two diﬀerences: ﬁrst, our goal is to separate

a convex set from a point not in the set (rather than two diﬀerent sets of points); second,

the point here is to prove strong LP duality, not to give an algorithm for a computational

problem.

We assume the following result.

Theorem 4.3 (Separating Hyperplane) Let C be a closed and convex subset of R_n, and

z a point in R_nnot in C. Then there is a separating hyperplane, meaning coeﬃcients α ∈ R_n

and an intercept β ∈ R such that:

(1)

α x ≥ β

all of C on one side of hyperplane

for all x ∈ C;

}

(2)

α z < β .

See also Figure 3. Note that the set C is not assumed to be bounded.

{z }

z on other side

Figure 3: Illustration of separating hyperplane theorem.

If you’ve forgotten what “convex” or “closed” means, both are very intuitive. A convex

set is “ﬁlled in,” meaning it contains all of its chords. Formally, this translates to

λx + (1 − λ)y

point on chord between x, y

∈ C

}

for all x, y ∈ C and λ ∈ [0, 1]. See Figure 4 for an example (a ﬁlled-in polygon) and a

non-example (an annulus).

A closed set is one that includes its boundary.⁶See Figure 5 for an example (the unit

disc) and a non-example (the open unit disc).

One formal deﬁnition is that whenever a sequence of points in C converges to a point x∗, then x∗ should

also be in C.

Figure 4: (a) a convex set (ﬁlled-in polygon) and (b) a non-convex set (annulus)

Figure 5: (a) a closed set (unit disc) and (b) non-closed set (open unit disc)

Hopefully Theorem 4.3 seems geometrically obvious, at least in two and three dimensions.

It turns out that the math one would use to prove this formally extends without trouble to

an arbitrary number of dimensions.⁷It also turns out that strong LP duality boils down to

exactly this fact.

.5 Farkas’s Lemma

It’s easy to convince someone whether or not a system of linear equations has a solution: just

run Gaussian elimination and see whether or not it ﬁnds a solution (if there is a solution,

Gaussian elimination will ﬁnd one). For a system of linear inequalities, it’s easy to convince

someone that there is a solution — just exhibit it and let them verify all the constraints. But

how would you convince someone that a system of linear inequalities has no solution? You

can’t very well enumerate the inﬁnite number of possibilities and check that each doesn’t

work. Farkas’s Lemma is a satisfying answer to this question, and can be thought of as the

“

feasibility version” of strong LP duality.

Theorem 4.4 (Farkas’s Lemma) Given a matrix A ∈ R_mand a right-hand side b

∈

R_m, exactly one of the following holds:

(i) There exists x ∈ R_nsuch that x ≥ 0 and Ax = b;

(ii) There exists y ∈ R_msuch that y^TA ≥ 0 and y^Tb < 0.

If you know undergraduate analysis, then even the formal proof is not hard: let y be the nearest neighbor

to z in C (such a point exists because C is closed), and take a hyperplane perpendicular to the line segment

between y and z, through the midpoint of this segment (cf., Figure 3). All of C lies on the same side of this

hyperplane (opposite of z) because C is convex and y is the nearest neighbor of z in C.

To connect the statement to the previous paragraph, think of Ax = b and x ≥ 0 as the

linear system of inequalities that we care about, and solutions to (ii) as proofs that this

system has no feasible solution.

Just like there are many variants of linear programs, there are many variants of Farkas’s

Lemma. Given Theorem 4.4, it is not hard to translate it to analogous statements for other

linear systems of inequalities (e.g., with both inequality and nonnegativity constraints); see

Problem Set #3.

Proof of Theorem 4.4: First, we have deliberately set up (i) and (ii) so that it’s impossible

for both to have a feasible solution. For if there were such an x and y, we would have

(y A) x ≥ 0

{z}

{z }

≥

and yet

y (Ax) = y b < 0,

a contradiction. In this sense, solutions to (ii) are proofs of infeasibility of the the system (i)

(and vice versa).

But why can’t both (i) and (ii) be infeasible? We’ll show that this can’t happen by proving

that, whenever (i) is infeasible, (ii) is feasible. Thus the “proofs of infeasibility” encoded

by (ii) are all that we’ll ever need — whenever the linear system (i) is infeasible, there is

a proof of it of the prescribed type. There is a clear analogy between this interpretation

of Farkas’s Lemma and strong LP duality, which says that there is always a feasible dual

solution proving the tightest-possible bound on the optimal objective function value of the

primal.

Assume that (i) is infeasible. We need to somehow exhibit a solution to (ii), but where

could it come from? The trick is to get it from the separating hyperplane theorem (Theo-

rem 4.3) — the coeﬃcients deﬁning the hyperplane will turn out to be a solution to (ii). To

apply this theorem, we need a closed convex set and a point not in the set.

Deﬁne

Q = {d : ∃x ≥ 0 s.t. Ax = d}.

Note that Q is a subset of R_m. There are two diﬀerent and equally useful ways to think

about Q. First, for the given constraint matrix A, Q is the set of all right-hand sides d

that are feasible (in x ≥ 0) with this constraint matrix. Thus by assumption, b ∈/ Q.

Equivalently, considering all vectors of the form Ax, with x ranging over all nonnegative

vectors in R_n, generates precisely the set of feasible right-hand sides. Thus Q equals the

set of all nonnegative linear combinations of the columns of A.⁸This deﬁnition makes it

obvious that Q is convex (an average of two nonnegative linear combinations is just another

nonnegative linear combination). Q is also closed (the limit of a convergent sequence of

nonnegative linear combinations is just another nonnegative linear combination).

Called the “cone generated by” the columns of A.

Since Q is closed and convex and b ∈/ Q, we can apply Theorem 4.3. In return, we are

granted a coeﬃcient vector α ∈ R_mand an intercept β ∈ R such that

α^Td ≥ β

for all d ∈ Q and

α^Tb < β.

An exercise shows that, since Q is a cone, we can take β = 0 without loss of generality (see

Exercise Set #5). Thus

α^Td ≥ 0

α^Tb < 0.

(7)

for all d ∈ Q while

(8)

A solution y to (ii) satisﬁes y^TA ≥ 0 and y^Tb < 0. Suppose we just take y = α. Inequal-

ity (8) implies the second condition, so we just have to check that α^TA ≥ 0. But what is

α^TA? An n-vector, where the jth coordinate is inner product of α^Tand the jth column

a^jof A. Since each a^j∈ Q — the jth column is obviously one particular nonnegative lin-

ear combination of A’s columns — inequality (7) implies that every coordinate of α^TA is

nonnegative. Thus α is a solution to (ii), as desired. ꢀ

.6 Epilogue

On Problem Set #3 you will use Theorem 4.4 to prove strong LP duality. The idea is

simple: let OP T_(D)denote the optimal value of the dual linear program, add a constraint to

the primal stating that the (primal) objective function value must be equal to or better than

OP T_(D), and use Farkas’s Lemma to prove that this augmented linear program is feasible.

In summary, strong LP duality is amazing and powerful, yet it ultimately boils down to

the highly intuitive existence of a separating hyperplane between a closed convex set and a

point not in the set.

CS261: A Second Course in Algorithms

Lecture #10: The Minimax Theorem and Algorithms

for Linear Programming^∗

Tim Roughgarden^†

February 4, 2016

Zero-Sum Games and the Minimax Theorem

.1 Rock-Paper Scissors

Recall rock-paper-scissors (or roshambo). Two players simultaneously choose one of rock,

paper, or scissors, with rock beating scissors, scissors beating paper, and paper beating rock.¹

Here’s an idea: what if I made you go ﬁrst? That’s obviously unfair — whatever you do,

I can respond with the winning move.

But what if I only forced you to commit to a probability distribution over rock, paper,

and scissors? (Then I respond, then nature ﬂips coins on your behalf.) If you prefer, imagine

that you submit your code for a (randomized) algorithm for choosing an action, then I have

to choose my action, and then we run your algorithm and see what happens.

In the second case, going ﬁrst no longer seems to doom you. You can protect yourself by

randomizing uniformly among the three options — then, no matter what I do, I’m equally

likely to win, lose, or tie. The minimax theorem states that, in general games of “pure

competition,” a player moving ﬁrst can always protect herself by randomizing appropriately.

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

Here are some fun facts about rock-paper-scissors. There’s a World Series of RPS every year, with a top

prize of at least $50K. If you watch some videos of them, you will see pure psychological welfare. Maybe this

explains why some of the same players seem to end up in the later rounds of the tournament every year.

There’s also a robot hand, built at the University of Tokyo, that plays rock-paper-scissors with a winning

probability of 100% (check out the video). No surprise, a very high-speed camera is involved.

.2 Zero-Sum Games

A zero-sum game is speciﬁed by a real-valued matrix m × n matrix A. One player, the row

player, picks a row. The other (column) player picks a column. Rows and columns are also

called strategies. By deﬁnition, the entry a_ijof the matrix A is the row player’s payoﬀ when

she chooses row i and the column player chooses column j. The column player’s payoﬀ in

this case is deﬁned as −a ; hence the term “zero-sum.” In eﬀect, a is the amount that

the column player pays to the row player in the outcome (i, j). (Don’t forget, a might be

negative, corresponding to a payment in the opposite direction.) Thus, the row and column

players prefer bigger and smaller numbers, respectively.

The following matrix describes the payoﬀs in the Rock-Paper-Scissors game in our current

language.

Rock Paper Scissors

Rock

Paper

Scissors

-1

.3 The Minimax Theorem

We can write the expected payoﬀ of the row player when payoﬀs are given by an m × n

matrix A, the row strategy is x (a distribution over rows), and the column strategy is y (a

distribution over columns), as

X^mXⁿ

Pr[outcome (i, j)] a_ij=

Pr[row i chosen] · Pr[column j chosen] a

} |

}

i=1 j=1

x^>Ay.

The ﬁrst term is just the deﬁnition of expectation, and the ﬁrst equality holds because the

row and column players randomize independently. That is, x^>Ay is just the expected payoﬀ

to the row player (and negative payoﬀ to the second player) when the row and column

strategies are x and y.

In a two-player zero-sum game, would you prefer to commit to a mixed strategy before or

after the other player commits to hers? Intuitively, there is only a ﬁrst-mover disadvantage,

since the second player can adapt to the ﬁrst player’s strategy. The minimax theorem is the

amazing statement that it doesn’t matter.

Theorem 1.1 (Minimax Theorem) For every two-player zero-sum game A,

ꢁ

ꢀ

ꢂ

On the left-hand side of (1), the row player moves ﬁrst and the column player second. The

ꢃ

max min x Ay = min max x Ay .

(1)

column player plays optimally given the strategy chosen by the row player, and the row

player plays optimally anticipating the column player’s response. On the right-hand side

of (1), the roles of the two players are reversed. The minimax theorem asserts that, under

optimal play, the expected payoﬀ of each player is the same in the two scenarios.

For example, in Rock-Paper-Scissors, both sides of (1) are 0 (with the ﬁrst player playing

uniformly and the second player responding arbitrarily). When a zero-sum game is asym-

metric and skewed toward one of the players, both sides of (1) will be non-zero (but still

equal). The common number on both sides of (1) is called the value of the game.

.4 From LP Duality to Minimax

Theorem 1.1 was originally proved by John von Neumann in the 1920s, using ﬁxed-point-

style arguments. Much later, in the 1940s, von Neumann proved it again using arguments

equivalent to strong LP duality (as we’ll do here). This second proof is the reason that,

when a very nervous George Dantzig (more on him later) explained his new ideas about

linear programming and the simplex method to von Neumann, the latter was able, oﬀ the

top of his head, to immediately give an hour-plus response that outlined the theory of LP

duality.

We now proceed to derive Theorem 1.1 from LP duality. The ﬁrst step is to formalize

the problem of computing the best strategy for the player forced to go ﬁrst.

Looking at the left-hand side (say) of (1), it doesn’t seem like linear programming should

apply. The ﬁrst issue is the nested min/max, which is not allowed in a linear program. The

second issue is the quadratic (nonlinear) character of x^>Ay in the decision variables x, y.

But we can work these issues out.

A simple but important observation is: the second player never needs to randomize. For

example, suppose the row player goes ﬁrst and chooses a distribution x. The column player

can then simply compute the expected payoﬀ of each column (the expectation with respect

to x) and choose the best column (deterministically). If multiple columns are tied for the

best, the it is also optimal to randomized arbitrarily among these; but there is no need for

the player moving second to do so.

In math, we have argued that

ꢀ

ꢁ

ꢀ

ꢁ

max min x Ay = max min x Ae

j=1

X^m

max min

a x_i

(2)

j=1

i=1

where e_jis the jth standard basis vector, corresponding to the column player deterministi-

cally choosing column j.

We’ve solved one of our problems by getting rid of y. But there is still the nested

max/min. Here we recall a trick from Lecture #7, that a minimum or maximum can often

be simulated by additional variables and constraints. The same trick works here, in exactly

the same way.

Speciﬁcally, we introduce a decision variable v, intended to be equal to (2), and

max v

subject to

X^m

v −

a x ≤ 0

for all j = 1, . . . , n

(3)

i=1

X^m

x_i= 1

i=1

x , . . . , x ≥ 0 and v ∈ R.

Note that this is a linear program. Rewriting the constraints (3) in the form

X^m

v ≤

a x_i

for all j = 1, . . . , n

i=1

makes it clear that they force v to be at most minⁿ

a x .

j=1

i=1 ij

∗

We claim that if (v , x ) is an optimal solution, then v = min

∗

j=1

a x_i. This follows

i=1 ij

∗

from the same arguments used in Lecture #7. As already noted, by feasibility, v cannot

be larger than minⁿ

a x . If it were strictly less, then we can increase v slightly

∗

j=1

i=1 ij

without destroying feasibility, yielding a better feasible solution (contradicting optimality).

Since the linear program explicitly maximizes v over all distributions x, its optimal

objective function value is

ꢀ

ꢁ

ꢀ

ꢁ

v = max min x Ae = max min x Ay .

∗

(4)

j=1

Thus we can compute with a linear program the optimal strategy for the row player, when it

moves ﬁrst, and the expected payoﬀ obtained (assuming optimal play by the column player).

Repeating the exercise for the column player gives the linear program

min w

subject to

Xⁿ

w −

a y ≥ 0

for all i = 1, . . . , m

j=1

Xⁿ

y_j= 1

j=1

y , . . . , y ≥ 0 and w ∈ R.

∗

At an optimal solution (w , y ), y is the optimal strategy for the column player (when going

ﬁrst, assuming optimal play by the row player) and

ꢂ

ꢃ

ꢂ

ꢃ

∗

w = min max e Ay = min max x Ay .

(5)

i=1

Here’s the punch line: these two linear programs are duals. This can be seen by looking

up our recipe for taking duals (Lecture #8) and verifying that these two linear programs

conform to the recipe (see Exercise Set #5). For example, the one unrestricted variable (v

j=1

or w) corresponds to the one equality constraint in the other linear program (

y_j= 1

i=1

x_i= 1, respectively).

∗

Strong duality implies that v = w ; in light of (4) and (5), the minimax theorem follows

directly.²

Survey of Linear Programming Algorithms

We’ve established that linear programs capture lots of diﬀerent problems that we’d like to

solve. So how do we eﬃciently solve a linear program?

.1 The High-Order Bit

If you only remember one thing about linear programming, make it this:

Linear programs can be solved eﬃciently, in both theory and practice.

By “in theory,” we mean that linear programs can be solved in polynomial time in the worst-

case. By “in practice,” we mean that commercial solvers routinely solve linear programs

with input size in the millions. (Warning: the algorithms used in these two cases are not

necessarily the same.)

.2 The Simplex Method

.2.1 Backstory

In 1947 George Dantzig developed both the general formalism of linear programming and

also the ﬁrst general algorithm for solving linear programs, the simplex method.³Amazingly,

the simplex method remains the dominant paradigm today for solving linear programs.

The minimax theorem is obviously interesting its own right, and it also has applications in algorithms,

speciﬁcally to proving lower bounds on what randomized algorithms can do.

Dantzig spent the ﬁnal 40 years of his career at Stanford (1966-2005). You’ve probably heard the story

about a student who is late to class, sees two problems written on the blackboard, assumes they’re homework

problems, and then goes home and solves them, not realizing that they are the major open questions in the

ﬁeld. (A partial inspiration for Good Will Hunting, among other things.) Turns out this story is not

apocryphal: it was Dantzig, as a PhD student in the late 1930s, in a statistics course at UC Berkeley.

.2.2 Geometry

Figure 1: Illustration of a feasible set and an optimal solution x^∗. We know that there always

exists an optimal solution at a vertex of the feasible set, in the direction of the objective

function.

In Lecture #7 we developed geometric intuition about what it means to solve a linear

program, and one of our ﬁndings was that there is always an optimal solution at a vertex

(i.e., “corner”) of the feasible region (e.g., Figure 1).⁴This observation implies a ﬁnite

(but bad) algorithm for linear programming. (This is not trivial, since there are an inﬁnite

number of feasible solutions.) The reason is that every vertex satisﬁes at least n constraints

with equality (where n is the number of decision variables). Or contrapositively: for a

feasible solution x that satisﬁes at most n − 1 constraints with equality, there is a direction

along which moving x continues to satisfy these constraints, and moving x locally in either

direction on this line yields two feasible points whose midpoint is x. But a vertex of a feasible

region cannot be written as a non-trivial convex combination of other feasible points.⁵See

also Exercise Set #5. The ﬁnite algorithm is then: enumerate all (ﬁnitely many) subsets of

n linearly independent constraints, check if the unique point of R_nthat satisﬁes all of them

is a feasible solution to the linear program, and remember the best feasible solution found

in this way.

The simplex algorithm also searches through the vertices of the feasible region, but does

so in a smarter and more principled way. The basic idea is to use local search — if there is

a “neighboring” vertex which is better, move to it, otherwise halt. The idea of neighboring

vertices should be clear from Figure 1 — two endpoints of an “edge” of the feasible region.

In general, we can deﬁne two diﬀerent vertices to be neighboring if and only if they satisfy

n − 1 common constraints with equality. Moving from one vertex to a neighbor then just

involves swapping out one of the old tight constraints for a new tight constraint; each such

swap (also called a pivot) corresponds to a “move” along an edge of the feasible region.⁶

There are a few edge cases, including unbounded or empty feasible regions, which can be handled and

which we’ll ignore here.

Making all of this completely precise is somewhat annoying. But everything your geometric intuition

suggests about these statements is indeed true.

One important issue is “degeneracy,” meaning a vertex that satisﬁes strictly more than n constraints

In an iteration of the simplex method, the current vertex may have multiple neighboring

vertices with better objective function value. The choice of which of these to move to is

known as a pivot rule.

.2.3 Correctness

The simplex method is guaranteed to terminate at an optimal solution.⁷The intuition for

this fact should be clear from Figure 1 — since the objective function is linear and the

feasible region is convex, if no “local move” from a vertex is improving, then there should

be no direction at all within the feasible region that leads to a better solution. Formally,

the simplex method “knows that it’s done” by, at termination, exhibiting a a feasible dual

solution such that the complementary slackness conditions hold (see Lecture #9). Indeed,

the proof that the simplex method is guaranteed to terminate with an optimal solution

provides another proof of strong LP duality.

In terms of our three-step design paradigm (Lecture #9), we can think of the simplex

method as maintaining primal feasibility and the complementary slackness conditions and

working toward dual feasibility.⁸

.2.4 Worst-Case Running Time

As mentioned, the simplex method is very fast in practice, and routinely solves linear pro-

grams with hundreds of thousands or even millions of variables and constraints. However,

it is a bizarre mathematical fact that the worst-case running time of the simplex method

is exponential in the input size. To understand the issue, ﬁrst note that the number of

vertices of a feasible region can be exponential in the dimension (e.g., the 2ⁿvertices of the

n-dimensional hypercube). Much harder is constructing a linear program where the simplex

method actually visits all of the vertices of the feasible region. Such an example was given

by Klee and Minty in the early 1970s (25 years after simplex has invented). Their example

is a “squashed” version of an n-dimensional hypercube. Such exponential lower bounds are

known for all natural deterministic pivot rules.⁹

The number of iterations required by the simplex method is also related to one of the most

famous open problems in combinatorial geometry, the Hirsch conjecture. This conjecture

concerns the “diameter of polytopes,” meaning the diameter of the graph derived from the

with equality. (E.g., in the plane, this would be 3 constraints whose boundaries meet at a common point.)

In this case, a constraint swap can result in staying at the same vertex. There are simple ways to avoid

cycling, however, which we won’t discuss here.

Assuming that the linear program is feasible and has a ﬁnite optimum. If not, the simplex method

correctly detects which of these cases the linear program falls in.

How does the simplex method ﬁnd the initial primal feasible point? For some linear programs this is

easy (e.g., the all-0 vector is feasible). In general, one can an additional variable, highly penalized in the

objective function, to make ﬁnding an initial feasible point trivial.

Interestingly, some randomized pivot rules (e.g., among the neighboring vertices that are better, pick one

at random) require, in expectation, at most ≈ 2 ⁿiterations to converge on every instance. There are now

√

nearly matching upper and lower bounds on the required number of iterations for all the natural randomized

rules.

skeleton of the polytope (with vertices and edges of the polytope inducing, um, vertices

and edges of the graph). The conjecture asserts that the diameter is always at most linear

(in the number of variables and constraints). The best known upper bound on the worst-

case diameter of polytopes is “quasi-polynomial” (of the form ≈ n^{log n}), due to Kalai and

Kleitman in the early 1990s. Since the trajectory of the simplex method is a walk along the

edges of the feasible region, the number of iterations required (for a worst-case starting point

and objective function) is at least the polytope diameter. Put diﬀerently, suﬃciently good

upper bounds on the number of iterations required by the simplex method (for some pivot

rule) would automatically yield progress on the Hirsch conjecture.

.2.5 Average-Case and Smoothed Running Time

The worst-case running time of the wildly practical simplex method poses a real quandary

for the mathematical analysis of algorithms. Can we “correct” the theory so that it better

reﬂects reality?

In the 1980s, a number of researchers (Borgwardt, Smale, Adler-Karp, etc.) showed that

the simplex method (with a suitable pivot rule) runs in polynomial time “on average” with

respect to various distributions over linear programs. Note that it is not at all obvious how

to deﬁne a “random linear program.” Indeed, many natural attempts lead to linear programs

that are almost always infeasible.

At the start of the 21st century, Spielman and Teng proved that the simplex method has

polynomial “smoothed complexity.” This is like a robust version of an average-cases analysis.

The model is to take a worst-case initial linear program, and then to randomly perturb it a

small amount. The main result here is that, for every initial linear program, in expectation

over the perturbed version of the linear program, the running time of simplex is polynomial

in the input size. The take-away being that bad examples for the simplex method are both

rare and isolated, in a precise sense. See the instructor’s CS264 course (“Beyond Worst-Case

Analysis”) for much more on smoothed analysis.

.3 The Ellipsoid Method

.3.1 Worst-Case Running Time

The ellipsoid method was originally proposed (by Shor and others) in the early/mid-1970s

as an algorithm for nonlinear programming. In 1979 Khachiyan proved that, for linear

programs, the algorithm is actually guaranteed to run in polynomial time. This was the

ﬁrst-ever polynomial-time algorithm for linear programming, a big enough deal at the time

to make the front page of the New York Times (if below the fold).

The ellipsoid method is very slow in practice — usually multiple orders of magnitude

slower than the fastest methods. How can a polynomial-time algorithm be so much worse

than the exponential-time simplex method? There are two issues. First, the degree in

the polynomial bounding the ellipsoid method’s running time is pretty big (like 4 or 5,

depending on the implementation details). Second, the performance of the ellipsoid method

on “typical cases” is generally close to its worst-case performance. This is in sharp contrast

to the simplex method, which almost always solves linear programs in time far less than its

worst-case (exponential) running time.

.3.2 Separation Oracles

Figure 2: The responsibility of a separation oracle.

The ellipsoid method is uniquely useful for proving theorems — for establishing that other

problems are worst-case polynomial-time solvable, and thus are at least eﬃciently solvable

in principle. The reason is that the ellipsoid method can solve some linear programs with

n variables and an exponential (in n) number of constraints in time polynomial in n. How

is this possible? Doesn’t it take exponential time just to read in all of the constraints?

For other linear programming algorithms, yes. But the ellipsoid method doesn’t need an

explicit description of the linear program — all it needs is a helper subroutine known as a

separation oracle. The responsibility of a separation oracle is to take as input an allegedly

feasible solution x to a linear program, and to either verify feasibility (if x is indeed feasible)

or produce a constraint violated by x (otherwise). See Figure 2. Of course, the separation

oracle should also run in polynomial time.¹⁰

How could one possibly check an exponential number of constraints in polynomial time?

You’ve actually already seen some examples of this. For example, recall the dual of the

path-based linear programming formulation of the maximum ﬂow problem (Lecture #8):

min

u `

e e

e∈E

⁰Such separation oracles are also useful in some practical linear programming algorithms: in “cutting

plane methods,” for linear programs with a large number of constraints (where the oracle is used in the

same way as in the ellipsoid method); and in the simplex method for linear programs with a large number of

variables (where the oracle is used to generate variables on the ﬂy, a technique called “column generation”).

subject to

`_e≥ 1

`_e≥ 0

for all P ∈ P

for all e ∈ E.

(6)

e∈P

Here P denotes the set of s-t ﬂow paths of a maximum ﬂow instance (with edge capacities

u ). Since a graph can have an exponential number of s-t paths, this linear program has a

potentially exponential number of constraints.¹¹But, it has a polynomial-time separation

oracle. The key observation is: at least one constraint is violated if and only if

min

P∈P

`_e< 1.

e∈P

Thus, the separation oracle is just Dijkstra’s algorithm! In detail: given an allegedly feasible

solution {` }

to the linear program, the separation oracle ﬁrst checks that each `_eis

e∈E

nonnegative (if ` < 0, it returns the violated constraint ` ≥ 0). If the solution passes this

test, then the separation oracle runs Dijkstra’s algorithm to compute a shortest s-t path,

using the `_e’s as (nonnegative) edge lengths. If the shortest path has length at least 1, then

all of the constraints (6) are satisﬁed and the oracle reports “feasible.” If the shortest path

P^∗has length less than 1, then it returns the violated constraint

can solve the above linear program in polynomial time using the ellipsoid method.¹²

`_e

≥

1. Thus, we

e∈P∗

.3.3 How the Ellipsoid Method Works

Here is a sketch of how the ellipsoid method works. The ﬁrst step is to reduce optimization

to feasibility. That is, if the objective is max c^Tx, one replaces the objective function by

the constraint c^Tx ≥ M for some target objective function value M. If one can solve this

feasibility problem in polynomial time, then one can solve the original optimization problem

using binary search on the target objective M.

There’s a silly story about how to hunt a lion in the Sahara. The solution goes: encircle

the Sahara with a high fence and then bifurcate it with another fence. Figure out which side

has the lion in it (e.g., looking for tracks), and recurse. Eventually, the lion is trapped in

such a small area that you know exactly where it is.

¹For example, consider the graph s = v , v , . . . , v = t, with two parallel edges directed from each v to

v_i+1

²Of course, we already know how to solve this particular linear program in polynomial time — just

compute a minimum s-t cut (see Lecture #8). But there are harder problems where the only known proof

of polynomial-time solvability goes through the ellipsoid method.

Figure 3: The ellipsoid method ﬁrst initializes a huge sphere (blue circle) that encompasses

the feasible region (yellow pentagon). If the ellipsoid center is not feasible, the separation

oracle produces a violated constraint (dashed line) that splits the ellipsoid into two regions,

one containing the feasible region and one that does not. A new ellipsoid (red oval) is drawn

that contains the feasible half-ellipsoid, and the method continues recursively.

Believe it or not, this story is a pretty good cartoon of how the ellipsoid method works.

The ellipsoid method maintains at all times an ellipsoid which is guaranteed to contain the

entire feasible region (Figure 3). It starts with a huge sphere to ensure the invariant at

initialization. It then invokes the separation oracle on the center of the current ellipsoid.

If the ellipsoid center is feasible, then the problem is solved. If not, the separation oracle

produces a constraint satisﬁed by all feasible points that is violated by the ellipsoid center.

Geometrically, the feasible region and the ellipsoid center are on opposite sides of the corre-

sponding halfspace boundary (Figure 3). Thus we know we can recurse on the appropriate

half-ellipsoid. Before recursing, however, the ellipsoid method redraws a new ellipsoid that

contains this half-ellipsoid (and hence the feasible region).¹³Elementary but tedious calcu-

lations show that the volume of the current ellipsoid is guaranteed to shrink at a certain rate

at each iteration, and this yields a polynomial bound on the number of iterations required.

The algorithm stops when the current ellipsoid is so small that it cannot possibly contain a

feasible point (given the precision of the input data).

Now that we understand how the ellipsoid method works at a high level, we see why it

can solve linear programs with an exponential number of constraints. It never works with an

explicit description of the constraints, and just generates constraints on the ﬂy on a “need

to know” basis. Because it terminates in a polynomial number of iterations, it only ever

³Why the obsession with ellipsoids? Basically, they are the simplest shapes that can decently approximate

all shapes of polytopes (“fat” ones, “skinny” one, etc.). In particular, every ellipsoid has a well deﬁned and

easy-to-compute center.

generates a polynomial number of constraints.¹⁴

.4 Interior-Point Methods

While the simplex method works “along the boundary” of the feasible region, and the ellip-

soid method works “outside in,” the third and again quite diﬀerent paradigm of interior-point

methods works “inside out.” There are many genres of interior-point methods, beginning

with Karmarkar’s algorithm in 1984 (which again made the New York Times, this time

above the fold). Perhaps the most popular are “central path” methods. The idea is, instead

of maximizing the given objective c^Tx, to maximize

c x − λ · f(distance between x and boundary),

barrier function

}

where λ ≥ 0 is a parameter and f is a “barrier function” that blows up (to +∞) as its

argument goes to 0 (e.g., log ). Initially, one sets λ so big that the problem become easy

(when f(x) = log , the solution is the “analytic center” of the feasible region, and can

be computed using e.g. Newton’s method). Then one gradually decreases the parameter λ,

tracking the corresponding optimal point along the way. (The “central path” is the set of

optimal points as λ varies from ∞ to 0.) When λ = 0, the optimal point is an optimal

solution to the linear program, as desired.

The two things you should know about interior-point methods are: (i) many such algo-

rithms run in time polynomial in the worst case; and (ii) such methods are also competitive

with the simplex method in practice. For example, one of Matlab’s LP solvers uses an

interior-point algorithm.

There are many linear programs where interior-point methods beat the best simplex codes

(especially on larger LPs), but also vice versa. There is no good understanding of when one

is likely to outperform the other. Despite the fact that it’s 70 years old, the simplex method

remains the most commonly used linear programming algorithm in practice.

⁴As a sanity check, recall that every vertex of a feasible region in Rⁿis the unique point satisfying some

subset of n constraints with equality. Thus in principle there’s always n constraints the are suﬃcient to

describe one feasible point (given a separation oracle to verify feasibility). The magic of the ellipsoid method

is that, even though a priori it has no idea which subset of constraints is the right one, it always ﬁnds a

feasible point while generating only a polynomial number of constraints.

CS261: A Second Course in Algorithms

Lecture #11: Online Learning and the Multiplicative

Weights Algorithm^∗

Tim Roughgarden^†

February 9, 2016

Online Algorithms

This lecture begins the third module of the course (out of four), which is about online

algorithms. This term was coined in the 1980s and sounds anachronistic there days — it has

nothing to do with the Internet, social networks, etc. It refers to computational problems of

the following type:

An Online Problem

. The input arrives “one piece at a time.”

. An algorithm makes an irrevocable decision each time it receives a new

piece of the input.

For example, in job scheduling problems, one often thinks of the jobs as arriving online (i.e.,

one-by-one), with a new job needing to be scheduled on some machine immediately. Or in a

graph problem, perhaps the vertices of a graph show up one by one (with whatever edges are

incident to previously arriving vertices). Thus the meaning of “one piece at a time” varies

with the problem, but it many scenarios it makes perfect sense. While online algorithms

don’t get any airtime in an introductory course like CS161, many problems in the real world

(computational and otherwise) are inherently online problems.

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

Online Decision-Making

.1 The Model

Consider a set A of n ≥ 2 actions and a time horizon T ≥ 1. We consider the following

setup.

Online Decision-Making

At each time step t = 1, 2, . . . , T:

a decision-maker picks a probability distribution p^tover her actions

an adversary picks a reward vector r^t: A → [−1, 1]

an action a^tis chosen according to the distribution p^t, and the

decision-maker receives reward r^t(a^t)

the decision-maker learns r^t, the entire reward vector

An online decision-making algorithm speciﬁes for each t the probability distribution p^t,

as a function of the reward vectors r¹, . . . , r^t−1and realized actions a¹, . . . , a^t−1of the ﬁrst

t − 1 time steps. An adversary for such an algorithm A speciﬁes for each t the reward

vector r^t, as a function of the probability distributions p¹, . . . , p^tused by A on the ﬁrst t

days and the realized actions a¹, . . . , a^tof the ﬁrst t 1 days.

−

For example, A could represent diﬀerent investment strategies, diﬀerent driving routes

between home and work, or diﬀerent strategies in a zero-sum game.

.2 Deﬁnitions and Examples

We seek a “good” online decision-making algorithm. But the setup seems a bit unfair, no?

The adversary is allowed to choose each reward function r^tafter the decision-maker has

committed to her probability distribution p^t. With such asymmetry, what kind of guarantee

can we hope for? This section gives three examples that establish limitations on what is

possible.¹

The ﬁrst example shows that there is no hope of achieving reward close to that of the

t=1

max_a∈Ar^t(a) is just too strong.

best action sequence in hindsight. This benchmark

Example 2.1 (Comparing to the Best Action Sequence) Suppose A = {1, 2} and ﬁx

an arbitrary online decision-making algorithm. Each day t, the adversary chooses the reward

vector r^tas follows: if the algorithm chooses a distribution p^tfor which the probability on

action 1 is at least , then r is set to the vector ( 1, 1). Otherwise, the adversary sets r equal

−

In the ﬁrst half of the course, we always sought algorithms that are always correct (i.e., optimal). In an

online setting, where you have to make decisions without knowing the future, we expect to compromise on

an algorithm’s guarantee.

to (1, −1). This adversary forces the expected reward of the algorithm to be nonpositive,

while ensuring that the reward of the best action sequence in hindsight is T.

Example 2.1 motivates the following important deﬁnitions. Rather than comparing the

expected reward of an algorithm to that of the best action sequence in hindsight, we compare

it to the reward incurred by the best ﬁxed action in hindsight. In words, we change our

t=1

max_a∈Ar^t(a) to max_a∈A

t=1

r^t(a).

benchmark from

Deﬁnition 2.2 (Regret) Fix reward vectors r¹, . . . , r^T. The regret of the action sequence

a¹, . . . , a^Tis

X^T

best ﬁxed action

X^T

our algorithm

r (a ) .

max

a∈A

r^t(a) −

(1)

t=1

}

We’d like an online decision-making algorithm that achieves low regret, as close to 0 as

possible (and negative regret would be even better).²Notice that the worst-possible regret

in 2T (since rewards lie in [−1, 1]). We think of regret Ω(T) as an epic fail for an algorithm.

What is the justiﬁcation for the benchmark of the best ﬁxed action in hindsight? First,

simple and natural learning algorithms can compete with this benchmark. Second, achieving

this is non-trivial: as the following examples make clear, some ingenuity is required. Third,

competing with this benchmark is already suﬃcient to obtain many interesting applications

(see end of this lecture and all of next lecture).

One natural online decision-making algorithm is follow-the-leader , which at time step t

t−1

chooses the action a with maximum cumulative reward

r^u(a) so far. The next example

shows that follow-the-leader, and more generally every deterministic algorithm, can have

regret that grows linearly with T.

u=1

Example 2.3 (Randomization Is Necessary for No Regret) Fix a deterministic on-

line decision-making algorithm. At each time step t, the algorithm commits to a single

action a^t. The obvious strategy for the adversary is to set the reward of action a^tto 0, and

the reward of every other action to 1. Then, the cumulative reward of the algorithm is 0

while the cumulative reward of the best action in hindsight is at least T(1 − ). Even when

there are only 2 actions, for arbitrarily large T, the worst-case regret of the algorithm is at

least .

For randomized algorithms, the next example limits the rate at which regret can vanish

as the time horizon T grows.

Example 2.4 ( (ln n)/T Regret Lower Bound) Suppose there are n = 2 actions, and

that we choose each reward vector r^tindependently and equally likely to be (1, −1) or (−1, 1).

No matter how smart or dumb an online decision-making algorithm is, with respect to this

random choice of reward vectors, its expected reward at each time step is exactly 0 and its

Sometimes this goal is referred to as “combining expert advice” — if we think of each action as an

expert,” then we want to do as well as the best expert.

“

expected cumulative reward is thus also 0. The expected cumulative reward of the best ﬁxed

action in hindsight is b T, where b is some constant independent of T. This follows from

√

the fact that if a fair coin is ﬂipped T times, then the expected number of heads is and

√

the standard deviation is

Fix an online decision-making algorithm A. A random choice of reward vectors causes A

√

to experience expected regret at least b T, where the expectation is over both the random

choice of reward vectors and the action realizations. At least one choice of reward vec-

tors induces an adversary that causes A to have expected regret at least b T, where the

√

expectation is over the action realizations.

A similar argument shows that, with n actions, the expected regret of an online decision-

√

making algorithm cannot grow more slowly than b T ln n, where b > 0 is some constant

independent of n and T.

The Multiplicative Weights Algorithm

We now give a simple and natural algorithm with optimal worst-case expected regret, match-

ing the lower bound in Example 2.4 up to constant factors.

Theorem 3.1 There is an online decision-making algorithm that, for every adversary, has

expected regret at most 2 T ln n.

√

An immediately corollary is that the number of time steps needed to drive the expected

time-averaged regret down to a small constant is only logarithmic in the number of actions.³

Corollary 3.2 There is an online decision-making algorithm that, for every adversary and

ꢀ

> 0, has expected time-averaged regret at most ꢀ after at most (4 ln n)/ꢀ²time steps.

In our applications in this and next lecture, we will use the guarantee in the form of Corol-

lary 3.2.

The guarantees of Theorem 3.1 and Corollary 3.2 are achieved by the multiplicative

weights (MW) algorithm.⁴Its design follows two guiding principles.

No-Regret Algorithm Design Principles

. Past performance of actions should guide which action is chosen at each

time step, with the probability of choosing an action increasing in its

cumulative reward. (Recall from Example 2.3 that we need a randomized

algorithm to have any chance.)

Time-averaged regret just means the regret, divided by T.

This and closely related algorithms are sometimes called the multiplicative weight update (MWU) algo-

rithm, Polynomial Weights, Hedge, and Randomized Weighted Majority.

. The probability of choosing a poorly performing action should decrease

at an exponential rate.

The ﬁrst principle is essential for obtaining regret sublinear in T, and the second for optimal

regret bounds.

The MW algorithm maintains a weight, intuitively a “credibility,” for each action. At

each time step the algorithm chooses an action with probability proportional to its cur-

rent weight. The weight of each action evolves over time according to the action’s past

performance.

Multiplicative Weights (MW) Algorithm

initialize w¹(a) = 1 for every a ∈ A

for each time step t = 1, 2, . . . , T do

use the distribution p^t:= w^t/Γ^tover actions, where

Γ^t=

w^t(a) is the sum of the weights

a∈A

given the reward vector r^t, for every action a ∈ A use the formula

w^t+1(a) = w^t(a) · (1 + ηr^t(a)) to update its weight

For example, if all rewards are either -1 or 1, then the weight of each action a either goes up

by a 1 + η factor or down by a 1 − η factor. The parameter η lies between 0 and , and is

chosen at the end of the proof of Theorem 3.1 as a function of n and T. For intuition, note

that when η is close to 0, the distributions p^twill hew close to the uniform distribution.

Thus small values of η encourage exploration. Large values of η correspond to algorithms

in the spirit of follow-the-leader. Thus large values of η encourage exploitation, and η is a

knob for interpolating between these two extremes. The MW algorithm is obviously simple

to implement, since the only requirement is to update the weight of each action at each time

step.

Proof of Theorem 3.1

Fix a sequence r¹, . . . , r^Tof reward vectors.⁵The challenge is that the two quantities that

we care about, the expected reward of the MW algorithm and the reward of the best ﬁxed

action, seem to have nothing to do with each other. The fairly inspired idea is to relate both

of these quantities to an intermediate quantity, namely the sum Γ^T+1=

the actions’ weights at the conclusion of the MW algorithm. Theorem 3.1 then follows from

some simple algebra and approximations.

w^T+1(a) of

a∈A

We’re glossing over a subtle point, the diﬀerence between “adaptive adversaries” (like those deﬁned in

Section 2) and “oblivious adversaries” which specify all reward vectors in advance. Because the behavior of

the MW algorithm is independent of the realized actions, it turns out that the worst-case adaptive adversary

for the algorithm is in fact oblivious.

The ﬁrst step, and the step which is special to the MW algorithm, shows that the sum

of the weights Γ^tevolves together with the expected reward earned by the MW algorithm.

In detail, denote the expected reward of the MW algorithm at time step t by ν^t, and write

w^t(a)

ν^t=

p (a) · r (a) =

· r^t(a).

(2)

Γ^t

a∈A

Thus we want to lower bound the sum of the ν^t’s.

To understand Γ^t+1as a function of Γ^tand the expected reward (2), we derive

Γ^t+1

w^t+1(a)

a∈A

w (a) · (1 + ηr (a))

a∈A

Γ (1 + ην ).

(3)

For convenience, we’ll bound from above this quantity, using the fact that 1 + x ≤ e^xfor all

real-valued x.⁶Then we can write

_Γ^t+1≤ Γ^t· e^ηνt

for each t and hence

Y^T

e^ηνt= n · e^η

t=1

ν^t

(4)

Γ^T+1≤ Γ

{z}

t=1

This expresses a lower bound on the expected reward of the MW algorithm as a relatively

simple function of the intermediate quantity Γ^T+1.

Figure 1: 1 + x ≤ e^xfor all real-valued x.

See Figure 1 for a proof by picture. A formal proof is easy using convexity, a Taylor expansion, or other

methods.

The second step is to show that if there is a good ﬁxed action, then the weight of this

action single-handedly shows that the ﬁnal value Γ^T+1is pretty big. Combining with the

ﬁrst step, this will imply the the MW algorithm only does poorly if every ﬁxed action is bad.

t=1

∗

r (a ) of the best ﬁxed action a

∗

Formally, let OPT denote the cumulative reward

for the reward vector sequence. Then,

Γ^T+1≥ w^T+1(a^∗)

Y^T

∗

w (a ) (1 + ηr (a )).

∗

(5)

{z }

t=1

OPT is the sum of the r^t(a )’s, so we’d like to massage the expression above to involve this

sum. Products become sums in exponents. So the ﬁrst idea is to use the same trick as before,

replacing 1 + x by e^x. Unfortunately, we can’t have it both ways — before we wanted an

upper bound on 1 + x, whereas now we want a lower bound. But looking at Figure 1, it’s

clear that the two function are very close to each other for x near 0. This can made precise

through the Taylor expansion

∗

x²

^x3− ^x4

ln(1 + x) = x −

· · ·

Provided |x| ≤ , we can obtain a lower bound on ln(1 + x) by throwing out all terms but

the ﬁrst two, and doubling the second term to compensate. (The magnitudes of the rest of

x²

−^x₂²

the terms can be bounded above by the geometric series ( + +

term blows them all away.)

· · ·

), so the extra

Since η ≤ and r (a )

| ^t∗ | ≤

1 for every t, we can plug this estimate into (5) to obtain

Y^T

_ΓT+1

ηr^t(a^∗)⁻

η (r (a ))

∗

≥

t=1

ηOPT−η²T

(6)

where in (6) we’re just using the crude estimate (r^t(a^∗))

Through (4) and (6), we’ve connected the cumulative expected reward

²≤

1 for all t.

t=1

ν^tof the

MW algorithm with the reward OPT of the best ﬁxed auction through the intermediate

quantity Γ^T+1:

and hence (taking the natural logarithm of both sides and dividing through by η):

t=1

ν^t

≥ Γ^T+1≥ e^ηOPT−η2T

_{n · e}η

X^T

ln n

ν^t≥ OPT − ηT −

(7)

t=1

Finally, we set the free parameter η. There are two error terms in (7), the ﬁrst one corre-

sponding to inaccurate learning (higher for larger learning rates), the second corresponding

to overhead before converging (higher for smaller learning rates). To equalize the two terms,

ﬁxed action. This completes the proof of Theorem 3.1.

we choose η = (ln n)/T. (Or η = , if this is smaller.) Then, the cumulative expected

√

reward of the MW algorithm is at most 2 T ln n less than the cumulative reward of the best

Remark 4.1 (Unknown Time Horizons) The choice of η above assumes knowledge of

the time horizon T. Minor modiﬁcations extend the multiplicative weights algorithm and

its regret guarantee to the case where T is not known a priori, with the “2” in Theorem 3.1

replaced by a modestly larger constant factor.

Minimax Revisited

Recall that a two-player zero-sum game can be speciﬁed by an m × n matrix A, where a

denotes the payoﬀ of the row player and the negative payoﬀ of the column player when row i

and column j are chosen. It is easy to see that going ﬁrst in a zero-sum game can only be

worse than going second — in the latter case, a player has the opportunity to adapt to the

ﬁrst player’s strategy. Last lecture we derived the minimax theorem from strong LP duality.

It states that, provided the players randomize optimally, it makes no diﬀerence who goes

ﬁrst.

Theorem 5.1 (Minimax Theorem) For every two-player zero-sum game A,

ꢀ

ꢁ

ꢂ

ꢃ

max min x Ay = min max x Ay .

(8)

We next sketch an argument for deriving Theorem 5.1 directly from the guarantee pro-

vided by the multiplicative weights algorithm (Theorem 3.1). Exercise Set #6 asks you to

provide the details.

Fix a zero-sum game A with payoﬀs in [−1, 1] and a value for a parameter ꢀ > 0. Let

n denote the number of rows or the number of columns, whichever is larger. Consider the

following thought experiment:

•

At each time step t = 1, 2, . . . , T = ^{4 ln n}:

ꢀ

–

The row and column players each choose a mixed strategy (p^tand q^t, respectively)

using their own copies of the multiplicative weights algorithm (with the action set

equal to the rows or columns, as appropriate).

The row player feeds the reward vector r^t= Aq^tinto (its copy of) the multiplica-

tive weights algorithm. (This is just the expected payoﬀ of each row, given that

the column player chose the mixed strategy q^t.)

Analogously, the column player feeds the reward vector r^t= −(p^t)^TA into the

multiplicative weights algorithm.

Let

X^T

t T

(p ) Aq

v =

t=1

denote the time-averaged payoﬀ of the row player. The ﬁrst claim is that applying Theo-

rem 3.1 (in the form of Corollary 3.2) to the row and column players implies that

ꢀ

ꢁ

v ≥ max p Aqˆ − ꢀ

and

ꢀ

ꢁ

v ≤ min pˆ Aq + ꢀ,

t=1

p^tand qˆ =

t=1

q^tdenote the time-averaged row and

respectively, where pˆ =

column strategies.

Given this, a short derivation shows that

ꢀ

ꢁ

ꢀ

ꢁ

max min p Aq ≥ min min p Aq − 2ꢀ.

Letting ꢀ → 0 and recalling the easy direction of the minimax theorem (max min p Aq

≤

min max p Aq) completes the proof.

CS261: A Second Course in Algorithms

Lecture #12: Applications of Multiplicative Weights to

Games and Linear Programs^∗

Tim Roughgarden^†

February 11, 2016

Extensions of the Multiplicative Weights Guarantee

Last lecture we introduced the multiplicative weights algorithm for online decision-making.

You don’t need to remember the algorithm details for this lecture, but you should remember

that it’s a simple and natural algorithm (just one simple update per action per time step).

You should also remember its regret guarantee, which we proved last lecture and will use

today several times as a black box.¹

Theorem 1.1 The expected regret of the multiplicative weights algorithm is always at most

T ln n, where n is the number of actions and T is the time horizon.

√

Recall the deﬁnition of regret, where A denotes the action set:

X^T

best ﬁxed action

X^T

our algorithm

r (a ) .

max

a∈A

r^t(a) −

t=1

}

The expectation in Theorem 1.1 is over the random choice of action in each time step; the

reward vectors r¹, . . . , r^Tare arbitrary.

The regret guarantee in Theorem 1.1 applies not only with with respect to the best

ﬁxed action in hindsight, but more generally to the best ﬁxed probability distribution in

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

This lecture is a detour from our current study of online algorithms. While the multiplicative weights

algorithm works online, the applications we discuss today are not online problems.

hindsight. The reason is that, in hindsight, the best ﬁxed action is as a good as the best

ﬁxed distribution over actions. Formally, for every distribution p over A,

X^TX

X^T

r^t(a) ≤ max

r^t(b).

p · r (a) =

{z}

b∈A

t=1 a∈A

a∈A

t=1

_{sum to 1}|

}

We’ll apply Theorem 1.1 in the following form (where time-averaged just means divided

≤

max_b

r^t(b)

by T).

Corollary 1.2 The expected time-averaged regret of the multiplicative weights algorithm is

at most ꢀ after at most (4 ln n)/ꢀ²time steps.

As noted above, the guarantee of Corollary 1.2 applies with respect to any ﬁxed distribution

over the actions.

Another useful extension is to rewards that lie in [−M, M], rather than in [−1, 1]. This

scenario reduces to the previous one by scaling. To obtain time-averaged regret at most ꢀ:

. scale all rewards down by M;

. run the multiplicative weights algorithm until the time-averaged expected regret is at

ꢀ

most

;

. scale everything back up.

Equivalently, rather than explicitly scaling the reward vectors, one can change the weight

update rule from w^t+1(a) = w^t(a)(1 + ηr^t(a)) to w^t+1(a) = w^t(a)(1 + r^t(a)). In any case,

iterations, the time-averaged expected regret is

M²ln n

Corollary 1.2 implies that after T =

at most ꢀ.

ꢀ

Minimax Revisited (Again)

Last lecture we sketched how to use the multiplicative weights algorithm to prove the min-

imax theorem (details on Exercise Set #6). The idea was to have both the row and the

column player play a zero-sum game repeatedly, using their own copies of the multiplicative

weights algorithm to choose strategies simultaneously at each time step. We next discuss an

alternative thought experiment, where the players move sequentially at each time step with

only the row player using multiplicative weights (the column player just best responds). This

alternative has similar consequences and translates more directly into interesting algorithmic

applications.

Fix a zero-sum game A with payoﬀs in [−M, M] and a value for a parameter ꢀ > 0. Let

m denote the number of rows of A. Consider the following thought experiment, in which

the row player has to move ﬁrst and the column player gets to move second:

Thought Experiment

•

At each time step t = 1, 2, . . . , T = ^4M2

ln m

ꢀ

–

The row player chooses a mixed strategy p^tusing the multiplicative

weights algorithm (with the action set equal to the rows).

The column player responds optimally with the deterministic strat-

egy q^t.²

If the column player chooses column j, then set r^t(i) = a_ijfor every

row i, and feed the reward vector r^tinto the multiplicative weights

algorithm. (This is just the payoﬀ of each row in hindsight, given

the column player’s strategy at time t.)

We claim that the column player get at least its minimax payoﬀ, and the row player gets

at least its minimax payoﬀ minus ꢀ.

Claim 1: In the thought experiment above, the negative time-averaged expected payoﬀ of

the column player is at most

ꢀ

ꢁ

max min p Aq .

Note that the benchmark used in this claim is the more advantageous one for the column

player, where it gets to move second.³

Proof: The column player only does better than its minimax value because, not only does

the player get to go second, but the player can tailor its best responses on each day to

t=1

p^tdenote the

the row player’s mixed strategy on that day. Formally, we let pˆ =

time-averaged row strategy and q an optimal response to p and derive

∗

ꢀ

ꢁ

max min p Aq ≥ min pˆ Aq

pˆ^TAq^∗

X^T

t T

(p ) Aq^∗

t=1

X^T

t T

(p ) Aq ,

≥

t=1

with the the last inequality following because q^tis an optimal response to p^tfor each t.

(Recall the column player wants to minimize p^TAq.) Since the last term is the negative

Recall from last lecture that the player who goes second has no need to randomize: choosing a column

with the best expected payoﬀ (given the row player’s strategy p^t) is the best thing to do.

Of course, we’ve already proved the minimax theorem, which states that it doesn’t matter who goes

ﬁrst. But here we want to reprove the minimax theorem, and hence don’t want to assume it.

time-averaged payoﬀ of the column player in the thought experiment, the proof is complete.

ꢀ

Claim 2: In the thought experiment above, the time-averaged expected payoﬀ of the row

player is at least

ꢀ

We are again using the stronger benchmark from the player’s perspective, here with the row

ꢁ

min max p Aq − ꢀ.

player going second.

t=1

q^tdenote the time-averaged column strategy. The multiplicative

Proof: Let qˆ =

weights guarantee, after being extended as in Section 1, states that the time-averaged ex-

pected payoﬀ of the row player is within ꢀ of what it could have attained using any ﬁxed

mixed strategy p. That is,

X^T

t T

p^TAq^t− ꢀ

(p ) Aq ≥ max

t=1

max p Aqˆ − ꢀ

ꢀ

ꢁ

≥

min max p Aq − ꢀ.

ꢀ

Letting ꢀ → 0, Claims 1 and 2 provide yet another proof of the minimax theorem.

(Recalling the “easy direction” that max min p^TAq ≤ min max p^TAq.) The next order

of business is to translate this thought experiment into fast algorithms for approximately

solving linear programs.

Linear Classiﬁers Revisited

.1 Recap

Recall from Lecture #7 the problem of computing a linear classiﬁer — geometrically, of

separating a bunch of “+”s and “-”s with a hyperplane (Figure 1).

Figure 1: We want to ﬁnd a linear function that separates the positive points (plus signs)

from the negative points (minus signs)

Formally, the input consists of m “positive” data points p¹, . . . , p^m∈ R_dand m “nega-

tive” data points q¹, . . . , q^m∈ ^d. This corresponds to labeled data, with the positive and

negative points having labels +1 and -1, respectively.

The goal is to compute a linear function h(z) =

h(pⁱ) > 0

j=1

R_d

to ) such that

a z + b (from

for all positive points and

h(qⁱ) < 0

for all negative points. In Lecture #7 we saw how to compute a linear classiﬁer (if one

exists) via linear programming. (It was almost immediate; the only trick was to introduce

an additional variable to turn the strict inequality constraints into the usual weak inequality

constraints.)

We’ve said in the past that linear programs with 100,000s of variables and constraints

are usually no problem to solve, and sometimes millions of variables and constraints are

also doable. But as you probably know from your other computer science courses, in many

cases we’re interested in considerably larger data sets. Can we compute a linear classiﬁer

faster, perhaps under some assumptions and/or allowing for some approximation? The

multiplicative weights algorithm provides an aﬃrmative answer.

.2 Preprocessing

We ﬁrst execute a few preprocessing steps to transform the problem into a more convenient

form.

First, we can force the intercept b to be 0. The trick is to add an additional (d + 1)th

variable, with the new coeﬃcient a_d+1corresponding to the old intercept b. Each positive

and negative data point gets a new (d + 1)th coordinate, equal to 1. Geometrically, we’re

now looking for a hyperplane separating the positive and negative points that passes through

the origin.

Second, if we multiply all the coordinates of each negative point yⁱ∈ R_d+1by -1, then

we can write the constraints as

h(x ), h(y ) > 0

for all positive and negative data points. (For this reason, we will no longer distinguish

positive and negative points.) Geometrically, we’re now looking for a hyperplane (through

the origin) such that all of the data points are on the same side of the hyperplane.

Third, we can insist that every coeﬃcient a_jis nonnegative. (Don’t forget that the

coordinates of the xⁱ’s can be both positive and negative.) The trick here is to make two

copies of every coordinate (blowing up the dimension from d + 1 to 2d + 2), and interpreting

the two coeﬃcients a , a corresponding to the jth coordinate as indicating the coeﬃcient

a = a⁰a in the original space. For this to work, each entry x of a data point is replaced

−

by two entries, xⁱand −xⁱ. Geometrically, we’re now looking for a hyperplane, through the

origin and with a normal vector in the nonnegative orthant, with all the data points on the

same side (and the same side as the normal vector).

For the rest of this section, we use d to denote the number of dimensions after all of this

preprocessing (i.e., we redeﬁne d to be what was previously 2d + 2).

.3 Assumption

We assume that the problem is feasible — that there is a linear function of the desired type.

Actually, we assume a bit more, that there is a solution with some “wiggle room.”

Assumption: There is a coeﬃcient vector a^∗

∈ R_d

such that:

j=1

∗

a = 1; and

j=1

∗

for all data points xⁱ.

a x >

ꢀ

{z}

“

margin”

Note that if there is any solution to the problem, then there is a solution satisfying the ﬁrst

condition (just by scaling the coeﬃcients). The second condition insists on wiggle room after

normalizing the coeﬃcients to sum to 1.

Let M be such that |xⁱ_j| ≤ M for every i and j. The running time of our algorithm will

depend on both ꢀ and M.

.4 Algorithm

Here is the algorithm.

. Deﬁne an action set A = {1, 2, . . . , d}, with actions corresponding to coordinates.

M²ln d

. For t = 1, 2, . . . , T =

ꢀ

(a) Use the multiplicative weights algorithm to generate a probability distribution

a^t∈ R_dover the actions/coordinates.

j=1

a x > 0 for every data point x , then halt and return a (which is a

(b) If

feasible solution).

j=1

a x ≤ 0, and deﬁne a reward

vector r^twith r^t(j) = xⁱ_jfor each coordinate j.

(d) Feed the reward vector r^tinto the multiplicative weights algorithm.

To motivate the choice of reward vector, suppose the coeﬃcient vector a^tfails to have a

j=1

a x with the data point x . We want to nudge the coeﬃcients

positive inner product

so that this inner product will go up in the next iteration. (Of course we might screw up

some other inner products, but we’re hoping it’ll work out OK in the end.) For coordinates j

with xⁱ> 0 we want to increase a ; for coordinates with xⁱ< 0 we want to do the opposite.

Recalling the multiplicative weight update rule (w^t+1(a) = w^t(a)(1 + ηr^t(a))), we see that

the reward vector r^t= xⁱwill have the intended eﬀect.

.5 Analysis

We claim that the algorithm above halts (necessarily with a feasible solution) by the time it

gets to the ﬁnal iteration T.

In the algorithm, the reward vectors are nefariously deﬁned so that, at every time step t,

the inner product of a^tand r^tis non-positive. Viewing a^tas a probability distribution over

the actions {1, 2, . . . , d}, the means that the expected reward of the multiplicative weights

algorithm is non-positive at every time step, and hence its time-averaged expected reward

is at most 0.

On the other hand, by assumption (Section 3.3), there exists a coeﬃcient vector (equiva-

lently, distribution over {1, 2, . . . , d}) a such that, at every time step t, the expected payoﬀ

∗

j=1

a r (j) min

∗

≥

i=1

j=1

∗

of playing a would have been

a x > ꢀ.

Combining these two observations, we see that as long as the algorithm has not yet

found a feasible solution, the time-averaged regret of the multiplicative weights subroutine

is strictly more than ꢀ. The multiplicative weights guarantee says that after T =

time-averaged regret is at most ꢀ.⁴We conclude that our algorithm halts, with a feasible

linear classiﬁer, within T iterations.

M²ln d

, the

ꢀ

We’re using the extended version of the guarantee (Section 1), which holds against every ﬁxed distribution

(like a^∗) and not just every ﬁxed action.

.6 Interpretation as a Zero-Sum Game

Our last two topics were a thought experiment leading to minimax payoﬀs in zero-sum games

and an algorithm for computing a linear classiﬁer. The latter is just a special case of the

former.

To translate the linear classiﬁer problem to a zero-sum game, introduce one row for each

of the d coordinates and one column for each of the data points xⁱ. Deﬁne the payoﬀ matrix

A by

ꢂ

ꢃ

A = a = x

Recall that in our thought experiment (Section 2), the row player generates a strategy at

each time step using the multiplicative weights algorithm. This is exactly how we generate

the coeﬃcient vectors a¹, . . . , a^Tin the algorithm in Section 3.4. In the thought experiment,

the column player, knowing the row player’s distribution, chooses the column that minimizes

the expected payoﬀ of the row player. In the linear classiﬁer context, given a^t, this corre-

sponds to picking a data point xⁱthat minimizes

j=1

a x . This ensures that a violated

data point (with nonpositive dot product) is chosen, provided one exists. In the thought

experiment, the reward vector r^tfed into the multiplicative weights algorithm is the payoﬀ of

each row in hindsight, given the column player’s strategy at time t. With the payoﬀ matrix

A above, this vector corresponds to the data point xⁱchosen by the column player at time t.

These are exactly the reward vectors used in our algorithm for computing a linear classiﬁer.

Finally, the assumption (Section 3.3) implies that the value of the constructed zero-sum

game is bigger than ꢀ (since the row player could always choose a^∗). The regret guarantee

in Section 2 translates to the row player having time-averaged expected payoﬀ bigger than

M²ln m

0 once T exceeds

before this time.

. The algorithm has no choice but to halt (with a feasible solution)

ꢀ

Maximum Flow Revisited

.1 Multiplicative Weights and Linear Programs

We’ve now seen a concrete example of how to approximately solve a linear program using the

multiplicative weights algorithm, by modeling the linear program as a zero-sum game and

then applying the thought experiment from Section 2. The resulting algorithm is extremely

fast (faster than solving the linear program exactly) provided the margin ꢀ is not overly small

and the radius M of the ` ball enclosing all of the data points xⁱ_jis not overly big.

∞

This same idea — associating one player with the decision variables and a second player

with the constraints — can be used to quickly approximate many other linear programs.

We’ll prove this point by considering one more example, our old friend the maximum ﬂow

problem. Of course, we already know some pretty good algorithms (faster than linear pro-

grams) for maximum ﬂow problems, but the ideas we’ll discuss extend also to multicom-

modity ﬂow problems (see Exercise Set #6 and Problem Set #3), where we don’t know any

exact algorithms that are signiﬁcantly faster than linear programming.

.2 A Zero-Sum Game for the Maximum Flow Problem

Recall the primal-dual pair of linear programs corresponding to the maximum ﬂow and

minimum cut problems (Lecture #8):

max

f_P

P∈P

subject to

f_P≤ 1

for all e ∈ E

P∈P : e∈P

}

total ﬂow on e

f ≥ 0

for all P ∈ P

and

min

`_e

e∈E

subject to

`_e≥ 1

`_e≥ 0

for all P ∈ P

for all e ∈ E,

e∈P

where P denotes the set of s-t paths. To reduce notation, here we’ll only consider the case

where all edges have unit capacity (u = 1). The general case, with u ’s on the right-hand

side of the primal and in the objective function of the dual, can be solved using the same

ideas (Exercise Set #6).⁵

We begin by deﬁning a zero-sum game. The row player will be associated with edges

(i.e., dual variables) and the column player with paths (i.e., primal variables). The payoﬀ

matrix is

ꢄ

ꢅ

ꢆ

if e ∈ P

A = a

otherwise

Note that all payoﬀs are 0 or 1. (Yes, this a huge matrix, but we’ll never have to write it

down explicitly; see the algorithm below.)

Let OPT denote the optimal objective function value of the linear programs. (The same

for each, by strong duality.) Recall that the value of a zero-sum game is deﬁned as the

expected payoﬀ of the row player under optimal play by both players (max min x^TAy or,

equivalently by the minimax theorem, min max x^TAy).

OPT

Claim: The value of this zero-sum game is

Although the running time scales quadratically with ratio of the maximum and minimum edge capacities,

which is not ideal. One additional idea (“width reduction”), not covered here, recovers a polynomial-time

algorithm for general edge capacities.

Proof: Let {`

∗

}

be an optimal solution to the dual, with

` = OPT. Obtain x ’s

∗

e∈E

∗

from the ` ’s by scaling down by OPT — then the x ’s form a probability distribution. If the

row player uses this mixed strategy x, then each column P ∈ P results in expected payoﬀ

∗

x_e=

` ≥

OPT

e∈P

where the inequality follows the dual feasibility of {` _e∈E. This shows that the value of the

∗

}

OPT

game is at least

Conversely, let x be an optimal strategy for the row player, with min x^TAy equal to

the game’s value v. This means that, no matter what strategy the column player chooses,

the row player’s expected payoﬀ is at least v. This translates to

x_e≥ v

e∈P

for every P ∈ P. Thus {x /v}

is a dual feasible solution, with objective function value

ꢀ

e∈E

OPT

(

x )/v = 1/v. Since this can only be larger than OPT, v ≤

e∈E

.3 Algorithm

For simplicity, assume that OPT is known.⁶Translating the thought experiment from

Section 2 to this zero-sum game, we get the following algorithm:

. Associate an action with each edge e ∈ E.

OPT²ln |E|

. For t = 1, 2, . . . , T =

ꢀ

(a) Use the multiplicative weights algorithm to generate a probability distribution

x^t∈ R_Eover the actions/edges.

(b) Let P^tbe a column that minimizes the row player’s expected payoﬀ (with the

expectation with respect to x^t). That is,

P ∈ argmin

x .

(1)

P∈P

e∈P

the P^tth column of A). Feed the reward vector r^tinto the multiplicative weights

algorithm.

For example, embed the algorithm into an outer loop that uses successive doubling to “guess” the value

of OPT (i.e., take OPT = 1, 2, 4, 8, . . . until the algorithm succeeds).

.4 Running Time

An important observation is that this algorithm never explicitly writes down the payoﬀ ma-

trix A. It maintains one weight per edge, which is a reasonable amount of state. To compute

P^tand the induced reward vector r^t, all that is needed is a subroutine that solves (1) —

that, given the x^t’s, returns a shortest s-t path (viewing the x^t’s as edge lengths). Dijkstra’s

algorithm, for example, works just ﬁne.⁷Assuming Dijkstra’s algorithm is implemented in

O(m log n) time, where m and n denote the number of edges and vertices, respectively, the

OPT²

total running time of the algorithm is O(

m log m log n). (Note that with unit capacities,

ꢀ

OPT ≤ m. If there are no parallel edges, then OPT ≤ n − 1.) This is comparable to some

of the running times we saw for (exact) maximum ﬂow algorithms, but more importantly

these ideas extend to more general problems, including multicommodity ﬂow.

.5 Approximate Correctness

So how do we extract an approximately optimal ﬂow from this algorithm? After running the

algorithm above, let P¹, . . . , P^Tdenote the sequence of paths chosen by the column player

(the same path can be chosen multiple times). Let f^tdenote the ﬂow that routes OPT units

of ﬂow on the path P^t. (Of course, this probably violates the edge capacity constraints.)

Finally, deﬁne f^∗=

t=1

f^tas the “time-average” of these path ﬂows. Note that since

each f^troutes OPT units of ﬂow from the source to the sink, so does f . But is f feasible?

∗

Claim: f^∗routes at most 1 + ꢀ units of ﬂow on every edge.

Proof: We proceed by contradiction. If f^∗routes more than 1+ꢀ units of ﬂow on the edge e,

then more than (1 + ꢀ)T/OP T of the paths in P¹, . . . , P^Tinclude the edge e. Returning to

our zero-sum game A, consider the row player strategy z that deterministically plays the

edge e. The time-averaged payoﬀ to the row player, in hindsight given the paths chosen by

the column player, would have been

X^T

1 + ꢀ

z Ay =

1 >

OPT

t=1

t : e∈P^t

The row player’s guarantee (Claim 1 in Section 2) then implies that

X^T

But this contradicts the guarantee that the column player does at least as well as the minimax

X^T

ꢀ

1 + ꢀ

ꢀ

t T

(x ) Ay ≥

z Ay −

−

OPT

t=1

OPT

ꢀ

value of the game (Claim 2 in Section 2), which is

by the Claim in Section 4.2.

Scaling down f^∗by a factor of 1+ꢀ yields a feasible ﬂow with value at least OP T/(1+ꢀ).

This subroutine is precisely the “separation oracle” for the dual linear program, as discussed in Lecture

10 in the context of the ellipsoid method.

CS261: A Second Course in Algorithms

Lecture #13: Online Scheduling and Online Steiner Tree^∗

Tim Roughgarden^†

February 16, 2016

Preamble

Last week we began our study of online algorithms with the multiplicative weights algorithm

for online decision-making. We also covered (non-online) applications of this algorithm to

zero-sum games and the fast approximation of certain linear programs. This week covers

more “traditional” results in online algorithms, with applications in scheduling, matching,

and more.

Recall from Lecture #11 what we mean by an online problem.

An Online Problem

. The input arrives “one piece at a time.”

. An algorithm makes an irrevocable decision each time it receives a new

piece of the input.

Online Scheduling

A canonical application domain for online algorithms is scheduling, with jobs arriving online

(i.e., one-by-one). There are many algorithms and results for online scheduling problems;

we’ll cover only what is arguably the most classic result.

.1 The Problem

To specify an online problem, we need to deﬁne how the input arrives at what action must

be taken at each step. There are m identical machines on which jobs can be scheduled;

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

these are known up front. Jobs then arrive online, one at a time, with job j having a known

processing time p_j. A job must be assigned to a machine immediately upon its arrival.

A schedule is an assignment of each job to one machine. The load of a machine in a

schedule is the sum of the processing times of the jobs assigned to it. The makespan of a

schedule is the maximum load of any machine. For example, see Figure 1.

Figure 1: Example of makespan assignments. (a) has makespan 4 and (b) has makespan 5.

We consider the objective function of minimizing the makespan. This is arguably the

most practically relevant scheduling objective. For example, if jobs represent pieces of a

task to be processed in parallel (e.g., MapReduce/Hadoop jobs), then for many tasks the

most important statistic is the time at which the last job completes. Minimizing this last

completion time is equivalent to minimizing the makespan.

.2 Graham’s Algorithm

We analyze what is perhaps the most natural approach to the problem, proposed and ana-

lyzed by Ron Graham 50 years ago.

Graham’s Scheduling Algorithm

when a new job arrives, assign it to the machine that currently has the smallest

load (breaking ties arbitrarily)

We measure the performance of this algorithm against the strongest-possible benchmark,

the minimum makespan in hindsight (or equivalently, the optimal clairvoyant solution).¹

Since the minimum makespan problem is NP-hard, this benchmark is both omniscient about

the future and also has unbounded computational power. So any algorithm that does almost

as well is a pretty good algorithm!

Note that the “best ﬁxed action” idea from online decision-making doesn’t really make sense here.

.3 Analysis

In the ﬁrst half of CS261, we were always asking “how do we know when we’re done (i.e.,

optimal)?” This was the appropriate question when the goal was to design an algorithm

that always computes an optimal solution. In an online problem, we don’t expect any online

algorithm to always compute the optimal-in-hindsight solution. We expect to compromise

on the guarantees provided by online algorithms with respect to this benchmark.

In the ﬁrst half of CS261, we were obsessed with “optimality conditions” — necessary

and suﬃcient conditions on a feasible solution for it to be an optimal solution. In the second

half of CS261, we’ll be obsessed with bounds on the optimal solution — quantities that are

“only better than optimal.” Then, if our algorithm’s performance is not too far from our

bound, then it is also not too far from the optimal solution.

Where do such bounds come from? For the two case studies today, simple bounds suﬃce

for our purposes. Next lecture we’ll use LP duality to obtain such bounds — this will

demonstrate that the same tools that we developed to prove the optimality of an algorithm

can also be useful in proving approximate optimality.

The next two lemmas give two diﬀerent simple lower bounds on the minimum-possible

makespan (call it OPT), given m machines and jobs with processing times p , . . . , p .

Lemma 2.1 (Lower Bound #1)

OPT ≥ max p .

j=1

Lemma 2.1 should be clear enough — the biggest job has to go somewhere, and wherever it

is assigned, that machine’s load (and hence the makespan) will be at least as big as the size

of this job.

The second lower bound is almost as simple.

Lemma 2.2 (Lower Bound #2)

Xⁿ

OPT ≥

p .

j=1

Proof: In every schedule, we have

maximum load of a machine ≥ average load of a machine

Xⁿ

p .

j=1

ꢀ

These two lemmas imply the following guarantee for Graham’s algorithm.

Theorem 2.3 The makespan of the schedule output by Graham’s algorithm is always at

most twice the minimum-possible makespan (in hindsight).

In online algorithms jargon, Theorem 2.3 asserts that Graham’s algorithm is 2-competitive,

or equivalently has a competitive ratio of at most 2.

Theorem 2.3 is tight in the worst case (as m → ∞), though better bounds are possible

in the (often realistic) special case where all jobs are relatively small (see Exercise Set #7).

Proof of Theorem 2.3: Consider the ﬁnal schedule produced by Graham’s algorithm, and

suppose machine i determines the makespan (i.e., has the largest load). Let j denote the

last job assigned to i. Why was j assigned to i at that point? It must have been that, at

that time, machine i had the smallest load (by the deﬁnition of the algorithm). Thus prior

to j’s assignment, we had

load of i = minimum load of a machine (at that time)

≤

average load of a machine (at that time)

X^j−1

p .

k=1

Thus,

X^j−1

ﬁnal load of machine i ≤

p + p

|{z}

k=1

{z }

≤

OPT

≤

2OP T,

OPT

≤

with the last inequality following from our two lower bounds on OPT (Lemma 2.1 and 2.2).

ꢀ

Theorem 2.3 should be taken as a representative result in a very large literature. Many

good guarantees are known for diﬀerent online scheduling algorithms and diﬀerent scheduling

problems.

Online Steiner Tree

We have two more case studies in online algorithms: the online Steiner tree problem (this

lecture) and the online bipartite matching problem (next lecture).²

.1 Problem Deﬁnition

In the online Steiner tree problem:

Because the benchmark of the best-possible solution in hindsight is so strong, for many important

problems, all online algorithm have terrible competitive ratios. In these cases, it is important to change the

setup so that theory can still give useful advice about which algorithm to use. See the instructor’s CS264

course (“beyond worst-case analysis”) for much more on this. In CS261, we’ll cherrypick a few problems

where there are natural online algorithms with good competitive ratios.

•

an algorithm is given in advance a connected undirected graph G = (V, E) with a

nonnegative cost c_e≥ 0 for each edge e ∈ E;

“terminals” t , . . . , t ∈ V arrive online (i.e., one-by-one).

The requirement for an online algorithm is to maintain at all times a subgraph of G that

spans all of the terminals that have arrived thus far. Thus when a new terminal arrives, the

algorithm must connect it to the subgraph-so-far. Think, for example, of a cable company

as it builds new infrastructure to reach emerging markets. The gold standard is to compute

the minimum-cost subgraph that spans all of the terminals (the “Steiner tree”).³The goal

of an online algorithm is to get as close as possible to this gold standard.

.2 Metric Case vs. General Case

A seemingly special case of the online Steiner tree problem is the metric case. Here, we

assume that:

. The graph G is the complete graph.⁴

. The edges satisfy the triangle inequality: for every triple u, v, w ∈ V of vertices,

c_uw≤ c + c .

The triangle inequality asserts that the shortest path between any two vertices is the direct

edge between the vertices (which exists, since G is complete) — that is, adding intermediate

destinations can’t help. The condition states that one-hop paths are always at least as good

as two-hop paths; by induction, one-hop paths are as good as arbitrary paths between the

two endpoints.

For example, distances between points in a normed space (like Euclidean space) satisfy

the triangle inequality. Fares for airline tickets are a non-example: often it’s possible to get

a cheaper price by adding intermediate stops.

It turns out that the metric case of the online Steiner tree problem is no less general than

the general case.

Lemma 3.1 Every α-competitive online algorithm for the metric case of the online Steiner

tree problem can be transformed into an α-competitive online algorithm for the general online

Steiner tree problem.

Exercise Set #7 asks you to supply the proof.

Since costs are nonnegative, this is a tree, without loss of generality.

By itself, this is not a substantial assumption — one could always complete an arbitrary graph with

super-high-cost edges.

.3 The Greedy Algorithm

We’ll study arguably the most natural online Steiner tree algorithm, which greedily connects

a new vertex to the subgraph-so-far in the cheapest-possible way.⁵

Greedy Online Steiner Tree

initialize T ⊆ E to the empty set

for each terminal arrival t , i = 2, . . . , k do

add to T the cheapest edge of the form (t , t ) with j < i

For example, in the 11th iteration of the algorithm, the algorithm looks at the 10 edges

between the new terminal and the terminals that have already arrived, and connects the

new terminal via the cheapest of these edges.⁶

.4 Two Examples

t₂

t₁

^t3

Figure 2: First example.

For example, consider the graph in Figure 2, with edge costs as shown. (Note that the

triangle inequality holds.) When the ﬁrst terminal t₁arrives, the online algorithm doesn’t

have to do anything. When the second terminal t₂arrives, the algorithm adds the edge

(t , t ), which has cost 2. When terminal t arrives, the algorithm is free to connect it to

either t or t (both edges have cost 2). In any case, the greedy algorithm constructs a

What else could you do? An alternative would be to build some extra infrastructure, hedging against

the possibility of future terminals that would otherwise require redundant infrastructure. This idea actually

beats the greedy algorithm in non-worst-case models (see CS264).

This is somewhat reminiscent of Prim’s minimum-spanning tree algorithm. The diﬀerence is that Prim’s

algorithm processes the vertices in a greedy order (the next vertex to connect is the closest one), while the

greedy algorithm here is online, and has to process the terminals in the order provided.

subgraph with total cost 4. Note that the optimal Steiner tree in hindsight has cost 3 (the

spokes).

^t1

^t5

t₃

t₄

t₂

Figure 3: Second example.

For a second example, consider the graph in Figure 3. Again, the edge costs obey the

triangle inequality. When t arrives, the algorithm does nothing. When t arrives, the

algorithm adds the edge (t , t ), which has cost 4. When t arrives, there is a tie between

the edges (t , t ) and (t , t ), which both have cost 2. Let’s say that the algorithm picks the

latter. When terminals t and t arrive, in each case there are two unit-cost options, and it

doesn’t matter which one the algorithm picks. At the end of the day, the total cost of the

greedy solution is 4 + 2 + 1 + 1 = 8. The optimal solution in hindsight is the path graph

t -t -t -t -t , which has cost 4.

.5 Lower Bounds

The second example above shows that the greedy algorithm cannot be better than 2-

competitive. In fact, it is not constant-competitive for any constant.

Proposition 3.2 The (worst-case) competitive ratio of the greedy online Steiner tree algo-

rithm is Ω(log k), where k is the number of terminals.

Exercise Set #7 asks you to supply the proof, by extending the second example above.

The following result is harder to prove, but true.

Proposition 3.3 The (worst-case) competitive ratio of every online Steiner tree algorithm,

deterministic or randomized, is Ω(log k).

.6 Analysis of the Greedy Algorithm

We conclude the lecture with the following result.

Theorem 3.4 The greedy online Steiner tree algorithm is 2 ln k-competitive, where k is the

number of terminals.

In light of Proposition 3.3, we conclude that the greedy algorithm is an optimal online

algorithm (in the worst case, up to a small constant factor).

The theorem follows easily from the following key lemma, which relates the costs incurred

by the greedy algorithm to that of the optimal solution in hindsight.

Lemma 3.5 For every i = 1, 2, . . . , k − 1, the ith most expensive edge in the greedy solution

T has cost at most 2OP T/i, where OPT is the cost of the optimal Steiner tree in hindsight.

Thus, the most expensive edge in the greedy solution has cost at most 2OPT, the second-

most expensive edge costs at most OPT, the third-most at most 2OP T/3, and so on. Recall

that the greedy algorithm adds exactly one edge in each of the k −1 iterations after the ﬁrst,

so Lemma 3.5 applies (with a suitable choice of i) to each edge in the greedy solution.

To apply the key lemma, imagine sorting the edges in the ﬁnal greedy solution from

most to least expensive, and then applying Lemma 3.5 to each (for successive values of

i = 1, 2, . . . , k − 1). This gives

X^k−1

OPT

greedy cost ≤

= 2OPT

≤ (2 ln k) · OP T,

i=1

where the last inequality follows by estimating the sum by an integral.

It remains to prove the key lemma.

Proof of Lemma 3.5: The proof uses two nice tricks, “tree-doubling” and “shortcutting,”

both of which we’ll reuse later when we discuss the Traveling Salesman Problem.

We ﬁrst recall an easy fact from graph theory. Suppose H is a connected multi-graph (i.e.,

parallel copies of an edge are OK) in which every vertex has even degree (a.k.a. an “Eulerian

graph”). Then H has an Euler tour, meaning a closed walk (i.e., a not-necessarily-simple

cycle) that uses every edge exactly once. See Figure 4. The all-even-degrees condition is

clearly necessary, since if the tour visits a vertex k times then it must have degree 2k. You’ve

probably seen the proof of suﬃciency in a discrete math course; we leave it to Exercise Set

7.⁷

^t2

^t3

t₁

t₄

Figure 4: Example graph with Euler tour t -t -t -t -t -t .

Basically, you just peel oﬀ cycles one-by-one until you reach the empty graph.

∗

Next, let T be the optimal Steiner tree (in hindsight) spanning all of the terminals

∗

c denote its cost. Obtain H from T by adding a second copy

of every edge (Figure 5). Obviously, H is Eulerian (every vertex degree got doubled) and

t , . . . , t . Let OPT =

∈

∗

c = 2OPT. Let C denote an Euler tour of H. C visits each of the terminals at least

e∈H

one, perhaps multiple times, and perhaps visits some other vertices as well. Since C uses

every edge of H once,

c_e= 2OPT.

e∈C

^t2

t₂

^t1

t₄

t₁

t₄

^t3

t₃

Figure 5: (a) Before doubling edges and (b) after doubling edges.

Now ﬁx a value for the parameter i ∈ {1, 2, . . . , k − 1} in the lemma statement. Deﬁne

the “connection cost” of a terminal t with j > 1 as the cost of the edge that was added to

the greedy solution when t arrived (from t to some previous terminal). Sort the terminals

in hindsight in nonincreasing order of connection cost, and let s , . . . , s be the ﬁrst (most

expensive) i terminals. The lemma asserts that the cheapest of these has connection cost

at most 2OP T/i. (The ith most expensive terminal is the cheapest of the i most expensive

terminals.)

The tour C visits each of s , . . . , s at least once. “Shortcut” it to obtain a simple cycle C

on the vertex set {s , . . . , s } (Figure 6). For example, if the ﬁrst occurrences of the terminals

in C happen to be in the order s , . . . , s , then C is just the edges (s , s ), (s , s ), . . . , (s , s ).

In any case, the order of terminals on C_iis the same as that of their ﬁrst occurrences in C.

Since the edge costs satisfy the triangle inequality, replacing a path by a direct edge between

its endpoints can only decrease the cost. Thus

has i edges,

c_e≤

c = 2OPT. Since C only

e∈C_i

∈

min c_e

≤

≤ 2OP T/i.

e∈C

}

e∈C

}

cheapest edge

average edge cost

Thus some edge (s , s ) ∈ C has cost at most 2OP T/i.

^t1

^t2

^t3

Figure 6: (a) Solid edges represent original edges, and dashed edge represent edges after

shortcutting from t to t , t to t , t to t has been done.

Consider whichever of s , s arrives later in the online ordering, say s . Since s arrived

earlier, the edge (s , s ) is one option for connecting s to a previous terminal; the greedy

algorithm either connects s_jvia this edge or by one that is even cheaper. Thus at least one

vertex of {s , . . . , s }, namely s , has connection cost at most 2OP T/i. Since these are by

deﬁnition the terminals with the i largest connection costs, the proof is complete. ꢀ

CS261: A Second Course in Algorithms

Lecture #14: Online Bipartite Matching^∗

Tim Roughgarden^†

February 18, 2016

Online Bipartite Matching

Our ﬁnal lecture on online algorithms concerns the online bipartite matching problem. As

usual, we need to specify how the input arrives, and what decision the algorithm has to make

at each time step. The setup is:

•

The left-hand side vertices L are known up front.

The right-hand side vertices R arrive online (i.e., one-by-one). A vertex w ∈ R arrives

together with all of the incident edges (the graph is bipartite, so all of w’s neighbors

are in L).

•

The only time that a new vertex w ∈ R can be matched is immediately on arrival.

The goal is to construct as large a matching as possible. (There are no edge weights, we’re

just talking about maximum-cardinality bipartite matching.) We’d love to just wait until all

of the vertices of R arrive and then compute an optimal matching at the end (e.g., via a max

ﬂow computation). But with the vertices of R arriving online, we can’t expect to always do

as well as the best matching in hindsight.

This lecture presents the ideas behind optimal (in terms of worst-case competitive ratio)

deterministic and randomized online algorithms for online bipartite matching. The random-

ized algorithm is based on a non-obvious greedy algorithm. While the algorithms do not

reference any linear programs, we will nonetheless prove the near-optimality of our algo-

rithms by exhibiting a feasible solution to the dual of the maximum matching problem. This

demonstrates that the tools we developed for proving the optimality of algorithms (for max

ﬂow, linear programming, etc.) are more generally useful for establishing the approximate

optimality of algorithms. We will see many more examples of this in future lectures.

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

Online bipartite matching was ﬁrst studied in 1990 (when online algorithms were ﬁrst

hot), but a new 21st-century killer application has rekindled interest on the problem over

the past 7-8 years. (Indeed, the main proof we present was only discovered in 2013!)

The killer application is Web advertising. The vertices of L, which are known up front,

represent advertisers who have purchased a contract for having their ad shown to users that

meet speciﬁed demographic criteria. For example, an advertiser might pay (in advance) to

have their ad shown to women between the ages of 25 and 35 who live within 100 miles of

New York City. If an advertiser purchased 5000 views, then there will be 5000 corresponding

vertices on the left-hand side. The right-hand side vertices, which arrive online, correspond

to “eyeballs.” When someone types in a search query or accesses a content page (a new

opportunity to show ads), it corresponds to the arrival of a vertex w ∈ R. The edges incident

to w correspond to the advertisers for whom w meets their targeting criteria. Adding an

edge to the matching then corresponds to showing a given ad to the newly arriving eyeball.

Both Google and Microsoft (and probably other companies) employ multiple people whose

primary job is adapting and ﬁne-tuning the algorithms discussed in this lecture to generate

as much as revenue as possible.

Deterministic Algorithms

v₁

w₁

v₂

w₂

Figure 1: Graph where no deterministic algorithm has competitive ratio better than .

We ﬁrst observe that no deterministic algorithm has a competitive ratio better than .

Consider the example in Figure 1. The two vertices v , v on the left are known up front,

and the ﬁrst vertex w₁to arrive on the right is connected to both. Every deterministic

algorithm picks either the edge (v , w ) or (v , w ).¹In the former case, suppose the second

vertex w to arrive is connected only to v , which is already matched. In this case the online

algorithm’s solution has 1 edge, while the best matching in hindsight has size 2. The other

case is symmetric. Thus for every deterministic algorithm, there is an instance where the

matching it outputs is at most times the maximum possible in hindsight.

Technically, the algorithm could pick neither, but then its competitive ratio would be 0 (what if no more

vertices arrive?).

The obvious greedy algorithm has a matching competitive ratio of . By the “obvious

algorithm” we mean: when a new vertex w ∈ R arrives, match w to an arbitrary unmatched

neighbor (or to no one, if it has no unmatched neighbors).

Proposition 2.1 The deterministic greedy algorithm has a competitive ratio of .

Proof: The proposition is easy to prove directly, but here we’ll give a more-sophisticated-

than-necessary proof because it introduces ideas that we’ll build on in the randomized case.

Our proof uses a dual feasible solution as an upper bound on the size of a maximum matching.

Recall the relevant primal-dual pair ((P) and (D), respectively):

max

x_e

e∈E

subject to

x_e≤ 1

for all v ∈ L ∪ R

for all e ∈ E,

e∈δ(v)

x_e≥ 0

min

and

p_v

v∈L∪R

subject to

p + p ≥ 1

for all (v, w) ∈ E

for all v ∈ L ∪ R.

p ≥ 0

There are some minor diﬀerences with the primal-dual pair that we considered in Lecture

9, when we discussed the minimum-cost perfect matching problem. First, in (P), we’re

maximizing cardinality rather than minimizing cost. Second, we allow matchings that are

not perfect, so the constraints in (P) are inequalities rather than equalities. This leads to the

expected modiﬁcations of the dual: it is a minimization problem rather than a maximization

problem, therefore with greater-than-or-equal-to constraints rather than less-than-or-equal-

to constraints. Because the constraints in the primal are now inequality constraints, the dual

variables are now nonnegative (rather than unrestricted).

We use these linear programs (speciﬁcally, the dual) only for the analysis; the algorithm,

remember, is just the obvious greedy algorithm. We next deﬁne a “pre-dual solution” as

follows: for every v ∈ L ∪ R, set

ꢀ

if greedy matches v

otherwise.

q_v=

The q’s are deﬁned in hindsight, purely for the sake of analysis. Or if you prefer, we can

imagine initializing all of the q_v’s to 0 and then updating them in tandem with the execution

of the greedy algorithm — when the algorithm adds a vertex (v, w) to its matching, we set

both q and q to . (Since the chosen edges form a matching, a vertex has its q-value set

to at most once.) This alternative description makes it clear that

M| =

q ,

(1)

v∈L∪R

where M is the matching output by the greedy algorithm. (Whenever one edge is added to

the matching, two vertices have their q-values increased to .)

Next, observe that for every edge (v, w) of the ﬁnal graph (L∪R, E), at least one of q , q

is (if not both). For if q = 0, then v was not matched by the algorithm, which means that

w had at least one unmatched neighbor when it arrived, which means the greedy algorithm

matched it (presumably to some other unmatched neighbor) and hence q = .

This observation does not imply that q is a feasible solution to the dual linear pro-

gram (D), which requires a sum of at least 1 from the endpoints of every edge. But it does

imply that after scaling up q by a factor 2 to obtain p = 2q, p is feasible for (D). Thus

M| =

p_v≥ · OP T,

{z }

v∈L∪R

obj fn of p

where OPT denotes the size of the maximum matching in hindsight. The ﬁrst equation is

from (1) and the deﬁnition of p, and the inequality is from weak duality (when the primal

is a maximization problem, every feasible dual solution provides an upper bound on the

optimum). ꢀ

Online Fractional Bipartite Matching

.1 The Problem

We won’t actually discuss randomized algorithms in this lecture. Instead, we’ll discuss a

deterministic algorithm for the fractional bipartite matching problem. The keen reader will

object that this is a stupid idea, because we’ve already seen that the fractional and integral

bipartite matching problems are really the same.²While it’s true that fractions don’t help

the optimal solution, they do help an online algorithm, intuitively by allowing it to “hedge.”

This is already evident in our simple bad example for deterministic algorithms (Figure 1).

When w₁shows up, in the integral case, a deterministic online algorithm has to match w₁

fully to either v or v . But in the fractional case, it can match w 50/50 to both v and

v . Then when w arrives, with only one neighbor on the left-hand side, it can at least be

matched with a fractional value of . The online algorithm produces a fractional matching

In Lecture #9 we used the correctness of the Hungarian algorithm to argue that the fractional problem

always has a 0-1 optimal solution (since the algorithm terminates with an integral solution and a dual-feasible

solution with same objective function value). See also Exercise Set #5 for a direct proof of this.

with value while the optimal solution has size 2. So this only proves a bound of of

the best-possible competitive ratio, leaving open the possibility of online algorithms with

competitive ratio bigger than .

.2 The Water Level (WL) Algorithm

We consider the following “Water Level,” algorithm, which is a natural way to deﬁne “hedg-

ing” in general.

Water-Level (WL) Algorithm

Physical metaphor:

think of each vertex v ∈ L as a water container with a capacity of 1

think of each vertex w ∈ R as a source of one unit of water

when w ∈ R arrives:

drain water from w to its neighbors, always preferring the containers

with the lowest current water level, until either

(i) all neighbors of w are full; or

(ii) w is empty (i.e., has sent all its water)

See also Figure 2 for a cartoon of the water being transferred to the neighbors of a vertex

w. Initially the second neighbor has the lowest level so w only sends water to it; when the

water level reaches that of the next-lowest (the ﬁfth neighbor), w routes water at an equal

rate to both the second and ﬁfth neighbors; when their common level reaches that of the

third neighbor, w routes water at an equal rate to these three neighbors with the lowest

current water level. In this cartoon, the vertex w successfully transfers its entire unit of

water (case (ii)).

Figure 2: Cartoon of water being transferred to vertices.

For example, in the example in Figure 1, the WL algorithm replicates our earlier hedging,

with vertex w distributing its water equally between v and v (triggering case (ii)) and

vertex w₂distributing units of water to its unique neighbor (triggering case (i)).

This algorithm is natural enough, but all you’ll have to remember for the analysis is the

following key property.

Lemma 3.1 (Key Property of the WL Algorithm) Let (v, w) ∈ E be an edge of the

ﬁnal graph G and y =

x the ﬁnal water level of the vertex v ∈ L. Then w only sent

water to containers when their current water level was y or less.

∈

δ(v)

Proof: Fix an edge (v, w) with v ∈ L and w ∈ R. The lemma is trivial if y = 1, so suppose

y < 1 — that the container v is not full at the end of the WL algorithm. This means

that case (i) did not get triggered, so case (ii) was triggered, so the vertex w successfully

routed all of its water to its neighbors. At the time when this transfer was completed, all

containers to which w sent some water have a common level `, and all other neighbors of w

have current water level at least ` (cf., Figure 2). At the end of the algorithm, since water

levels only increase, all neighbors of w have ﬁnal water level ` or more. Since w only sent

ﬂow to containers when their current water level was ` or less, the proof is complete. ꢀ

.3 Analysis: A False Start

To prove a bound on the competitive ratio of the WL algorithm, a natural idea is to copy

the same analysis approach that worked so well for the integral case (Proposition 2.1). That

is, we deﬁne a pre-dual solution in tandem with the execution of the WL algorithm, and

then scale it up to get a solution feasible for the dual linear program (D) in Section 2.

Idea #1:

•

initialize q_v= 0 for all v ∈ L ∪ R;

whenever the amount x_vwof water sent from w to v goes up by ∆, increase both q_v

and q_wby ∆/2.

Inductively, this process maintains at all times that the value of the current fractional match-

ing equals

increases by the same amount.)

The hope is that, for some constant c > , the scaled-up vector p = q is feasible for (D).

If this is the case, then we have proved that the competitive ratio of the WL algorithm is at

q_v. (Whenever the matching size increases by ∆, the sum of q-values

v∈L∪R

least c (since its solution value equals c times the objective function value

p_vof the

v∈L∪R

dual feasible solution p, which in turn is an upper bound on the optimal matching size).

To see why this doesn’t work, consider the example shown in Figure 3. Initially there

are four vertices on the left-hand side. The ﬁrst vertex w ∈ R is connected to every vertex

of L, so the WL algorithm routes one unit of water evenly across the four edges. Now every

∈

container has a water level of . The second vertex w

all neighbors have the same water level, w splits its unit of water evenly between the three

R is connected to v , v , v . Since

∈

containers, bringing their water levels up to + = . The third vertex w₃R is connected

only to v and v . The vertex splits its water evenly between these two containers, but it

cannot transfer all of its water; after sending

units to each of v and v , both containers

are full (triggering case (i)). The last vertex w ∈ R is connected only to v . Since v is

already full, w can’t get rid of any of its water.

The question now is: by what factor to we have to scale up q to get a feasible solution

p = q to (D)? Recall that dual feasibility boils down to the sum of p-values of the endpoints

of every edge being at least 1. We can spot the problem by examining the edge (v , w ).

The vertex v₄got ﬁlled, so its ﬁnal q-value is (as high as it could be with the current

approach). The vertex w₄didn’t participate in the fractional matching at all, so its q-value

is 0. Since q + q = , we would need to scale up by 2 to achieve dual feasibility. This

does not improve over the competitive ration of .

^w4

w₁

v₁

v₂

v₃

v₄

w₂

/12

w₃

/12

Figure 3: Example showcasing why Idea #1 does not work.

On the other hand, the solution computed by the WL algorithm for this example, while

not optimal, is also not that bad. Its value is 1 + 1 + + 0 = , which is substantially

bigger than times the optimal solution (which is 4). Thus this is a bad example only for

the analysis approach, and not for the WL algorithm itself. Can we keep the algorithm the

same, and just be smarter with its analysis?

.4 Analysis: The Main Idea

Idea #2: when the amount x_vwof water sent from w to v goes up by ∆, split the increase

unequally between q and q .

To see the motivation for this idea, consider the bottom edge in Figure 3. The WL

algorithm never sends any water on any edge incident to w₄, so it’s hard to imagine how

its q-value will wind up anything other than 0. So if we want to beat , we need to make

sure that v ﬁnishes with a q-value bigger than . A naive ﬁx for this example would be to

only increase the q-values for vertices of L, and not from R; but this would fail miserably

if w were the only vertex to arrive (then all q-values on the left would be , all those on

the right 0). To hedge between the various possibilities, as a vertex v ∈ L gets more and

more full, we will increase its q-value more and more quickly. Provided it increases quickly

enough as v becomes full, it is conceivable that v could end up with a q-value bigger than

with the splitting ratio evolving over the course of the algorithm.

Summarizing, we’ll use unequal splits between the q-values of the endpoints of an edge,

There are zillions of ways to split an increase of ∆ on x_vwbetween q and q (as a function

of v’s current water level). The plan is to give a general analysis that is parameterized by such

a “splitting function,” and solve for the splitting function that leads to the best competitive

ratio. Don’t forget that all of this is purely for the analysis; the algorithm is always the WL

algorithm.

So ﬁx a nondecreasing “splitting function” g : [0, 1] → [0, 1]. Then:

•

initialize q_v= 0 for all v ∈ L ∪ R;

whenever the amount x_vwof water sent from w to v goes up by an inﬁnitesimal amount

dz, and the current water level of v is y_v=

x :

δ(v)

∈

–

increase q by g(y )dz;

increase q by (1 − g(y ))dz.

For example, if g is the constant function always equal to 0 (respectively, 1), then only

the vertices of R (respectively, vertices of L) receive positive q-values. If g is the constant

function always equal to , then we recover our initial analysis attempt, with the increase

on an edge split equally between its endpoints.

By construction, no matter how we choose the function g, we have

current value of WL fractional matching = current value of

q ,

v∈L∪R

at all times, and in particular at the conclusion of the algorithm.

For the analysis (parameterized by the choice of g), ﬁx an arbitrary edge (v, w) of the

ﬁnal graph. We want a worst-case lower bound on q + q (hopefully, bigger than ).

For the ﬁrst case, suppose that at the termination of the WL algorithm, the vertex v ∈ L

is full (i.e., y =

x = 1). At the time that v’s current water level was z, it accrued

δ(v)

q-value at rate g(z). Integrating over these accruals, we have

∈

q + q ≥ q =

g(z)dz.

(2)

(It may seem sloppy to throw out the contribution of q ≥ 0, but Figure 3 shows that when

v is full it might well be the case that some of its neighbors have q-value 0.) Note that the

bigger the function g is, the bigger the lower bound in (2).

For the second case, suppose that v only has water level y_v< 1 at the conclusion of the

WL algorithm. It follows that w successfully routed its entire unit of water to its neighbors

(otherwise, the WL algorithm would have routed more water to the non-full container v).

Here’s where we use the key property of the WL algorithm (Lemma 3.1): whenever v sent

water to a container, the current water level of that container was as most y_v. Thus, since

the function g is nondecreasing, whenever v routed any water, it accrued q-value at rate at

least 1 − g(y_v). Integrating over the unit of water sent, we obtain

q_w≥

(1 − g(y ))dz = 1 − g(y ).

As in the ﬁrst case, we have

and hence

y_v

q_v=

g(z)dz

ꢁ

ꢂ

y_v

q + q ≥

g(z)dz + 1 − g(y ).

(3)

Note the lower bound in (3) is generally larger for smaller functions g (since 1 − g(y ) is

bigger). This is the tension between the two cases.

For example, if we take g to be identically 0, then the lower bounds (2) and (3) read 0

and 1, respectively. With g identically equal to 1, the values are reversed. With g identically

equal to , as in our initial attempt, the right-hand sides of both (2) and (3) are guaranteed

to be at least (though not larger).

.5 Solving for the Optimal Splitting Function

With our lower bounds (2) and (3) on the worst-case value of q + q for an edge (v, w), our

task is clear: we want to solve for the splitting function g that makes the minimum of these

two lower bounds as large as possible. If we can ﬁnd a function g such that the right-hand

sides of (2) and (3) (for any y ∈ [0, 1]) are both at least c, then we will have proved that

the WL algorithm is c-competitive. (Recall the argument: the value of the WL matching is

q , and p = q is a feasible dual solution, which is an upper bound on the maximum

matching.)

Solving for the best nondecreasing splitting function g may seem an intimidating prospect

there are an inﬁnite number of functions to choose from. In situations like this, a good

—

strategy is to “guess and check” — try to develop intuition for what the right answer might

look like and then verify your guess. There are many ways to guess, but often in an optimal

analysis there is “no slack anywhere” (since otherwise, a better solution could take advantage

of this slack). In our context, this corresponds to guessing that the optimal function g

equalizes the lower bound in (2) with that in (3), and with the second lower bound tight

simultaneously for all values of y ∈ [0, 1]. There is no a priori guarantee that such a g exists,

and if such a g exists, its optimality still needs to be veriﬁed. But it’s still a good strategy

for generating a guess.

Let’s start with the guess that the lower bound in (3) is the same for all values of

^yv ^{∈ [0, 1]. This means that}

ꢁZ

when viewed as a function of y , is a constant function. This means its derivative (w.r.t. y )

ꢂ

y_v

g(z)dz + 1 − g(y ),

is 0, so

g(y ) − g (y ) = 0,

i.e., the derivative of g is the same as g.³This implies that g(z) has the form g(z) = ke^zfor

a constant k > 0. This is great progress: instead of an inﬁnite-dimensional g to solve for,

we now just have the single parameter k to solve for.

Now let’s use the guess that the two lower bounds in (2) and (3) are the same. Plugging

ke^zinto the lower bound in (2) gives

z 1

ke dz = k [e | = k(e − 1),

which gets larger with k. Plugging ke^zinto the lower bound in (3) gives (for any y ∈ [0, 1])

This lower bound is independent of the choice of y — we knew that would happen, it’s how

ke dz + 1 − ke = k(e − 1) + 1 − ke = 1 − k.

we chose g(z) = ke^z– and gets larger with smaller k. Equalizing the two lower bounds of

k(e−1) and 1−k and solving for k, we get k = , and so the splitting function is g(y) = e^y−1.

(Thus when a vertex v ∈ L is empty it gets a share of the increase of an incident edge;

the share increases as v gets more full, and approaches 100% as v becomes completely full.)

Our lower bounds in (2) and (3) are then both equal to

−

≈ 63.2%.

This proves that the WL algorithm is (1 − )-competitive, a signiﬁcant improvement over

the more obvious -competitive algorithm.

I don’t know about you, but this is pretty much the only diﬀerential equation that I remember how to

solve.

.6 Epilogue

In this lecture we gave a (1 − )-competitive (deterministic) online algorithm for the online

fractional bipartite matching problem. The same ideas can be used to design a randomized

online algorithm for the original integral online bipartite matching problem that always

outputs a matching with expected size at least 1 − times the maximum possible. (The

expectation is over the random coin ﬂips made by the algorithm.) The rough idea is to set

things up so that the probability that a given edge is included the matching plays the same

role as its fractional value in the WL algorithm. Implementing this idea is not trivial, and

the details are outlined in Problem Set #4.

But can we do better? Either with a smarter algorithm, or with a smarter analysis

of these same algorithms? (Recall that being smarter improved the analysis of the WL

− ¹

algorithm from a to a 1

is negative: no online algorithm, deterministic or randomized, has a competitive ratio better

.) Even though 1

may seem like a weird number, the answer

than 1 − for maximum bipartite matching. The details of this argument are outlined in

Problem Set #3.

CS261: A Second Course in Algorithms

Lecture #15: Introduction to Approximation

Algorithms^∗

Tim Roughgarden^†

February 23, 2016

Coping with NP-Completeness

All of CS161 and the ﬁrst half of CS261 focus on problems that can be solved in polynomial

time. A sad fact is that many practically important and frequently occurring problems do

not seem to be polynomial-time solvable, that is, are NP-hard.¹

As an algorithm designer, what does it mean if a problem is NP-hard? After all, a

real-world problem doesn’t just go away after you realize that it’s NP-hard. The good news

is that NP-hardness is not a death sentence — it doesn’t mean that you can’t do anything

practically useful. But NP-hardness does throw the gauntlet to the algorithm designer, and

suggests that compromises may be necessary. Generally, more eﬀort (computational and

human) will lead to better solutions to NP-hard problems. The right eﬀort vs. solution

quality trade-oﬀ depends on the context, as well as the relevant problem size. We’ll discuss

algorithmic techniques across the spectrum — from low-eﬀort decent-quality approaches to

high-eﬀort high-quality approaches.

So what are some possible compromises? First, you can restrict attention to a relevant

special case of an NP-hard problem. In some cases, the special case will be polynomial-

time solvable. (Example: the Vertex Cover problem is NP-hard in general graphs, but on

Problem Set #2 you proved that, in bipartite graphs, the problem reduces to max ﬂow/min

cut.) In other cases, the special case remains NP-hard but is still easier than the general

case. (Example: the Traveling Salesman Problem in Lecture #16.) Note that this approach

requires non-trivial human eﬀort — implementing it requires understanding and articulating

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

I will assume that you’re familiar with the basics of NP-completeness from your other courses, like

CS154. If you want a refresher, see the videos on the Course site.

whatever special structure your particular application has, and then ﬁguring out how to

exploit it algorithmically.

A second compromise is to spend more than a polynomial amount of time solving the

problem, presumably using tons of hardware and/or restricting to relatively modest problem

sizes. Hopefully, it is still possible to achieve a running time that is faster than naive brute-

force search. While NP-completeness is sometimes interpreted as “there’s probably nothing

better than brute-force search,” the real story is more nuanced. Many NP-complete problems

can be solved with algorithms that, while running in exponential time, are signiﬁcantly faster

than brute-force search. Examples that we’ll discuss later include 3SAT (with a running

time of (4/3)ⁿrather than 2ⁿ) and the Traveling Salesman Problem (with a running time

of 2ⁿinstead of n!). Even for NP-hard problems where we don’t know any algorithms that

provably beat brute-force search in the worst case, there are almost always speed-up tricks

that help a lot in practice. These tricks tend to be highly dependent on the particular

application, so we won’t really talk about any in CS261 (where the focus is on general

techniques).

A third compromise, and the one that will occupy most of the rest of the course, is to

relax correctness. For an optimization problem, this means settling for a feasible solution

that is only approximately optimal. Of course one would like the approximation to be as

good as possible. Algorithms that are guaranteed to run in polynomial time and also be

near-optimal are called approximation algorithms, and they are the subject of this and the

next several lectures.

Approximation Algorithms

In approximation algorithm design, the hard constraint is that the designed algorithm should

run in polynomial time on every input. For an NP-hard problem, assuming P = NP, this

necessarily implies that the algorithm will compute a suboptimal solution in some cases.

The obvious goal is then to get as close to an optimal solution as possible (ideally, on every

input).

There is a massive literature on approximation algorithms — a good chunk of the algo-

rithms research community has been obsessed with them for the past 25+ years. As a result,

many interesting design techniques have been developed. We’ll only scratch the surface in

our lectures, and will focus on the most broadly useful ideas and problems.

One take-away from our study of approximation algorithms is that the entire algorithmic

toolbox that you’ve developed during CS161 and CS261 remains useful for the design and

analysis of approximation algorithms. For example, greedy algorithms, divide and conquer,

dynamic programming, and linear programming all have multiple killer applications in ap-

proximation algorithms (we’ll see a few). And there are other techniques, like local search,

which usually don’t yield exact algorithms (even for polynomial-time solvable problems) but

seem particularly well suited for designing good heuristics.

The rest of this lecture sets the stage with four relatively simple approximation algorithms

for fundamental NP-hard optimization problems.

.1 Example: Minimum-Makespan Scheduling

We’ve already seen a couple of examples of approximation algorithms in CS261. For example,

recall the problem of minimum-makespan scheduling, which we studied in Lecture #13.

There are m identical machines, and n jobs with processing times p , . . . , p . The goal is to

schedule all of the jobs to minimize the makespan (the maximum load, where the load of a

machine is the sum of the processing times of the jobs assigned to it) — to balance the loads

of the machines as evenly as possible.

In Lecture #13, we studied the online version of this problem, with jobs arriving one-

by-one. But it’s easy to imagine applications where you get to schedule a batch of jobs all

at once. This is the oﬄine version of the problem, with all n jobs known up front. This

problem is NP-hard.²

Recall Graham’s algorithm, which processes the jobs in the given (arbitrary) order, al-

ways scheduling the next job on the machine that currently has the lightest load. This

algorithm can certainly by implemented in polynomial time, so we can reuse it as a legiti-

mate approximation algorithm for the oﬄine problem. (Now the fact that it processes the

jobs online is just a bonus.) Because it always produces a schedule with makespan at most

twice the minimum possible (as we proved in Lecture #13), it is a 2-approximation algo-

rithm. The factor “2” here is called the approximation ratio of the algorithm, and it plays

the same role as the competitive ratio in online algorithms.

Can we do better? We can, by exploiting the fact that an (oﬄine) algorithm knows all of

the jobs up front. A simple thing that an oﬄine algorithm can do that an online algorithm

cannot is sort the jobs in a favorable order. Just running Graham’s algorithm on the jobs

in order from largest to smallest already improves the approximation ratio to (a good

homework problem).

.2 Example: Knapsack

Another example that you might have seen in CS161 (depending on who you took it from)

is the Knapsack problem. We’ll just give an executive summary; if you haven’t seen this

material before, refer to the videos posted on the course site.

An instance of the Knapsack problem is n items, each with a value and a weight. Also

given is a capacity W. The goal is to identify the subset of items with the maximum total

value, subject to having total weight at most W. The problem gets its name from a silly

story of a burglar trying to ﬁll up a sack with the most valuable items. But the problem

comes up all the time, either directly or as a subroutine in a more complicated problem —

whenever you have a shared resource with a hard capacity, you have a knapsack problem.

Students usually ﬁrst encounter the Knapsack problem as a killer application of dynamic

programming. For example, one such algorithm, which works as long as all item weights

For the most part, we won’t bother to prove any NP-hardness results in CS261. The NP-hardness

proofs are all of the exact form that you studied in a course like CS154 — one just exhibits a polynomial-

time reduction from a known NP-hard problem to the current problem. Many of the problems that we

study were among the ﬁrst batch of NP-complete problems identiﬁed by Karp in 1972.

are integers, runs in time O(nW). Note that this is not a polynomial-time algorithm, since

the input size (the number of keystrokes needed to type in the input) is only O(n log W).

(Writing down the number W only takes log W digits.) And in fact, the knapsack problem

is NP-hard, so we don’t expect there to be a polynomial-time algorithm. Thus the O(nW)

dynamic programming solution is an example of an algorithm for an NP-hard problem that

beats brute-force search (unless W is exponential in n), while still running in time exponential

in the input size.

What if we want a truly polynomial-time algorithm? NP-hardness says that we’ll have

to settle for an approximation. A natural greedy algorithm, which processes the items in

order of value divided by size (“bang-per-buck”) achieves a -approximation, that is, is

guaranteed to output a feasible solution with total value at least 50% times the maximum

possible.³If you’re willing to work harder, then by rounding the data (basically throwing out

the lower-order bits) and then using dynamic programming (on an instance with relatively

small numbers), one obtains a (1 − ꢀ)-approximation, for a user-speciﬁed parameter ꢀ > 0,

in time polynomial in n and . (By NP-hardness, we expect the running time to blow up

as ꢀ gets close to 0.) This is pretty much the best-case scenario for an NP-hard problem —

arbitrarily close approximation in polynomial time.

ꢀ

.3 Example: Steiner Tree

Next we revisit the other problem that we studied in Lecture #13, the Steiner tree problem.

Recall that the input is an undirected graph G = (V, E) with a nonnegative cost c ≥ 0 for

each edge e ∈ E. Recall also that there is no loss of generality in assuming that G is the

complete graph and that the edge costs satisfy the triangle inequality (i.e., c ≤ c + c

for all u, v, w ∈ V ); see Exercise Set #7. Finally, there is a set R = {t , . . . , t } of vertices

called “terminals.” The goal is to compute the minimum-cost subgraph that spans all of the

terminals. We previously studied this problem with the terminals arriving online, but the

oﬄine version of the problem, with all terminals known up front, also makes perfect sense.

In Lecture #13 we studied the natural greedy algorithm for the online Steiner tree prob-

lem, where the next terminal is connected via a direct edge to a previously arriving terminal

in the cheapest-possible way. We proved that the algorithm always computes a Steiner tree

with cost at most 2 ln k times the best-possible solution in hindsight. Since the algorithm is

easy to implement in polynomial time, we can equally well regard it as a 2 ln k-approximation

algorithm (with the fact that it processes terminals online just a bonus). Can we do some-

thing smarter if we know all the terminals up front?

As with job scheduling, better bounds are possible in the oﬄine model because of the

ability to sort the terminals in a favorable order. Probably the most natural order in which

to process the terminals is to always process next the terminal that is the cheapest to connect

to a previous terminal. If you think about it a minute, you realize that this is equivalent to

running Prim’s MST algorithm on the subgraph induced by the terminals. This motivates:

Technically, to achieve this for every input, the algorithm takes the better of this greedy solution and

the maximum-value item.

The MST heuristic for metric Steiner tree: output the minimum spanning tree of

the subgraph induced by the terminals.

Since the Steiner tree problem is NP-hard and the MST can be computed in polynomial

time, we expect this heuristic to produce a suboptimal solution in some cases. A concrete

example is shown in Figure 1, where the MST of {t , t , t } costs 4 while the optimal Steiner

tree has cost 3. (Thus the cost can be decreased by spanning additional vertices; this is what

makes the Steiner tree problem hard.) Using larger “wheel” graphs of the same type, it can

be shown that the MST heuristic can be oﬀ by a factor arbitrarily close to 2 (Exercise Set

8). It turns out that there are no worse examples.

t₂

t₁

^t3

Figure 1: MST heuristic will pick {t , t }, {t , t } but best Steiner tree (dashed edges) is

{

a, t }, {a, t }, {a, t }.

Theorem 2.1 In the metric Steiner tree problem, the cost of the minimum spanning tree of

the terminals is always at most twice the cost of an optimal solution.

Proof: The proof is similar to our analysis of the online Steiner tree problem (Lecture #13),

only easier. It’s easier to relate the cost of the MST heuristic to that of an optimal solution

than for the online greedy algorithm — the comparison can be done in one shot, rather then

on an edge-by-edge basis.

∗

For the analysis, let T denote a minimum-cost Steiner tree. Obtain H from T by adding

a second copy of every edge (Figure 2(a)). Obviously, H is Eulerian (every vertex degree got

∗

doubled) and

walk using every edge of H exactly once. We again have

c = 2OPT. Let C denote an Euler tour of H — a (non-simple) closed

e∈H

c_e= 2OPT.

e∈C

The tour C visits each of t , . . . , t at least once. “Shortcut” it to obtain a simple cycle C

on the vertex set {t , . . . , t } (Figure 2(b)); since the edge costs satisfy the triangle inequality,

this only decreases the cost. C minus an edge is a spanning tree of the subgraph induced by

R that has cost at most 2OPT; the MST can only be better. ꢀ

t₂

t₁

^t3

Figure 2: (a) Adding second copy of each edge in T^∗to form H. Note H is Euler-

sian. (b) Shorting cutting edges ({t1, a}, {a, t2}), ({t2, a}, {a, t3}), ({t3, a}, {a, t1}) to

{

t1, t2}, {t2, t3}, {t3, t1} respectively.

.4 Example: Set Coverage

Next we study a problem that we haven’t seen before, set coverage. This problem is a

killer application for greedy algorithms in approximation algorithm design. The input is a

collection S , . . . , S of subsets of some ground set U (each subset described by a list of its

elements), and a budget k. The goal is to pick k subsets to maximize the size of their union

(Figure 3). All else being equal, bigger sets are better for the set coverage problem. But

it’s not so simple — some sets are largely redundant, while others are uniquely useful (cf.,

Figure 3).

Figure 3: Example set coverage problem. If k = 2, we should pick the blue sets. Although

the red set is the largest, picking it is redundant.

Set coverage is a basic problem that comes up all the time (often not even disguised). For

example, suppose your start-up only has the budget to hire k new people. Each applicant

can be thought of as a set of skills. The problem of hiring to maximize the number of distinct

skills required is a set coverage problem. Similarly for choosing locations for factories/ﬁre

engines/Web caches/artisinal chocolate shops to cover as many neighborhoods as possible.

Or, in machine learning, picking a small number of features to explain as much as the data

as possible. Or, in HCI, given a budget on the number of articles/windows/menus/etc. that

can be displayed at any given time, maximizing the coverage of topics/functionality/etc.

The set coverage problem is NP-hard. Turning to approximation algorithms, the follow-

ing greedy algorithm, which increases the union size as much as possible at each iteration,

seems like a natural and good idea.

Greedy Algorithm for Set Coverage

for i = 1, 2, . . . , k: do

compute the set A_imaximizing the number of new elements covered

i−1

(relative to ∪ A_j)

j=1

return {A , . . . , A }

This algorithm can clearly be implemented in polynomial time, so we don’t expect it to

always compute an optimal solution. It’s useful to see some concrete examples of what can

go wrong. example.

Figure 4: (a) Bad example when k = 2 (b) Bad example when k = 3.

For the ﬁrst example (Figure 4(a)), set the budget k = 2. There are three subsets. S₁

and S partition the ground set U half-half, so the optimal solution has size |U|. We trick

the greedy algorithm by adding a third subset S that covers slightly more than half the

elements. The greedy algorithm then picks S₃in its ﬁrst iteration, and can only choose

one of S , S in the second iteration (it doesn’t matter which). Thus the size of the greedy

solution is ≈ |U|. Thus even when k = 2, the best-case scenario would be that the greedy

algorithm is a -approximation.

We next extend this example (Figure 4(b)). Take k = 3. Now the optimal solution is

S , S , S , which partition the ground set into equal-size parts. To trick the greedy algorithm

in the ﬁrst iteration (i.e., prevent it from taking one of the optimal sets S , S , S ), we add a

set S₄that covers slightly more than of the elements and overlaps evenly with S , S , S .

To trick it again in the second iteration, note that, given S , choosing any of S , S , S would

· ²·| |

U = U new elements. Thus we add a set S , disjoint from S , covering slightly

²| |

cover

more than a fraction of U. In the third iteration we allow the greedy algorithm to pick one

of S , S , S . The value of the greedy solution is ≈ |U|( + + ) = U . This is roughly

1 4

3 9

¹⁹| |

0% of |U|, so it is a worse example for the greedy algorithm than the ﬁrst

Exercise Set #8 asks you to extend this family of bad examples to show that, for all k,

the greedy solution could be as small as

ꢀ

ꢁ

− 1 −

times the size of an optimal solution. (Note that with k = 2, 3 we get and .) This

4 27

63.2% in the limit (since 1

expression is decreasing with k, and approaches 1 −

≈

−

approaches e for x going to 0, recall Figure 5).⁴

−

y = e⁻

−

y = 1 − x

−

Figure 5: Graph showing 1 − x approaching e^−xfor small x.

These examples show that the following guarantee is remarkable.

Theorem 2.2 For every k ≥ 1, the greedy algorithm is a (1 − (1 − ) )-approximation

algorithm for set coverage instances with budget k.

Thus there are no worse examples for the greedy algorithm that the ones we identiﬁed

above. Here’s what’s even more amazing: under standard complexity assumptions, there is

no polynomial-time algorithm with a better approximation ratio!⁵In this sense, the greedy

algorithm is an optimal approximation algorithm for the set coverage problem.

We now turn to the proof of Theorem 2.2. The following lemma proves a sense in which

the greedy algorithm makes healthy progress at every step. (This is the most common way

to analyze a greedy algorithm, whether for exact or approximate guarantees.)

There’s that strange number again!

As k grows large, that is. When k is a constant, the problem can be solved optimally in polynomial time

using brute-force search.

Lemma 2.3 Suppose that the ﬁrst i − 1 sets A , . . . , A

computed by the greedy algorithm

cover ` elements. Then the next set A chosen be the algorithm covers at least

i−1

(OPT − `)

new elements, where OPT is the value of an optimal solution.

Proof: As a thought experiment, suppose that the greedy algorithm were allowed to pick k

new sets in this iteration. Certainly it could cover OPT − ` new elements — just pick all of

the k subsets in the optimal solution. One of these k sets must cover at least (OPT `)

−

new elements, and the set A chosen by the greedy algorithm is at least as good. ꢀ

Now we just need a little algebra to prove the approximation guarantee.

Proof of Theorem 2.2: Let g = | ∪ⁱA | denote the number of elements covered by the

greedy solution after i iterations. Applying Lemma 2.3, we get

j=1

ꢀ

ꢁ

OPT

g = (g − g_k−1) + g_k−1≥ (OPT − g_k−1) + g_k−1

+ 1 −

g_k−1

Applying it again we get

ꢀ

ꢁ ꢀ

ꢀ

ꢁ

ꢀ

ꢁ

ꢀ

ꢁ

OPT

g_k≥

+ 1 −

g_k−2

+ 1 −

g_k−3

Iterating, we wind up with

ꢀ

ꢁ

ꢀ

ꢁ

ꢀ

ꢁ_k−1

OPT

g_k≥

1 + 1 −

+ 1 −

+ · · · + 1 −

(There are k terms, one per iteration of the greedy algorithm.) Recalling from your discrete

math class the identity

− z^k

− z

+ z + z²+ · · · + z^k−1

for z ∈ (0, 1) — just multiply both sides by 1 − z to verify — we get

ꢀ

ꢁ

OPT 1 − (1 − )

g_k≥

= OPT 1

−

1 − (1 − ¹)

as desired. ꢀ

.5 Inﬂuence Maximization

Guarantees for the greedy algorithm for set coverage and various generalizations were already

known in the 1970s. But just over the last dozen years, these ideas have taken oﬀ in the

data mining and machine learning communities. We’ll just mention one representative and

inﬂuential (no pun intended) example, due to Kempe, Kleinberg, and Tardos in 2003.

Consider a “social network,” meaning a directed graph G = (V, E). For our purposes, we

interpret an edge (v, w) as “v inﬂuences w.” (For example, maybe w follows v on Twitter.)

We next posit a simple model of how an idea/news item/meme/etc. “goes viral,” called

a “cascade model.”⁶

•

Initially the vertices in some set S are “active,” all other vertices are “inactive.” Every

edge is initially “undetermined.”

While there is an active vertex v and an undetermined edge (v, w):

–

with probability p, edge (v, w) is marked “active,” otherwise it is marked “inac-

tive;”

if (v, w) is active and w is inactive, then mark w as active.

Thus whenever a vertex gets activated, it has the opportunity to active all of the vertices

that it inﬂuences (if they’re not already activated). Note that once a vertex is activated, it

is active forevermore. A vertex can get multiple chances to be activated, corresponding to

the number of its inﬂuencers who get activated. See Figure 6. In the example, note that a

vertex winds up getting activated if and only if there is a path of activated edges from v to

it.

Figure 6: Example cascade model. Initially, only a is activated. b (and similarly c) can get

activated by a with probability p. d has a chance to get activated by either a, b or c.

The inﬂuence maximization problem is, given a directed graph G = (V, E) and a budget k,

to compute the subset S ⊆ V of size k that maximizes the expected number of active vertices

at the conclusion of the cascade, given that the vertices of S are active at the beginning.

Such models were originally proposed in epidemiology, to understand the spread of diseases.

(The expectation is over the coin ﬂips made for the edges.) Denote this expected value for

a set S by f(S).

There is a natural greedy algorithm for inﬂuence maximization, where at each iteration

we increase the function f as much as possible.

Greedy Algorithm for Inﬂuence Maximization

S = ∅

for i = 1, 2, . . . , k: do

add to S the vertex v maximizing f(S ∪ {v})

return S

The same analysis we used for set coverage can be used to prove that this greedy algorithm

is a (1 − (1 − ) )-approximation algorithm for inﬂuence maximization. The greedy algo-

rithm’s guarantee holds for every function f that is “monotone” and “submodular,” and the

function f above is one such example (it is basically a convex combination of set coverage

functions). See Problem Set #4 for details.

CS261: A Second Course in Algorithms

Lecture #16: The Traveling Salesman Problem^∗

Tim Roughgarden^†

February 25, 2016

The Traveling Salesman Problem (TSP)

In this lecture we study a famous computational problem, the Traveling Salesman Problem

(TSP). For roughly 70 years, the TSP has served as the best kind of challenge problem, mo-

tivating many diﬀerent general approaches to coping with NP-hard optimization problems.

For example, George Dantzig (who you’ll recall from Lecture #10) spent a fair bit of his time

in the 1950s ﬁguring out how to use linear programming as a subroutine to solve ever-bigger

instances of TSP. Well before the development of NP-completeness in 1971, experts were

well aware that the TSP is a “hard” problem in some sense of the word.

So what’s the problem? The input is a complete undirected graph G = (V, E), with a

nonnegative cost c ≥ 0 for each edge e ∈ E. By a TSP tour, we mean a simple cycle that

visits each vertex exactly once. (Not to be confused with an Euler tour, which uses each

edge exactly once.) The goal is to compute the TSP tour with the minimum total cost. For

example, in Figure 1, the optimal objective function value is 13.

The TSP gets its name from a silly story about a salesperson who has to make a number

of stops, and wants to visit them all in an optimal order. But the TSP deﬁnitely comes up in

real-world scenarios. For example, suppose a number of tasks need to get done, and between

two tasks there is a setup cost (from, say, setting up diﬀerent equipment or locating diﬀerent

workers). Choosing the order of operations so that the tasks get done as soon as possible is

exactly the TSP. Or think about a scenario where a disk has a number of outstanding read

requests; ﬁguring out the optimal order in which to serve them again corresponds to TSP.

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

Figure 1: Example TSP graph. Best TSP tour is a-c-b-d-a with cost 13.

The TSP is hard, even to approximate.

Theorem 1.1 If P = NP, then there is no α-approximation algorithm for the TSP (for

any α).

Recall that an α-approximation algorithm for a minimization problem runs in polynomial

time and always returns a feasible solution with cost at most α times the minimum possible.

Proof of Theorem 1.1: We prove the theorem using a reduction from the Hamiltonian cycle

problem. The Hamiltonian cycle problem is: given an undirected graph, does it contain a

simple cycle that visits every vertex exactly once? For example, the graph in Figure 2 does

not have a Hamiltonian cycle.¹This problem is NP-complete, and usually one proves it in

a course like CS154 (e.g., via a reduction from 3SAT).

Figure 2: Example graph without Hamiltonian cycle.

While it’s generally diﬃcult to convince someone that a graph has no Hamiltonian cycle, in this case

there is a slick argument: color the four corners and the center vertex green, and the other four vertices

red. Then every closed walk alternates green and red vertices, so a Hamiltonian cycle would have the same

number of green and red vertices (impossible, since there are 9 vertices).

For the reduction, we need to show how to use a good TSP approximation algorithm to

solve the Hamiltonian cycle problem. Given an instance G = (V, E) of the latter problem,

we transform it into an instance G = (V , E , c) of TSP, where:

•

V 0 = V ;

E0 is all edges (so (V 0, E0) is the complete graph);

for each e ∈ E0, set

ꢀ

where n is the number of vertices and α is the approximation factor that we want to

rule out.

if e ∈ E

c_e=

α · n if e ∈/ E,

For example, in Figure 2, all the edges of the grid get a cost of 1, and all the missing edges

get a cost greater than αn.

The key point is that there is a one-to-one correspondence between the Hamiltonian

cycles of G and the TSP tours of G⁰that use only unit-cost edges. Thus:

(i) If G has a Hamiltonian cycle, then there is a TSP tour with total cost n.

(ii) If G has no Hamiltonian cycle, then every TSP tour has cost larger than αn.

Now suppose there were an α-approximation algorithm A for the TSP. We could use A to

solve the Hamiltonian cycle problem: given an instance G of the problem, run the reduction

above and then invoke A on the produced TSP instance. Since there is more than an α

factor gap between cases (i) and (ii) and A is an α-approximation algorithm, the output of

A indicates whether or not G is Hamiltonian. (If yes, then it must return a TSP tour with

cost at most αn; if no, then it can only return a TSP tour with cost bigger than αn.) ꢀ

Metric TSP

.1 Toward a Tractable Special Case

Theorem 1.1 indicates that, to prove anything interesting about approximation algorithms for

the TSP, we need to restrict to a special case of the problem. In the metric TSP, we assume

that the edge costs satisfy the triangle inequality (with c_uw≤ c + c for all u, v, w ∈ V ).

We previously saw the triangle inequality when studying the Steiner tree problem (Lectures

13 and #15). The big diﬀerence is that in the Steiner tree problem the metric assumption

is without loss of generality (see Exercise Set #7) while in the TSP it makes the problem

signiﬁcantly easier.²

The metric TSP problem is still NP-hard, as shown by a variant of the proof of Theo-

rem 1.1. We can’t use the big edge costs αn because this would violate the triangle inequality.

This is of course what we’re hoping for, because the general case is impossible to approximate.

But if we use edge costs of 2 for edges not in the given Hamiltonian cycle instance G, then

the triangle inequality holds trivially (why?). The optimal TSP tour still has value at most

n when G has a Hamiltonian cycle, and value at least n + 1 when it does not. This shows

that there is no exact polynomial-time algorithm for metric TSP (assuming P = NP). It

does not rule out good approximation algorithms, however. And we’ll see next that there

are pretty good approximation algorithms for metric TSP.

.2 The MST Heuristic

Recall that in approximation algorithm design and analysis, the challenge is to relate the

solution output by an algorithm to the optimal solution. The optimal solution itself is often

hard to get a handle on (its NP-hard to compute, after all), so one usually resorts to bounds

on the optimal objective function value — quantities that are “only better than optimal.”

Here’s a simple lower bound for the TSP, with or without the triangle inequality.

Lemma 2.1 For every instance G = (V, E, c), the minimum-possible cost of a TSP tour is

at least the cost of a minimum spanning tree (MST).

Proof: Removing an edge from the minimum-cost TSP tour yields a spanning tree with only

less cost. The minimum spanning tree can only have smaller cost. ꢀ

Lemma 2.1 motivates using the MST as a starting point for building a TSP tour — if

we can turn the MST into a tour without suﬀering too much extra cost, then the tour will

be near-optimal. The idea of transforming a tree into a tour should ring some bells — recall

our online (Lecture #13) and oﬄine (Lecture #15) algorithms for the Steiner tree problem.

We’ll reuse the ideas developed for Steiner tree, like doubling and shortcutting, here for the

TSP. The main diﬀerence is that while these ideas were used only in the analysis of our

Steiner tree algorithms, to relate the cost of our algorithm’s tree to the minimum-possible

cost, here we’ll use these ideas in the algorithm itself. This is because, in TSP, we have to

output a tour rather than a tree.

MST Heuristic for Metric TSP

compute the MST T of the input G

construct the graph H by doubling every edge of T

compute an Euler tour C of H

// every v ∈ V is visited at least once in C

shortcut repeated occurrences of vertices in C to obtain a TSP tour

When we studied the Steiner tree problem, steps 2–4 were used only in the analysis. But

all of these steps, and hence the entire algorithm, are easy to implement in polynomial (even

near-linear) time.³

Recall from CS161 that there are many fast algorithms for computing a MST, including Kruskal’s and

Prim’s algorithms.

Theorem 2.2 The MST heuristic is a 2-approximation algorithm for the metric TSP.

Proof: We have

cost of our TSP tour ≤ cost of C

e∈H

c_e

e∈T

≤

2 · cost of optimal TSP tour,

where the ﬁrst inequality holds because the edge costs obey the triangle inequality, the

second equation holds because the Euler tour C uses every edge of H exactly once, the third

equation follows from the deﬁnition of H, and the ﬁnal inequality follows from Lemma 2.1.

ꢀ

The analysis of the MST heuristic in Theorem 2.2 is tight — for every constant c < 2,

there is a metric TSP instance such that the MST heuristic outputs a tour with cost more

than c times that of an optimal tour (Exercise Set #8).

Can we do better with a diﬀerent algorithm? This is the subject of the next section.

.3 Christoﬁdes’s Algorithm

Why were we oﬀ by a factor of 2 in the MST heuristic? Because we doubled every edge of

the MST T. Why did we double every edge? Because we need an Eulerian graph, to get

an Euler tour that we can shortcut down to a TSP tour. But perhaps it’s overkill to double

every edge of the MST. Can we augment the MST T to get an Eulerian graph without paying

the full cost of an optimal solution?

The answer is yes, and the key is the following slick lemma. It gives a second lower bound

on the cost of an optimal TSP tour, complementing Lemma 2.1.

Lemma 2.3 Let G = (V, E) be a metric TSP instance. Let S ⊆ V be an even subset of

vertices and M a minimum-cost perfect matching of the (complete) graph induced by S. Then

c_e≤ · OP T,

e∈M

where OPT denotes the cost of an optimal TSP tour.

Proof: Fix S. Let C^∗denote an optimal TSP tour. Since the edges obey the triangle

inequality, we can shortcut C to get a tour C of S that has cost at most OPT. Since |S| is

∗

even, C is a (simple) cycle of even length (Figure 3). C is the union of two disjoint perfect

matchings (alternate coloring the edges of C_Sred and green). Since the sum of the costs of

these matchings is that of C (which is at most OPT), the cheaper of these two matchings

has cost at most OP T/2. The minimum-cost perfect matching of S can only be cheaper. ꢀ

Figure 3: C_Sis a simple cycle of even length representing union of two disjoint perfect

matchings (red and green).

Lemma 2.3 brings us to Christoﬁdes’s algorithm, which diﬀers from the MST heuristic

only in substituting a perfect matching computation in place of the doubling step.

Christoﬁdes’s Algorithm

compute the MST T of the input G

compute the set W of vertices with odd degree in T

compute a minimum-cost perfect matching M of W

construct the graph H by adding M to T

compute an Euler tour C of H

// every v ∈ V is visited at least once in C

shortcut repeated occurrences of vertices in C to obtain a TSP tour

In the second step, the set W always has even size. (The sum of the vertex degrees of a graph

is double the number of edges, so there cannot be an odd number of odd-degree vertices.) In

the third step, note that the relevant matching instance is the graph induced by W, which

is the complete graph on W. Since this is not a bipartite graph (at least if |W| ≥ 4), this is

an instance of nonbipartite matching. We haven’t covered any algorithms for this problem,

but we mentioned in Lecture #6 that the ideas behind the Hungarian algorithm (Lecture

5) can, with additional ideas, be extended to also solve the nonbipartite case in polynomial

time. In the fourth step, there may be edges that appear in both T and M. The graph H

contains two copies of such edges, which is not a problem for us. The last two steps are

the same as in the MST heuristic. Note that the graph H is indeed Eulerian — adding the

matching M to T increases the degree of each vertex v ∈ W by exactly one (and leaves other

degrees unaﬀected), so T + M has all even degrees.⁴This algorithm can be implemented in

polynomial time — the overall running time is dominated by the matching computation in

the third step.

Theorem 2.4 Christoﬁdes’s algorithm is a -approximation algorithm for the metric TSP.

And as usual, H is connected because T is connected.

Proof: We have

cost of our TSP tour ≤ cost of C

e∈H

c_e

e∈T

e∈M

{z }

| {z }

≤

OPT (Lem 2.1)

≤OPT/2 (Lem 2.3)

≤

· cost of optimal TSP tour,

where the ﬁrst inequality holds because the edge costs obey the triangle inequality, the

second equation holds because the Euler tour C uses every edge of H exactly once, the third

equation follows from the deﬁnition of H, and the ﬁnal inequality follows from Lemmas 2.1

and 2.3. ꢀ

The analysis of Christoﬁdes’s algorithm in Theorem 2.4 is tight — for every constant

c < , there is a metric TSP instance such that the algorithm outputs a tour with cost more

than c times that of an optimal tour (Exercise Set #8).

Christoﬁdes’s algorithm is from 1976. Amazingly, to this day we still don’t know whether

or not there is an approximation algorithm for metric TSP better than Christoﬁdes’s algo-

rithm. It’s possible that no such algorithm exists (assuming P = NP, since if P = NP the

problem can be solved optimally in polynomial time), but it is widely conjecture that (if

not better) is possible. This is one of the biggest open questions in the ﬁeld of approximation

algorithms.

Asymmetric TSP

Figure 4: Example ATSP graph. Note that edges going in opposite directions need not have

the same cost.

We conclude with an approximation algorithm for the asymmetric TSP (ATSP) problem,

the directed version of TSP. That is, the input is a complete directed graph, with an edge

in each direction between each pair of vertices, and a nonnegative cost c ≥ 0 for each edge

(Figure 4). The edges going in opposite directions between a pair of vertices need not have

the same cost.⁵The “normal” TSP is equivalent to the special case in which opposite edges

(between the same pair of vertices) have the same cost. The goal is to compute the directed

TSP tour — a simple directed cycle, visiting each vertex exactly once — with minimum-

possible cost. Since the ATSP includes the TSP as a special case, it can only harder (and

appears to be strictly harder). Thus we’ll continue to assume that the edge costs obey the

triangle inequality (c_uw≤ c +c for every u, v, w ∈ V ) — note that this assumption makes

perfect sense in directed graphs as well as undirected graphs.

Our high-level strategy mirrors that in our metric TSP approximation algorithms.

. Construct a not-too-expensive Eulerian directed graph H.

. Shortcut H to get a directed TSP tour; by the triangle inequality, the cost of this tour

is at most

c .

e∈H

Recall that a directed graph H is Eulerian if (i) it is strongly connected (i.e., for every v, w

there is a directed path from v to w and also a directed path from w to v); and (ii) for

every vertex v, the in-degree of v in H equals its out-degree in H. Every directed Eulerian

graph admits a directed Euler tour — a directed closed walk that uses every (directed) edge

exactly once. Assumptions (i) and (ii) are clearly necessary for a graph to have a directed

Euler tour (since one enters and exists a vertex the same number of times). The proof of

suﬃciency is basically the same as in the undirected case (cf., Exercise Set #7).

The big question is how to implement the ﬁrst step of constructing a low-cost Eulerian

graph. In the metric case, we used the minimum spanning tree as a starting point. In the

directed case, we’ll use a diﬀerent subroutine, for computing a minimum-cost cycle cover.

Figure 5: Example cycle cover of vertices.

A cycle cover of a directed graph is a collection of C , . . . , C of directed cycles, each

with at least two vertices, such that each vertex v ∈ V appears in exactly one of the cycles.

(This is, the cycles partition the vertex set.) See Figure 5. Note that directed TSP tours

Recalling the motivating scenario of scheduling the order of operations to minimize the overall setup

time, it’s easy to think of cases where the setup time between task i and task j is not the same as if the

order of i and j are reversed.

are exactly the cycle covers with k = 1. Thus, the minimum-cost cycle cover can only be

cheaper than the minimum-cost TSP tour.

Lemma 3.1 For every instance G = (V, E, c) of ATSP, the minimum-possible cost of a

directed TSP tour is at least that of a minimum-cost cycle cover.

The minimum-cost cycle cover of a directed graph can be computed in polynomial time. This

is not obvious, but as a student in CS261 you’re well-equipped to prove it (via a reduction

to minimum-cost bipartite perfect matching, see Problem Set #4).

Approximation Algorithm for ATSP

initialize F = ∅

initialize G to the input graph

while G has at least 2 vertices do

compute a minimum-cost cycle cover C , . . . , C of the current G

add to F the edges in C₁, . . . , C_k

for i = 1, 2, . . . , k do

delete from G all but one vertex from C_i

compute a directed Euler tour C of H = (V, F)

// H is Eulerian, see discussion below

shortcut repeated occurrences of vertices on C to obtain a TSP tour

For the last two steps of the algorithm to make sense, we need the following claim.

Claim: The graph H = (V, F) constructed by the algorithm is Eulerian.

Proof: Note that H = (V, F) is the union of all the cycle covers computed over all iterations

of the while loop. We prove two invariants of (V, F) over these iterations.

First, the in-degree and out-degree of a vertex are always the same in (V, F). This is

trivial at the beginning, when F = ∅. When we add in the ﬁrst cycle cover to F, every vertex

then has in-degree and out-degree equal to 1. The vertices that get deleted never receive

any more incoming or outgoing edges, so they have the same in-degree and out-degree at the

conclusion of the while loop. The undeleted vertices participate in the cycle cover computed

in the second iteration; when this cycle cover is added to H, the in-degree and out-degree

of each vertex in (V, F) increases by 1 (from 1 to 2). And so on. At the end, the in- and

out-degree of a vertex v is exactly the number of while loop iterations in which it participated

(before getting deleted).

Second, at all times, for all vertices v that have been deleted so far, there is a vertex w

that has not yet been deleted such that (V, F) contains both a directed path from v to w

and from w to v. That is, in (V, F), every deleted vertex can reach and be reached by some

undeleted vertex.

To see why this second invariant holds, consider the ﬁrst iteration. Every deleted vertex

v belongs to some cycle C of the cycle cover, and some vertex w on C was left undeleted. C

contains a directed path from v to w and vice versa, and F contains all of C_i. By the same

reasoning, every vertex v that was deleted in the second iteration has a path in (V, F) to and

from some vertex w that was not deleted. A vertex u that was deleted in the ﬁrst iteration

has, at worst, paths in (V, F) to and from a vertex v deleted in the second iteration; stitching

these paths together with the paths from v to an undeleted vertex w, we see that (V, F)

contains a path from u to this undeleted vertex w, and vice versa. In the ﬁnal iteration of

the while loop, the cycle cover contains only one cycle C. (Otherwise, at least 2 vertices

would not be deleted and the while loop would continue.) The edges of C allow every vertex

remaining in the ﬁnal iteration to reach every other such vertex. Since every deleted vertex

can reach and be reached by the vertices remaining in the ﬁnal iteration, the while loops

concludes with a graph (V, F) where everybody can reach everybody (i.e., which is strongly

connected). ꢀ

The claim implies that our ATSP algorithm is well deﬁned. We now give the easy

argument bounding the cost of the tour it produces.

Lemma 3.2 In every iteration of the algorithm’s main while loop, there exists a directed

TSP tour of the current graph G with cost at most OPT, the minimum cost of a TSP tour

in the original input graph.

Proof: Shortcutting the optimal TSP tour for the original graph down to one on the current

graph G yields a TSP tour with cost at most OPT (using the triangle inequality). ꢀ

By Lemmas 3.1 and 3.2:

Corollary 3.3 In every iteration of the algorithm’s main while loop, the cost of the edges

added to F is at most OPT.

Lemma 3.4 There are at most log₂n iterations of the algorithm’s main while loop.

Proof: Recall that every cycle in a cycle cover has, by deﬁnition, at least two vertices. The

algorithm deletes all but one vertex from each cycle in each iteration, so it deletes at least

one vertex for each vertex that remains. Since the number of remaining vertices drops by a

factor of at least 2 in each iteration, there can only be log₂n iterations. ꢀ

Corollary 3.3 and Lemma 3.4 immediately give the following.

Theorem 3.5 The ATSP algorithm above is a log₂n-approximation algorithm.

This algorithm is from the early 1980s, and progress since then has been modest. The

best-known approximation algorithm for ATSP has an approximation ratio of O(log n/ log log n),

and even this improvement is only from 2010! Another of the biggest open questions in all of

approximation algorithms is: is there a constant-factor approximation algorithm for ATSP?

CS261: A Second Course in Algorithms

Lecture #17: Linear Programming and Approximation

Algorithms^∗

Tim Roughgarden^†

March 1, 2016

Preamble

Recall that a key ingredient in the design and analysis of approximation algorithms is getting

a handle on the optimal solution, to compare it to the solution return by an algorithm. Since

the optimal solution itself is often hard to understand (it’s NP-hard to compute, after all),

this generally entails bounds on the optimal objective function value — quantities that are

“only better than optimal.” If the output of an algorithm is within an α factor of this bound,

then it is also within an α factor of optimal.

So where do such bounds on the optimal objective function value come from? Last

week, we saw a bunch of ad hoc examples, including the maximum job size and the average

load in the makespan-minimization problem, and the minimum spanning tree for the metric

TSP. Today we’ll see how to use linear programs and their duals to generate systematically

such bounds. Linear programming and approximation algorithms are a natural marriage

—

for example, recall that dual feasible solutions are by deﬁnition bounds on the best-

possible (primal) objective function value. We’ll see that some approximation algorithms

explicitly solve a linear program; some use linear programming to guide the design of an

algorithm without ever actually solving a linear program to optimality; and some use linear

programming duality to analyze the performance of a natural (non-LP-based) algorithm.

A Greedy Algorithm for Set Cover (Without Costs)

We warm up with a solution that builds on our set coverage greedy algorithm (Lecture #15)

and doesn’t require linear programming at all. In the set cover problem, the input is a list

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

S , . . . , S ⊆ U of sets, each speciﬁed as a list of elements from a ground set U. The goal

is to pick as few sets as possible, subject to the constraint their union is all of U (i.e., that

they form a set cover). For example, in Figure 1, the optimal solution comprises of picking

the blue sets.

Figure 1: Example set coverage problem. The optimal solution comprises of picking the blue

sets.

In the set coverage problem (Lecture #15), the input included a parameter k. The

hard constraint was to pick at most k sets, and subject to this the goal was to cover as

many elements as possible. Here, the constraint and the objective are reversed: the hard

constraint is to cover all elements and, subject to this, to use as few sets as possible. Potential

applications of the set cover problem are the same as for set coverage, and which problem

is a better ﬁt for reality depends on the context. For example, if you are choosing where to

build ﬁre stations, you can imagine that it’s a hard constraint to have reasonable coverage

of all of the neighborhoods of a city.

The set cover problem is NP-hard, for essentially the same reasons as the set coverage

problem. There is again a tension between the size of a set and how “redundant” it is with

other sets that might get chosen anyway.

Turning to approximation algorithms we note that the greedy algorithm for set coverage

makes perfect sense for set cover. The only diﬀerence is in the stopping condition — rather

than stopping after k iterations, the algorithm stops when it has found a set cover.

Greedy Algorithm for Set Cover (No Costs)

C = ∅

while C not a set cover do

add to C the set S which covers the largest number of new elements

/ elements covered by previously chosen sets don’t count

return C

The same bad examples from Lecture #15 show that the greedy algorithm is not in

general optimal. In the ﬁrst example of that lecture, the greedy algorithm uses 3 sets even

though 2 are enough; in the second lecture, it uses 5 sets even though 3 are enough. (And

there are worse examples than these.) We next prove an approximation guarantee for the

algorithm.

Theorem 2.1 The greedy algorithm is a ln n-approximation algorithm for the set cover prob-

lem, where n = |U| is the size of the ground set.

Proof: We can usefully piggyback on our analysis of the greedy algorithm for the set coverage

problem (Lecture #15). Consider a set cover instance, and let OPT denote the size of the

smallest set cover. The key observation is: the current solution after OPT iterations of the

set cover greedy algorithm is the same as the output of the set coverage greedy algorithm

with a budget of k = OPT. (In both cases, in every iteration, the algorithm picks the set

that covers the maximum number of new elements.) Recall from Lecture #15 that the greedy

algorithm is a (1 − )-approximation algorithm for set coverage. Since there is a collection

of OPT sets covering all |U| elements, the greedy algorithm, after OPT iterations, will have

covered at least (1 − ) U elements, leaving at most U /e elements uncovered. Iterating,

| |

every OPT iterations of the greedy algorithm will reduce the number of uncovered elements

by a factor of e. Thus all elements are covered within OPT log_en = OPT ln n iterations.

Thus the number of sets chosen by the greedy algorithm is at most ln n times the size of an

optimal set cover, as desired. ꢀ

A Greedy Algorithm for Set Cover (with Costs)

It’s easy to imagine scenarios where the diﬀerent sets of a set cover instance have diﬀerent

costs. (E.g., if sets model the skills of potential hires, diﬀerent positions/seniority may

command diﬀerent salaries.) In the general version of the set cover problem, each set S_ialso

has a nonnegative cost c ≥ 0. Since there were no costs in the set coverage problem, we can

no longer piggyback on our analysis there — we’ll need a new idea.

The greedy algorithm is easy to extend to the general case. If one set costs twice as much

as another, then to be competitive, it should cover at least twice as many elements. This

idea translates to the following algorithm.

Greedy Algorithm for Set Cover (With Costs)

C = ∅

while C not a set cover do

add to C the set S with the minimum ratio

c_i

newly covered elements

r_i=

(1)

return C

Note that if all of the c_i’s are identical, then we recover the previous greedy algorithm —

in this case, minimizing the ratio is equivalent to maximizing the number of newly covered

elements. In general, the ratio is the “average cost per-newly covered element,” and it makes

sense to greedily minimize this.

The best-case scenario is that the approximation guarantee for the greedy algorithm does

not degrade when we allow arbitrary set costs. This is indeed the case.

Theorem 3.1 The greedy algorithm is a ≈ ln n-approximation algorithm for the general set

cover problem (with costs), where n = |U| is the size of the ground set.¹

To prove Theorem 3.1, the ﬁrst order of business is to understand how to make use of

the greedy nature of the algorithm. The following simple lemma, reminiscent of a lemma in

Lecture #15 for set coverage, addresses this point.

Lemma 3.2 Suppose that the current greedy solution covers ` elements of the set S_i. Then

the next set chosen by the algorithm has ratio at most

c_i

(2)

S_i| − `

Indeed, choosing the set S_iwould attain the ratio in (2); the ratio of the set chosen by the

greedy algorithm can only be smaller.

For every element e ∈ U, deﬁne

q_e= ratio of the ﬁrst set chosen by the greedy algorithm that covers e.

Since the greedy algorithm terminates with a set cover, every element has a well-deﬁned

q-value.²See Figure 2 for a concrete example.

Figure 2: Example set with q-value of the elements.

Inspection of the proof shows that the approximation ratio is ≈ ln s, where s = max |S | is the maximum

size of an input set.

The notation is meant to invoke the q-values in our online bipartite matching analysis (Lecture #13);

as we’ll see, something similar is going on here.

Corollary 3.3 For every set S , the jth element e of S to be covered by the greedy algorithm

satisﬁes

c_i

S_i| − (j − 1)

q_e≤

(3)

Corollary 3.3 follows immediately from Lemma 3.2 in the case where the elements of S_iare

covered one-by-one (with j − 1 playing the role of `, for each j). In general, several elements

of S might be covered at once. (E.g., the greedy algorithm might actually pick S .) But

in this case the corollary is only “more true” — if j is covered as part of a batch, then

the number of uncovered elements in S before the current selection was j − 1 or less. For

example, in Figure 2, Corollary 3.3 only asserts that the q-values of the largest set are at

most , , and 1, when in fact all are only . Similarly, for the last set chosen, Corollary 3.3

only guarantees that the q-values are at most and 1, while in fact they are and 1.

We can translate Corollary 3.3 into a bound on the sum of the q-values of the elements

of a set S_i:

c_i

c_ic_i

q_e≤

+ · · · +

S_i| |S_i| − 1

e∈S

≈

≤

c ln |S |

(4)

(5)

c ln n,

where n = |U| is the ground set size.³

We also have

q_e= cost of the greedy set cover.

(6)

e∈U

This identity holds inductively at all times. (If e has not been covered yet, then we deﬁne

q = 0.) Initially, both sides are 0. When a new set S is chosen by the greedy algorithm,

the right-hand side goes up by c_i. The left-hand side also increases, because all of the newly

covered elements receive a q-value (equal to the ratio of the set S_i), and this increase is

r · (# of newly covered elements) = c .

(Recall the deﬁnition (1) of the ratio.)

Proof of Theorem 3.1: Let {S , . . . , S denote the sets of an optimal set cover, and OPT

∗

∗}

^P|

S |

Our estimate

≈ ln |S | in (4), which follows by approximating the sum by an integral, is actually

j=1 j

oﬀ by an additive constant less that 1 (known as “Euler’s constant”). We ignore this additive constant for

simplicity.

its cost. We have

cost of the greedy set cover =

q_e

e∈U

X^kX

≤

q_e

i=1 e∈S∗

X^k

≤

c_iln n

i=1

OPT · ln n,

∗

where the ﬁrst equation is (6), the ﬁrst inequality follows because S , . . . , S form a set cover

(each e ∈ U is counted at least once), and the second inequality from (5). This completes

the proof. ꢀ

Our analysis of the greedy algorithm is tight. To see this, let U = {1, 2, . . . , n}, S = U

with c = 1 + ꢀ for small ꢀ, and S = {i} with cost c = for i = 1, 2, . . . , n. The optimal

solution (S ) has cost 1 + ꢀ. The greedy algorithm chooses S , S_n−1, . . . , S₁(why?), for a

total cost of

≈ ln n.

i=1 i

More generally, the approximation factor of ≈ ln n cannot be beaten by any polynomial-

time algorithm, no matter how clever (under standard complexity assumptions). In this

sense, the greedy algorithm is optimal for the set cover problem.

Interpretation via Linear Programming Duality

Our proof of Theorem 3.1 is reasonably natural — using the greedy nature of the algorithm

to prove the easy Lemma 3.2 and then compiling the resulting upper bounds via (5) and (6)

— but it still seems a bit mysterious in hindsight. How would one come up with this type

of argument for some other problem?

We next re-interpret the proof of Theorem 3.1 through the lens of linear programming

duality. With this interpretation, the proof becomes much more systematic. Indeed, it

follows exactly the same template that we already used in Lecture #13 to analyze the

WaterLevel algorithm for online bipartite matching.

To talk about a dual, we need a primal. So consider the following linear program (P):

X^m

min

c_ix_i

i=1

subject to

x_i≥ 1

x_i≥ 0

for all e ∈ U

for all S_i.

i : e∈S

The intended semantics is for x to be 1 if the set S is chosen in the set cover, and 0

otherwise.⁴In particular, every set cover corresponds to a 0-1 solution to (P) with the same

objective function value, and conversely. For this reason, we call (P) a linear programming

relaxation of the set cover problem — it includes all of the feasible solutions to the set cover

instance (with the same cost), in additional to other (fractional) feasible solutions. Because

the LP relaxation minimizes over a superset of the feasible set covers, its optimal objective

function value (“fractional OPT”) can only be smaller than that of a minimum-cost set cover

(“OPT”):

fractional OPT ≤ OP T.

We’ve seen a couple of examples of LP relaxations that are guaranteed to have optimal

-1 solutions — for the minimum s-t cut problem (Lecture #8) and for bipartite matching

(Lecture #9). Here, because the set cover problem is NP-hard and the linear programming

relaxation can be solved in polynomial time, we don’t expect the optimal LP solution to

always be integral. (Whenever we get lucky and the optimal LP solution is integral, it’s

handing us the optimal set cover on a silver platter.) It’s useful to see a concrete example of

this. In Figure 3, the ground set has 3 elements and the sets are the subsets with cardinality 2.

All costs are 1. The minimum cost of a set cover is clearly 2 (no set covers everything). But

setting x_i= for every set yields a feasible fractional solution with the strictly smaller

objective function value of .

Figure 3: Example where all sets have cost 1. Optimal set cover is clearly 2, but there exists

a feasible fraction with value by setting all x = .

Deriving the dual (D) of (P) is straightforward, using the standard recipe (Lecture #8):

max

p_e

e∈U

If you’re tempted to also include the constraints x ≤ 1 for every S , note that these will hold anyways

at an optimal solution.

subject to

p_e≤ c_i

p_e≥ 0

for every set S_i

for every e ∈ U.

e∈S

Lemma 4.1 If {p_e}_e∈Eis a feasible solution to (D), then

p_e≤ fractional OPT ≤ OP T.

e∈U

The ﬁrst inequality follows from weak duality — for a minimization problem, every feasible

dual solution gives (by construction) a lower bound on the optimal primal objective function

value — and second inequality follows because (P) is a LP relaxation of the set cover problem.

Recall the derivation from Section 3 that, for every set S_i,

q ≤ c ln n;

e∈S

see (5). Looking at the constraints in the dual (D), the purpose of this derivation is now

transparent:

ln n

Lemma 4.2 The vector p :=

is feasible for the dual (D).

As such, the dual objective function value

cost of a set cover (Lemma 4.1).⁵Using the identity (6) from Section 3, we get

p_eprovides a lower bound on the minimum

e∈U

cost of the greedy set cover = ln n ·

p_e≤ ln n · OP T.

e∈U

So, while one certainly doesn’t need to know linear programming to come up with the

greedy set cover algorithm, or even to analyze it, linear programming duality renders the

analysis transparent and reproducible for other problems. We next examine a couple of

algorithms whose design is explicitly guided by linear programming.

A Linear Programming Rounding Algorithm for Ver-

tex Cover

Recall from Problem Set #2 the vertex cover problem: the input is an undirected graph

G = (V, E) with a nonnegative cost c for each vertex v ∈ V , and the goal is to compute

a minimum-cost subset S ⊆ V that contains at least one endpoint of every edge. On

This is entirely analogous to what happened in Lecture #13, for maximum bipartite matching: we

deﬁned a vector q with sum equal to the size of the computed matching, and we scaled up q to get a feasible

dual solution and hence an upper bound on the maximum-possible size of a matching.

Problem Set #2 you saw that, in bipartite graphs, this problem reduces to a max-ﬂow/min-

cut computation. In general graphs, the problem is NP-hard.

The vertex cover problem can be regarded as a special case of the set cover problem. The

elements needing to be covered are the edges. There is one set per vertex v, consisting of the

edges incident to v (with cost c_v). Thus, we’re hoping for an approximation guarantee better

than what we’ve already obtained for the general set cover problem. The ﬁrst question to

ask is: does the greedy algorithm already have a better approximation ratio when we restrict

attention to the special case of vertex cover instances? The answer is no (Exercise Set #9),

so to do better we’ll need a diﬀerent algorithm.

This section analyzes an algorithm that explicitly solves a linear programming relax-

ation of the vertex cover problem (as opposed to using it only for the analysis). The LP

relaxation (P) is the same one as in Section 4, specialized to the vertex cover problem:

min

c_vx_v

v∈V

subject to

x + x ≥ 1

for all e = (v, w) ∈ E

for all v ∈ V .

x ≥ 0

There is a one-to-one and cost-preserving correspondence between 0-1 feasible solutions to

this linear program and vertex covers. ( We won’t care about the dual of this LP relaxation

until the next section.)

Again, because the vertex cover problem is NP-hard, we don’t expect the LP relaxation

to always solve to integers. We can reinterpret the example from Section 4 (Figure 3) as a

vertex cover instance — the graph G is a triangle (all unit vertex costs), the smallest vertex

cover has size 2, but setting x_v= for all three vertices yields a feasible fractional solution

with objective function value .

LP Rounding Algorithm for Vertex Cover

compute an optimal solution x^∗to the LP relaxation (P)

return S = {v ∈ V : x_v

∗

≥ ¹₂}

The ﬁrst step of our new approximation algorithm computes an optimal (fractional)

solution to the LP relaxation (P). The second step transforms this fractional feasible solution

into an integral feasible solution (i.e., a vertex cover). In general, such a procedure is

called a rounding algorithm. The goal is to round to an integral solution without aﬀecting

the objective function value too much.⁶The simplest approach to LP rounding, and a

This is analogous to our metric TSP algorithms, where we started with an infeasible solution that was

only better than optimal (the MST) and then transformed it into a feasible solution (i.e., a TSP tour) with

suﬀering too much extra cost.

common heuristic in practice, is to round fractional values to the nearest integer (subject

to feasibility). The vertex cover problem is a happy case where this heuristic gives a good

worst-case approximation guarantee.

Lemma 5.1 The LP rounding algorithm above outputs a feasible vertex cover S.

∗

∗ ≥

1 for every (v, w) E. Hence, for

∈

Proof: Since the solution x is feasible for (P), x + x

every (v, w) ∈ E, at least one of x , x is at least . Hence at least one endpoint of every

∗

edge is included in the ﬁnal output S. ꢀ

The approximation guarantee follows from the fact that the algorithm pays at most twice

what the optimal LP solution x^∗pays.

Theorem 5.2 The LP rounding algorithm above is a 2-approximation algorithm.

Proof: We have

∗

c_v

≤

c (2x )

v∈S

v∈V

{z }

cost of alg’s soln

2 · fractional OPT

2 · OP T,

≤

where the ﬁrst inequality holds because v ∈ S only if x_{v 2}, the equation holds because x^∗is

an optimal solution to (P), and the second inequality follows because (P) is a LP relaxation

of the vertex cover problem. ꢀ

∗

≥ ¹

A Primal-Dual Algorithm for Vertex Cover

Can we do better than Theorem 5.2? In terms of worst-case approximation ratio, the answer

seems to be no.⁷But we can still ask if we can improve the running time. For example,

can we get a 2-approximation algorithm without explicitly solving the linear programming

relaxation? (E.g., for set cover, we used linear programs only in the analysis, not in the

algorithm itself.)

Our plan is to use the LP relaxation (P) and its dual (below) to guide the decisions made

by our algorithm, without ever solving either linear program explicitly (or exactly). The

dual linear program (D) is again just a specialization of that for the set cover problem:

max

p_e

e∈E

Assuming the “Unique Games Conjecture,” a signiﬁcant strengthening of the P = NP conjecture, there

is no (2 − ꢀ)-approximation algorithm for vertex cover, for any constant ꢀ > 0.

subject to

p_e≤ c_v

p_e≥ 0

for every v ∈ V

for every e ∈ E.

e∈δ(v)

We consider the following algorithm, which maintains a dual feasible solution and itera-

tively works toward a vertex cover.

Primal-Dual Algorithm for Vertex Cover

initialize p = 0 for every edge e ∈ E

initialize S = ∅

while S is not a vertex cover do

pick an edge e = (v, w) with v, w ∈/ S

increase p until the dual constraint corresponding to v or w goes

tight

add the vertex corresponding to the tight dual constraint to S

In the while loop, such an edge (v, w) ∈ E must exist (otherwise S would be a vertex

cover). By a dual constraint “going tight,” we mean that it holds with equality. It is easy to

implement this algorithm, using a single pass over the edges, in linear time. This algorithm

is very natural when you’re staring at the primal-dual pair of linear programs. Without

knowing these linear programs, it’s not clear how one would come up with it.

For the analysis, we note three invariants of the algorithm.

(P1) p is feasible for (D). This is clearly true at the beginning when p = 0 for every e ∈ E

(vertex costs are nonnegative), and the algorithm (by deﬁnition) never violates a dual

constraint in subsequent iterations.

(P2) If v ∈ S, then

p = c . This is obviously true initially, and we only add a vertex

e∈δ(v)

to S when this condition holds for it.

(P3) If p > 0 for e = (v, w) ∈ E, then |S ∩ {v, w}| ≤ 2. This is trivially true (whether or

not p > 0).

Furthermore, by the stopping condition, at termination we have:

(P4) S is a vertex cover.

That is, the algorithm maintains dual feasibility and works toward primal feasibility. The

second and third invariants should be interpreted as an approximate version of the comple-

mentary slackness conditions.⁸The second invariant is exactly the ﬁrst set of complemen-

Recall the complementary slackness conditions from Lecture #9: (i) whenever a primal variable is

nonzero, the corresponding dual constraint is tight; (ii) whenever a dual variable is nonzero, the corresponding

primal constraint is tight. Recall that the complementary slackness conditions are precisely the conditions

under which the derivation of weak duality holds with equality. Recall that a primal-dual pair of feasible

solutions are both optimal if and only if the complementary slackness conditions hold.

tary slackness conditions — it says that a primal variable is positive (i.e., v ∈ S) only if

the corresponding dual constraint is tight. The second set of exact complementary slackness

conditions would assert that whenever p > 0 for e = (v, w) ∈ E, the corresponding primal

constraint is tight (i.e., exactly one of v, w is in S). These conditions will not in general hold

for the algorithm above (if they did, then the algorithm would always solve the problem ex-

actly). They do hold approximately, in the sense that tightness is violated only by a factor

of 2. This is exactly where the approximation factor of the algorithm comes from.

Since the algorithm maintains dual feasibility and approximate complementary slackness

and works toward primal feasibility, it is a primal-dual algorithm, in exactly the same sense

as the Hungarian algorithm for minimum-cost perfect bipartite matching (Lecture #9). The

only diﬀerence is that the Hungarian algorithm maintains exact complementary slackness

and hence terminates with an optimal solution, while our primal-dual vertex cover algorithm

only maintains approximate complementary slackness, and for this reason terminates with

an approximately optimal solution.

Theorem 6.1 The primal-dual algorithm above is a 2-approximation algorithm for the ver-

tex cover problem.

Proof: The derivation is familiar from when we derived weak duality (Lecture #8). Letting

S denote the vertex cover returned by the primal-dual algorithm, OPT the minimum cost

of a vertex cover, and “fractional OPT” the optimal objective function value of the LP

relaxation, we have

X X

c_v=

p_e

v∈S

v∈S e∈δ(v)

p_e· |S ∩ {v, w}|

e=(v,w)∈E

≤

p_e

e∈E

≤

2 · fractional OPT

2 · OP T.

The ﬁrst equation is the ﬁrst (exact) set of complementary slackness conditions (P2), the

second equation is just a reversal of the order of summation, the ﬁrst inequality follows from

the approximate version of the second set of complementary slackness conditions (P3), the

second inequality follows from dual feasibility (P1) and weak duality, and the ﬁnal inequality

follows because (P) is an LP relaxation of the vertex cover problem. This completes the proof.

ꢀ

CS261: A Second Course in Algorithms

Lecture #18: Five Essential To ols for the Analysis of

Randomized Algorithms^∗

Tim Roughgarden^†

March 3, 2016

Preamble

In CS109 and CS161, you learned some tricks of the trade in the analysis of randomized

algorithms, with applications to the analysis of QuickSort and hashing. There’s also CS265,

where you’ll learn more than you ever wanted to know about randomized algorithms (but

a great class, you should take it). In CS261, we build a bridge between what’s covered in

CS161 and CS265. Speciﬁcally, this lecture covers ﬁve essential tools for the analysis of

randomized algorithms. Some you’ve probably seen before (like linearity of expectation and

the union bound) while others may be new (like Chernoﬀ bounds). You will need these

tools in most 200- and 300-level theory courses that you may take in the future, and in other

courses (like in machine learning) as well. We’ll point out some applications in approximation

algorithms, but keep in mind that these tools are used constantly across all of theoretical

computer science.

Recall the standard probability setup. There is a state space Ω; for our purposes, Ω is

always ﬁnite, for example corresponding to the coin ﬂip outcomes of a randomized algorithm.

A random variable is a real-valued function X : Ω → R deﬁned on Ω. For example, for a

ﬁxed instance of a problem, we might be interested in the running time or solution quality

produced by a randomized algorithm (as a function of the algorithm’s coin ﬂips). The

expectation of a random variable is just its average value, with the averaging weights given

by a speciﬁed probability distribution on Ω:

E[X] =

Pr[ω] · X(ω).

ω∈Ω

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

An event is a subset of Ω. The indicator random variable for an event E ⊆ Ω takes on the

value 1 for ω ∈ E and 0 for ω ∈/ E. Two events E , E are independent if their probabilities

factor: Pr[E ∧ E ] = Pr[E ] · Pr[E ]. Two random variables X , X are independent if, for

every x and x , the events {ω : X (ω) = x } and {ω : X (ω) = x } are independent. In

this case, expectations factor: E[XY ] = E[X] · E[Y ]. Independence for sets of 3 or more

events or random variables is deﬁned analogously (for every subset, probabilities should

factor). Probabilities and expectations generally don’t factor for non-independent random

variables, for example if E , E are complementary events (so Pr[E ∧ E ] = 0).

Linearity of Expectation and MAX 3SAT

.1 Linearity of Expectation

The ﬁrst of our ﬁve essential tools is linearity of expectation. Like most of these tools, it

somehow manages to be both near-trivial and insanely useful. You’ve surely seen it before.¹

To remind you, suppose X , . . . , X are random variables deﬁned on a common state space

Ω. Crucially, the X_i’s need not be independent. Linearity of expectation says that we can

freely exchange expectations with summations:

Xⁿ

X_i

E[X_i] .

i=1

The proof is trivial — just expand the expectations as sums over Ω, and reverse the order

of summation.

The analogous statement for, say, products of random variables is not generally true

(when the X_i’s are not independent). Again, just think of two indicator random variables

for complementary events.

As an algorithm designer, why should you care about linearity of expectation? A typical

use case works as follows. Suppose there is some complex random variable X that we

care about — like the number of comparisons used by QuickSort, or the objective function

value of the solution returned by some randomized algorithm. In many cases, it is possible

i=1

to express the complex random variable X as the sum

variables X , . . . , X , for example indicator random variables. One can then analyze the

X_iof much simpler random

expectation of the simple random variables directly, and exploit linearity of expectation to

deduce the expected value of the complex random variable of interest. You should have seen

this recipe in action already in CS109 and/or CS161, for example when analyzing QuickSort

or hash tables. Remarkably, linearity of expectation is already enough to derive interesting

results in approximation algorithms.

When I teach CS161, out of all the twenty lectures, exactly one equation gets a box drawn around it for

emphasis — linearity of expectation.

.2 A -Approximation Algorithm for MAX 3SAT

An input of MAX 3SAT is just like an input of 3SAT — there are n Boolean variables

x , . . . , x and m clauses. Each clause is the disjunction (“or”) of 3 literals (where a literal is

a variable or its negation). For example, a clause might have the form x ∨ ¬x ∨ ¬x . For

simplicity, assume that the 3 literals in each clause correspond to distinct variables. The goal

is to output a truth assignment (an assignment of each x to { true, false }) that satisﬁes the

maximum-possible number of clauses. Since 3SAT is the special case of checking whether or

not the optimal objective function value equals m, MAX 3SAT is an NP-hard problem.

A very simple algorithm has a pretty good approximation ratio.

Theorem 2.1 The expected number of clauses satisﬁed by a random truth assignment, cho-

sen uniformly at random from all 2ⁿtruth assignments, is m.

Since the optimal solution can’t possibly satisfy more than m clauses, we conclude that

the algorithm that chooses a random assignment is a -approximation (in expectation).

Proof of Theorem 2.1: Identify the state space Ω with all 2ⁿpossible truth assignments (with

the uniform distribution). For each clause j, let X_jdenote the indicator random variable for

the event that clause j is satisﬁed. Observe that the random variable X that we really care

j=1

about, the number of satisﬁed clauses, is the sum

X_jof these simple random variables.

We now follow the recipe above, analyzing the simple random variables directly and using

linearity of expectation to analyze X. As always with an indicator random variable, the

expectation is just the probability of the corresponding event:

E[X ] = 1 · Pr[X = 1] + 0 · Pr[X = 0] = Pr[clause j satisﬁed] .

The key observation is that clause j is satisﬁed by a random assignment with probability

exactly . For example, suppose the clause is x x₂x₃. Then a random truth assignment

∨

satisﬁes the clause unless we are unlucky enough to set each of x , x , x to false — for all of

the other 7 combinations, at least one variable is true and hence the clause is satisﬁed. But

there’s nothing special about this clause — for any clause with 3 literals corresponding to

distinct variables, only 1 of the 8 possible assignments to these three variables fails to satisfy

the clause.

Putting the pieces together and using linearity of expectation, we have

X^m

E[X] = E

X_j

E[X_j] =

= m,

j=1

as claimed. ꢀ

If a random assignment satisﬁes m clauses on average, then certainly some truth as-

signment does as well as this average.²

It is not hard to derandomize the randomized algorithm to compute such a truth assignment determin-

istically in polynomial time, but this is outside the scope of this lecture.

Corollary 2.2 For every 3SAT formula, there exists a truth assignment satisfying at least

87.5% of the clauses.

Corollary 2.2 is counterintuitive to many people the ﬁrst time they see it, but it is a near-

trivial consequence of linearity of expectation (which itself is near-trivial!).

Remarkably, and perhaps depressingly, there is no better approximation algorithm: as-

suming P = NP, there is no ( + ꢀ)-approximation algorithm for MAX 3SAT, for any

constant ꢀ > 0. This is one of the major results in “hardness of approximation.”

Tail Inequalities

If you only care about the expected value of a random variable, then linearity of expectation

is often the only tool you need. But in many cases one wants to prove that an algorithm

is good not only on average, but is also good almost all the time (“with high probability”).

Such high-probability statements require diﬀerent tools.

The point of a tail inequality is to prove that a random variable is very likely to be

close to its expected value — that the random variable “concentrates.” In the world of tail

inequalities, there is always a trade-oﬀ between how much you assume about your random

variable, and the degree of concentration that you can prove. This section looks at the three

most commonly used points on this trade-oﬀ curve. We use hashing as a simple running

example to illustrate these three inequalities; the next section connects these ideas back to

approximation algorithms.

.1 Hashing

Figure 1: a hash function h that maps a large universe U to a relatively smaller number of

buckets n.

Throughout this section, we consider a family H of hash functions, with each h ∈ H mapping

a large universe U to a relatively small number of “buckets” {1, 2, . . . , n} (Figure 1). We’ll be

thinking about the following experiment, which should be familiar from CS161: an adversary

picks an arbitrary data set S ⊆ U, then we pick a hash function h ∈ H uniformly at random

and use it to hash all of the elements of S. We’d like these objects to be distributed evenly

across the buckets, and the maximum load of a bucket (i.e., the number of items hashing

to it) is a natural measure of distance from this ideal case. For example, in a hash table

with chaining, the maximum load of a bucket governs the worst-case search time, a highly

relevant statistic.

.2 Markov’s Inequality

For now, all we assume about H is that each object is equally likely to map to each bucket

(though not necessarily independently).

(P1) For every x ∈ U and i ∈ {1, 2, . . . , n}, Pr_h∈H[h(x) = i] = .

This property is already enough to analyze the expected load of a bucket. For simplicity,

suppose that the size |S| of the data set being hashed equals the number of buckets n.

Then, for any bucket i, by linearity of expectation (applied to indicator random variables

for elements getting mapped to i), its expected load is

Pr[h(x) = i] =

= 1.

(1)

}

x∈S

1/n by (P1)

This is good — the expectations seem to indicate that things are balanced on average. But

can we prove a concentration result, stating that loads are close to these expectations?

The following tail inequality gives a weak bound but applies under minimal assumptions;

it is our second (of 5) essential tools for the analysis of randomized algorithms.

Theorem 3.1 (Markov’s Inequality) If X is a non-negative random variable with ﬁnite

expectation, then for every constant c ≥ 1,

Pr[X ≥ c · E[X]] ≤ .

For example, such a random variable is at least 10 times its expectation at most 10% of the

time, and is at least 100 times its expectation at most 1% of the time. In general, Markov’s

inequality is useful when a constant probability guarantee is good enough. The proof of

Markov’s inequality is easy, and we leave it to Exercise Set #9.³

Both hypotheses are necessary. For example, random variables that are equally likely to be M or −M

exhibit no concentration whatsoever as M → ∞.

We now apply Markov’s inequality to the random variable equal to the load of our favorite

bucket i. We can choose any c ≥ 1 we want in Theorem 3.1. For example, choosing c = n

and recalling that the relevant expectation is 1 (assuming |S| = n), we obtain

Pr[load of i ≥ n] ≤

The good news is that is not a very big number when n is large. But let’s look at the

event we’re talking about: the load of i being at least n means that every single element of

S hashes to i. And this sounds crazy, like it should happen much less often than 1/n of the

time. (If you hash 100 things into a hash table with 100 buckets, would you really expect

everything to hash to the same bucket 1% of the time?)

If we’re only assuming the property (P1), however, it’s impossible to prove a better bound.

To see this, consider the set H = {h(x) = i : i = 1, 2, . . . , n} of constant hash functions,

each of which maps all items to the same bucket. Observe that H satisﬁes property (P1).

But the probability that all items hash to the bucket i is indeed .

.3 Chebyshev’s Inequality

A totally reasonable objection is that the example above is a stupid family of hash function

that no one would ever use. So what about a good family of hash functions, like those you

studied in CS161? Speciﬁcally, we now assume:

(P2) for every pair x, y ∈ U of distinct elements, and every i, j ∈ {1, 2, . . . , n},

Pr_h∈H[h(x) = i and h(y) = j] =

n²

That is, when looking at only two elements, the joint distribution of their buckets is as if

the function h is a totally random function. (Property (P1) asserts an analogous statement

when looking at only a single element.) A family of hash functions satisfying (P2) is called

a pairwise or 2-wise independent family. This is almost the same as (and for practical

purposes equivalent to) the notion of “universal hashing” that you saw in CS161. The

family of constant hash functions (above) clearly fails to satisfy property (P2).

So how do we use this stronger assumption to prove sharper concentration bounds?

Recall that the variance Var[X] of a random variable is its expected squared deviation from

its mean E[(X − E[X])²], and that the standard deviation is the square root of the variance.

Assumption (P2) buys us control over the variance of the load of a bucket. Chebyshev’s

inequality, the third of our ﬁve essential tools, is the inequality you want to use when the

best thing you’ve got going for you is a good bound on the variance of a random variable.

Theorem 3.2 (Chebyshev’s Inequality) If X is a random variable with ﬁnite expecta-

tion and variance, then for every constant t ≥ 1,

Pr[|X − E[X] | > t · StdDev[X]] ≤

t²

For example, the probability that a random variable diﬀers from its expectation by at least

two standard deviations is at most 25%, and the probability that it diﬀers by at least 10

standard deviations is at most 1%. Chebyshev’s inequality follows easily from Markov’s

inequality; see Exercise Set #9.

Now let’s go back to the load of our favorite bucket i, where a data set S ⊆ U with size

S| = n is hashed using a hash function h chosen uniformly at random from H. Call this

random variable X. We can write

X =

X_y,

y∈S

where X is the indicator random variable for whether or not h(y) = i. We noted earlier

that, by (P1), E[X] =

= 1.

Now consider the variance of X. We claim that

y∈S n

Var[X] =

Var[X_y] ,

(2)

y∈S

analogous to linearity of expectation. Note that this statement is not true in general — e.g.,

if X and X are indicator random variables of complementary events, then X +X is always

equal to 1 and hence has variance 0. In CS109 you saw a proof that for independent random

variables, variances add as in (2). If you go back and look at this derivation — seriously,

go look at it — you’ll see that the variance of a sum equals the sum of the variances of

the summands, plus correction terms that involve the covariances of pairs of summands.

The covariance of independent random variables is zero. Here, we are only dealing with

pairwise independent random variables (by assumption (P2)), but still, this implies that the

covariance of any two summands is 0. We conclude that (2) holds not only for sums of

independent random variables, but also of pairwise independent random variables.

Each indicator random variable X is a Bernoulli variable with parameter , and so

1. Using (2), we have Var[X] =

= 1. (By

Var[X ] = (1

− ¹

)

≤

Var[X_y]

≤

· ¹

∈

contrast, when H is the set of constant hash functions, Var[X] ≈ n.)

Applying Chebyshev’s inequality with t = n (and ignoring “+1” terms for simplicity),

we obtain

Pr_h∈H[X ≥ n] ≤

n²

This is a better bound than what we got from Markov’s inequality, but it still doesn’t seem

that small — when hashing 10 elements into 10 buckets, do you really expect to see all of them

in a single bucket 1% of the time? But again, without assuming more than property (P2),

we can’t do better — there exist families of pairwise independent hash functions such that

all elements hash to the same bucket with probability ; showing this is a nice puzzle.

n²

.4 Chernoﬀ Bounds

In this section we assume that:

(P3) all h(x)’s are uniformly and independently distributed in {1, 2, . . . , n}. Equivalently,

h is completely random function.

How can we use this strong assumption to prove sharper concentration bounds?

The fourth of our ﬁve essential tools for analyzing randomized algorithms is the Chernoﬀ

bounds. They are the centerpiece of this lecture, and are used all the time in the analysis of

algorithms (and also complexity theory, machine learning, etc.).

The point of the Chernoﬀ bounds is to prove sharp concentration for sums of independent

and bounded random variables.

Theorem 3.3 (Chernoﬀ Bounds) Let X , . . . , X be random variables, deﬁned on the

same state space and taking values in [0, 1], and set X =

j=1

X_j. Then:

(i) for every δ > 0,

ꢀ

ꢁ

(1+δ)E[X]

Pr[X > (1 + δ)E[X]] <

+ δ

(ii) for every δ ∈ (0, 1),

Pr[X < (1 − δ)E[X]] < e^−δ2E[X]/2.

The key thing to notice in Theorem 3.3 is that the deviation probability decays exponentially

in both the factor of the deviation (1+δ) and the expectation of the random variable (E[X]).

So if either of these quantities is even modestly big, then the deviation probability is going

to be very small.⁴

We could prove Theorem 3.3 in 30 minutes or less, but the right place to spend time

on the proof is a randomized algorithms class (like CS265). So we’ll just use the Chernoﬀ

bounds as a “black box” — this is how almost everybody thinks about them, anyways. It’s

notable that, of our ﬁve essential tools for the analysis of randomized algorithms, only the

Chernoﬀ bounds require a non-trivial proof. We’ll only use part (i) in this lecture, but (ii)

is also useful in many situations. An analog of Theorem 3.3 for random variables that are

nonnegative and bounded (not necessarily in [0, 1]) follows from a simple scaling argument.

The independence assumption can be relaxed, for example to negatively correlated random

variables, although the proof then requires a bit more work.

Now let’s apply the Chernoﬀ bounds to analyze the number of items hashing to our

favorite bucket i, under the assumption (P3) that h is a uniformly random function. Again

using X to denote the indicator random variable for the event that h(y) = i, we see that

P^y

X =

X_yis now the sum of independent 0-1 random variables, and hence is right in

y∈S

the wheelhouse of the Chernoﬀ bounds. For example, setting 1 + δ = ln n and recalling that

E[X] = 1, Theorem 3.3 implies that

ꢂ

. More generally, a constant less than one

ꢃ

ln n

Pr[X > ln n] <

(3)

ln n

To interpret this bound, note that ( )

raised to a logarithmic power yields an inverse polynomial. Now

ln n

is smaller than any

For the ﬁrst bound (i), it is common to state the tighter probability upper bound of [e^δ/(1+δ)^(1+δ)]^E[X],

but the simpler bound here suﬃces for almost all applications.

constant as n grows large, and hence the probability bound in (3) is smaller than any inverse

polynomial. Notice how much better this is than what we could prove using Markov’s or

Chebyshev’s inequality — we’re looking at a much smaller deviation (ln n instead of n) yet

obtaining a much smaller probability bound (smaller than any inverse polynomial).

Theorem 3.3 even implies that

ꢄ

ꢅ

ln n

Pr X >

≤

(4)

ln ln n

n²

as you should verify. Why ln n/ ln ln n? Because this is roughly the solution to the equation

x^x= n (this is relevant in Theorem 3.3 because of the (1 + δ)^−(1+δ)term). Again, this is a

huge improvement over what we obtained using Markov’s and Chebyshev’s inequalities. For

a more direct comparison, note that Chernoﬀ bounds imply that the probability Pr[X ≥ n] is

at most an inverse exponential function of n (as opposed to an inverse polynomial function).

.5 The Union Bound

Figure 2: Area of union is bounded by sum of areas of the circles.

Our ﬁfth essential analysis tool is the union bound, which is not a tail inequality but is

often used in conjunction with tail inequalities. The union bound just says that for events

E , . . . , E ,

X^k

Pr[at least once of E_ioccurs] ≤

Pr[E_i] .

i=1

Importantly, the events are completely arbitrary, and do not need to be independent. The

proof is a one-liner. In terms of Figure 2, the union bound just says that the area (i.e.,

probability mass) in the union is bounded above by the sum of the areas of the circles.

The bound is tight if the events are disjoint; otherwise the right-hand side is larger, due to

double-counting. (It’s like inclusion-exclusion, but without any of the correction terms.) In

applications, the events E , . . . , E are often “bad events” that we’re hoping don’t happen;

the union bound says that as long as each event occurs with low probability and there aren’t

too many events, then with high probability none of them occur.

Returning to our running hashing example, let E_idenote the event that bucket i receives

a load larger than 3 ln n/ ln ln n. Using (4) and the union bound, we conclude that with

probability at least 1 − , none of the buckets receive a load larger than 3 ln n/ ln ln n. That

is, the maximum load is O(log n/ log log n) with high probability.⁵

.6 Chernoﬀ Bounds: The Large Expectation Regime

We previously noted that the Chernoﬀ bounds yield very good probability bounds once the

deviation (1+δ) or the expectation (E[X]) becomes large. In our hashing application above,

we were in the former regime. To illustrate the latter regime, suppose that we hash a data

set S ⊆ U with |S| = n ln n (instead of ln n). Now, the expected load of every bucket is ln n.

Applying Theorem 3.3 with 1 + δ = 4, we get that, for each bucket i,

ꢂ ꢃ

4 ln n

Pr[load on i is > 4 ln n] ≤

≤

n²

Using the union bound as before, we conclude that with high probability, no bucket receives

a load more than a small constant factor times its expectation.

Summarizing, when loads are light there can be non-trivial deviations from expected

loads (though still only logarithmic). Once loads are even modestly larger, however, the

buckets are quite evenly balanced with high probability. This is a useful lesson to remember,

for example in load-balancing applications (in data centers, etc.).

Randomized Rounding

We now return to the design and analysis of approximation algorithms, and give a classic

application of the Chernoﬀ bounds to the problem of low-congestion routing.

Figure 3: Example of edge-disjoint path problem. Note that vertices can be shared, as shown

in this example.

There is also a matching lower bound (up to constant factors).

If the edge-disjoint paths problems, the input is a graph G = (V, E) (directed or undi-

rected) and source-sink pairs (s , t ), . . . , (s , t ). The goal is to determine whether or not

there is an s -t path P for each i such that no edge appears in more than one of the P ’s.

See Figure 3. The problem is NP-hard (for directed graphs, even when k = 2).

Recall from last lecture the linear programming rounding approach to approximation

algorithms:

. Solve an LP relaxation of the problem. (For an NP-hard problem, we expect the

optimal solution to be fractional, and hence not immediately meaningful.)

. “Round” the resulting fractional solution to a feasible (integral) solution, hopefully

without degrading the objective function value by too much.

Last lecture applied LP rounding to the vertex cover problem. For the edge-disjoint paths

problem, we’ll use randomized LP rounding. The idea is to interpret the fractional values

in an LP solution as specifying a probability distribution, and then to round variables to

integers randomly according to this distribution.

The ﬁrst step of the algorithm is to solve the natural linear programming relaxation of

the edge-disjoint paths problem. This is just a multicommodity ﬂow problem (as in Exercise

Set #5 and Problem Set #3). In this relaxation the question is whether or not it is possible

to send simultaneously one unit of (fractional) ﬂow from each source s_ito the corresponding

sink t_i, where every edge has a capacity of 1. 0-1 solutions to this multicommodity ﬂow

problem correspond to edge-disjoint paths. As we’ve seen, this LP relaxation can be solved

in polynomial time. If this LP relaxation is infeasible, then we can conclude that the original

edge-disjoint paths problem is infeasible as well.

Assume now that the LP relaxation is feasible. The second step rounds each s -t pair

independently. Consider a path decomposition (Problem Set #1) of the ﬂow being pushed

from s to t . This gives a collection of paths, together with some amount of ﬂow on each

path. Since exactly one unit of ﬂow is sent, we can interpret this path decomposition as

a probability distribution over s -t paths. The algorithm then just selects an s -t path

randomly according to this probability distribution.

The rounding step yields paths P , . . . , P . In general, they will not be disjoint (this

would solve an NP-hard problem), and the goal is to prove that they are approximately

disjoint in some sense. The following result is the original and still canonical application of

randomized rounding.

Theorem 4.1 Assume that the LP relaxation is feasible. Then with high probability, the

randomized rounding algorithm above outputs a collection of paths such that no edge is used

by more than

ln m

ln ln m

of the paths, where m is the number of edges.

The outline of the proof is:

. Fix an edge e. The expected number of paths that include e is at most 1. (By linearity

of expectation, it is precisely the amount of ﬂow sent on e by the multicommodity ﬂow

relaxation, which is at most 1 since all edges were given unit capacity.)

. Like in the hashing analysis in Section 3.6,

ꢄ

ꢅ

ln m

Pr # paths on e >

≤

ln ln m

m²

where m is the number of edges. (Edges are playing the role of buckets, and s -t pairs

as items.)

. Taking a union bound over the m edges, we conclude that with all but probability,

every edge winds up with at most 3 ln m/ ln ln m paths using it.

Zillions of analyses in algorithms (and theoretical computer science more broadly) use this

one-two punch of the Chernoﬀ bound and the union bound.

Interestingly, for directed graphs, the approximation guarantee in Theorem 4.1 is optimal,

up to a constant factor (assuming P = NP). For undirected graphs, there is an intriguing

gap between the O(log n/ log log n) upper bound of Theorem 4.1 and the best-know lower

bound of Ω(log log n) (assuming P = NP).

Epilogue

To recap the top 5 essential tools for the analysis of randomized algorithms:

. Linearity of expectation. If all you care about is the expectation of a random variable,

this is often good enough.

. Markov’s inequality. This inequality usually suﬃces if you’re satisﬁed with a constant-

probability bound.

. Chebyshev’s inequality. This inequality is the appropriate one when you have a good

handle on the variance of your random variable.

. Chernoﬀ bounds. This inequality gives sharp concentration bounds for random vari-

ables that are sums of independent and bounded random variables (most commonly,

sums of independent indicator random variables).

. Union bound. This inequality allows you to avoid lots of bad low-probability events.

All ﬁve of these tools are insanely useful. And four out of the ﬁve have one-line proofs!

CS261: A Second Course in Algorithms

Lecture #19: Beating Brute-Force Search^∗

Tim Roughgarden^†

March 8, 2016

A popular myth is that, for NP-hard problems, there are no algorithms with worst-case

running time better than that of brute-force search. Reality is more nuanced, and for many

natural NP-hard problems, there are algorithms with (worst-case) running time much better

than the naive brute-force algorithm (albeit still exponential). This lecture proves this point

by revisiting three problems studied in previous lectures: vertex cover, the traveling salesman

problem, and 3-SAT.

Vertex Cover and Fixed-Parameter Tractability

This section studies the special case of the vertex cover problem (Lecture #18) in which

every vertex has unit weight. That is, given an undirected graph G = (V, E), the goal is to

compute a minimum-cardinality subset S ⊆ V that contains at least one endpoint of every

edge.

We study the problem of checking whether or not a vertex cover instance admits a vertex

cover of size at most k (for a given k). This problem is no easier than the general problem,

since the latter reduces to the former by trying all possible values of k. Here, you should

think of k as “small,” for example between 10 and 20. The graph G can be arbitrarily

large, but think of the number of vertices as somewhere between 100 and 1000. We’ll show

how to beat brute-force search for small k. This will be our only glimpse of “parameterized

algorithms and complexity,” which is a vibrant subﬁeld of theoretical computer science.

The naive brute-force search algorithm for checking whether or not there is a vertex cover

of size at most k is: for every subset S ⊆ V of k vertices, check whether or not S is a vertex

ꢀ

ꢁ

cover. The running time of this algorithm scales as

, which is Θ(n ) when k is small.

While technically polynomial for any constant k, there is no hope of running this algorithm

unless k is extremely small (like 3 or 4).

If we aim to do better, what can we hope for? Better than Θ(n^k) would a running time

of the form poly(n) · f(k), where the dependence on k and on n can be separated, with

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

the latter dependence only polynomial. Even better would be a running time of the form

poly(n) + f(k) for some function k. Of course, we’d like the poly(n) term to be as close to

linear as possible. We’d also like the function f(k) to be as small as possible, but because

the vertex cover problem is NP-hard for general k, we expect f(k) to be at least exponential

in k. An algorithm with such a running time is called ﬁxed-parameter tractable (FPT) with

respect to the parameter k.

We claim that the following is an FPT algorithm for the minimum-cardinality vertex

cover problem (with budget k).

FPT Algorithm for Vertex Cover

set S = {v ∈ V : deg(v) ≥ k + 1}

set G = G S

set G equal to G with all isolated vertices removed

if G has more than k edges then

return “no vertex cover with size ≤ k”

else

compute a minimum-size vertex cover T of G by brute-force search

return “yes” if and only if |S| + |T| ≤ k

We next explain why the algorithm is correct. First, notice that if G has a set cover S

of size at most k, then every vertex with degree at least k + 1 must be in S. For if such a

vertex v is not in S, then the other endpoint of each of the (at least k + 1) edges incident

to v must be in the vertex cover; but then |S| ≥ k + 1. In the second step, G is obtained

from G by deleting S and all edges incident to a vertex in S. The edges that survive in G⁰

are precisely the edges not already covered by S. Thus, the vertex covers of size at most k

in G are precisely the sets of the form S ∪ T, where T is a vertex cover of G size at most

k −|S|. Given that every vertex cover with size at most k contains the set S, there is no loss

in discarding the isolated vertices of G (all incident edges of such a vertex in G are already

covered by vertices in S). Thus, G has a vertex cover of size at most k if and only if G has

a vertex cover of size at most k − |S|. In the fourth step, if G has more than k edges, then

it cannot possibly have a vertex cover of size at most k (let alone k − |S|). The reason is

that every vertex of G has degree at most k (all higher-degree vertices were placed in S),

so each vertex of G can only cover k edges, so G has a vertex cover of size at most k only

if it has at most k²edges. The ﬁnal step computes the minimum-size vertex cover of G by

brute force, and so is clearly correct.

Next, observe that in the ﬁnal step (if reached), the graph G has at most k edges (by

assumption) and hence at most 2k²vertices (since every vertex of G has degree at least 1).

It follows that the brute-force search step can be implemented in 2^O(k2)time. Steps 1–4 can

be implemented in linear time, so the overall running time is O(m) + 2^O(k, and hence the

)

algorithm is ﬁxed-parameter tractable. In FPT jargon, the graph G is called a kernel (of

size O(k²)), meaning that the original problem (on an arbitrarily large graph, with a given

budget k) reduces to the same problem on a graph whose size depends only on k. Using

linear programming techniques, it is possible to show that every unweighted vertex cover

instance actually admits a kernel with size only O(k), leading to a running time dependence

on k of 2^O(k)rather than 2^O(k. Such singly-exponential dependence is pretty much the

best-case scenario in ﬁxed-parameter tractability.

)

Just as some problems admit good approximation algorithms and others do not (assuming

P = NP), some problems (and parameters) admit ﬁxed-parameter tractable algorithms

while others do not (under appropriate complexity assumptions). This is made precise

primarily via the theory of “W[1]-hardness,” which parallels the familiar theory of NP-

hardness. For example, the independent set problem, despite its close similarity to the

vertex cover problem (the complement of a vertex cover is an independent set and vice

versa), is W[1]-hard and hence does not seem to admit a ﬁxed-parameter tractable algorithm

(parameterized by the size of the largest independent set).

TSP and Dynamic Programming

Recall from Lecture #16 the traveling salesman problem (TSP): the input is a complete

undirected graph with non-negative edge weights, and the goal to compute the minimum-

cost TSP tour, meaning a simple cycle that visits every vertex exactly once. We saw in

Lecture #16 that the TSP problem is hard to even approximate, and for this reason we

focused on approximation algorithms for the (still NP-hard) special case of the metric TSP.

Here, we’ll give an exact algorithm for TSP, and we won’t even assume that the edges satisfy

the triangle inequality.

The naive brute-force search algorithm for TSP tries every possible tour, leading to

a running time of roughly n!, where n is the number of vertices. Recall that n! grows

considerably faster than any function of the form cⁿfor a constant c (see also Section 3).

Naive brute-force search is feasible with modern computers only for n in the range of 12

or 13. This section gives a dynamic programming algorithm for TSP that runs in O(n²2ⁿ)

time. This extends the “tractability frontier” for n into the 20s. One drawback of the

dynamic programming algorithm is that it also uses exponential space (unlike brute-force

search). It is an open question whether or not there is an exact algorithm for TSP that has

running time O(cⁿ) for a constant c > 1 and also uses only a polynomial amount of space.

Two take-aways from the following algorithm are: (i) TSP is another fundamental NP-hard

problem for which algorithmic ingenuity beats brute-force search; and (ii) your algorithmic

toolbox (here, dynamic programming) continues to be extremely useful for the design of

exact algorithms for NP-hard problems.

Like any dynamic programming algorithm, the plan is to solve systematically a collection

of subproblems, from “smallest” to “largest,” and then read oﬀ the ﬁnal answer from the

biggest subproblems. Coming up with right subproblems is usually the hardest part of

designing a dynamic programming algorithm. Here, in the interests of time, we’ll just cut

to the chase and state the relevant subproblems.

Let V = {1, 2, . . . , n} be the vertex set. The algorithm populates a two-dimensional

array A, with one dimension indexed by a subset S ⊆ V of vertices and the other dimension

indexed by a single vertex j. At the end of the algorithm, the entry A[S, j] will contain the

cost of the minimum-cost path that:

(i) visits every vertex v ∈ S exactly once (and no other vertices);

(ii) starts at the vertex 1 (so 1 better be in S);

(iii) ends at the vertex j (so j better be in S).

There are O(n2ⁿ) subproblems. Since the TSP is NP-hard, we should not be surprised to

see an exponential number of subproblems.

After solving all of the subproblems, it is easy to compute the cost of an optimal tour

in linear time. Since A[{1, 2, . . . , n}, j] contains the length of the shortest path from 1 to j

that visits every vertex exactly once, we can just “guess” (i.e., do brute-force search over)

the vertex preceding 1 on the tour:





ⁿ



OPT = min A[{1, 2, . . . , n}, j] + c  .

{

}

|{z}

j=2

path from 1 to j

last hop

Next, we need a principled way to solve all of the subproblems, using solutions to pre-

viously solved “smaller” subproblems to quickly solve “larger” subproblems. That is, we

need a recurrence relating the solutions of diﬀerent subproblems. So consider a subproblem

A[S, j], where the goal is to compute the minimum cost of a path subject to (i)–(iii) above.

What must the optimal solution look like? If we only knew the penultimate vertex k on the

path (right before j), then we would know what the path looks like: it would be the cheapest

possible path visiting each of the vertices of S \ {j} exactly once, starting at 1, and ending

at k (why?), followed of course by the ﬁnal hop from k to j. Our recurrence just executes

brute-force search over all of the legitimate choices of k:

A[S, j] = _k∈m_S\i_{n_1,j}(A[S \ {j}, k] + c_kj) .

This recurrence assumes that |S| ≥ 3. If |S| = 1 then A[S, j] is 0 if S = {1} and j = 1 and

is +∞ otherwise. If |S| = 2, then the only legitimate choice of k is 1.

The algorithm ﬁrst solves all subproblems with |S| = 1, then all subproblems with

S| = 2, . . . , and ﬁnally all subproblems with |S| = n (i.e., S = {1, 2, . . . , n}). When solving

a subproblem, the solutions to all relevant smaller subproblems are available for constant-

time lookup. Each subproblem can thus be solved in O(n) time. Since there are O(n2ⁿ)

subproblems, we obtain the claimed running time bound of O(n²2ⁿ).

3SAT and Random Search

.1 Scho¨ning’s Algorithm

Recall from last lecture that a 3SAT formula involves n Boolean variables x , . . . , x and m

clauses, where each clause is the disjunction of three literals (where a literal is a variable or

its negation). Last lecture we studied MAX 3SAT, the optimization problem of satisfying as

many of the clauses as possible. Here, we’ll study the simpler decision problem, where the

goal is to check whether or not there is a assignment that satisﬁes all m clauses. Recall that

this is the canonical example of an NP-complete problem (cf., the Cook-Levin theorem).

Naive brute-force search would try all 2ⁿtruth assignments. Can we do better than

exhaustive search? Intriguingly, we can, with a simple algorithm and by a pretty wide

margin. Speciﬁcally, we’ll study Sch¨oning’s random search algorithm (from 1999). The

parameter T will be determined later.

Random Search Algorithm for 3SAT (Version 1)

repeat T times (or until a satisfying assignment is found):

choose a truth assignment a uniformly at random

repeat n times (or until a satisfying assignment is found):

choose a clause C violated by the current assignment a

choose one the three literals from C uniformly at random, and

modify a by ﬂipping the value of the corresponding variable

(from “true” to “false” or vice versa)

if a satisfying assignment was found then

return “satisﬁable”

else

return “unsatisﬁable”

And that’s it!¹

.2 Analysis (Version 1)

We give three analyses of Scho¨ning’s algorithm (and a minor variant), each a bit more so-

phisticated and establishing a better running time bound than the last. The ﬁrst observation

is that the algorithm never makes a mistake when the formula is unsatisﬁable — it will never

ﬁnd a satisfying assignment (no matter what its coin ﬂips are), and hence reports “unsatis-

ﬁable.” So what we’re worried about is the algorithm failing to ﬁnd a satisfying assignment

when one exists. So for the rest of the lecture, we consider only satisﬁable instances. We

use a^∗to denote a reference satisfying assignment (if there are many, we pick one arbitrar-

ily). The high-level idea is to track the “Hamming distance” between a^∗and our current

truth assignment a (i.e., the number of variables with diﬀerent values in a and a^∗). If this

Hamming distance ever drops to 0, then a = a^∗and the algorithm has found a satisfying

assignment.

A little backstory: an analogous algorithm for 2SAT (2 literals per clause) was studied earlier by Pa-

padimitriou. 2SAT is polynomial-time solvable — for example, it can be solved in linear time via a reduction

to computing the strongly connected components of a suitable directed graph. Papadimitriou’s random search

algorithm is slower but still polynomial (O(n²)), with the analysis being a nice exercise in random walks

(covered in the instructor’s Coursera videos).

A simple observation is that, if the current assignment a fails to satisfy a clause C, then

∗

a assigns at least one of the three variables in C a diﬀerent value than a does (as a satisﬁes

the clause). Thus, when the random search algorithm chooses a variable of a violated clause

to ﬂip, there is at least a 1/3 chance that the algorithm chooses a “good variable,” the

ﬂipping of which decreases the Hamming distance between a and a^∗by one. (If a and a^∗

diﬀer on more than one variable of C, then the probability is higher.) In the other case,

when the algorithm chooses a “bad variable,” where a and a^∗give it the same value, ﬂipping

the value of the variable in a increases the Hamming distance between a and a^∗by 1. This

happens with probability at most 2/3.²

All of the analyses proceed by identifying simple suﬃcient conditions for the random

search algorithm to ﬁnd a satisfying assignment, bounding below the probability that these

suﬃcient conditions are met, and then choosing T large enough that the algorithm is guar-

anteed to succeed with high probability.

To begin, suppose that the initial random assignment a chosen in an iteration of the outer

loop diﬀers from the reference satisfying assignment a^∗in k variables. A suﬃcient condition

for the algorithm to succeed is that, in every one of the ﬁrst k iterations of the inner loop, the

algorithm gets lucky and ﬂips the value of a variable on which a, a^∗diﬀer. Since each inner

loop iteration has a probability of at least 1/3 of choosing wisely, and the random choices

are independent, this suﬃcient condition for correctness holds with probability at least 3 .

−

(The algorithm might stop early if it stumbles on a satisfying assignment other than a^∗; this

is obviously ﬁne with us.)

For our ﬁrst analysis, we’ll use a sloppy argument to analyze the parameter k (the distance

between a and a^∗at the beginning of an outer loop iteration). By symmetry, a agrees with a^∗

on at least half the variables (i.e., k ≤ n/2) with probability at least 1/2. Conditioning on this

event, we conclude that a single outer loop iteration successfully ﬁnds a satisfying assignment

with probability at least p =

. Hence, the algorithm ﬁnds a satisfying assignment in one

·3^n/2

of the T outer loop iterations except with probability at most (1 − p)^T≤ e . If we take

−

pT 3

d ln n

T =

for a constant d > 0, then the algorithm succeeds except with inverse polynomial

probability . Substituting for p, we conclude that

n^d

ꢂ

ꢃ

√

T = Θ ( 3) log n

outer loop iterations are enough to be correct with high probability. This gives us an al-

gorithm with running time O((1.74)ⁿ), which is already signiﬁcantly better than the 2ⁿ

dependence in brute-force search.

The fact that the random process is biased toward moving farther away from a∗ is what gives rise to

the exponential running time. In the case of 2SAT, each random move is at least as likely to decrease the

distance as increase the distance, which in turn leads to a polynomial running time.

Recall the useful inequality 1 + x ≤ e^xfor all x ∈ R, used also in Lectures #11 (see the plot there) and

15.

.3 Analysis (Version 2)

We next give a reﬁned analysis of the same algorithm. The plan is to count the probability

of success for all values of the initial distance k, not just when k ≤ n/2 (and not assuming

the worst case of k = n/2).

For a given choice of k ∈ {1, 2, . . . , n}, what is the probability that the initial assignment

∗

a and a diﬀer in their values to exactly k variables? There is one such assignment for each

ꢀ

ꢁ

of the

with a on S and disagrees with a outside of S.) Since all truth assignments are equally

choices of a set S of k out of n variables. (The corresponding assignment a agrees

∗

likely (probability 2 each),

∗

−

ꢄ ꢅ

Pr[dist(a, a^∗) = k] =

−n

2 .

We can now lower bound the probability of success of an outer loop iteration by condi-

tioning on k:

Xⁿ

∗

Pr[success] =

Pr[dist(a, a ) = k] · E[success | dist(a, a ) = k]

k=0

Xⁿ

ꢄ ꢅ

2⁻

≥

k=0

−

2 (1 + )

ꢄ ꢅ

where the penultimate equality follows from a slick application of the binomial formula.⁴

Thus, taking T = Θ(( ) log n), the random search algorithm is correct with high prob-

ability.

.4 Analysis (Version 3)

For the ﬁnal analysis, we tweak the version of Scho¨ning’s algorithm above slightly, replacing

repeat n times” in the inner loop by “repeat 3n times.” This only increases the running

“

time by a constant factor.

Our two previous analyses only considered the cases where the random search algorithm

made a beeline for the reference satisfying assignment a^∗, never making an incorrect choice

of which variable to ﬂip. There are also other cases where the algorithm will succeed.

For example, if the algorithm chooses a bad variable once (increasing dist(a, a^∗) by 1),

but then a good variable k + 1 times, then after these k + 2 iterations a is the same as

the satisfying assignment a^∗(unless the algorithm stopped early due to ﬁnding a diﬀerent

satisfying assignment).

ꢀ ꢁ

I.e., the formula (a + b)ⁿ=

k=0

ⁿa^kbⁿ−^k.

For the analysis, we’ll focus on the speciﬁc case where, in the ﬁrst 3k inner loop iterations,

the algorithm chooses a bad variable k times and a good variable 2k times. This idea leads

ꢄ ꢅꢄ ꢅ ꢄ ꢅ ꢄ ꢅ

Xⁿ

Pr[success] ≥

2⁻

(1)

k=0

since the probability that the random local search algorithm chooses a good variable 2k

ꢀ

ꢁ

( ) ( ) .

times in the ﬁrst 3k inner loop iterations is at least

This inequality is pretty messy, with no less than two binomial coeﬃcients complicating

ꢀ

ꢁ

each summand. We’ll be able to handle the

terms using the same slick binomial expansion

terms are more annoying. To deal with them,

ꢀ

_kꢁ

trick from the previous analysis, but the

recall Stirling’s approximation for the factorial function:

ꢂ

ꢂ ꢃ ꢃ

√

n! = Θ

√

(The hidden constant is 2π, but we won’t need to worry about that.) Thus, in the grand

scheme of things, n! is not all that much smaller than nⁿ.

ꢀ

ꢁ

We can use Stirling’s approximation to simplify

ꢄ ꢅ

(3k!)

(2k)!k!

ꢄ

√

( )

√ √

ꢅ

2k 2k

k 2k ( ) ( )

3^3k

Θ √ ·

Thus,

ꢄ ꢅ ꢄ ꢅ ꢄ ꢅ

ꢄ

ꢅ

−

Θ √

{z }

√

Θ(3^3k/2^2kk)

Substituting back into (1), we ﬁnd that for some constant c > 0 (hidden in the Θ notation),

ꢄ ꢅꢄ ꢅ ꢄ ꢅ ꢄ ꢅ

Xⁿ

k=0

c2⁻

Pr[success] ≥

2⁻

ꢄ ꢅ

Xⁿ

−

n 2

√

≥

k=0

ꢄ ꢅ

Xⁿ

−

2^−k

√ 2

k=0

ꢄ

ꢅ

−

√ 2

1 +

ꢄ ꢅ

√

ꢀ

ꢀ ꢁ √

ꢁ

We conclude that with T = Θ

n log n , the algorithm is correct with high probability.

This running time of ≈ ( ) has been improved somewhat since 1999, but this is still

quite close to the state of the art, and it is an impressive improvement over the ≈ 2ⁿrunning

time require by brute-force search. Can we do even better? This is an open question.

The exponential time hypothesis (ETH) asserts that every correct algorithm for 3SAT has

worst-case running time at least cⁿfor some constant c > 1. (For example, this rules out a

“

quasi-polynomial-time” algorithm, with running time n^polylog(n).) The ETH is certainly a

stronger assumption than P = NP, but most experts believe that it is true.

The random search idea can be extended from 3SAT to k-SAT for all constant values

of k. For every constant k, the result is an algorithm that runs in time O(cⁿ) for a constant

c < 2. However, the constant c tends to 2 as k tends to inﬁnity. The strong exponential

time hypothesis (SETH) asserts that this is necessary — that there is no algorithm for the

general SAT problem (with k arbitrary) that runs in worst-case running time O(cⁿ) for some

constant c < 2 (independent of k). Expert opinion is mixed on whether or not SETH holds.

If it does hold, then there are interesting consequences for lots of diﬀerent problems, ranging

from the prospects of ﬁxed-parameter tractable algorithms for NP-hard problems (Section 1)

to lower bounds for classic algorithmic problems like computing the edit distance between

two strings.

CS261: A Second Course in Algorithms

Lecture #20: The Maximum Cut Problem and

Semideﬁnite Programming^∗

Tim Roughgarden^†

March 10, 2016

Introduction

Now that you’re ﬁnishing CS261, you’re well equipped to comprehend a lot of advanced

material on algorithms. This lecture illustrates this point by teaching you about a cool and

famous approximation algorithm.

In the maximum cut problem, the input is an undirected graph G = (V, E) with a

nonnegative weight w ≥ 0 for each edge e ∈ E. The goal is to compute a cut — a partition

of the vertex set into sets A and B — that maximizes the total weight of the cut edges (the

edges with one endpoint in each of A and B).

Now, if it were the minimum cut problem, we’d know what to do — that problem reduces

to the maximum ﬂow problem (Exercise Set #2). It’s tempting to think that we can reduce

the maximum cut problem to the minimum cut problem just by negating the weights of all

of the edges. Such a reduction would yield a minimum cut problem with negative weights

(or capacities). But if you look back at our polynomial-time algorithms for computing

minimum cuts, you’ll notice that we assumed nonnegative edge capacities, and that our

proofs depended on this assumption. Indeed, it’s not hard to prove that the maximum cut

problem is NP-hard. So, let’s talk about polynomial-time approximation algorithms.

It’s easy to come up with a -approximation algorithm for the maximum cut problem.

Almost anything works — a greedy algorithm, local search, picking a random cut, linear pro-

gramming rounding, and so on. But frustratingly, none of these techniques seemed capable

of proving an approximation factor better than . This made it remarkable when, in 1994,

Goemans and Williamson showed how a new technique, “semideﬁnite programming round-

ing,” could be used to blow away all previous approximation algorithms for the maximum

cut problem.

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

A Semideﬁnite Programming Relaxation for the Max-

imum Cut Problem

.1 A Quadratic Programming Formulation

To motivate a novel relaxation for the maximum cut problem, we ﬁrst reformulate the

problem exactly via a quadratic program. (So solving this program is also NP-hard.) The

idea is to have one decision variable y for each vertex i ∈ V , indicating which side of the

cut the vertex is on. It’s convenient to restrict y to lie in {−1, +1}, as opposed to {0, 1}.

There’s no need for any other constraints. In the objective function, we want an edge (i, j)

of the input graph G = (V, E) to contribute w_ijwhenever i, j are on diﬀerent sides of the

cut, and 0 if they are on the same side of the cut. Note that y y = +1 if i, j are on the

same side of the cut and y y = −1 otherwise. Thus, we can formulate the maximum cut

objective function exactly as

max

w · (1 − y y ) .

(i,j)∈E

Note that the contribution of edge (i, j) to the objective function is w_ijif i and j are on

diﬀerent sides of the cut and 0 otherwise, as desired. There is a one-to-one and objective-

function-preserving correspondence between cuts of the input graph and feasible solutions

to this quadratic program.

This quadratic programming formulation has two features that make it a non-linear

program: the integer constraints y ∈ {±1} for every i ∈ V , and the quadratic terms y y in

the objective function.

.2 A Vector Relaxation

Here’s an inspired idea for a relaxation: rather than requiring each y_ito be either -1 or +1,

we only ask that each decision variable is a unit vector in R_n, where n = |V | denotes the

number of vertices. We henceforth use x to denote the (vector-valued) decision variable

corresponding to the vertex i ∈ V . We can think of the values +1 and -1 as the special cases

of the unit vectors (1, 0, 0, . . . , 0) and (−1, 0, 0, . . . , 0). There is an obvious question of what

we mean by the quadratic term y y˙ when we switch to decision variables that are n-vectors;

the most natural answer is to replace the scalar product y · y by the inner product hx , x i.

We then have the following “vector programming relaxation” of the maximum cut problem:

max

w (1 − hx , x i)

(i,j)∈E

subject to

kx k = 1

for every i ∈ V .

It may seem obscure to write kx k²= 1 rather than kx k²= 1 (which is equivalent); the

reason for this will become clear later in the lecture. Since every cut of the input graph G

maps to a feasible solution of this relaxation with the same objective function value, and the

vector program only maximizes over more stuﬀ, we have

vector OPT ≥ OP T.

Geometrically, this relaxation maps all the vertices of the input graph G to the unit

sphere in R_n, while attempting to map the endpoints of each edge to points that are as close

to antipodal as possible (to get hx , x i as close to -1 as possible).

.3 Disguised Convexity

Figure 1: (a) a circle is convex, but (b) is not convex;the chord shown is not contained

entirely in the set.

It turns out that the relaxation above can be solved to optimality in polynomial time.¹You

might well ﬁnd this counterintuitive, given that the inner products in the objective function

seem hopelessly quadratic. The moral reason for computational tractability is convexity.

Indeed, a good rule of thumb very generally is to equate computational tractability with

convexity. A mathematical program can be convex in two senses. The ﬁrst sense is the same

as that we discussed back in Lecture #9 — a subset of R_nis convex if it contains all of its

chords. (See Figure 1.) Recall that the feasible region of a linear program is always convex

in this sense. The second sense is that the objective function can be a convex function. (A

linear function is a special case of a convex function.) We won’t need this second type of

convexity in this lecture, but it’s extremely useful in other contexts, especially in machine

learning.

OK. . . but where’s the convexity in the vector relaxation above? After all, if you take the

average of two points on the unit sphere, you don’t get another point on the unit sphere.

We next expose the disguised convexity. A natural idea to remove the quadratic (inner

product) character of the vector program above is to linearize it, meaning to introduce a

new decision variable p for each i, j ∈ V , with the intention that p will take on the value

hx , x i. But without further constraints, this will lead to a relaxation of the relaxation —

Strictly speaking, since the optimal solution might be irrational, we only solve it up to arbitrarily small

error.

nothing is enforcing the p ’s to actually be of the form hx , x i for some collection x , . . . , x

of n-vectors, and the p ’s could form an arbitrary matrix instead. So how can we enforce

the intended semantics?

This is where elementary linear algebra comes to the rescue. We’ll use some facts that

you’ve almost surely seen in a previous course, and also have almost surely forgotten. That’s

OK — if you spend 20-30 minutes with your favorite linear algebra textbook (or Wikipedia),

you’ll remember why all of these relevant facts are true (none are diﬃcult).

First, let’s observe that a V × V matrix P = {p } is of the form p = hx , x i for some

vectors x , . . . , x (for every i, j ∈ V ) if and only if we can write

P = X^TX

(1)

for some matrix X ∈ R_V. Recalling the deﬁnition of matrix multiplication, the (i, j) entry

of X^TX is the inner product of the ith row of X^Tand the jth column of X, or equivalently

the inner product of the ith and jth columns of X. Thus, for matrices P of the desired form,

the columns of the matrix X provide the n-vectors whose inner products deﬁne all of the

entries of P.

Matrices that are “squares” in the sense of (1) are extremely well understood, and they are

called (symmetric) positive semideﬁnite (psd) matrices. There are many characterizations of

symmetric psd matrices, and none are particularly hard to prove. For example, a symmetric

matrix is psd if and only if all of its eigenvalues are nonnegative. (Recall that a symmetric

matrix has a full set of real-valued eigenvalues.) The characterization that exposes the latent

convexity in the vector program above is that a symmetric matrix P is psd if and only if

z Pz

≥ 0

(2)

{z }

”

quadratic form”

for every vector z ∈ R_n. Note that the forward direction is easy to see (if P can be written

P = X^TX then z^TPz = (Xz)^T(Xz) = kXzk²≥ 0); the (contrapositive of the) reverse

direction follows easily from the eigenvalue characterization already mentioned.

For a ﬁxed vector z ∈ R_n, the inequality (2) reads

p z z ≥ 0,

i,j∈V

which is linear in the p ’s (for ﬁxed z ’s). And remember that the p ’s are our decision

variables!

.4 A Semideﬁnite Relaxation

Summarizing the discussion so far, we’ve argued that the vector relaxation in Section 2.2 is

equivalent to the linear program

max

w (1 − p )

(i,j)∈E

subject to

p z z ≥ 0

for every z ∈ Rⁿ

(3)

i,j∈V

p_ij= p_ji

p_ii= 1

for every i, j ∈ V

for every i ∈ V .

(4)

(5)

The constraints (3) and (4) enforce the p.s.d. and symmetry constraints on the p_ij’s. Their

presence makes this program a semideﬁnite program (SDP). The ﬁnal constraints (5) corre-

spond to the constraints that kx k²= 1 for every i ∈ V — that the matrix formed by the

p ’s not only has the form X^TX, but has this form for a matrix X whose columns are unit

vectors.

.5 Solving SDPs Eﬃciently

The good news about the SDP above is that every constraint is linear in the p_ij’s, so we’re in

the familiar realm of linear programming. The obvious issue is that the linear program has

an inﬁnite number of constraints of the form (3) — one for each real-valued vector z ∈ R_n.

So there’s no hope of even writing this SDP down. But wait, didn’t we discuss an algorithm

for linear programming that can solve linear programs eﬃciently even when there are too

many constraints to write down?

The ﬁrst way around the inﬁnite number of constraints is to use the ellipsoid method

(Lecture #10) to solve the SDP. Recall that the ellipsoid method runs in time polynomial in

the number of variables (n²variables in our case), provided that there is a polynomial-time

separation oracle for the constraints. The responsibility of a separation oracle is, given an

allegedly feasible solution, to either verify feasibility or else produce a violated constraint. For

the SDP above, the constraints (4) and (5) can be checked directly. The constraints (3) can be

checked by computing the eigenvalues and eigenvectors of the matrix formed by the p_ij’s.²As

mentioned earlier, the constraints (3) are equivalent to this matrix having only nonnegative

eigenvalues. Moreover, if the p_ij’s are not feasible and there is a negative eigenvalue, then

the corresponding eigenvector serves as a vector z such that the constraint (3) is violated.³

This separation oracle allows us to solve SDPs using the ellipsoid method.

The second solution is to use “interior-point methods,” which were also mentioned brieﬂy

at the end of Lecture #10. State-of-the-art interior-point algorithms can solve SDPs both in

theory (meaning in polynomial time) and in practice, meaning for medium-sized problems.

SDPs are deﬁnitely harder in practice than linear programs, though — modern solvers have

trouble going beyond thousands of variables and constraints, which is a couple orders of

magnitude smaller than the linear programs that are routinely solved by commercial solvers.

There are standard and polynomial-time matrix algorithms for this task; see any textbook on numerical

analysis.

If z is an eigenvector of a symmetric matrix P with eigenvalue λ, then zT Pz = zT (λz) = λ · kzk2₂, which

is negative if and only if λ is negative.

A third option for many SDPs is to use an extension of the multiplicative weights algo-

rithm (Lecture #11) to quickly compute an approximately optimal solution. This is similar

in spirit to but somewhat more complicated than the application to approximate maximum

ﬂows discussed in Lecture #12.⁴

Henceforth, we’ll just take it on faith that our SDP relaxation can be solved in polynomial

time. But the question remains: what do we do with the solution to the relaxation?

Randomized Hyperplane Rounding

The SDP relaxation above of the maximum cut problem was already known in the 1980s.

But only in 1994 did Goemans and Williamson ﬁgure out how to round its solution to

a near-optimal cut. First, it’s natural to round the solution of the vector programming

relaxation (Section 2.2) rather than the equivalent SDP relaxation (Section 2.4), since the

former ascribes one object (a vector) to each vertex i ∈ V , while the latter uses one scalar

for each pair of vertices.⁵Thus, we “just” need to round each vector to a binary value, while

approximately preserving the objective function value.

The ﬁrst key idea is to use randomized rounding, as ﬁrst discussed in Lecture #18. The

second key idea is that a simple way to round a vector to a binary value is to look at

which side of some hyperplane it lies on (cf., the machine learning examples in Lectures #7

and #12). See Figure 2. Combining these two ideas, we arrive at randomized hyperplane

rounding.

Figure 2: Randomized hyperplane rounding: points with positive dot product in set A,

points with negative dot product in set B.

Strictly speaking, the ﬁrst two solutions also only compute an approximately optimal solution. This

is necessary, because the optimal solution to an SDP (with all integer coeﬃcients) might be irrational.

(This can’t happen with a linear program.) For a given approximation ꢀ, the running time of the ellipsoid

method and interior-point methods depend on log , while that of multiplicative weights depends inverse

ꢀ

polynomially on

ꢀ

After solving the SDP relaxation to get the matrix P of the p_ij’s, another standard matrix algorithm

(“Cholesky decomposition”) can be used to eﬃciently recover the matrix X in the equation P = X^TX and

hence the vectors (which are the columns of X).

Randomized Hyperplane Rounding

given: one vector x for each i ∈ V

choose a random unit vector r ∈ R_n

set A = {i ∈ V : hx , ri ≥ 0}

set B = {i ∈ V : hx , ri < 0}

return the cut (A, B)

Thus, vertices are partitioned according to which side of the hyperplane with normal vector

r they lie on. You may be wondering how to choose a random unit vector in R_nin an

algorithm. One simple way is: sample n independent standard Gaussian random variables

(with mean 0 and variance 1) g , . . . , g , and normalize to get a unit vector:

(g , . . . , g )

r =

k(g , . . . , g )k

(Or, note that the computed cut doesn’t change if we don’t bother to normalize.) The main

property we need of the distribution of r is spherical symmetry — that all vectors at a given

distance from the origin are equally likely.

We have the following remarkable theorem.

Theorem 3.1 The expected weight of the cut produced by randomized hyperplane rounding

is at least .878 times the maximum possible.

The theorem follows easily from the following lemma.

Lemma 3.2 For every edge (i, j) ∈ E of the input graph,

ꢀ

ꢁ

Pr[(i, j) is cut] ≥ .878 · (1 − hx , x i) .

}

contribution to SDP

Proof of Theorem 3.1: We can derive

E[weight of (A, B)] =

w_ij· Pr[(i, j) is cut]

(i,j)∈E

.878 ·

ꢀ

ꢁ

≥

(1 − hx , x i)

(i,j)∈E

.878 · OP T,

where the equation follows from linearity of expectation (using one indicator random variable

per edge), the ﬁrst inequality from Lemma 3.2, and the second inequality from the fact that

the x_i’s are an optimal solution to vector programming relaxation of the maximum cut

problem. ꢀ

We conclude by proving the key lemma.

Figure 3: x and x are placed on diﬀerent sides of the cut with probability θ/π.

Proof of Lemma 3.2: Fix an edge (i, j) ∈ E. Consider the two-dimensional subspace (through

the origin) spanned by the vectors x and x . Since r was chosen from a spherically symmetric

distribution, its projection onto this subspace is also spherically symmetric — it’s equally

likely to point in any direction. The vertices x and x are placed on diﬀerent sides of the

cut if and only if they are “split” by the projection of r. (Figure 3.) If we let θ denote the

angle between x and x in this subspace, then 2θ out of the 2π radians of possible directions

result in the edge (i, j) getting cut. So we know the cutting probability, as a function of θ:

Pr[(i, j) is cut] =

− h

x , x ) as a function of θ. But remember from pre-

We still need to understand (1

calculus that hx , x i = kx kkx k cos θ. And since x and x are both unit vectors (in the

original space and also the subspace that they span), we have

(1 − hx , x i) = (1 cos θ).

−

The lemma thus boils down to verifying that

ꢂ

ꢃ

≥

.878 · ¹(1 − cos θ)

for all possible values of θ ∈ [0, π]. This inequality is easily seen by plotting both sides, or if

you’re a stickler for rigor, by computations familiar from ﬁrst-year calculus. ꢀ

Going Beyond .878

For several lectures we were haunted by the number 1 − , which seemed like a pretty weird

number. Even more bizarrely, it is provably the best-possible approximation guarantee for

several natural problems, including online bipartite matching (Lecture #14) and, assuming

P = NP, set coverage (Lecture #15).

Now the .878 in this lecture seems like a really weird number. But there is some evidence

that it might be optimal! Speciﬁcally, in 2005 it was proved that, assuming that the “Unique

Games Conjecture (UGC)” is true (and P = NP), there is no polynomial-time algorithm

for the maximum cut problem with approximation factor larger than the one proved by

Goemans and Williamson. The UGC (which is only from 2002) is somewhat technical

to state precisely — it asserts that a certain constraint satisfaction problem is NP-hard.

Unlike the P = NP conjecture, which is widely believed, it is highly unclear whether the

UGC is true or false. But it’s amazing that any plausible complexity hypothesis implies the

optimality of randomized hyperplane rounding for the maximum cut problem.

CS261: A Second Course in Algorithms

The Top 10 List^∗

Tim Roughgarden^†

March 10, 2016

If you’ve kept up with this class, then you’ve learned a tremendous amount of material.

You know now more about algorithms than most people who don’t have a PhD in the ﬁeld,

and are well prepared to tackle more advanced courses in theoretical computer science. To

recall how far you’ve traveled, let’s wrap up with a course top 10 list.

. The max-ﬂow/min-cut theorem, and the corresponding polynomial-time algorithms for

computing them (augmenting paths, push-relabel, etc.). This is the theorem that se-

duced your instructor into a career in algorithms. Who knew that objects as seemingly

complex and practically useful as ﬂows and cuts could be so beautifully characterized?

This theorem also introduced the running question of “how do we know when we’re

done?” We proved that a maximum ﬂow algorithm is done (i.e., can correctly terminate

with the current ﬂow) when the residual graph contains no s-t path or, equivalently,

when the current ﬂow saturates some s-t cut.

. Bipartite matching, including the Hungarian algorithm for the minimum-cost perfect

bipartite matching problem. In this algorithm, we convinced ourselves we were done

by exhibiting a suitable dual solution (which at the time we called “vertex prices”)

certifying optimality.

. Linear programming is in P. We didn’t have time to go into the details of any lin-

ear programming algorithms, but just knowing this fact as a “black box” is already

extremely powerful. On the theoretical side, there are polynomial-time algorithms

for solving linear programs — even those whose constraints are speciﬁed implicitly

through a polynomial-time separation oracle — and countless theorems rely on this

fact. In practice, commercial linear program solvers routinely solve problem instances

with millions of variables and constraints and are a crucial tool in many real-world

applications.

∗

†

ꢀc

Department of Computer Science, Stanford University, 474 Gates Building, 353 Serra Mall, Stanford,

016, Tim Roughgarden.

CA 94305. Email: tim@cs.stanford.edu.

. Linear programming duality. For linear programming problems, there’s a generic way

to know when you’re done. Whatever the optimal solution of the linear program is,

strong LP duality guarantees that there’s a dual solution that proves its optimality.

While powerful and perhaps surprising, the proof of strong duality boils down to the

highly intuitive statement that, given a closed convex set and a point not in the set,

there’s a hyperplane with the set on one side and the point on the other.

. Online algorithms. It’s easy to think of real-world situations where decisions need to be

made before all of the relevant information is available. In online algorithms, the input

arrives “online” in pieces, and an irrevocable decision must be made at each time step.

For some problems, there are online algorithms with good (close to 1) competitive

ratios — algorithms that compute a solution with objective function value close to

that of the optimal solution. Such algorithms perform almost as well as if the entire

input was known in advance. For example, in online bipartite matching, we achieved

a competitive ratio of 1 −

≈

63% (which is the best possible).

. The multiplicative weights algorithm. This simple online algorithm, in the spirit of “re-

inforcement learning,” achieves per-time-step regret approaching 0 as the time horizon

T approaches inﬁnity. That is, the algorithm does almost as well as the best ﬁxed

action in hindsight. This result is interesting in its own right as a strategy for making

decisions over time. It also has some surprising applications, such as a proof of the

minimax theorem for zero-sum games (if both players randomize optimally, then it

doesn’t matter who goes ﬁrst) and fast approximation algorithms for several problems

(maximum ﬂow, multicommodity ﬂow, etc.).

. The Traveling Salesman Problem (TSP). The TSP is a famous NP-hard problem with

a long history, and several of the most notorious open problems in approximation

algorithms concern diﬀerent variants of the TSP. For the metric TSP, you now know

the state-of-the-art — Christoﬁdes’s -approximation algorithm, which is nearly 40

years old. Most researchers believe that better approximation algorithms exist. (You

also know close to the state-of-the-art for asymmetric TSP, where again it seems that

better approximation algorithms should exist.)

. Linear programming and approximation algorithms. Linear programs are useful not

only for solving problems exactly in polynomial time, but also in the design and analysis

of polynomial-time approximation algorithms for NP-hard optimization problems. In

some cases, linear programming is used only in the analysis of an algorithm, and

not explicitly in the algorithm itself. A good example is our analysis of the greedy

set cover algorithm, where we used a feasible dual solution as a lower bound on the

cost of an optimal set cover. In other applications, such as vertex cover and low-

congestion routing, the approximation algorithm ﬁrst explicitly solves an LP relaxation

of the problem, and then “rounds” the resulting fractional solution into a near-optimal

integral solution. Finally, some algorithms, like our primal-dual algorithm for vertex

cover, use linear programs to guide their decisions, without ever explicitly or exactly

solving the linear programs.

. Five essential tools for the analysis of randomized algorithms. And in particular, the

Chernoﬀ bounds, which prove sharp concentration around the expected value for ran-

dom variables that are sums of bounded independent random variables. Chernoﬀ

bounds are used all the time. We saw an application in randomized rounding, leading

to a O(log n/ log log n)-approximation algorithm for low-congestion routing.

We also reviewed four easy-to-prove tools that you’ve probably seen before: linearity of

expectation (which is trivial but super-useful), Markov’s inequality (which is good for

constant-probability bounds), Chebyshev’s inequality (good for random variables with

small variance), and the union bound (which is good for avoiding lots of low-probability

events simultaneously).

0. Beating brute-force search. NP-hardness is not a death sentence — it just means that

you need to make some compromises. In approximation algorithms, one insists on a

polynomial running time and compromises on correctness (i.e., on exact optimality).

But one can also insist on correctness, resigning oneself to an exponential running time

(but still as fast as possible). We saw three examples of NP-hard problems that admit

exact algorithms that are signiﬁcantly faster than brute-force search: the unweighted

vertex cover problem (an example of a “ﬁxed-parameter tractable” algorithm, with

running time of the form poly(n) + f(k) rather than O(n^k)); TSP (where dynamic

programming reduces the running time from roughly O(n!) to roughly O(2ⁿ)); and

SAT (where random search reduces the running time from roughly O(2ⁿ) to roughly

O((4/3)ⁿ)).

CS261: Exercise Set #1

For the week of January 4–8, 2016

Instructions:

(1) Do not turn anything in.

(2) The course staﬀ is happy to discuss the solutions of these exercises with you in oﬃce hours or on

Piazza.

(3) While these exercises are certainly not trivial, you should be able to complete them on your own

(perhaps after consulting with the course staﬀ or a friend for hints).

Exercise 1

Suppose we generalize the maximum ﬂow problem so that there are multiple source vertices s , . . . , s ∈ V

and sink vertices t , . . . , t ∈ V . (As usual, the rest of the input is a directed graph with integer edge

capacities.) You should assume that no vertex is both a source and sink, that source vertices have no

incoming edges, and that sink vertices have no outgoing edges. A ﬂow is deﬁned as before: a nonnegative

number f for each e ∈ E such that capacity constraints are obeyed on every edge and such that conservation

constraints hold at all vertices that are neither a source nor a sink. The value of a ﬂow is the total amount

of outgoing ﬂow at the sources:

i=1

f .

e∈δ (s )

Prove that the maximum ﬂow problem in graphs with multiple sources and sinks reduces to the single-

source single-sink version of the problem. That is, given an instance of the multi-source multi-sink version

of the problem, show how to (i) produce a single-source single-sink instance such that (ii) given a maximum

ﬂow to this single-source single-sink instance, you can recover a maximum ﬂow of the original multi-source

multi-sink instance. Your implementations of steps (i) and (ii) should run in linear time. Include a brief

proof of correctness.

[

Hint: consider adding additional vertices and/or edges.]

Exercise 2

In lecture we’ve focused on the maximum ﬂow problem in directed graphs. In the undirected version of the

problem, the input is an undirected graph G = (V, E), a source vertex s ∈ V , a sink vertex t ∈ V , and a

integer capacity u ≥ 0 for each edge e ∈ E.

Flows are deﬁned exactly as before, and remain directed. Formally, a ﬂow consists of two nonnegative

numbers f_uvand f_vufor each (undirected) edge (u, v) ∈ E, indicating the amount of traﬃc traversing

the edge in each direction. Conservation constraints (ﬂow in = ﬂow out) are deﬁned as before. Capacity

constraints now state that, for every edge e = (u, v) ∈ E, the total amount of ﬂow f + f_vuon the edge is

^uP^v

at most the edge’s capacity u . The value of a ﬂow is the net amount

of the source.

Prove that the maximum ﬂow problem in undirected graphs reduces to the maximum ﬂow problem in

directed graphs. That is, given an instance of the undirected problem, show how to (i) produce an instance

of the directed problem such that (ii) given a maximum ﬂow to this directed instance, you can recover a

maximum ﬂow of the original undirected instance. Your implementations of steps (i) and (ii) should run in

linear time. Include a brief proof of correctness.

−

f_vsgoing out

(s,v)∈E sv

(v,s)∈E

[

Hint: consider bidirecting each edge.]

Exercise 3

For every positive integer U, show that there is an instance of the maximum ﬂow problem with edge capacities

in {1, 2, . . . , U} and a choice of augmenting paths so that the Ford-Fulkerson algorithm runs for at least U

iterations before terminating. The number of vertices and edges in your networks should be bounded above

by constant, independent of U. (This shows that the algorithm is only “pseudopolynomial.”)

[

Hint: use a network similar to the examples discussed in lecture.]

Exercise 4

Consider the special case of the maximum ﬂow problem in which every edge has capacity 1. (This is called

the unit-capacity case.) Explain why a suitable implementation of the Ford-Fulkerson algorithm runs in

O(mn) time in this special case. (As always, m denotes the number of edges and n the number of vertices.)

Exercise 5

Consider a directed graph G = (V, E) with source s and sink t for which each edge e has a positive integral

capacity u . For a ﬂow f in G, deﬁne the “layered graph” L as in Lecture #2, by computing the residual

graph G and running breadth-ﬁrst search (BFS) in G starting from s, aborting once the sink t is reached,

and retaining only the forward edges. (Recall that a forward edge in BFS goes from layer i to layer (i + 1),

for some i.)

Recall from Lecture #2 that a blocking ﬂow in a network is a ﬂow that saturates at least one edge on each

s-t path. Prove that for every ﬂow f and every blocking ﬂow g in L_f, the shortest-path distance between s

and t in the new residual graph G_f+gis strictly larger than that in G_f.

CS261: Exercise Set #2

For the week of January 11–15, 2016

Instructions:

(1) Do not turn anything in.

(2) The course staﬀ is happy to discuss the solutions of these exercises with you in oﬃce hours or on

Piazza.

(3) While these exercises are certainly not trivial, you should be able to complete them on your own

(perhaps after consulting with the course staﬀ or a friend for hints).

Exercise 6

In the s-t directed edge-disjoint paths problem, the input is a directed graph G = (V, E), a source vertex s,

and a sink vertex t. The goal is to output a maximum-cardinality set of edge-disjoint s-t paths P , . . . , P .

(I.e., P and P should share no edges for each i = j, and k should be as large as possible.)

Prove that this problem reduces to the maximum ﬂow problem. That is, given an instance of the disjoint

paths problem, show how to (i) produce an instance of the maximum ﬂow problem such that (ii) given a

maximum ﬂow to this instance, you can compute an optimal solution to the disjoint paths instance. Your

implementations of steps (i) and (ii) should run in linear and polynomial time, respectively. (Can you achieve

linear time also for (ii)?) Include a brief proof of correctness.

[

Hint: for (ii), make use of your solution to Problem 1 (from Problem Set #1).]

Exercise 7

In the s-t directed vertex-disjoint paths problem, the input is a directed graph G = (V, E), a source vertex s,

and a sink vertex t. The goal is to output a maximum-cardinality set of internally vertex-disjoint s-t paths

P , . . . , P . (I.e., P and P should share no vertices other than s and t for each i = j, and k should be as

large as possible.) Give a polynomial-time algorithm for this problem.

[

Hint: reduce the problem either directly to the maximum ﬂow problem or to the edge-disjoint version solved

in the previous exercise.]

Exercise 8

In the (undirected) global minimum cut problem, the input is an undirected graph G = (V, E) with a

nonnegative capacity u for each edge e ∈ E, and the goal is to identify a cut (A, B) — i.e., a partition of V

into non-empty sets A and B — that minimizes the total capacity

u of the cut edges. (Here, δ(A)

e∈δ(S)

denotes the edges with exactly one endpoint in A.)

Prove that this problem reduces to solving n−1 maximum ﬂow problems in undirected graphs.₁That is,

given an instance the global minimum cut problem, show how to (i) produce n−1 instances of the maximum

ﬂow problem (in undirected graphs) such that (ii) given maximum ﬂows to these n − 1 instances, you can

compute an optimal solution to the global minimum cut instance. Your implementations of steps (i) and (ii)

should run in polynomial time. Include a brief proof of correctness.

And hence to solving n − 1 maximum ﬂow problems in directed graphs.

Exercise 9

Extend the proof of Hall’s Theorem (end of Lecture #4) to show that, for every bipartite graph G =

(V ∪ W, E) with |V | ≤ |W|,

maximum cardinality of a matching in G = _Sm_⊆i_Vn [|V | − (|S| − |N(S)|)] .

Exercise 10

In lecture we proved a bound of O(n³) on the number of operations needed by the Push-Relabel algorithm

(where each iteration, we select the highest vertex with excess to Push or Relabel) before it terminates with

a maximum ﬂow. Give an implementation of this algorithm that runs in O(n³) time.

[

Hints: ﬁrst prove the running time bound assuming that, in each iteration, you can identify the highest

vertex with positive excess in O(1) time. The hard part is to maintain the vertices with positive excess in a

data structure such that, summed over all of the iterations of the algorithm, only O(n³) total time is used

to identify these vertices. Can you get away with just a collection of buckets (implemented as lists), sorted

by height?]

CS261: Exercise Set #3

For the week of January 18–22, 2016

Instructions:

(1) Do not turn anything in.

(2) The course staﬀ is happy to discuss the solutions of these exercises with you in oﬃce hours or on

Piazza.

(3) While these exercises are certainly not trivial, you should be able to complete them on your own

(perhaps after consulting with the course staﬀ or a friend for hints).

Exercise 11

Recall that in the maximum-weight bipartite matching problem, the input is a bipartite graph G = (V ∪W, E)

with a nonnegative weight w per edge, and the goal is to compute a matching M that maximizes

w .

e∈M

In the minimum-cost perfect bipartite matching problem, the input is a bipartite graph G = (V ∪ W, E)

such that |V | = |W| and G contains a perfect matching, and a nonnegative cost c per edge, and the goal is

to compute a perfect matching M that minimizes

c .

e∈M

Give a linear-time reduction from the former problem to the latter problem.

Exercise 12

Suppose you are given an undirected bipartite graph G = (V ∪ W, E) and a positive integer b_vfor every

vertex v ∈ V ∪ W. A b-matching is a subset M ⊆ E of edges such that each vertex v is incident to at most

b edges of M. (The standard bipartite matching problem corresponds to the case where b = 1 for every

v ∈ V ∪ W.)

Prove that the problem of computing a maximum-cardinality bipartite b-matching reduces to the problem

of computing a (standard) maximum-cardinality bipartite matching in a bigger graph. Your reduction should

run in time polynomial in the size of G and in max_{v∈V ∪W}b_v.

Exercise 13

A graph is d-regular if every vertex has d incident edges. Prove that every d-regular bipartite graph is the

union of d perfect matchings. Does the same statement hold for d-regular non-bipartite graphs?

[

Hint: Hall’s theorem.]

Exercise 14

Prove that the minimum-cost perfect bipartite matching problem reduces, in linear time, to the minimum-

cost ﬂow problem deﬁned in Lecture #6.

Exercise 15

In the edge cover problem, the input is a graph G = (V, E) (not necessarily bipartite) with no isolated

vertices, and the goal is to compute a minimum-cardinality subset F ⊆ E of edges such every vertex v ∈ V

is the endpoint of at least one edge in F. Prove that this problem reduces to the maximum-cardinality

(non-bipartite) matching problem.

CS261: Exercise Set #4

For the week of January 25–29, 2016

Instructions:

(1) Do not turn anything in.

(2) The course staﬀ is happy to discuss the solutions of these exercises with you in oﬃce hours or on

Piazza.

(3) While these exercises are certainly not trivial, you should be able to complete them on your own

(perhaps after consulting with the course staﬀ or a friend for hints).

Exercise 16

In Lecture #7 we noted that the maximum ﬂow problem translates quite directly into a linear program:

max

f_e

e∈δ (s)

subject to

f_e−

f_e= 0

f ≤ u

for all v = s, t

e∈δ−(v)

e∈δ (v)

for all e ∈ E

for all e ∈ E.

f ≥ 0

(As usual, we are assuming that s has no incoming edges.) In Lecture #8 we considered the following

alternative linear program, where P denotes the set of s-t paths of G:

max

f_P

P ∈P

subject to

f_P≤ u_e

f_P≥ 0

for all e ∈ E

for all P ∈ P.

P ∈P : e∈P

Prove that these two linear programs always have equal optimal objective function value.

Exercise 17

In the multicommodity ﬂow problem, the input is a directed graph G = (V, E) with k source vertices s , . . . , s ,

k sink vertices t , . . . , t , and a nonnegative capacity u for each edge e ∈ E. An s -t pair is called a

commodity. A multicommodity ﬂow if a set of k ﬂows f⁽¹⁾, . . . , f^(k)such that (i) for each i = 1, 2, . . . , k, f⁽ⁱ⁾

is an s -t ﬂow (in the usual max ﬂow sense); and (ii) for every edge e, the total amount of ﬂow (summing

over all commodities) sent on e is at most the edge capacity u . The value of a multicommodity ﬂow is the

sum of the values (in the usual max ﬂow sense) of the ﬂows f⁽¹⁾, . . . , f^(k).

Prove that the problem of ﬁnding a multicommodity ﬂow of maximum-possible value reduces in polyno-

mial time to solving a linear program.

Exercise 18

Consider a primal linear program (P) of the form

max c^Tx

subject to

Ax = b

x ≥ 0.

The recipe from Lecture #8 gives the following dual linear program (D):

min b^Ty

subject to

A^Ty ≥ c

y ∈ R.

Prove weak duality for primal-dual pairs of this form: the (primal) objective function value of every

feasible solution to (P) is bounded above by the (dual) objective function value of every feasible solution

to (D).₁

Exercise 19

Consider a primal linear program (P) of the form

max c^Tx

subject to

Ax ≤ b

x ≥ 0

and corresponding dual program (D)

min b^Ty

subject to

A^Ty ≥ c

y ≥ 0.

Suppose xˆ and yˆ are feasible for (P) and (D), respectively. Prove that if xˆ, yˆ do not satisfy the complementary

slackness conditions, then c^Txˆ = b^Tyˆ.

Exercise 20

Recall the linear programming relaxation of the minimum-cost bipartite matching problem:

min

c x

e e

e∈E

In Lecture #8, we only proved weak duality for primal linear programs with only inequality constraints (and hence dual

programs with nonnegative variables), like those in Exercise 19.

subject to

x_e= 1

x_e≥ 0

for all v ∈ V ∪ W

for all e ∈ E.

e∈δ(v)

In Lecture #8 we appealed to the Hungarian algorithm to prove that this linear program is guaranteed to

have an optimal solution that is 0-1. This point of this exercise is to give a direct proof of this fact, without

recourse to the Hungarian algorithm.

(a) By a fractional solution, we mean a feasible solution to the above linear program such that 0 < x_e< 1

for some edge e ∈ E. Prove that, for every fractional solution, there is an even cycle C of edges with

< x < 1 for every e ∈ C.

(b) Prove that, for all ꢀ suﬃciently close to 0 (positive or negative), adding ꢀ to x_efor every other edge

of C and subtracting ꢀ from x_efor the other edges of C yields another feasible solution to the linear

program.

fewer fractional coordinates than x; and (ii) the objective function value of x0 is no larger than that

of x.

(d) Conclude that the linear programming relaxation above is guaranteed to possess an optimal solution

that is 0-1 (i.e., not fractional).

CS261: Exercise Set #5

For the week of February 1–5, 2016

Instructions:

(1) Do not turn anything in.

(2) The course staﬀ is happy to discuss the solutions of these exercises with you in oﬃce hours or on

Piazza.

(3) While these exercises are certainly not trivial, you should be able to complete them on your own

(perhaps after consulting with the course staﬀ or a friend for hints).

Exercise 21

Consider the following linear programming relaxation of the maximum-cardinality matching problem:

max

x_e

e∈E

subject to

x_e≤ 1

for all v ∈ V

for all e ∈ E,

e∈δ(v)

x_e≥ 0

where δ(v) denotes the set of edges incident to vertex v.

We know from Lecture #9 that for bipartite graphs, this linear program always has an optimal 0-1

solution. Is this also true for non-bipartite graphs?

Exercise 22

Let x₁, . . . , x_n∈ R^mbe a set of n m-vectors. Deﬁne C as the cone of x , . . . , x , meaning all linear

combinations of the x ’s that use only nonnegative coeﬃcients:

(

)

Xⁿ

C =

λ x : λ , . . . , λ ≥ 0

i=1

Suppose α ∈ R^m, β ∈ R deﬁne a “valid inequality” for C, meaning that

α^{T x}≥ β

for every x ∈ C. Prove that

_α^Tx

≥

for every x ∈ C, so α and 0 also deﬁne a valid inequality.

[

Hint: Show that β > 0 is impossible. Then use the fact that if x ∈ C then λx ∈ C for all scalars λ ≥ 0.]

Exercise 23

Verify that the two linear programs discussed in the proof of the minimax theorem (Lecture #10),

max v

subject to

X^m

v −

a x ≤ 0

for all j = 1, . . . , n

for all i = 1, . . . , m

i=1

X^m

x_i= 1

x ≥ 0

i=1

v ∈ R,

and

min w

subject to

Xⁿ

w −

a y ≥ 0

for all i = 1, . . . , m

for all j = 1, . . . , n

j=1

Xⁿ

y_j= 1

y ≥ 0

j=1

w ∈ R,

are both feasible and are dual linear programs. (As in lecture, A is an m × n matrix, with a specifying the

payoﬀ of the row player and the negative of the payoﬀ of the column player when the former chooses row i

and the latter chooses column j.)

Exercise 24

Consider a linear program with n decision variables, and a feasible solution x ∈ Rⁿat which less than n of

the constraints hold with equality (i.e., the rest of the constraints hold as strict inequalities).

(a) Prove that there is a direction y ∈ Rⁿsuch that, for all suﬃciently small ꢀ > 0, x + ꢀy and x − ꢀy are

both feasible.

(b) Prove that at least one of x + ꢀy, x − ꢀy has objective function value at least as good as x.

[

Context: these are the two observations that drive the fact that a linear program with a bounded feasible

region always has an optimal solution at a vertex. Do you see why?]

Exercise 25

Recall from Problem #12(e) (in Problem Set #2) the following linear programming formulation of the s-t

shortest path problem:

min

c x

e e

e∈E

subject to

x_e≥ 1

x_e≥ 0

for all S ⊆ V with s ∈ S, t ∈/ S

for all e ∈ E.

e∈δ (S)

Prove that this linear program, while having exponentially many constraints, admits a polynomial-time

separation oracle (in the sense of the ellipsoid method, see Lecture #10).

CS261: Exercise Set #6

For the week of February 8–12, 2016

Instructions:

(1) Do not turn anything in.

(2) The course staﬀ is happy to discuss the solutions of these exercises with you in oﬃce hours or on

Piazza.

(3) While these exercises are certainly not trivial, you should be able to complete them on your own

(perhaps after consulting with the course staﬀ or a friend for hints).

Exercise 26

In the online-decision making problem (Lecture #11), suppose that you know in advance an upper bound

Q on the sum of squared rewards (

multiplicative weights algorithm and analysis to obtain a regret bound of O( Q log n + log n).

t=1

r a

( _t( ))₂) for every action

a ∈ A

. Explain how to modify the

√

Exercise 27

Consider the thought experiment sketched at the end of Lecture #11: for a zero-sum game speciﬁed by the

n × n matrix A:

•

At each time step t = 1, 2, . . . , T = ^{4 ln n}:

ꢀ

–

The row and column players each choose a mixed strategy (p^tand q^t, respectively) using their own

copies of the multiplicative weights algorithm (with the action set equal to the rows or columns,

as appropriate).

The row player feeds the reward vector r^t= Aq^tinto (its copy of) the multiplicative weights

algorithm. (This is just the expected payoﬀ of each row, given that the column player chose the

mixed strategy q^t.)

The column player feeds the reward vector r^t= −(p^t)_TA into the multiplicative weights algo-

rithm.

Let

X^T

v =

(p^{t)T Aqt}

t=1

denote the time-averaged payoﬀ of the row player. Use the multiplicative weights guarantee for the row and

column players to prove that

ꢀ

ꢁ

v ≥ max p

^TA qˆ

− ꢀ

and

ꢀ

ꢁ

v ≤ min pˆ

^TAq

ꢀ,

respectively, where pˆ = _T¹

t=1

p^tand qˆ = _T¹

t=1

q^tdenote the time-averaged row and column strategies.

[

Hint: ﬁrst consider the maximum and minimum over all deterministic row and column strategies, respec-

tively, rather than over all mixed strategies p and q.]

Exercise 28

Use the previous exercise to prove the minimax theorem:

ꢀ

ꢁ

ꢀ

ꢁ

max min p^TAq = min max p^TAq

for every zero-sum game A.

Exercise 29

There are also other notions of regret. One useful one is swap regret, which for an action sequence a¹, . . . , a^T

and a reward vector sequence r¹, . . . , r^Tis deﬁned as

X^T

max

δ:A→A

r^t(δ(a^t)) −

r^t(a^t)

t=1

where the maximum ranges over all functions from A to itself. Thus the swap regret measures how much

better you could do in hindsight by, for each action a, switching your action from a to some other action (on

the days where you previously chose a). Prove that, even with just 3 actions, the swap regret of an action

sequence can be arbitrarily larger (as T → ∞) than the standard regret (as deﬁned in Lecture #11).₁

Exercise 30

At the end of Lecture #12 we showed how to use the multiplicative weights algorithm (as a black box) to

obtain a (1 − ꢀ)-approximate maximum ﬂow in O(^{OP T}log n) iterations in networks where all edges have

ꢀ

capacity 1. (We are ignoring the outer loop that does binary search on the value of OPT.) Extend this idea

to obtain the same result for maximum ﬂow instances in which every edge capacity is at least 1.

[

Hint: if {`∗}

is an optimal dual solution, with value OPT =

c `∗, then obtain a distribution by

e∈E

scaling each c `∗ down by OPT. What are the relevant edge lengths after this scaling?]

Despite this, there are algorithms (a bit more complicated than multiplicative weights, but still reasonably simple) that

guarantee swap regret sublinear in T.

CS261: Exercise Set #7

For the week of February 15–19, 2016

Instructions:

(1) Do not turn anything in.

(2) The course staﬀ is happy to discuss the solutions of these exercises with you in oﬃce hours or on

Piazza.

(3) While these exercises are certainly not trivial, you should be able to complete them on your own

(perhaps after consulting with the course staﬀ or a friend for hints).

Exercise 31

Recall Graham’s algorithm from Lecture #13: given a parameter m (the number of machines) and n jobs

arriving online with processing times p , . . . , p , always assign the current job to the machine that currently

has the smallest load. We proved that the schedule produced by this algorithm always has makespan (i.e.,

maximum machine load) at most twice the minimum possible in hindsight.

Show that for every constant c < 2, there exists an instance for which the schedule produced by Graham’s

algorithm has makespan more than c times the minimum possible.

[

Hint: Your bad instances will need to grow larger as c approaches 2.]

Exercise 32

In Lecture #13 we considered the online Steiner tree problem, where the input is a connected undirected

graph G = (V, E) with nonnegative edge costs c , and a sequence t , . . . , t ∈ V of “terminals” arrive

online. The goal is to output a subgraph that spans all the terminals and has total cost as small as possible.

In lecture we only considered the metric special case, where the graph G is complete and the edge costs

satisfy the triangle inequality. (I.e., for every triple u, v, w ∈ V , c

≤ c + c_vw.) Show how to convert

an α-competitive online algorithm for the metric Steiner tree problem into one for the general Steiner tree

problem.₁

[

Hint: Deﬁne a metric instance where the edges represent paths in the original (non-metric) instance.]

Exercise 33

Give an inﬁnite family of instances (with the number k of terminals tending to inﬁnity) demonstrating that

the greedy algorithm for the online Steiner tree problem is Ω(log k)-competitive (in the worst case).

Exercise 34

Let G = (V, E) be an undirected graph that is connected and Eulerian (i.e., all vertices have even degree).

Show that G admits an Euler tour — a (not necessarily simple) cycle that uses every edge exactly once. Can

you turn your proof into an O(m)-time algorithm, where m = |E|?

[

Hint: Induction on |E|.]

This extends the 2 ln k competitive ratio given in lecture to the general online Steiner tree problem.

Exercise 35

Consider the following online matching problem in general, not necessarily bipartite graphs. No information

about the graph G = (V, E) is given up front. Vertices arrive one-by-one. When a vertex v ∈ V arrives, and

S ⊆ V are the vertices that arrived previously, the algorithm learns about all of the edges between v and

vertices in S. Equivalently, after i time steps, the algorithm knows the graph G[S ] induced by the set S of

the ﬁrst i vertices.

Give a ¹₂-competitive online algorithm for this problem.

CS261: Exercise Set #8

For the week of February 22–26, 2016

Instructions:

(1) Do not turn anything in.

(2) The course staﬀ is happy to discuss the solutions of these exercises with you in oﬃce hours or on

Piazza.

(3) While these exercises are certainly not trivial, you should be able to complete them on your own

(perhaps after consulting with the course staﬀ or a friend for hints).

Exercise 36

Recall the MST heuristic for the Steiner tree problem — in Lecture #15, we showed that this is a 2-

approximation algorithm. Show that, for every constant c < 2, there is an instance of the Steiner tree

problem such that the MST heuristic returns a tree with cost more than c times that of an optimal Steiner

tree.

Exercise 37

Recall the greedy algorithm for set coverage (Lecture #15). Prove that for every k ≥ 1, there is an example

where the value of the greedy solution is at most 1 − (1 − ¹)_ktimes that of an optimal solution.

Exercise 38

Recall the MST heuristic for the metric TSP problem — in Lecture #16, we showed that this is a 2-

approximation algorithm. Show that, for every constant c < 2, there is an instance of the metric TSP

problem such that the MST heuristic returns a tour with cost more than c times the minimum possible.

Exercise 39

Recall Christoﬁdes’s ³-approximation algorithm for the metric TSP problem. Prove that the analysis given

in Lecture #16 is tight: for every constant c < ³, there is an instance of the metric TSP problem such that

Christoﬁdes’s algorithm returns a tour with cost more than c times the minimum possible.

Exercise 40

Consider the following variant of the traveling salesman problem (TSP). The input is an undirected complete

graph with edge costs. These edge costs need not satisfy the triangle inequality. The desired output is the

minimum-cost cycle, not necessarily simple, that visits every vertex at least once.

Show how to convert a polynomial-time α-approximation algorithm for the metric TSP problem into a

polynomial-time α-approximation algorithm for this (non-metric) TSP problem with repeated visits allowed.

[

Hint: Compare to Exercise 32.]

CS261: Exercise Set #9

For the week of February 29–March 4, 2016

Instructions:

(1) Do not turn anything in.

(2) The course staﬀ is happy to discuss the solutions of these exercises with you in oﬃce hours or on

Piazza.

(3) While these exercises are certainly not trivial, you should be able to complete them on your own

(perhaps after consulting with the course staﬀ or a friend for hints).

Exercise 41

Recall the Vertex Cover problem from Lecture #17: the input is an undirected graph G = (V, E) and a

non-negative cost c_vfor each vertex v ∈ V . The goal is to compute a minimum-cost subset S ⊆ V that

includes at least one endpoint of each edge.

The natural greedy algorithm is:

•

S = ∅

while S is not a vertex cover:

–

add to S the vertex v minimizing (c_v/# newly covered edges)

•

return S

Prove that this algorithm is not a constant-factor approximation algorithm for the vertex cover problem.

Exercise 42

Recall from Lecture #17 our linear programming relaxation of the Vertex Cover problem (with nonnegative

edge costs):

min

c x

v v

v∈V

subject to

and

x_v+ x_w≥ 1

x_v≥ 0

for all edges e = (v, w) ∈ E

for all vertices v ∈ V .

Prove that there is always a half-integral optimal solution x∗ of this linear program, meaning that x∗ ∈

{

0, ¹, 1} for every v ∈ V .

[

Hint: start from an arbitrary feasible solution and show how to make it “closer to half-integral” while only

improving the objective function value.]

Exercise 43

Recall the primal-dual algorithm for the vertex cover problem — in Lecture #17, we showed that this is a

-approximation algorithm. Show that, for every constant c < 2, there is an instance of the vertex cover

problem such that this algorithm returns a vertex cover with cost more than c times that of an optimal

vertex cover.

Exercise 44

Prove Markov’s inequality: if X is a non-negative random variable with ﬁnite expectation and c > 1, then

Pr[X ≥ c · E[X]] ≤

Exercise 45

ꢀ

ꢁ

Let X be a random variable with ﬁnite expectation and variance; recall that Var[X] = E (X − E[X])₂and

StdDev[X] = Var[X]. Prove Chebyshev’s inequality: for every t > 1,

Pr[|X − E[X] | ≥ t · StdDev[X]] ≤

t²

[

Hint: apply Markov’s inequality to the (non-negative!) random variable (X − E[X])₂.]

CS261: Problem Set #1

Due by 11:59 PM on Tuesday, January 26, 2016

Instructions:

(1) Form a group of 1-3 students. You should turn in only one write-up for your entire group.

(2) Submission instructions: We are using Gradescope for the homework submissions. Go to www.gradescope.com

to either login or create a new account. Use the course code 9B3BEM to register for CS261. Only one

group member needs to submit the assignment. When submitting, please remember to add all group

member names in Gradescope.

(3) Please type your solutions if possible and we encourage you to use the LaTeX template provided on

the course home page.

(4) Write convincingly but not excessively.

(5) Some of these problems are diﬃcult, so your group may not solve them all to completion. In this case,

you can write up what you’ve got (subject to (3), above): partial proofs, lemmas, high-level ideas,

counterexamples, and so on.

(6) Except where otherwise noted, you may refer to the course lecture notes only. You can also review any

relevant materials from your undergraduate algorithms course.

(7) You can discuss the problems verbally at a high level with other groups. And of course, you are

encouraged to contact the course staﬀ (via Piazza or oﬃce hours) for additional help.

(8) If you discuss solution approaches with anyone outside of your group, you must list their names on the

front page of your write-up.

(9) Refer to the course Web page for the late day policy.

Problem 1

This problem explores “path decompositions” of a ﬂow. The input is a ﬂow network (as usual, a directed

graph G = (V, E), a source s, a sink t, and a positive integral capacity u_efor each edge), as well as a ﬂow f

in G. As always with graphs, m denotes |E| and n denotes |V |.

(a) A ﬂow is acyclic if the subgraph of directed edges with positive ﬂow contains no directed cycles. Prove

that for every ﬂow f, there is an acyclic ﬂow with the same value of f. (In particular, this implies that

some maximum ﬂow is acyclic.)

(b) A path ﬂow assigns positive values only to the edges of one simple directed path from s to t. Prove

that every acyclic ﬂow can be written as the sum of at most m path ﬂows.

(d) A cycle ﬂow assigns positive values only to the edges of one simple directed cycle. Prove that every

ﬂow can be written as the sum of at most m path and cycle ﬂows.

(e) Can you compute the decomposition in (d) in O(mn) time?

Problem 2

Consider a directed graph G = (V, E) with source s and sink t for which each edge e has a positive integral

capacity u . Recall from Lecture #2 that a blocking ﬂow in such a network is a ﬂow {f }

e∈E

with the

property that, for every s-t path P of G, there is at least one edge of P such that f = u . For example, our

ﬁrst (broken) greedy algorithm from Lecture #1 terminates with a blocking ﬂow (which, as we saw, is not

necessarily a maximum ﬂow).

Dinic’s Algorithm

initialize f = 0 for all e ∈ E

while there is an s-t path in the current residual network G do

construct the layered graph L , by computing the residual graph G and running

breadth-ﬁrst search (BFS) in G_fstarting from s, stopping once the sink t is

reached, and retaining only the forward edges¹

compute a blocking ﬂow g in G_f

/ augment the flow f using the flow g

for all edges (v, w) of G for which the corresponding forward edge of G_fcarries

ﬂow (g_vw> 0) do

increase f_eby g_e

for all edges (v, w) of G for which the corresponding reverse edge of G_fcarries

ﬂow (g_wv> 0) do

decrease f_eby g_e

The termination condition implies that the algorithm can only halt with a maximum ﬂow. Exercise Set #1

argues that every iteration of the main loop increases d(f), the length (i.e., number of hops) of a shortest

s-t path in G_f, and therefore the algorithm stops after at most n iterations. Its running time is therefore

O(n·BF), where BF is the amount of time required to compute a blocking ﬂow in the layered graph L . We

know that BF = O(m²) — our ﬁrst broken greedy algorithm already proves this — but we can do better.

Consider the following algorithm, inspired by depth-ﬁrst search, for computing a blocking ﬂow in L_f:

A Blocking Flow Algorithm

Initialize. Initialize the ﬂow variables g to 0 for all e ∈ E. Initialize the path variable

P as the empty path, from s to itself. Go to Advance.

Advance. Let v denote the current endpoint of the path P. If there is no edge out

of v, go to Retreat. Otherwise, append one such edge (v, w) to the path P. If w = t

then go to Advance. If w = t then go to Augment.

Retreat. Let v denote the current endpoint of the path P. If v = s then halt.

Otherwise, delete v and all of its incident edges from L_f. Remove from P its last edge.

Go to Advance.

Augment. Let ∆ denote the smallest residual capacity of an edge on the path P

(which must be an s-t path). Increase g by ∆ on all edges e ∈ P. Delete newly

saturated edges from L , and let e = (v, w) denote the ﬁrst such edge on P. Retain

only the subpath of P from s to v. Go to Advance.

And now the analysis:

(a) Prove that the running time of the algorithm, suitably implemented, is O(mn). (As always, m denotes

E| and n denotes |V |.)

Hint: How many times can Retreat be called? How many times can Augment be called? How many

[

times can Advance be called before a call to Retreat or Augment?]

Recall that a forward edge in BFS goes from layer i to layer (i + 1), for some i.

(b) Prove that the algorithm terminates with a blocking ﬂow g in L_f.

For example, you could argue by contradiction.]

[

computes a blocking ﬂow in linear (i.e., O(m)) time.

[

Hint: can an edge (v, w) be chosen in two diﬀerent calls to Advance?]

Problem 3

In this problem we’ll analyze a diﬀerent augmenting path-based algorithm for the maximum ﬂow problem.

Consider a ﬂow network with integral edge capacities. Suppose we modify the Edmonds-Karp algorithm

(Lecture #2) so that, instead of choosing a shortest augmenting path in the residual network G_f, it chooses

an augmenting path on which it can push the most ﬂow. (That is, it maximizes the minimum residual

capacity of an edge in the path.) For example, in the network in Figure 1, this algorithm would push 3 units

of ﬂow on the path s → v → w → t in the ﬁrst iteration. (And 2 units on s → w → v → t in the second

iteration.)

(3)

5 (3)

3 (3)

Figure 1: Problem 3. Edges are labeled with their capacities, with ﬂow amounts in parentheses.

(a) Show how to modify Dijkstra’s shortest-path algorithm, without aﬀecting its asymptotic running time,

so that it computes an s-t path with the maximum-possible minimum residual edge capacity.

(b) Suppose the current ﬂow f has value F and the maximum ﬂow value in G is F^∗. Prove that there

∗

−

is an augmenting path in G_fsuch that every edge has residual capacity at least (F

m = |E|.

F)/m, where

[

Hint: if ∆ is the maximum amount of ﬂow that can be pushed on any s-t path of G_f, consider the set

of vertices reachable from s along edges in G_fwith residual capacity more than ∆. Relate the residual

∗

−

capacity of this (s, t)-cut to F

F.]

where F^∗is deﬁned as in the previous problem.

[

Hint: you might ﬁnd the inequality 1 − x ≤ e⁻for x [0, 1] useful.]

∈

(d) Assume that all edge capacities are integers in {1, 2, . . . , U}. Give an upper bound on the running time

of your algorithm as a function of n = |V |, m, and U. Is this bound polynomial in the input size?

Problem 4

In this problem we’ll revisit the special case of unit-capacity networks, where every edge has capacity 1 (see

also Exercise 4).

(a) Recall the notation d(f) for the length (in hops) of a shortest s-t path in the residual network G_f.

Suppose G is a unit-capacity network and f is a ﬂow with value F. Prove that the maximum ﬂow

d(f)

value is at most F +

[Hint: use the layered graph L_fdiscussed in Problem 2 to identify an s-t cut of the residual graph that

has small residual capacity. Then argue along the lines of Problem 3(b).]

(b) Explain how to compute a maximum ﬂow in a unit-capacity network in O(m^3/2) time.

[

Hints: use Dinic’s algorithm and Problem 2(c). Also, in light of part (a) of this problem, consider the

question: if you know that the value of the current ﬂow f is only c less than the maximum ﬂow value

in G, then what’s a crude upper bound on the number of additional blocking ﬂows required before

you’re sure to terminate with a maximum ﬂow?]

Problem 5

(Diﬃcult.) This problem sharpens the analysis of the highest-label push-relabel algorithm (Lecture #3) to

improve the running time bound from O(n³) to O(n²m).²(Replacing an n by a m is always a good

thing.) Recall from the Lecture #3 analysis that it suﬃces to prove that the number of non-saturating

√

pushes is O(n²m) (since there are only O(n²) relabels and O(nm) saturating pushes, anyways).

For convenience, we augment the algorithm with some bookkeeping: each vertex v maintains at most one

successor, which is a vertex w such that (v, w) has positive residual capacity and h(v) = h(w)+1 (i.e., (v, w)

goes “downhill”). (If there is no such w, v’s successor is NULL.) When a push is called on the vertex v, ﬂow

is pushed from v to its successor w. Successors are updated as needed after each saturating push or relabel.³

For a preﬂow f and corresponding residual graph G , we denote by S the subgraph of G consisting of the

edges {(v, w) ∈ G_f: w is v’s successor}.

v(1)

(1)

100 (2)

t(0)

s(4)

3 (2)

s(4)

t(0)

00 (100)

1 (1)

w(2)

Figure 2: (a) Sample instance of running push-relabel algorithm. As usual, for edges, the ﬂows values are

in brackets. For vertices, the bracketed values denote the heights of vertices. (b) S_ffor the given preﬂow in

(a). Maximal vertices are denoted by two circles.

(a) Note that every vertex of S_fhas out-degree 0 or 1. Prove that S_fis a directed forest, meaning a

collection of disjoint directed trees (in each tree, all edges are directed inward toward the root).

(b) Deﬁne D(v) as the number of descendants of v in its directed tree (including v itself). Equivalently,

D(v) is the number of vertices that can reach v by repeatedly following successor edges. (The D(v)’s

can change each time the preﬂow, height function, or successor edges change.)

Prove that the push-relabel algorithm only pushes ﬂow from v to w when D(w) > D(v).

^{Believe it or not, this is a tight upper bound — the algorithm requires Ω(n2√}m) operations in the worst case.

We leave it as an exercise to think about how to implement this to get an algorithm with overall running time O(n²m).

√

excess is maximal — do you see why? — but the converse need not hold.) For such a vertex, deﬁne

φ(v) = max{K − D(v) + 1, 0},

where K is a parameter to be chosen in part (i). For the other vertices, deﬁne φ(v) = 0. Deﬁne

Φ =

φ(v).

v∈V

Prove that a non-saturating push, from a highest vertex v with positive excess, cannot increase Φ.

Moreover, such a push strictly decreases Φ if D(v) ≤ K.

(d) Prove that changing a vertex’s successor from NULL to a non-NULL value cannot increase Φ.

(e) Prove that each relabel increases Φ by at most K.

[Hint: before a relabel at v, v has out-degree 0 in S_f. After the re-label, it has in-degree 0. Can this

create new maximal vertices? And how do the diﬀerent D(w)’s change?]

(f) Prove that each saturating push increases Φ by at most K.

(g) A phase is a maximal sequence of operations such that the maximum height of a vertex with excess

remains unchanged. (The set of such vertices can change.) Prove that there are O(n²) phases.

(h) Arguing as in Lecture #3 shows that each phase performs at most n non-saturating pushes (why?), but

we want to beat the O(n³) bound. Suppose that a phase performs at least

Show that at least half of these strictly decrease Φ.

non-saturating pushes.

[Hint: prove that if a phase does a non-saturating push at both v and w during a phase, then v and

w share no descendants during the phase. How many such vertices can there be with more than K

descendants?]

n³

(i) Prove a bound of O( + nmK) on the total number of non-saturating pushes across all phases.

√

Choose K so that the bound simpliﬁes to O(n²m).

Problem 6

Suppose we are given an array A[1..m][1..n] of non-negative real numbers. We want to round A to an integer

matrix, by replacing each entry x in A with either bxc or dxe, without changing the sum of entries in any

row or column of A. (Assume that all row and column sums of A are integral.) For example:









.2 3.4 2.4

.9 2.1

.9 1.6 0.5









→

(a) Describe and analyze an eﬃcient algorithm that either rounds A in this fashion, or reports correctly

that no such rounding is possible.

[

Hint: don’t solve the problem from scratch, use a reduction instead.]

(b) Prove that such a rounding is guaranteed to exist.

CS261: Problem Set #2

Due by 11:59 PM on Tuesday, February 9, 2016

Instructions:

(1) Form a group of 1-3 students. You should turn in only one write-up for your entire group.

(2) Submission instructions: We are using Gradescope for the homework submissions. Go to www.gradescope.com

to either login or create a new account. Use the course code 9B3BEM to register for CS261. Only one

group member needs to submit the assignment. When submitting, please remember to add all group

member names in Gradescope.

(3) Please type your solutions if possible and we encourage you to use the LaTeX template provided on

the course home page.

(4) Write convincingly but not excessively.

(5) Some of these problems are diﬃcult, so your group may not solve them all to completion. In this case,

you can write up what you’ve got (subject to (3), above): partial proofs, lemmas, high-level ideas,

counterexamples, and so on.

(6) Except where otherwise noted, you may refer to the course lecture notes only. You can also review any

relevant materials from your undergraduate algorithms course.

(7) You can discuss the problems verbally at a high level with other groups. And of course, you are

encouraged to contact the course staﬀ (via Piazza or oﬃce hours) for additional help.

(8) If you discuss solution approaches with anyone outside of your group, you must list their names on the

front page of your write-up.

(9) Refer to the course Web page for the late day policy.

Problem 7

A vertex cover of an undirected graph (V, E) is a subset S ⊆ V such that, for every edge e ∈ E, at least one

of e’s endpoints lies in S.¹

(a) Prove that in every graph, the minimum size of a vertex cover is at least the size of a maximum

matching.

(b) Give a non-bipartite graph in which the minimum size of a vertex cover is strictly bigger than the size

of a maximum matching.

time in bipartite graphs.²

[

Hint: reduction to maximum ﬂow.]

(d) Prove that in every bipartite graph, the minimum size of a vertex cover equals the size of a maximum

matching.

Yes, the problem is confusingly named.

In general graphs, the problem turns out to be NP-hard (you don’t need to prove this).

Problem 8

This problem considers the special case of maximum ﬂow instances where edges have integral capacities and

also

(*) for every vertex v other than s and t, either (i) there is at most one edge entering v, and this edge

(if it exists) has capacity 1; or (ii) there is at most one edge exiting v, and this edge (if it exists) has

capacity 1.

Your tasks:

√

(a) Prove that the maximum ﬂow problem can be solved in O(m n) time in networks that satisfy (*).

(As always, m is the number of edges and n is the number of vertices.)

[

Hint: proceed as in Problem 4, but prove a stronger version of part (a) of that problem.]

√

(b) Prove that the maximum bipartite matching problem can be solved in O(m n) time.

[

Hint: examine the reduction in Lecture #4.]

Problem 9

This problem considers approximation algorithms for graph matching problems.

(a) For the maximum-cardinality matching problem in bipartite graphs, prove that for every constant

ꢀ > 0, there is an O(m)-time algorithm that computes a matching with size at most ꢀn less than the

maximum possible (where n is the number of vertices). (The hidden constant in the big-oh notation

can depend on .)

ꢀ

[

Hint: ideas from Problem 8(b) should be useful.]

(b) Now consider non-bipartite graphs where each edge e has a real-valued weight w_e. Recall the greedy

algorithm from Lecture #6:

Greedy Matching Algorithm

sort and rename the edges E = {1, 2, . . . , m} so that w ≥ w ≥ · · · w

M = ∅

for i = 1 to m do

if w > 0 and e shares no endpoint with edges in M then

add e_ito M

How fast can you implement this algorithm?

of the maximum possible.

[Hint: if the greedy algorithm adds an edge e to M, how many edges in the optimal matching can this

edge “block”? How do the weights of the blocked edges compare to that of e?]

Problem 10

This problem concerns running time optimizations to the Hungarian algorithm for computing minimum-cost

perfect bipartite matchings (Lecture #5). Recall the O(mn²) running time analysis from lecture: there are

at most n augmentation steps, at most n price update steps between two augmentation steps, and each

iteration can be implemented in O(m) time.

(a) By a phase, we mean a maximal sequence of price update iterations (between two augmentation

iterations). The naive implementation in lecture regrows the search tree from scratch after each price

update in a phase, spending O(m) time on this for each of up to n iterations. Show how to reuse work

from previous iterations so that the total amount of work done searching for good paths, in total over

all iterations in the phase, is only O(m).

[

Hint: compare to Problem 2(a).]

(b) The other non-trivial work in a price update phase is computing the value of ∆ (the magnitude of the

update). This is easy to do in O(m) time per iteration. Explain how to maintain a heap data structure

so that the total time spent computing ∆ over all iterations in the phase is only O(m log n). Be sure

to explain what heap operations you perform while growing the search tree and when executing a price

update.

[

This yields an O(mn log n) time implementation of the Hungarian algorithm.]

Problem 11

In the minimum-cost ﬂow problem, the input is a directed graph G = (V, E), a source s ∈ V , a sink t ∈ V ,

a target ﬂow value d, and a capacity u ≥ 0 and cost c ∈ R for each edge e ∈ E. The goal is to compute

sending d units from s to t with the minimum-possible cost

a ﬂow {f }

c f . (If there is no such

ﬂow, the algorithm should correctly report this fact.)

e∈E

Given a min-cost ﬂow instance and a feasible ﬂow f with value d, the corresponding residual network G

is deﬁned as follows. The vertex set remains V . For every edge (v, w) ∈ E with f < u_vw, there is an edge

(v, w) in G with cost c and residual capacity u − f . For every edge (v, w) ∈ E with f > 0, there is a

reverse edge (w, v) in G with the cost −c and residual capacity f .

A negative cycle of G is a directed cycle C of G such that the sum of the edge costs in C is negative.

(E.g., v → w → x → y → v, with c_vw= 2, c_wx= −1, c_xy= 3, and c_yv= −5.)

(a) Prove that if the residual network G_fof a ﬂow f has a negative cycle, then f is not a minimum-cost

ﬂow.

(b) Prove that if the residual network G_fof a ﬂow f has no negative cycles, then f is a minimum-cost

ﬂow.

[Hint: look to the proof of the minimum-cost bipartite matching optimality conditions (Lecture #5)

for inspiration.]

correctly reports that no negative cycle exists.

[Hint: feel free to use an algorithm from CS161. Be clear about which properties of the algorithm

you’re using.]

(d) Assume that all edge costs and capacities are integers with magnitude at most M. Give an algorithm

that is guaranteed to terminate with a minimum-cost ﬂow and has running time polynomial in n = |V |,

m = |E|, and M.³

[

Hint: what would the analog of Ford-Fulkerson be?]

Problem 12

The goal of this problem is to revisit two problems you studied in CS161 — the minimum spanning tree

and shortest path problems — and to prove the optimality of Kruskal’s and Dijkstra’s algorithms via the

complementary slackness conditions of judiciously chosen linear programs.

Thus this algorithm is only “pseudo-polynomial.” A polynomial algorithm would run in time polynomial in n, m, and

log M. Such algorithms can be derived for the minimum-cost ﬂow problem using additional ideas.

(a) For convenience, we consider the maximum spanning tree problem (equivalent to the minimum spanning

tree problem, after multiplying everything by -1). Consider a connected undirected graph G = (V, E)

in which each edge e has a weight w_e.

For a subset F ⊆ E, let κ(F) denote the number of connected components in the subgraph (V, F).

Prove that the spanning trees of G are in an objective function-preserving one-to-one correspondence

with the 0-1 feasible solutions of the following linear program (with decision variables {x_e}_e∈E):

max

w_ex_e

e∈E

subject to

x_e≤ |V | − κ(F)

x_e= |V | − 1

for all F ⊆ E

e∈F

e∈E

x_e≥ 0

for all e ∈ E.

(While this linear program has a huge number of constraints, we are using it purely for the analysis of

Kruskal’s algorithm.)

(b) What is the dual of this linear program?

(d) Recall that Kruskal’s algorithm, adapted to the current maximization setting, works as follows: do a

single pass over the edges from the highest weight to lowest weight (breaking ties arbitrarily), adding

an edge to the solution-so-far if and only if it creates no cycle with previously chosen edges. Prove that

the corresponding solution to the linear program in (a) is in fact an optimal solution to that linear

program, by exhibiting a feasible solution to the dual program in (b) such that the complementary

slackness conditions hold.⁴

[

E that comprise the i edges with the largest weights (for some i).]

Hint: for the dual variables of the form y , it is enough to use only those that correspond to subsets F ⊆

(e) Now consider the problem of computing a shortest path from s to t in a directed graph G = (V, E)

with a nonnegative cost c on each edge e ∈ E. Prove that every simple s-t path of G corresponds to

a 0-1 feasible solution of the following linear program with the same objective function value:⁵

min

c x

e e

e∈E

subject to

x_e≥ 1

x_e≥ 0

for all S ⊆ V with s ∈ S, t ∈/ S

for all e ∈ E.

e∈δ+(S)

(Again, this huge linear program is for analysis only.)

(f) What is the dual of this linear program?

(g) What are the complementary slackness conditions?

You can assume without proof that Kruskal’s algorithm outputs a feasible solution (i.e., a spanning tree), and focus on

proving its optimality.

Recall that δ+(S) denotes the edges sticking out of S.

(h) Let P denote the s-t path returned by Dijkstra’s algorithm. Prove that the solution to the linear

program in (e) corresponding to P is in fact an optimal solution to that linear program, by exhibiting

a feasible solution to the dual program in (f) such that the complementary slackness conditions hold.

[Hint: it is enough to use only dual variables of the form y_Sfor subsets S ⊆ V that comprise the ﬁrst i

vertices processed by Dijkstra’s algorithm (for some i).]

CS261: Problem Set #3

Due by 11:59 PM on Tuesday, February 23, 2016

Instructions:

(1) Form a group of 1-3 students. You should turn in only one write-up for your entire group.

(2) Submission instructions: We are using Gradescope for the homework submissions. Go to www.gradescope.com

to either login or create a new account. Use the course code 9B3BEM to register for CS261. Only one

group member needs to submit the assignment. When submitting, please remember to add all group

member names in Gradescope.

(3) Please type your solutions if possible and we encourage you to use the LaTeX template provided on

the course home page.

(4) Write convincingly but not excessively.

(5) Some of these problems are diﬃcult, so your group may not solve them all to completion. In this case,

you can write up what you’ve got (subject to (3), above): partial proofs, lemmas, high-level ideas,

counterexamples, and so on.

(6) Except where otherwise noted, you may refer to the course lecture notes only. You can also review any

relevant materials from your undergraduate algorithms course.

(7) You can discuss the problems verbally at a high level with other groups. And of course, you are

encouraged to contact the course staﬀ (via Piazza or oﬃce hours) for additional help.

(8) If you discuss solution approaches with anyone outside of your group, you must list their names on the

front page of your write-up.

(9) Refer to the course Web page for the late day policy.

Problem 13

This problem ﬁlls in some gaps in our proof sketch of strong linear programming duality.

(a) For this part, assume the version of Farkas’s Lemma stated in Lecture #9, that given A ∈ R^m×and

b ∈ R^m, exactly one of the following statements holds: (i) there is an x ∈ Rⁿsuch that Ax = b and

x ≥ 0; (ii) there is a y ∈ R^msuch that y^TA ≥ 0 and y^Tb < 0.

Deduce from this a second version of Farkas’s Lemma, stating that for A and b as above, exactly one

of the following statements holds: (iii) there is an x ∈ Rⁿsuch that Ax ≤ b; (iv) there is a y ∈ R^m

such that y ≥ 0, y^TA = 0, and y^Tb < 0.

[

Hint: note the similarity between (i) and (iv). Also note that if (iv) has a solution, then it has a

solution with y^Tb = −1. ]

(b) Use the second version of Farkas’s Lemma to prove the following version of strong LP duality: if the

linear programs

max c^Tx

subject to

Ax ≤ b

with x unrestricted, and

min b^Ty

subject to

A^Ty = c, y ≥ 0

are both feasible, then they have equal optimal objective function values.

[

Hint: weak duality is easy to prove directly. For strong duality, let γ^∗denote the optimal objective

function value of the dual linear program. Add the constraint c^Tx ≥ γ to the primal linear program

∗

and use Farkas’s Lemma to show that the feasible region is non-empty.]

Problem 14

Recall the multicommodity ﬂow problem from Exercise 17. Recall the input consists of a directed graph

G = (V, E), k “commodities” or source-sink pairs (s , t ), . . . , (s , t ), and a positive capacity u for each

edge.

Consider also the multicut problem, where the input is the same as in the multicommodity ﬂow problem,

and feasible solutions are subsets F ⊆ E of edges such that, for every commodity (s , t ), there is no s -t

path in G = (V, E \ F). (Assume that s and t are distinct for each i.) The value of a multicut F is just

(a) Formulate the multicommodity ﬂow problem as a linear program with one decision variable for each

path P that travels from a source s to the corresponding sink t . Aside from nonnegativity constraints,

the total capacity

u .

e∈F

there should be only be m constraints (one per edge).

[

Note: this is a diﬀerent linear programming formulation than the one asked for in Exercise 21.]

(b) Take the dual of the linear program in (a). Prove that every optimal 0-1 solution of this dual —

i.e., among all feasible solutions that assign each decision variable the value 0 or 1, one of minimum

objective function value — is the characteristic vector of a minimum-value multicut.

value strictly smaller than that of every 0-1 feasible solution. In light of your example, explain a sense

in which there is no max-ﬂow/min-cut theorem for multicommodity ﬂows and multicuts.

Problem 15

This problem gives a linear-time (!) randomized algorithm for solving linear programs that have a large

number m of constraints and a small number n of decision variables. (The constant in the linear-time

guarantee O(m) will depend exponentially on n.)

Consider a linear program of the form

max c^Tx

subject to

Ax ≤ b.

For simplicity, assume that the linear program is feasible with a bounded feasible region, and let M be large

enough that |x | < M for every coordinate of every feasible solution. Assume also that the linear program

is “non-degenerate,” in the sense that no feasible point satisﬁes more than n constraints with equality. For

example, in the plane (two decision variables), this just means that there does not exist three diﬀerent

constraints (i.e., halfplanes) whose boundaries meet at a common point. Finally, assume that the linear

program has a unique optimal solution.¹

Let C = {1, 2, . . . , m} denote the set of constraints of the linear program. Let B denote additional

constraints asserting that −M ≤ x ≤ M for every j. The high-level idea of the algorithm is: (i) drop a

All of these simplifying assumptions can be removed without aﬀecting the asymptotic running time; we leave the details to

the interested reader.

random constraint and recursively compute the optimal solution x^∗of the smaller linear program; (ii) if x^∗

∗

is feasible for the original linear program, return it; (iii) else, if x violates the constraint a x

change this inequality to an equality and recursively solve the resulting linear program.

≤

b_i, then

More precisely, consider the following recursive algorithm with two arguments. The ﬁrst argument C₁is

a subset of inequality constraints that must be satisﬁed (initially, equal to C). The second argument is a

subset C of constraints that must be satisﬁed with equality (initially, ∅). The responsibility of a recursive call

is to return a point maximizing c^Tx over all points that satisfy all the constraints of C ∪ B (as inequalities)

and also those of C (as equations).

Linear-Time Linear Programming

Input: two disjoint subsets C , C ⊆ C of constraints

Base case #1: if |C₂| = n, return the unique point that satisﬁes every constraint

of C₂with equality

Base case #2: if |C | + |C | = n, return the point that maximizes c^Tx subject to

a^Tx ≤ b for every i ∈ C , a^Tx = b for every i ∈ C , and the constraints in B

Recursive step:

choose i ∈ C uniformly at random

recurse with the sets C \ {i} and C to obtain a point x

∗

if a^Tx

∗

≤

b then

∗

return x

else

recurse with the sets C \ {i} and C ∪ {i}, and return the result

(a) Prove that this algorithm terminates with the optimal solution x^∗of the original linear program.

Hint: be sure to explain why, in the “else” case, it’s OK to recurse with the ith constraint set to an

[

equation.]

(b) Let T(m, s) denote the expected number of recursive calls made by the algorithm to solve an instance

with |C | = m and |C | = s (with the number n of variables ﬁxed). Prove that T satisﬁes the following

recurrence:

ꢀ

if s = n or m + s = n

T(m 1, s + 1) otherwise.

T(m, s) =

T(m − 1, s) +

^n−s·

−

[

Hint: you should use the non-degeneracy assumption in this part.]

[

induction on m and δ.]

Hint: it might be easiest to make the variable substitution δ = n − s and proceed by simultaneous

(d) Conclude that, for every ﬁxed constant n, the algorithm above can be implemented so that the expected

running time is O(m) (where the hidden constant can depend arbitrarily on n).

Problem 16

This problem considers a variant of the online decision-making problem. There are n “experts,” where n is

a power of 2.

Combining Expert Advice

At each time step t = 1, 2, . . . , T:

each expert oﬀers a prediction of the realization of a binary event (e.g., whether a

stock will go up or down)

a decision-maker picks a probability distribution p^tover the possible realizations 0

and 1 of the event

the actual realization r^t∈ {0, 1} of the event is revealed

a 0 or 1 is chosen according to the distribution p^t, and a mistake occurs whenever

it is diﬀerent from r^t

You are promised that there is at least one omniscient expert who makes a correct prediction at every time

step.

(a) Prove that the minimum worst-case number of mistakes that a deterministic algorithm can make is

precisely log₂n.

(b) Prove that the minimum worst-case expected number of mistakes that a randomized algorithm can

make is precisely log₂n.

Problem 17

In Lecture #11 we saw that the follow-the-leader (FTL) algorithm, and more generally every deterministic

algorithm, can have regret that grows linearly with T. This problem outlines a randomized variant of

FTL, the follow-the-perturbed-leader (FTPL) algorithm, with worst-case regret comparable to that of the

multiplicative weights algorithm. In the description of FTPL, we deﬁne each probability distribution p^tover

actions implicitly through a randomized subroutine.

Follow-the-Perturbed-Leader (FTPL) Algorithm

for each action a ∈ A do

independently sample a geometric random variable with parameter η,²denoted by

X_a

for each time step t = 1, 2, . . . , T do

choose the action a that maximizes the perturbed cumulative reward

t−1

For convenience, assume that, at every time step t, there is no pair of actions whose (unperturbed) cumulative

rewards-so-far diﬀer by an integer.

X_a+

r^u(a) so far

u=1

(a) Prove that, at each time step t = 1, 2, . . . , T, with probability at least 1 − η, the largest perturbed

cumulative reward of an action prior to t is more than 1 larger than the second-largest such perturbed

reward.

∗

[

Hint: Sample the X ’s gradually by ﬂipping coins only as needed, pausing once the action a with

largest perturbed cumulative reward is identiﬁed. Resuming, only X_a∗is not yet fully determined.

What can you say if the next coin ﬂip comes up “tails?”]

Equivalently, when repeatedly ﬂipping a coin that comes up “heads” with probability η, count the number of ﬂips up to

and including the ﬁrst “heads.”

(b) As a thought experiment, consider the (unimplementable) algorithm that, at each time step t, picks

u=1

the action that maximizes the perturbed cumulative reward X_a+

r (a) over a ∈ A, taking into

account the current reward vector. Prove that the regret of this algorithm is at most max_a∈AX .

Hint: Consider ﬁrst the special case where X_a= 0 for all a. Iteratively transform the action sequence

that always selects the best action in hindsight to the sequence chosen by the proposed algorithm. Work

[

backward from time T, showing that the reward only increases with each step of the transformation.]

X_a] ≤ bη⁻¹ln n, where n is the number of actions and b > 0 is a constant

a∈A

independent of η and n.

Hint: use the deﬁnition of a geometric random variable and remind yourself about “the union bound.”]

(d) Prove that, for a suitable choice of η, the worst-case expected regret of the FTPL algorithm is at

[

√

most b T ln n, where b > 0 is a constant independent of n and T.

Problem 18

In this problem we’ll show that there is no online algorithm for the online bipartite matching problem with

competitive ratio better than 1 −

≈

63.2%.

Consider the following probability distribution over online bipartite matching instances. There are n

left-hand side vertices L, which are known up front. Let π be an ordering of L, chosen uniformly at random.

The n vertices of the right-hand side R arrive one by one, with the ith vertex of R connected to the last

n − i + 1 vertices of L (according to the random ordering π).

(a) Explain why OP T = n for every such instance.

(b) Consider an arbitrary deterministic online algorithm A. Prove that for every i ∈ {1, 2, . . . , n}, the

probability (over the choice of π) that A matches the ith vertex of L (according to π) is at most









min

, 1



n − j + 1



j=1

[

Hint: for example, in the ﬁrst iteration, assume that A matches the ﬁrst vertex of R to the vertex

v ∈ L. Note that A must make this decision without knowing π. What can you say if v does not

happen to be the ﬁrst vertex of π?]

produced by A is at most







Xⁿ

X

min

, 1

(1)



n − j + 1



i=1

j=1

and prove that (1) approaches n(1 − ) as n

→ ∞

[

Hint: for the second part, recall that

≈ ln d (up to an additive constant less than 1). For what

j=1 j

value of i is the inner sum roughly equal to 1?]

(d) Extend (c) to randomized online algorithms A, where the expectation is now over both π and the

internal coin ﬂips of A.

[Hint: use the fact that a randomized online algorithm is a probability distribution over deterministic

online algorithms (as ﬂipping all of A’s coins in advance yields a deterministic algorithm).]

(e) Prove that for every ꢀ > 0 and (possibly randomized) online bipartite matching algorithm A, there

exists an input such that the expected (over A’s coin ﬂips) size of A’s output is no more than 1− +ꢀ

times that of an optimal solution.

CS261: Problem Set #4

Due by 11:59 PM on Tuesday, March 8, 2016

Instructions:

(1) Form a group of 1-3 students. You should turn in only one write-up for your entire group.

(2) Submission instructions: We are using Gradescope for the homework submissions. Go to www.gradescope.com

to either login or create a new account. Use the course code 9B3BEM to register for CS261. Only one

group member needs to submit the assignment. When submitting, please remember to add all group

member names in Gradescope.

(3) Please type your solutions if possible and we encourage you to use the LaTeX template provided on

the course home page.

(4) Write convincingly but not excessively.

(5) Some of these problems are diﬃcult, so your group may not solve them all to completion. In this case,

you can write up what you’ve got (subject to (3), above): partial proofs, lemmas, high-level ideas,

counterexamples, and so on.

(6) Except where otherwise noted, you may refer to the course lecture notes only. You can also review any

relevant materials from your undergraduate algorithms course.

(7) You can discuss the problems verbally at a high level with other groups. And of course, you are

encouraged to contact the course staﬀ (via Piazza or oﬃce hours) for additional help.

(8) If you discuss solution approaches with anyone outside of your group, you must list their names on the

front page of your write-up.

(9) Refer to the course Web page for the late day policy.

Problem 19

This problem considers randomized algorithms for the online (integral) bipartite matching problem (as in

Lecture #14).

(a) Consider the following algorithm: when a new vertex w ∈ R arrives, among the unmatched neighbors

of w (if any), choose one uniformly at random to match to w.

Prove that the competitive ratio of this algorithm is strictly smaller than 1 − .

(b) The remaining parts consider the following algorithm: before any vertices of R arrive, independently

pick a number y uniformly at random from [0, 1] for each vertex v ∈ L. Then, when a new vertex

w ∈ R arrives, match w to its unmatched neighbor with the smallest y-value (or to no one if all its

neighbors are already matched).

For the analysis, when v and w are matched, deﬁne q = g(y ) and q = 1 − g(y_v), where g(y) = e^y−

is the same function used in Lecture #14.

Prove that with probability 1, at the end of the algorithm,

matching.

q_vequals the size of the computed

v∈L∪R

other than v; q remains random. As a thought experiment, suppose we re-run the online algorithm

from scratch with v deleted (the rest of the input and the y-values stay the same), and let t ∈ L denote

the vertex to which w is matched (if any).

Hint: prove that v is matched (in the online algorithm with the original input, not in the thought

^yt

Prove that the conditional expectation of q (given q for all x ∈ L ∪ R \ {v}) is at least

g(z)dz.

(If t does not exist, interpret y as 1.)

[

experiment) whenever y < y . Conditioned on this event, what is the distribution of y ?]

(d) Prove that, conditioned on q for all x ∈ L ∪ R \ {v}, q ≥ 1 − g(y ).

[

Hint: prove that w is always matched (in the online algorithm with the original input) to a vertex

with y-value at most y_t.]

(e) Prove that the randomized algorithm in (b) is (1 − )-competitive, meaning that for every input, the

expected value of the computed matching (over the algorithm’s coin ﬂips) is at least 1 − times the

size of a maximum matching.

[

Hint: use the expectation of the q-values to deﬁne a feasible dual solution.]

Problem 20

A set function f : 2^U→ R₊is monotone if f(S) ≤ f(T) whenever S ⊆ T ⊆ U. Such a function is submodular

if it has diminishing returns: whenever S ⊆ T ⊆ U and i ∈/ T, then

f(T ∪ {i}) − f(T) ≤ f(S ∪ {i}) − f(S).

(1)

We consider the problem of, given a function f and a budget k, computing¹

max f(S).

S⊆U:|S|=k

(2)

(a) Prove that set coverage problem (Lecture #15) is a special case of this problem.

(b) Let G = (V, E) be a directed graph and p ∈ [0, 1] a parameter. Recall the cascade model from Lecture

15:

•

Initially the vertices in some set S are “active,” all other vertices are “inactive.” Every edge is

initially “undetermined.”

•

While there is an active vertex v and an undetermined edge (v, w):

–

with probability p, edge (v, w) is marked “active,” otherwise it is marked “inactive;”

if (v, w) is active and w is inactive, then mark w as active.

Let f(S) denote the expected number of active vertices at the conclusion of the cascade, given that the

vertices of S are active at the beginning. (The expectation is over the coin ﬂips made for the edges.)

Prove that f is monotone and submodular.

[

Hint: prove that the condition (1) is preserved under convex combinations.]

of k iterations, add to S the element that increases f the most. Suppose at some iteration the current

greedy solution is S and it decides to add i to S. Prove that

f(S ∪ {i}) − f(S) ≥ (OP T − f(S)) ,

where OP T is the optimal value in (2).

[Hint: If you added every element in the optimal solution to S, where would you end up? Then use

submodularity.]

Don’t worry about how f is represented in the input. We assume that it is possible to compute f(S) from S in a reasonable

amount of time.

(d) Prove that for every monotone submodular function f, the greedy algorithm is a (1− )-approximation

algorithm.

Problem 21

This problem considers the “{1, 2}” special case of the asymmetric traveling salesman problem (ATSP). The

input is a complete directed graph G = (V, E), with all n(n − 1) directed edges present, where each edge e

has a cost c that is either 1 or 2. Note that the triangle inequality holds in every such graph.

(a) Explain why the {1, 2} special case of ATSP is NP-hard.

(b) Explain why it’s trivial to obtain a polynomial-time 2-approximation algorithm for the {1, 2} special

case of ATSP.

G = (V, E) is a collection C , . . . , C of simple directed cycles, each with at least two edges, such that

every vertex of G belongs to exactly one of the cycles. (A traveling salesman tour is the special case

where k = 1.) Prove that given a directed graph with edge costs, a cycle cover with minimum total

cost can be computed in polynomial time.

[

Hint: bipartite matching.]

(d) Using (c) as a subroutine, give a -approximation algorithm for the 1, 2 special case of the ATSP

{

}

problem.

Problem 22

This problem gives an application of randomized linear programming rounding in approximation algorithms.

In the uniform labeling problem, we are given an undirected graph G = (V, E), costs c ≥ 0 for all edges

e ∈ E, and a set L of labels that can be assigned to the vertices of V . There is a non-negative cost cⁱ≥ 0 for

assigning label i ∈ L to vertex v ∈ V , and the edge cost c is incurred if and only if e’s endpoints are given

distinct labels. The goal of the problem is to assign each vertex a label so as to minimize the total cost.²

(a) Prove that the following is a linear programming relaxation of the problem:

X X

min

c_e

z +

c x

e∈E

i∈L

v∈V i∈L

subject to

x = 1

for all v ∈ V

i∈L

z ≥ x − x

for all e = (u, v) ∈ E and i ∈ L

for all e ∈ E and i ∈ L

z ≥ x − x

z ≥ 0

x ≥ 0

Speciﬁcally, prove that for every feasible solution to the uniform labeling problem, there is a corre-

for all v ∈ V and i ∈ L.

sponding 0-1 feasible solution to this linear program that has the same objective function value.

The motivation for the problem comes from image segmentation, generalizing the foreground-background segmentation

problem discussed in Lecture #4.

(b) Consider now the following algorithm. First, the algorithm solves the linear programming relaxation

above. The algorithm then proceeds in phases. In each phase, it picks a label i ∈ L uniformly at

random, and independently a number α ∈ [0, 1] uniformly at random. For each vertex v ∈ V that has

not yet been assigned a label, if α ≤ xⁱ, then we assign v the label i (otherwise it remains unassigned).

To begin the analysis of this randomized rounding algorithm, consider the start of a phase and suppose

that the vertex v ∈ V has not yet been assigned a label. Prove that (i) the probability that v is

assigned the label i in the current phase is exactly xⁱ/|L|; and (ii) the probability that it is assigned

some label in the current phase is exactly 1/|L|.

(d) We say that an edge e is separated by a phase if both endpoints were not assigned prior to the phase,

and exactly one of the endpoints is assigned a label in this phase. Prove that, conditioned on neither

endpoint being assigned yet, the probability that an edge e is separated by a given phase is at most

z .

i∈L

(e) Prove that, for every edge e, the probability that the algorithm assigns diﬀerent labels to e’s endpoints

zⁱ.

relate the probability of this to the quantity

is at most

i∈L

Hint: it might help to identify a suﬃcient condition for an edge e = (u, v) to not be separated, and to

[

min{xⁱ, xⁱ}.]

i∈L

(f) Prove that the expected cost of the solution returned by the algorithm is at most twice the cost of an

optimal solution.

Problem 23

This problem explores local search as a technique for designing good approximation algorithms.

(a) In the Max k-Cut problem, the input is an undirected graph G = (V, E) and a nonnegative weight w_e

for each edge, and the goal is to partition V into at most k sets such that the sum of the weights of

the cut edges — edges with endpoints in diﬀerent sets of the partition — is as large as possible. The

obvious local search algorithm for the problem is:

. Initialize (S , . . . , S ) to an arbitrary partition of V .

. While there exists an improving move:

[

increases the objective function.]

An improving move is a vertex v ∈ S and a set S such that moving v from S to S strictly

(a) Choose an arbitrary improving move and execute it — move the vertex v from S to S .

Since each iteration increases the objective function value, this algorithm cannot cycle and eventually

terminates, at a “local maximum.”

Prove that this local search algorithm is guaranteed to terminate at a solution with objective function

k−1

value at least

times the maximum possible.

[Hint: prove the statement ﬁrst for k = 2; your argument should generalize easily. Also, you might

ﬁnd it easier to prove the stronger statement that the algorithm’s ﬁnal partition has objective function

k−1

value at least

times the sum of all the edge weights.]

(b) Recall the uniform metric labeling problem from Problem 22. We now give an equally good approxi-

mation algorithm based on local search.

Our local search algorithm uses the following local move. Given a current assignment of labels to

vertices in V , it picks some label i ∈ L and considers the minimum-cost i-expansion of the label i; that

is, it considers the minimum-cost assignment of labels to vertices in V in which each vertex either keeps

its current label or is relabeled with label i (note that all vertices currently with label i do not change

their label). If the cost of the labeling from the i-expansion is cheaper than the current labeling, then

we switch to the labeling from the i-expansion. We continue until we ﬁnd a locally optimal solution;

that is, an assignment of labels to vertices such that every i-expansion can only increase the cost of

the current assignment.

Give a polynomial-time algorithm that computes an improving i-expansion, or correctly decides that

no such improving move exists.

[

Hint: recall Lecture #4.]

most twice the minimum possible.

[Hint: the optimal solution suggests some local moves. By assumption, these are not improving. What

do these inequalities imply about the overall cost of the local minimum?]

Problem 24

This problem considers a natural clustering problem, where it’s relatively easy to obtain a good approximation

algorithm and a matching hardness of approximation bound.

The input to the metric k-center problem is the same as that in the metric TSP problem — a complete

undirected graph G = (V, E) where each edge e has a nonnegative cost c_e, and the edge costs satisfy the

triangle inequality (c_uv+ c_vw≥ c_uwfor all u, v, w ∈ V ). Also given is a parameter k. Feasible solutions

correspond to choices of k centers, meaning subsets S ⊆ V of size k. The objective function is to minimize

the furthest distance from a point to its nearest center:

min max min c .

S⊆V : |S|=k v∈V s∈S

(3)

We’ll also refer to the well-known NP-complete Dominating Set problem, where given an undirected

graph G and a parameter k, the goal is to decide whether or not G has a dominating set of size at most k.³

(a) (No need to hand in.) Let OPT denote the optimal objective function value (3). Observe that OPT

ꢀ

ꢁ

equals the cost c_eof some edge, which immediately narrows down its possible values to a set of

diﬀerent possibilities (where n = |V |).

(b) Given an instance G to the metric k-center problem, let G_Ddenote the graph with vertices V and

with an edge (u, v) if and only if the edge cost c_uvin G is at most 2D. Prove that if we can eﬃciently

compute a dominating set of size at most k in G_D, then we can eﬃciently compute a solution to the

k-center instance that has objective function value at most 2D.

–

S = ∅

While S is not a dominating set in G

OP T

∗

Let v be a vertex that is not in S and has no neighbor in S — there must be one, by the

deﬁnition of a dominating set — and add v so S.

[Hint: the optimal k-center solution partitions the vertex set V into k “clusters,” where the ith group

consists of those vertices for which the ith center is the closest center. Argue that the algorithm above

never picks two diﬀerent vertices from the same cluster.]

(d) Put (a)–(c) together to obtain a 2-approximation algorithm for the metric k-center problem. (The

running time of your algorithm should be polynomial in both n and k.)

(e) Using a reduction from the Dominating Set problem, prove that for every ꢀ > 0, there is no (2 − ꢀ)-

approximation algorithm for the metric k-center problem, unless P = NP.

[

Hint: look to our reduction to TSP (Lecture #16) for inspiration.]

A dominating set is a subset S ⊆ V of vertices such that every vertex v ∈ V either belongs to S or has a neighbor in S.

ow residual

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

A Second Course in Algorithms Lecture Notes (Stanford CS261)

评论