[POW] A Low-Rank Approximation Approach to Learning Joint Embeddings of News Stories and Images for Timeline Summarization

Link: https://www.cs.cmu.edu/~yww/papers/naacl2016.pdf

Authors: William Yang Wang, Yashar Mehdad, Dragomir R. Radev, Amanda Stent


The paper tackles the problem of timeline summarization – extracting milestones of a news story and arranging them in a temporal dimension. It casts this problem as a recommendation problem and solves it via a low-rank approximation method.

Novelty (to me)

Problem formulation: timeline summarization is formulated as a sentence recommendation task. Specifically, for a period of time, the job is to recommend sentences most related to the events happening during that period. With this formulation, matrix factorization is a natural choice for solution.

Visual information can be useful on top of textual information. Yahoo! image searches are performed on sentences of news articles to filter related images. Those images are converted into visual features by a convolutional neural network.

Extending the notion of context: “we shall know a word by the company it keeps.” Context should not be limited to texts but can be anything that helps inferring a word’s meaning.

Notable details

Feature extraction: important score (computed by comparing with a human summary), word features, event features (SVO events from dependency parses), time features, and CNN-based features of the top related image.

Matrix factorization objective: least square with l2 penalty. Each row/column has a representation vector.

How to produce a timeline? at training time, they learn the vector representations of the rows/columns. At test time, for each period of time, we extract features for the sentences, except the important score, which will be hallucinated by taking the dot product of the sentence’s embedding vector and the first column’s embedding vector. Then they use the important score to rank the sentences and pick the top ones (with their images) for our timeline. However, it is unclear to me how they obtain the embedding of  a new sentence.

Last words

Timeline extraction is a very interesting application for the current era. The technique in this paper is similar to skip-gram or other context-based representation learning methods, but the notion of context includes visual cues. The “results” section is a little short for me. I hoped that they gave more explanations on why and how visual context helped since this was the main point of the paper. Also, are the sentence representations meaningful?