Video Mosaics

Steve L. Martin, Charless Fowlkes, Alexander Berg
U.C. Berkeley, Fall 2003


Abstract
We present a novel method for creating a video mosaic, a two-dimensional arrangement of video tiles placed in a manner that resembles a larger video. As part of this method, we develop several metrics used to measure aspects of video for comparison. In addition, we seek to achieve good results without excessive post-processing of the mosaic video.


1. Introduction
Methods of creating image mosaics, or images made up of smaller images that suggest a larger overall picture, have been well researched and published in the past. In addition, there has also been some previous work on video mosaics, which extend the idea of image mosaics from images made up of smaller images to videos made up of smaller videos.

Our goal is to address the problem of creating video mosaics in a new manner that does not rely on post-production color enhancement for visual accuracy. To do this, we seek to solve the problem of finding excellent spatial and temporal matches for a block of video through the comparison of features that describe that video. The methods for both computing and using these features to create accurate video mosaics are the major contributions of this project. In addition, we hope to create and present some exciting example videos using the results of our research.

The organization of this preliminary write-up is as follows: Section 2 will give a brief summary of previous work on creating video mosaics. Section 3 discusses the overall method used to create a video mosaic. Section 4 gives specifics on the metrics used to compare video. Section 5 presents some preliminary results of what we have implemented so far. Section 6 gives some suggestions for future work that have been discussed. Finally, Section 7 concludes this document with some thoughts on video mosaics and some of the challenges overcome during the course of this project.


2. Related Work


Methods for creating mosaics made up of many smaller images arranged in a manner that suggests a larger picture have been well researched in the past. Artists such as Chuck Close, Salvador Dali, and Arthur Mole create these mosaics by hand, arranging hundreds and often thousands of tiles to achieve this effect. More recently, several papers on this topic appeared in the 1990s at the SIGGRAPH graphics conference, and soon after a company was founded to produce mosaics commercially.

As with image mosaics, video mosaics themselves are also not a new idea. A recent commercial for Wells Fargo[1] includes an example of video mosaics that looks extremely interesting, although we believe it was done by hand. Klein et al [2] at Princeton University developed a method for creating video mosaics algorithmically in much in the same manner photo mosaics are assembled. The method implemented by Klein involves matching the tiles on color and then over time. The first step they take is to do a wavelet decomposition on each color channel of each frame in each video tile, and then for each color color channel store in a database the average color over the frame and the indices and signs of the thirty largest-magnitude coefficients in that channel. This wavelet decomposition is then performed on each frame of each tile of the source video, and the database searched for the sequence of frames with the most similar values using dynamic programming. After the video mosaic is assembled, post-processing is performed on each pixel, altering its color to better match the pixel it replaced in the original movie.

In addition, to the NPAR 2002 paper, there has also been previous work on content-based video retrieval, which involves coming up with good metrics to describe movie content for multimedia database queries. A variety of methods have been proposed in this field. Although none of these ideas have been implemented as of yet in this project, they may be useful in the future.


3. Overall Method


The method we use for the creation of a video mosaic consists of two steps. First, once a repository of replacement videos has been assembled, it is processed by a database generation program to create a searchable file of features describing each video tile. The nest step is to actually assemble a mosaic from a source movie using that database.  Figure 1 gives a visual description of the entire process.
Figure 1. A visual explanation of the mosaic creation process.

As can be inferred from figure 1, the ability of our program to create video mosaics is highly dependant on the contents of the video repository. Thanks to a generous contribution by RocketClips Inc, a stock video company, we currently have a repository of roughly 165 gigabytes of high quality video broken up into more than 3,000 clips of around 15 seconds each. This is an enormous amount of data to process; for these preliminary results, we use around 10% of the available repository.

When generating our database, we further subdivide each video in the repository into tiles of arbitrary size. Each tile is then divided even further into sections of frames. To make the mosaic more visually interesting, we made the tiles an average length of 30 frames (one second of video) plus or minus 10 frames. Then, the program runs four separate metrics on each tile, generating features describing that piece of video which are then recorded in the database. The database is organized such that each tile has its own row, and each column is either the result or part of the result of a feature describing that tile. The program that generates the database is separate from the mosaic application, and several copies can be run in parallel over any repository using command line switches to divide the work. The program terminates once all video it has been told to process has been finished.

Once the database has been generated, the next step is to actually create the mosaic from a source video. After loading the database into memory for faster searching, the mosaic creation program opens and decompresses the movie to mosaic. Next the original video is divided up into tiles of the same size as those in the database. Each tile is then processed in 30 frame segments using the same metrics library the database generation program used to build the database. Once a vector of features has been built, the database searched for the best match among all the possible replacements. After a tile is selected for the mosaic it replaces a tile of the source video. 


4. Video Comparison Metrics


We came up with four metrics to generate the set of features used to describe each tile of video. These are average color, color histograms, edge histograms, and energy histograms. Together, these features form a 33-dimensional vector that describes a given video tile.

4.1 Average Color

The first metric calculated is the average color of a video tile.  To generate this value, we sum up the color values for all pixels in all frames of the tile and divide to get the average color.

By itself, the average color of a tile does not do a that bad of a job of indicating which tiles should be placed where.  However, it can be easily fooled, and its accuracy as a descriptive metric is highly dependant on the number of frames in the video tile.  As an example, a movie that pans from a solid red element to a solid green element will have an average color of brown; were this tile to actually be chosen to replace a tile with a lot of brown on the original video, it would look very incorrect.

4.2 Color Histograms

While average color is a good starting point, we also want a measure of the frequency of colors in the tile.  To this end, we create a histogram of the colors used in a video tile.

To calculate this histogram, the color of each pixel if each frame of the video is converted from RBG space to HSB space.  The hue portion of the color value is then recorded, and a the values placed into an arbitrary number of histogram bins on a per-frame basis. The number of bins is usually 10, although this is changeable from the command line.  The frame histograms are then summed and normalized over the entire video tile.

Color histograms, when combined with average color, help catch a lot of the cases that average color does poorly on.  However, this metric can also be fooled; figure 2 gives an example of this, also comparing the method to the results of straight average color.

Figure 2.  Squares a) and b) have the exact same average color, and also the same color histogram values.  Squares c) and d) have the exact same average color, but very different color histograms.  This shows how a color histogram feature would help determine between frames like c) and d), whereas still more information is needed to be able to determine between frames like a) and b).


4.3 Edge Histograms

In addition to color metrics, we also wish to take edges into account.  The method we use to do this involves creating a histogram of edge data.  Like color histograms, edge histograms are created per frame, and then summed and normalized over the entire video tile.

The first step in creating an edge histogram given a frame of a video tile is to discard color information; edges are preserved with image intensity alone.  Next, the frame is resized down to ten by ten pixels using a bicubic resize algorithm.  The bicubic resize was implemented so that the image is blurred slightly as it is resized, which will help eliminate noise from the resulting histogram.  Finally, for every pixel in the resized image, the gradient in the x and y directions is measured.  If the energy of this pixel (the sum of the squares of the gradients in each direction) is above a certain threshold, then the arctangent of these values is calculated and then placed into a bin for the histogram of the current frame. The histograms are then summed and normalized over the entire video tile.  The thresholding is done to prevent noise in the histogram caused by regions of close to the same intensity; these aren't really edges, so we remove them from the histogram.  Figure 3 gives a visual example of this process over a given pixel in a frame.  Figure 4 shows the output of the gradient arctangent function over two dimensional input consisting of a black dot.

Figure 3. A graphic explanation of how an edge histogram is calculated.  The quantity that is placed into the histogram is theta, the arctangent of the ratio of the gradient in the y direction over the gradient in the x direction.   Figure 4. The output of theta in figure 3 plotted over an image of a black dot.  The white circle shows areas where theta is high, which corresponds exactly to edges in the image.

Including edge histograms in our vector or features appreciably helps the quality of the resulting mosaic.  However, they are far from perfect, as only the quantity and rough orientation of edges is matched.  An improvement that is currently being implemented to help place tiles more intelligently is to subdivide the resized tile into quarters, and then to histogram each one separately.  This would improve the orientation of the tile when it is placed into the source video.

4.4 Energy Histograms

Finally, along with the color and edge metrics, we also consider the energy of a given tile when performing matching.  The energy of a tile is the sum of the sums of the squared gradients in each direction  for every pixel in the movie.  This gives a measure of how 'busy' the movie is.  This captures another important visual aspect of each tile.

To create an energy histogram, we consider each frame of the tile separately and resize it in exactly the same way as we do when creating edge histograms.  For every pixel in the resized image, the gradient in the x and y directions is measured and the energy of this pixel (the sum of the squares of the gradients in each direction) calculated.  If the energy is above a certain threshold, then it is recorded in the histogram for this frame.   The histograms are then summed and normalized over the entire video tile.

Although none of the preliminary video results for this write-up use energy histograms, they have been added to the mosaic creation programs, and initial tests show that they do help improve the quality of the results.  They will be incorporated into the next round of examples once the mosaic database is regenerated.


5. Preliminary Results


This section will present some preliminary results that were created using a recent version of the mosaic software.  These examples were created using 283 clips from the stock video repository provided by Mark Adams of RocketClips, Inc.  

5.1 Single Frame Results

To show some clear results of the performance of our metrics, a set of single frame mosaics were generated.  Figures 5 through 8 show the original frame of a random movie clip from the video repository, the results of running the mosaic program using all the metrics discussed in section 4, the results of using just the average color, and the results of using just the color histogram metric; mosaics generated using just the other metrics were not visually interesting. These images were created using a database generated with just the first frame of 283 videos from our stock video library.  Each frame was then subdivided into twenty by twenty pixel tiles, giving 864 tiles per video for a repository size of about 300,000 possible replacement tiles.  Generating this database took about 30 minutes of computation on a 3.0 gigahertz Intel Pentium 4 with one gigabyte of RAM, and created a 76.5 megabyte file.

Figures 5-8.  Results of creating mosaics with just a single frame of video.  The top left image is the original frame.  The top right is a mosaic created using all of the metrics presented in section 3.  The bottom left image is created using just average color, and the bottom right with just color histograms.


5.2 Video Results

To generate video mosaics with more than one frame, a new database was created with 283 videos from our stock video library.  Subdividing these clips into thirty by thirty pixel tiles with a  mean length of thirty frames yields about 2000 tiles per video clip, giving the mosaic creation program a total of roughly half a million video tiles to choose from.  Indexing these clips took about 36 hours and generated a 296 megabyte database on a 3.0 gigahertz Intel Pentium 4 with one gigabyte of RAM.

Table 1 presents links to video mosaic results.  All of the movies listed in table 1 were encoded with DivX 5.0 and have no audio.  

Table 1. Video results using all tile metrics.
Video Name Original Video Mosaic Video Side-by-Side Comparison Video
Kayaak.avi 1.64 MB 3.54 MB 1.99 MB
Necklace.avi 978 KB 1.85 MB 1.18 MB
Woman.avi 317 KB 1.54 MB 961 KB


Note that these results are still preliminary.  They present a proof of concept that thus far our program can create reasonable mosaics from a variety of source material, but there is much to be done in order to present results on par with previous research without post-processing.


6. Future Work


The purpose of this preliminary write-up is to present the initial results of our work. There remains much more research to be done, both in terms of the improving quality of results and in enhancing or adding to the effects made possible by the video mosaic technique.

The single frame examples demonstrate that the metrics we have come up with are capable of generating good mosaics. However, once the time dimension is increased to beyond 1 frame, the results break down. In fact, the more frames there are in a tile used to actually create a mosaic, in general the worse the results become. This means that first item on the list of improvements is to come up with some method of choosing movie frames by somehow taking into account how much the tiles change over time. It might be that the database generation and search is overhauled and some kind of dynamic programming used to find tile matches on a per-frame basis.

Aside from improvements to existing work, we strongly believe that the surface has barely been scratched in terms of the sheer number of interesting visual effects this project could generate.  In addition, new metrics are being discussed to improve results, as well as different ways of improving the metrics already in use. 

That said, here are some brief ideas I think would generate interesting visual effects:

  • Include a metric that takes edges with neighboring tiles into account, or perhaps bend a bit and implement some post-process smoothing. This would reduce the 'compression artifact' effect given by Princeton's videos.
  • Be able to place multiple-sized video tiles, some overlapping others. A 'patchwork' effect could look amazing.
  • Allow the video tiles to move--i.e.. have the videos tiles shift to maintain good matches while playing.  This could give some very interesting effects as well.

7. Conclusion
While there has been a lot of progress on this project over this semester, its obvious that there remains much more to do. The project framework is solid; there exists a solid library of tools implemented for creating these databases and mosaics from different sources.  However, the actual research, while exciting, is in its infancy.

Some of the challenges that have been overcome during the course of this project include obtaining a large library of video to use when making mosaics, building a multipurpose video library on top of the QuickTime SDK that can handle video of any compression type and size, building an image library from the ground up that can do specialized comparisons, bicubic resize, and a few other operations, building a database library to create and handle large database files, and actually writing programs that can process terabytes of video information.

In short, much of the base has been written; we are now actively engaged in the research. We anticipate work on this extending well into next semester, and anticipate a high ceiling on the results.


Acknowledgements
We would like to thank Mark Adams of RocketClips, Inc. for his most generous contribution of stock video footage for our project.


References
[1] Wells Fargo, Wells Fargo Video Mosaic Ad

[2] Allison W. Klein, Tyler Grant, Adam Finkelstein and Michael F. Cohen. Video Mosaics. NPAR 2002: Second International Symposium on Non Photorealistic Rendering. pp. 21-28, June 2002.

Various content based image/video retrieval papers (still under discussion)

Various websites on Image Mosaics

Various websites on the QuickTime and Video for Windows SDKs.

CS283 Final Presentation (slides)