Video Mosaics Steve L. Martin, Charless Fowlkes, Alexander Berg Abstract We present a novel method for creating a video mosaic, a two-dimensional arrangement of video tiles placed in a manner that resembles a larger video. As part of this method, we develop several metrics used to measure aspects of video for comparison. In addition, we seek to achieve good results without excessive post-processing of the mosaic video. 1. Introduction Methods of creating image mosaics, or images made up of smaller images that suggest a larger overall picture, have been well researched and published in the past. In addition, there has also been some previous work on video mosaics, which extend the idea of image mosaics from images made up of smaller images to videos made up of smaller videos. Our goal is to address the problem of creating video mosaics in a new manner that does not rely on post-production color enhancement for visual accuracy. To do this, we seek to solve the problem of finding excellent spatial and temporal matches for a block of video through the comparison of features that describe that video. The methods for both computing and using these features to create accurate video mosaics are the major contributions of this project. In addition, we hope to create and present some exciting example videos using the results of our research. The organization of this preliminary write-up is as follows: Section 2 will give a brief summary of previous work on creating video mosaics. Section 3 discusses the overall method used to create a video mosaic. Section 4 gives specifics on the metrics used to compare video. Section 5 presents some
preliminary results of what we have implemented so far. Section 6 gives some suggestions for future work that have been discussed. Finally, Section 7 concludes this document with some thoughts on video mosaics and some of the challenges overcome during the course of this project.
Methods for creating mosaics made up of many smaller images arranged in a manner that suggests a larger picture have been well researched in the past. Artists such as Chuck Close, Salvador Dali, and Arthur Mole create these mosaics by hand, arranging hundreds and often thousands of tiles to achieve this effect. More recently, several papers on this topic appeared in the 1990s at the SIGGRAPH graphics conference, and soon after a company was founded to produce mosaics commercially. As with image mosaics, video mosaics themselves are also not a new idea. A recent commercial for Wells Fargo[1] includes an example of video mosaics that looks extremely interesting, although we believe it was done by hand. Klein et al [2] at Princeton University developed a method for creating video mosaics algorithmically in much in the same manner photo mosaics are assembled. The method implemented by Klein involves matching the tiles on color and then over time. The first step they take is to do a wavelet decomposition on each color channel of each frame in each video tile, and then for each color color channel store in a database the average color over the frame and the indices and signs of the thirty largest-magnitude coefficients in that channel. This wavelet decomposition is then performed on each frame of each tile of the source video, and the database searched for the sequence of frames with the most similar values using dynamic programming. After the video mosaic is assembled, post-processing is performed on each pixel, altering its color to better match the pixel it replaced in the original movie. In addition, to the NPAR 2002 paper, there has also been previous work on content-based video retrieval, which involves coming up with good metrics to describe movie content for multimedia database queries. A variety of methods have been proposed in this field. Although none of these ideas have been implemented as of yet in this project, they may be useful in the future.
The method we use for the creation of a video mosaic consists of two steps. First, once a repository of replacement videos has been assembled, it is processed by a database generation program to create a searchable file of features describing each video tile. The nest step is to actually assemble a mosaic from a source movie using that database. Figure 1 gives a visual description of the entire process.
As can be inferred from figure 1, the ability of our program to create video mosaics is highly dependant on the contents of the video repository. Thanks to a generous contribution by RocketClips Inc, a stock video company, we currently have a repository of roughly 165 gigabytes of high quality video broken up into more than 3,000 clips of around 15 seconds each. This is an enormous amount of data to process; for these preliminary results, we use around 10% of the available repository. When generating our database, we further subdivide each video in the repository into tiles of arbitrary size. Each tile is then divided even further into sections of frames. To make the mosaic more visually interesting, we made the tiles an average length of 30 frames (one second of video) plus or minus 10 frames. Then, the program runs four separate metrics on each tile, generating features describing that piece of video which are then recorded in the database. The database is organized such that each tile has its own row, and each column is either the result or part of the result of a feature describing that tile. The program that generates the database is separate from the mosaic application, and several copies can be run in parallel over any repository using command line switches to divide the work. The program terminates once all video it has been told to process has been finished. Once the database has been generated, the next step is to actually create the mosaic from a source video. After loading the database into memory for faster searching, the mosaic creation program opens and decompresses the movie to mosaic. Next the original video is divided up into tiles of the same size as those in the database. Each tile is then processed in 30 frame segments using the same metrics library the database generation program used to build the database. Once a vector of features has been built, the database searched for the best match among all the possible replacements. After a tile is selected for the mosaic it replaces a tile of the source video.
We came up with four metrics to generate the set of features used to describe each tile of video. These are average color, color histograms, edge histograms, and energy histograms. Together, these features form a 33-dimensional vector that describes a given video tile. 4.1 Average Color The first metric calculated is the average color of a video tile. To generate this value, we sum up the color values for all pixels in all frames of the tile and divide to get the average color. By itself, the average
color of a tile does not do a that bad of a job of indicating which tiles
should be placed where. However, it can be easily fooled, and its
accuracy as a descriptive metric is highly dependant on the number of
frames in the video tile. As an example, a movie that pans from a solid red element to a solid green element will have an
average color of brown; were this tile to actually be chosen to replace
a tile with a lot of brown on the original video, it would look very
incorrect. To calculate this histogram, the color of each pixel if each frame of the video is converted from RBG space to HSB space. The hue portion of the color value is then recorded, and a the values placed into an arbitrary number of histogram bins on a per-frame basis. The number of bins is usually 10, although this is changeable from the command line. The frame histograms are then summed and normalized over the entire video tile. Color histograms, when combined with average color,
help catch a lot of the cases that average color does poorly on.
However, this metric can also be fooled; figure 2 gives an example of
this, also comparing the method to the results of straight average color.
The first step in creating an edge histogram given a frame of a
video tile is to discard color information; edges are preserved with image intensity alone. Next, the frame is resized down
to ten by ten pixels using a bicubic resize algorithm. The bicubic
resize was implemented so that the image is blurred slightly
as it is resized, which will help eliminate noise from the resulting
histogram. Finally, for every pixel in the resized image, the
gradient in the x and y directions is measured. If the energy of
this pixel (the sum of the squares of the gradients in each direction)
is above a certain threshold, then the arctangent of these values is calculated and then placed into a bin for the histogram of the current frame. The histograms are then summed and normalized over the
entire video tile. The thresholding is done to prevent noise in the
histogram caused by regions of close to the same intensity; these aren't
really edges, so we remove them from the histogram. Figure 3
gives a visual example of this process over a given pixel in a
frame. Figure 4 shows the output of the gradient arctangent
function over two dimensional input consisting of a black dot.
Including edge histograms in our vector or features
appreciably helps the quality of the resulting mosaic. However,
they are far from perfect, as only the quantity and rough orientation of edges
is matched. An improvement that is currently being implemented
to help place tiles more intelligently is to subdivide the resized tile
into quarters, and then to histogram each one separately. This
would improve the orientation of the tile when it is placed into the source
video.
To create an energy histogram, we consider each frame of the tile separately and resize it in exactly the same way as we do when creating edge histograms. For every pixel in the resized image, the gradient in the x and y directions is measured and the energy of this pixel (the sum of the squares of the gradients in each direction) calculated. If the energy is above a certain threshold, then it is recorded in the histogram for this frame. The histograms are then summed and normalized over the entire video tile. Although none of the preliminary video results for this
write-up use energy histograms, they have been added to the mosaic
creation programs, and initial tests show that they do help improve the
quality of the results. They will be incorporated into the next
round of examples once the mosaic database is regenerated.
This section will present some preliminary results that were created using a recent version of the mosaic software. These examples were created using 283 clips from the stock video repository provided by Mark Adams of RocketClips, Inc. 5.1 Single Frame Results To show some clear results of the performance of our metrics, a set of single frame mosaics were generated. Figures 5 through 8 show the original frame of a random movie clip from the video repository, the results of running the mosaic program using all the metrics discussed in section 4, the results of using just the average color, and the results of using just the color histogram metric; mosaics generated using just the other metrics were not visually interesting. These images were created using a database generated with just the first frame of 283 videos from our stock video library. Each frame was then subdivided into twenty by twenty pixel tiles, giving 864 tiles per video for a repository size of about 300,000 possible replacement tiles. Generating this database took about 30 minutes of computation on a 3.0 gigahertz Intel Pentium 4 with one gigabyte of RAM, and created a 76.5 megabyte file.
5.2 Video Results To generate video mosaics with more than one frame, a new database was created with 283 videos from our stock video library. Subdividing these clips into thirty by thirty pixel tiles with a mean length of thirty frames yields about 2000 tiles per video clip, giving the mosaic creation program a total of roughly half a million video tiles to choose from. Indexing these clips took about 36 hours and generated a 296 megabyte database on a 3.0 gigahertz Intel Pentium 4 with one gigabyte of RAM. Table 1 presents links to video mosaic results. All of the movies listed in table 1 were encoded with DivX
5.0 and have no audio.
The purpose of this preliminary write-up is to present the initial results of our work. There remains much more research to be done, both in terms of the improving quality of results and in enhancing or adding to the effects made possible by the video mosaic technique. The single frame examples demonstrate that the metrics we have come up with are capable of generating good mosaics. However, once the time dimension is increased to beyond 1 frame, the results break down. In fact, the more frames there are in a tile used to actually create a mosaic, in general the worse the results become. This means that first item on the list of improvements is to come up with some method of choosing movie frames by somehow taking into account how much the tiles change over time. It might be that the database generation and search is overhauled and some kind of dynamic programming used to find tile matches on a per-frame basis. Aside from improvements to existing work, we strongly believe that the surface has barely been scratched in terms of the sheer number of interesting visual effects this project could generate. In addition, new metrics are being discussed to improve results, as well as different ways of improving the metrics already in use. That said, here are some brief ideas I think would generate interesting visual effects:
7. Conclusion While there has been a lot of progress on this project over this semester, its obvious that there remains much more to do. The project framework is solid; there exists a solid library of tools implemented for creating these databases and mosaics from different sources. However, the actual research, while exciting, is in its infancy. Some of the challenges that have been overcome during the course of this project include obtaining a large library of video to use when making mosaics, building a multipurpose video library on top of the QuickTime SDK that can handle video of any compression type and size, building an image library from the ground up that can do specialized comparisons, bicubic resize, and a few other operations, building a database library to create and handle large database files, and actually writing programs that can process terabytes of video information. In short, much of the base has been written; we are now actively engaged in the research. We anticipate work on this extending well into next semester, and anticipate a high ceiling on the results. Acknowledgements We would like to thank Mark Adams of RocketClips, Inc. for his most generous contribution of stock video footage for our project. References [1] Wells Fargo, Wells Fargo Video Mosaic Ad [2] Allison W. Klein, Tyler Grant, Adam Finkelstein and Michael F. Cohen. Video Mosaics. NPAR 2002: Second International Symposium on Non Photorealistic Rendering. pp. 21-28, June 2002. Various content based image/video retrieval papers (still under discussion) Various websites on Image Mosaics Various websites on the QuickTime and Video for Windows SDKs. CS283 Final Presentation (slides) |
||||||||||||||||||||||||||||||||||||