Crowne Plaza Hotel, Seattle, USA
Spotting by Association method for video analysis is a novel method to detect video segments with typical semantics. Video data contains various kinds of information through continuous images, natural language, and sound. For videos to be stored and retrieved in a Digital Library, it is essential to segment the video data into meaningful pieces. To detect meaningful segments, we need to identify the segment in each modality (video, language, and sound) that corresponds to the same story. For this purpose, we propose a new method for making correspondences between image clue detected by image analysis and language clue detected by natural language analysis. As a result, relevant video segments with sufficient information from every modality are obtained. We applied our method to closed-captioned CNN Headline News. Video segments with important events, such as a public speech, meeting, or visit, are detected fairly well.
Digital Library, Video Analysis, Semantic Analysis, Situation Spotting Image Analysis, Natural Language Analysis
Digital Libraries gather a large amount of video data for public or commercial use. The Informedia project [WKSS96] is one of the Digital Libraries, in which news and documentary videos are stored. Its experimental system provides news and documentary video retrieval by user queries from text or speech input.
Since the amount of data stored in the libraries is enormous, in addition to efficient retrieval, data presentation techniques are also required to show large amounts of data to the users. Suppose a user is looking for video portions in which the U.S. president gave a talk about Ireland peace at some location. Then, if the user simply asks video segments related to ``Mr.Clinton'' and/or ``Ireland'' from news data in 1995 or 1996, hundreds of video segments may be retrieved. It may take a considerable amount of time to find the right data from that set. In this sense, we need two kinds of data management. One is semantical organization and tagging of the data, and the other is data presentation that is structural and clearly understandable.
For this purpose, it is effective to detect a topic essence in terms of one to several representative pairs of image and language data, for example, three pairs of a picture and a sentence. Image and language data corresponding to the same portion of a story should be chosen in this selection. These segments are the portions which the film/TV producers want to report, and are the portions which are easily understandable even when they are shown independently. Therefore, to detect those segments and to organize video archives based on them will be an essential technique for digital video libraries.
So far few researches have dealt with this problem. It is common to give a topic explanation by using the first frame/image of the first cut/shot and the first sentence in a transcript. This representative pair is often a poor topic explanation, for example, an anchor person's close-up with a too much general description. To cope with image selection problem, Zhang, et.al, proposed a method for key-frame selection by using several image features such as colors, textures, and temporal features including camera operations [ZLSW95]. Smith and Kanade proposed video skimming by selecting video segments based on TFIDF, camera motion, human face, captions on video, and so on [SK97]. By joining the selected segments, a new video which gives a rough idea about the topic is obtained. They are good techniques which are broadly applicable, since they do not require deep content analysis.
There are, however, still open problems to tackle. One is the semantic classification of each segment. For effective topic indexing or explanation, we need to know what a segment describes. Another is the correspondence problem between image and language. As mentioned above, we need to detect image and language data corresponding to the same portion of a story. If they are taken from different portions, the pair may become misleading to the users[1].
To handle these problems, we introduce the Spotting by Association method, which detects relevant video segments by associating image data and language data. This method is aimed to make the retrieval process more efficient and to allow for more sophisticated queries. First, we define language clue and image clue which are common in news videos, and introduce the basic idea of situation detection. Then, we describe inter-modal association between images and language. By this method, relevant video segments with sufficient information from every modality are obtained.
We applied our method to closed-captioned CNN Headline News, from which segments with typical important situations, such as public speeches, meetings, or visits are detected fairly well.
[1] For example, a close-up of a victim's face and the name of the criminal.
When we see a news video, we can understand topics at least partially, even if either images or audio is missing. For example, when we see an image as shown in Figure 1(a), we guess that someone's speech is the focus. A facial close-up and changes in lip shape is the basis of this assumption. Similarly, Figure 1(b) suggests the news reports a car accident and the extent of damage[2].
(a) | (b) | (c) | (d) |
However, video content extraction from only language or image data is not reliable. Suppose that we are trying to detect a speech or lecture scene. Figure 1(c) is a face close-up; it is a criminal's face, and the video portion is devoted to a crime report. The same can be said about the language portion. Suppose that we need to detect someone's opinion from a news video. A human can do this perfectly if he reads the transcript and considers the contexts. However, current natural language processing techniques are far from human ability. Considering a sentence which starts with ``They say'', it is difficult to determine, without deep knowledge, whether the sentence mentions a rumor or is really spoken as an opinion.
[2]Actually, the car was exploded by a missile attack, not by a car accident.
From the above discussion, it is clear that the association between language and image is an important key to video content detection. Moreover, we believe that an important video segment must have mutually consistent image and language data. Based on this idea, we propose the ``Spotting by Association'' method for detecting important clues from each modality and associating them across modalities. This method has two advantages: the detection can be reliable by utilizing both images and language; the data explained by both modalities can be clearly understandable to the users.
For the above clues, we introduce several categories which are common in news videos. They are, for language, SPEECH/OPINION, MEETING/CONFERENCE, CROWD, VISIT/TRAVEL, and LOCATION; for image, FACE, PEOPLE, and OUTDOOR SCENE. They are shown in Table 1.
Inter-modal coincidence among those clues expresses important situations. Examples are shown in Figure 2. A pair of SPEECH/OPINION and FACE shows one of the most typical situation, in which someone talk about his opinion, or reports something. A pair of MEETING/CONFERENCE and PEOPLE show a conventional situation such as the Congress.
A brief overview of the spotting for a speech or lecture situation is shown in Figure 3. The language clue can be characterized by typical phrases such as ``He says'' or ``I think'', while image clue can be characterized by face close-ups. By finding and associating these images and sentences, we can expect to obtain speech or lecture situations.
language clue | |
---|---|
SPEECH/OPINION | speech, lecture, opinion, etc. |
MEETING/CONFERENCE | conference, congress, etc. |
CROWD/PEOPLE | gathering people, demonstration, etc. |
VISIT/TRAVEL | VIP's visit, etc. |
LOCATION | explanation for location, city, country, or natural phenomena |
image clue | |
FACE | human face close-up (not too small) |
PEOPLE | more than one person, faces or human figures |
OUTDOOR- SCENE |
outdoor scene regardless of natural or artificial. |
The transcripts of news videos are automatically taken from a NTSC signal, and stored as text. The simplest way to detect language clue is keyword spotting from the texts. However, since keyword spotting picks many unnecessary words, we apply additional screening by parsing and lexical meaning check.
In a speech or lecture situation, the following words frequently appear as shown in 3].
The first group is a set of words expressing indirect narration in which a reporter or an anchor-person mentions someone's speech. The second group is a set of words expressing direct narration which is often live video portions in news videos. In those portions, people are usually talking about their opinions.
The actual statistics on those words are shown in Table 3. Each row shows the number of word occurrences in speech portions or other portions[4]. This means if we detect ``say'' from an affirmative sentence in the present or past tense, we can get a speech or lecture scene at a rate of 92%. Some words suggesting meeting/conference, crowd, visit/travel situations are shown in Table 4. Similarly, a location name often appears with outdoor scenes that are the actual scenes of that location.
|
| ||||||||||||||||||||||||||||||||||||||||||||||||
|
|
[3]Since they are taken from closed-caption, they are all in upper case.
[4]In this statistics, words in a sentence of future tense or a negative sentence are not counted, since real scenes rarely appear with them.
As we can see in Table 3, some words such as ``talk'' are not sufficient keys. One of the reasons is that ``talk'' is often used as a noun, such as ``peace talk''. In such a case, it sometimes mentions only the topic of the speech, not the speech action itself. Moreover, negative sentences and those in future tense are rarely accompanied by the real images which show the mentioned content. Consequently, keyword spotting may cause a large amount of false detections which can not be recovered by the association with image data.
To cope with this problem, we parse a sentence in transcripts, check the role of each keyword, and check the semantics of the subject, the verb, and the objects. Also, each word is checked for expression of a location.
In key-sentence detection, keywords are detected from transcripts. Separately, transcripts are parsed by the Link Parser [ST93]. Keywords are syntactically and semantically checked and evaluated by using the parsing results. Since the transcripts of CNN Headline News are rather complicated, less than one third of the sentences are perfectly parsed. However, if we focus only on subjects and verbs, results are more acceptable. In our experiments, subjects and verbs are correctly detected at a rate close to 80%.
By using these results, part of speech of each keyword, and lexical meanings of the subject, verb, and object in a sentence are checked. The words to be checked and the conditions are listed in Table 5. A sentence including one or more words which satisfy these conditions is considered a key-sentence.
The results are shown in Table 6. The figure (X/Y/Z) in each table shows the numbers of detected key-sentence: X is the number of sentences which include keywords; Y is the sentences removed by the above keyword screening; Z is the number of sentences incorrectly removed[5].
type | condition |
---|---|
SPEECH/OPINION | active voice and affirmative, not future tense, subject as a human or a social group, not ``it'' |
MEETING/CONFERENCE | affirmative, not future tense |
CROWD | affirmative, not future tense |
VISIT/TRAVEL | affirmative, not future tense, subject as human, at least one location name in a sentence |
LOCATION | preposition (in, at, on, to, etc.) + location name |
video | speech | meeting | crowd | visit | location |
---|---|---|---|---|---|
Video1 | 40/3/1 | 20/1/0 | 33/4/0 | 41/33/0 | 89/59/5 |
Video2 | 28/3/0 | 22/6/0 | 24/3/0 | 39/34/1 | 65/39/2 |
Video3 | 34/5/1 | 15/2/1 | 22/2/0 | 39/33/0 | 70/50/4 |
[5]In this evaluation, difficult and implicit expressions which do not include words implying the clues. Therefore, we assume the keyword spotting results include all of the needed language clue.
A dominant portion of a news video is occupied by human activities. Consequently, human images, especially faces and human figures, have important roles. In the case of human visits or, movement outdoor scenes carry important information: who went where, how was the place, etc. We consider this a unit of image clue, and we call it a key image.
In this research, three types of images, face close-ups, people, and outdoor scenes are considered as image clue. Although these image clue are not strong enough for classifying a topic, there usage has a strong bias to several typical situations. Therefore, by associating the key images and key-sentences, the topic of an image can be clarified, and the focus of the news segment can be detected.
The actual usage of the three kinds of images are shown in Table 7, 8 and 9. Among them, the predominant usage of face close-ups is for speech, though a human face close-up has the role of identifying the subject of other acts: a visitor of a ceremony; a criminal for a crime report, etc. Similarly, an image with small faces or small human figures suggests a meeting, conference, crowd, demonstration, etc. Among them, the predominant usage is the expression for a meeting or conference. In such a case, the name of a conference such as ``Senate'' is mentioned, while the people attending the conference are not always mentioned. Another usage of people images is the description about crowds, such as people in a demonstration.
In the case of outdoor scenes, images describe the place, the degree of a disasters, etc. Since the clear distinction of the roles is difficult, only the number of images with outdoor scenes is shown in Table 9.
video | speech | others | total |
---|---|---|---|
Video1 | 59 | 10 | 69 |
Video2 | 80 | 12 | 92 |
Other usages are personal introduction(4), action(2), audience/attendee(3), movie(2), anonymous(2), exercising(2), sports(1), and singing(4). |
video | meeting | crowd | total |
---|---|---|---|
Video1 | 16 | 16 | 32 |
Video2 | 9 | 43 | 52 |
(a) | (b) |
video | outdoor scenes |
---|---|
Video1 | 34 |
Video2 | 39 |
(a) | (b) |
First, the videos are segmented into cuts by histogram based scene change detection [HS95], [SH95]; The tenth frame[6] of each cut is regarded as the representative frame for the cut. Next, the following feature extractions are performed for each representative frame.
[6]The first few frames are skipped because they often have scene change effects.
In this research, human faces are detected by the neural-network based face detection program [RBK96]. Most face close-ups are easily detected because they are large and frontal. Therefore, most frontal faces[7], less than half of the small faces and profiles are detected.
[7]As described in [RBK96], the face detection accuracy for frontal face close-up is nearly satisfactory.
As for images with many people, the problem becomes difficult because small faces and human figures are more difficult to detect. The same can be said of outdoor scene detection.
Automatic face and outdoor scene detection is still under development. For the experiments in this paper, we manually pick them. Since the representative image of each cut is automatically detected, it takes only a few minutes for us to pick those images from a 30-minute news video.
The sequence of key-sentences and that of key images are associated by Dynamic Programming.
The detected data is the sequence of key images and that of key-sentences to which starting and ending time is given. If a key image duration and a key-sentence duration have enough overlap (or close to each other) and the suggested situations are compatible, they should be associated.
In addition to that, we impose a basic assumption that the order of a key image sequence and that of a key-sentence sequence are the same. In other words, there is no reverse order correspondence. Consequently, dynamic programming can be used to find the correspondence.
The basic idea is to minimize the following penalty value P.
P = Sumj \in Sn Skips(j) + Sumk \in In Skipi(k) + Sumj \in S, k \in I Match(j, k)
In DP path calculation, we allow any inter-modal correspondence unless the duration of a key image and that of a key-sentence are mutually too far to be matched[8]. Any key-sentence or key image may be skipped (warped), that is left unmatched.
[8]In our experiments, the threshold value is 20 seconds.
Basically, the penalty values are determined by the importance of the data, that is the possibility of each data having the inter-modal correspondence. In this research, importance evaluation of each clues is calculated by the following formula. The skip penalty Skip is considered as -E.
E = EtypeEdata
Similarly, Edata for key-sentences is calculated based on a keyword's part-of-speech, lexical meaning of subject, etc. An example of this coefficient is shown in Table 11.
The evaluation of correspondences is calculated by the following formula.
Match(i,j) = Mtime(i, j) Mtype(i, j)
A key image's duration (di) is the duration of the cut from which the key image is taken; the starting and ending time of a sentence in the speech is used for key-sentence duration (ds). In the case where the exact speech time is difficult to obtain, it is substituted by the time when closed-caption appears.
The actual values for Mtype are shown in Table 12. They are roughly determined by the number of correspondences in our sample videos.
type | speech | meeting | crowd | visit | location |
---|---|---|---|---|---|
face | 1.0 | 0.25 | 0.25 | 0.25 | 0.0 |
people | 0.75 | 1.0 | 1.0 | 0.5 | 0.5 |
outdoor scene | 0.0 | 0.25 | 0.25 | 1.0 | 1.0 |
We chose 6 CNN Headline News videos from the Informedia testbed. Each video is 30 minutes in length. They are segmented into cuts by scene change detection, then each poster frame, i.e. representative image for each cut is detected. Next, the face detection, people detection, and outdoor scene detection are applied to each poster frame. Currently, only the face close-up detection is automated, the rest are created manually. Each data is registered as a key image, then the importance is evaluated.
Transcripts are automatically obtained by closed-caption. They are segmented into sentences, and parsed by Link Parser. Then, through keyword detection and screening by checking semantics, key-sentences are detected. All transcript processing is done without human assistance, since the key-sentence detection results are satisfactory. For each key-sentence, importance is calculated similarly to the key image evaluation. Finally, inter-modal correspondences between obtained key images and key-sentences are calculated by DP.
Figure 6 shows the association results by DP. The columns show the key-sentences and the rows show key images. The correspondences are calculated from the paths' cost. In this example, 167 key images, 122 key-sentences are detected; 69 correspondence cases are successfully obtained.
Total numbers of matched and unmatched key-data in 6 news videos are shown in Table 13. Details are in Table 14.
A is the total number of key-data, B is the number of key-data for which inter-modal correspondences are found, C1 is the number of key-data associated with correct correspondences, D is the number of missing association, that is clues for which association is failed in spite of having real correspondences, E is the number of wrong association, i.e. mismatching.
all | matched | correct | miss | wrong | |
---|---|---|---|---|---|
A | B | C | D | E | |
speech | 292 | 226 | 178 | 40 | 48 |
meeting | 47 | 26 | 19 | 18 | 7 |
crowd | 63 | 35 | 26 | 19 | 9 |
travel | 15 | 8 | 7 | 6 | 1 |
location | 76 | 34 | 27 | 32 | 7 |
face | 472 | 217 | 173 | 0 | 44 |
people | 220 | 84 | 63 | 0 | 21 |
scene | 168 | 25 | 21 | 0 | 4 |
face | people | scene | |
---|---|---|---|
speech | 199/165 | 24/12 | 2/1 |
meeting | 9/6 | 15/12 | 1/1 |
crowd | 5/1 | 28/25 | 1/0 |
visit | 1/0 | 4/4 | 3/3 |
location | 3/1 | 13/10 | 18/16 |
As shown in the above example, the accuracy of the association process is good enough to assist manual tagging. About 70 segments are spotted for each video, and around 50 of them are correct. Although there are many unmatched key images, most unmatched key images are taken from commercial messages for which corresponding key-sentences do not exist. However, there are still a considerable number of association failures. They are mainly caused by the following factors:
Given the spotting results, the following usage can be considered.
Around 70 segments are spotted for each 30-minute news video. This means an average of 3 segments in a minute. If a topic is not too long, we can place all of the segments in one topic into one window. This view could be a good presentation of a topic as well as a good summarization tool.
An example is shown in Figure 7 and Figure 8. Each pair of a picture and a sentence is an associated pair. The picture is a key image, and the sentence is a key-sentence. The position of the pair is determined by the situations defined in the previous section: segments for VISIT/TRAVEL or LOCATION are placed in the top row; the MEETING or CROWD segments are in the second row; SPEECH/OPINION segments are in the bottom row. Thus, the first row shows Mr.Clinton's visit to Ireland and the preparation for him in Belfast; the second row explains the politicians and people in that country; the third row shows each speech or opinion about Ireland peace.In this view, the time order of segments is kept only inside each row. This is mainly for saving the space. If we keep the order across the row, i.e. if all the segments are placed in the order of their presented time, we get the view shown in Figure 9. This view enables us to overlook how the topic is organized. Visit and place information is given first, meeting information is given second, then a few public speeches and opinions are given. As we can see in this example, we can grasp the rough structure of the topic by taking a brief look at the explainer.
As mentioned before, the situations such as ``speech scene'' situation can be a good tag for video segments. Currently, we are trying to extract additional information from transcripts. The name of a speaker, attendants in a meeting/conference, a visitor and location of visit, etc. With these data, video segment retrieval can be much more efficient.
We described the idea of the Spotting by Association in news video. By this method, video segments with typical semantics are detected by associating language clue and image clue.
Our experiments have shown that many correct segments can be detected with our method. Most of the detected segments fit the typical situations we introduced in this paper. We also proposed new applications by using detected news segments.
There are many areas for future work. One of the most important areas is the improvement of key image and key-sentence detection. Another is to check the effectiveness of this method with other kinds of videos.