Broadcast News Navigation using Story Segmentation
Andrew Merlino, Daryl Morey, Mark Maybury Advanced Information Systems Center The MITRE Corporation 202 Burlington Road Bedford, MA 01730, USA {andy, dmorey, maybury }@mitre.org |
Abstract
In this paper we examine the developed techniques and lessons learned in an operational multimedia exploitation system, Broadcast News Editor (BNE) and Broadcast News Navigator (BNN). BNE captures, analyzes, annotates, segments, summarizes, and stores broadcast news audio, video and textual data within the context of multimedia database system. BNN provides web based retrieval tools from the multimedia database system. The key innovation of this system is the detection and segmentation of story segments from the multimedia broadcast stream. This paper discusses:
BNE and BNN are currently used every evening at MITRE’s Multimedia Research Lab and to this point have automatically processed over 6011 news stories from over 349 broadcasts of CNN Prime NewsTM .
As massive amounts of multimedia data (e.g., interactive web pages, television broadcasts, surveillance videos) is created, more effective multimedia data search and retrieval exploitation are necessary. Multimedia data analysts must search, annotate and segment multimedia data for a subject of interest or discovery of trends. Costly manual approaches are currently in use at many facilities (e.g., government agencies, film studios, broadcast agencies). The BNE and BNN systems were created to assist the internationally located multimedia data analyst to view broadcast news stories and trends. This project exploits the parallel signals found in a multimedia data source to enable story segmentation and summarization of broadcast new [MANI]. Our initial investigation looked for discourse cues in domestic broadcast news such as hand-offs from "Anchor to Reporter" (e.g., "to our senior correspondent in Washington Britt Hume") and "Reporter to Anchor" (e.g., "This is Britt Hume, CNN, Washington"). Using the cues embedded within a broadcast’s closed-caption transcript, story segments (distinct portions of a news broadcast where one story is discussed) were discovered. This initial technique proved inadequate. Many segments were missed or misidentified. Also, the technique was not robust. If a particular cue was given in a slightly different manner than anticipated, the cue and the story segment would not be detected. To improve the segmentation accuracy and make the technique more robust, other cues were added. In the closed-caption transcript, detection of cues such as ‘>>’ (speaker change) and blank lines introduced in the captioning process improved story segmentation. A Natural Language Processing (NLP) text tagging tool [ABERDEEN], Alembic, provided named entity detection (i.e., people, organization, location). This paper discusses our latest textual, video, and audio cues used and our developed techniques for correlating the cues to improve broadcast, commercial and story segmentation. 2. Efficacy of Story Segmentation Breaking a news broadcast into reported news stories provides effective browsing that requires the data analyst to review less material than through linear browsing unsegmented content. To demonstrate the efficiency of a story segment search, we gathered metrics in a task based retrieval experiment. Before describing our experiment, we will motivate the utility of the story segment technique through some discussion. To search for a particular news story in a given program, a linear search is performed by searching in a sequential fashion through video until the story is found. This is obviously a time consuming technique but it provides a useful baseline for comparison. A keyword search through the associated time stamped video transcript provides a time-indexed pointer into the video stream for each instance where the keyword occurred. For example, if the user did a keyword search on Peru, the result would provide a pointer to the video for each location where the word Peru was spoken. Because a keyword search may provide multiple pointer references to the same story, it is intuitive that a story segment search is superior to a keyword search. To confirm our intuition, we performed the following experiment. In our experiment, a user was requested to find a story on three topics over a one-month time period. The user was asked to find these stories by using the three techniques mentioned above: linear search, keyword search and story segment search. The linear search was performed with a shuttle-control VCR. The keyword search was performed by searching the multimedia database for dates and times when the news program referenced the given keyword. For each date and time retrieved, the user manually searched through the videotape using the VCR shuttle control. The story segment search was performed using our BNN system. The data set was the nightly half-hour CNN Prime NewsTM programs from 12/14/96 - 1/13/97.
Linear Search |
Key- word |
BNN |
|||||
Story Topic |
Actual Stories |
Time hh:mm |
# of Stories |
Time hh:mm |
# of Stories |
Time hh:mm |
# of Stories |
Peru |
17 |
3:10 |
16 |
2:40 |
18 |
0:02 |
22 |
Middle East |
16 |
3:16 |
16 |
2:53 |
17 |
0:02 |
25 |
Gulf War Chemicals |
3 |
4:30 |
3 |
0:33 |
3 |
0:02 |
4 |
Average |
3:39 |
2:02 |
0:02 |
Table 1. Search Comparisons
As seen in Table 1, the manual search took 80% longer than the keyword search and 10,850% longer than the BNN search. There were three anomalies discovered with the test. First, in the manual process, when a story was discovered in a news program, the searcher stopped for the remainder of that news program with the assumption that the story would not reoccur. Second, in the keyword search, keywords detected in the first minute of a broadcast were ignored because they pointed to the highlights of the news. This method had better recall and retrieved more of the relevant stories because the stories that re-occurred in the news broadcast were detected. Third, in the BNN search, the system over-generated story segments, which increased the number of stories that were found. In three cases of over segmentation, a story crossed a commercial boundary and was broken into two individual stories. In one case, a story consisted of two sub-stories, the Peruvian Army and the Peruvian Economy.
3. Story Segmentation Techniques
The technique we created to detect story segments is a multi-source technique that correlates various video, audio and closed-caption cues to detect when a story segment occurs. Because each broadcast news program tends to follow a general format across the entire program and within a story segment, the broadcast can be broken down into a series of "states", such as "start of broadcast", "advertising", "new story" and "end of broadcast". The multi-source cues, including time, are then used to detect when a state transition occurred.
Consider our observations of CNN’s PrimeNews half-hour program as an example. The CNN broadcast typically follows the following format:
3.1 Textual, Video, and Audio Segment Cues
To detect when a state transition occurs, cues from the video, audio and text (closed-caption) stream are used, as well as time. We will describe each cue that is used and how it is generated. Automated analysis programs have been written to detect each of these cues. When an analysis program detects a cue, the discovery is loaded into an integrated relational table by broadcast, cue type and time stamp. This integrated relational table allows rapid and efficient story segmentation and will be described in more detail later in the paper. Within this section, we show the latest analysis from ten months of broadcast news of which 96% is CNN Prime News.
3.1.1 Text (Closed-Caption) Cues
In the closed-caption channel, we have found highly frequent word patterns that can be used as text cues. The first pattern is in the anchor introduction. Typically, the anchors introduce themselves with {"I’m" the anchor’s name}. We use MITRE’s text tagging tool, Alembic, to automatically detect a person, location and organization. With these detections, a search for the phrase pattern {"I’m" <Person>} is performed. As seen in figure 1, we also exploit the fact that the anchor introductions occur 90 seconds from the start of the news and 30 seconds from the end of the news program. Figure 1 plots only occurrences over specified minutes of the broadcast.
Figure 1. Occurrences of "I’M <Person>"
From our database query, analysis of word frequencies and their temporal locations, we have identified introductory phrases that occur in many different news sources. Our current domain specific list of terms can be seen in figure 2. Again, using figure 3, we can use the knowledge that a program introduction occurs within 90 seconds from the start of the news.
HELLO AND WELCOME
HELLO FROM
WELCOME TO
THANKS FOR WATCHING
THANKS FOR JOINING US
HERE ON PRIMENEWS
TONIGHT ON PRIMENEWS
PRIMENEWS
Figure 2. Introductory CNN Prime NewsTM Anchor Terms
Figure 3. Occurrences of introductions
Also from our analysis, we have identified terms that occur during story segments pertaining to the weather. Our current list of expanding weather terms can be seen in Figure 4. Again, using the figure 5, it can be seen that a weather report occurs on average at 22 minutes and 30 seconds and ends on average at 25 minutes and 15 seconds. Using this information, we can modify our detection program to tag a story as weather if it falls within these time periods and uses the listed terms.
WEATHER
FORECAST
FRONTAL SYSTEM
LOW PRESSURE
HIGH PRESSURE
RAIN
SNOW
ICE
HAIL
STORM
CLOUD
PRECIPITATION
TORNADO
HURRICANE
LIGHTNING
THUNDER
Figure 4. Weather Story Segment Terms
Figure 5. Occurrences of Weather Terms
As reported in previous work, story segments can be detected by looking at anchor to reporter and reporter to anchor hand-offs. For anchor to reporter detections, we use the phrases illustrated in figure 6 where the person and locations are tagged by Alembic. For reporter to anchor hand-off detections, we use the phrases illustrated in figure 7 where again the person and locations are tagged using Alembic.
<varying phrase> "CNN’S" <Person> (e.g., "HERE'S CNN'S GARY TUCHMAN")
<Person> "JOINS US" (e.g., SENIOR WHITE HOUSE CORRESPONDENT WOLF BLITZER JOINS US")
<Person> "REPORTS" (e.g., "CNN'S JOHN HOLLIMAN REPORTS")
Figure 6. Anchor to Reporter Phrases
<Person> "CNN," <Location> (e.g., "BRENT SADLER, CNN, GAZA")
"BACK TO YOU" (e.g., "BACK TO YOU IN ATLANTA")
"THANK YOU" <Person> (e.g., "THANK YOU, MARTIN")
Figure 7. Reporter to Anchor Phrases
On the closed-caption channel, there are instances in the program when the anchor or reporter gives highlights of upcoming news stories. These teasers can be found by looking for the phrases found in figure 8.
COMING UP ON PRIMENEWS
NEXT ON PRIMENEWS
AHEAD ON PRIMENEWS
WHEN PRIMENEWS RETURNS
ALSO AHEAD
Figure 8. Story Previews
Certain anchor booth phrases are used to provide a detection cue for the end of a broadcast. As seen in figure 9, these phrases are mostly sign off phrases heard throughout various broadcast news programs. These phrases occur in 97% of the news programs we have analyzed.
'THAT WRAPS UP'
'THAT IS ALL'
'THAT''S ALL'
'THAT''S PRIMENEWS'
'THANKS FOR WATCHING'
'THANKS FOR JOINING US’
Figure 9. Story Previews
Figure 10. Occurrences of Sign off Terms
Finally, in the closed-caption stream, the operator frequently inserts three very useful closed-caption cues. These cues are:
3.1.2 Audio
While analyzing the audio channel, we have discovered that there are detectable periods of silence at least .7 seconds long at the beginning and end of commercial boundaries. Although there may be other periods of silence at the maximal noise energy level, the knowledge of these data points will be shown to be useful.
3.1.3 Video
Currently, we have a program that discovers the location of black frames, logos and single (i.e., one anchor is visible) and double anchor (i.e., two anchors are visible) booth scenes from an MPEG file. Black frames can be used to detect commercials and logos can be used to detect the beginning and the end of a broadcast. With the single and double anchor booth recognitions, story segmentation boundaries can start to be established from the video.
3.1.4 Cue Correlation
Commercial and story boundaries are detected through the correlation of the cues discussed in the previous sections. Looking at figure 3.1.4-1, broadcast boundaries are primarily found by correlating audio silence, video logo, and black frame cues along with closed-caption tokens. The commercials, which occur within the Ad brackets, are primarily found by correlating audio silence, black frame, and closed-caption blank line cues. Finally, story segments are primarily found by correlating closed-caption symbols (>>>, >>, <person>:), anchor to reporter and reporter to anchor cues.
Figure 3.1.4-1. Cue Correlation Chart
3.1.5 Identifying Story Segment
To detect story segments, the cue correlation technique must predict each time a new "Start of Story" state has occurred. When deciding on the technique used for prediction, there were two requirements. First, the technique must be flexible enough to allow the quick addition of new cues into the system. Second, the technique must be able to handle cues that are highly correlated with each other (e.g., Black Frame and Silence). The technique we use is a finite state automaton (FSA) enhanced with time transitions. The states and transitions of the FSA are represented in a relational database. Each detected cue and token is represented by a state that is instantiated by a record in the state table. The time transition attribute of each state allows state changes based on the amount of time the FSA has been in a certain state. For example, since we know the highlights section never lasts longer than 90 seconds, a time transition is created to move the FSA from the "Highlights" state to the "Start of Story" state whenever the FSA has been in the "Highlights" state for more than 90 seconds. The time transitions are a nice buffer against the possibility that no cues used to detect the transition to another state are seen in the amount of time the system expects them to occur.
A story segment is detected each time there is a transition into the "Start of Story" state. Commercials are detected when the current segment is in the FSA "Advertising" state. Below, we list the cues that are primarily used to determine each state. A picture of the FSA can be seen in figure 3.1.5-1. A full map of all FSA states and transitions are listed in Appendix A and B.
Figure 3.1.5-1. FSA for CNN Prime News (See Appendix A and B for detailed State-Transition Map)
The cues highlighted here are the primary ones used for detecting each state. The system is robust enough to detect state transitions even if the major cues are not present.
How well does our segmentation technique perform? We looked at 5 news broadcasts from 3/12/97-3/16/97 to gather metrics of segmentation performance. We measured both the precision (% of detected segments that were actual segments) and recall (% of actual segments that were detected).
Precision |
Precision Truth |
Recall |
Recall Truth |
|
Date |
# stories |
# stories |
# stories |
# stories |
3/16/97 |
24 |
19 |
19 |
19 |
3/15/97 |
22 |
16 |
16 |
16 |
3/14/97 |
27 |
21 |
21 |
22 |
3/13/97 |
26 |
16 |
16 |
17 |
3/12/97 |
22 |
18 |
18 |
19 |
Total |
121 |
90 |
90 |
93 |
Percent |
0.74 |
0.97 |
Table 3.1.5-1. Precision and Recall of Story
Segments
Recall is the area of our primary concern and as you can tell from the table, our technique excels in this area. Recall is more important because an over-segmented story is still very easy to navigate using our tool. This is because in BNN stories are displayed in temporal order and the video is played back from the full video file at the story starting point until the user stops the playback.
BNE and BNN are the two subsystems that comprise our system. BNE consists of the detection, correlation and segmentation algorithms described above. The BNE system operates automatically every evening according to a pre-programmed news broadcast. BNN consists of dynamically built web pages used to browse the broadcast news.
4.1 System Architecture
The system consists of a PC, used to capture a news source, and a Sun server, used to process and serve the data. As shown in diagram 4.1-1, the conceptual system is broken up into the processing subsystem (BNE) and the dissemination subsystem (BNN). The PC is used in the BNE portion and the Sun server is used in both subsystems. After the PC captures the imagery (MPEG), audio (MPA) and closed-caption information, it passes the created files to the UNIX Server for processing. With the MPG file, scene change detection and video classification (i.e., black frame, logo, anchor booth and reporter scene detection) is performed. Periods of silence are detected from the MPA file. With the closed caption file, named entity tagging and token detection is performed. With all of this information, the previously mentioned correlation process is performed to detect stories. With each detected story segment, a theme, gist and key frame is automatically generated and stored in the multimedia database (Oracle Relational Database Management System 7.3, Oracle Video Server 2.1). Once the information is available in the database, the end user queries the system through web pages served by Oracle’s 2.0 Web Server.
Figure 4.1-1. BNE and BNN Architecture
The underlying data in the system is stored in a relational database management system. The conceptual level of the database can be seen in figure 4.1-2. The key to relating the textual data to the video is through a video file pointer and time codes. Within the video table there is a reference to the MPEG video file. Within each Video’s child table, there is a time stamp. With the pointer to the video file name and a time stamp, the BNN system gets direct access to the point of interest on the video. Note: Due to the 2-4 second delay in the closed-caption stream, the video is started five seconds before the desired time stamp start time.
Figure 4.1-2Conceptual Video and Metadata Model
4.2 Sample Session
BNN enables a user to search and browse the original video by program, date, person, organization, location or topic of interest. One popular query is to search for news stories that have occurred in the last week. Figure 4.2-1 illustrates such a response to a user query. Notice that there were many references to the "FAA", "GINGRICH" and "ZAIRE".
Figure 4.2-1. Named Entity Frequencies for a One Month Time Period
With the frequency screen displayed, the user can view the stories for one of the values by selecting a value, for example "ZAIRE". Upon selection of the value, BNN searches through the multimedia database to display the related stories seen in Figure 4.2-2. The returned stories are sorted in descending order of key word occurrence. Each returned story contains the key frame, the date, the source, the six most frequent tags, a summary and the ability to view the closed-caption, video and all of the tags found for the story. The summary is currently the first significant closed-caption line of the segment. In the future, the system will extract the sentence that is most relevant to the query.
While viewing the story segment, the user has the ability to access the digitized video from the full source, typically 30 minutes. Thus, if the story segment starts at six minutes and twelve seconds into the news broadcast, the streaming of the video to the user will start at that point. While viewing the streaming video, the user can scroll through the video with VCR like controls.
Figure 4.2-2. BNN Story Browse Window
4.3 System Direction
In the future, we will be integrating an English and foreign language speech transcription system into BNE to supplement the multimedia sources where the closed-caption is incomplete or not existent. We will also decrease the execution time of the system such that the news will be ready within an hour as compared to 1˝ hours currently. Also, due to the time required to process audio files with speech transcription algorithms, we will use video and audio segmentation techniques to detect the broadcast and commercial boundaries in an initial pass of the multimedia source. With these detected boundaries, we will be able to process the smaller broadcast segments (three to eight minute segments) simultaneously as opposed to processing the complete broadcast (typically 30 minutes) serially.
The following cues will also be added to BNE:
We will also be adding the following to BNN:
In this paper we discuss how we correlate cues detected from the video, audio and closed-caption streams to improve broadcast, commercial and story segmentation. By using the three streams, we demonstrate how we have increased the accuracy of the segmentation from previous one-stream evaluation techniques. With these current techniques, we are planning on annotating and segmenting other domestic broadcast news sources. The challenges for different captioned news sources will be creating the new FSM and creating new video models for the video classification programs. With the addition of speech transcription, foreign language sources will be annotated and segmented using the same techniques. Although our current techniques were created for a structured multimedia source (i.e., domestic broadcast news), it is believed that these techniques can be applied to other multimedia sources. (e.g., usability study video, documentaries, films, surveillance video).
Aberdeen, J., Burger, J., Day, D., Hirschman, L., Robinson, P., and Vilain, M. 1995. Description of the Alembic System Used for MUC-6. In Proceedings of the Sixth Message Understanding Conference. Advanced Research Projects Agency Information Technology Office, 6-8. Columbia, MD.
Dubner, B. Automatic Scene Detector and Videotape logging system, User Guide, Dubner International, Inc., Copyright 1995, 14.
Mani, I. .1995 "Very Large Scale Text Summarization", Technical Note, The MITRE Corporation.
Maybury, M.; Merlino, A.; Rayson, J. 1997. Segmentation, Content Extraction and Visualization of Broadcast News Video using Multistream Analysis. AAAI Spring Symposium. Stanford, CA.
Appendix A
State_ID |
Description |
1 |
Start |
2 |
Wait for Broadcast Start |
3 |
Highlights Begin |
4 |
Highlights are ending |
5 |
Story Segment Detected |
6 |
Start of Story |
7 |
Near end of Story |
8 |
Advert Story Segment |
9 |
Wait for Advert End |
10 |
Broadcast End Story Segment |
11 |
Broadcast Over |
12 |
Story Buffer |
13 |
Possible Advertisement |
14 |
Very Possible Advertisement |
15 |
Prime Possible Advertisement |
16 |
Prime Highlight |
17 |
Very Near Start |
18 |
Possible end of Broadcast |
19 |
Very Possible End |
20 |
End Story Segment |
21 |
End Broadcast |
22 |
Reporter Segment |
Appendix B
Start State |
End State |
Transition Cue |
1 |
2 |
CNN Prime News |
2 |
3 |
Anchor to Weather |
2 |
3 |
Triple_Greater |
2 |
3 |
Name Colon |
2 |
3 |
PRIMENEWS |
2 |
3 |
Signon |
2 |
3 |
Anchor to Reporter |
2 |
3 |
Reporter to Anchor |
2 |
3 |
Weather to Anchor |
2 |
3 |
Story Preview |
2 |
3 |
Story Preview |
2 |
3 |
Im Start |
2 |
3 |
Double_Greater |
2 |
17 |
LogoBegin |
3 |
4 |
TIME |
3 |
4 |
Signon |
3 |
4 |
Im Start |
3 |
4 |
PRIMENEWS |
3 |
5 |
Reporter to Anchor |
3 |
5 |
Anchor to Reporter |
4 |
5 |
Triple_Greater |
4 |
5 |
PRIMENEWS |
4 |
5 |
Signon |
4 |
5 |
Im Start |
4 |
5 |
Reporter to Anchor |
4 |
5 |
Anchor to Reporter |
4 |
5 |
Double_Greater |
5 |
12 |
DEFAULT |
6 |
5 |
Triple_Greater |
6 |
7 |
TIME |
6 |
7 |
Reporter to Anchor |
6 |
7 |
Weather to Anchor |
Start State |
End State |
Transition Cue |
6 |
13 |
SILENCE_START |
6 |
13 |
BlackScreen |
6 |
13 |
BLANK_LINE |
6 |
15 |
PRIMENEWS |
6 |
15 |
Story Preview |
6 |
18 |
Im End |
6 |
18 |
Signoff |
6 |
22 |
Anchor to Reporter |
7 |
5 |
Triple_Greater |
7 |
5 |
Name Colon |
7 |
13 |
SILENCE_START |
7 |
13 |
BLANK_LINE |
7 |
13 |
BlackScreen |
7 |
15 |
PRIMENEWS |
7 |
15 |
Story Preview |
7 |
18 |
Im End |
7 |
18 |
Signoff |
7 |
22 |
Anchor to Reporter |
8 |
9 |
DEFAULT |
9 |
12 |
Triple_Greater |
9 |
12 |
Name Colon |
9 |
12 |
Story Preview |
9 |
12 |
Anchor to Reporter |
9 |
12 |
Anchor to Weather |
9 |
12 |
Weather to Anchor |
9 |
12 |
Reporter to Anchor |
9 |
12 |
Double_Greater |
9 |
12 |
PRIMENEWS |
9 |
18 |
Im End |
9 |
18 |
Signoff |
9 |
21 |
TIME |
10 |
11 |
DEFAULT |
12 |
6 |
TIME |
13 |
5 |
Triple_Greater |
13 |
5 |
Name Colon |
13 |
6 |
Anchor to Reporter |
13 |
7 |
Double_Greater |
13 |
7 |
TIME |
13 |
8 |
SILENCE_START |
13 |
8 |
BlackScreen |
13 |
14 |
BLANK_LINE |
13 |
15 |
PRIMENEWS |
13 |
15 |
Story Preview |
13 |
19 |
Im End |
13 |
19 |
Signoff |
14 |
5 |
Triple_Greater |
14 |
5 |
Name Colon |
14 |
6 |
Anchor to Reporter |
14 |
7 |
Double_Greater |
14 |
8 |
SILENCE_START |
14 |
8 |
BLANK_LINE |
14 |
13 |
TIME |
14 |
15 |
PRIMENEWS |
14 |
15 |
Story Preview |
14 |
19 |
Im End |
14 |
19 |
Signoff |
15 |
5 |
Triple_Greater |
15 |
6 |
Anchor to Reporter |
15 |
7 |
TIME |
15 |
8 |
SILENCE_START |
Start State |
End State |
Transition Cue |
15 |
8 |
BLANK_LINE |
15 |
8 |
BlackScreen |
15 |
19 |
Im End |
15 |
19 |
Signoff |
17 |
3 |
BlackScreen |
17 |
3 |
Triple_Greater |
17 |
3 |
Im Start |
17 |
3 |
Story Preview |
17 |
3 |
Anchor to Weather |
17 |
3 |
Story Preview |
17 |
3 |
Weather to Anchor |
17 |
3 |
Anchor to Reporter |
17 |
3 |
Signon |
17 |
3 |
Double_Greater |
17 |
3 |
PRIMENEWS |
Start State |
End State |
Transition Cue |
17 |
3 |
Name Colon |
18 |
5 |
Triple_Greater |
18 |
5 |
Name Colon |
18 |
7 |
TIME |
18 |
19 |
PRIMENEWS |
18 |
19 |
Im End |
18 |
19 |
Signoff |
19 |
5 |
Triple_Greater |
19 |
20 |
SILENCE_START |
19 |
20 |
BLANK_LINE |
19 |
20 |
BlackScreen |
20 |
21 |
DEFAULT |
22 |
6 |
TIME |
22 |
7 |
Reporter to Anchor |
22 |
18 |
Im End |
22 |
18 |
Signoff |