Researchers are inventing better ways to find and make sense of information. Efforts to improve data mining and searching are being driven by the deluge of information in this increasingly networked world and by companies’ need to respond ever faster to changes. And, sadly, the field has gotten a big boost from the terrorist attacks on the U.S. last fall.
Computerworld looked at some of this research and found companies perfecting techniques for machine learning, real-time analysis of data flows, distributed data mining and the discovery of “nonobvious” relationships.
Known Associates
Systems Research & Development (SRD) developed its Non-Obvious Relationship Awareness (NORA) technology to help casinos identify cheaters by correlating information from multiple sources about relationships and earlier transactions.
Las Vegas-based SRD, which received funding from the CIA, is now developing several NORA plug-ins to reach further into the world of criminals and terrorists. Last month, the company unveiled a “degrees of separation” capability that finds deeper connections among people.
“It will tell you that the Drug Enforcement Agency’s agent’s college roommate’s ex-wife’s current husband is the drug lord,” says Jeff Jonas, chief technology officer at SRD. NORA can bridge up to 30 such links, he says.
The new NORA module uses streaming technology that scans data and extracts information in real time as it flows by. That would allow it to, for example, instantly discover that a man at an airline ticket counter shares a phone number with a known terrorist and then issue an alert before he can board his flight. Jonas calls it “perpetual analytics,” to distinguish it from periodic queries against an occasionally updated database.
SRD is also developing the concept of “cascading” NORA data warehouses for really big problems.
For example, Jonas says, each airline might have a copy of NORA processing its passenger data and sending the summarized results to a midtier NORA system at the Federal Aviation Administration. Car rental agencies might send their NORA results to a rental car association. And the U.S. Immigration and Naturalization Service could collect data from ports of entry.
All three midtier NORA systems would then send transactions to the top-tier system at the Office of Homeland Security in Washington. They would communicate with one another in a “zero administration” arrangement in which rules and filters would determine whether a piece of information got passed up or down the chain, Jonas says.
Outbreak Detection
If a bioterrorist attack occurred, it would be critical for health and law enforcement officials to find out quickly, even before people were diagnosed with a specific disease.
The key to doing that lies in distributed data mining, says Tom Mitchell, a computer science professor at Carnegie Mellon University in Pittsburgh. Carnegie Mellon and the University of Pittsburgh recently fielded the Real-Time Outbreak Disease and Surveillance (RODS) system, which takes data feeds from the emergency rooms of 17 local hospitals, loads it into a database and applies statistical techniques to predict the occurrence of diseases such as anthrax and smallpox. The universities also used RODS during this year’s Olympic Winter Games in Salt Lake City.
The system considers 30 to 100 variables every few minutes over a large geographic area, says project co-director Andrew Moore, who is also director of the Biomedical Security Institute in Pittsburgh. “We are looking at between 1 million and 1 trillion possible strange things going on possible indicators of various kinds of disease,” says Moore. “If we are not careful, we’ll use a year’s worth of supercomputer time every day.”
Project members are working on better algorithms and have increased processing efficiency by a factor of 10,000 in the past year, Moore says, but more improvements are needed. The system may be expanded to look at pharmacy cash-register data, school attendance records, animal sickness data, phone call records and vehicular traffic patterns, all of which may hold real-time clues about changes in a population’s health.
But gathering all that information raises privacy and confidentiality concerns. At present, the hospital data comes into a central repository where it’s carefully scrubbed of information that could be used to identify anyone. Carnegie Mellon researchers are looking at ways to push that scrubbing activity out to the data source.
“How can you design a data mining system that instead of running on a central repository, allows each hospital, store and so on to keep their own records and not reveal the identities?” asks Mitchell, director of Carnegie Mellon’s Center for Automated Learning and Discovery. “What you want to do is give them some software that they can use to put their own privacy restrictions on.”
That concept could be applied in many domains, Mitchell says. For example, intelligence agencies could use it to allow information-sharing across departments while protecting the sources of the information, he says.
Upside Down
“Instead of archiving data and running search queries through it, we archive search queries and run data through it,” says Val Jerdes, vice-president for business development at Streamlogic Inc. “It’s a search engine on its head.”
The advantage of an inverted search engine, he claims, is that it’s 6,000 times more efficient than the conventional approach. It can handle huge volumes of data that would be expensive or impossible to process using the standard method of loading data into an archive, indexing it and then retroactively querying it.
Los Altos Hills, Calif.-based Streamlogic’s feed-monitoring technology “strains” the information through query rules in real time, eliminating the archival requirement entirely. A demonstration at www.streamlogic.com runs all the postings to some 50,000 Usenet news groups 10 postings per second, or 2GB per day through a database of user-specified topics and instantly sends an alert every time one of those topics appears in a post. It also turns unstructured information into data that can put into a relational database for further analysis.
A feed-processing engine plucks out information based on user-specified topics or keywords. A feed analysis engine uses statistical techniques to analyse, categorize and summarize information for identifying trends, advertisement-targeting and other applications. The engine improves with use as it learns the most relevant words and phrases, says Streamlogic.
The future of these concepts lies in applications that others will develop with Streamlogic’s tool kit, which includes a collection of “metaware” and a language similar to SQL. For example, it could be used to speed and unify the flow of data throughout an enterprise, Jerdes says.
“So when a customer’s order comes in, instead of moving from one database to another in functional silos, we are able to dissolve the walls so that the order gets through to manufacturing, customer relations, financial and sales systems,” he says. “And all that could happen instantaneously.”
What’s the Answer?
When someone types the query “What is the population of the world?” into an Internet search engine, he most likely wants the numerical answer 6.2 billion not pointers to hundreds of documents containing the words population and world. Unfortunately, today’s search engines produce more document hits than answers.
But Verity Inc. is developing software that will be a lot smarter, says Prabhakar Raghavan, chief technology officer at the Sunnyvale, Calif.-based company. The approach involves putting human learning, or rules, into the software and enabling that software to teach itself in a process called machine learning.
Suppose you want to build a recruiting system that automatically extracts information from the scanned r