Topic Modeling for OLAP on Multidimensional Text Databases: Topic Cube and its Applications

As the amount of textual information grows explosively in various kinds of business systems, it becomes more and more desirable to analyze both structured data records and unstructured text data simultaneously. Although online analytical processing (OLAP) techniques have been proven very useful for analyzing and mining structured data, they face challenges in handling text data. On the other hand, probabilistic topic models are among the most effective approaches to latent topic analysis and mining on text data. In this paper, we study a new data model called topic cube to combine OLAP with probabilistic topic modeling and enable OLAP on the dimension of text data in a multidimensional text database. Topic cube extends the traditional data cube to cope with a topic hierarchy and stores probabilistic content measures of text documents learned through a probabilistic topic model. To materialize topic cubes efficiently, we propose two heuristic aggregations to speed up the iterative Expectation-Maximization (EM) algorithm for estimating topic models by leveraging the models learned on component data cells to choose a good starting point for iteration. Experimental results show that these heuristic aggregations are much faster than the baseline method of computing each topic cube from scratch. We also discuss some potential uses of topic cube and show sample experimental results.

Data and Resources

Additional Info

Field Value
Maintainer Nikunj Oza
Last Updated March 31, 2025, 19:47 (UTC)
Created March 31, 2025, 19:47 (UTC)
accessLevel public
accrualPeriodicity irregular
bureauCode {026:00}
catalog_@context https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld
catalog_@id https://data.nasa.gov/data.json
catalog_conformsTo https://project-open-data.cio.gov/v1.1/schema
catalog_describedBy https://project-open-data.cio.gov/v1.1/schema/catalog.json
harvest_object_id b51aa099-dc8c-4b10-843a-52da9a01a253
harvest_source_id 61638e72-b36c-4866-9d28-551a3062f158
harvest_source_title DNG Legacy Data
identifier DASHLINK_541
issued 2012-02-26
landingPage https://c3.nasa.gov/dashlink/resources/541/
modified 2020-01-29
programCode {026:029}
publisher Dashlink
resource-type Dataset
source_datajson_identifier true
source_hash 6a791b93464d296f835a812b1a108c6da8c1769cb2d8cc612dd3aece62ea679f
source_schema_version 1.1