Not logged in : Login
(Sponging disallowed)

About: https://dragonfly.hypotheses.org/1051     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : rss:item, within Data Space : demo.openlinksw.com associated with source document(s)

AttributesValues
type
Creator
  • Christof Schöch
described by
Date
  • 2016-11-13T22:46:22Z
Subject
  • MALLET
  • Articles
  • My research
  • Tools
  • featured
  • hyperparameters
  • optimization
  • topic modeling
rss:title
  • Topic Modeling with MALLET: Hyperparameter Optimization
rss:link
rss:description
  • This is a short technical post about an interesting feature of Mallet which I have recently discovered or rather, whose (for me) unexpected effect on the topic models I have discovered: the parameter that controls the hyperparameter optimization interval in Mallet.[1] Yes, there are parameters, there are hyperparameters, and there are parameters controlling how hyperparameters are optimized. For a long...
content:encoded
  • This is a short technical post about an interesting feature of Mallet which I have recently discovered or rather, whose (for me) unexpected effect on the topic models I have discovered: the parameter that controls the hyperparameter optimization interval in Mallet.[1] Yes, there are parameters, there are hyperparameters, and there are parameters controlling how hyperparameters are optimized.

    For a long time, I believed that my most important decisions when topic modeling involved choosing parameters such as the appropriate number of topics, deciding on how best to split my long novels into smaller segments, and (something I still find extremely difficult) deciding which terms to include in my stoplist. By contrast, I fully trusted Mallet to choose the hyperparameters in an appropriate way, particularly because I was aware of the Mallet hyperparameter optimization functionality. This “optimize-interval” parameter is described as follows on the Mallet website: “This option turns on hyperparameter optimization, which allows the model to better fit the data by allowing some topics to be more prominent than others. Optimization every 10 iterations is reasonable.” My rationale was not to worry about this. After all, rather than meddling with parameters myself, introducing more subjective choices, it appeared to be a much more reasonable choice to let Mallet solve the issue for me.

    At some point, I did get back to the issue of the hyperparameters and did some more reading on the issue, for example, Hannah Wallach et al.’s 2009 piece on “Why Priors Matter”. But I also went back to relevant sections in Steyvers & Griffith’s 2009 piece on “Probabilistic Topic Modeling” to get a firmer grasp of that hyperparameters really do; and Allen Riddell pointed me to a very useful paper on “Non-Parametric Topic Models” by Buntine and Mishra from 2014.[2]">http://papers.nips.cc/paper/3854-rethinking-lda-why-priors-matter.pdf.">2] In a nutshell, the hyperparameters alpha and beta affect the distributional profile of the the words in each topic and that of the topics in each document, respectively. And instead of setting the hyperparameters to some fixed level, hyperparameter optimization affects how strongly the effect of tweaking hyperparameters can get in a given modeling process.

    In fact, the major insight of Wallach et al. was that rather than deciding on fixed hyperparameters for the entire collection (with each topic having a similar probability in the model, and each word having a similar probability in each topic), it makes much more sense to allow for some differentiation between overall topic probabilities in a model: after all, it makes perfect sense that some topics are more general and therefore widespread while others are more specific and therefore less common. This intuition is implemented in the hyperparameter optimization function of Mallet.

    So far, so good! Obviously, optimization can only make things better (or so I thought). Also, performance doesn’t appear to be much affected by optimization. As a consequence, I decided to let Mallet do what it does and optimize every 100 iterations when doing topic modeling and running the process for 5,000 to 10,000 iterations. That is, until I did a series of test runs and began to understand the effect of Mallet’s hyperparameter optimization interval on the resulting model.

    In fact, exactly as described in the documentation, each time Mallet performs an optimization step, the topic probability distribution departs a bit more from the initial homogeneous distribution. If you run the modeling process without any optimization, the distribution stays flat at whatever hyperparameter you have choosen to start with (or at the default hyperparameter). In short, the model looks like the one shown in figure 1 (click to enlarge):

    hypaop-none

    Here, each topic has the exact same probability in the model overall as any other topic. This doesn’t mean topics can’t be more or less probable in any one document of the collection, of course. It just describes the overall probability in the entire collection. The default value for beta appears to be somewhere in the vicinity of beta = 1 / number of topics x 5.

    However, once you start introducing optimization steps, things keep changing. The smaller the interval between these optimization steps is, or the more iterations you perform (that is: the higher the absolute number of optimization steps is), the stronger the topic probability distribution differs from the flat distribution. This results in a small number of topics with ever higher probabilities (indicating widespread topics), an increasingly sharp drop-off, and a large number of topics with extremely low probabilities (indicating topics which are present only in a small proportion of the documents). Figure 2 shows the topic probability distributions for a number of models whose only difference is the optimization interval setting.

    hypaop-allThe smaller the optimization interval, the steeper the curve becomes. This may be useful in some cases, but I think it may be detrimental in others. If your goal is to identify small numbers of texts about specific themes in a large collection, then a lot of opimization may be good. However, if your goal is to identify topics typical of certain authors, periods, genres or some other reasonably large subset of your collection, then it may be better to optimize a bit less. In any case, it seems to me that it is quite possible to do too much or too little optimization for a given task.

    Of course, instead of the nifty optimization feature taking the whole issue of choosing appropriate hyperparameters out of our hands, this appears to make matters only worse. In fact, it replaces one choice with another one: the question ‘What is a good hyperparameter beta?’ becomes another question: ‘What is a good optimization interval?’ I think the answer to this question is the same as for all other parameters, and has three parts.

    • The first part is: All of the parameter setting should be done with a reasonably good understanding of the topic modeling procedure, combined with a good knowledge of the dataset under scrutiny and a clear idea of what purpose or research objective the model is supposed to serve.
    • The second part is: If there is a clear purpose or function, we should design a task (for example a classification task) using the topic model data as input or some measure of model quality (for example topic coherence) and adjust the parameters in a way to optimize the performance on the task or the topic coherence scores.
    • And the third part of the answer, you ask? Well, with so many decisions to take, the most important thing is to make the entire procedure as transparent as possibly by publishing your complete code and data. (That’s what I did recently for a piece on French drama to appear in DHQ and whose underlying data, with 35 different models and their diagnostic data, is already on GitHub.)

    A note on the second part of the answer: Using topic coherence scores as a measure for model quality may actually be misleading, at least in some cases, because those highly-optimized models I described above yield beautifully-coherent, crisp topics, but there may not be much of a heuristic value or usefulness to them, depending on the task at hand, because each of these clear and coherent topics are relevant only to a tiny proportion of the data collection under scrutiny.

    In another recent project, I was able to test this approach again. First, when modeling a collection of 600 French novels (split into 38k segments), I generated 56 different models with various numbers of topics (60, 80, 100, 120, 160, 200, 240, 280) and various optimization intervals (10, 50, 100, 500, 1000, 2000, no optimization). Then, I designed a ten-fold cross-validation classification task in which a classifier is asked to learn the distinction between crime fiction novels and other novels from the topic score data for 90% of the novels, and then predict the correct label for the remaining 10%. The following shows an overview of the mean accuracy different parameters of the models yield.

    classification-task-topic-models

    Clearly, there is not a huge difference in performance, but the trend overall becomes clear: Less topics, better performance on this two-class task, which makes sense. More importantly, the best-performing model (on average using different classifiers) was one not only with just 80 topics, but also with hyperparameter optimization completedly switched off. This is not to say this is the best setting in all cases, but if you’re interested in detecting trends affecting rather large groups in the data rather than in a fine-grained discovery tool, then it may make sense.

    It all depends on the project’s aims. But it is important that we are aware of the massive effects Mallet’s unconspicous parameter of the hyperparameter optimization can have on the resulting models.

    Notes
    1. Another recent discovery, by the way: The diagnostics file which MALLET allows you to output, containing a host of interesting data about both the topics as a whole as about the words contained in each topics. But there is no need to write a blog-post about it, because it has a wonderfully detailed documentation.
    2. Buntine, Wray L., und Swapnil Mishra. „Experiments with Non-parametric Topic Models“. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 881–890. KDD ’14. New York, NY, USA: ACM, 2014. doi:10.1145/2623330.2623691. // Steyvers, Mark, und Tom Griffiths. „Probabilistic Topic Models“. In Latent Semantic Analysis: A Road to Meaning, herausgegeben von T. Landauer, D. McNamara, S. Dennis, und W. Kintsch. Laurence Erlbaum, 2006. // Wallach, Hanna M., David M. Mimno, und Andrew McCallum. „Rethinking LDA: Why Priors Matter“. In Advances in Neural Information Processing Systems 22, 1973–1981. Curran Associates, Inc., 2009. http://papers.nips.cc/paper/3854-rethinking-lda-why-priors-matter.pdf.
  • This is a short technical post about an interesting feature of Mallet which I have recently discovered or rather, whose (for me) unexpected effect on the topic models I have discovered: the parameter that controls the hyperparameter optimization interval in Mallet.[1] Yes, there are parameters, there are hyperparameters, and there are parameters controlling how hyperparameters are optimized.

    For a long time, I believed that my most important decisions when topic modeling involved choosing parameters such as the appropriate number of topics, deciding on how best to split my long novels into smaller segments, and (something I still find extremely difficult) deciding which terms to include in my stoplist. By contrast, I fully trusted Mallet to choose the hyperparameters in an appropriate way, particularly because I was aware of the Mallet hyperparameter optimization functionality. This “optimize-interval” parameter is described as follows on the Mallet website: “This option turns on hyperparameter optimization, which allows the model to better fit the data by allowing some topics to be more prominent than others. Optimization every 10 iterations is reasonable.” My rationale was not to worry about this. After all, rather than meddling with parameters myself, introducing more subjective choices, it appeared to be a much more reasonable choice to let Mallet solve the issue for me.

    At some point, I did get back to the issue of the hyperparameters and did some more reading on the issue, for example, Hannah Wallach et al.’s 2009 piece on “Why Priors Matter”. But I also went back to relevant sections in Steyvers & Griffith’s 2009 piece on “Probabilistic Topic Modeling” to get a firmer grasp of that hyperparameters really do; and Allen Riddell pointed me to a very useful paper on “Non-Parametric Topic Models” by Buntine and Mishra from 2014.[2]">http://papers.nips.cc/paper/3854-rethinking-lda-why-priors-matter.pdf.">2] In a nutshell, the hyperparameters alpha and beta affect the distributional profile of the the words in each topic and that of the topics in each document, respectively. And instead of setting the hyperparameters to some fixed level, hyperparameter optimization affects how strongly the effect of tweaking hyperparameters can get in a given modeling process.

    In fact, the major insight of Wallach et al. was that rather than deciding on fixed hyperparameters for the entire collection (with each topic having a similar probability in the model, and each word having a similar probability in each topic), it makes much more sense to allow for some differentiation between overall topic probabilities in a model: after all, it makes perfect sense that some topics are more general and therefore widespread while others are more specific and therefore less common. This intuition is implemented in the hyperparameter optimization function of Mallet.

    So far, so good! Obviously, optimization can only make things better (or so I thought). Also, performance doesn’t appear to be much affected by optimization. As a consequence, I decided to let Mallet do what it does and optimize every 100 iterations when doing topic modeling and running the process for 5,000 to 10,000 iterations. That is, until I did a series of test runs and began to understand the effect of Mallet’s hyperparameter optimization interval on the resulting model.

    In fact, exactly as described in the documentation, each time Mallet performs an optimization step, the topic probability distribution departs a bit more from the initial homogeneous distribution. If you run the modeling process without any optimization, the distribution stays flat at whatever hyperparameter you have choosen to start with (or at the default hyperparameter). In short, the model looks like the one shown in figure 1 (click to enlarge):

    hypaop-none

    Here, each topic has the exact same probability in the model overall as any other topic. This doesn’t mean topics can’t be more or less probable in any one document of the collection, of course. It just describes the overall probability in the entire collection. The default value for beta appears to be somewhere in the vicinity of beta = 1 / number of topics x 5.

    However, once you start introducing optimization steps, things keep changing. The smaller the interval between these optimization steps is, or the more iterations you perform (that is: the higher the absolute number of optimization steps is), the stronger the topic probability distribution differs from the flat distribution. This results in a small number of topics with ever higher probabilities (indicating widespread topics), an increasingly sharp drop-off, and a large number of topics with extremely low probabilities (indicating topics which are present only in a small proportion of the documents). Figure 2 shows the topic probability distributions for a number of models whose only difference is the optimization interval setting.

    hypaop-allThe smaller the optimization interval, the steeper the curve becomes. This may be useful in some cases, but I think it may be detrimental in others. If your goal is to identify small numbers of texts about specific themes in a large collection, then a lot of opimization may be good. However, if your goal is to identify topics typical of certain authors, periods, genres or some other reasonably large subset of your collection, then it may be better to optimize a bit less. In any case, it seems to me that it is quite possible to do too much or too little optimization for a given task.

    Of course, instead of the nifty optimization feature taking the whole issue of choosing appropriate hyperparameters out of our hands, this appears to make matters only worse. In fact, it replaces one choice with another one: the question ‘What is a good hyperparameter beta?’ becomes another question: ‘What is a good optimization interval?’ I think the answer to this question is the same as for all other parameters, and has three parts.

    • The first part is: All of the parameter setting should be done with a reasonably good understanding of the topic modeling procedure, combined with a good knowledge of the dataset under scrutiny and a clear idea of what purpose or research objective the model is supposed to serve.
    • The second part is: If there is a clear purpose or function, we should design a task (for example a classification task) using the topic model data as input or some measure of model quality (for example topic coherence) and adjust the parameters in a way to optimize the performance on the task or the topic coherence scores.
    • And the third part of the answer, you ask? Well, with so many decisions to take, the most important thing is to make the entire procedure as transparent as possibly by publishing your complete code and data. (That’s what I did recently for a piece on French drama to appear in DHQ and whose underlying data, with 35 different models and their diagnostic data, is already on GitHub.)

    A note on the second part of the answer: Using topic coherence scores as a measure for model quality may actually be misleading, at least in some cases, because those highly-optimized models I described above yield beautifully-coherent, crisp topics, but there may not be much of a heuristic value or usefulness to them, depending on the task at hand, because each of these clear and coherent topics are relevant only to a tiny proportion of the data collection under scrutiny.

    In another recent project, I was able to test this approach again. First, when modeling a collection of 600 French novels (split into 38k segments), I generated 56 different models with various numbers of topics (60, 80, 100, 120, 160, 200, 240, 280) and various optimization intervals (10, 50, 100, 500, 1000, 2000, no optimization). Then, I designed a ten-fold cross-validation classification task in which a classifier is asked to learn the distinction between crime fiction novels and other novels from the topic score data for 90% of the novels, and then predict the correct label for the remaining 10%. The following shows an overview of the mean accuracy different parameters of the models yield.

    classification-task-topic-models

    Clearly, there is not a huge difference in performance, but the trend overall becomes clear: Less topics, better performance on this two-class task, which makes sense. More importantly, the best-performing model (on average using different classifiers) was one not only with just 80 topics, but also with hyperparameter optimization completedly switched off. This is not to say this is the best setting in all cases, but if you’re interested in detecting trends affecting rather large groups in the data rather than in a fine-grained discovery tool, then it may make sense.

    It all depends on the project’s aims. But it is important that we are aware of the massive effects Mallet’s unconspicous parameter of the hyperparameter optimization can have on the resulting models.Notes

    1. Another recent discovery, by the way: The diagnostics file which MALLET allows you to output, containing a host of interesting data about both the topics as a whole as about the words contained in each topics. But there is no need to write a blog-post about it, because it has a wonderfully detailed documentation.
    2. Buntine, Wray L., und Swapnil Mishra. „Experiments with Non-parametric Topic Models“. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 881–890. KDD ’14. New York, NY, USA: ACM, 2014. doi:10.1145/2623330.2623691. // Steyvers, Mark, und Tom Griffiths. „Probabilistic Topic Models“. In Latent Semantic Analysis: A Road to Meaning, herausgegeben von T. Landauer, D. McNamara, S. Dennis, und W. Kintsch. Laurence Erlbaum, 2006. // Wallach, Hanna M., David M. Mimno, und Andrew McCallum. „Rethinking LDA: Why Priors Matter“. In Advances in Neural Information Processing Systems 22, 1973–1981. Curran Associates, Inc., 2009. http://papers.nips.cc/paper/3854-rethinking-lda-why-priors-matter.pdf.
is rdf:_1 of
is rdf:_3 of
Faceted Search & Find service v1.17_git144 as of Jul 26 2024


Alternative Linked Data Documents: iSPARQL | ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 08.03.3331 as of Aug 25 2024, on Linux (x86_64-ubuntu_noble-linux-glibc2.38-64), Single-Server Edition (378 GB total memory, 57 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software