Résumé
L’analyse quantitative de l’histoire culturelle a été ouverte par la mise à disposition de corpus de masse tel que celui de Google fbooks (500 milliards de mots, 5 millions d’ouvrages, soit environ 4% de la littérature mondiale) et a été popularisé sous le nom de « culturonomics ». Elle s’ouvre désormais aux chercheurs, en promettant un accès profond aux faits culturels et à leurs évolutions qui affleurent à travers leurs traces textuelles dans les corpus textuelles numérisées. Encore faut-il pouvoir interroger ces corpus dont la taille et la nature posent des problèmes scientifiques nouveaux, leur dimension les rendant illisibles directement et mettant échec les méthodes de fouille et les outils traditionnels d’analyse statistique des données en imposant des méthodes statistiques nouvelles et le saut vers des formes d’intelligence visuelles originales. Dans le cadre d’un projet mené entre le Labex « Obvil » de Paris-Sorbonne et le Literary Lab de Stanford sur l’histoire de l’idée de littéraire (la définition de la littérature comme mot, comme concept et comme champ), et visant à produire une histoire empirique de la littérature, nous avons mené depuis deux ans des expériences de fouille d’un corpus de critique littéraire de 1618 titres, 140 millions de mots (dont plus de 50 000 occurrences du lemme « littérature ») de la fin de l’Ancien Régime à la Seconde Guerre mondiale. En présentant des exemples développés dans cette première expérimentation à grande échelle de mesure de l’histoire des idées, on présentera les méthodes de text mining contemporaines en essayant d’éprouver leur pertinence heuristique et de leur capacité à faire remonter des données signifiantes pour l’histoire et la théorie littéraire. On fera l’hypothèse que toute enquête quantitative sérieuse mobilise désormais non une échelle intermédiaire standard et immédiatement lisible, mais le maniement d’outils statistiques dont l’interprétation en sciences humaines pose des problèmes particuliers qui, paradoxalement, ne peuvent être résolus que par leur articulation étroite à du close reading et à des mesures fines.
Abstract
Quantitative analysis of cultural history has begun with the appearance of massive open-source data, such as Google Books, and has been renown as “cultural economicsˮ. It is now open to researchers and literary critics, thus allowing to have access to cultural facts and their evolution through textual marks within digitalized data. Those massive corpora cannot be analyzed blindly as they may not all be equipped with substantial metadata, or might, in worst case scenarios, be very noisy. For massive corpora, that is to say with billions of words, common visualization tools such as Voyant Tools or TXM, and the methods those softwares use to analyze data, cannot be reliably efficient. Within the margins of a project about literary History, between the Labex OBVIL and the Stanford Literary Lab, aiming at defining literature as a word, concept and semantic field, and at drawing an empirical history of literature, we analyzed 1618 French books, that is to say a 140 million word corpus, from the end of the “Ancien Régime” up to the Second World War. To do so, we used different experimental text mining techniques, combining distant and close reading analysis. In this article, we shall explore different kinds of text mining, such as (frequencial) closed measures, unsupervised machine analysis (topic modeling), semi-open methods (collocations), each time pointing out their benefits and drawbacks. We shall then demonstrate how necessary it is to apply to a deeper and more precise text mining, using substantial metadata, such as lemmatized data, syntactical structure and semantic analysis (such as word vectors). We shall in the end demonstrate how a substantial study of big literary corpora cannot disjoint distant and close reading, as both tend to prove or contradict one another in a most effective way for producing evolutive representations of the history of literature.
References
Archer, Jodie & Matthew L. Jockers. 2016. The bestseller code: Anatomy of the blockbuster novel. New York: St. Martin’s Press.Search in Google Scholar
Ardanuy, Mariona Coll & Caroline Sporleder. 2014. Structure-based clustering of novels. In Proceedings of the third workshop on Computational Linguistics for Literature (CLfL)@ EACL, 31–39.Search in Google Scholar
Burnard, Lou. 2010. TEI P5: Guidelines for electronic text encoding and interchange. 1.6. 0. http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf (consulté le 3 juin 2019).Search in Google Scholar
Glorieux, Frédéric. 2017. Nuage de mots. http://obvil.lip6.fr/alix/wordcloud.jsp?&bibcode=proust_recherche&frantext=on (consulté le 3 juin 2019).Search in Google Scholar
Harris, Zellig S. 1954. Distributional structure. Word 10(2–3). 146–162.10.1080/00437956.1954.11659520Search in Google Scholar
Heiden, Serge, Jean-Philippe Magué & Bénédicte Pincemin. 2010. TXM : Une plateforme logicielle open-source pour la textométrie-conception et développement. In Tenth International Conference on the statistical analysis of textual data-JADT 2010, vol. 2, 1021–1032. Milan : Edizioni Universitarie di Lettere Economia Diritto.Search in Google Scholar
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard & David McClosky. 2014. The stanford coreNLP natural language processing toolkit. ACL (System Demonstrations) 55–60. doi:10.3115/v1/P14-5010.Search in Google Scholar
Mikolov, Tomas, Kai Chen, Greg Corrado & Jeffrey Dean. 2014. word2vec. https://code.google.com/p/word2vec (consulté le 3 juin 2019).Search in Google Scholar
Moretti, Franco. 2005. Graphs, maps, trees: Abstract models for a literary history. London: Verso.Search in Google Scholar
Moretti, Franco. 2013. Operationalizing: or, the function of measurement in modern literary theory (Literary lab pamphlet 6). Stanford, CA: Stanford Literary Lab.Search in Google Scholar
Pennington, Jeffrey, Richard Socher & Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543.10.3115/v1/D14-1162Search in Google Scholar
Posner, Miriam. 2012. Very basic strategies for interpreting results from the topic modeling tool. https://miriamposner.com/blog/very-basic-strategies-for-interpreting-results-from-the-topic-modeling-tool/ (consulté le 3 juin 2019).Search in Google Scholar
Schmid, Helmut. 1995. Treetagger, a language independent part-of-speech tagger. Institut Für Maschinelle Sprachverarbeitung, Universität Stuttgart 43. 28.Search in Google Scholar
Sinclair, Stéfan & Geoffrey Rockwell. 2014. Voyant tools. https://voyant-tools.org/docs/#!/guide/about (consulté le 3 juin 2019).Search in Google Scholar
Snow, Rion, Daniel Jurafsky & Andrew Y. Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. In Lawrence K. Saul, Yair Weiss & Léon Bottou (ed.), Advances in neural information processing systems, vol. 17, 1297–1304. Cambridge, MA: MIT Press.Search in Google Scholar
Underwood, Ted. 2012. The stone and the shell. https://tedunderwood.com/ (consulté le 3 juin 2019).Search in Google Scholar
Wieviorka, Michel. 2013. L’impératif numérique ou La nouvelle ère des sciences humaines et sociales ? Paris : CNRS.10.3917/cnrs.wiev.2013.01Search in Google Scholar
Zhang, Sarah. 2015. The pitfalls of using Google NGRAM to study language. Wiredhttps://www.wired.com/2015/10/pitfalls-of-studying-language-with-google-ngram/ (consulté le 3 juin 2019).Search in Google Scholar
© 2019 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Introduction to Meaningful data/Données signifiantes
- A data-driven computational semiotics: The semantic vector space of Magritte’s artworks
- New approaches to plastic language: Prolegomena to a computer-aided approach to pictorial semiotics
- Mesures et savoirs : Quelles méthodes pour l’histoire culturelle à l’heure du big data ?
- Visual semiotics and automatic analysis of images from the Cultural Analytics Lab: How can quantitative and qualitative analysis be combined?
- Uncertain infographics: Expressing doubt in data visualization
- Quelques expériences pour développer l’expression de sens en cartographie thématique
- Raw data or hypersymbols? Meaning-making with digital data, between discursive processes and machinic procedures
- Differential heterogenesis and the emergence of semiotic function
- Regular Articles
- Semiotic alignment: Towards a dialogical model of interspecific communication
- Modal functioning of rhetorical resources in selected multimodal cartoons
- The mission of the Chinese puzzle: From a quest for order to seeking entertainment
- Memes, genes, and signs: Semiotics in the conceptual interface of evolutionary biology and memetics
- Formalisation sémiotique de la traduction : Le modèle transformationnel d’Alexandre Ljudskanov
- Glossopoesis in Thomas More’s Utopia: Beyond a representation of foreignness
- Playing peripatos: Creativity and abductive inference in religion, art and war
- Borders and translation: Revisiting Juri Lotman’s semiosphere
- Peirce’s philosophy of communication and language communication
- Mythic symbolic type, utopia, and body without organs
- The contribution of narrative semiotics of experiential imaginary to the ideation of new digital customer experiences
- Transference of brand personality in brand name translation: A case study on the Chinese-English translation of men’s clothing brands
- Intertextuality as a strategy of glocalization: A comparative study of Nike’s and Adidas’s 2008 advertising campaigns in China
- The semiotic web of the research proposal
- Sic vita est: Visual representation in painting of the conceptual metaphor LIFE IS A JOURNEY
Articles in the same Issue
- Frontmatter
- Introduction to Meaningful data/Données signifiantes
- A data-driven computational semiotics: The semantic vector space of Magritte’s artworks
- New approaches to plastic language: Prolegomena to a computer-aided approach to pictorial semiotics
- Mesures et savoirs : Quelles méthodes pour l’histoire culturelle à l’heure du big data ?
- Visual semiotics and automatic analysis of images from the Cultural Analytics Lab: How can quantitative and qualitative analysis be combined?
- Uncertain infographics: Expressing doubt in data visualization
- Quelques expériences pour développer l’expression de sens en cartographie thématique
- Raw data or hypersymbols? Meaning-making with digital data, between discursive processes and machinic procedures
- Differential heterogenesis and the emergence of semiotic function
- Regular Articles
- Semiotic alignment: Towards a dialogical model of interspecific communication
- Modal functioning of rhetorical resources in selected multimodal cartoons
- The mission of the Chinese puzzle: From a quest for order to seeking entertainment
- Memes, genes, and signs: Semiotics in the conceptual interface of evolutionary biology and memetics
- Formalisation sémiotique de la traduction : Le modèle transformationnel d’Alexandre Ljudskanov
- Glossopoesis in Thomas More’s Utopia: Beyond a representation of foreignness
- Playing peripatos: Creativity and abductive inference in religion, art and war
- Borders and translation: Revisiting Juri Lotman’s semiosphere
- Peirce’s philosophy of communication and language communication
- Mythic symbolic type, utopia, and body without organs
- The contribution of narrative semiotics of experiential imaginary to the ideation of new digital customer experiences
- Transference of brand personality in brand name translation: A case study on the Chinese-English translation of men’s clothing brands
- Intertextuality as a strategy of glocalization: A comparative study of Nike’s and Adidas’s 2008 advertising campaigns in China
- The semiotic web of the research proposal
- Sic vita est: Visual representation in painting of the conceptual metaphor LIFE IS A JOURNEY