Exploring Web Archives: Challenges and Solutions - KBS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Exploring Web Archives: Challenges and Solutions Vaibhav Kasturia Supervisor: Prof. Dr. Wolfgang Nejdl vbh18kas@gmail.com nejdl@l3s.de 12. Juli 2016 1
Outline • Social Media Growth: Twitter • Social Media Content Loss • Need for Web Archiving • Temporal Information • Temporal Tagging • Applications and Challenges • Conclusion http://iabireland.ie/wp-content/uploads/2015/11/social-media-original.jpg Vaibhav Kasturia 2
Tremendous Growth of the Social Media http://www.infinitdatum.com/wp-content/uploads/2014/12/social-media-data.jpg Vaibhav Kasturia 3
What do we Preserve ? http://tinyurl.com/hqgp4te; http://tinyurl.com/hdhcam8; http://tinyurl.com/hyarvmn Vaibhav Kasturia 4
How much Social Media Content gets Lost? [1] • Culturally Significant Events (June 2009 - March 2012) H1N1 Virus Outbreak Syrian Uprising Egyptian Revolution Iranian Elections Michael Jackson’s Death Obama gets Nobel Peace Prize [1] Salaheldeen, H.; Nelson, M. L.: Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? JCDL, Washington, USA 2012 http://cdni.wired.co.uk/620x413/d_f/FLU1.jpg; http://tinyurl.com/jy32puj; http://tinyurl.com/hwozc4o; http://tinyurl.com/z3uztlc; http://tinyurl.com/ grvutxu; http://tinyurl.com/bc893pf; Vaibhav Kasturia 5
Tweets from Twitter T = Timestamp U = Link to user posting the tweet W = Tweet Content http://tinyurl.com/jlktjg7; http://tinyurl.com/zz55q38 Vaibhav Kasturia 6
Finding Relevant Tweets Swine Flu Common Cold #h1n1 Versus #flu http://tinyurl.com/hvh66mx; http://tinyurl.com/jo3tsj9 Vaibhav Kasturia 7
Finding Relevant Tweets Michael Jackson’s Death Paul Walker’s Death #michaeljackson or #mj Versus #rip http://tinyurl.com/z2k6zo5; http://tinyurl.com/zghqh8t; http://tinyurl.com/hxzgr5g Vaibhav Kasturia 8
Finding Relevant Tweets #obama ? White House Correspondent’s Dinner Getting Nobel Peace Prize Visit to Hannover http://tinyurl.com/jf48jek; http://tinyurl.com/jtpbqrq; http://tinyurl.com/hxdhmvv Vaibhav Kasturia 9
Finding Relevant Tweets Table 1: Twitter hashtags generated for filtering and their frequency of occurring[1] Vaibhav Kasturia 10
Uniqueness Check and Duplicate Elimination http://www.formula1.com http://www.f1.com http://www.formula1.com Vaibhav Kasturia 11
Checking for Lost and Archived Resources • Success Class ! 200 OK • Failure Class ! 404 Not Found ! 403 Forbidden ! 410 Gone ! 30X Redirect Family ! 50X Server Error ! Soft 404s http://www.ibm.com/us http://www.ibm.com/us/blahblah Soft 404 Detection[2] [2] Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic Transit Gloria Telae: Towards an Understanding of the Web’s Decay. In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004, pp. 328–337 http://www.ibm.com/us-en/ Vaibhav Kasturia 12
Building Model Fig. 1. URIs shared per day corresponding to each event[1] Vaibhav Kasturia 13
Building Model Table 2: The Split Dataset[1] Fig. 2. Percentage of content missing and archived as a function of time[1] Vaibhav Kasturia 14
Observations from Model • Linear Relationship between ! Content Lost Percentage or Content Archived Percentage ! Age in Days Content Lost Percentage = 0.02(Age in Days) + 4.20 Content Archived Percentage = 0.04(Age in Days) + 6.74 • An year after publishing content on Social media, about 11% will be gone • After this point, we lose roughly 0.02% of content per day • Two and three years later, about 19% and 26% of content is lost Vaibhav Kasturia 15
Twitter Content Generation[3] • 50 % of content on Twitter generated by 0.05 % of users Lady Gaga Ashton Kutcher Oprah Winfrey • Content reaching masses through intermediate layer of opinion leaders (not celebrities) [3] Wu, S., Hofman, J.M., Mason, W.A., Watts, D.J.: Who Says What to Whom on Twitter. In: Proceedings of the 20th International Conference on World Wide Web, WWW 2011, pp. 705–714 (2011) http://tinyurl.com/hovtg77; http://tinyurl.com/jsq2qo6; http://tinyurl.com/hg4pj2g Vaibhav Kasturia 16
Tweet Lifetimes [3] • Media Generated Content URIs(e.g. Breaking News): Short Lived • Blog Content URIs (e.g. Cooking tips, Parenting Tips) have more life • Music Video URIs : Most Lived Merkel visits CeBIT 2016 Cooking Tips Music Videos http://tinyurl.com/j3tewp9; http://tinyurl.com/gtxopdo; http://tinyurl.com/gvsfllv Vaibhav Kasturia 17
Web Archives [4] • Important to archive culturally significant resources • Need to develop tools, models and techniques • Research in L3S : ALEXANDRIA PROJECT • Searching: Semantic Based or Time Based or Both • Searching along Time dimension: Temporal Information Retrieval [4] 1st ALEXANDRIA Workshop (http://alexandria- project.eu/1st_alex_ws/) http://tinyurl.com/h7lygpc Vaibhav Kasturia 18
Characteristics of Temporal Information[5] • Clear Relationship between Events ! Before Attack on Charlie Hebdo (7 Jan 2015) Paris Attacks(13 Nov 2015) ! Overlap European Migrant Crisis(Jan 2015-Today) Russian Intervention in Syria (Sep 2015-Today) [5] Alonso, O.; Strötgen, J.; Baeza-Yates, R.; Gertz, M.: Temporal information Retrieval: Challenges and opportunities. Temporal Web Analytics Workshop (TWAW), WWW, Hyderabad, India, 2011 http://tinyurl.com/hzwjw5o, http://tinyurl.com/j2cr7ks, http://tinyurl.com/j2ffp8v, http://tinyurl.com/hew6huv Vaibhav Kasturia 19
Characteristics of Temporal Information[5] • Clear Relationship between Events ! After Iran-Saudi Arabia cut diplomatic ties (4 Jan 2016) Execution of Shia Cleric Sheikh al-Nimr (2 Jan 2016) • Temporal Information can be Normalized • Suitable Granularity can be chosen (Coarse or Fine) http://tinyurl.com/gvct8d6, http://tinyurl.com/z7thpmt Vaibhav Kasturia 20
Clustering & Exploring Search Results using Timelines[6] • TCluster Algorithm Fig. 3.Timeline cluster for the query [football world cup][6] Fig. 4.Timeline cluster for [avian flu] tweets[6] [6] O. Alonso, M. Gertz, and R. Baeza-Yates. Clustering and Exploring Search Results Using Timeline Constructions. In Proceedings of the 18th ACM International Conference on Information and Knowledge Management (CIKM ’09), pages 97–106, 2009 Vaibhav Kasturia 21
Types of Temporal Information • Explicit Temporal Information ! December 25, 2015 • Implicit Temporal Information ! New Year 2016 http://tinyurl.com/hzehprd; http://tinyurl.com/jnjyres Vaibhav Kasturia 22
Types of Temporal Information • Relative Information ! “Tear gas was fired at refugees at the Greece border yesterday” ! “On Monday, voting was conducted to decide whether UK should remain part of the EU” ! “Over the past few years, pressure has been rising on Greece to pay off its EU debt” Migrant Clashes UK’s Future in EU Greek Financial Crisis http://tinyurl.com/z6tpgfm; http://tinyurl.com/goukwbu; http://tinyurl.com/jgh2clc Vaibhav Kasturia 23
Temporal Tagging • TempEval-2 Challenge : HeidelTime Temporal Tagger Fig. 5. HeidelTime System Architecture[7] [7] J. Strötgen and M. Gertz. HeidelTime: High Quality Rule-based Extraction and Normalization of Temporal Expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval ’10), pg 321-324 Vaibhav Kasturia 24
Temporal Tagger Application • TimeTrails: HeidelTime used as Temporal Tagger Fig. 6.TimeTrails System Architecture[8] [8] J. Strötgen and M. Gertz. TimeTrails: A System for Exploring Spatio-Temporal Information in Documents. In Proceedings of the 36th International Conference on Very Large Data Bases (VLDB ’10), pages 1569–1572, 2010 Vaibhav Kasturia 25
Temporal Tagger Application • Visualizes information extracted as Document Trajectories • Intersection of Trajectories: Documents (may) have same Spatio-Temporal Scope James Joyce Samuel Beckett Fig. 7.TimeTrails: Multiple Document View and Intersection of Trajectories[8] http://tinyurl.com/z2o99mf; http://tinyurl.com/z29xwx6 Vaibhav Kasturia 26
Further Applications and Challenges • Enhancing functionality of Temporal Information Retrieval Apps • Finding trending news from Twitter before getting published as Article • Temporal Summaries for Search Results • Perform ! Temporal Clustering ! Temporal Querying ! Temporal Question-Answering ! Temporal Similarity between Documents • Web Archiving: Predicting how often Web Content Change happens for efficient Web Crawling • Many Open Research Challenges • Huge Future Scope for Development Vaibhav Kasturia 27
References [1] Salaheldeen, H.; Nelson, M. L.: Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? JCDL, Washington, USA, 2012 [2] Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic Transit Gloria Telae: Towards an Understanding of the Web’s Decay. In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004, pp. 328–337 (2004) [3] Wu, S., Hofman, J.M., Mason, W.A., Watts, D.J.: Who Says What to Whom on Twitter. In: Proceedings of the 20th International Conference on World Wide Web, WWW 2011, pp. 705–714 (2011) [4] 1st ALEXANDRIA Workshop (http://alexandria-project.eu/1st_alex_ws/) [5] Alonso, O.; Strötgen, J.; Baeza-Yates, R.; Gertz, M.: Temporal information retrieval: Challenges and opportunities. Temporal Web Analytics Workshop (TWAW), WWW, Hyderabad, India, 2011 Vaibhav Kasturia 28
References [6] O. Alonso, M. Gertz, and R. Baeza-Yates. Clustering and Exploring Search Results Using Timeline Constructions. In Proceedings of the 18th ACM International Conference on Information and Knowledge Management (CIKM ’09), pages 97–106, 2009 [7] J. Strötgen and M. Gertz. HeidelTime: High Quality Rule-based Extraction and Normalization of Temporal Expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval ’10), pages 321-324, 2010 [8] J. Strötgen and M. Gertz. TimeTrails: A System for Exploring Spatio-Temporal Information in Documents. In Proceedings of the 36th International Conference on Very Large Data Bases (VLDB ’10), pages 1569–1572, 2010 [9] BBC News (bbc.com/news) [10] CNBC: Major Global Events of 2015(http://www.cnbc.com/2015/12/31/major-global- events-that-shook-2015.html) Vaibhav Kasturia 29
Discussion Vaibhav Kasturia 30
You can also read