Page 1 of 4 1234 LastLast
Results 1 to 10 of 34

Thread: Data Mining the London Gazette Website

  1. #1
    Join Date
    Jan 2008
    Location
    Delaware, USA
    Posts
    903
    Thanks
    167
    Thanked 14 Times in 10 Posts

    Default Data Mining the London Gazette Website

    Has anyone tried to "mine" the London Gazette Website? eg try and build a database of all RAF Officer / WO names into one single DB ... or build a searchable citations index etc? How useful will it be if the above said are built..

  2. #2
    Join Date
    Nov 2007
    Location
    Brisbane
    Posts
    1,303
    Thanks
    18
    Thanked 3 Times in 3 Posts

    Default

    There was a London Gazette Finder a few years ago at http://www.rafaircraftaccidents.com/...nderindex.html but the site is no longer active.

    Steve
    41 (F) Squadron RAF at War and Peace, April 1916-March 1946
    http://brew.clients.ch/41sqnraf.htm

  3. #3
    Join Date
    Oct 2008
    Posts
    6,403
    Thanks
    0
    Thanked 33 Times in 32 Posts

    Default

    Jagan,

    Some time ago, our own Ross McNeill had a Gazette search engine. It worked well, but coverage was about 75% due to entries not being recognized. It was removed from the site quite a while back, Talk to Ross about it, or better still, Ross himself might reply and explain how it "worked".

    Col.
    Last edited by COL BRUGGY; 29th March 2018 at 18:31.

  4. #4
    Join Date
    Nov 2007
    Location
    Bewdley, UK
    Posts
    2,700
    Thanks
    0
    Thanked 1 Time in 1 Post

    Default

    Needed quite a bit of manipulation to get over the inherent flaws in the LG website.

    OCR on the LG was and mostly still is quite flaky with spaces, full stops, non printing characters peppering the page scans.

    This is why the LG search engine via their site is hit and miss.

    As the is the primary detail you mine from the site then all you will be doing is importing the problems and having to address them by post processing.

    I used the OCR Air Force Lists from National Library of Scotland as a better source of name for yes/no validation of Commission, mining the Personal Number and then using both for json search on LG and TNA to compensate for scan errors.

    Not 100% success rate but better than LG engine alone.

    This also indirectly gave sequence of LG dates which enabled early missed promotions to be estimated and finding by searching for intake/gradation classmates.

    Citations etc were found using two methods.

    Search engine - least useful due to scan errors
    Manton etc references for issue and page - 99.9% accurate

    For issue/page search I created a new database of LG page urls for front page of each issue and then used the input page number to offset the url to display the wanted item.

    Additionally the finder also used the info for a TNA json search to tease the remaining details from Discovery but this needed prior Gov copyright waivers for fair use. Possible to get but hoops needed to be jumped through and limitations put on what info was open to view outside EU data protection rules.

    All in all it needed all sources to compensate for errors and this was the biggest admin headache as TNA had regular purges and changes in their JSON API which meant periodic code rewrites to keep the finder finding.

    LG would also change page URLs as they updated scans but they were also moving towards an JSON API format which could kill the finder at a stroke.

    The final straw was the inherent problem of online database portals being used by increasingly sophisticated hacker Injection attacks to gain backdoor access to the site via the server.

    Several Injection attacks were being made when I closed off this portal by teeny script hackers keen to establish their reputation in entering a "Royal Air Force" website.

    Ross
    The Intellectual Property contained in this message has been assigned specifically to this web site.
    Copyright Ross McNeill 2015/2018 - All rights reserved.

  5. #5
    Join Date
    Jan 2008
    Location
    Delaware, USA
    Posts
    903
    Thanks
    167
    Thanked 14 Times in 10 Posts

    Default

    Ross. That sounded like a real nightmare programming project.

    Second your thought about the hacking headaches

  6. #6
    Join Date
    Nov 2007
    Location
    Bewdley, UK
    Posts
    2,700
    Thanks
    0
    Thanked 1 Time in 1 Post

    Default

    For those wishing to descend the rabbit hole after the White Rabbit then the Gazette Datasets are here
    https://www.thegazette.co.uk/data/longitudinal-datasets

    The Sparql end points (pre 1997) that I was using in the final Finder program before I pulled it are
    https://www.thegazette.co.uk/longitu...dataset/sparql

    Flint is quite good for testing the endpoint query before committing to code it up
    https://www.thegazette.co.uk/flint

    But for off line mining then the year indexes will give the OCR Surnames along with the triplets
    ftp://ftp.thegazette.co.uk/

    Click on say 1940 to download then extract and wade through until you see eg "<https://www.thegazette.co.uk/id/surname/" headed datasets.

    For The National Archives the API is much better but they are still playing with it to decide what format to go with.

    Needs a request for access to
    http://www.nationalarchives.gov.uk/h...interface-api/

    Found them quite receptive to allowing specialist formating of JSON query to Discovery - think that this was the resource better suited to our needs in developing a better way than Discovery to unlock the records kept.

    Regards
    Last edited by Ross_McNeill; 29th March 2018 at 20:21.
    The Intellectual Property contained in this message has been assigned specifically to this web site.
    Copyright Ross McNeill 2015/2018 - All rights reserved.

  7. #7
    Join Date
    Jan 2008
    Location
    Delaware, USA
    Posts
    903
    Thanks
    167
    Thanked 14 Times in 10 Posts

    Default

    This Sparql and flint stuff is just above my head - i am an old school vbscript php mysql guy

    I have a different rabbit hole that I will venture in - started work on it

  8. #8
    Join Date
    Apr 2011
    Location
    Kent
    Posts
    126
    Thanks
    0
    Thanked 1 Time in 1 Post

    Default

    Blimey, I thought I was on a technical forum for a while. Most of that was so far over my head it was leaving contrails! I haven't touched SQL since Office 97 and then I got Access to write it and I would tweak it.

    Anyway, am I to understand that the LG search is unreliable? If so that would explain why I got so few hits on Notices to Airmen or Seamen, despite finding a few not in the search by browsing page by page.

  9. #9
    Join Date
    Nov 2007
    Location
    Reading, Berkshire, UK
    Posts
    3,528
    Thanks
    3
    Thanked 11 Times in 11 Posts

    Default

    I can get three different lists of 'hits' from the LG for precisely the same input parameters. And that's when it decides it's going talk to me at all! Not wildly different lists - but different.
    All is clearly not well at the on-line LG (as Ross has enumerated). It would be nice if they upgraded their systems, and OCR/scanning. But that will require a great deal of money and expertise. In the current UK public service financial climate that would seem a bit unlikely. Hens teeth do not, I'm assured, grow on trees!!
    We'll just have to soldier-on!
    HTH
    Peter Davies
    Meteorology is a science; good meteorology is an art!
    We might not know - but we might know who does!

  10. #10
    Join Date
    Nov 2007
    Posts
    944
    Thanks
    0
    Thanked 1 Time in 1 Post

    Default

    I understand that Jagan is not after another LG search tool but rather to import the data in single process and then convert into a database. Hopefully, service numbers should be less prone to errors than names, and following conversion any errors and gaps should be plainly visible. Making it interactive, it shall be possible to correct errors and fill gaps by users. As Peter notes, I would not expect any new OCR of the files, so any improvements of the database or search tool will not solve the problem.
    Of course, such a database could be then ammended with data from other databases, like BMD, etc.
    https://www.facebook.com/Franciszek-Grabowski-241360809684411/

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •