Complete Refresh is Evil! Make your SIS integration go at lightspeed

Version 4

    This document is taken from a presentation Mark O'Neil delivered at the 2017 Teaching & Learning Conference in Milan, Italy. You can download the presentation and script files at the end of this document.

    Complete Refresh vs. Store/Delete

    Complete RefreshExplicit Store and Delete
    • You pass all your data – every single record – into a black box.
    • Learn chews on data and takes actions based on data differences.

     

    Complete Refresh takes a file and manages the deletes from a set of data based on differences in the data provided and what is stored in Learn. This has substantial overhead as the data must be compared and determinations regarding differences (update), whether data is new (store), or whether the data is in Learn, but not in the data file (delete).

     

    Complete Refresh is often easier from an SIS extraction as you pull all the relevant data and create the feed file.

    • You pass in only the data to act on– decreasing data scale
    • You pass in explicit data sets – and Learn only adds/updates records

     

    Using Store or Delete gives you explicit control over granular sets of data. This is substantially faster as only the data present in the file is acted upon.

    Store and Delete files are more difficult to generate as the SIS needs to process the data pulls to create files containing only the differences between pulls to generate the feed files. However, the efficiency gained in processing often makes Store and Delete the more worthwhile method.

     

    Example

    Given:

    Old.txt: A Complete Refresh feed from yesterday

    EXTERNAL_PERSON_KEY|SYSTEM_ROLE|USER_ID|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE

    2006|ddishez001|none|Dan|Dishez|ddishez@sununiversity.edu|Student

    2000|jsmith001|none|John|Smith1|jsmith@sununiversity.edu|Student

    2007|rrondelle001|none|Ronda|Rondelle|rrondella@sununiversity.edu|Student

    2002|ttsai001|none|Tevis|Tsai|ttsai@sununiversity.edu|Student

    2003|lgonzales001|none|Lois|Gonzales|lgonzales@sununiversity.edu|Student

    2004|mmacneil001|none|Megan|MacNeil|mmacneil@sununiversity.edu|Student

    2001|kthomas001|none|Kyle|Thomas2|kthomas@sununiversity.edu|Student

    2005|mmacneil002|none|Mark|MacNeil|mmacneil1@sununiversity.edu|Student

     

     

    And New.txt: A Complete Refresh feed from today

    EXTERNAL_PERSON_KEY|USER_ID|SYSTEM_ROLE|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE

    2006|ddishez001|none|Daniel|Dishez|ddishez@sununiversity.edu|Student

    2005|mmacneil002|none|Mark|MacNeil|mmacneil1@sununiversity.edu|Student

    2007|rrondelle001|none|Ronda|Rondelle|rrondella@sununiversity.edu|Student

    2003|lgonzales001|none|Lois|Gonzales|lgonzales@sununiversity.edu|Student

    2004|mmacneil001|none|Megan|MacNeil|mmacneil@sununiversity.edu|Student

    2000|jsmith001|none|John|Smith1|jsmith@sununiversity.edu|Student

    2020|bbaggins|none|bilbo|baggins|bbaggins@sununiversity.edu|Student

     

    Process

    1. We need to compare new to old complete refresh feeds to find differences
      • sort the files and diff
    2. We want our DISABLE document to contain only records that are not in new
    3. Our STORE document to contain only documents that are CHANGED or ADDED in New
      • CHANGED means the record may be in old and thus in both old and new so is a special case
        • We take the disabled doc and These will be in DISABLED due to initial diff – need to remove
      • ADDED means the record is only in New
        • This is discovered via another diff between old and new

    This may look like a lot of costly steps against a large set of refresh data, but I guarantee it and the Snapshot DELETE/STORE will be significantly faster than a complete refresh.

     

    Create Sorted files:

    • $ sort New.txt > sortedNew.txt
    • $ sort Old.txt > sortedOld.txt

     

    Create DISABLE.txt file and write the proper header:

    • $echo 'EXTERNAL_PERSON_KEY|…|LASTNAME|EMAIL|INSTITUTION_ROLE' > DISABLE.txt

     

    Generate the DISABLEd users (those not in the new file):

    • $diff sortedNew.txt sortedOld.txt | grep ‘> ' | sed ’s/> //g’ >> DISABLE.txt

     

    Generate the Updates from DISABLEd users file

    • $ ./sedRemoveUpdates.sh

     

    DISABLE.txt contents before sedRemoveUpdates:

    EXTERNAL_PERSON_KEY|SYSTEM_ROLE|USER_ID|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE

    2002|ttsai001|none|Tevis|Tsai|ttsai@sununiversity.edu|Student

    2001|kthomas001|none|Kyle|Thomas2|kthomas@sununiversity.edu|Student

    2006|ddishez001|none|Daniel|Dishez|ddishez@sununiversity.edu|Student

     

    DISABLE.txt contents after sedRemoveUpdates:

    EXTERNAL_PERSON_KEY|USER_ID|SYSTEM_ROLE|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE

    2002|ttsai001|none|Tevis|Tsai|ttsai@sununiversity.edu|Student

    2001|kthomas001|none|Kyle|Thomas2|kthomas@sununiversity.edu|Student

     

    Running sedRemoveUpdates removes the updates (which which were matched in the diff). Where sedRemoveUpdates.sh is:

     

    #!/bin/bash

    while read line; do

        matchMe=$(cut -d'|' -f1 <<< $line)

        echo $matchMe

        searchResult=$(grep $matchMe < sortedNew.txt)

        if [ -z "$searchResult" ]; then

                  #echo "not found"

        else

                  sed -i.bak "/$matchMe/d" ./DISABLE.txt

        fi

    done < DISABLE.txt

    exit 0

     

    now DISABLED.txt contains only those EXTERNAL_PERSON_KEYS that are not in the sortedNew.txt file.

     

    We can then use a similar process reversing the Diff to find Updated users to create a smaller Store file that reflects only additions and updated lines in the sortedNew.txt by copying only new and updated entries from a slightly altered sortedNew file to a STORE.txt file.

     

    Copy sortedNew to storeNew:

    $cp sortedNew.txt storeNew.txt

     

    Create STORE.txt file and write the proper header:

    $echo 'EXTERNAL_PERSON_KEY|…|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE' > STORE.txt

     

    add the missing data from old to new by appending DISABLED to sortedOld and resorting:

    $cat DISABLED.txt >> storeNew.txt

    $sort storeNew.txt > sortedStoreNew.txt

     

    Diff the old file to the new data and append to STORE.txt:

    $diff sortedStoreNew.txt sortedOld.txt | grep '< '|sed 's/< //g' >> STORE.txt

     

    STORE.txt:

    EXTERNAL_PERSON_KEY|USER_ID|SYSTEM_ROLE|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE

    2006|ddishez001|none|Daniel|Dishez|ddishez@sununiversity.edu|Student

    2020|bbaggins|none|bilbo|baggins|bbaggins@sununiversity.edu|Student

     

    DISABLE.txt:

    EXTERNAL_PERSON_KEY|USER_ID|SYSTEM_ROLE|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE

    2002|ttsai001|none|Tevis|Tsai|ttsai@sununiversity.edu|Student

    2001|kthomas001|none|Kyle|Thomas2|kthomas@sununiversity.edu|Student

     

    Archive Old.txt and Save New.txt as Old.txt for the next update.

     

    Additional thoughts

    • Faster by orders of magnitude
    • Easy to insert into Snapshot SIS Framework flow
    • Less prone to “where’d my data go?” scenarios
    • Timestamp and archive your files, and prep for the next iteration
      • archive_dir = “./archives”
      • now = `date +%Y-%m-%d.%H:%M:%S`
      • mv old.txt $archive_dir+”archive” + $now
      • mv new.txt old.txt

     

    Downloads

    • Scripts required to provide SIS deltas
    • 'Complete Refresh is Evil!' Presentation