Complete Refresh is Evil! Make your SIS integration go at lightspeed

Document created by kelley.macewen on May 24, 2017Last modified by kelley.macewen on Feb 15, 2018
Version 4Show Document
  • View in full screen mode

This document is taken from a presentation Mark O'Neil delivered at the 2017 Teaching & Learning Conference in Milan, Italy. You can download the presentation and script files at the end of this document.

Complete Refresh vs. Store/Delete

Complete RefreshExplicit Store and Delete
  • You pass all your data – every single record – into a black box.
  • Learn chews on data and takes actions based on data differences.

 

Complete Refresh takes a file and manages the deletes from a set of data based on differences in the data provided and what is stored in Learn. This has substantial overhead as the data must be compared and determinations regarding differences (update), whether data is new (store), or whether the data is in Learn, but not in the data file (delete).

 

Complete Refresh is often easier from an SIS extraction as you pull all the relevant data and create the feed file.

  • You pass in only the data to act on– decreasing data scale
  • You pass in explicit data sets – and Learn only adds/updates records

 

Using Store or Delete gives you explicit control over granular sets of data. This is substantially faster as only the data present in the file is acted upon.

Store and Delete files are more difficult to generate as the SIS needs to process the data pulls to create files containing only the differences between pulls to generate the feed files. However, the efficiency gained in processing often makes Store and Delete the more worthwhile method.

 

Example

Given:

Old.txt: A Complete Refresh feed from yesterday

EXTERNAL_PERSON_KEY|SYSTEM_ROLE|USER_ID|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE

2006|ddishez001|none|Dan|Dishez|ddishez@sununiversity.edu|Student

2000|jsmith001|none|John|Smith1|jsmith@sununiversity.edu|Student

2007|rrondelle001|none|Ronda|Rondelle|rrondella@sununiversity.edu|Student

2002|ttsai001|none|Tevis|Tsai|ttsai@sununiversity.edu|Student

2003|lgonzales001|none|Lois|Gonzales|lgonzales@sununiversity.edu|Student

2004|mmacneil001|none|Megan|MacNeil|mmacneil@sununiversity.edu|Student

2001|kthomas001|none|Kyle|Thomas2|kthomas@sununiversity.edu|Student

2005|mmacneil002|none|Mark|MacNeil|mmacneil1@sununiversity.edu|Student

 

 

And New.txt: A Complete Refresh feed from today

EXTERNAL_PERSON_KEY|USER_ID|SYSTEM_ROLE|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE

2006|ddishez001|none|Daniel|Dishez|ddishez@sununiversity.edu|Student

2005|mmacneil002|none|Mark|MacNeil|mmacneil1@sununiversity.edu|Student

2007|rrondelle001|none|Ronda|Rondelle|rrondella@sununiversity.edu|Student

2003|lgonzales001|none|Lois|Gonzales|lgonzales@sununiversity.edu|Student

2004|mmacneil001|none|Megan|MacNeil|mmacneil@sununiversity.edu|Student

2000|jsmith001|none|John|Smith1|jsmith@sununiversity.edu|Student

2020|bbaggins|none|bilbo|baggins|bbaggins@sununiversity.edu|Student

 

Process

  1. We need to compare new to old complete refresh feeds to find differences
    • sort the files and diff
  2. We want our DISABLE document to contain only records that are not in new
  3. Our STORE document to contain only documents that are CHANGED or ADDED in New
    • CHANGED means the record may be in old and thus in both old and new so is a special case
      • We take the disabled doc and These will be in DISABLED due to initial diff – need to remove
    • ADDED means the record is only in New
      • This is discovered via another diff between old and new

This may look like a lot of costly steps against a large set of refresh data, but I guarantee it and the Snapshot DELETE/STORE will be significantly faster than a complete refresh.

 

Create Sorted files:

  • $ sort New.txt > sortedNew.txt
  • $ sort Old.txt > sortedOld.txt

 

Create DISABLE.txt file and write the proper header:

  • $echo 'EXTERNAL_PERSON_KEY|…|LASTNAME|EMAIL|INSTITUTION_ROLE' > DISABLE.txt

 

Generate the DISABLEd users (those not in the new file):

  • $diff sortedNew.txt sortedOld.txt | grep ‘> ' | sed ’s/> //g’ >> DISABLE.txt

 

Generate the Updates from DISABLEd users file

  • $ ./sedRemoveUpdates.sh

 

DISABLE.txt contents before sedRemoveUpdates:

EXTERNAL_PERSON_KEY|SYSTEM_ROLE|USER_ID|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE

2002|ttsai001|none|Tevis|Tsai|ttsai@sununiversity.edu|Student

2001|kthomas001|none|Kyle|Thomas2|kthomas@sununiversity.edu|Student

2006|ddishez001|none|Daniel|Dishez|ddishez@sununiversity.edu|Student

 

DISABLE.txt contents after sedRemoveUpdates:

EXTERNAL_PERSON_KEY|USER_ID|SYSTEM_ROLE|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE

2002|ttsai001|none|Tevis|Tsai|ttsai@sununiversity.edu|Student

2001|kthomas001|none|Kyle|Thomas2|kthomas@sununiversity.edu|Student

 

Running sedRemoveUpdates removes the updates (which which were matched in the diff). Where sedRemoveUpdates.sh is:

 

#!/bin/bash

while read line; do

    matchMe=$(cut -d'|' -f1 <<< $line)

    echo $matchMe

    searchResult=$(grep $matchMe < sortedNew.txt)

    if [ -z "$searchResult" ]; then

              #echo "not found"

    else

              sed -i.bak "/$matchMe/d" ./DISABLE.txt

    fi

done < DISABLE.txt

exit 0

 

now DISABLED.txt contains only those EXTERNAL_PERSON_KEYS that are not in the sortedNew.txt file.

 

We can then use a similar process reversing the Diff to find Updated users to create a smaller Store file that reflects only additions and updated lines in the sortedNew.txt by copying only new and updated entries from a slightly altered sortedNew file to a STORE.txt file.

 

Copy sortedNew to storeNew:

$cp sortedNew.txt storeNew.txt

 

Create STORE.txt file and write the proper header:

$echo 'EXTERNAL_PERSON_KEY|…|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE' > STORE.txt

 

add the missing data from old to new by appending DISABLED to sortedOld and resorting:

$cat DISABLED.txt >> storeNew.txt

$sort storeNew.txt > sortedStoreNew.txt

 

Diff the old file to the new data and append to STORE.txt:

$diff sortedStoreNew.txt sortedOld.txt | grep '< '|sed 's/< //g' >> STORE.txt

 

STORE.txt:

EXTERNAL_PERSON_KEY|USER_ID|SYSTEM_ROLE|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE

2006|ddishez001|none|Daniel|Dishez|ddishez@sununiversity.edu|Student

2020|bbaggins|none|bilbo|baggins|bbaggins@sununiversity.edu|Student

 

DISABLE.txt:

EXTERNAL_PERSON_KEY|USER_ID|SYSTEM_ROLE|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE

2002|ttsai001|none|Tevis|Tsai|ttsai@sununiversity.edu|Student

2001|kthomas001|none|Kyle|Thomas2|kthomas@sununiversity.edu|Student

 

Archive Old.txt and Save New.txt as Old.txt for the next update.

 

Additional thoughts

  • Faster by orders of magnitude
  • Easy to insert into Snapshot SIS Framework flow
  • Less prone to “where’d my data go?” scenarios
  • Timestamp and archive your files, and prep for the next iteration
    • archive_dir = “./archives”
    • now = `date +%Y-%m-%d.%H:%M:%S`
    • mv old.txt $archive_dir+”archive” + $now
    • mv new.txt old.txt

 

Downloads

  • Scripts required to provide SIS deltas
  • 'Complete Refresh is Evil!' Presentation
1 person found this helpful

Outcomes