String Patterns: Searching for Interesting Words and Numbers

Embed Size (px)

Citation preview

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    1/53

    String Patterns: Searching for InterestingWords and Numbers

    Roger Bilisoly, PhD

    Associate Professor of Statistics

    Central Connecticut State University

    Department of MathematicsAmherst College, Amherst, Massachusetts

    Thursday, October 6, 2011

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    2/53

    Overview of Talk

    String Patterns and Examples

    Unusual words, squares, and primes

    Anagrams of Words and Numbers Including square anagrams

    Birthday Problem and Pangrams

    Analyzing Dickens A Christmas Carol.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    3/53

    1. String Patterns

    Regular expressions(also called regexes) are used tofind string patterns. A variety of software packages hasthem implemented, e.g., Mathematica, Perl, SAS, Emacs,

    and so forth. Well use them to find interesting words (from wordlists

    available on the web) and interesting numbers (e.g.,squares with unusual digit patterns).

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    4/53

    Perl Regexes

    For a wordlist, one word per line:

    /cat/ would match cat cats scatter but NOT Cat

    /[cC]at/ would match cat Cat or Catcher

    /cat/i would match cat CaT or sCaTtEr i stands for case insensitive

    /cat|dog/ would match either cat or dog

    Well see examples of more complex string patterns.

    See Chapter 2 of Bilisoly (2008b).

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    5/53

    Example: Some Unusual Words

    Are there other words like bookkeeper with threedouble letters in a row?

    Essentially no: {bookkeeper, bookkeepers, bookkeeping,bookkeepings}

    Are there words containing mile? {besmiled, besmiles, camomiles, facsimiles, homiletic,

    outsmiled, outsmiles, similes, smiled, smiler, smilers, smiles}

    wordlist = Import["c:\CROSSWD.TXT","Lines"];

    threepair = Pick[wordlist, StringMatchQ[wordlist,

    RegularExpression[".*(.)\\1(.)\\2(.)\\3.*"]]]

    milewords = Pick[wordlist, StringMatchQ[wordlist,

    RegularExpression[".+mile.+"]]]

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    6/53

    before state within preparedeagleb-e-f s-t-a w-i-t p-r e-a| | | |\| || | |r-o e n h a e-d l-g

    concern decency rather prosperse-c d-e-c r-a-t p-r| |\ |/| \ | ||\r-n-o n y e-h e s-o

    Each node is a distinct letter, and each edge connects letters that areadjacent in the word. Graphs are directed: the arrows are understood.

    From Section 24 of Eckler (1996).

    Word Graphs

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    7/53

    Searching for word graphs

    Linear words have no branches, e.g., A-M-H-E-R-S-T.o What is longest such word?o Answer: ambidextrouslyhas 14 letterso lycanthropies, metalworkings, multibranched, unpredictablyare only

    examples with 13 letters

    How many square cyclic words are there? (like EAGLE)o Need to match regular expression /(.)...\1/ and have 4 distinct

    letters. Latter can be checked by taking intersection.o 417 such words, including dazed(and other 4 letter weak verbs starting with

    d in the past tense), sails(and other 4 letter nouns starting with s, etc.

    Longest cyclic words are 12 letters long: spaceflights, speculations, subharmonics, subordinates,switchblades, switchboards, sympathizers.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    8/53

    C(3) through C(8) Alphabets

    aroma

    blurb

    comic

    dread

    eagle

    -----

    going

    hatch

    iambi-----

    knock

    local

    maxim

    nylon

    outdo

    plump

    -----

    razorstars

    theft

    -----

    -----

    widow

    xerox

    yolky

    -----

    asthma

    benumb

    cosmic

    demand

    excuse

    ------

    gaming

    health

    incubi------

    kopeck

    lawful

    medium

    napkin

    overdo

    pickup

    ------

    rathershirts

    throat

    ------

    ------

    window

    ------

    yearly

    ------

    area

    bomb

    chic

    dead

    ease

    fief

    gong

    high

    impi----

    kick

    leal

    maim

    noun

    ouzo

    pump

    ----

    roarsaws

    text

    unau

    ----

    whew

    ----

    ----

    ----

    amnesia

    brewpub

    chronic

    dogsled

    eclipse

    -------

    glowing

    hawkish

    intagli-------

    kinfolk

    logical

    midterm

    newborn

    oregano

    parsnip

    -------

    regularsailors

    tourist

    -------

    -------

    whipsaw

    -------

    -------

    -------

    asphyxia

    --------

    catholic

    disabled

    earphone

    --------

    gambling

    hyacinth

    ----------------

    kinsfolk

    lightful

    mealworm

    nitrogen

    obligato

    pawnshop

    --------

    roadstersardines

    tolerant

    --------

    --------

    withdraw

    --------

    yeastily

    --------

    angostura

    ---------

    chromatic

    diagnosed

    enjoyable

    ---------

    gathering

    haircloth

    ------------------

    ---------

    ---------

    mechanism

    ---------

    ---------

    playgroup

    ---------

    regulatorseafronts

    therapist

    ---------

    ---------

    worldview

    ---------

    ---------

    ---------

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    9/53

    Cyclic Word Mathematica Code

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    10/53

    Applying String Patterns toSquare Integers (Squares)

    Numbers are strings, too, so amenable to regexes.

    Well apply regexes to find some unusual squares.

    We will also investigate the randomness of the digits of

    squares.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    11/53

    Squares without Doubled Adjacent Digits

    Suppose the digits in a square (in base 10) are random.

    Then P(adjacent digits are unequal) = 9/10.

    So P(100 digit square has no adjacent digits equal) = 0.9^99= 0.0000295127, or about 30 in a million.

    We can check this estimate by stochastic computer searches.

    51737187749414391248343906418265954222307660356158 ^2 =

    2676736596218154562635394063470527614352180564931651707698543214192825076469507142492903267408520964

    92247034353159048872216907430912506658722672391337 ^2 =

    8509515346952905702167423230637367421917540367430367521058492897148214941569637843428692738072647569

    46684912798986257941402247358809860358456884854826 ^2 =

    2179481083048950920786370528301507294579140187693485325858417852926259858127813068545985375095490276

    72400570183600937943662908692744405525154232178778 ^2 =

    5241842562910525153020949368218058035430952649615191319089815789036984216473806049461870608953573284

    etc.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    12/53

    Code to Compute Counts of Such Squares

    Do[

    total = 0;

    nsteps = 1000000;

    ndigits = 100;

    start = Ceiling[Sqrt[10^(ndigits-1)]];

    stop = Floor[Sqrt[10^(ndigits)]];

    Do[square = Random[Integer, {start,stop}]^2;

    match = StringCases[ToString[square], RegularExpression["(.)\\1"]];

    If[Length[match] == 0, ++total, Null],

    {i,1,nsteps}

    ]

    Print[total],

    {nreps,1,50}

    ]50 repetitions of searching

    a million squares.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    13/53

    Are the Digits of Squares Random?The initial digits are not.

    Lower Upper Lower^2 Upper^2

    1 100 141 10000 19881

    2 142 173 20164 29929

    3 174 200 30276 40000

    4 200 223 40000 49729

    5 224 244 50176 59536

    6 245 264 60025 69696

    7 265 282 70225 79524

    8 283 300 80089 90000

    9 300 316 90000 99856

    10 317 447 100489 199809

    20 448 547 200704 299209

    30 548 632 300304 399424

    40 633 707 400689 499849

    50 708 774 501264 599076

    60 775 836 600625 698896

    70 837 894 700569 799236

    80 895 948 801025 898704

    90 949 999 900601 998001

    The limiting proportion of digits 1 k 9 is given by:

    (Sqrt[k+1] - Sqrt[k] + Sqrt[10(k+1)] Sqrt[10k])/9.

    Digit Prob.1 19.16%

    2 14.70%

    3 12.39%

    4 10.92%

    5 9.87%

    6 9.08%7 8.45%

    8 7.93%

    9 7.50%

    Total = 100.00%

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    14/53

    Aside: Benfords Law

    Benfords Law says that initial digits oftenfollow the following probability distribution:

    Log[10, (k + 1)/k] for 1 k 9.

    {an} satisfies Benfords law iff Log[10, an]

    (mod 1) is uniformly distributed. Seehttp://en.wikipedia.org/wiki/Benford's_law.

    Benfords Law does not fit the distributionof initial digits of squares.

    Digit Prob.

    1 30.10%

    2 17.61%

    3 12.49%4 9.69%

    5 7.92%

    6 6.69%

    7 5.80%

    8 5.12%

    9 4.58%

    Total 100.00%

    http://en.wikipedia.org/wiki/Benford's_lawhttp://en.wikipedia.org/wiki/Benford's_law
  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    15/53

    The final digits of squaresare not random.

    Final digit can only be: {0,1,4,5,6,9} (only 60% of possibleone digit endings are allowed)

    Final two digits can only be: {00,01,04,09,16,21,24,25,29,36,41,44,49,56,61,64,69,76,

    81,84,89,96} (only 22% of possible two digit endings allowed)

    Final three digits can only be: {000,001,004,009,016,024,025,036,041,044,049,056,064,076

    ,081,084,089,096,100,104,116,121,124,129,136,144,156,161,164,169,176,184,196,201,204,209,216,224,225,236,241,244,249,256,264,276,281,284,289,296,304,316,321,324,329,336,344,356,361,364,369,376,384,396,400,401,404,409,416,424,436,441,444,449,456,464,476,481,484,489,496,500,504,516,521,524,529,536,544,556,561,564,569,576,584,596,600,601

    ,604,609,616,624,625,636,641,644,649,656,664,676,681,684,689,696,704,716,721,724,729,736,744,756,761,764,769,776,784,796,801,804,809,816,824,836,841,844,849,856,864,876,881,884,889,896,900,904,916,921,924,929,936,944,956,961,964,969,976,984,996} (only 15.9% allowed)

    The percentages are decreasing, but to what value?

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    16/53

    Walter Penneys Theorem

    Pick ndigits at random, and let P(n) = Probability that these ndigits are the final digits of a square.

    Theorem Penney (1960): As n, P(n) 5/72 6.94%.

    Penney shows that P(n) = (2n-1 + 4)(5n+1 + 7)/(36*10n) for neven

    and P(n) = (2n-1 + 5)(5n+1 + 11)/(36*10n) for nodd.

    For a proof see Walter Penney (1960)

    On the Final Digits of Squares.

    Also see Walter Stangl (1996)

    Counting Squares in Zn

    n P(n) P(n)/P(n-1)

    1 .6000000 .6000

    2 .2200000 .3667

    3 .1590000 .7227

    4 .1044000 .6566

    5 .0912100 .8736

    6 .0781320 .8564

    7 .0748719 .9590

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    17/53

    Equi-Pandigital Primes

    An equi-pandigitalnumberin base b contain each digit from 0 through (b-1)

    exactly the same number of times.

    Theorem. For b > 3, there are no equi-pandigital primes.

    Proof. Let n be an equi-pandigital number in base b. Then mod (b 1) n is

    congruent to the sum of its digitsbecause bn

    1n

    = 1. Let rbe the # ofrepetitions of 0, 1, 2, , b 1, which sum to b(b 1)/2. So we have:

    Ifb is even, then n 0 (mod b 1) since b/2 is an integer, so (b 1) divides n.

    Ifb is odd, then (b 1)/2 is an integer, so either n 0 or (b 1)/2. In both cases,

    (b 1)/2 divides n.For b > 3, (b 1) and (b 1)/2 are nontrivial, so n is not prime. QED

    Remark 1: Finding a base 10 equi-pandigital prime will take some trickery.

    Remark 2: 102, 1001012, 1010012, 100010112, etc. are prime, as are

    1023, 2013, 1000122123, 1000221123, etc.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    18/53

    Equi-Pandigital Gaussian Primes

    Theorem 9.15 of Deskins (1964): Letp be a prime in Z, then all Gaussian primes

    in Z[i] fall into one of the following three cases, up to units.

    (a) Ifp 3 (mod 4), thenp is a Gaussian prime.

    (b) Ifp 1 (mod 4), thenp = (a + b i)(a b i) is the Gaussian prime factorization,

    where a, b is the unique (up to order) solution top = a2 + b2.

    (c) Ifp = 2, then 2 = (1 + i)(1i) is the factorization into Gaussian primes.

    Proof: See Deskins (1964).

    Remark: Brillharts algorithm can find a and b forp 1 (mod 4). See Williams (1995).

    Lets make an estimate of the number of (5,5)-digit pandigital primes.

    Since P(mZis prime) 1/log(m) (just differentiate the logarithmic integral), we need tomultiply this by the number of (a + b i)s satisfying a > b and condition (b) above. Hence

    105

    > a > b > 104.5

    , which implies that 2(1010

    ) > a2

    + b2

    . Pick two digits from 1 through 9 touse as a and b, and the rest of the digits can be put in any order, so the total number of

    expected Gaussian primes is approximately:

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    19/53

    Search Results

    By computer search, there are 69774 equi-pandigitalGaussian primes of the form a + b i, a > b > 0. Here aresome interesting ones:

    Pandigital Gaussian prime Distinguishing property

    96530 + 87421i Max norm

    20468 + 13597i Min norm

    98765 + 10234i Max realimaginary parts

    60143 + 59872i Min realimaginary parts

    86420 + 79513i Largest real part with all even digits

    20864 + 13579i Smallest imaginary part with all odd digits

    97531 + 82604i Largest real part with all odd digits

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    20/53

    All the Equi-Pandigital Gaussian Primes

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    21/53

    2. Anagrams of Words and Numbers

    An anagramof a word is a non-identity permutation ofthat words letters.

    E.g., Amherstis an anagram of hamster.

    One word anagrams are sometimes called transpositionsin

    wordplay. In wordplay, some require an anagram to have a related

    meaning to the original word.

    Well also consider anagrams of numbers. In what

    follows, initial zeros are forbidden. E.g., 132 = 169, 142 = 196, and 312 = 961 are anagrams of eachother (in base 10).

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    22/53

    How to find English word anagramsStep 1: Obtain a wordlist.

    There are now a variety of sources available:

    American Cryptogram Association athttp://cryptogram.org/cdb/words/words.html

    National Puzzlers League at

    http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:start

    Grady Wards Moby word lists (in public domain)

    http://icon.shef.ac.uk/Moby/

    The above wordlists include all the inflected forms of

    words: nouns with both singular and plural forms,adjectives with comparative forms, verbs with allconjugated forms, etc.

    http://cryptogram.org/cdb/words/words.htmlhttp://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://icon.shef.ac.uk/Moby/http://icon.shef.ac.uk/Moby/http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://cryptogram.org/cdb/words/words.html
  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    23/53

    Step 2: Read in the wordlist andstore it in a hash.

    aa

    aahaahed

    aahing

    aahs

    aal

    aalii

    aaliis

    aals

    aardvarkaardvarks

    aardwolf

    aardwolves

    aas

    aasvogel

    aasvogels

    abaabaca

    abacas

    abaci

    aback

    abacus

    abacuses

    open(WORDS, "CROSSWD.TXT") or die;

    while () {

    chomp;

    @letters = split(//);$key = join('',sort(@letters));

    if ( exists($dictionary{$key}) ) {

    $dictionary{$key} .= ",$_";

    } else {

    $dictionary{$key} = $_;

    }

    }

    foreach $key (sort keys %dictionary) {

    print "$key, $dictionary{$key}\n";

    }

    Perl program from Section 3.7.2 of Bilisoly (2008b):The hash key equals the

    letters of the word sorted

    in alphabetical order.

    Examples:

    aah -> aah

    aahed -> aadeh

    aahing -> aaghin

    aardvark -> aaadkrrv

    If key already exists, then

    an anagram has been

    discovered. Example:evil, live, vile, veilall have

    the key eilv.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    24/53

    Step 3: Print out the hash withthe keys sorted in alphabetical order.

    The result (see right) is ananagram dictionary.

    Invaluable for word gamessuch as Scrabble and

    Jumble: just sort the lettersat hand and check if theyform a word.

    Looking for entries with two

    or more commas revealsword anagrams. Most words do not have

    anagrams.

    aa, aa

    aaaaabbcdrr, abracadabra

    aaaabcceelrstu, baccalaureates

    aaaabcceelrtu, baccalaureate

    aaaabdilmorss, ambassadorial

    aaaabenn, anabaenaaaaabenns, anabaenas

    aaaaccdiiklllsy, lackadaisically

    aaaaccdiiklls, lackadaisical

    aaaaccrr, caracara

    aaaaccrrs, caracaras

    aaaacgnr, caraganaaaaacgnrs, caraganas

    aaaacmnrst, catamarans

    aaaacmnrt, catamaran

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    25/53

    Numbers are formed from thealphabet {0,1,2,3,4,5,6,7,8,9}.

    The program above can easily be modified to find anagrams of a set ofnumbers.

    In recreational mathematics, it is well known that 122 = 144, 212 = 441; and132 = 169, 312 = 961, 142 = 196.

    Unlike words, it turns out that it is easy to find two or more squares that areanagrams. For example, the following 87 squares are anagrams of each other:

    1026753849, 1042385796, 1098524736, 1237069584, 1248703569, 1278563049,1285437609, 1382054976, 1436789025, 1503267984, 1532487609, 1547320896,1643897025, 1827049536, 1927385604, 1937408256, 2076351489, 2081549376,2170348569, 2386517904, 2431870596, 2435718609, 2571098436, 2913408576,3015986724, 3074258916, 3082914576, 3089247561, 3094251876, 3195867024,3285697041, 3412078569, 3416987025, 3428570916, 3528716409, 3719048256,3791480625, 3827401956, 3928657041, 3964087521, 3975428601, 3985270641,4307821956, 4308215769, 4369871025, 4392508176, 4580176329, 4728350169,4730825961, 4832057169, 5102673489, 5273809641, 5739426081, 5783146209,

    5803697124, 5982403716, 6095237184, 6154873209, 6457890321, 6471398025,6597013284, 6714983025, 7042398561, 7165283904, 7285134609, 7351862049,7362154809, 7408561329, 7680594321, 7854036129, 7935068241, 7946831025,7984316025, 8014367529, 8125940736, 8127563409, 8135679204, 8326197504,8391476025, 8503421796, 8967143025, 9054283716, 9351276804, 9560732841,9614783025, 9761835204, 9814072356.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    26/53

    A Pattern Emerges

    # Digits # Squares # Anasquares Proportion

    1 3 0 0.00%

    2 6 0 0.00%

    3 22 7 31.82%

    4 68 13 19.12%

    5 217 86 39.63%

    6 683 293 42.90%

    7 2163 1212 56.03%

    8 6837 4699 68.73%

    9 21623 17380 80.38%

    10 68377 60623 88.66%

    In fact, looking at n-digit squares, it seems that as n increases, the proportion of

    squares with square anagrams (lets call these anasquares) keeps increasing.

    What is the limit?

    The above table is Table 1 from Bilisoly (2008a).

    Also see http://oeis.org/A177952.

    http://oeis.org/A177952http://oeis.org/A177952
  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    27/53

    The limit is 100%!

    Let Sd,b be the set of squares with exactly ddigits when written in base b.

    Define apattern of a number n to be the digits ofn in base b sorted

    from least to greatest. Note that a pattern is a hash key.

    Theorem (Bilisoly, 2008a): The proportion of anasquares in Sd,b1 as

    dand for b fixed.

    Proof: A lower bound to the number of anasquares occurs when as many as

    possible patterns correspond to exactly 1 square. To find this lower bound,

    we count the number of patterns and d-digit squares.

    First, thinking back to the Perl program, the hash key of a number is obtainedby sorting its digits. Let dibe the number of times the digit iappears in a

    square. Then this hash key can be represented by:

    110

    **...*|...|**...*|**...*

    bddd

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    28/53

    End of Proof

    The number of distinct hash keys is:

    Second, the number ofddigit squares is:

    Hence the number ofd-digit squares is exponential (in d), but the number of

    patterns is a polynomial (in d), so the proportion of anasquares is bounded

    below by the following, which 1 as d (and b is fixed.)

    QED

    )./11(11 2/2/)1(2/1 bbbbbbddddd

    0,)/11(

    1)!1(

    )1)...(2)(1()/11(

    max2/

    2/

    bb

    b

    dbdbdbb

    d

    d

    )!1(

    )1)...(2)(1(1

    b

    dbdbd

    d

    bd

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    29/53

    3. Birthday Problem and Pangrams

    The basic birthday problem is famous: For npeople,assuming all days are equally likely, what is theprobability that at least two people share the samebirthday?

    The following are related: Let Nshared = number of people such that 1 birthday appears at

    least 2 times.

    Let Nall = number of people such that all 365 birthdays appear atleast once.

    Note E(Nshared) = 24.6166 > the usual # of peoplequoted. Why? P(22 people, 2 share) = 0.475695

    P(23 people, 2 share) = 0.507297

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    30/53

    Results from:Flajolet, Gardy, and Thimonier (1992)

    Suppose all days are not equally likely, then let pi = P(day i is a birthday).

    Corollary (The Birthday Problem) We need j = 1 day to appear at least k = 2 times.

    0

    3651shared

    )exp()1()( dtttpNEi i

    0

    3651all

    ))exp(1(1)( dttpNEi i

    Corollary (The Coupon Problem) We need all letters to appear at least once.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    31/53

    Application to Birthdays

    What is the expected number of people needed so that 2 people share a birthday?

    Mathematica gives E(Nshared) =24.6166, which assumes each day is equallylikely. Note E(Nall) =2364.65.

    What is the expected number of people born in 1978 needed so that 2 peopleshare a birthday?

    Mathematica gives E(Nshared) =24.5262 and note E(Nall) =2435.14.

    Plot of Julian Day vs. Proportion

    of births on that day for 1978.

    Which days does the lower

    band represent?

    Data Source: Todd Swansons Home Page:

    http://www.math.hope.edu/swanson/da

    ta/birthdays.txt

    http://www.math.hope.edu/swanson/data/birthdays.txthttp://www.math.hope.edu/swanson/data/birthdays.txthttp://www.math.hope.edu/swanson/data/birthdays.txthttp://www.math.hope.edu/swanson/data/birthdays.txt
  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    32/53

    Pangrammatic Windows

    The Spirit dropped beneath it, so that the extinguisher

    covered its whole form; but though Scrooge pressed it down

    with all his force, he could not hide the light: which streamed

    from under it, in an unbroken flood upon the ground.

    He was conscious of being exhausted, and overcome by anirresistible drowsiness; and, further, of being in his own

    bedroom. He gave the cap a parting squeeze, in which his hand

    relaxed; and had barely time to reel to bed, before he sank

    into a heavy sleep.

    AWAKING in the middle of a prodigiously tough snore, and

    sitting up in bed to get his thoughts together, Scrooge hadno occasion to be told that the bell was again upon the

    stroke of One. He felt that he was restored to consciousness

    in the right nick of time, for the especial purpose of holding

    a conference with the second messenger dispatched to him

    through Jacob Marley's intervention.

    This text is from Charles

    DickensA Christmas Carol.

    The blue portion is a

    pangrammatic window, i.e.,

    it contains each letter of thealphabet at least once.

    There are 679 letters in

    color.

    The search started with

    The Spirit and thewindow could be shortened

    by dropping letters from the

    beginning.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    33/53

    Pangrams in A Christmas Carol

    a 9308 0.076892

    b 1943 0.016051

    c 3035 0.025072

    d 5674 0.046872

    e 14850 0.122674

    f 2433 0.020099

    g 2979 0.024609

    h 8368 0.069127

    i 8294 0.068515

    j 113 0.00093

    k 1031 0.008517

    l 4553 0.037612

    m 2840 0.023461

    n 7960 0.065756

    o 9690 0.080048

    p 2119 0.017505

    q 97 0.000801

    r 7031 0.058082

    s 7900 0.065261

    t 10869 0.089787

    u 3335 0.02755

    v 1022 0.008443

    w 3096 0.025576

    x 131 0.001082

    y 2298 0.018983

    z 84 0.000694

    Well search A Christmas Carolfor pangrams byselecting random starting positions. Then we

    compare this to independently generated lettersusing the letter frequencies of this novel. Thecounts and the proportions are listed to the right.

    Of course, letters are not independent, but thequestion is this: How does the actual pangram

    lengths differ from the simulated independentpangram lengths?

    Letter Frequencies

    in Christmas Carol

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    34/53

    Pangram Lengths

    The left histogram shows lengths of

    pangrams found inA Christmas Carol

    using random starting points.

    The right histogram shows lengths of

    pangrams found in a simulated string of

    independent letters using the

    proportions found inA Christmas Carol.

    Note the long right tail

    N = 1000N = 1000

    Theoretical mean = 2473.8

    Figures from Bilisoly (2009)

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    35/53

    Concordancing words with the letter zreveals

    "Why, it's old Fezziwig! Bless his heart; i

    ess his heart; it's Fezziwig alive again!"

    Old Fezziwig laid down his pen,

    "Yo ho, there! Ebenezer! Dick!"

    ho, my boys!" said Fezziwig. "No more work to-n

    e, Dick. Christmas, Ebenezer! Let's have the shuters up," cried old Fezziwig, with a sharp clap

    illi-ho!" cried old Fezziwig, skipping down from

    -ho, Dick! Chirrup, Ebenezer!"

    ared away, with old Fezziwig looking on. It was

    aches. In came Mrs. Fezziwig, one vast substanti

    came the three Miss Fezziwigs, beaming and lovabl

    brought about, old Fezziwig, clapping his hands

    overley." Then old Fezziwig stood out to dance

    to dance with Mrs. Fezziwig. Top couple, too; wah, four times--old Fezziwig would have been a m

    , and so would Mrs. Fezziwig. As to her, she was

    eared to issue from Fezziwig's calves. They shon

    next. And when old Fezziwig and Mrs. Fezziwig hd Fezziwig and Mrs. Fezziwig had gone all throug

    gain to your place; Fezziwig "cut"--cut so deftl

    ke up. Mr. and Mrs. Fezziwig took their stations

    hearts in praise of Fezziwig: and when he had do

    luence over him, he seized the extinguisher-ca

    e the cap a parting squeeze, in which his hand

    ore and centre of a blaze of ruddy light, whi

    ore alarming than a dozen ghosts, as he was p

    ; and such a mighty blaze went roaring up the

    , half thawed, half frozen, whose heavier part

    ught fire, and were blazing away to their dearanding his gigantic size, he could accommoda

    chit, kissing her a dozen times, and taking o

    erness and flavour, size and cheapness, were

    , so hard and firm, blazing in half of half-a-q

    e flickering of the blaze showed preparations

    g grew but moss and furze, and coarse rank gr

    of endeavouring to seize you, which would ha

    relents," she said, amazed, "there is! Nothing

    grave his own name, EBENEZER SCROOGE.

    er they've sold the prize Turkey that was han

    re?--Not the little prize Turkey: the big one

    it. It's twice the size of Tiny Tim. Joe Mi

    e passed the door a dozen times, before he ha

    out after dark in a breezy spot--say Saint Paud-stone, Scrooge! a squeezing, wrenching, graspinThe cold within him froze his old features, n

    e court outside, go wheezing up and down, beatin

    'em through a round dozen of months presented

    e chattering in its frozen head up there. The

    d a great fire in a brazier, round which a part

    eir eyes before the blaze in rapture. The wat

    Scrooge seized the ruler with such

    n the gloom. Half-a-dozen gas-lamps out of th

    her-beds, Abrahams, Belshazzars, Apostles putting o

    ring at those fixed glazed eyes, in silence fothe vision's stony gaze from himself.

    from other regions, Ebenezer Scrooge, and is conpe of my procuring, Ebenezer."

    Exchange pay to Mr. Ebenezer Scrooge or his orde

    st have sunk into a doze unconsciously, and

    er a long way below freezing; that he was clad b

    The Spirit gazed upon him mildly. It

    Middle section has

    43 of the 84 zs, but

    represents only 3 of 83

    pages of Dickens (1986).

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    36/53

    Pangram Lengths: Fezziwig Effect

    N = 1000

    Simulated pangrams.

    Theoretical Mean = 3620.5

    N = 1000

    Actual pangrams.

    Endpoint with Fezziwig

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    37/53

    100,000 Simulated Pangram Lengths

    Best fit lognormal distribution shown.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    38/53

    References

    Roger Bilisoly (2008a). Anasquares: Square Anagrams of Squares. Mathematical Gazette, 92,58-63.

    Roger Bilisoly (2008b). Practical Text Mining with Perl, Wiley.

    Roger Bilisoly (2009). Two Language-based Examples for Use in the Statistics Classroom.American Statistical Association Proceedings of the Joint Statistical Meetings, Section onStatistical Education.

    Gunnar Blom, Lars Holst, and Dennis Sandell (1993). Problems and Snapshots from the World ofProbability, Springer.

    W. E. Deskins (1964). Abstract Algebra, MacMillan.

    Charles Dickens (1986). A Christmas Carol, Bantam.

    Philippe Flajolet, Daniele Gardy, and Loys Thimonier (1992). Birthday Paradox, CouponCollectors, Caching Algorithms and Self-Organinzing Search. Discrete Applied Mathematics, 39,207-229.

    Walter Penney (1960). On the Final Digits of Squares. The American Mathematical Monthly, Vol.67, No. 10, pp. 1000-1002.

    Walter Stangl (1996). Counting Squares in Zn. Mathematics Magazine, Vol. 69, No. 4, pp. 285-189.

    Kenneth Williams (1995). "Some Refinements of an Algorithm of Brillhart," CanadianMathematical Society Conference Proceedings, Volume 15, 409-416. Available athttp://www.math.carleton.ca/~williams/papers/pdf/202.pdf .

    http://www.math.carleton.ca/~williams/papers/pdf/202.pdfhttp://www.math.carleton.ca/~williams/papers/pdf/202.pdf
  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    39/53

    Web References

    Benfords Law http://mathworld.wolfram.com/BenfordsLaw.html http://en.wikipedia.org/wiki/Benford's_law

    Squares with 3 distinct digits http://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htm

    Counterexample to Ed Pegg http://mathworld.wolfram.com/Baxter-HickersonFunction.html

    American Cryptogram Association http://cryptogram.org/

    National Puzzlers Association http://www.puzzlers.org/

    Moby Word Lists http://icon.shef.ac.uk/Moby/

    Anasquare counts http://oeis.org/A177952. 1978 birthday data

    http://www.math.hope.edu/swanson/data/birthdays.txt

    Word Ways http://wordways.com/

    http://mathworld.wolfram.com/BenfordsLaw.htmlhttp://en.wikipedia.org/wiki/Benford's_lawhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://cryptogram.org/cdb/words/words.htmlhttp://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://icon.shef.ac.uk/Moby/http://oeis.org/A177952http://www.math.hope.edu/swanson/data/birthdays.txthttp://wordways.com/http://wordways.com/http://www.math.hope.edu/swanson/data/birthdays.txthttp://oeis.org/A177952http://icon.shef.ac.uk/Moby/http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://cryptogram.org/cdb/words/words.htmlhttp://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://en.wikipedia.org/wiki/Benford's_lawhttp://mathworld.wolfram.com/BenfordsLaw.html
  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    40/53

    Wordplay References

    Tony Augarde (1994). The Oxford A to Z of Word Games, Oxford. Tony Augarde (2003). The Oxford Guide to Word Games, Oxford.

    o Has historical information. Dmitri Borgmann (1967). Beyond Language,Scribners. Ross Eckler (1979). Word Recreations, Dover.

    o Most examples originally appeared in Word Ways. Ross Eckler (1996). Making the Alphabet Dance, St. Martin's.o Most examples originally appeared in Word Ways.

    Dave Morice (1997). Alphabet Avenue, Chicago Review Press. Dave Morice (2001). The Dictionary of Word Play, Teachers and Writers

    Collaborative. Warren F. Motte, Jr. (1998). Oulipo: A Primer of Potential Literature, Dalkey

    Archive.o Oulipostands for Ouvroir de Litterature Potentielle, which is a group of

    writers, mathematicians, and other people interested in literarystructures.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    41/53

    The Key Wordplay Resource:Word Ways: A Journal of Recreational Linguistics

    Established by Dmitri Borgmann in 1968o He is author of Language on Vacationand Beyond Language

    Bought by A. Ross Eckler, Jr. in 1968. He waseditor and publisher from 1968-2006.

    o PhD in mathematics from Princeton, 1954o Worked at Bell Labs, 1954-84o Published Word Recreations(1979), Names and Games:

    Onomastics and Recreational Linguistics(1986), Making theAlphabet Dance(1996)

    Current editor is Jeremiah Farrell, professor

    emeritus of mathematics at Butler University

    Online at http://wordways.com/

    http://wordways.com/http://wordways.com/
  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    42/53

    Open question:

    What are the upper

    and lower bounds of

    this plot? Points aresquares in base 10

    with 12 or less

    digits. This is Figure

    2 of Bilisoly (2008a).

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    43/53

    Brillhart Alogithm (See Slide 18)

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    44/53

    Let us generalize the birthday problem.

    Let represent an alphabet of size na.

    For birthdays let = {d1, d2,, d365}, so naequals 365.

    Let pi= P(dioccurs), so that each day need not be equally likely.

    Define Njk= number of letters drawn from (with replacement) so

    that there arejdistinct letters that each appear at least ktimes.

    Let ek(t) = kth order Taylor series expansion of exp(t). Theorem 1 of Flajolet, Gardy and Thimonier (1992) states:

    1

    00

    n

    1 11)exp()))()(exp()((][)(

    a

    j

    l

    i ikiik

    l

    jk dtttpetpxtpexNE

    Corollary (The Birthday Problem) We need j = 1 day to appear at least k = 2 times.Note that the sum has only one term, and N12 = Nshared.

    0

    365

    112)exp()1()( dtttpNE

    i i

    See Corollary 1 of Flajolet et al. (1992)

    Product of 1st degree

    polynomials in x

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    45/53

    Generalized Coupon Collectors Problem

    Theorem 2 of Flajolet, Gardy and Thimonier (1992)The expected number of letters drawn to get thecomplete alphabet, , is given below. Their proof

    follows fairly easily from Theorem 1.

    0 1all ))exp(1(1 dttpN

    an

    i i

    For uniformly likely birthdays, 2364.65 people are neededon average to get all 365 days to appear. For 1978, weexpect to need 2435.14 people.

    Pangrams have = {a, b, c, , z}, na= 26, and pi

    determined by frequencies found in a text sample.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    46/53

    Example of Mathematica 8 code

    to find 14 letter words with nomultiple edges and diameter = 2.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    47/53

    Squares Having only Three Distinct Digits:Investigated by Hisanori Mishima

    Largest known sporadic example:81401637345465395512991484^2 =6626226562522666562566262626266252566552622656522256

    However, there are an infinite number of patterned 3-digit squares.

    97 9409

    997 994009

    9997 99940009

    99997 9999400009

    999997 999994000009

    1235 1525225

    12335 152152225

    123335 15211522225

    1233335 1521115222225

    12333335 152111152222225

    See http://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htm

    http://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htm
  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    48/53

    How many sporadic solutions?

    Assume that digits of squares are independent.

    Binomial[10,3] (3/10)^n P(n-digit square with 3 distinct digits)

    10^(n/2)(1 - 1/Sqrt[10]) number of n-digit squares

    Expected # n-digit squares with 3 distinct digits Binomial[10,3] (3/10)^n * 10^(n/2)(1 - 1/Sqrt[10]) =constant * (3/10)n

    But n (3/10)n converges, which suggests a finite # of solutions.

    Sum = 360(1-1/10)(3+10) 1517.

    However, the analogous argument for squares with

    4 distinct digits results in n (4/10)n, which diverges.

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    49/53

    Three Examples of 150-digit Squareswith exactly 9 distinct digits.

    This square has no 1s.

    590286760507408218847058025821601275020644462041449539546951025992081988403 ^2 =

    348438459630330307258664735742002640854590982059075240585330803623545707923

    270640840935682702648690975443382535993405344806539300344597863650226490409

    This square has no 8s.705635480731670264258949343062158505097813112879657762505544377338770578675 ^2 =

    497921431667415396779522503575462537630442531656110964327039001763590201921

    651000116765556629356041206321334237073176796236102099130225925794364755625

    This square has no 7s.

    624228579548317386188320909320329013264935555613585034560249848356296289668 ^2 =

    389661319524910006943311531238425925016482192128500480080280569984852381981

    266389000443336616268805696010466681530390083280245809868986959183363550224

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    50/53

    Squares with 9 Distinct Digits

    How common is an n-digit square with only 9 distinct digits?

    Binomial[10,9] (9/10)^n probability of square with 9 digits

    For n = 150, this gives 1.36891 E-6 1 in a million.

    Hence a computer program checking 5,000,000 random 150-digit

    squares should find 5,000,000*1.36891 E-6 = 6.84 such squares. This was done 30 times, and the counts are given in the histogram.

    Mean = 7.27

    SD = 2.65

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    51/53

    Ed Peggs Failed Conjecture

    In April of 1999, on sci.math, Ed Pegg conjectured thatthere are only finitely many cubes without the digit 0.

    D. Hickerson found a counterexample and a few dayslater Lew Baxter found the example given below.

    baxter[n_] := (2 10^(5 n)-10^(4 n)+2 10^(3 n)+10^(2 n)+10^n+1)/3

    Do[Print[{baxter[i],baxter[i]^3}],{i,1,5}]

    {64037, 262598918898653}

    {6634003367, 291962492648791178822648631863}

    {666334000333667, 295852962482593148779111778815593148629851963}

    {66663334000033336667,296251862962481592598148777911117778814892598148629651852963}

    {6666633334000003333366667,

    296291851962962481492592648148777791111177778814822592648148629631851862963}

    Function given at http://mathworld.wolfram.com/Baxter-HickersonFunction.html

    http://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://mathworld.wolfram.com/Baxter-HickersonFunction.html
  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    52/53

    Let A = 1, B = 2, C = 3, ..., Z = 26. Let s(word) = Sum of its alphabetic values

    o Example: s(bad) = 2 + 1 + 4 = 7 Let nn(number) = its number name

    o Example: nn(3) = three

    Consider the dynamical system of composing sand nno That is, iterate n-> nn(n) -> f(nn(n)) -> nn(f(nn(n)), etc.o Example: 1, 34, 160, 205, 174, 278, 291, 253, 254, 258,

    247, 281, 240, 216, 228, 288, 255, 240 1 becomes a 5-cycle, so what else can happen?

    o Answer first published by Dmitri Borgmann in 1967 inBeyond Language.

    Miscellaneous Example:Number Words and Numbers Graph

  • 8/3/2019 String Patterns: Searching for Interesting Words and Numbers

    53/53